I analyzed 3,823,919 references cited in 96,685 research papers, chosen at random from those uploaded to PubMed Central between the years 2016 and 2021, in order to answer the question:
How to determine if a reference is too old to be included in a research article?
I used the BioC API to download the data (see the References section below).
Here’s a summary of the key findings
1- When searching for references to cite, you should aim to find those published within the past 13 years. However, 25% of references cited in published research papers are older than this, and there is no convincing evidence that higher-quality articles cite more recent sources.
2- Looking at the same data from the author’s point of view, we can say that:
You should not expect your paper to get cited a lot within its first year of publication, the estimated peak will be after 3 to 13 years, then it will gradually taper off, but your paper can still get cited even 27 years after publication!
How old is the average cited source?
Looking at the density plot below we see that:
A large portion of references cited in research papers is less than 5-years-old, and the majority is less than 10-years-old.
Note that the reference age is calculated by subtracting the publication year of the reference from that of the paper citing it.
The table below shows that:
The median reference cited in a research paper is 7-years-old, and 75% of references were published within the past 13-years. Still, 5% of papers cited sources older than 27 years, some even used historical sources.
|Minimum Reference Age
|0 years old
|1 year old
|3 years old
|50th Percentile (Median)
|7 years old
|Mean Reference Age
|9.78 years old
|13 years old
|27 years old
|Maximum Reference Age
|2020 years old
Do higher-quality articles cite more recent sources?
In order to answer this question, the quality of a given article will be judged by the impact factor of the journal in which it was published. Although impact factors are not perfect measures of quality, it could be argued that they provide a good proxy for our purposes.
So I collected the journal impact factor (JIF) for 71,579 articles and divided the dataset into 2 groups:
- research papers published in low impact journals (JIF ≤ 3): this subset consisted of 34,758 articles and 1,247,373 references
- research papers published in high impact journals (JIF > 3): this subset consisted of 36,821 articles and 1,791,061 references
The median reference in both groups was 7-years-old, the mean however was different: the average reference in the first group was 10.1-years-old and in the second group was 9.3-years-old. So there isn’t enough evidence to conclude that higher quality articles reference more recent sources.
- Comeau DC, Wei CH, Islamaj Doğan R, and Lu Z. PMC text mining subset in BioC: about 3 million full text articles and growing, Bioinformatics, btz070, 2019.