I analyzed the methods sections of 43,110 randomly chosen research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to check the popularity of 125 statistical methods in medical research.
I used the BioC API to download the articles (see the References section below).
Here’s a summary of the key findings
The most popular statistical tests in research articles are:
- Student’s t-test: Used to compare the mean of a population to a theoretical value, or compare means between 2 populations.
- Chi-square test: Used to compare 2 proportions.
- Mann-Whitney U test: Used to compare medians between 2 populations.
- One-way ANOVA and Kruskal-Wallis test: Used to compare means between more than 2 populations.
- Kaplan-Meier estimator: Used to estimate the survival function when analyzing time-to-event data.
- Log-rank test: Used to compare survival times between 2 groups.
The most popular statistical models in research articles are:
- Logistic regression: Used to study the relationship between 1 or more predictor variables and 1 binary outcome variable.
- Linear regression: Used to study the relationship between 1 or more predictor variables and 1 continuous outcome variable.
- Cox regression: Used to study the relationship between 1 or more predictor variables and the survival time of patients.
Top statistical methods overall
The top 10 chart below shows that comparing means and proportions between study groups, and analyzing survival data are the most common types of statistical techniques found in research papers:
In order to visualize the popularity of all 125 statistical tests and models, I created a Fisher-shaped word cloud which is a cluster of words showing the most popular ones in bolder and larger fonts:
And here’s a table for those of you who prefer to look at numbers:
|Rank||Statistical Test/Model||Number of|
(In 43,110 Articles)
|3||Mann-Whitney U test||8063||18.70%|
|11||Fisher’s exact test||2578||5.98%|
|16||Tukey’s HSD test||1834||4.25%|
|17||Wilcoxon signed-rank test||1454||3.37%|
|22||Repeated measures ANOVA||882||2.05%|
|29||Support vector machines||449||1.04%|
|38||One sample t-test||296||0.69%|
|40||Duncan’s new multiple range test||268||0.62%|
|41||Cochran’s Q test||259||0.60%|
|45||Linear discriminant analysis||250||0.58%|
|54||Partial least squares discriminant analysis||151||0.35%|
|55||Analysis of similarities||146||0.34%|
|56||Negative binomial regression||140||0.32%|
|57||Mauchly’s sphericity test||125||0.29%|
|58||Principle component analysis||124||0.29%|
|63||Kaiser Meyer Olkin test||100||0.23%|
|70||Item-total correlation test||53||0.12%|
|74||Partial least squares regression||40||0.09%|
|86||Goodman and Kruskal’s gamma||11||0.03%|
|88||Vuong’s closeness test||10||0.02%|
|89||Wald–Wolfowitz runs test||9||0.02%|
|95||Cramér–von Mises test||5||0.01%|
|96||Fay and Wu’s H||5||0.01%|
|103||Hoeffding’s independence test||3||0.01%|
|104||Dixon’s Q test||2||0.00%|
|105||Ramsey RESET test||2||0.00%|
|106||Sequential probability ratio test||1||0.00%|
|110||Cochran’s C test||1||0.00%|
|112||Van der Waerden test||1||0.00%|
|113||Tukey’s test of additivity||0||0.00%|
|120||Information matrix test||0||0.00%|
|123||Squared ranks test||0||0.00%|
|124||Principle components regression||0||0.00%|
Popularity of normality tests
Normality tests are used to determine if the data follow a normal distribution, an essential requirement for many statistical tests and models.
The most used normality test was the Shapiro-Wilk test (mentioned in 5.06% of research papers), followed by Kolmogorov-Smirnov test (4.39%), Anderson-Darling test (0.08%), Jarque-Bera test (0.02%), and Cramér–von Mises test (0.01%).
Popularity of machine learning algorithms
Machine learning algorithms are divided into 2 classes:
- Supervised learning algorithms
- Unsupervised learning algorithms
1- Supervised learning algorithms are models used to predict an outcome, given 1 or more predictors. The most popular algorithms in our data were:
- Neural networks (mentioned in 1.46% of research papers)
- Non-linear regression (1.14%)
- Random forest (1.05%)
- Support vector machines (1.04%)
- Lasso regression (0.67%)
- Classification and regression trees (0.38%)
- Naïve Bayes (0.37%)
- Gradient boosted models (0.22%)
- Ridge regression (0.14%).
Note that these models were far less popular than inferential models such as linear and logistic regression (10.35% and 15.04% respectively).
2- Unsupervised learning algorithms are methods used to discover patterns or group unlabeled data (i.e. in cases where we don’t have an outcome variable). The most popular methods in this category were:
- Factor analysis (mentioned in 1.86% of research papers)
- K-means clustering (0.98%)
- Principle component analysis (0.29%)
Popularity of Bayesian methods
Bayesian methods were mentioned only in 5.5% of research papers, and there is no sign of an increasing Bayesian trend between the years 2016 and 2021.
This however does not reflect the importance of Bayesian methods. In fact, some of the best books on statistics, such as Regression and Other Stories by Gelman, Hill and Vehtari, incorporate at least some form of Bayesian thinking when teaching frequentist statistics.
Challenges I faced while analyzing text data for this study
In this bonus section if you will, I thought it would be interesting to share some of the challenges I had to deal with while analyzing the methods sections of these 43,110 research papers looking for mentions of the statistical methods used.
All of the problems mentioned below were taken care of by using appropriate regular expressions — these are sequences of symbols and characters used to search for a particular pattern (corresponding to a statistical test/model) in the text.
1. Different spellings
- Hosmer–Lemeshow, Hosmer Lemeshow, Hosmer and Lemeshow, Hosmer & Lemeshow, etc.
- K-nearest neighbors, K-nearest neighbours (British English spelling), KNN.
- Bonferroni correction, Bonferroni method, Bonferroni‘s method
- Chi square test, Chi-squared test, or χ2 test
Several incorrect spellings were surprisingly common, for instance:
|Incorrect Spelling||Number of occurrences||Correct Spelling|
|Cochrane Q test||743 times||Cochran Q test|
3. Split words
For instance, when searching for the number of papers who reported the use of linear regression, it would be incomplete to search only for the phrase “linear regression” as sometimes the model may be reported as “linear and logistic regression were used”. Also, it would be wrong to search only for the word “linear” since “linear discriminant analysis” is a different statistical method than linear regression.
4. Alternative names
For instance, Student’s t-test is also known as: Independent t-test, Independent-samples t-test, and Two-sample t-test. This complicates the analysis as it requires knowledge of all synonyms for all statistical tests.
Although I did my best in finding these synonyms, if you notice anything missing or want to report an error, please email me by using the form in the contact page.
- Comeau DC, Wei CH, Islamaj Doğan R, and Lu Z. PMC text mining subset in BioC: about 3 million full text articles and growing, Bioinformatics, btz070, 2019.