Our team’s work on data reuse is to prevent errors in data analysis and inferences both pre- and post-publication. We offer statistical analysis verification to collaborators before sending papers for publication and advise them to share data and code for reproduction and reuse by others. We also reach out to journal editors and authors when we notice statistical errors in published papers and request access to data, which we reanalyze, and share the corrected results with the scientific community. Our team’s significant success is bolstered by its academic background and demographic diversity, and we are motivated to enhance rigor, reproducibility, and transparency in research and scientific communications through data sharing and reuse.
Our team came together after Dr. Allison became dean of our school in 2017 and brought a diverse group of faculty on board, including Drs. Tekwe, Zoh, Owora, Agley, Otten, Vorland, Jamshidi-Naeini, and Siddique. Our Biostatistics Consulting Center (BCC), a team of eight professional biostatisticians directed by Stephanie Dickinson, including Lilian Golzarri-Arroyo was also significantly enhanced and grown.
Our workflow generally centers around the BCC, which provides high-quality support for data management and analysis, with a focus on meticulous workflow and file management, data sharing, and archiving. When given a dataset, a BCC statistician sometimes performs the initial analyses, and verification of the analysis is performed independently by a second BCC statistician. In other cases, collaborators have already performed their analysis, and the BCC serves as the second statistician to ensure no errors have been made. Yet in other cases, we observe likely statistical errors when reading the literature, and obtain raw data if they are open or by contacting the authors. BCC statisticians assist in the reanalysis of these datasets to correct errors, and all members of our team participate in publishing the corrections.
Our team is at the forefront of addressing errors in science and seeking to inculcate a new generation steadfast in pursuing the truth. We often do this by publishing letters to editors flagging errors, and in some cases when we can obtain data from authors, we have published the corrected results. Since we started formally tracking statistical errors that we identify in 2017, we have flagged over 75 papers that might or do contain errors and have published or submitted more than 46 letters or PubPeer posts identifying errors or confirming good practices in the literature. Some errors were used as the basis for tutorial literature in at least five articles on statistical and reporting issues in childhood obesity, nutrition, and aging. Dr. Allison has been our leader in this effort, working on error correction for over a decade. Per a Scopus search of letters that Dr. Allison has published with colleagues, he has 123 unique coauthors for at least 57 letters in 43 journals over the last 10 years. Together, we lead a multi-institutional effort to increase the recognition of errors in science. Indeed, our letters have an enduring function of identifying common errors. Our work, especially if widely adopted, can serve to normalize error identification and correction in science.
Our efforts are most effective when data are available for reuse, whether publicly or when authors share them. We recommend that authors make their anonymized raw data publicly available to the extent that they can. Even more importantly, if data cannot be made public, we recommend unrestricted access to data by professional biostatisticians for evaluation, whether pre- or post-publication. When our team analyzes colleagues’ data before it is submitted for peer review, we have found mistakes at each step in the process. Sometimes the analyst accidentally used the wrong version of the data, other times the results were incorrectly described in the text. We have also seen use of incorrect methods, in which case we helped advise on the correct methods and code. We also advise on archiving the de-identified data and code to be shared publicly so they are ready to pull off the shelf as needed. The extent to which we identify errors post-publication demonstrates the critical importance of data reuse to prevent and correct errors. Our approach is applicable to any scientific effort based on data. Any team can likely adopt our procedures, but may not yet understand the importance of doing so.
Our team offers a framework for error mitigation following the slogan “prevent, detect, admit, correct”. There are existing standards to identify things that are 'clearly wrong.' Our approach emphasizes instances in this scope. We want to help prevent errors, first and foremost. Here, we have implemented a three-step verification process for all projects we work on:
Verify that the code produces all of the exact same results presented in the manuscript
Verify that the methods used in the code are accurately reported in the manuscript
Verify that the methods used are appropriate, and advise on any alternatives
These steps rely on our colleagues providing all data and code for reuse. As such, we advocate for complete transparency.
It is often challenging to obtain raw data from authors of publications we think may contain errors. We have found some general steps to be the most successful. In polite correspondence, we contact authors requesting raw data, quoting journal data sharing policies if relevant. We copy editors and publishers to request their involvement if the authors do not reply or refuse. If we can obtain raw data, we reanalyze results and publish the corrections. If we cannot, we publish a letter to the editor stating our reasoning for the likely errors, but that the authors need to make corrections.
These procedures are easily replicable for anyone with relevant expertise, and when they become more widespread, error correction will become normalized.
Data (and code) sharing can prevent or minimize the risk of analytical and statistical errors. It is nearly impossible to completely understand how an analysis was done as described in the method section in a manuscript. For example, how outliers are treated and whether assumptions behind the model are met (such as normality in residuals in the regression model) are often opaque yet crucial to interpreting results. Therefore, data sharing enables readers to reproduce the analysis and understand exactly what was done. This process also discourages authors from implementing inappropriate analyses (authors may take approaches yielding more favorable results if they are not asked to share data).
Further, researchers can use public data to implement other potential analyses. Researchers can run sensitivity analyses such as adding covariates that have not been considered originally to evaluate how robust conclusions are. Further, the data could be used to test newly developed models and generate additional hypotheses, which we have done from others’ generous data sharing.