menu

Submission

submission voting
voting is closed.
introduction
title
Trust, but Verify - Data Sharing and Reuse
short description
Data sharing and reuse practices have built-in self-correction mechanisms for the field. We have been trail blazers since 2010.
Submission Details
Please complete these prompts for your round one submission.
Submission Category
Data reuse
Abstract / Overview

We contacted original investigators of 24 brief alcohol intervention studies for raw data, instruments, intervention materials, and IRB documents. We also systematically extracted study-level data from 309 independent samples from 1980 through 2018. We have a comprehensive data set with different levels of granularity ranging from item-level to study-level data. We harmonized key variables, utilized advanced measurement models, and developed cutting-edge statistical models to analyze data appropriately. We are projected to publish 64 articles by the year-end. In the last 3-4 years, we have publicly shared data and codes at Mendeley Data to disseminate and replicate our work. They have been viewed and downloaded for new publications. 

Team

Project INTEGRATE includes core investigators and data contributors. We invited ten core investigators who have been most active in recent years to this DataWorks! Challenge. Team members bring diverse expertise from different disciplines – Psychology, Education, Sociology, Statistics, Biostatistics, and Public Health. In addition, investigators from every career stage are represented, including a research scientist, 3 assistant professors, 1 associate professor, and 5 professors. We are at 5 different higher education institutions in the US (University of North Texas HSC, U Washington, U Oregon, Rutgers) and Hong Kong (Hong Kong University). Four are women, 6 are racial minorities, and 3 are first-generation immigrants. This collaboration started with mostly early-career investigators at Rutgers and U. of Washington. Our careers grew, and new investigators were recruited and joined because of the shared vision of strengthening research with our data-driven, innovative modeling work. Dr. Mun, Team Captain, oversees all data management responsibilities. We formally meet twice a month and more frequently in small groups for specific projects. We utilize cloud-based productivity apps and have a team-based process to verify all work.

Potential Impact

As the past 2 years of the COVID pandemic laid bare, persuading people to change their behavior can be quite difficult. We launched Project INTEGRATE in 2010 to understand and promote effective intervention strategies for reducing alcohol consumption and alcohol-related negative consequences among college students. The scientific premise of using large-scale individual participant data from multiple studies is mainly three-fold: 1) by pooling data across studies, we enjoy better power and more precision in estimates for low base rate behaviors such as substance use behaviors; 2) we obtain diverse samples, outcome measures, interventions and control comparisons to account for their effects on the overall effect size estimates; and 3) we can check and clean data and use more appropriate, cutting-edge modeling approaches to data analysis. Project INTEGRATE is arguably the most successful project in Social Behavioral Science to reuse existing data for generating valuable clinical insights. We are on track to publish 64 papers by the year-end. This success is because we as a team have developed new methods, packages/codes, tested them in simulation studies, and disseminated findings in publications. We have learned that data reuse is quite challenging because of between-study heterogeneity, such as intervention adaptations and samples. Through this existing work, we discovered that the conclusion by a Cochrane Systematic Review (Foxcroft et al., withdrawn) was problematic and communicated with the field as to why (Mun, Atkins, et al., 2015), which was cited in policy documents twice and linked out 1473 times. As this episode indicates, evidence that can be replicated and trusted has become more valuable. We have reported our findings and shared data and code at Mendeley Data for researchers to replicate. For example, data and code for Huh et al. (2019) have been downloaded 110 times and viewed 709 times (see doi:10.17632/4dw4kn97fz.2). When we see a pattern of inadequate statistical modeling in primary studies, we have provided recommendations for clinical trial investigators on what to avoid and what would be better. We have also used small incentives (e.g., small consulting fees, occasional updates, co-authorship or group authorship when warranted, and project promotional items) for primary investigators and developed a website and a Twitter account to increase our reach. Project INTEGRATE provides a successful collaboration model for data sharing and reuse. 

Replicability

Project INTEGRATE has adopted the latest methods from the fields of Statistics, Measurement, Meta-analysis, and Data Science. We have presented our work at different national and international professional societies (e.g., the Society for Prevention Research, Joint Statistical Meeting, the Society for Research Synthesis Methods) and published in multidisciplinary journals (Prevention Science, Statistics in Medicine, Statistics and Its Interface). Early on, we were interested in analyzing data simultaneously in a single-step estimation, as suggested by several prominent psychologists. However, we quickly learned that it would not be feasible in many data situations. Data harmonization via commensurate measures has been most discussed in psychology when combining individual participant data. However, harmonizing measures that are not directly measurable or quantifiable, such as depression, is much more challenging than harmonizing age, sex, race/ethnicity, or smoking frequency. Further, many other reasons exist why analyzing and pooling data in a single-step estimation may not be feasible or ideal. To communicate our observations, we have demonstrated that differences in intervention arms (Huh et al., 2015, 2019), sample characteristics (Jiao et al., 2020), outcome distributions (Mun et al., 2022), and measures (Mun et al., 2015, 2019) can pose barriers to analyses while providing methodological solutions. Our approach can be replicated because we have shared data and code. 

Potential for Community Engagement and Outreach

Clinical trials require considerable resources, but many variables are never analyzed or reported in the literature. For example, Mun et al. (2022) discovered that only 1 out of 15 studies had reported driving after drinking as an outcome, although this measure existed. Of the possible outcomes from the studies included in Project INTEGRATE, only half of the outcomes assessed were reported in publications (Li et al., 2019). More generally, clinical trials are often underpowered, and many studies do not examine data beyond testing whether interventions are efficacious. Leveraging data from existing studies provides opportunities to reexamine evidence, or lack thereof, and test novel hypotheses using computationally powerful emerging techniques. The benefits of data sharing are (1) data sharing can extend the life cycle of valuable data; (2) the impact of their work increases; (3) the overall quality of their work may improve, especially if they follow suggestions for appropriate statistical modeling and reporting recommendations; and (4) this open science practice helps increase public trust in science. In sum, the entire knowledge generation system benefits from data sharing and reuse, consequently improving public health. 

Supporting Information (Optional)
Include links to relevant and publicly accessible website page(s), up to three relevant publications, and/or up to five relevant resources.

comments (public)