Data sharing has transformed discovery research in the genomics arena, but has not yet become the norm for epidemiological datasets. ClinEpiDB is an online resource that attempts to bridge this gap by facilitating appropriate sharing of data from clinical & epidemiological studies/trials. We integrate data using an ontology-driven approach, providing metadata & supporting documentation for context. An intuitive point-and-click website supports data browsing, subsetting & visualization, enabling diverse users to explore potential associations, or download data for advanced analyses. By facilitating access to and interrogation of high-quality, large-scale datasets, ClinEpiDB aims to spur collaboration & discovery, enhancing global health.
The ClinEpiDB team leverages infrastructure & staff developed in the context of the Eukaryotic Host, Pathogen & Vector Bioinformatics Resource Center (VEuPathDB.org), & includes 70+ data scientists on 4 continents. All staff working on these resources– regardless of their physical location or source of funding– are managed through collaborative teams, with performance assessed on the basis of the work we make possible by others, rather than traditional academic metrics like publications. We collaborate in real time across multiple time zones, using Slack as our office and regular Zoom calls & shared documentation to keep the team working in harmony.
ClinEpiDB encompasses ~15 FTEs, driven by epidemiologists, but also including data specialists, ontologists & outreach staff, in addition to core infrastructure support. Outreach specialists have domain expertise in global health & epidemiology and engage with the scientific community to identify & prioritize key datasets for integration, and educate relevant communities on platform use. Data specialists process & integrate the data, collaborating with ontologists on mapping to harmonized ontologies. Principal investigators maintain funding & provide long-term project vision.
Over 20+ years, the VEuPathDB team has become adept at integrating increasingly diverse Omic-scale datasets, and is widely viewed as an honest broker in facilitating data sharing. ClinEpiDB was developed in response to a 2015 request from NIH to explore whether the relational schema that has been so successful at integrating genome informatics datasets, enabling researchers to ask their own questions, might also accommodate epidemiological datasets from the Int’l Centers of Excellence in Malaria Research. A pilot project focusing on a longitudinal cohort study in Uganda with extensive clinical, socioeconomic & entomological data was released in 2018, and has since been supplemented by numerous other studies. By facilitating the structuring, integration, interrogation and sharing of epidemiological data, the ClinEpiDB platform has proved appealing to a variety of data providers and funders, allowing us to establish a dedicated ontology team, expand expertise in other areas of infectious disease, maternal, newborn & child health, etc., improve tools for data analysis & visualization, and greatly reduce the per-study effort and cost of data loading.
Our data sharing policy is straightforward: while we are happy to facilitate fulfillment of the data release requirements imposed by funders, publishers, etc., study teams always retain ownership of their data in ClinEpiDB. We work as an extension of the study team: helping to define variables, harmonize ontologies, clean data, organize metadata, and provide tools for analysis. Interestingly, this has often proved to be the fastest route to public data release: by providing tools that help the study team to understand their own data, we enhance enthusiasm for sharing, triggering favorable attention from others and expediting scientific discovery, impact & policy implementation. A tiered system enables study teams to appropriately manage data access requests; nothing is ever released until both the data providers and ClinEpiDB agree that it is ready to go live. Providers control the level of data access, deciding whether permission is required to view and/or download data, and how data requests are reviewed.
While many data repositories support clinical and epidemiological data deposition, we believe that ClinEpiDB is unique in both making data & metadata available and providing transparent, user-friendly tools and interfaces that lower barriers for data exploration & analysis directly within the browser.
ClinEpiDB promotes FAIR data principles to make data Findable, Accessible, Interoperable and Reusable.
The ClinEpiDB approach to data sharing can readily be replicated in similar database instances constructed elsewhere, as all code underlying the ClinEpiDB platform (and the broader VEuPathDB project) is publicly available in GitHub.
It is a truism that scientific advances build upon prior research, which means that in this data-rich era, it is imperative that high quality datasets be Findable, Accessible, Interoperable, and Reusable if they are to fulfill their potential. Data sharing is an integral part of the entire research process, and increasingly required as funders seek to enhance transparency, preserve research outputs, and get the most out of their investment, while publishers strive to ensure reproducibility and impact. But data sharing is more than an obligation: it directly benefits data producers by fostering collaborations, increasing visibility and making their work more fruitful.
Over the past generation, FAIR access to genomic-scale datasets has revolutionized all aspects of biomedical research, but has been a challenge for epidemiological datasets. Our experience demonstrates that working closely with data providers during the process of data sharing results in data that is cleaner and well-structured, with accompanying metadata that is clear and complete.