High-quality, error-free & complete genome assembly data generated by the Vertebrate Genomes Project (VGP), the Earth Biogenome Project (EBP) and others are made public through the GenomeArk data repository. The VGP developed the GenomeArk to make genomic data for all species available to the community, to advance biomedical research, and conservation efforts. Once generated, the data is uploaded onto the GenomeArk following quality-control checks. Bioinformatic and manual assembly curation teams process and upload raw data, intermediate as well as draft and final reference genome assemblies onto the GenomeArk. Our approach allows the data for every species accessible immediately once generated, in every form to the research community.
Jarvis is Chair of the VGP and co-director of the Vertebrate Genome Lab (VGL) at Rockefeller University, the main sequencing hub of the VGP. Jarvis founded the GenomeArk, he co-designed and implemented it with Phillippy Lab. Jarvis oversees the roles of members involved in data upload & maintenance. He negotiated with Amazon Webservices to waive the storage fees to provide an Amazon S3 bucket to support the GenomeArk. Fedrigo is the Executive Director of the VGL and oversees the generation, upload and upkeep of data onto the GenomeArk. Formenti is the Bioinformatics Lead at the VGL and oversees the generation & processing of genome sequencing data. Mahmoud is the storyteller and writer of the team. The VGP council & leaders in genomics formulated the data sharing policies. International sequencing hubs making up the VGP, EBP & others (UK, Germany and all around the world) generate and upload raw sequencing and processed genome assemblies onto the GenomeArk. Bioinformatic and manual curation teams at every hub process and upload data onto the GenomeArk and eventually onto public archives. The GenomeArk is in constant development based on feedback from all the hubs and users, making it a truly international, collaborative effort.
The VGP was founded with the goal of producing high-quality, complete & error-free genome assemblies and sequencing data for all vertebrate species. The founding of the VGP came with the need to have an open-access platform that could handle large data sets and make raw data, intermediate and final results immediately available to the community. The US (GenBank-NCBI) and European (EBI-ENA) public archives could not initially provide such capabilities. Thus, we developed GenomeArk between 2015-17. We now host EBP high-quality reference genomes, which aims to sequence genomes of all 2 million eukaryotic species on earth, making the GenomeArk a repository for all genomes. Our goal is to promote species conservation and enable a new era of discovery across life sciences using comparative genomics. While the VGP contributes its data to public archives, such archives do not allow the upload and instant accessibility of genomic data in its multiple forms (raw and final). The GenomeArk provides a complimentary solution that allows anyone to immediately access the data as genomes are assembled, annotated and analyses are performed. We designed the GenomeArk with user experience, accessibility and transparency in mind. We have a standardized data structure with dedicated schemes and conventions for file naming and folder structures. We have automated quality control validation protocols that check the completion and accuracy of the uploaded raw data before they become available. While we allow only trained personnel to operate the GenomeArk (we give free webinar training), we have automated backup in place to protect data integrity. Only admins can upload/modify/delete the data on the GenomeArk, whereas members of the public can freely download the data. We provide a free, additional dedicated repository to collaborators (genomeark-upload) to share their data with us. We developed the GenomeArk in Github, a popular and accessible code repository, thereby making public the recent and the history of the code used to create the GenomeArk. This allows reproducibility and data sharing at every level, including web-design. Our project has inspired other large sequencing projects, such as the Darwin Tree of Life (DToL), to create their own data repository. We recommend that researchers adopt our data sharing practices, specifically the transparency aspect of the GenomeArk from data repository to web-design, as this will drive biomedical discoveries forward.
We developed the GenomeArk as a novel solution for data repository of high-quality reference genomes, while making use of existing processes. We directly connect the final assemblies in Genbank to our data on the GenomeArk, so that data, metadata and intermediate assemblies can be easily traced. In addition, our quality controls along with our data backup systems are consistent with the existing standards that maintain the integrity of publicly available data. The GenomeArk plays a complementary role to existing public archives with our unique feature being providing users instant accessibility and freedom to download and analyze data in multiple forms as soon as they are generated. We used easily accessible platforms such as Amazon Webservices to support the storage of our large data sets. We developed the GenomeArk using Github, which tracks the entire state of the code at any one point in time, improving the quality of the code, allowing users to identify and correct errors, providing regression to earlier stages and automatically validating the code. Since Github is open-access, it allows users to view the current and history of code, and allows for anyone to be able to replicate the code for generating the GenomeArk. A successful example of the replicability of the GenomeArk is the DToL project at the Wellcome Sanger Institute. The accessible nature of the GenomeArk allows it to be a model for an open-access data repository for any large-scale genomics project.
The availability of species-specific genome assembly data has significantly benefited the research community. Our platform has helped raise awareness of species conservation efforts in the public, enabled international collaborations in the Genomics community and our data were used to make exciting biomedical discoveries, including elucidating the adaptive evolution of immune genes in bats; resolving gene family orthologies; forebrain cell type evolution in vocal learning brain regions; and the breadth of the host range of SARS-CoV-2. The significance of the GenomeArk and its contribution to biomedical research was discussed in detail in a Nature Special issue. The GenomeArk has been designed for and by the community, our web-design, user interface along with data-specific features have all been developed and continue to be modified based on active feedback from users. The GenomeArk provides researchers around the world the autonomy to readily download the data, in any form (raw or processed) where they can do their own analyses. We have inspired others to use the GenomeArk as a model for open-access genome assembly data repository. We look forward to contributing to many other discoveries and more international collaborations.