The Monarch Initiative is an international consortium that leads key global standards and semantic data integration technologies. The Human Phenotype ontology (HPO), used extensively for rare disease, has demonstrably improved diagnostic efficacy. The Mondo ontology reconciles disease knowledge and has created a community of best practices for disease nomenclature and identification. The LinkML, SSSOM, and GA4GH Phenopackets standards support translational data integration and make data fundamentally more reusable at scale. The Monarch Knowledge Graph (KG) integrates disease knowledge and has 831,521 genotype-phenotype associations from 126 species to facilitate mechanism discovery for diagnostics, biodiversity, and evolutionary biology.
The Monarch Initiative tells a story of scientific enchantment, with teams that came together to address a fundamental problem in biomedicine: the lack of interoperability in genotype-phenotype data. In 2008, there were two projects underway: One aimed to create interoperability across species using semantics and built algorithms to identify causes of human diseases for which the mechanism was yet unknown. The other project aimed to codify the phenotypic free text descriptions associated with diseases. Instead of competing, the two projects merged to form the Monarch Initiative and maximized clinical and semantic expertise to create one of the most robust and well regarded ontologies in the world, the HPO. For over a decade, the Monarch team has built reproducible semantic integration methods that enable nuanced classification of phenotypic data across species and sources. The leadership team meets weekly and each member has different areas of responsibility that focus on Curation, Ontologies, Algorithms, and the Database & Portal. The ethos is openness and team science, and the team is but a microcosm of a much larger community of users and contributors from fields as diverse as evolutionary biology, medicine, and biodiversity.
The Monarch Initiative has filled a critical gap in availability of genotype-phenotype data for disease diagnostics and mechanism discovery. It focuses on 1) maximizing data reuse from a disparate landscape of disease and phenotypic knowledge and data sources; 2) developing community standards and ontologies to facilitate data reuse and interoperability; and 3) improving the lives of patients with rare diseases through engagement, diagnosis, and identification of treatments.
Lack of consistency in how each data source curates different associations, such as disease-variants or gene-phenotypes, is significantly challenging. Creating semantics tools to harmonize these data for computational reuse has required over a decade of great effort and coordination. Core strategies include best practices for identifier management and reconciliation and community development of species-neutral ontologies such as Uberon for anatomy, the Gene Ontology (GO) for function, and uPheno for phenotypes. Monarch has built some of the most widely used resources for disease knowledge standardization: Mondo, Phenopackets (also an ISO standard), and HPO. Phenotyping patients using HPO has increased diagnostic yield by over 20% through use of Monarch’s semantically integrated data, which extends coverage of genotype-phenotype associations in humans from ~21% to ~84%.
Many programs think a new ontology or standard is needed, when often extending existing resources is more interoperable and sustainable. Reconciling ontologies requires years more work than advanced planning for interoperability. Unclear or poor data licensing are also challenges for redistribution; we recommend the metrics on reusabledata.org to aid resources to make data legally reusable. Poor identifier provisioning is a barrier and our community best practices and identifier services help resources reconcile and best manage their identifiers to enable data reuse at scale.
Monarch data reuse is notable for its scope and mechanism discovery across species and diverse domains such as rare disease diagnosis, evolutionary biology, veterinary medicine, and biodiversity. The wide-ranging impact of Monarch resources is understated, standardizing the entirety of anatomy within the GO, helping to realize precision medicine through patient classification, and supporting other resources in their efforts to become interoperable with their schemas, identifiers, semantics, and exchange of data.
In partnership with over 40 resources, Monarch has created a public and fully provenanced KG with 32,573,812 nodes and 14,064,433 edges to enable mechanism discovery across species and reveal understanding of the phenotypic diversity of life. A number of standards and reproducible processes are used to maintain the Monarch resources: a LinkML-based graph standard called the Biolink Model, which provides the semantic structure for data harmonization and includes source provenance and evidence; Koza, a data transformation pipeline that supports ingest and curation of a very heterogeneous landscape of data sources; a Simple Standard for Sharing Ontology Mappings (SSSOM) to harmonize the ontologies that are used by the sources; and generalizable standards such as JSON-LD and Compact URIs. Automated quality control and manual curation processes are in place to ensure the quality and accuracy of the data we reuse and of the relations between ontology terms. Monarch’s computable evidence, provenance, and attribution model is used by GA4GH, ClinGen, etc. for relevance ranking and interpretation. The curation process involves members of the Monarch Team and domain experts from a diversity of fields across species. Involving stakeholder communities not only raises the value of the shared knowledge, but has also ensured scalability and sustainability of Monarch. All of the code and standards are published on GitHub and the entirety of the Monarch platform is reproducible by others.
While some may view Monarch as a “research data parasite,” the reality is that through community partnerships and Monarch’s extensive data reuse and harmonization, many sources have improved their data provisioning, standardization, licensing, and community engagement. The result is greater rigor, reproducibility, and data quality where all data resources benefit. The socio-technical engineering strategies taken to bring diverse community members together (e.g., fission yeast, rodent, and amphibian researchers) have led to some of the most widely utilized and contributed-to ontologies in biomedicine, as well as democratized access to knowledge (e.g., Mondo). The impact is huge: patients have better equity in access to a diagnosis, and many informatics resources have been “lifted up” in the use of and contribution to Monarch-led standards. Monarch’s leadership in creating interoperability across languages, domains, contexts (clinical, basic, patient), and sources has truly set new community expectations and is changing the culture in biomedicine. Data reuse at this scale has helped realize easier, faster, and more rigorous building on each source, as well as an unprecedented cross-disciplinary collaboration and attribution.