Researchers aspire to disseminate their data, but for many reasons, they might not recognize specialized terminology embedded in their discussions. This terminology can obscure their data, materials, or techniques from individuals beyond their specific domain or even beyond their immediate research group. Yet, science manuscripts often pass peer-review without specific experimental details vetted for accuracy, clarity, or completeness. These deficiencies in communication are only found after publication, often painfully, by readers hoping to understand, find, and or repeat the experiments in their own lab. These deficiencies are also found by database curators, whose job is to make those data more accessible and interoperable. With the ever-growing corpus of science publications, database curators cannot keep up with good articles, let alone trying to tease out data that was deficiently reported. microPublication is a novel journal paradigm that takes a major step to address the publishing of deficient, or opaque data by embedding scientific-copy editing and database-level curation within the publication workflow. Including database curators in our workflow allows us to vet and validate domain-specific standards in nomenclature and experiment reporting. That is, our pipeline has errors corrected BEFORE publication. This re-architecture of the publishing workflow alleviates numerous obstacles to making data reusable and has wide-arching implications in scholarly communication.
microPublication Journals: Ensuing data reusability by bridging data producers with data annotators
Recipe
1 Publishing Venue
1 Publishing Team (Managing editors, Science Officers, Authors, Reviewers)
1 Institutional Library Archiver
1 Science Community Network that includes at least
1 Database, I Database Curator and 1 Community-respected and trusted scientist
Biological Copy-editing with automated BioEntity identification, verification, and linking
1 Article limited in size to a curatable nugget of information
microPublication Biology is a peer-reviewed journal that redesigns the scholarly communication workflow to make data validation and curation part of the publishing process. With a focus on single experiment results, our platform has speeded up the rate of getting data to the public. Such results were often left out of the typical science article for a number of reasons, such as resulting in a negative result, being part of a project that was dropped, left out of the narrative for lack of novelty or being ground-breaking, etc. Getting these data to the public is a first step in making data FAIR. However, as any researcher can tell you, not all data that is published is available, nor is it necessarily published correctly.
Our novel publishing architecture provides a critical line of communication between authors and community-expert curators. These curators provide ‘scientific’ copy editing who ensure the reported data adhere to community standards, including adherence to nomenclature rules and completeness of data reporting requirements. Since these curators are placed in the publishing workflow, errors in reporting are corrected before they are enshrined in the literature.
An inadvertent outcome is that the author becomes trained to see and address errors and omissions, helping them in future manuscript preparations. Since a good portion of our authors are younger status academics (undergraduate, and graduate students), this early exposure to the relevance of nomenclature and reporting standards may have a long-lasting positive impact on future scholarly communications.
Our system is open for any publisher to adopt, in fact we hope all science publishers will make room in their publishing workflow for ‘scientific’ copy editing and database curation.
Our team uses automated, named-entity recognition scripts and database curators to vet and annotate science nomenclature and results as part of the publication workflow, that is, BEFORE publication, which solves at least three major time and effort roadblocks to data reuse; (1) identifying data that need to be captured and assigning machine-friendly identifiers where appropriate; (2) eliminating errors in author reporting of their data and making sure they adhere to community standards; and (3) curating into the relevant community database for wider accessibility. The end results are well-described and annotated metadata and data, that adhere to community standards. These articles and data are indexed and findable on PubMed as well as in community databases. Data and metadata are licensed as CC-BY4.0, so they are freely available for reuse by the community as long as the primary source is cited, thereby establishing and encouraging provenance.
Hyperlinking of BioEntities is not a novel effort. The Genetics Society of America has already implemented our hyperlinking pipeline in their publishing workflow for their journals, Genetics and G3: Genes | Genomes | Genetics. This pipeline has served as a prototype for assessing the usefulness of such an effort. microPublication Biology seeks to take the project further by (1) expanding the linking tool set to invite more stakeholders in science publishing and data use to take part, and (2) building curation tools and data sharing pipelines for adoption by relevant databases.
microPublication expands data sharing and usability laterally as well as vertically in the scholarly communication space. We bring databases and libraries forward in the publishing process, speeding the rate at which data is distributed and discovered. We are also building tools that greatly speed the rate of data validation and curation, increasing the value of data at an earlier stage for data sharing.
microPublication brings data to the databases rather than database curators needing to ferret out data they can curate through normal PubMed searches. During our workflow, we (1) have the author declare the species they work with upon submission; (2) identify errors and omissions made by authors reporting their data, errors due to typos, incomplete/inconsistent reporting, etc. that would normally stall curation at the relevant database; (3) have authors correct and or clarify of these found errors or omissions, based on community standards. These types of Databases invest heavily in automating the identification of publications that contain curatable data. For model organism databases, one huge stumbling block to identifying relevant papers is knowing what species the authors are reporting on. While, species recognition does not seem like a difficult problem, it turns out to be a major issue for curators and ultimately researchers, such as when a researcher uses a gene name that could be Mouse, Rat, or Human. By requiring species identification upon article submission, microPublication alleviates time and effort with down-the-line entity recognition AI and data extraction for us as well as databases.
By implementing database curation efforts within our workflow, we have a catalytic impact on science communication in general. Curators who annotate data from published literature often do not have access or awareness of an article until months, if not years, after the work is published. At this point, authors are not likely to remember or have access to details of the reagents they used or the specifics of an experiment that needs clarification, however these details are critical for proper data reuse. Our system allows curators to interact with the authors at the time of data reporting. At this time, authors are available and incentivized to receive guidance from curators to clarify the reporting of their research. The feedback that curators give has tremendous value and impact on published research, since the detail at which a curator evaluates the work is not the same as a peer-reviewer, who is tasked with a wider assessment of the research.
microPublications capture data that would not be published, publicly available, or trackable as brief articles, focused on the single experiment. The size and focus of our articles allow reviewers and curators to be more comprehensive in their feedback, which they would not necessarily be able to do with the larger articles, which typically can contain 10 experiments. microPublication authors receive incredibly detailed feedback on their one or two experiments and their communication of their results to a degree that would not be possible with the numerous-experiment article. The outcome of work within our platform is an article with serious vetting in its data reporting and communication.
BioEntity Linking has also been employed with GSA journals. This provided key communication between authors and database curators for BioEntity vetting, with the journal playing a key role in ‘enforcing’ author compliance with correcting errors. This process was deemed useful, but it was an isolated step in the workflow, with no easy line of communication between curator and author. microPublication has expanded on this process by embedding the tools and curators within the publishing decision and is building curation tools within the platform. The ultimate goal being to train authors to and curate their own work.
Our recipe to make data more usable relies on understanding and adhering to community nomenclature and data reporting standards. These standards are built and used by curators of community expert databases. While any stakeholder of scholarly communication can reach out, at any time, to get guidance and input from a curator, we found that the most efficient and effective input from a curator is at the time of publication of one’s data. Authors are most likely to follow through with making their data comply with community standards, and fix any errors that may affect the reporting of their data. We support and encourage authors and other publishers to reach out to relevant Knowledgebases at this point to have their articles vetted for data compliance with community standards, rigor and reproducibility.
As a peer-reviewed journal, recognition and trust is everything. If we are an awardee, we would send out blog posts from the microPublication Biology website and have all participating databases post an announcement. We would include DataWorks on our fliers and steel-cut stickers. It would be loud.
Anyone who wants to adopt curation into their publishing workflow would to be able to accommodate database-specific validation pipelines and workflows.
The first step is to reach out and establish a good collaboration with the community supporting database, that is the database that establishes nomenclature standards for and annotates the published data. Their curatorial staff are the experts that will determine what is important to vet and how to capture data useful to the community.
The microPublication Biology team came together in 2016 with the vision of building a system where expert curation can be part of the publishing workflow rather than as an unsustainable task by database curators after publication. The core team, composed of two established C elegans researchers, two Scientific Curators, Software Architect, and a strong Software Developer, had strong ties with the Model Organism Resource, Wormbase, the authoritative Knowledgebase dedicated to the model organism Caenorhabditis elegans. WormBase was built specifically to manage data produced by the worm community. That is, WormBase curators identify, capture, standardize, and annotate data from the published literature to make those data more findable, accessible, interoperable, and reusable by the community. Unfortunately, with the rapid pace of science discovery, we all knew that we would not be able to keep up with the ever growing body of published literature without the help of the authors who generate the data in the first place.
Our model of focusing on single experiments has resulted in rapid and transparent dissemination of scientific research findings that have proved valuable to our communities. These aspects allow us to recruit top scientists in the biological field, which has expanded our editorial team to over 50 people serving in different capacities; Scientific Curators, Managing Editors, Science Officers (Senior Scientific Editors), and Scientific Curators from various Databases. Managing Editors liaise daily with researchers, bioinformaticians, librarians and other stakeholders to make sure that the journal follows best practices in data sharing and data management. We also collaborate with the California Institute of Technology Library, who are our publisher, policy advisers, and article data archivers. We regularly meet with Caltech Library staff to discuss best practices for collecting article metadata.