Our project involves two open-source tools for mining crystallization data for biomolecules.
One tool extracts, parses, and defines chemical details found in free text metadata in the Protein Data Bank, the worldwide repository of biomolecular structural information. The tool enables easy access to these metadata, which provide critical details about successfully solved structures.
Our second tool is a GUI for viewing crystallization images that provides a useable implementation of a state-of-the-art neural network classifier MARCO (MAchine Recognition of Crystallization Outcomes).
Both tools are community accessible and open-source via GitHub and are accompanied by tutorials; the first also includes a downloadable database housed at Zenodo.
The team is comprised of two collaborators whose research focus areas are protein crystallization, image analysis, and machine learning. We collaborate on a number of projects that all focus on structural biology. We work extensively with students who are actively involved in building the tools, and train the students in how to both annotate the tools on GitHub for users to access and write up the techniques developed in publications.
We manage responsibilities for data management and oversight with regular meetings and working with the GitHub repository. The team also is invested in engaging via presentations and outreach to make potential users aware of the tools we’ve developed. We have worked hard to present the tools we have built into the literature and always emphasize the accessibility and edit-ability of the tools.
We also have frequent contact with tool users to fix bugs and assist with implementation.
The main goal behind our developing tools for data reuse is to overcome the primary obstacle to diffraction-based biomolecular structure determination: acquiring a crystal construct of the target. Structural biology is a cornerstone of biological investigation, providing critical frameworks for understanding basic bioscience, probing disease processes, advancing drug discovery, and understanding energy science research.
The Protein Data Bank (PDB) is the primary worldwide repository for macromolecular structural data, with close to 200K structures deposited. Download statistics for 2021 PDB data underscore the central role that structures provide as a driving force in all areas of science. In 2021, over 719 million files were downloaded from the PDB website, corresponding to nearly 2 million downloads per day.
PDB data often arises from diffraction-based methods, relying on crystal formation of molecules in close to 90% of cases. Our open-source tools, developed by undergraduate student interns working with us during 2018 and 2020, use data-reuse and tool-access methods to overcome the primary bottleneck to structural biology: getting the molecule into a crystal.
Our first approach to the bottleneck uses the metadata in the PDB about crystallization to better guide successful approaches. We created a tool to data scrape the PDB to access the difficult-to-parse details in free text fields. We have made the tool available, as well as the resulting database in a downloadable format for users who are less familiar with executing scripting and command line operations. The python-based tool creates a standardized dictionary of chemical components found in crystallization conditions and converts the often irregular text into a standardized format which can be further mined. Our presentation of the tool also demonstrates examples of how the mined data can be used to study the process of crystallization.
Our second tool is designed to operate at the experimental level. When doing experiments to identify crystallization conditions, imaging is a primary tool used to rapidly assess successful outcomes. Our second tool is constructed to facilitate the high-throughput nature of this experimental process. We provide a GUI that enables a user friendly visualization tool for crystallization experiments that also incorporates a recently developed deep learning algorithm for automated scoring of crystallization outcomes.
For both tools, we make extensive use of GitHub for hosting scripting, annotating operating instructions and tutorials, and enabling collaboration both within the team and with external users. In the PDB data mining project, we used existing methods to build tools to extract and parse free text metadata, ultimately yielding a dictionary of chemical components important in crystallization outcomes. The scripting itself is available and implementable. Further, we have made the resulting database available in an easily downloadable format on Zenodo for those researchers who may not be as comfortable running the command line scripting in the tool for a fresh extraction. Similarly, in the GUI that we’ve built for crystallization images, a major component of the project was to generate an accessible implementation of the MARCO algorithm. One of our major emphasis areas has been on making tools that are useable for researchers beyond the developer community.
A key component in data sharing and reuse is to make the resources that are generated accessible to a large number and range of users. More easily accessible and available data resources provides an important entry point for a) students interested in pursuing these types of data focused subject areas for research topics, b) researchers whose scientific endeavors can make use of the resources, and c) outreach to new communities. Our experiences with both of our tools have highlighted the benefits that can arise from making data resources accessible, by providing both the database outcomes, as well as the tools to construct fresh databases from continuously updated source repositories. Open-source tools and available scripting also enable a dialogue with users of the tool, which informs on troubleshooting how the tool is being used on different platforms and in different contexts.
Ultimately, we encourage others to share and reuse data and to implement ways for users, collaborators, students, and others to make better use of the massive data resources that are now available, with a goal to build bridges to these data resources with potential users in mind.