Submission

submission voting

voting is closed.

introduction

title

Normalizing imaging data towards replicablility

short description

We provide a computationally efficient layer to remove batch effects in imaging data for implementing in public image data repositories.

Submission Details

Please complete these prompts for your round one submission.

Submission Category

Data reuse

Abstract / Overview

With the exponential growth in image data collection, more advanced analyses are focusing on making full use of the high-dimensional images to improve personalized prediction and prevention strategies. However, batch effects arising from different machines often influence quantitative image features and can produce overwhelming variability between analyses. This thus presents a huge burden in replicability of results and data reuse. We provide a layer of normalization for digital imaging data (even of high resolution such as the mammograms in our project, each composed of 13 million pixels). This technique is being applied to several datasets to ensure uniform interpretation and dissemination.

Team

Dr. Shu Jiang is Assistant Professor in the Division of Public Health Sciences at Washington University. She has strong training in statistical theory and methodology. In 2021, Dr. Jiang was awarded the prestigious MERIT award for breast cancer methodology study by the NIH. In 2022, she was recognized in the 40 under 40 Public Health Catalyst Award with her work in breast cancer prevention by the Boston Congress of Public Health.

Dr. Hufeng Zhou is a Research Scientist in Biostatistics Department of Harvard T. H. Chan School of Public Heath. He is a dedicated Computational Biologist with broad experience in proteomics, transcriptomics, and genomics research for more than a decade and an experienced developer for bioinformatics databases. He has extensive experience in leading bioinformatics research team as junior faculty member in Harvard Medical School.

Drs. Jiang and Zhou were officemates during their postdoctoral fellowship at Harvard School of Public Health. Dr. Jiang will be responsible for statistical methods development while Dr. Zhou will bring expertise in computational infrastructure for implementation into public databases. They will meet weekly to move the project forward.

Potential Impact

For the past year, we have been extensively working with digital mammogram imaging data of high resolution across the Joanne knight breast health cohort at Washington University and the Nurses Study at Harvard University. We have encountered batch effects and have been working to solving this problem since.
Following the FAIR principles(Findable, Accessible, Interoperable, and Reusable in logical ways). We adopted the centralized cloud-based data management that allow convenient open source methods with detailed annotation and documentation. Drs. Jiang and Zhou both retain an open-source repository for such purposes.
We recommend everybody to follow the FAIR principles, because good data-management supports scientific discovery in a more efficient and replicable way. This has been highly recommended by the NIH since 2016.
Our way to reuse and share data is compelling because in the current era everything (data and analytical methods) is moving to cloud. And for big datasets such as imaging data, it is crucial that the first layer for data preprocessing be uniform across studies accompanied with immense computing power to support the FAIR principles. Our proposed approach allows scientists to focus solely on discovery and innovation for big biomedical imaging data, to facilitate a more efficient and replicable data analysis.

Replicability

As in genetic studies, adjusting for batch effects is extremely crucial to ensure replicability in different studies. However, this is a new challenge in images and we have leveraged some knowledge with our previous experiences working with the genetic data in applying to the images.
The proposed methods are robust, transparent, and replicable in that we provide statistical uncertainties, such as p-values and confidence intervals, such that the proposed method is directly replicable by others.

Potential for Community Engagement and Outreach

"Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process." [Wilkinson MD. et al. Scientific Data (2016)]

The key to accelerate scientific discovery and maximize societal benefits hinges on data sharing and reuse. Through data sharing and reuse, dissemination of the scientific results will be of interest not only to the statistical and bioinformatic community, but also to the empowering general population who have their routine digital imaging health care data in the electronic health record. This can increase their awareness and involvements in their own personalized health care, so that each individual can contribute of their own portion.

Therefore, maximizing data sharing and reuse will improve the quality of health care research, the well-being of general population, and have societal benefits extending into the future.

Supporting Information (Optional)

Include links to relevant and publicly accessible website page(s), up to three relevant publications, and/or up to five relevant resources.

Supporting Documentation 01

https://pubmed.ncbi.nlm.nih.gov/34854477/

Supporting Documentation 02

https://pubmed.ncbi.nlm.nih.gov/34435196/

Supporting Documentation 03

https://favor.genohub.org/

comments (public)

Was this page helpful? yes no