New computational biology tool automates and standardizes genome sequencing analysis

The open-access metapipeline-DNA resource makes analyzing large and complicated data more accessible

In a single experiment, scientists can decipher the entire genomes of many patient samples, animal models or cultured cells. To fully realize the potential to study biology at this unprecedented scale, researchers must be equipped to analyze the titanic troves of data generated by these new methods.

Scientists at Sanford Burnham Prebys Medical Discovery Institute and the University of California Los Angeles published findings March 17, 2026, in Cell Reports Methods discussing building and testing a new computational tool for tackling massive and complex sequencing datasets. The new resource, named metapipeline-DNA, may also make sequencing data analysis more standardized across different research labs.

The sequence of a single human genome represents about 100 gigabytes of raw data, the rough equivalent of 20,000 smartphone photos. The sheer scale of experimental data increases significantly as tens or hundreds of genomes are added into the mix.

As the technology to produce this data has rapidly advanced over the last 10-15 years and become more affordable and accessible, many labs have built their own software to use for analysis, or customized open-access tools shared freely by colleagues. Some of these resources only work on specific supercomputing or cloud computing systems.

This fragmented software landscape can complicate collaboration across institutions, add difficulties when labs move to new institutions or institutions switch to new computing solutions, and contribute to a lack of standardization as well as challenges reproducing studies with different tools.

“Bioinformatics pipelines for genomic sequencing data such as metapipeline-DNA are designed to standardize analysis of all this data to make sure it is processed in a uniform way, and in a reproducible way,” said Yash Patel, MSc, a cloud and AI infrastructure architect at Sanford Burnham Prebys and co-first author of the study.

Yash Patel, MSc, is a cloud and AI infrastructure architect at Sanford Burnham Prebys. Image credit: Sanford Burnham Prebys.

“The goal is to automate quality control, determination of genetic variants and all the other analysis steps to make it much easier so that researchers do not need to write their own code to process their data.”

The metapipeline-DNA development team emphasized the software’s ability to detect and recover from common errors. Even with the powerful supercomputing clusters scientists use to analyze sequencing data, failed runs can cost days of computing time and delay new discoveries.

Paul Boutros, PhD, MBA, is director and professor in the NCI-Designated Cancer Center at Sanford Burnham Prebys and senior vice president of Data Sciences. Image credit: Sanford Burnham Prebys.

“In designing the software, we focused on making sure that the choices we present to the users are fully validated before the pipeline runs,” said Paul Boutros, PhD, MBA, director and professor in the NCI-Designated Cancer Center at Sanford Burnham Prebys and senior vice president of Data Sciences.

“In our lab, we don’t want to suffer a setback due to a preventable configuration error, and we don’t want it to happen to anyone using our pipelines.”

The collaborative development process has included 43 contributors making 1,408 pull requests to enhance the underlying code, and 46 individuals submitting 1,124 suggestions, requests for features and/or reports of issues.

To improve the ability of metapipeline-DNA to determine where changes in the genome have occurred, the scientists worked with the Genome in a Bottle Consortium led by the U.S. Department of Commerce’s National Institute of Standards and Technology. By incorporating this public-private-academic consortium’s meticulously validated resources, the researchers reduced the rate of false positives without reducing the tool’s precision in finding true genetic variants.

The researchers also produced two case studies demonstrating the pipeline’s capabilities for cancer research. The investigators used metapipeline-DNA to analyze sequencing data from five patients that donated both normal tissue and tumor samples to the Pan-Cancer Analysis of Whole Genomes dataset, as well as another five from The Cancer Genome Atlas.

The next step is to get metapipeline-DNA into more labs to accelerate discoveries, and to continue improving the resource with more user feedback.

“This tool should enable labs to process data without needing a lot of background in computation or computer infrastructure, and without having to optimize for their specific computing environment,” said Patel.

In addition, the authors plan to build upon this foundation to create automated, end-to-end solutions for analyzing sequencing of other biological molecules such as RNA and proteins.

“Workflows across different biomolecules can share the architecture, automation and quality control methods of metapipeline-DNA such that improvements to any single pipeline can improve the others,” said Boutros, the senior and corresponding author of the manuscript.

“We’re excited to expand to other data-intensive high-throughput sequencing techniques to continue improving the pace and efficiency of discovery in our lab, at Sanford Burnham Prebys and throughout the research community.”

Patel shares first authorship of the study with Chenghao Zhu, PhD, a research assistant professor at Sanford Burnham Prebys, and Takafumi Yamaguchi, MSc, a senior bioinformatician in the Boutros lab at Sanford Burnham Prebys.

Additional authors include:

Rupert Hugh-White, Mao Tian and Wenshu Zhang at Sanford Burnham Prebys
Nicholas K. Wang, Nicholas Wiltsie, Nicole Zeltser, Alfredo E. Gonzalez, Helena K. Winata, Yu Pan, Mohammed Faizal Eeman Mootor, Timothy Sanders, Sorel T. Fitz-Gibbon, Cyriac Kandoth, Julie Livingstone, Lydia Y. Liu, Benjamin Carlin, Aaron Holmes, Jieun Oh, John Sahrmann, Shu Tao, Stefan Eng, Kiarod Pashminehazar, Arpi Beshlikyan, Madison Jordan, Selina Wu, Jaron Arbet, Beth Neilsen, Roni Haas, Yuan Zhe Bugh, Gina Kim, Joseph Salmingo, Wenyan Zhao, Aakarsh Anand, Edward Hwang, Anna Neiman-Golden, Philippa Steinberg, Prateek Anand, Raag Agrawal and Brandon L. Tsai at the University of California Los Angeles

The study was supported by the National Institutes of Health, National Cancer Institute, Department of Defense, Department of Health and Human Services, National Library of Medicine, Canadian Institutes of Health, Jonsson Comprehensive Cancer Center, Howard Hughes Medical Institute and Prostate Cancer Foundation.

The study’s DOI is 10.1016/j.crmeth.2026.101340.

We translate science into health

The path to better knowledge starts here

The latest news

You can help fund the next breakthrough

Dedicated to human disease and advancing discoveries.

Global Search

New computational biology tool automates and standardizes genome sequencing analysis