Recent advances in genome sequencing have improved all aspects of uncovering the human genome. One of the unlocked areas includes variant calling, or identifying and cataloging genetic variations, like SNPs and indels, from sequencing data. Variant calling has not only improved throughout the genome, but also in complex regions of the genome that were previously unapproachable. Now, new research describes the Platinum Pedigree benchmark—a new, comprehensive, truth-set of genomic variation which characterizes simple and complex variation.
This work is published in Nature Methods in the paper, “The Platinum Pedigree: a long-read benchmark for genetic variants.”
The Platinum Pedigree, which was built by scientists from PacBio in collaboration with researchers at the University of Washington, the University of Utah, and several other institutions, was developed using deep sequencing from three sequencing platforms across a 28-member, multi-generational family (CEPH-1463). More specifically, the group used Mendelian inheritance in the large pedigree to filter variants across PacBio high-fidelity (HiFi), Illumina, and Oxford Nanopore Technologies platforms.
By tracking the inheritance of genetic variants from parents to multiple children, the study catalogs over 37 Mb of genetic variation segregating within the family from single nucleotide variants to large structural variants.
The dataset introduces the first large pedigree-validated tandem repeat and structural variant truth sets. The team generated a variant map “with over 4.7 million single-nucleotide variants, 767,795 insertions and deletions (indels), 537,486 tandem repeats and 24,315 structural variants, covering 2.77 Gb of the GRCh38 genome.”
It also adds more than 200 million bases extending the benchmark regions to 2.77 Gb, including difficult-to-map areas such as segmental duplications and low-complexity regions, including eight percent more small variants.
“Comprehensive benchmarking datasets that include all variant types are foundational to progress in genomics methods development and the application of AI-driven tools, as well as to our understanding of genomic variation for both research and diagnostic purposes,” said Zev Kronenberg, PhD, Senior Manager at PacBio. “The Platinum Pedigree benchmark doesn’t just include simple variants in easy-to-sequence regions, it includes variants from across the entire genome, including regions that were previously excluded from benchmarks due to their complex nature.”
As a demonstration of the value of improved benchmarks to improve AI and ML methods, the researchers retrained Google’s DeepVariant—a popular software tool that employs deep learning to identify genetic variants—using the Platinum Pedigree benchmark data. This updated DeepVariant model reduced errors by up to 34% genome-wide, including even higher gains in the most challenging regions of the genome.
These types of reference datasets are critical to advancing clinical research, and powering the use of genomics in diagnosing rare diseases, understanding cancer heredity, and more.
The Platinum Pedigree benchmark is freely available and already being used by scientists to develop new sequence analysis tools and validate clinical sequencing workflows. It also provides a roadmap for future benchmarking efforts, especially those involving more complete genomes like T2T-CHM13.
The full dataset, analysis code, and pipelines are publicly available at: https://github.com/Platinum-Pedigree-Consortium.
The post Platinum Pedigree Benchmark Boosts AI Accuracy in Genomic Variant Detection appeared first on GEN – Genetic Engineering and Biotechnology News.