New AI Tool, ShortStop, Searches the Genome for Microproteins

new-ai-tool,-shortstop,-searches-the-genome-for-microproteins
New AI Tool, ShortStop, Searches the Genome for Microproteins

Proteins sustain life as we know it, serving many important structural and functional roles throughout the body. But these large molecules have cast a long shadow over a smaller subclass of proteins called microproteins. Microproteins have been lost in the 99% of DNA disregarded as “noncoding”—hiding in vast, dark stretches of unexplored genetic code. But despite being small and elusive, their impact may be just as big as larger proteins.

Salk Institute scientists have now developed a computational tool, ShortStop, that allows them to explore the mysterious dark side of the genome in search of microproteins. Using ShortStop, researchers can probe genetic databases and identify DNA stretches in the genome that likely code for microproteins. Importantly, ShortStop also predicts which microproteins are most likely to be biologically relevant, saving time and money in the search for microproteins involved in health and disease.

ShortStop shines a new light on existing datasets, spotlighting microproteins formerly impossible to find. The Salk team reported on use of the tool to analyze a lung cancer dataset, finding 210 new microprotein candidates—including one standout validated microprotein—that may make good therapeutic targets in the future.

“Most of the proteins in our body are well known, but recent discoveries suggest we’ve been missing thousands of small, hidden proteins—called microproteins—coded by overlooked regions of our genome,” said Alan Saghatelian, PhD, professor and holder of the Dr. Frederik Paulsen Chair at Salk. “For a long time, scientists only really studied the regions of DNA that coded for large proteins and dismissed the rest as ‘junk DNA,’ but we’re now learning that these other regions are actually very important, and the microproteins they produce could play critical roles in regulating health and disease.”

Senior author Saghatelian and colleagues reported their findings in BMC Methods, in a paper titled “ShortStop: a machine learning framework for microprotein discovery,” concluding “ShortStop addresses a key gap in microprotein research—the lack of scalable tools to characterize microproteins and standardized negative training data to train machine learning models for microproteins.”

Compared to standard proteins that can range from hundreds to thousands of amino acids long, microproteins typically contain fewer than 150 amino acids, making them harder to detect using standard protein analysis methods. The authors wrote, “The human UniProt/Swiss-Prot database includes over 20,000 well-characterized proteins, but only about 10% are microproteins—proteins shorter than 150 amino acids…It is still unclear whether this low number is due to true biological limits or because many microproteins have not been discovered yet.”

Instead of searching for the microproteins themselves, scientists can search large, publicly available datasets for the DNA sequences that make them. They know that certain stretches of DNA called small open reading frames (smORFs) can contain the instructions for making microproteins. But while current experimental methods have already cataloged thousands of smORFs, these tools remain time-consuming and expensive. Furthermore, not all smORFs translate to biologically meaningful microproteins. Existing methods can’t discriminate between functional and nonfunctional microprotein-generating smORFs. This difficulty separating potentially functional microproteins from nonfunctional microproteins has stalled their discovery and characterization. “Thousands of smORFs are actively translated, but it remains unclear which give rise to bioactive micropro­teins,” the team stated. This means that scientists must independently test each microprotein to determine whether it is functional or not.

Cells express a novel ShortStop-predicted microprotein (green), with cell nuclei stained blue. The pattern suggests microproteins are localized either in endosomes, which are organelles responsible for sorting and transporting cellular cargo, or in lysosomes, which are organelles that collect and remove cellular waste. [Salk Institute]
Cells express a novel ShortStop-predicted microprotein (green), with cell nuclei stained blue. The pattern suggests microproteins are localized either in endosomes, which are organelles responsible for sorting and transporting cellular cargo, or in lysosomes, which are organelles that collect and remove cellular waste. [Salk Institute]

ShortStop is a computational framework that radically alters this workflow, optimizing smORF discovery by sorting microproteins into functional and nonfunctional categories. The key to ShortStop’s two-class sorting is how it’s trained as a machine learning system. Its training relies on a negative control dataset of computer-generated random smORFs.  The framework is designed to help researchers prioritize smORF-encoded microproteins for further research, the scientists commented. “ShortStop provides a much-needed foundation by generating a consistent and realistic negative training dataset, enabling machine learning tools to better distinguish between smORFs that resemble known microproteins and those that do not.”

ShortStop compares identified smORFs against these decoys to quickly decide whether a new smORF is likely to be functional or nonfunctional. “Specifically, ShortStop classifies translated smORFs based on shared protein features with either well-characterized microproteins in Swiss-Prot, referred to as SAMs (Swiss-Prot Analog Microproteins), or with artificially generated non-canonical microproteins, termed PRISMs (Physicochemically Resembling In Silico Microproteins).”

ShortStop cannot definitively say whether a smORF will code for a biologically relevant microprotein, but this two-class system narrows down the experimental pool immensely. Now researchers can spend less time manually sorting through datasets and failing at the bench.

When the researchers applied ShortStop to a previously published smORF dataset, they identified eight percent as likely functional microproteins, prioritizing them for targeted follow-up.

“When applied to a published dataset of translating smORFs, ShortStop classified about eight percent as candidates with biochemical properties resembling Swiss-Prot microproteins (i.e., called SAMs),” the team reported. “The remaining 92% resembled in silico generated sequences (i.e., called PRISMs), representing noncanonical proteins, non-functional peptides, or regulatory translation events.”

First author Brendan Miller, PhD, a postdoctoral researcher in Saghatelian’s lab, added, “What makes ShortStop especially powerful is that it works with common data types, like RNA sequencing datasets, which many labs already use. This means we can now search for microproteins across healthy and diseased tissues at scale, which will reveal new insights into human biology and unlock new paths for diagnosing and treating diseases, such as cancer and Alzheimer’s disease.”

The team also analyzed genetic data from human lung tumors and adjacent normal tissue to create a list of potential functional smORFs. Among the smORFs ShortStop found, one microprotein stood out—it was expressed more in tumor tissue than normal tissue, suggesting it may serve as a biomarker or functional microprotein for lung cancer. “Among the ShortStop-identified SAMs, the most upregu­lated in tumors was an alternative microprotein encoded by a COL1A1 transcript (COL1A1-MP),” they wrote. “The identification of this lung cancer-related microprotein demonstrates the value of ShortStop and machine learning to prioritize candidates for future research and therapeutic development.”

Saghatelian said, “There’s so much data that already exists that we can now process with ShortStop to find novel microproteins associated with health and disease, stretching from Alzheimer’s to obesity and beyond. My team is really good at making methods, and with data from other Salk faculty, we can integrate these methods and accelerate the science.”

In their paper the researchers concluded, “By providing a classification framework rooted in biochemical features, ShortStop offers a practical solution for targeting smORFs in functional studies, benchmarking new discovery tools, and advancing microprotein research.” They acknowledge that ShortStop is not intended for standalone use, but rather to help researchers priori­tize candidates for functional studies. “Overall, ShortStop provides a computational frame­work for systematic microprotein discovery, allowing researchers to prioritize candidates for functional studies while also offering a foundation for future method devel­opment and benchmarking in the field.”

The post New AI Tool, ShortStop, Searches the Genome for Microproteins appeared first on GEN – Genetic Engineering and Biotechnology News.