Genomic language model mitigates chimera artifacts in nanopore direct RNA sequencing

genomic-language-model-mitigates-chimera-artifacts-in-nanopore-direct-rna-sequencing
Genomic language model mitigates chimera artifacts in nanopore direct RNA sequencing

Data availability

Raw and processed data generated in this study, including dRNA-seq using the SQK-RNA002 and SQK-RNA004 kits, as well as direct cDNA sequencing of VCaP cells, have been deposited in the Gene Expression Omnibus (GEO) under the accession number GSE277934Source data are provided with this paper.

Code availability

DeepChopper (v1.2.6), implemented in Rust and Python, is open source and available on GitHub (https://github.com/ylab-hi/DeepChopper) under the Apache License, Version 2.0. The package can be installed via PyPI (https://pypi.org/project/deepchopper/) using pip, with wheel distributions provided for Windows, Linux, and macOS to ensure easy cross-platform installation. An interactive demo is available on Hugging Face (https://huggingface.co/spaces/yangliz5/deepchopper), allowing users to test DeepChopper’s functionality without local installation. For large-scale analyses, we recommend using DeepChopper on systems with GPU acceleration. Detailed system requirements and optimization guidelines are available in the repository’s documentation.

References

  1. Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).

    Google Scholar 

  2. Jain, M., Abu-Shumays, R., Olsen, H. E. & Akeson, M. Advances in Nanopore Direct RNA Sequencing. Nat. Methods 19, 1160–1164 (2022).

    Google Scholar 

  3. Zou, Y. et al. A Comparative Evaluation of Computational Models for RNA Modification Detection Using Nanopore Sequencing with RNA004 Chemistry. Brief. Bioinforma. 26, bbaf404 (2025).

    Google Scholar 

  4. Martin, A. et al. Molecular Barcoding of Native RNAs Using Nanopore Sequencing and Deep Learning. Genome Res. 30, 1345–1353 (2020).

    Google Scholar 

  5. GitHub – nanoporetech/dorado: Oxford Nanopore’s Basecaller. https://github.com/nanoporetech/dorado (2023).

  6. Liu-Wei, W. et al. Sequencing accuracy and systematic errors of nanopore direct RNA sequencing. BMC Genomics 25, 528 (2024).

    Google Scholar 

  7. GitHub – epi2me-labs/pychopper: cDNA read preprocessing. https://github.com/epi2me-labs/pychopper (2024).

  8. Wick, R.R., Judd, L.M., Gorrie, C.L. and Holt, K.E. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb. Genom. 3, e000132 (2017).

  9. Bonenfant, Q., Noé, L. & Touzet, H. élène Porechop_ABI: Discovering Unknown Adapters in Oxford Nanopore Technology Sequencing Reads for Downstream Trimming. Bioinforma. Adv. 3, vbac085 (2023).

    Google Scholar 

  10. Benegas, G., Ye, C., Albors, C., Li, J.C. and Song, Y.S. Genomic language models: opportunities and challenges. Trends in Genetics (2025).

  11. Nguyen, E. et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Adv. Neural Inform. Proces. Sys. 36, 43177–43201 (2023).

  12. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. of the IEEE conference on computer vision and pattern recognition pages 770–778 (2016).

  13. Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: Pre-Trained Bidirectional Encoder Representations from Transformers Model for DNA-language in Genome. Bioinformatics 37, 2112–2120 (2021).

    Google Scholar 

  14. Zhou, Z. et al. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv, (2023).

  15. Dalla-Torre, H. et al. Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. Nat. Methods 22, 287–297 (2025).

  16. Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336 (2024).

    Google Scholar 

  17. Chen, Y. et al. A systematic benchmark of Nanopore long-read RNA sequencing for transcript-level analysis in human cell lines. Nat. Methods 1–12 (2025).

  18. Heinz, J. M., Meyerson, M. & Li, H. Detecting Foldback Artifacts in Long-Reads. bioRxiv pp 2025-07 (2025).

  19. Oxford Nanopore PLC. Chemistry Technical Document (CHTD_500_v1_revAQ_ 07Jul2016) (2017).

  20. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Google Scholar 

  21. Kent, W. J. BLAT—the BLAST-like Alignment Tool. Genome Res. 12, 656–664 (2002).

    Google Scholar 

  22. Ma, C., Shao, M. & Kingsford, C. SQUID: Transcriptomic Structural Variation Detection from RNA-seq. Genome Biology 19: 52 April (2018).

  23. Pardo-Palacios, F. J. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat. Methods 21, 1349–1363 (2024).

  24. Hewel, C. et al. Direct RNA sequencing enables improved transcriptome assessment and tracking of RNA modifications for medical applications. Nucleic Acids Res. 53, gkaf1314 (2025).

  25. Prjibelski, A. D. et al. Accurate isoform discovery with IsoQuant using long reads. Nat. Biotechnol. 41, 915–918 (2023).

    Google Scholar 

  26. Chen, Y. et al. Gene fusion detection and characterization in long-read cancer transcriptome sequencing data with FusionSeeker. Cancer Res. 83, 28–33 (2023).

    Google Scholar 

  27. Uhrig, S. et al. Accurate and Efficient Detection of Gene Fusions from RNA Sequencing Data. Genome Res. 31, 448–460 (2021).

    Google Scholar 

  28. Schulz, L. et al. Direct Long-Read RNA Sequencing Identifies a Subset of Questionable Exitrons Likely Arising from Reverse Transcription Artifacts. Genome Biol. 22, 190 (2021).

    Google Scholar 

  29. Balázs, Z. et al. Template-Switching Artifacts Resemble Alternative Polyadenylation. BMC Genomics 20, 824 (2019).

    Google Scholar 

  30. Sessegolo, C. et al. Transcriptome Profiling of Mouse Samples Using Nanopore Sequencing of cDNA and RNA Molecules. Sci. Rep. 9, 14908 (2019).

    Google Scholar 

  31. Felton, C., Tang, A.D., Knisbacher, B.A., Wu, C.J. & Brooks, A.N. Detection of alternative isoforms of gene fusions from long-read RNA-seq with FLAIR-fusion. bioRxiv, pages 2022–08, (2022).

  32. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Google Scholar 

  33. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), pages 1–15 (2015).

  34. Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32, (2019).

  35. GitHub – facebookresearch/hydra: A framework for elegantly configuring complex applications. https://github.com/facebookresearch/hydra, (2024).

  36. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).

    Google Scholar 

  37. Grünberger, F., Ferreira-Cerca, S. ébastien & Grohmann, D. Nanopore Sequencing of RNA and cDNA Molecules in Escherichia Coli. RNA 28, 400–417 (2022).

    Google Scholar 

  38. Li, Y. & Yang, R. PxBLAT: An Efficient Python Binding Library for BLAT. BMC Bioinforma. 25, 219 (2024).

    Google Scholar 

  39. Tardaguila, M. et al. SQANTI: Extensive Characterization of Long-Read Transcript Sequences for Quality Control in Full-Length Transcriptome Identification and Quantification. Genome Res. 28, 396–411 (2018).

    Google Scholar 

  40. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).

    Google Scholar 

  41. Dobin, A. et al. STAR: Ultrafast Universal RNA-seq Aligner. Bioinformatics 29, 15–21 (2013).

    Google Scholar 

  42. Hu, B. et al. GSDS 2.0: An Upgraded Gene Feature Visualization Server. Bioinformatics 31, 1296–1297 (2015).

    Google Scholar 

  43. Samarakoon, H. et al. Interactive Visualization of Nanopore Sequencing Signal Data with Squigualiser. Bioinformatics 40, btae501 (2024).

    Google Scholar 

  44. Lågstad, S. et al. Chimeraviz: A Tool for Visualizing Chimeric RNA. Bioinformatics 33, 2954–2956 (2017).

    Google Scholar 

  45. Sherman, B. T. et al. DAVID: A Web Server for Functional Enrichment Analysis and Functional Annotation of Gene Lists (2021 Update). Nucleic Acids Res. 50, W216–W221 (2022).

    Google Scholar 

Download references

Acknowledgements

This project was supported in part by NIH grants R35GM142441 and R01CA259388 awarded to RY, and NIH grants R01CA256741, R01CA278832, and R01CA285684 awarded to Q. C.

Author information

Author notes

  1. These authors contributed equally: Yangyang Li, Ting-You Wang.

Authors and Affiliations

  1. Department of Urology, Northwestern University Feinberg School of Medicine, Chicago, IL, USA

    Yangyang Li, Ting-You Wang, Qingxiang Guo, Yanan Ren, Xiaotong Lu, Qi Cao & Rendong Yang

  2. Robert H. Lurie Comprehensive Cancer Center, Northwestern University Feinberg School of Medicine, Chicago, IL, USA

    Qi Cao & Rendong Yang

Authors

  1. Yangyang Li
  2. Ting-You Wang
  3. Qingxiang Guo
  4. Yanan Ren
  5. Xiaotong Lu
  6. Qi Cao
  7. Rendong Yang

Contributions

Y.L., T.Y.W. and R.Y. designed the study with Q.C. Y.L. and T.Y.W. performed the analysis. Q.G., Y.R. and X.L. performed the experiments. YL designed and implemented the model and computational tool. Y.L., T.Y.W., Q.G. and R.Y. wrote the manuscript. R.Y. supervised this work.

Corresponding author

Correspondence to Rendong Yang.

Ethics declarations

Competing interests

R.Y. has served as an advisor/consultant for Tempus AI, Inc. This relationship is unrelated to and did not influence the research presented in this study. Other authors have no competing interests.

Peer review

Peer review information

Nature Communications thanks Ying Chen, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Source data

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Wang, TY., Guo, Q. et al. Genomic language model mitigates chimera artifacts in nanopore direct RNA sequencing. Nat Commun (2026). https://doi.org/10.1038/s41467-026-68571-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41467-026-68571-5