Data availability
Raw and processed data generated in this study, including dRNA-seq using the SQK-RNA002 and SQK-RNA004 kits, as well as direct cDNA sequencing of VCaP cells, have been deposited in the Gene Expression Omnibus (GEO) under the accession number GSE277934. Source data are provided with this paper.
Code availability
DeepChopper (v1.2.6), implemented in Rust and Python, is open source and available on GitHub (https://github.com/ylab-hi/DeepChopper) under the Apache License, Version 2.0. The package can be installed via PyPI (https://pypi.org/project/deepchopper/) using pip, with wheel distributions provided for Windows, Linux, and macOS to ensure easy cross-platform installation. An interactive demo is available on Hugging Face (https://huggingface.co/spaces/yangliz5/deepchopper), allowing users to test DeepChopper’s functionality without local installation. For large-scale analyses, we recommend using DeepChopper on systems with GPU acceleration. Detailed system requirements and optimization guidelines are available in the repository’s documentation.
References
-
Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).
-
Jain, M., Abu-Shumays, R., Olsen, H. E. & Akeson, M. Advances in Nanopore Direct RNA Sequencing. Nat. Methods 19, 1160–1164 (2022).
-
Zou, Y. et al. A Comparative Evaluation of Computational Models for RNA Modification Detection Using Nanopore Sequencing with RNA004 Chemistry. Brief. Bioinforma. 26, bbaf404 (2025).
-
Martin, A. et al. Molecular Barcoding of Native RNAs Using Nanopore Sequencing and Deep Learning. Genome Res. 30, 1345–1353 (2020).
-
GitHub – nanoporetech/dorado: Oxford Nanopore’s Basecaller. https://github.com/nanoporetech/dorado (2023).
-
Liu-Wei, W. et al. Sequencing accuracy and systematic errors of nanopore direct RNA sequencing. BMC Genomics 25, 528 (2024).
-
GitHub – epi2me-labs/pychopper: cDNA read preprocessing. https://github.com/epi2me-labs/pychopper (2024).
-
Wick, R.R., Judd, L.M., Gorrie, C.L. and Holt, K.E. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb. Genom. 3, e000132 (2017).
-
Bonenfant, Q., Noé, L. & Touzet, H. élène Porechop_ABI: Discovering Unknown Adapters in Oxford Nanopore Technology Sequencing Reads for Downstream Trimming. Bioinforma. Adv. 3, vbac085 (2023).
-
Benegas, G., Ye, C., Albors, C., Li, J.C. and Song, Y.S. Genomic language models: opportunities and challenges. Trends in Genetics (2025).
-
Nguyen, E. et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Adv. Neural Inform. Proces. Sys. 36, 43177–43201 (2023).
-
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. of the IEEE conference on computer vision and pattern recognition pages 770–778 (2016).
-
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: Pre-Trained Bidirectional Encoder Representations from Transformers Model for DNA-language in Genome. Bioinformatics 37, 2112–2120 (2021).
-
Zhou, Z. et al. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv, (2023).
-
Dalla-Torre, H. et al. Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. Nat. Methods 22, 287–297 (2025).
-
Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336 (2024).
-
Chen, Y. et al. A systematic benchmark of Nanopore long-read RNA sequencing for transcript-level analysis in human cell lines. Nat. Methods 1–12 (2025).
-
Heinz, J. M., Meyerson, M. & Li, H. Detecting Foldback Artifacts in Long-Reads. bioRxiv pp 2025-07 (2025).
-
Oxford Nanopore PLC. Chemistry Technical Document (CHTD_500_v1_revAQ_ 07Jul2016) (2017).
-
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
-
Kent, W. J. BLAT—the BLAST-like Alignment Tool. Genome Res. 12, 656–664 (2002).
-
Ma, C., Shao, M. & Kingsford, C. SQUID: Transcriptomic Structural Variation Detection from RNA-seq. Genome Biology 19: 52 April (2018).
-
Pardo-Palacios, F. J. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat. Methods 21, 1349–1363 (2024).
-
Hewel, C. et al. Direct RNA sequencing enables improved transcriptome assessment and tracking of RNA modifications for medical applications. Nucleic Acids Res. 53, gkaf1314 (2025).
-
Prjibelski, A. D. et al. Accurate isoform discovery with IsoQuant using long reads. Nat. Biotechnol. 41, 915–918 (2023).
-
Chen, Y. et al. Gene fusion detection and characterization in long-read cancer transcriptome sequencing data with FusionSeeker. Cancer Res. 83, 28–33 (2023).
-
Uhrig, S. et al. Accurate and Efficient Detection of Gene Fusions from RNA Sequencing Data. Genome Res. 31, 448–460 (2021).
-
Schulz, L. et al. Direct Long-Read RNA Sequencing Identifies a Subset of Questionable Exitrons Likely Arising from Reverse Transcription Artifacts. Genome Biol. 22, 190 (2021).
-
Balázs, Z. et al. Template-Switching Artifacts Resemble Alternative Polyadenylation. BMC Genomics 20, 824 (2019).
-
Sessegolo, C. et al. Transcriptome Profiling of Mouse Samples Using Nanopore Sequencing of cDNA and RNA Molecules. Sci. Rep. 9, 14908 (2019).
-
Felton, C., Tang, A.D., Knisbacher, B.A., Wu, C.J. & Brooks, A.N. Detection of alternative isoforms of gene fusions from long-read RNA-seq with FLAIR-fusion. bioRxiv, pages 2022–08, (2022).
-
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
-
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), pages 1–15 (2015).
-
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32, (2019).
-
GitHub – facebookresearch/hydra: A framework for elegantly configuring complex applications. https://github.com/facebookresearch/hydra, (2024).
-
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).
-
Grünberger, F., Ferreira-Cerca, S. ébastien & Grohmann, D. Nanopore Sequencing of RNA and cDNA Molecules in Escherichia Coli. RNA 28, 400–417 (2022).
-
Li, Y. & Yang, R. PxBLAT: An Efficient Python Binding Library for BLAT. BMC Bioinforma. 25, 219 (2024).
-
Tardaguila, M. et al. SQANTI: Extensive Characterization of Long-Read Transcript Sequences for Quality Control in Full-Length Transcriptome Identification and Quantification. Genome Res. 28, 396–411 (2018).
-
Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).
-
Dobin, A. et al. STAR: Ultrafast Universal RNA-seq Aligner. Bioinformatics 29, 15–21 (2013).
-
Hu, B. et al. GSDS 2.0: An Upgraded Gene Feature Visualization Server. Bioinformatics 31, 1296–1297 (2015).
-
Samarakoon, H. et al. Interactive Visualization of Nanopore Sequencing Signal Data with Squigualiser. Bioinformatics 40, btae501 (2024).
-
Lågstad, S. et al. Chimeraviz: A Tool for Visualizing Chimeric RNA. Bioinformatics 33, 2954–2956 (2017).
-
Sherman, B. T. et al. DAVID: A Web Server for Functional Enrichment Analysis and Functional Annotation of Gene Lists (2021 Update). Nucleic Acids Res. 50, W216–W221 (2022).
Acknowledgements
This project was supported in part by NIH grants R35GM142441 and R01CA259388 awarded to RY, and NIH grants R01CA256741, R01CA278832, and R01CA285684 awarded to Q. C.
Ethics declarations
Competing interests
R.Y. has served as an advisor/consultant for Tempus AI, Inc. This relationship is unrelated to and did not influence the research presented in this study. Other authors have no competing interests.
Peer review
Peer review information
Nature Communications thanks Ying Chen, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Y., Wang, TY., Guo, Q. et al. Genomic language model mitigates chimera artifacts in nanopore direct RNA sequencing. Nat Commun (2026). https://doi.org/10.1038/s41467-026-68571-5
-
Received:
-
Accepted:
-
Published:
-
DOI: https://doi.org/10.1038/s41467-026-68571-5
