Data availability
The spectral data generated in this study for LC-MS/MS analysis of the data-bearing proteins have been deposited in the MassIVE database under accession code MSV000098849 [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=7f1745678fc846728d68438e2f742071]. Source data are provided with this paper.
Code availability
The codes of the home-made sequencing program are protected due to patent restrictions, but may be available for academic exchange and collaboration purposes by sending email requests to the corresponding author, with the expected response time of around 1 week.
References
-
Wright, A. Worldwide IDC Global DataSphere Forecast, 2025–2029. IDC https://my.idc.com/getdoc.jsp?containerId=US53363625 (2025).
-
Ng, C. C. A. et al. Data storage using peptide sequences. Nat. Commun. 12, 4242 (2021).
-
Rössler, S. L., Grob, N. M., Buchwald, S. L. & Pentelute, B. L. Abiotic peptides as carriers of information for the encoding of small-molecule library synthesis. Science 379, 939–945 (2023).
-
Extance, A. How DNA could store all the world’s data. Nature 537, 22–24 (2016).
-
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242 (2018).
-
Yaniv, E. & Dina, Z. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
-
Roy, R. K. et al. Design and synthesis of digitally encoded polymers that can be decoded and erased. Nat. Commun. 6, 7237 (2015).
-
Huang, Z. et al. Binary tree-inspired digital dendrimer. Nat. Commun. 10, 1918 (2019).
-
Cafferty, B. J. et al. Storage of information using small organic molecules. ACS Cent. Sci. 5, 911–916 (2019).
-
Nagarkar, A. A. et al. Storing and reading information in mixtures of fluorescent molecules. ACS Cent. Sci. 7, 1728–1735 (2021).
-
Rosenstein, J. K. et al. Principles of information storage in small-molecule mixtures. IEEE T. Nanobiosci. 19, 378–384 (2020).
-
Zhang, H. et al. Rational incorporation of any unnatural amino acid into proteins by machine learning on existing experimental proofs. Comput. Struct. Biotechnol. J. 20, 4930–4941 (2022).
-
Zhang, G. & Zhu, T. F. Mirror-image trypsin digestion and sequencing of D-proteins. Nat. Chem. 16, 592–598 (2024).
-
Hendy, J. Ancient protein analysis in archaeology. Sci. Adv. 7, eabb9314 (2021).
-
Service, R. F. Protein power. Science 349, 372–373 (2015).
-
Welker, F. et al. Ancient proteins resolve the evolutionary history of Darwin’s South American ungulates. Nature 522, 81 (2015).
-
Chen, C.-S. et al. A proteome chip approach reveals new DNA damage recognition activities in Escherichia coli. Nat. Methods 5, 69–74 (2008).
-
Fisher, M. A., McKinley, K. L., Bradley, L. H., Viola, S. R. & Hecht, M. H. De novo designed proteins from a library of artificial sequences function in Escherichia Coli and enable cell growth. PLoS ONE 6, e15364 (2011).
-
Pan, X. & Kortemme, T. Recent advances in de novo protein design: principles, methods, and applications. J. Biol. Chem. 296, 100558 (2021).
-
Shen, H. et al. De novo design of self-assembling helical protein filaments. Science 362, 705–709 (2018).
-
Chen, Z. et al. Self-assembling 2D arrays with de novo protein building blocks. J. Am. Chem. Soc. 141, 8891–8895 (2019).
-
Chevalier, A. et al. Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79 (2017).
-
Cao, L. et al. De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science 370, 426–431 (2020).
-
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
-
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).
-
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
-
Hallgren, J. et al. DeepTMHMM predicts alpha and beta transmembrane proteins using deep neural networks. Preprint at BioRxiv https://www.biorxiv.org/content/10.1101/2022.04.08.487609v1 (2022).
-
Hughes, C., Ma, B. & Lajoie, G. A. De novo sequencing methods in proteomics. Methods Mol. Biol. 604, 105–121 (2010).
-
Sun, B., Kovatch, J. R., Badiong, A. & Merbouh, N. Optimization and modeling of quadrupole orbitrap parameters for sensitive analysis toward single-cell proteomics. J. Proteome Res. 16, 3711–3721 (2017).
-
Price, W. N. et al. Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. Microb. Inform. Exp. 1, 1–20 (2011).
-
Torres, J. M., Borja, C., Gibert, L., Ribot, F. & Olivares, E. G. Twentieth-century paleoproteomics: lessons from Venta Micena Fossils. Biology 11, 1184 (2022).
-
Warinner, C., Korzow Richter, K. & Collins, M. J. Paleoproteomics. Chem. Rev. 122, 13401–13446 (2022).
-
Yu, Z., An, B., Ramshaw, J. A. & Brodsky, B. Bacterial collagen-like proteins that form triple-helical structures. J. Struct. Biol. 186, 451–461 (2014).
-
Qiu, Y., Zhai, C., Chen, L., Liu, X. & Yeo, J. Current insights on the diverse structures and functions in bacterial collagen-like proteins. ACS Biomater. Sci. Eng. 9, 3778–3795 (2021).
-
Xu, C., Yu, Z., Inouye, M., Brodsky, B. & Mirochnitchenko, O. Expanding the family of collagen proteins: recombinant bacterial collagens of varying composition form triple-helices of similar stability. Biomacromolecules 11, 348–356 (2010).
-
Peng, Y. Y. et al. Towards scalable production of a collagen-like protein from Streptococcus pyogenes for biomedical applications. Microb. Cell Fact. 11, 1–8 (2012).
-
Huang, X. et al. Storage-D: A user-friendly platform that enables practical and personalized DNA data storage. Imeta 3, e168 (2024).
-
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
-
Tirmazi, H. & Tran, T. P. An introduction to protein cryptography. Preprint at https://eprint.iacr.org/2025/089 (2025).
-
Chen, W. D. et al. Combining data longevity with high storage capacity—layer-by-layer DNA encapsulated in magnetic nanoparticles. Adv. Funct. Mater. 29, 1901672 (2019).
-
Luo, B. et al. High-capacity information storage using peptide-encapsulated hydrogels for long-term data preservation. Commun. Mater. 6, 183 (2025).
-
He, M., Stoevesandt, O. & Taussig, M. J. In situ synthesis of protein arrays. Curr. Opin. Biotechnol. 19, 4 (2008).
-
Wiesler, S. C. & Weinzierl, R. O. Robotic high-throughput purification of affinity-tagged recombinant proteins. Methods Mol. Biol. 1286, 97–107 (015).
-
Restrepo-Pérez, L., Joo, C. & Dekker, C. Paving the way to single-molecule protein sequencing. Nat. Nanotechnol. 13, 786–796 (2018).
-
Brinkerhoff, H., Kang, A. S. W., Liu, J., Aksimentiev, A. & Dekker, C. Multiple rereads of single proteins at single–amino acid resolution using nanopores. Science 374, 1509–1513 (2021).
-
Afshar Bakshloo, M. et al. NanoporE-based Protein Identification. J. Am. Chem. Soc. 144, 2716–2725 (2022).
-
Jiang, J. et al. Protein nanopore reveals the renin–angiotensin system crosstalk with single-amino-acid resolution. Nat. Chem. 15, 578–586 (2023).
-
Reed, B. D. et al. Real-time dynamic single-molecule protein sequencing on an integrated semiconductor device. Science 378, 186–192 (2022).
-
Niu, H. et al. Direct mapping of tyrosine sulfation states in native peptides by nanopore. Nat. Chem. Biol. 21, 716–726 (2025).
-
YYao, Z., Ng, C. C., Lau, C. M. & Tam, W. M. Data storage using peptides. US Patent No. 11,315,023 (2022).
Acknowledgements
We would like to thank Prof. Yanxiang Zhao and Prof. Clarence Chun Ting Wong (PolyU) for their help and useful discussions on this project. This work was supported by National Key Research and Development Program of China (Grant No. 2024YFF0725800, Z.P.Y.), Hong Kong Research Grants Council (Grant Nos. R5013-19F, C5026-24GF, C4002-20WF, C4014-23G, CRS_CUHK405/23 and AoE/M-402/25-N, Z.P.Y.), Faculty of Science (Grant No. 1-WZA2, Z.P.Y.), the University Research Facility in Chemical and Environmental Analysis, and the University Research Facility in Life Sciences of The Hong Kong Polytechnic University.
Ethics declarations
Competing interests
The authors declare the following competing interests: Y.Z., C.C.A.N., C.L., and Z.P.Y. are inventors for a related patent entitled “Data storage using proteins” (Zhongping Yao, Yin Zhou, Cheuk Chi Ng, Chengxi Liu, PCT patent application No. PCT/CN2023/108347, filed on 20 July 2023; CN patent application No. 202380100287.2, filed on 7 Jan 2026; US Non-Provisional patent application No. 19/501,196, filed on 12 January 2026; EP patent application No. 23945486.1, filed on 16 January 2026). W.M.T. and F.C.M.L. declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Chunhai Fan, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhou, Y., Ng, C.C.A., Liu, C. et al. Data storage and retrieval with unnatural proteins expressed via E. coli. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70061-7
-
Received:
-
Accepted:
-
Published:
-
DOI: https://doi.org/10.1038/s41467-026-70061-7
