Data storage and retrieval with unnatural proteins expressed via E. coli

data-storage-and-retrieval-with-unnatural-proteins-expressed-via-e.-coli
Data storage and retrieval with unnatural proteins expressed via E. coli

Data availability

The spectral data generated in this study for LC-MS/MS analysis of the data-bearing proteins have been deposited in the MassIVE database under accession code MSV000098849 [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=7f1745678fc846728d68438e2f742071]. Source data are provided with this paper.

Code availability

The codes of the home-made sequencing program are protected due to patent restrictions, but may be available for academic exchange and collaboration purposes by sending email requests to the corresponding author, with the expected response time of around 1 week.

References

  1. Wright, A. Worldwide IDC Global DataSphere Forecast, 2025–2029. IDC https://my.idc.com/getdoc.jsp?containerId=US53363625 (2025).

  2. Ng, C. C. A. et al. Data storage using peptide sequences. Nat. Commun. 12, 4242 (2021).

    Google Scholar 

  3. Rössler, S. L., Grob, N. M., Buchwald, S. L. & Pentelute, B. L. Abiotic peptides as carriers of information for the encoding of small-molecule library synthesis. Science 379, 939–945 (2023).

    Google Scholar 

  4. Extance, A. How DNA could store all the world’s data. Nature 537, 22–24 (2016).

    Google Scholar 

  5. Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242 (2018).

    Google Scholar 

  6. Yaniv, E. & Dina, Z. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).

    Google Scholar 

  7. Roy, R. K. et al. Design and synthesis of digitally encoded polymers that can be decoded and erased. Nat. Commun. 6, 7237 (2015).

    Google Scholar 

  8. Huang, Z. et al. Binary tree-inspired digital dendrimer. Nat. Commun. 10, 1918 (2019).

    Google Scholar 

  9. Cafferty, B. J. et al. Storage of information using small organic molecules. ACS Cent. Sci. 5, 911–916 (2019).

    Google Scholar 

  10. Nagarkar, A. A. et al. Storing and reading information in mixtures of fluorescent molecules. ACS Cent. Sci. 7, 1728–1735 (2021).

    Google Scholar 

  11. Rosenstein, J. K. et al. Principles of information storage in small-molecule mixtures. IEEE T. Nanobiosci. 19, 378–384 (2020).

    Google Scholar 

  12. Zhang, H. et al. Rational incorporation of any unnatural amino acid into proteins by machine learning on existing experimental proofs. Comput. Struct. Biotechnol. J. 20, 4930–4941 (2022).

    Google Scholar 

  13. Zhang, G. & Zhu, T. F. Mirror-image trypsin digestion and sequencing of D-proteins. Nat. Chem. 16, 592–598 (2024).

    Google Scholar 

  14. Hendy, J. Ancient protein analysis in archaeology. Sci. Adv. 7, eabb9314 (2021).

    Google Scholar 

  15. Service, R. F. Protein power. Science 349, 372–373 (2015).

    Google Scholar 

  16. Welker, F. et al. Ancient proteins resolve the evolutionary history of Darwin’s South American ungulates. Nature 522, 81 (2015).

    Google Scholar 

  17. Chen, C.-S. et al. A proteome chip approach reveals new DNA damage recognition activities in Escherichia coli. Nat. Methods 5, 69–74 (2008).

    Google Scholar 

  18. Fisher, M. A., McKinley, K. L., Bradley, L. H., Viola, S. R. & Hecht, M. H. De novo designed proteins from a library of artificial sequences function in Escherichia Coli and enable cell growth. PLoS ONE 6, e15364 (2011).

    Google Scholar 

  19. Pan, X. & Kortemme, T. Recent advances in de novo protein design: principles, methods, and applications. J. Biol. Chem. 296, 100558 (2021).

    Google Scholar 

  20. Shen, H. et al. De novo design of self-assembling helical protein filaments. Science 362, 705–709 (2018).

    Google Scholar 

  21. Chen, Z. et al. Self-assembling 2D arrays with de novo protein building blocks. J. Am. Chem. Soc. 141, 8891–8895 (2019).

    Google Scholar 

  22. Chevalier, A. et al. Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79 (2017).

    Google Scholar 

  23. Cao, L. et al. De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science 370, 426–431 (2020).

    Google Scholar 

  24. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).

    Google Scholar 

  25. Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).

    Google Scholar 

  26. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Google Scholar 

  27. Hallgren, J. et al. DeepTMHMM predicts alpha and beta transmembrane proteins using deep neural networks. Preprint at BioRxiv https://www.biorxiv.org/content/10.1101/2022.04.08.487609v1 (2022).

  28. Hughes, C., Ma, B. & Lajoie, G. A. De novo sequencing methods in proteomics. Methods Mol. Biol. 604, 105–121 (2010).

    Google Scholar 

  29. Sun, B., Kovatch, J. R., Badiong, A. & Merbouh, N. Optimization and modeling of quadrupole orbitrap parameters for sensitive analysis toward single-cell proteomics. J. Proteome Res. 16, 3711–3721 (2017).

    Google Scholar 

  30. Price, W. N. et al. Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. Microb. Inform. Exp. 1, 1–20 (2011).

    Google Scholar 

  31. Torres, J. M., Borja, C., Gibert, L., Ribot, F. & Olivares, E. G. Twentieth-century paleoproteomics: lessons from Venta Micena Fossils. Biology 11, 1184 (2022).

    Google Scholar 

  32. Warinner, C., Korzow Richter, K. & Collins, M. J. Paleoproteomics. Chem. Rev. 122, 13401–13446 (2022).

    Google Scholar 

  33. Yu, Z., An, B., Ramshaw, J. A. & Brodsky, B. Bacterial collagen-like proteins that form triple-helical structures. J. Struct. Biol. 186, 451–461 (2014).

    Google Scholar 

  34. Qiu, Y., Zhai, C., Chen, L., Liu, X. & Yeo, J. Current insights on the diverse structures and functions in bacterial collagen-like proteins. ACS Biomater. Sci. Eng. 9, 3778–3795 (2021).

    Google Scholar 

  35. Xu, C., Yu, Z., Inouye, M., Brodsky, B. & Mirochnitchenko, O. Expanding the family of collagen proteins: recombinant bacterial collagens of varying composition form triple-helices of similar stability. Biomacromolecules 11, 348–356 (2010).

    Google Scholar 

  36. Peng, Y. Y. et al. Towards scalable production of a collagen-like protein from Streptococcus pyogenes for biomedical applications. Microb. Cell Fact. 11, 1–8 (2012).

    Google Scholar 

  37. Huang, X. et al. Storage-D: A user-friendly platform that enables practical and personalized DNA data storage. Imeta 3, e168 (2024).

    Google Scholar 

  38. Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).

    Google Scholar 

  39. Tirmazi, H. & Tran, T. P. An introduction to protein cryptography. Preprint at https://eprint.iacr.org/2025/089 (2025).

  40. Chen, W. D. et al. Combining data longevity with high storage capacity—layer-by-layer DNA encapsulated in magnetic nanoparticles. Adv. Funct. Mater. 29, 1901672 (2019).

    Google Scholar 

  41. Luo, B. et al. High-capacity information storage using peptide-encapsulated hydrogels for long-term data preservation. Commun. Mater. 6, 183 (2025).

    Google Scholar 

  42. He, M., Stoevesandt, O. & Taussig, M. J. In situ synthesis of protein arrays. Curr. Opin. Biotechnol. 19, 4 (2008).

    Google Scholar 

  43. Wiesler, S. C. & Weinzierl, R. O. Robotic high-throughput purification of affinity-tagged recombinant proteins. Methods Mol. Biol. 1286, 97–107 (015).

  44. Restrepo-Pérez, L., Joo, C. & Dekker, C. Paving the way to single-molecule protein sequencing. Nat. Nanotechnol. 13, 786–796 (2018).

    Google Scholar 

  45. Brinkerhoff, H., Kang, A. S. W., Liu, J., Aksimentiev, A. & Dekker, C. Multiple rereads of single proteins at single–amino acid resolution using nanopores. Science 374, 1509–1513 (2021).

    Google Scholar 

  46. Afshar Bakshloo, M. et al. NanoporE-based Protein Identification. J. Am. Chem. Soc. 144, 2716–2725 (2022).

    Google Scholar 

  47. Jiang, J. et al. Protein nanopore reveals the renin–angiotensin system crosstalk with single-amino-acid resolution. Nat. Chem. 15, 578–586 (2023).

    Google Scholar 

  48. Reed, B. D. et al. Real-time dynamic single-molecule protein sequencing on an integrated semiconductor device. Science 378, 186–192 (2022).

    Google Scholar 

  49. Niu, H. et al. Direct mapping of tyrosine sulfation states in native peptides by nanopore. Nat. Chem. Biol. 21, 716–726 (2025).

    Google Scholar 

  50. YYao, Z., Ng, C. C., Lau, C. M. & Tam, W. M. Data storage using peptides. US Patent No. 11,315,023 (2022).

Download references

Acknowledgements

We would like to thank Prof. Yanxiang Zhao and Prof. Clarence Chun Ting Wong (PolyU) for their help and useful discussions on this project. This work was supported by National Key Research and Development Program of China (Grant No. 2024YFF0725800, Z.P.Y.), Hong Kong Research Grants Council (Grant Nos. R5013-19F, C5026-24GF, C4002-20WF, C4014-23G, CRS_CUHK405/23 and AoE/M-402/25-N, Z.P.Y.), Faculty of Science (Grant No. 1-WZA2, Z.P.Y.), the University Research Facility in Chemical and Environmental Analysis, and the University Research Facility in Life Sciences of The Hong Kong Polytechnic University.

Author information

Author notes

  1. These authors contributed equally: Yin Zhou, Cheuk Chi A. Ng.

Authors and Affiliations

  1. Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China

    Yin Zhou, Cheuk Chi A. Ng, Chengxi Liu & Zhong-Ping Yao

  2. Research Institute for Future Food, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China

    Yin Zhou, Cheuk Chi A. Ng, Chengxi Liu & Zhong-Ping Yao

  3. Research Centre for Chinese Medicine Innovation, The Hong Kong Polytechnic University; Hung Hom, Kowloon, Hong Kong SAR, China

    Yin Zhou, Cheuk Chi A. Ng, Chengxi Liu & Zhong-Ping Yao

  4. State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China

    Yin Zhou, Cheuk Chi A. Ng, Chengxi Liu & Zhong-Ping Yao

  5. Shenzhen Key Laboratory of Food Biological Safety Control, The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China

    Yin Zhou, Cheuk Chi A. Ng, Chengxi Liu & Zhong-Ping Yao

  6. Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China

    Wai Man Tam & Francis C. M. Lau

Authors

  1. Yin Zhou
  2. Cheuk Chi A. Ng
  3. Chengxi Liu
  4. Wai Man Tam
  5. Francis C. M. Lau
  6. Zhong-Ping Yao

Contributions

Y.Z., C.C.A.N., and Z.P.Y. designed the experiments. W.M.T. and F.C.M.L. encoded and decoded the data with the error-correction schemes. Y.Z. and C.C.A.N. performed the LC-MS/MS analysis. C.L. and Y.Z. performed the native and denatured MS and stability analysis. Y.Z., C.C.A.N., C.L., and W.M.T. drafted the manuscript. Z.P.Y. and F.C.M.L. revised the manuscript. Z.P.Y. initiated and coordinated the whole project.

Corresponding author

Correspondence to Zhong-Ping Yao.

Ethics declarations

Competing interests

The authors declare the following competing interests: Y.Z., C.C.A.N., C.L., and Z.P.Y. are inventors for a related patent entitled “Data storage using proteins” (Zhongping Yao, Yin Zhou, Cheuk Chi Ng, Chengxi Liu, PCT patent application No. PCT/CN2023/108347, filed on 20 July 2023; CN patent application No. 202380100287.2, filed on 7 Jan 2026; US Non-Provisional patent application No. 19/501,196, filed on 12 January 2026; EP patent application No. 23945486.1, filed on 16 January 2026). W.M.T. and F.C.M.L. declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Chunhai Fan, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Source data

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Ng, C.C.A., Liu, C. et al. Data storage and retrieval with unnatural proteins expressed via E. coli. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70061-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41467-026-70061-7