Sequencing a DNA analog composed of artificial bases

Introduction

DNA forms the informational basis of Terran life through its ability to store, transmit, and evolve genetic information. Two features let it do so: 1) rules of base pairing (i) size complementarity and (ii) hydrogen bond complementarity provide a direct mechanism for information-conserving replication, and 2) a repeating backbone charge. The repeating backbone charge is required for any informational biopolymer to support evolution under the “Polyelectrolyte Theory of the Gene”¹ because it controls strand-strand interactions, but also allows informational units to be exchanged without dramatically changing the overall biophysical properties.

Many research groups have added new nucleotides and new nucleobase pairs to the four canonical nucleobases in standard Terran (A, C, G, T) DNA^2,3,4,5,6. Some of these retain both pairing rules^4,5. Others retain size complementarity, while dispensing with inter-base hydrogen bonds³. Others retain intra hydrogen bonds, but dispense with size⁷. In all cases, however, anthropogenic pairs are sparsely introduced into double helices that are largely stabilized by standard A:T and G:C pairs. For anthropogenic pairs that are not joined by inter-base hydrogen bonds, this sparseness is a requirement for a functioning informational biopolymer.

ALternative Isoinformational ENgineered (ALIEN) DNA moves beyond this constraint. Here, the goal is an entirely anthropogenic informational system, without any Terran nucleotides at all. Recently, this was done using four synthetic nucleotides:

B: isoguanine 6-amino-9[(1′-β-D-2′-deoxyribofuranosyl)-4-hydroxy-5-(hydroxymethyl)-oxolan-2-yl]-1H-purin-2-one, which pairs with

S: 3-methyl-6-amino-5-(1′-β-D-2′-deoxyribofuranosyl)-pyrimidin-2-one, and

P: 2-amino-8-(1′-β-D-2′-deoxyribofuranosyl)-imidazo-[1,2a]-1,3,5-triazin-[8H]-4-one, which pairs with

Z: 6-amino-3-(1′-β-D-2′-deoxyribofuranosyl)-5-nitro-1H-pyridin-2-one.

ALIEN DNA retains the standard deoxyribose-phosphate polyelectrolyte backbone of standard DNA, and uses both size and hydrogen bonding complementarity to form specific pairs^8,9,10 (Fig. 1A). It is “isoinformational”, since each site holds two bits of information, the same as with standard DNA. Further, ALIEN DNA can form the same A and B double helical forms as standard DNA. Both the polyelectrolyte backbone and rule-base pairing allow ALIEN DNA to meet the requirements for Darwinian evolution¹. Anthropogenic genetic systems including P and Z bases have been shown to evolve medically useful receptors, ligands, and catalysts with high affinity^{11,12,13,14,15,16}, support highly multiplexed PCR¹⁷, and are components of commercial diagnostics products¹⁸.

DNA built from B, P, S, and Z is intrinsically biocompatible because it retains the properties of canonical DNA but cannot hybridize to any biological DNA sequence^8,9,10. Such DNA, completely orthogonal to biological DNA, has applications in synthetic biology such as expanded codon libraries^19,20, chimeric PCR primers for cleaner multiplex PCR¹⁷, and rapid assembly of nanostructures¹⁴. Because P:Z and B:S pairs are both joined with three hydrogen bonds, PZBS DNA folds have higher melting points than comparable ACGT DNA.

ALIEN DNA oligonucleotides can be synthesized using standard solid phase phosphoramidite chemistry^{8,9,10,21,22,23}. However, standard commercial sequencing platforms are incompatible with this material that is entirely anthropogenic. Transliteration techniques have enabled sequencing of six-letter ACGTPZ DNA²⁴, by probabilistically converting P:Z pairs to A:T and C:G pairs. Comparison of multiple reads reveal the locations of sites that are consistently ACGT, while those which are read as sometimes T, sometimes C, are inferred to be Z, and those which are read as sometimes A, sometimes G, are inferred to be P. However this strategy relies on base-pair mismatching and only works if the anthropogenic components are relatively sparse. This strategy has not been demonstrated for 4-letter DNA consisting of B, P, S, and Z, and does not work well with dense, long tracts of non-canonical bases because of its use of base pair mismatches.

Realizing the full potential of ALIEN DNA requires sequencing technology that is routine, accurate, and agnostic to both the chemical identity of bases and alphabet size of anthropogenic genetic systems. Further, the development of anthropogenic genetic systems is ongoing, with the potential for new analogues of P, Z, S, and B optimized for chemically stability, biocompatibility, or with different chemical functional groups useful for aptamers and aptazymes^12,15,25,26. Thus, it is important that any sequencing technology developed can be rapidly adapted to alternative alphabets with minimal cost of development.

Nanopore strand sequencing^27,28,29 is a well-accepted strategy for sequencing DNA which requires no chemical label, has single molecule sensitivity, and needs no prior DNA amplification (Fig. 1B). Critically, unlike other techniques that are based on chemical recognition of DNA bases, nanopore sequencing identifies bases physically based on their interaction with an ion current passing through the nanopore. This makes nanopores sensitive to the detailed chemical structure of the base, and therefore easily adaptable to new bases and analytes. As examples, nanopore sequencing has been used to detect base modifications that include methylated cytosine^30,31,32,33, base adducts³⁴, and base isomers such as pseudouridine³⁵. Nanopore techniques are even being developed for peptide sequencing³⁶. We, in collaboration with others, have used nanopore sequencing for reference sequencing of the 8-letter system [A,C,G,T,P,Z,S,B]³⁷, the 12-letter system [A,C,G,T,P,Z,S,B,J,V,K,X]³⁸, DNA with negatively charged nucleobase (Z^–)³⁹, and a 6-letter system containing a base pair between two hydrophobic nucleotides⁴⁰. Each of these works demonstrated that nanopores could identify isolated incorporations of many different non-canonical bases surrounded by ACGT DNA of known sequence. De novo sequencing of DNA with continuous tracts of non-canonical bases is a more challenging problem. Essentially, instead of finding and identifying deviations from what is expected of ACGT DNA, one has to produce an entire sequence blind.

In enzyme-actuated nanopore strand-sequencing²⁷, a single nanoscale pore (nanopore) is incorporated into an artificial lipid bilayer membrane and electrically connects two electrolyte reservoirs (Fig. 1B, C). An ion current flows through the nanopore in response to a voltage applied across the membrane. This ion current is measured as the primary nanopore signal. Nucleic acid molecules coupled to a motor enzyme are captured by the electric field near the pore due to their negative charge. DNA is drawn into the pore until the enzyme comes to rest on the rim of the pore. DNA motion through the pore is controlled by the motor enzyme, which translocates along the DNA in discrete steps. For strand sequencing with the Hel308 helicase, the helicase is loaded on an 8-base overhang at a strand’s 3′ end while the strand’s 5′ end is threaded into the pore. As the DNA strand is pulled into the pore, the complementary strand is rapidly unzipped by the nanopore until the Hel308 helicase comes to rest on the pore rim. We find nearly all reads in this configuration begin with Hel308 helicase at the 3′ end of the DNA, suggesting that Hel308 helicase is not particularly good at initiating unwinding at ssDNA-dsDNA junctions⁴¹. With the complementary strand removed, Hel308 helicase begins translocating along the template strand from 3′ to 5′, feeding DNA through the pore in two ~half-nucleotide steps per ATP hydrolysed^41,42.

Bases within the pore constriction block the ion current through the pore to differing degrees because each base has a unique shape and electrostatic characteristics that alter the flow of ions uniquely through the tight pore constriction. Even small molecular differences in the nucleotides, such as base methylation³⁰ or isomeric nucleobases such as guanine and B^37,38, produce detectable ion current differences that can be used to determine base identity.

Oxford Nanopore Technologies (ONT) produce commercial nanopore sequencers which have sufficient accuracy, high throughput, and are capable of long reads⁴³. However, the inner workings of the ONT devices are proprietary and lack the experimental customizability required for sequencing more exotic analytes. Further, the 9-base k-mer of the current ONT “chemistry” necessitates building of a large training set which is prohibitively expensive for strands of ALIEN DNA. We used variable-voltage sequencing with the pore MspA for this work because of its higher experimental customizability compared to commercial devices, as well as the higher information content embedded in variable-voltage signal⁴⁴ compared to the constant voltage signal.

In variable-voltage nanopore sequencing^37,40,44, a voltage waveform is applied instead of a DC-voltage. This stretches the DNA within the nanopore periodically, effectively “flossing” DNA back and forth several times during a typical enzyme step, allowing repeated measurements of each nucleotide.

Previous efforts laid the groundwork for results presented here. We showed P, Z, B, and S produced different ion current signals within limited sequence contexts, suggesting ALIEN DNA composed of P, Z, B, and S might be straightforwardly sequenced without additional pore mutation. We also found that P, Z, B, and S possess acceptable compatibility with the Hel308 helicase commonly used for nanopore sequencing³⁷. While the Hel308 helicase did tend to dissociate from the DNA when translocating over dense tracts of the C-glycoside bases S and Z, helicase processivity was shown to be sufficient for sequencing short synthetic oligos³⁷. In this work, we demonstrate de novo variable-voltage sequencing of DNA containing long, contiguous regions of ALIEN DNA bases. To achieve this, we follow the method which enabled the first sequencing of ACGT DNA⁴⁵: building empirical reference maps of measured ion currents to DNA sequence for all possible k-mers in a range of sequence contexts.

Results

The ion current measured in variable-voltage nanopore sequencing with MspA is affected by the 6 nucleotides in or adjacent to the pore constriction^44,45. This is due to the size of the pore constriction (as it compares to the size of individual bases) compounded with the rapid Brownian motion of the DNA within the pore. The set of bases which affect the ion current is referred to as a k-mer, where the number k is the number of bases which contribute to the ion current. For variable-voltage sequencing ACGT DNA with MspA, we have found that a 6-mer describes nanopore data well, where the central 4 bases of the 6-mer describe the majority of the variation seen in the data^44,45. By establishing a reference map for all possible k-mers, the ion current trace of an analyte sequence can be decoded into sequence.

We built a variable-voltage 4-mer (k = 4) model by measuring every possible PZBS 4-mer using a compact de Bruijn sequence (Fig. 2, Supplementary Figs. 1–3, Methods and Supplementary Information section S1), which contains all such 4-mers. Just as the cyclical sequence “AABB” contains all 2-letter combinations of A and B (AA, AB, BB, and, wrapping around, BA), a similar 256-base long de Bruijn sequence can be constructed which contains all 4-letter-long combinations of PZSB. Because phosphoramidite DNA synthesis limits the length of oligonucleotides and previous results suggested Hel308 helicase may have reduced processivity on dense regions of S and Z, we divided this de Bruijn sequence into 12 separately-synthesized strands. We designed and synthesized four additional oligonucleotide strands to allow for limited repeated measurements of some 4-mers. These additional strands helped to ensure internal consistency within the 4-mer sequencing model, i.e., we aligned nanopore signals to DNA sequence ensuring that each instance of a repeated 4-mer appeared identically. All model-training strands were designed with a variable 28 nt region of ALIEN DNA sandwiched between two sections of ACGT-DNA used for ion current calibration and helicase loading (Fig. 1C, Supplementary Fig. 4). A complete list of DNA sequences used in this study and their associated consensus traces are found in Supplementary Table 1 and Supplementary Figs. 5–7 respectively. Model training and subsequent sequencing was performed using the purely ALIEN sections of the DNA strands. A summary of all collected data can be found in Table S2.

To benchmark the performance of our sequencing model, we de novo sequenced four testing strands containing 28 nucleotide pseudorandom sequences of ALIEN DNA using a hidden Markov model solver^46,47 developed for variable-voltage nanopore sequencing⁴⁴ (Fig. 3, Supplementary Information section S2). We achieved a mean, per-base accuracy of 74 ± 1% (s.e.m.) for single passages of single DNA molecules (Fig. 3C), comparable with per-base variable-voltage nanopore sequencing accuracy of standard DNA (79.3 ± 0.3% s.e.m.)⁴⁴. For comparison, sequencing accuracy on training data is 83.7 ± 0.5% s.e.m (Supplementary Fig. 8), indicating some overfitting of the model to the training data.

**Fig. 3: De novo sequencing accuracy of ALIEN DNA.**

The breakdown of basecall errors by base shows that P and Z are called at nearly 90% accuracy (Fig. 3D). In contrast, we found that B was miscalled at a higher rate than the other bases (63%), particularly as insertions/deletions.

To further characterize our sequencing models, we estimated the mutual information between the bases and nanopore signal encoded in both the ALIEN and ACGT 4-mer models⁴⁸ (Supplementary Fig. 10). Mutual information measures the amount of shared information between two sets of data. The question posed here is “How much of the information about the base at location X is encoded in the variable-voltage conductance curve?” Mutual information is encoded in bits (0 or 1) and one needs 2 bits of information to fully specify one of four bases. We find that the estimated mutual information between sequence and signal is comparable between both models, despite the ALIEN model containing orders of magnitude fewer sequence contexts than the ACGT model, suggesting that we have encoded a substantial fraction of the available information about PZBS DNA in the model and that similar, or higher, magnitude sequencing accuracy to ACGT can be obtained with additional training data.

While single-read accuracy is a simple and straight-forward metric for benchmarking sequencing performance, in practice, nanopore reads of the same sequence are often combined into a more accurate consensus sequence. Here we implement a multiple sequence alignment consensus technique for ALIEN nanopore reads which increases the accuracy over single-read sequencing (Fig. 4, Supplementary Information section S3, Supplementary Figs. 11–13).

**Fig. 4: Consensus sequences of the ALIEN DNA determined through multiple sequence alignment of read basecalls.**

Discussion

Because model training was done with a small number of short oligonucleotides, our initial model was built from relatively few unique sequence contexts for each 4-mer (Supplementary Fig. 15). Many 4-mers in our sequencing model are based on a single sequence context. Further, while 4-mers characterize most of the variation observed in variable-voltage conductance curves, we found previously that 6-mers provide a more complete characterization with ACGT DNA. In a 4-mer model, the surrounding sequence context effectively adds increased variance to 4-mer measurements⁴⁴. Because we’ve only measured most 4-mers in one or two sequence contexts, we anticipate that sequencing accuracy for ALIEN DNA can be improved dramatically by scaling up the training library size 10 or 100 fold. For example, the 6-mer ACGT model parameterized 4⁶ = 4096 unique ACGT 6-mers using two separate kilobase-scale genomic DNA templates for model training measurements⁴⁴, compared to the 4⁴ = 256 4-mers parametrized here using 12 short oligos. The datasets used to train the ONT machine learning sequencing models are many orders of magnitude larger⁴⁹. Given how well our limited 4-mer sequencing model performs, we expect that the use of larger ALIEN training libraries will allow the accuracy of ALIEN sequencing to converge with or exceed that of standard nanopore sequencing of four-nucleotide ACGT DNA.

The high error rate of calling B compared with the other bases remains conspicuous. Future improvements in identifying B will have the largest impact on total ALIEN sequencing performance. The tendency of B to undergo tautomerization with a non-trivial proportion of B bases adopting an enolic form may play a role in its sequencing difficulty⁵⁰. Other origins of systematic model errors may include errors from the phosphoramidite chemistry used to synthesize the ALIEN oligonucleotides. This synthesis process is known to produce strands with mutations and chemical adducts³⁷ that are known to affect the nanopore signal and could contaminate both the training and testing datasets. Because B is the leading contributor to ALIEN DNA sequencing errors, 2D or 1D² sequencing strategies⁵¹, in which reads of both the primary strand and its complement are combined to yield a higher accuracy read, should deliver significant improvements in sequencing accuracy even at the single-molecule level, because the complementary base S is sequenced with higher accuracy than B.

The DNA oligos used here consisted of short sections of ALIEN DNA sandwiched between adapters made of ACGT DNA. This was primarily a design choice. Nanopore strand sequencing does not read the first or last several bases of each strand. This is not an issue for sequencing because in practice sequencing adapter strands are ligated to either end of the strand of interest. Due to the higher per-base cost of solid phase synthesis for bases PZSB we chose to use sections of ACGT for these unread sections of the DNA. A leading error in solid-phase synthesis is missed base incorporation at the single molecule level. While poly-acrylamide gel electrophoresis (PAGE) can be used to ensure strand purity, this error mode significantly affects yield for strands significantly longer than 100 nucleotides. Thus, all strands used here were 90 nucleotides long.

In previous work, we found that ALIEN DNA bases interacted somewhat differently with the DNA motor enzyme as compared to standard bases ACGT, modifying motor enzyme step kinetics. We analysed our training and test data to determine how enzyme step dwell time and the probability of an enzyme backwards step were affected by ALIEN bases within the enzyme (Supplementary Information section S4). Similar to canonical bases⁵², we find that ALIEN bases affect these kinetic observables in a sequence-specific manner (Supplementary Fig. 14). For example, we observe that B is associated with a higher rate of backwards stepping when it is 17 nt upstream of the pore constriction. Such sequence-dependent kinetic observables could be integrated into the nanopore sequencing algorithm, resulting in higher basecall accuracy as compared to using the ion current signal alone. As reported previously, Hel308 helicase appeared to have a higher rate of dissociation from the DNA strand in regions dense in C-glycoside bases S and Z, however the 28 nucleotide long region of ALIEN DNA used here was not a significant challenge. Sequencing of longer sections of ALIEN DNA may require further optimization of the Hel308 helicase via mutation or use of another motor enzyme more compatible with PZBS DNA. Sequencing of 28 consecutive PZBS bases shown here would be sufficient for sequencing of ALIEN DNA aptamers.

This demonstration of full-factorial de novo sequencing of a DNA analog containing entirely artificial bases is an important step forward for the many burgeoning fields that use anthropogenic nucleic acid systems. We show that systems which are biomimetic and yet biorthogonal can be easily substituted into nanopore sequencing workflows to generate training data and enable sequencing of new alphabets in short order with minimal additional development. The system used here is not yet optimized for sequencing accuracy, throughput, sample size, or read-length with ALIEN DNA. The ONT nanopore platforms have orders-of-magnitude higher sequencing throughput than the custom nanopore setup used here, and suggest what may be achievable for sequencing of ALIEN DNA. It is conceivable that with a large and complex enough training library, one could generate training data for an entire high-accuracy sequencing model for a new artificial alphabet in a single experiment. However, training a 9-mer sequencing model would require a training library 4⁵ = 1024 times the size of the current training set. Additionally, the efficacy of the ONT helicase to translocate on contiguous sections of ALIEN DNA remains untested. Typical input requirements for experiments reported here are ~1ul of 500 nMolar DNA (0.5pmol of DNA). Again, the commercial ONT devices suggests that through optimization, input requirements for nanopore sequencing of ALIEN DNA can be reduced 100-1000 fold.

Still, this work shows how nanopore systems can be readily adapted to directly sequence entirely anthropogenic informational molecules. This will enable faster and more economical progress in the field of synthetic biology^19,53, technology that leverages anthropogenic bases in molecular diagnostics^54,55,56, evolvable nucleic acid ligands/catalysts/receptors^12,13,15,50, and ultra-dense digital data storage^6,57,58,59. It may even help humankind search for alien life, which by theory must be able to support Darwinian evolution, but need not do so with the same biopolymers as used in all (known) life forms here on Earth⁶⁰.

Methods

Pore establishment

A single M2-NNN-MspA nanopore was established in a 1,2-di-O-phytanyl-sn-glycero-3-phosphocholine (DOPHPC) lipid bilayer using the painting method⁶¹. Lipids were purchased from Avanti Polar Lipids. Cis and trans wells contained 400 mM KCl, 10 mM HEPES, buffered to pH 8.00 ± 0.05. The cis well additionally contains 10 mM MgCl₂ and 1 mM ATP. With a bilayer established ~1 ul of M2-NNN MspA is added to the cis well and mixed. A single MspA insertion into the bilayer is detected by a jump in the measured ion current to ~145 pA with 180 mV applied. Then remaining MspA in solution is washed out via perfusion of 1 ml of buffer.

DNA preparation

DNA oligos containing AEGIS nucleotides were constructed using AEGIS phosphoramidites (Firebird Biomolecular Sciences LLC. www.firebirdbio.com) synthesized on an ABI 394 instrument. Oligonucleotides were purified on 10% denatured PAGE and then desalted on C18 Sep-Pak cartridge, lyophilized and shipped from Florida to the University of Washington. DNA strands were resuspended in 100 mM KCl to a final concentration of 20 µM. Sequences were then mixed with their complement DNA strand and were heated to 90 °C and slowly cooled by ∼10 °C per minute to 4 °C to form dsDNA constructs shown in Fig. 1C. Once annealed, these constructs were further diluted to a concentration of 1 µM and ~1 µl additions of this strand was added to the cis chamber. A complete list of DNA strands used in this work is given in Table S1.

Nanopore experiment

DNA molecules bound to Hel308 helicase are electrophoreticaly captured by the nanopore. As the 5′ end of the DNA is pulled into the pore, the complementary strand is displaced by the pore leaving only the ssDNA template with Hel308 resting on the pore rim. Hel308 will then translocate from 3′ to 5′ in discrete steps approximately half a nucleotide in length⁴², drawing the ssDNA back out of the pore. Over time ATP is converted to ADP within the cis well. We perfused new reagents into the cis well every 45 min to maintain consistent [ATP]. Variable voltage sequencing is performed by applying a periodic voltage waveform (200 Hz triangle wave, 100 mV peak-to-peak offset to 150 mV) to repeatedly stretch and relax DNA within the pore vestibule, “flossing” the DNA in the constriction by ±1 nt much faster than a typical Hel308 step (~10 Hz).

Data acquisition

Data was acquired with custom LabVIEW software on an Axopatch 200B amplifier at 50 kHz. During experiments, we applied a 200 Hz, 100 mV peak-to-peak triangle waveform in addition to a constant 150 mV DC offset.

Operating conditions

All experiments were run at 400 mM trans [KCl] and 400 mM cis [KCl] with 10 mM HEPES at pH 8.0 and 10 mM [MgCl2] (cis only) at temperature 37 ± 1 degrees Celsius. Once a single M2-NNN MspA nanopore was established, a buffer with the above conditions along with ATP was perfused to the cis well. ATP was ordered from Sigma Aldrich. Experiments were performed at various ATP concentrations between 100 and 1000 µM, but the majority of data was collected at 200 µM ATP. The perfusion is done to maintain constant concentrations of the reactants/products in the reaction volume. DNA, DTT, and Hel308 were added to final concentrations of 10 nM, 1 mM, and 200 nM, respectively. Reactants/products were re-perfused every 45 min.

Proteins

Hel308 from Thermococcus gammatolerans EJ3 (accession number WP 015858487.1 [https://www.ncbi.nlm.nih.gov/protein/WP_015858487.1/]) and M2-NNN-MspA (accession number CAB56052.1 [https://www.ncbi.nlm.nih.gov/protein/CAB56052.1], with the mutations (D90N/D91N/D93N/D118R/D134R/E139K)) were expressed in E. coli. and purified using a 6xhis-tag.

Validation of sequences containing PZBS bases

Synthesis errors of oligos by solid phase synthesis can manifest in two categories: systematic errors and random single-molecule errors. Systematic errors could include the wrong sequence being made, contaminated phosphoramidites, or sequence biases such as secondary structures which systematically affect synthesis. Random errors can include incomplete deprotection of nucleotides, instances of unwanted chemical adducts which can be attached to bases or chemical degradation products such as abasic lesions. Put simply, systematic errors are those which manifest uniformly across all copies of the strand, while random errors are different strand to strand.

P, Z, S, and B phosphoramidites undergo tight quality control. dP, dZ, and dS phosphoramidites were purchased from Firebird Biomolecular (Purity are 97%, 98%, and 97%, respectively). dB phosphoramidite was purchased from Glen Research (Purity is 99.5%)

Previous Liquid Chromatography-Mass Spec. performed on homopolymer strands used in Thomas et al.³⁷ showed such strands are successfully produced using solid phase synthesis. Leading errors consisted of single cyanoethyl base adducts and single-base deletions, both of which can be assumed to be distributed randomly along the DNA strands. Roughly 30% of such molecules contained either a base deletion or cyanoethyl base adduct somewhere along their 90-nucleotide length. This is supported by observed variations in phosphoramidite synthesized DNA within our lab (including ACGT-DNA purchased from IDT). Further, previously published results verified proper construction of PZSB strands constructed by solid phase synthesis via X-ray crystallography.

All constructs used in this work were PAGE purified after synthesis, showing predominantly full-length constructs so systematic errors resulting in change of strand length can be ruled out. In addition, nanopore reads of the synthesized DNA contained reads of the natural DNA handles at both the 5′ and 3′ ends (Fig. 2). Further, nanopore traces contained the expected number of steps for the number of bases contained in each strand. Training data was also checked for internal consistency. For example, several k-mers (e.g. “BBBZ”, “ZBZP” and “PZSS”) appear multiple times within different sequencing contexts within the training strands and each time produce similar conductance curves. Additionally, the ability of the sequencing model to predict ion currents of previously unmeasured sequences (Fig. 3B) and to sequence with meaningful accuracy is also proof positive that the desired strands were indeed constructed as designed in both the training and test datasets.

Chemical names.

B:isoguanine 6-amino-9[(1′-β-D-2′-deoxyribofuranosyl)-4-hydroxy-5-(hydroxymethyl)-oxolan-2-yl]-1H-purin-2-one

P: 2-amino-8-(1′-β-D-2′-deoxyribofuranosyl)-imidazo-[1,2a]-1,3,5-triazin-[8H]-4-one

S: 3-methyl-6-amino-5-(1′-β-D-2′-deoxyribofuranosyl)-pyrimidin-2-one

Z: 6-amino-3-(1′-β-D-2′-deoxyribofuranosyl)-5-nitro-1H-pyridin-2-one

Change point detection

A key requirement of enzyme-actuated nanopore sequencing is the segmentation of a conductance trace into segments corresponding to the discrete steps taken by the motor enzyme. Here we use the change-point detection algorithm⁶² previously demonstrated⁴⁴ with slight improvements. Previously, change point detection was conducted on a model composed of 5 principal components of periodic functions that represent the raw variable-voltage data of a single read⁴⁴. In this work, we first subtract the mean cycle ion current for each read individually and use this signal representation solely for the purposes of change point detection. This helps to calibrate out pore-to-pore and read-to-read variations in pore ion current, removing significant systematic errors. This works because the important feature of the data is the sequence-dependent change in pore conductance. Ion current through the nanopore changes non-linearly with applied voltage in ways that depend on the capacitance of the setup. This is known to be variable for example, due to changes in bilayer area/thickness, subtraction of the event-average cycle ion current removes all elements of the non-linear conductance change and isolates the part which we care about: the sequence-dependent change in conductance at various applied voltages.

We construct a model of the variable-voltage data based on principal component vectors, but model the mean cycle normalized signal instead of the raw signal. The mean cycle current subtraction allows us to reduce our model complexity to 2 principal components (Supplementary Fig. 1) as compared to the 5 used in Noakes et al. ⁴⁴.

Capacitance compensation

Because of the applied voltage waveform in variable-voltage sequencing, we must subtract out the large charging and discharging ion current due to the large capacitance of the lipid bilayer membrane that supports the nanopore if we hope to extract information from the signal. We use the identical computational strategy detailed in ref. ⁴⁴, which assumes that the non-capacitive component of the positive-going and negative-going ion currents should be identical when variable voltage is applied as a triangle wave. This is because the capacitive current depends only on the rate of change of the applied voltage which, with a triangle wave, has the same magnitude but opposite sign. Thus the nanopore conductances at a given voltage can be calculated as the mean of the up-going and down-going currents.

Conductance normalization

As in ref. ⁴⁴, we perform a normalization step to remove the remaining DNA-position independent contribution to each state’s conductance curve. This nuisance component is predominately caused by two factors: the intrinsically non-ohmic character of the nanopore conductance when translocating a charged molecule, and the effects of voltage-dependent DNA stretching on the number of bases sensed by the pore constriction. To address the first factor, we find the average conductance at each voltage over each state in a read and subtract this from each state’s conductance curve. The second factor can be subtracted to first order by subtracting a linear component from each curve segment. We find the magnitude of the linear voltage response as a function of conductance by finding the average linear component for each curve averaged over all curve segments in a read. This average slope is independent of the DNA sequence and can be interpreted as the pore’s average change in conductance as a function of applied voltage. We then subtract this line from each state’s conductance curve. The resulting curve segment called the “normalized conductance” contains primarily the DNA-position dependent portion of the nanopore’s conductance curve. This process is originally described in ref. ⁴⁴.

Principal component reduction

As in ref. ⁴⁴, we reduce the dimensionality of the conductance curves through principal component reduction, using linear combinations of the top 3 principal component vectors to characterize a conductance curve. This allows for more efficient computation during sequencing and better estimates of feature covariances while retaining 98% of the variance between conductance curves. Because the 3 top principal component vectors roughly correspond to the basic shape characteristics of a conductance curve, they generalize satisfactorily to ALIEN sequencing.

Noise measurement

Noise is calculated by taking the standard deviation of the conductance measurements at each voltage sampled (Eq. 1). The mean of this is taken over all voltages and serves as the holistic noise measurement of the state. The standard error of the measurement noise is computed in Eq. 2, where N is defined as the total number of conductance measurements across all sampled voltages. This is converted into stiffness (Eq. 3) for inclusion into the 4×4 stiffness matrix, which characterizes the full state certainty.

$${sigma }_{G}={{rm{mean}}}({sigma }_{{G}_{V1}},{sigma }_{{G}_{V2}},ldots,{sigma }_{{G}_{V101}})$$

(1)

$${sigma }_{{sigma }_{G}}=,frac{{sigma }_{G}}{sqrt{2N-2}}$$

(2)

$${K}_{{sigma }_{G}}=,{sigma }_{{sigma }_{G}}^{-2}$$

(3)

Filtering

Data is next filtered to 1) remove spurious “bad levels” and 2) consolidate over-segmented data and enzyme backsteps. We determine the optimal ordering of observed conductance states through a two-stage filtering process as described previously⁴⁴ with slight modifications.

The removal filter finds and removes observed states that are not informative of the DNA sequence. They may occur due to transient noise spikes, mid-read gating of the nanopore, enzyme-specific transients (“flickers”), and over-segmented states. The removal filter of⁴⁴ assigns a bad state probability to each state using a 12-feature Support Vector Machine (SVM). Input features 1–9 are the principal component coefficients of the evaluated state, the previous state, and the subsequence state. Feature 10 is the value of the single conductance measurement in a state that maximally deviates from the mean conductance of the read. Feature 11 characterizes the deviation between the principal component reduced curve and the original conductance curve. Feature 12 is the score of the best match through alignment of the evaluated state with the ACGT 6-mer model. All of these features, with the exception of feature 12, generalize to measurements of ALIEN DNA, so we use the removal filter with feature 12 disabled and tune the sensitivity threshold accordingly.

The recombination filter finds and combines states that represent repeated measurements of the same DNA position within a read. This filter fixes errors due to over-segmentation arising during change point detection, resulting in consecutive states of the same DNA position, as well as repeated states due to the enzyme stepping backward (towards 3′) along the DNA. Because change-point detection and enzyme stepping behavior are similar in ALIEN sequencing, we use the recombination filter as-is.

Read polishing

The final quality control step in read processing is to polish the reads by hand to remove any remaining “bad”, “hold”, or backstep states missed by the previous filtering process. We also trim states at both ends of the read to match the bounds of the consensus trace, typically only redundant calibration region states caused by homopolymer T’s at the 5′ end and redundant states caused by GA repeats at the 3′ end (Supplementary Fig. 4).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The nanopore data generated in this study have been deposited in the figshare database under accession code https://doi.org/10.6084/m9.figshare.28199681.v2⁶³.

Code availability

The analysis code underlying this article are available in figshare and can be accessed at https://doi.org/10.6084/m9.figshare.28199681.v2⁶³.

References

Benner, S. A. Rethinking nucleic acids from their origins to their applications. Philos. Trans. R. Soc. B: Biol. Sci. 378, 20220027 (2023).

CAS Google Scholar
Kool, E. T., Morales, J. C. & Guckian, K. M. Mimicking the Structure and Function of DNA: Insights into DNA Stability and Replication. Angew. Chem. Int. Ed. 39, 990–1009 (2000).

CAS Google Scholar
Henry, A. A. & Romesberg, F. E. Romesberg, Beyond A, C, G and T: augmenting nature’s alphabet. Curr. Opin. Chem. Biol. 7, 727–733 (2003).

CAS PubMed Google Scholar
Geyer, C. R., Battersby, T. R. & Benner, S. A. Nucleobase pairing in expanded Watson-Crick-like genetic information systems. Structure 11, 1485–1498 (2003).

CAS PubMed Google Scholar
Hirao, I. et al. An unnatural base pair for incorporating amino acid analogs into proteins. Nat. Biotechnol. 20, 177–182 (2002).

CAS PubMed Google Scholar
Tabatabaei, S. K. et al. Expanding the Molecular Alphabet of DNA-Based Data Storage Systems with Neural Network Nanopore Readout Processing. Nano Lett. 22, 1905–1914 (2022).

ADS CAS PubMed PubMed Central Google Scholar
Liu, H. et al. A Four-Base Paired Genetic Helix with Expanded Size. Science 302, 868–871 (2003).

ADS CAS PubMed Google Scholar
Hoshika, S., Shukla, M. S., Benner, S. A. & Georgiadis, M. M. Visualizing “Alternative Isoinformational Engineered” DNA in A- and B-Forms at High Resolution. J. Am. Chem. Soc. 144, 15603–15611 (2022).

CAS PubMed Google Scholar
Hoshika, S. et al. Hachimoji DNA and RNA: A genetic system with eight building blocks. Science 363, 884–887 (2019).

ADS CAS PubMed PubMed Central Google Scholar
Shukla, M. S., Hoshika, S., Benner, S. A. & Georgiadis, M. M. Crystal structures of ‘ALternative Isoinformational ENgineered’ DNA in B-form. Philos. Trans. R. Soc. B: Biol. Sci. 378, 20220028 (2023).

CAS Google Scholar
Zhang, L. et al. An Aptamer-Nanotrain Assembled from Six-Letter DNA Delivers Doxorubicin Selectively to Liver Cancer Cells. Angew. Chem. Int. Ed. 59, 663–668 (2020).

ADS CAS Google Scholar
Jerome, C. A., Hoshika, S., Bradley, K. M., Benner, S. A. & Biondi, E. In vitro evolution of ribonucleases from expanded genetic alphabets. Proc. Natl. Acad. Sci. 119, e2208261119 (2022).

CAS PubMed PubMed Central Google Scholar
Biondi, E. et al. Laboratory evolution of artificially expanded DNA gives redesignable aptamers that target the toxic form of anthrax protective antigen. Nucleic Acids Res. 44, 9565–9577 (2016).

CAS PubMed PubMed Central Google Scholar
Biondi, E. & Benner, S. A. Artificially Expanded Genetic Information Systems for New Aptamer Technologies. Biomedicines 6, 53 (2018).

PubMed PubMed Central Google Scholar
Zhang, L. et al. Aptamers against Cells Overexpressing Glypican|3 from Expanded Genetic Systems Combined with Cell Engineering and Laboratory Evolution. Angew. Chem. Int. Ed. 55, 12372–12375 (2016).

CAS Google Scholar
Shaker, S. et al. Cancer cell target discovery: comparing laboratory evolution of expanded DNA six-nucleotide alphabets with standard four-nucleotide alphabets. Nucleic Acids Res. 53, gkaf072 (2025).
Yang, Z., Chen, F., Chamberlin, S. G. & Benner, S. A. Expanded genetic alphabets in the polymerase chain reaction. Angew. Chem. Int. Ed. 49, 177–180 (2010).

CAS Google Scholar
Yang, Z., Benner, S. A. Compositions for the Multiplexed Detection of Viruses (2020).
Zhou, A. X.-Z., Sheng, K., Feldman, A. W. & Romesberg, F. E. Progress toward Eukaryotic Semisynthetic Organisms: Translation of Unnatural Codons. J. Am. Chem. Soc. 141, 20166–20170 (2019).

CAS PubMed PubMed Central Google Scholar
Bain, J. D., Switzer, C., Chamberlin, R. & Benner, S. A. Ribosome-mediated incorporation of a non-standard amino acid into a peptide through expansion of the genetic code. Nature 356, 537–539 (1992).

ADS CAS PubMed Google Scholar
Yang, Z., Hutter, D., Sheng, P., Sismour, A. M. & Benner, S. A. Artificially expanded genetic information system: a new base pair with an alternative hydrogen bonding pattern. Nucleic Acids Res. 34, 6095–6101 (2006).

CAS PubMed PubMed Central Google Scholar
Wang, X. et al. Biophysics of Artificially Expanded Genetic Information Systems. Thermodynamics of DNA Duplexes Containing Matches and Mismatches Involving 2-Amino-3-nitropyridin-6-one (Z) and Imidazo[1,2-a]-1,3,5-triazin-4(8H)one (P). ACS Synth. Biol. 6, 782–792 (2017).

CAS PubMed Google Scholar
Pham, T. M. et al. DNA Structure Design Is Improved Using an Artificially Expanded Alphabet of Base Pairs Including Loop and Mismatch Thermodynamic Parameters. ACS Synth. Biol. 12, 2750–2763 (2023).

CAS PubMed PubMed Central Google Scholar
Yang, Z., Chen, F., Alvarado, J. B. & Benner, S. A. Amplification, Mutation, and Sequencing of a Six-Letter Synthetic Genetic System. J. Am. Chem. Soc. 133, 15105–15112 (2011).

CAS PubMed PubMed Central Google Scholar
Hollenstein, M., Hipolito, C. J., Lam, C. H. & Perrin, D. M. A self-cleaving DNA enzyme modified with amines, guanidines and imidazoles operates independently of divalent metal cations (M 2+). Nucleic Acids Res. 37, 1638–1649 (2009).

CAS PubMed PubMed Central Google Scholar
Gold, L. et al. Aptamer-based multiplexed proteomic technology for biomarker discovery. PLOS ONE 5, e15004 (2010).
Manrao, E. A. et al. Reading DNA at single-nucleotide resolution with a mutant MspA nanopore and phi29 DNA polymerase. Nat. Biotechnol. 30, 349–353 (2012).

CAS PubMed PubMed Central Google Scholar
Loman, N. J. & Watson, M. Successful test launch for nanopore sequencing. Nat. Methods 12, 303–304 (2015).

CAS PubMed Google Scholar
Kasianowicz, J. J., Brandin, E., Branton, D. & Deamer, D. W. Characterization of individual polynucleotide molecules using a membrane channel. Proc. Natl. Acad. Sci. 93, 13770–13773 (1996).

ADS CAS PubMed PubMed Central Google Scholar
Laszlo, A. H. et al. Detection and mapping of 5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc. Natl. Acad. Sci. USA 110, 18904–18909 (2013).

ADS CAS PubMed PubMed Central Google Scholar
Rand, A. C. et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods 14, 411–413 (2017).

CAS PubMed PubMed Central Google Scholar
Schreiber, J. et al. Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands. Proc. Natl. Acad. Sci. 110, 18910–18915 (2013).

ADS CAS PubMed PubMed Central Google Scholar
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).

CAS PubMed Google Scholar
Nookaew, I. et al. Detection and discrimination of DNA Adducts Differing In Size, Regiochemistry, And Functional Group By Nanopore Sequencing. Chem. Res. Toxicol. 33, 2944–2952 (2020).

CAS PubMed PubMed Central Google Scholar
Tavakoli, S. et al. Semi-quantitative detection of pseudouridine modifications and type I/II hypermodifications in human mRNAs using direct long-read sequencing. Nat. Commun. 14, 334 (2023).

ADS CAS PubMed PubMed Central Google Scholar
Brinkerhoff, H., Kang, A. S. W., Liu, J., Aksimentiev, A. & Dekker, C. Multiple rereads of single proteins at single–amino acid resolution using nanopores. Science 374, 1509–1513 (2021).

ADS CAS PubMed PubMed Central Google Scholar
Thomas, C. A. et al. Assessing readability of an 8-Letter Expanded Deoxyribonucleic Acid Alphabet With Nanopores. J. Am. Chem. Soc. 145, 8560–8568 (2023).

CAS Google Scholar
Kawabe, H. et al. Enzymatic synthesis and nanopore sequencing of 12-letter supernumerary DNA. Nat. Commun. 14, 6820 (2023).

ADS CAS PubMed PubMed Central Google Scholar
Smith, D. C. et al. Nanopores map the acid-base properties of a single site in a single DNA molecule. Nucleic Acids Res. 52, 7429–7436 (2024).

CAS PubMed PubMed Central Google Scholar
Ledbetter, M. P. et al. Nanopore sequencing of an expanded genetic alphabet reveals high-fidelity replication of a predominantly hydrophobic unnatural base pair. J. Am. Chem. Soc. 142, 2110–2114 (2020).

CAS PubMed PubMed Central Google Scholar
Craig, J. M. et al. Revealing dynamics of helicase translocation on single-stranded DNA using high-resolution nanopore tweezers. Proc. Natl. Acad. Sci. 114, 11932–11937 (2017).

ADS CAS PubMed PubMed Central Google Scholar
Derrington, I. M. et al. Subangstrom single-molecule measurements of motor proteins using a nanopore. Nat. Biotechnol. 33, 1073–1075 (2015).

CAS PubMed PubMed Central Google Scholar
Ying, Y.-L. et al. Nanopore-based technologies beyond DNA sequencing. Nat. Nanotechnol. 17, 1136–1146 (2022).

ADS CAS PubMed Google Scholar
Noakes, M. T. et al. Increasing the accuracy of nanopore DNA sequencing using a time-varying cross membrane voltage. Nat. Biotechnol. 37, 651–656 (2019).

CAS PubMed PubMed Central Google Scholar
Laszlo, A. H. et al. Decoding long nanopore sequencing reads of natural DNA. Nat. Biotechnol. 32, 829–833 (2014).

CAS PubMed PubMed Central Google Scholar
Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 260–269 (1967).

Google Scholar
Durbin, R., Eddy, S. R., Krogh, A., Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1998).
Ross, B. C. Mutual Information between Discrete and Continuous Data Sets. PLOS ONE 9, e87357 (2014).

ADS PubMed PubMed Central Google Scholar
Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).

PubMed PubMed Central Google Scholar
Eberlein, L. et al. Tautomeric equilibria of nucleobases in the Hachimoji expanded genetic alphabet. J. Chem. Theory Comput. 16, 2766–2777 (2020).

CAS PubMed PubMed Central Google Scholar
Lin, B., Hui, J. & Mao, H. Nanopore technology and its applications in gene sequencing. Biosensors 11, 214 (2021).

CAS PubMed PubMed Central Google Scholar
Craig, J. M. et al. Determining the effects of DNA sequence on Hel308 helicase translocation along single-stranded DNA using nanopore tweezers. Nucleic Acids Res. 47, 2506–2513 (2019).

CAS PubMed PubMed Central Google Scholar
Malyshev, D. A. et al. A semi-synthetic organism with an expanded genetic alphabet. Nature 509, 385–388 (2014).

ADS CAS PubMed PubMed Central Google Scholar
Glushakova, L. G. et al. Detecting respiratory viral RNA using expanded genetic alphabets and self-avoiding DNA. Anal. Biochem. 489, 62–72 (2015).

CAS PubMed PubMed Central Google Scholar
Glushakova, L. G. et al. High-throughput multiplexed xMAP Luminex array panel for detection of twenty two medically important mosquito-borne arboviruses based on innovations in synthetic biology. J. Virological Methods 214, 60–74 (2015).

CAS Google Scholar
Kawabe, H. et al. Harnessing Non-standard Nucleic Acids for Highly Sensitive Icosaplex (20-plex) Detection of Microbial Threats for Environmental Surveillance. ACS Synth. Biol.14, 470–484 (2025).
Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).

CAS PubMed Google Scholar
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628–1628 (2012).

ADS CAS PubMed Google Scholar
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).

ADS CAS PubMed PubMed Central Google Scholar
Carr, C. E. et al. Advancing the search for extra-terrestrial genomes in 2016 IEEE Aerospace Conference (2016), pp. 1–15.
Laszlo, A. H., Derrington, I. M. & Gundlach, J. H. MspA nanopore as a single-molecule tool: From sequencing to SPRNT. Methods 105, 75–89 (2016).

CAS PubMed PubMed Central Google Scholar
LaMont, C. H. & Wiggins, P. A. The Development of an Information Criterion for Change-Point Analysis. Neural Comput. 28, 594–612 (2016).

MathSciNet PubMed Google Scholar
Thomas, C. A., Laszlo, A. H. Craig, Jonathan M., Sequencing a DNA Analog Composed of Artificial Bases, figshare https://doi.org/10.6084/m9.figshare.28199681.v2 (2025).

Download references

Acknowledgements

This research was supported by grants from the National Human Genome Research Institute U24HG011735 (AHL, JHG, SAB), the National Institutes of Health, National Human Genome Research Institute R01HG005115 (AHL & JHG), the National Institutes of Health under the Director’s Award Number R01GM128186 (SH & SAB), the National Science Foundation MCB-1939086 (SH & SAB), and the National Institutes of General Medical Sciences 1R01GM141391-01A1 (SH & SAB).

Author information

Authors and Affiliations

Department of Physics, University of Washington, Seattle, WA, USA

Christopher A. Thomas, Henry Brinkerhoff, Jonathan M. Craig, Desislava Mihaylova, Akira M. Pfeffer, Michaela C. Franzi, Sarah J. Abell, Jessica D. Carrasco, Jens H. Gundlach & Andrew H. Laszlo
Foundation for Applied Molecular Evolution, Alachua, FL, USA

Shuichi Hoshika & Steven A. Benner

Authors

Christopher A. Thomas
Henry Brinkerhoff
Jonathan M. Craig
Shuichi Hoshika
Desislava Mihaylova
Akira M. Pfeffer
Michaela C. Franzi
Sarah J. Abell
Jessica D. Carrasco
Jens H. Gundlach
Steven A. Benner
Andrew H. Laszlo

Contributions

C.A.T., H.B., J.M.C., D.M., and A.M.P. analysed nanopore data. C.A.T., M.C.F., S.J.A., and J.D.C. performed nanopore experiments. S.H. synthesized and purified ALIEN oligonucleotides. A.H.L., S.A.B. and J.H.G. directed research. C.A.T, A.H.L, J.M.C, S.H., S.A.B., and J.H.G. wrote the paper.

Corresponding author

Correspondence to Andrew H. Laszlo.

Ethics declarations

Competing interests

The authors declare the following competing financial interest(s): S.A.B. owns Firebird Biomolecular Sciences, which makes various ALIEN reagents available for sale. S.H. is affiliated with Firebird Biomolecular Sciences. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Lingjun Li, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Thomas, C.A., Brinkerhoff, H., Craig, J.M. et al. Sequencing a DNA analog composed of artificial bases. Nat Commun 16, 7240 (2025). https://doi.org/10.1038/s41467-025-61991-9

Download citation

Received: 11 February 2025
Accepted: 08 July 2025
Published: 06 August 2025
DOI: https://doi.org/10.1038/s41467-025-61991-9