Main
In prokaryotes, the CRISPR-Cas system provides antiviral immunity by recognizing and disrupting intruding viral DNA through DNA sequence recognition1. This system has been harnessed for precise genome editing in various organisms and cell types1,2,3,4,5. Identifying naturally occurring Cas nucleases and engineering Cas enzyme variants with different features is crucial for various research and clinical applications. CRISPR-Cas nucleases form protein:DNA contacts to initiate target site recognition through a protospacer adjacent motif (PAM)6,7,8,9,10,11. The location (5′ or 3′ of the spacer) and sequence of the PAM differs among different types of CRISPR-Cas system11,12; however, efficient and accurate identification of PAM requirements in eukaryotic cells remains a bottleneck in the discovery and characterization of novel Cas nucleases and their engineered variants.
Various methods have been developed for PAM identification, including in silico13 and in vitro cleavage assays12,14,15,16, bacterial-based assays7,10 including the PAM screen achieved by NOT-gate repression (PAM-SCANR)17, cell-free transcription–translation (TXTL) systems18,19, fluorescence-based20 assays including the PAM definition by observable sequence excision (PAM-DOSE)21, human cell library-based approaches22,23 and scalable human cell expression followed by the in vitro cleavage reaction hybrid method, high-throughput PAM determination assay (HT-PAMDA)24,25. Each method has its own advantages and limitations (reviewed elsewhere13,25). In general, in silico and bacterial depletion results may not be easily translated to mammalian cell contexts. In vitro cleavage assays have the advantages of managing large libraries15; however, in vitro methods require laborious protein purification, and the cleavage kinetics may not accurately reflect the conditions in a living organism. Previous in vivo methods require introducing synthetic random oligos as PAM candidates into live cells, which are challenging for maintaining high-diversity sequence libraries. Moreover, fluorescence (GFP or RFP)-based enrichments are associated with particularly low efficiency.
Repetitive sequences in the mammalian genome, flanked by diverse sequences, are a potential resource for characterizing the PAM preferences of naturally occurring and engineered Cas nucleases. Here we developed a direct PAM identification method called GenomePAM, which uses highly repetitive sequences in the mammalian genome. To characterize the PAM requirements of Cas enzymes, we identified genomic repeats flanked by highly diverse sequences where the constant sequence can be used as the protospacer in CRISPR-Cas genome editing experiments. The cleaved genomic regions can then be analysed using methods such as the genome-wide unbiased identification of double strand breaks (DSBs) enabled by sequencing (GUIDE-seq)26 that enriches double strand oligodeoxynucleotide (dsODN)-integrated fragments by anchor multiplex PCR sequencing (AMP-seq)27. GenomePAM is highly efficient and accurate at characterizing the mammalian-cell-based PAM of an enzyme because every single cell contains just one full set of identical-complexity candidate PAM library. GenomePAM can also simultaneously assess the potency of thousands of on-target sites across the genome and the fidelity of tens of thousands of potential off-target sites of a Cas nuclease, facilitating performance comparison of different Cas nucleases. Moreover, GenomePAM can be used to better understand and compare genome-wide chromatin accessibility profiles of different cell types.
Results
Method design
The human genome contains highly repetitive sequences28, most of which are not suitable for use as protospacers due to low-complexity flanking sequences. However, a subset of these sequences can be used for PAM preference identification, provided they have the following features: (1) The number of unique flanking sequences of a given length in the human genome is comparable with, or not significantly smaller than, the number of potential PAMs to be tested. For example, the PAM of SpCas9 (Streptococcus pyogenes Cas9) and its variants may range from 1 to 3 bases; therefore, the number of unique 3-nt-long flanking sequences should preferably be 64 (=43). In the case of a SaCas9 (ref. 29) (Staphylococcus aureus Cas9) and its variants, the PAM may range from 3 to 4 bases; therefore, the number of unique flanking sequences of 4-nt length should preferably be ~256 (=44). (2) The flanking sequences should have highly diverse or nearly completely random sequence compositions.
To characterize the PAM of SpCas9 and its variants, we analysed the human genome for all possible 20-nt-long sequences and their flanking sequence diversities. For example, there are 8,471 occurrences of the sequence 5′-GTGAGCCACTGTGCCTGGCC-3′ (part of an Alu sequence; hereafter referred to as ‘Rep-1’) distributed across the human genome (Fig. 1a; ~16,942 occurrences in a human diploid cell) with nearly random flanking sequences of 10-nt length at its 3′ end, making it a suitable candidate as the protospacer sequence for PAM identification (Fig. 1b). For type II Cas nucleases with 3′ PAMs, such as SpCas9 and SaCas9, Rep-1 can be directly used for PAM preference characterization. For type V Cas nucleases with their PAM at the 5′ end of the spacer, such as FnCas12a12 (Francisella novicida Cas12a), the reverse complementary sequence 5′-GGCCAGGCACAGTGGCTCAC-3′ (‘Rep-1RC’) can be used as the protospacer sequence (Fig. 1b). Since Cas nucleases can often tolerate a few base mismatches (off targets), we calculated the numbers of 20-nt sequences with 1, 2, 3 and 4 base mismatches. For Rep-1, these numbers were 48,207, 206,767, 579,336 and 1,350,488, respectively, and >2 million in total in the human genome (hg38). Thus, using Rep-1 or Rep-1RC as the protospacer, there are potentially >4 million targets in a single human diploid cell. A list of example repeats, their occurrences, flanking sequence diversity and their use are shown in Extended Data Fig. 1.
a, Genome-wide distribution of the Rep-1 sequence in the human genome. b, SeqLogo plot showing nucleotide frequency at each position in the 10 bases at 3′ of Rep-1 and the 10 bases at 5′ of Rep-1RC (reverse complement of Rep-1). Rep-1 and Rep-1RC sequences can be used to characterize PAM preferences of Cas nucleases with 3′ PAM and 5′ PAM, respectively. c, GenomePAM workflow for the identification of PAM preferences using the highly repetitive genome sequence Rep-1 as the protospacer and the GUIDE-seq experiments to capture cleaved genomic sites of SpCas9. CMV, cytomegalovirus promoter. d, An example of GUIDE-seq output. Each line shows one SpCas9 cleavage site and the mismatch bases are colour coded. The flanking sequences and GUIDE-seq read counts of each site are shown on the right side. e, SeqLogo plot summary for SpCas9 PAM preferences using their corresponding read counts as weights and stratified by perfect-match and mismatch targets. f, GenomePAM Table reporting the enriched PAM sequences and counts, along with the numbers and percentages of corresponding genomic sites edited, and associated statistical significance. P values were derived from two-sided chi-square test.
To leverage these genomic repeats to characterize the PAM of various CRISPR-Cas enzymes, the repeat sequence Rep-1 was chosen as the protospacer target. The corresponding spacer was cloned into a guide RNA (gRNA) expression cassette to be used along with a plasmid that encodes the candidate Cas nuclease. To identify which repeats within the genome were cleaved in an experiment, we adapted the GUIDE-seq26 method to capture cleaved genomic sites in HEK293T cells (Fig. 1c). Only those sites whose flanking sequences contain functional PAMs can be cleaved by the Cas nuclease. Cell toxicity after large numbers of DSBs occur in one cell was reported previously when using CRISPR to target highly repetitive element LINE1 (ref. 30) or unique repeat sequences associated with temozolomide mutational signature31. To assess this toxicity, we measured cell viability in four different conditions, including Lipofectamine 3000 transfection controls, and in two different cell lines (Methods). The results showed largely similar cell viability across different transfection conditions at 24 h and 48 h after transfection in HEK293T (Extended Data Fig. 2a) and HepG2 cells (Extended Data Fig. 2b). During the GUIDE-seq data analysis, the candidate PAM was set as unknown (‘NNNNNNNNNN’) and 13,908 sites across the genome were identified (Fig. 1d). The mismatch bases were typically located at positions 8–11 of the targets and were transitions of the intended bases (Fig. 1d). The resulting PAMs were then summarized using their corresponding read counts as weights and used for SeqLogo plotting (Fig. 1e), which was stratified by perfect-match and mismatch targets. Beyond the descriptive SeqLogo, the consistent genomic background sequences inspired us to create an iterative ‘seed-extension’ method. This approach identifies statistically significant enriched motifs and reports the percentages of edited genomic sites at each iteration step (Fig. 1f and Methods).
Performance of GenomePAM on SpCas9, SaCas9 and FnCas12a
To evaluate the performance of GenomePAM, we chose three Cas nucleases with well-established PAMs: SpCas9, SaCas9 and FnCas12a. We used Rep-1 for GenomePAM analysis of SpCas9 and SaCas9, and Rep-1RC for FnCas12a. The results showed that the PAM preferences for SpCas9, SaCas9 and FnCas12a were NGG at 3′, NNGRRT (R is G or A) at 3′ and YYN (Y is T or C) at 5′ sides of the spacers, respectively, consistent with previous results7,10,12,29,32 (Fig. 2a–c; top, perfect match; bottom, mismatch; Fig. 2d–f, 4-base heat map of relative PAM cleavage value (PCV); Methods; GUIDE-seq results are listed in Supplementary Tables 1–3 and visualized in Supplementary Tables 4–6). GenomePAM Table analysis showed that, for SpCas9 3′ PAM, the most significant single base was the G at position 3 (1,103 (65.6%) of total 1,681 targets in human genome edited), the most significant two bases were GG at positions 2 and 3 (449 (94.1%) out of total 477 targets edited), and no further significant bases (Fig. 2g). For SaCas9 3′ PAM, the GenomePAM Table analysis showed increasing significance for G at position 3, GR at 3–4, GRR at 3–5, and GRRT at 3–6, respectively (Fig. 2h). The corresponding percentages of edited targets were 44.6%, 62.1–73.9%, 80.0–93.3%, and 96.7–98.0%, respectively (Fig. 2i). For FnCas12a 5′ PAM, the GenomePAM Table analysis showed a Y position −3 (5.3–6.0% of targets edited) and a YY at position −2 to −3 (8.5–9.6% of targets edited). For the known PAM wobble bases6,33, namely the 2nd position of N[G/A]G in SpCas9 and the 6th position of NNGRR[T/A] in SaCas9, the proportions of the prominent bases ([G] in SpCas9 and [T] in SaCas9) in the mismatch target-associated PAMs were higher than those perfect-match-associated PAMs (Fig. 2j; left, SpCas9; right, SaCas9; both P < 0.01). We also performed validation experiments in three other cell lines, HepG2, Huh7 and HeLa cell lines (Extended Data Fig. 3). The results showed nearly identical PAM profiles as those using HEK293T cells. Furthermore, we performed further tests using another repeat sequence (Rep-2: 5′-GAGCCACCGTGCCTGGCCTC-3′) that occurs 1,126 times in the human genome (~2,252 occurrences in a human diploid cell) (Extended Data Fig. 1) as the protospacer for GenomePAM analysis of SpCas9, SaCas9 and FnCas12a. The GenomePAM results were nearly the same, namely, NGG at 3′, NNGRRT at 3′ and TTTN at 5′ for the three nucleases, respectively (Extended Data Fig. 4). We further compared SpCas9 canonical (NGGN) and non-canonical PAM (NAGN, NGAN, NBGG; ‘B’ is C, G or T) PCVs derived from GenomePAM with indel frequencies reported using three well-established methods21,22,24. The analyses showed high correlations (Fig. 2k,l, versus two assays: R = 0.96, P < 1 × 10−100; Fig. 2m, versus PAM-DOSE: R = 0.92, P < 2.6 × 10−27). Our results recapitulate the known PAM requirements of type II and type V Cas nucleases, demonstrating that our genome-based PAM determination method (GenomePAM) is effective.
a–i, SeqLogo results for SpCas9, SaCas9 and FnCas12a PAM preferences in HEK293T cells with perfect-match spacers (a–c, top) and with mismatch spacers (a–c, bottom), and plotted in a 4-base heat map of relative PCV (d–f) and corresponding GenomePAM Tables (g–i). P values were derived from two-sided chi-square test. j, Percentage of ‘G’ at the 2nd position of SpCas9 PAM (left) and percentage of ‘T’ at the 6th position of SaCas9 PAM (right), by cleaved target types (perfect match versus mismatch). Data are presented as mean ± s.d. k–m, Correlations between PCV of GenomePAM and indel frequencies reported by previous methods: ref. 22 (k), HT-PAMDA24 (l) and PAM-DOSE21 (m) across various SpCas9 PAM sequences, including canonical (NGGN) and non-canonical PAMs (NAGN, NGAN, NBGG; ‘B’ is C, G or T). Linear regression lines are plotted, with 95% confidence intervals indicated as grey areas.
Characterization of challenging PAM preferences
Long and complicated PAMs in naturally occurring Cas nucleases pose a challenge to identifying their PAMs experimentally. We evaluated the performance of GenomePAM on Campylobacter jejuni Cas9 (CjCas9), which was reported to require a 7-nt PAM NNNNACA34 and an 8-nt one NNNNRYAC35. We performed GenomePAM assay on CjCas9 using Rep-1 as the protospacer in HEK293T cells. The results showed that CjCas9 required NNNNRYAC as its PAM in HEK293T cells (Fig. 3a; top, perfect match; bottom, mismatch). Because the optimal length of protospacer for CjCas9 was shown to be 22 bases35, we tested using extended Rep-1 to 21 (5′-YGTGAGCCACTGTGCCTGGCC-3′; Y is C or T) and 22 (5′-GYGTGAGCCACTGTGCCTGGCC-3′) bases. The results showed nearly the same NNNNRYAC PAM preferences for both 21 and 22 base protospacers (Fig. 3b,c; top, perfect match; bottom, mismatch). Relative PCVs using these protospacers are visualized in heat maps (Fig. 3d–f). GenomePAM Table analysis consistently showed the most enriched sequence ACAC at positions 5–8 (Fig. 3g–i). The longer protospacers with 21 and 22 bases showed marked increases in cleavage activities compared with the 20-base spacer for CjCas9 (Fig. 3j). The PAM requirement on the 8th position was relatively relaxed (Fig. 3b,c), and there were increases in the numbers of off-target sites (Fig. 3a–c, bottom).
a–i, SeqLogo results for CjCas9 when using different lengths of spacers: 20 bases (‘GTGAGCCACTGTGCCTGGCC’) (a), 21 bases (‘YGTGAGCCACTGTGCCTGGCC’; ‘Y’ is ‘C’ or ‘T’) (b) and 22 bases (‘GYGTGAGCCACTGTGCCTGGCC’) (c) with perfect-match cleaved sites (top) in HEK293T cells and in their mismatch cleaved sites (bottom), and corresponding 4-base heat map of relative PCVs (d–f) and GenomePAM Tables (g–i). P values were derived from two-sided chi-square test. j, The numbers of perfect-match sites (left) and mismatch sites (right) cleaved by CjCas9 using 20-nt, 21-nt and 22-nt spacers. k, SeqLogo results for SpRY PAM preferences in HEK293T cells with perfect-match spacers (left) and mismatch spacers (right).
Another challenging scenario in characterizing PAM preference is when there is little preference. Engineering Cas nucleases to relax PAM requirements can broaden potential applications, such as using the SpRY variant of SpCas9 with nearly no PAM restriction (previously described to be NRN > NYN)24. Additional near-PAMless Cas variants have also been developed36,37. However, depletion-based methods may not be efficient in identifying Cas nucleases with nearly no PAM preferences25. Being a positive selection method, GenomePAM found that as expected, SpRY exhibited a very minimal PAM requirement, being nearly PAMless across 5,003 perfect-match loci and 23,946 mismatch loci in HEK293T cells (Fig. 3k).
GenomePAM for characterizing novel Cas PAM
After establishing the simplicity and accuracy of the GenomePAM assay in identifying various Cas nucleases, we sought to demonstrate its utility in PAM identification for novel Cas discovery. Using a metagenomics approach (Methods) to analyse recent data in the NCBI Sequence Read Archive (SRA), we identified one novel type V-A CRISPR-Cas candidate derived from Ruminococcus Dsp902787825, named RuCas12a (Fig. 4a). We performed the GenomePAM assay using ‘Rep-1’ as the protospacer to characterize its potential activity and 5′ PAM requirement. The result revealed that the PAM preference of RuCas12a was TTYN at its 5′ end (Fig. 4b,c). The GenomePAM Table showed dominant TTC at positions −4 to −2 (Fig. 4d, bottom row). To further validate and comprehensively evaluate genome editing efficiency of RuCas12a, we used 20 regular (non-repetitive) genomic sites containing a 5′ end ‘TTTG’ in human genes CD34, CFTR, DNMT1, EMX1, HBB, LPA, POLQ, RFN2, TTR and VEGFA (spacer and primer sequences are listed in Supplementary Table 7). The editing efficiencies ranged from 3.4% to 40.6% across the 20 genomic loci in HEK293T cells (Fig. 4e). We also applied GenomePAM for identifying PAM of novel type II Cas nuclease and found a novel Cas9 from Tissierella sp., named TiCas9. TiCas9 clusters closely to SpCas9 and ScCas9, implying that it is a type II-A Cas nuclease (Fig. 4f). GenomePAM analysis revealed that TiCas9 had an NNNACT PAM (Fig. 4g–i). We further validated its potencies across 20 endogenous loci with a 3′-NNNACT PAM in genes CD34, CTCF, EMX1, POLQ and VEGFA in HEK293T cells, which showed up to ~30% editing efficiency using its native gRNA scaffold (Fig. 4j; spacer and primer sequences are listed in Supplementary Table 8).
a, Phylogenetic tree for the type V-A Cas system, RuCas12a. b–d, SeqLogo results for RuCas12a PAM preferences in HEK293T cells with perfect-match spacers (b, top) and mismatch spacers (b, bottom), associated 4-base heat map of relative PCV (c) and the GenomePAM Table (d). P values were derived from two-sided chi-square test. e, Editing efficiency of RuCas12a on 20 regular genomic sites in human genes CD34, CFTR, DNMT1, EMX1, HBB, LPA, POLQ, RFN2, TTR and VEGFA with a 5′-TTTG PAM. Data are presented as mean ± s.d. f, Phylogenetic tree for the type II CRISPR-Cas system, TiCas9. g–i, SeqLogo results for TiCas9 PAM preferences in HEK293T cells with perfect-match spacers (g, top) and mismatch spacers (g, bottom), and a 4-base heat map of relative PCV (h) and the GenomePAM Table (i). P values were derived from two-sided chi-square test. j, Editing efficiency of TiCas9 on 20 regular genomic sites in human genes CD34, CTCF, EMX1, POLQ and VEGFA with a 3′-NNNACT PAM. Data are presented as mean ± s.d.
GenomePAM facilitates Cas PAM engineering
Engineering Cas PAM preference to expand targetability represents an attractive strategy for broad applications37. To this end, we questioned whether GenomePAM could facilitate Cas variant discovery. We assessed this using TiCas9 as an example by first applying GenomePAM to profile pooled mutant variants and, upon evidence of altered mixed PAMs, applied GenomePAM characterization of single-mutant variants (Fig. 5). Because there are many Cas9 nucleases recognizing G/C-rich PAM, we aimed to engineer TiCas9 for recognizing A/T-rich PAM, namely, to relax the C at position 5 of NNNACT. Using AlphaFold 3 (ref. 38), we identified that K1315 was the only residue found to interact with G at position 5 on the complementary strand (Fig. 5a). We constructed an NNK library encoding for all 20 amino acids at position 1315. GenomePAM analysis of the pooled variants showed dramatically altered base compositions at position 5, without affecting positions 4 and 6, in the aggregated PAMs (Fig. 5b–d). We then assessed all the 19 a.a. variants individually. The variant K1315Q showed completely no restriction at position 5 (Fig. 5e–g), namely, an ANT PAM at positions 4–6, while another 18 variants showed varied preferences at position 5 (Extended Data Fig. 5). Then, 16 endogenous sites in RNF2 harbouring PAM positions 4–6 (4 ACT, 4 ATT, 4 AGT and 4 AAT) were used to validate the variant K1315Q versus wild type (WT) (spacer and primer sequences are listed in Supplementary Table 9). The results were consistent with SeqLogo, PCV visualization and the GenomePAM Table (Fig. 5h versus 5b–g). Interestingly, even though the SeqLogo of the variant K1315Q showed no noticeable dominant base at position 5 (Fig. 5e), the GenomePAM Table (Fig. 5g) showed that, for PAM positions 4–6, the proportions of genome-wide target sites edited were highest with AGT, followed by ACT, and the lowest with AAT and ATT, largely consistent with the indel percentages at the 16 endogenous sites tested individually (Fig. 5h).
a, Left: the structure of the TiCas9 and sgRNA complex was predicted using AlphaFold 3 and visualized in PyMOL 3.0. The PAM-interacting domain containing polar residue–DNA interactions within 4 Å are shown. Middle: hydrogen bonds between K1315 or Q1191 and nitrogenous base in the anti-sense strand of PAM are indicated by yellow dashed line, with corresponding distances labelled. Right: SeqLogo showing the GenomePAM result of TiCas9 WT. b–d, SeqLogo visualization of the 3′ PAM preference of the pooled NNK library of the TiCas9 1315 variants (b), the associated 4-base heat map of relative PCV (c) and the GenomePAM Table (d). P values were derived from two-sided chi-square test. e–g, SeqLogo visualization of the 3′ PAM preference of the TiCas9 K1315Q variant (e), the associated 4-base heat map of relative PCV (f) and the GenomePAM Table (g). P values were derived from two-sided chi-square test. h, Indel percentages using TiCas9 WT and its K1315Q variant on 16 endogenous sites in RNF2 harbouring different PAM sequences (positions 4–6: ACT, ATT, AGT and AAT). Data are presented as mean ± s.d.
Comparison of genome-wide potency and specificity
Many SpCas9 variants have been developed to reduce off-target effects (for example, SpCas9-HF1 (ref. 39), HypaCas9 (ref. 40), eSpCas9(1.1)41, Sniper-Cas9 (ref. 42) and Sniper2L-Cas9 (ref. 43)) and broaden PAM compatibilities (for example, xCas9 (ref. 44)). Oftentimes, dozens of gene loci are used to assess the fidelity and activity of Cas9 variants40,42. It would be desirable to simultaneously evaluate Cas nuclease potency and specificity with a less laborious method than traditional library-based approaches22,45. A method based on large-scale synthetic oligos (n = 26,891) containing targeting sequences and mismatch sequences has been developed for this purpose22. GenomePAM uses a single protospacer oligo that provides thousands of perfect-match sites and millions of mismatch sites in one human cell. Therefore, we sought to evaluate the feasibility of simultaneously comparing genome-wide potency and specificity of different Cas9 variants. We performed GenomePAM experiments with ‘Rep-1’ as the targeting protospacer for WT SpCas9 and six variants in parallel (SpCas9-HF1, eSpCas9(1.1), HypaCas9, xCas9, Sniper-SpCas9, and Sniper2L-SpCas9), with the same amounts of Cas and of sgRNA expression plasmids (Extended Data Fig. 6). The ratios of on-to-off target sites were highest for SpCas9-HF1 (mean 1.13), eSpCas9(1.1) (mean 1.08) and HypaCas9 (mean 0.93), followed by xCas9 (mean 0.57), Sniper2L-Cas9 (mean 0.23) and Sniper-Cas9 (mean 0.20), and lowest for WT SpCas9 (mean 0.13) (Fig. 6a). Similarly, the ratios of on-to-off target reads were 1.12, 2.28, 0.78, 0.75, 0.36, 0.31 and 0.17, respectively (Fig. 6b). To evaluate Cas9 cleavage dynamics with increasing probing data, we sampled datasets from 100,000 up to 1 million sequencing reads for GUIDE-seq analysis. The number of on-target sites identified given the same amount of sequencing data was highest (the most potent) in WT, followed by Sniper2L-SpCas9, comparable in Sniper-SpCas9 and eSpCas9(1.1), and lowest in SpCas9-HF1, HypaCas9 and xCas9 (Fig. 6c). The numbers of off-target sites identified given the same amount of data were lowest (the most specific) in xCas9, HypaCas9, SpCas9-HF1 and eSpCas9(1.1), comparable in Sniper-SpCas9 and Sniper2L-SpCas9, and highest (the least specific) in WT (Fig. 6d).
a, Ratio of the number of on-target sites to the number of off-target sites for seven SpCas9 variants. Data are presented as mean ± s.d. b, Ratio of the number of on-target sequencing reads to the number of off-target sequencing reads for seven SpCas9 variants. Data are presented as mean ± s.d. c,d, The number of on-target sites (c) and the number of off-target sites (d) detected when using randomly downsampled datasets, from 0.1 million to 1 million raw sequencing reads. e, Relative activities (defined as the number of perfect-match sites relative to that of WT SpCas9) and specificities (defined as the ratio of perfect-match to mismatch site numbers relative to the ratio in SpCas9-HF1) for seven SpCas9 variants.
To compare general activity and specificity of different SpCas9 variants in one place, we used the 1M-read datasets. We defined relative activity as the number of perfect-match sites relative to the number of perfect-match sites identified by WT SpCas9, and relative specificity as the ratio of perfect-match to mismatch target site numbers relative to the same ratio in SpCas9-HF1 (because SpCas9-HF1 had the highest ratio among the seven SpCas9 tested here) (Fig. 6e) The scatterplot showed that WT and Sniper2L-SpCas9s were more potent but less specific than other variants, whereas eSpCas9, SpCas9-HF1 and HypaCas9 were more specific but less potent than WT SpCas9 (Fig. 6e). To evaluate whether using a different repetitive spacer for GenomePAM can affect general activity and specificity, we used Rep-3 (Extended Data Fig. 1) and the results were similar to those obtained with Rep-1 (Extended Data Figs. 7 and 8).
Profiling chromatin accessibility in different human cell lines
Chromatin conformation has been shown to affect Cas nuclease genome editing46,47 on relatively small numbers of genomic targets but not on the genome-wide scale. To assess genome-wide targeting profiles in different cell lines, we performed GenomePAM assays using SpCas9 and ‘Rep-1’ in HEK293T, HepG2, Huh7 and HeLa cell lines in triplicates (Fig. 7). Genome-wide chromatin accessibility was defined as the number of targeting reads per 5-M-base chromosome window. Relative to HEK293T, the results from triplicates of the same cell line showed consistent and reproducible genome-wide chromatin profiles (Extended Data Fig. 9). Interestingly, hepatocyte-derived cell lines HepG2 and Huh7 showed very similar chromatin accessibility profiles in contrast to HeLa cells, indicating similar chromatin accessibility in the same tissue (Fig. 7).
The proportion of cleavage read counts in each 5-million-base chromosomal window was divided by the proportion in the corresponding chromosomal window in the HEK293T cells (mean of triplicates) and log2 transformed. A higher cleavage proportion relative to HEK293T is coloured in red and a relatively lower proportion in blue.
Discussion
In this study, we developed a new method called GenomePAM and demonstrated its simplicity, accuracy and capability in assaying PAM preferences of previously established SpCas9, SaCas9 and FnCas12a nucleases, as well as complicated and challenging PAM recognitions as in SpRY and CjCas9. We also demonstrated the potential of GenomePAM for simultaneous comparison of potencies (thousands of perfectly matched loci) and, when combined with GUIDE-seq, fidelities (tens of thousands of off-target sites) of various Cas nucleases and variants. Compared with other PAM identification methods using regular PCR amplicon sequencing, GenomePAM uses GUIDE-seq and thus requires relatively more skills to perform. However, GUIDE-seq has been one of the main methods for assessing CRISPR off-target effects in both research and therapeutic settings48,49. Using GenomePAM, we rapidly identified one type II Cas TiCas9 and one type V Cas RuCas12a that are both active in human cells. Directed by AlphaFold 3, we further accelerated the PAM engineering of TiCas9 to expand its targetability using GenomePAM. In addition, we demonstrated at the genome-scale that genomic accessibility of a given CRISPR-Cas design differs among cells of different tissue types. We envision that GenomePAM will be widely useful for the discovery, characterization and comprehensive evaluation of PAM recognition, potency and fidelity of CRISPR-Cas nucleases and engineered variants.
GenomePAM is capable of direct identification of challenging PAMs in human cells. Different PAM preferences for CjCas9 have been reported. An in silico prediction followed by a biochemical digestion assay reported that the PAM for CjCas9 was NNNNACA34, while an in vitro cleavage assay followed by in vivo second-step analyses on each of the positions from 5 to 8 showed that the optimal PAM was NNNNRYAC35. Maintaining large-scale libraries consisting of many sequences is challenging. Previous methods have attempted to address this by using sequential rounds of experiments in exceptionally challenging situations, with progressively lengthened PAM candidate sequences12,14. Maintaining rich library complexities in large-scale screening experiments is often challenging, but is not an issue in GenomePAM since every single cell contains one full set of potential PAM candidates, and maintaining PAM candidate sequence diversity is also not an issue in GenomePAM. We used GenomePAM in a one-round experiment directly in human cells and showed that, without previous protein purification and without introducing a library of synthetic oligos, the PAM preference of CjCas9 was NNNNRYAC. Since GenomePAM is a positive selection method, it can be used to efficiently identify PAM requirements when there are no preferences25.
Methods that can compare potency and fidelity of various Cas nucleases simultaneously are highly desirable. One such method involves constructing stable cell lines with balanced expression of Cas nucleases and variants to be compared, followed by transduction of a large pool of synthetic oligos (n = 26,891, on- and off-target sequences) at a carefully controlled multiplicity of infection (MOI) into these stable expression cells to compare potency and fidelity of various Cas nucleases22. One advantage of this approach is that it includes different on-target and off-target sequences. GenomePAM takes advantage of highly repetitive sequences in every cell (thousands of on-target and tens of thousands of off-target sequences), is much simpler and of low cost. One limitation of GenomePAM is that it uses relatively limited kinds of on-target sequences, although they appear thousands of times in one cell. However, this can be compensated for by using different repetitive sequences, such as Rep-1, Rep-3 or Rep-4 for result confirmation and validation, and in different cell types. Indeed, our results showed that using different sequences (Rep-1, Rep-2 or Rep-3) as the protospacers for GenomePAM analyses gave the same results in PAM characterizations for the different Cas nucleases tested in this study. However, a new Cas nuclease might have a scaffold sequence that interferes with the repeats, potentially forming strong secondary structures and affecting GenomePAM results. We recommend using at least two different repeats as GenomePAM spacers for novel Cas nucleases. Another possibility is to combine different repetitive sequences in one experiment, although we have not tested this ourselves yet. In such a case, bioinformatic analysis would need to use one repetitive sequence at a time and repeat the data analysis for all sequences. Chromatin accessibility affects Cas nuclease activity, as shown on a genome-wide scale. The GenomePAM assay is minimally biased by chromatin accessibility, probably due to the large number of accessible perfect-match targets in each cell.
Measuring chromatin accessibility is important in understanding basic cellular processes, including transcription, replication, chromosome segregation and DNA repair50. A variety of techniques such as Dnase-seq51 and ATAC-seq52,53 enable quantifying genome-wide chromatin accessibility. Genome accessibility to CRISPR-Cas targeting is known to differ among different cell types but has been demonstrated only in limited and selected loci. GenomePAM demonstrates clearly that the CRISPR-Cas genome-wide accessibility profiles differ among different cell types. GenomePAM may complement existing methods for studying genome-wide chromatin dynamics.
Methods
Identification of repeat sequences
The human genome (hg38) was used to calculate the frequencies of all 20-mer sequences using jellyfish tools54. Because a spacer starting with a G base at 5′ is required for most Cas nucleases, we selected all 20-mer sequences starting with a 5′ G. To avoid simple homopolymers and to increase base composition diversity, we also excluded those 20-mers containing ‘AAA’, ‘TTT’, ‘CCC’ or ‘GGG’. Among the remaining 20-mer sequences, to retrieve their flanking sequences, we used BWA55 to map their chromosomal coordinates, and retrieved 10 bases upstream and 10 bases downstream using samtools56. The diversity of the flanking sequences of each of the 20-mer sequence was plotted using the ggseqlogo57 package. We defined PCV as the ratio of the percentage of a given PAM sequence among all sequences of the same length captured by GenomePAM to the percentage of the same PAM sequence among all sequences of the same length in the human genome (hg38). To better visualize the PAM recognition pattern in a 4-base heat map, a relative PCV was calculated by log2 transforming PCV and normalizing to the PAM sequence with the highest PCV.
CRISPR-Cas identification
Metagenomes were downloaded from EMBL-EBI MGnify, NCBI GenBank and the Joint Genome Institute, or assembled in-house using raw sequencing reads from the NCBI Sequence Read Archive. We used a combinatorial pipeline that includes CCTyper58, CRISPRcasIdentifier59 and OPFI60 to predict putative Cas proteins. MinCED61 was used to identify CRISPR arrays; any CRISPR arrays located adjacent to the predicted Cas, which typically comprises a CRISPR operon, were subjected to further analysis. Putative sequences of Cas were scanned using Interproscan62 to identify and annotate conserved domains. Selected Cas proteins were aligned with MAFFT63, and a phylogenetic tree was constructed using FastTree2 (ref. 64).
Cell culture
HEK293T (CRL3216, ATCC), HepG2 (CRL11997, ATCC) and HeLa (CCL-2, ATCC) cell lines were purchased from the American Type Culture Collection (ATCC). The Huh7 (01042712, Sigma) cell line was purchased from Sigma. HEK293T cells, HeLa cells and Huh7 cells were cultured in Dulbecco’s modified Eagle medium (C11995500BT, GIBCO), HepG2 cells were cultured in Eagle’s minimum essential medium (30-2003, ATCC) supplemented with 10% fetal bovine serum (10270-106, GIBCO), and all cells were incubated at 37 °C with 5% CO2 in a constant-temperature incubator. Cell passaging was performed at a 1:3 split ratio when the cells reached 90% confluence.
Plasmids and oligonucleotides
The plasmids used in these experiments were purchased from the non-profit plasmid repository Addgene. The plasmid lentiCRISPRv2 (Addgene, 52961) was used to express wild-type SpCas9; BPK2139 (Addgene, 65776) to express wild-type SaCas9; pY004 (Addgene, 69976) to express wild-type FnCas12a; pET-CjCas9 (Addgene, 89754) to express wild-type CjCas9; and the plasmids BPK1520 (Addgene, 65777), BPK2660 (Addgene, 70709), pU6-Fn-crRNA (Addgene, 78958) and pU6-cj-E sgRNA (Addgene, 169915) were used to express SpCas9 sgRNA, SaCas9 sgRNA, FnCas12a crRNA and CjCas9 sgRNA, respectively. Oligonucleotide duplexes corresponding to the target spacer sequences were purchased from GENEWIZ.
Plasmid construction
Oligonucleotide duplexes corresponding to sgRNA sequences (paired top and bottom single-stranded oligos) were annealed together using the following programme: 95 °C, 3 min; 70 cycles of (95 °C, 1 min, with −1 °C per cycle); 4 °C hold. Annealed DNA segments were inserted into BsmbI digested sgRNA expression plasmids. After transformation into bacteria and selection, the plasmids were purified by PureLink HiPure Plasmid Midiprep kit (Invitrogen). Sequences of guide insertion in the plasmids were confirmed by Sanger sequencing (BGI)
dsODN preparation
dsODN oligos were purchased from GENEWIZ with HPLC purification. Each oligo was resuspended in 1× TE buffer (ThermoFisher, 12090015) to a final concentration of 250 µM. These oligos were then annealed at 100 µM in 1× annealing buffer (10 mM Tris-HCl, 50 mM NaCl, 1 mM EDTA, pH7.4) on a thermocycler. The programme was 95 °C, 3 min; 70× (95 °C, 1 min, −1 °C per cycle); 4 °C hold.
Cell transfection
Guide RNAs and Cas protein plasmids were transfected into cells using Lipofectamine 3000 transfection reagent (ThermoFisher, L3000015) following manufacturer instructions. Cells were cultured at a density of 1 × 105 per well in a 24-well plate. For each well, 100 ng of gRNA and 400 ng of Cas expression plasmids together with 5 pmol of annealed dsODN were mixed with 1 µl of P3000 reagents in 25 µl Opti-MEM medium and then mixed with 1.5 µl Lipofectamine 3000 reagent in 25 µl Opti-MEM medium to generate a total volume of 50 µl DNA–lipid complex, followed by incubation for 10 min at room temperature. The transfection complex was added into individual wells. The plate was maintained in a cell culture incubator for 48–72 h.
Cell viability assay
HEK293T and HepG2 cells were seeded in 96-well plates at 2.0 × 104 per well and transfected at four different conditions: (1) SpCas9 plasmid + Rep-1 sgRNA plasmid + dsODN; (2) SpCas9 plasmid + Rep-1 sgRNA plasmid; (3) SpCas9 plasmid + non-targeting sgRNA plasmid + dsODN; and (4) Lipofectamine 3000 only. Cell viability assay was performed with Enhanced Cell Counting Kit-8 (Beyotime, C0043) according to manufacturer instruction. Briefly, 10 µl CCK-8 labelling reagent was added to each well and incubated at 37 °C with 5% CO2 in a humidified atmosphere for 1 h. Cell viability was then determined using absorbance at 450 nm and evaluated at 0 h, 24 h and 48 h after transfection.
DNA extraction
Genomic DNA was extracted using the MiniBEST Universal Genomic DNA Extraction kit (TaKaRa) and quantified using Qubit dsDNA HS Assay kit (Invitrogen) in a Qubit 3.0 fluorometer.
GUIDE-seq
Genomic DNA was extracted at 48–72 h post transfection and 500 ng of DNA was used for next-generation sequencing library construction according to our previous work26 with modification65 (see updated dsODN sequences and amplification primers with discussion notes in Supplementary Table 10). Briefly, the library preparation procedure contains enzymatic fragmentation, end repair, A-tailing, adaptor ligation and two rounds of anchored nested PCR. The libraries were quantified with KAPA Library Quantification kits and sequenced on a NextSeq 1000 System (Illumina) using a 300-cycle kit (2× 150-bp paired-end) with standard Illumina sequencing workflow (that is, no need to adjust indexing cycles or use customized sequencing primers). Sequencing data (FASTQ files) were analysed using the GenomePAM pipeline with the off-target identification steps adapted from the GUIDE-seq pipeline (https://github.com/tsailabSJ/guideseq). The off targets were identified using the criteria of ≤6 mismatch bases with the intended targeting protospacer.
GenomePAM Table
To identify enriched PAM motifs over genomic background, we developed the algorithm GenomePAM Table66, involving the computational steps implemented in an R script to: (1) Identify the most significantly enriched single-base motif: we define the edited value as the sum of GUIDE-seq-detected genomic site numbers and GUIDE-seq read counts, with the latter linearly scaled to match the range of the former. The maximum value equals the highest number of genomic sites considered for all combinatorial potential motifs. Within the same motif window, a chi-square test is used to compare the edited value against the corresponding genomic background counts among all motifs. (2) Extend from the position identified in Step 1 bidirectionally: extend one base towards the 5′ end or one base towards the 3′ end and calculate the new edited values. Between the two extensions, the one with higher statistical significance is recorded and used for the next round of extension. (3) Repeat Step 2: continue extending in both directions until the ends of candidate bases are reached. Record all significant motifs without limiting motif length. (4) Report enriched motifs: report the enriched motifs along with the percentages of corresponding genomic sites edited, retaining only those motifs with increasing percentages from each iteration step (Fig. 1f).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Details of target sites identified have been included in the Supplementary Tables. Raw Fastq data are available at SRA (ID 1258724-BioProject-NCBI)67. Source data are provided with this paper.
Code availability
The GenomePAM pipeline is available on GitHub66. The only input file required by the GenomePAM pipeline is the identifiedOfftargets.txt from the GUIDE-seq pipeline.
References
-
Barrangou, R. et al. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, 1709–1712 (2007).
-
Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR-Cas9 for genome engineering. Cell 157, 1262–1278 (2014).
-
Sternberg, S. H. & Doudna, J. A. Expanding the biologist’s toolkit with CRISPR-Cas9. Mol. Cell 58, 568–574 (2015).
-
Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR-Cas9. Science 346, 1258096 (2014).
-
Glemzaite, M. et al. Targeted gene editing by transfection of in vitro reconstituted Streptococcus thermophilus cas9 nuclease complex. RNA Biol. 12, 1–4 (2015).
-
Jinek, M. et al. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–821 (2012).
-
Jiang, W., Bikard, D., Cox, D., Zhang, F. & Marraffini, L. A. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat. Biotechnol. 31, 233–239 (2013).
-
Jinek, M. et al. Structures of Cas9 endonucleases reveal RNA-mediated conformational activation. Science 343, 1247997 (2014).
-
Gasiunas, G., Barrangou, R., Horvath, P. & Siksnys, V. Cas9–crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria. Proc. Natl Acad. Sci. USA 109, E2579–E2586 (2012).
-
Kleinstiver, B. P. et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature 523, 481–485 (2015).
-
Nishimasu, H. et al. Crystal structure of Cas9 in complex with guide RNA and target DNA. Cell 156, 935–949 (2014).
-
Zetsche, B. et al. Cpf1 is a single RNA-guided endonuclease of a class 2 CRISPR-Cas system. Cell 163, 759–771 (2015).
-
Karvelis, T., Gasiunas, G. & Siksnys, V. Methods for decoding Cas9 protospacer adjacent motif (PAM) sequences: a brief overview. Methods 121–122, 3–8 (2017).
-
Karvelis, T. et al. Rapid characterization of CRISPR-Cas9 protospacer adjacent motif sequence elements. Genome Biol. 16, 253 (2015).
-
Gao, L. et al. Engineered Cpf1 variants with altered PAM specificities. Nat. Biotechnol. 35, 789–792 (2017).
-
Kleinstiver, B. P. et al. Engineered CRISPR-Cas12a variants with increased activities and improved targeting ranges for gene, epigenetic and base editing. Nat. Biotechnol. 37, 276–282 (2019).
-
Leenay, R. T. et al. Identifying and visualizing functional PAM diversity across CRISPR-Cas systems. Mol. Cell 62, 137–147 (2016).
-
Wimmer, F., Mougiakos, I., Englert, F. & Beisel, C. L. Rapid cell-free characterization of multi-subunit CRISPR effectors and transposons. Mol. Cell 82, 1210–1224.e6 (2022).
-
Marshall, R. et al. Rapid and scalable characterization of CRISPR technologies using an E. coli cell-free transcription–translation system. Mol. Cell 69, 146–157.e3 (2018).
-
Wang, S. et al. Identification of SaCas9 orthologs containing a conserved serine residue that determines simple NNGG PAM recognition. PLoS Biol. 20, e3001897 (2022).
-
Tang, L. et al. Efficient cleavage resolves PAM preferences of CRISPR-Cas in human cells. Cell Regen. 8, 44–50 (2019).
-
Kim, N. et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat. Biotechnol. 38, 1328–1336 (2020).
-
Kim, N. et al. Deep learning models to predict the editing efficiencies and outcomes of diverse base editors. Nat. Biotechnol. 42, 484–497 (2024).
-
Walton, R. T., Christie, K. A., Whittaker, M. N. & Kleinstiver, B. P. Unconstrained genome targeting with near-PAMless engineered CRISPR-Cas9 variants. Science 368, 290–296 (2020).
-
Walton, R. T., Hsu, J. Y., Joung, J. K. & Kleinstiver, B. P. Scalable characterization of the PAM requirements of CRISPR–Cas enzymes using HT-PAMDA. Nat. Protoc. 16, 1511–1547 (2021).
-
Tsai, S. Q. et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat. Biotechnol. 33, 187–198 (2015).
-
Zheng, Z. et al. Anchored multiplex PCR for targeted next-generation sequencing. Nat. Med. 20, 1479–1484 (2014).
-
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).
-
Friedland, A. E. et al. Characterization of Staphylococcus aureus Cas9: a smaller Cas9 for all-in-one adeno-associated virus delivery and paired nickase applications. Genome Biol. 16, 257 (2015).
-
Smith, C. J. et al. Enabling large-scale genome editing at repetitive elements by reducing DNA nicking. Nucleic Acids Res. 48, 5183–5195 (2020).
-
Tan, I. L. et al. Targeting the non-coding genome and temozolomide signature enables CRISPR-mediated glioma oncolysis. Cell Rep. 42, 113339 (2023).
-
Ran, F. A. et al. In vivo genome editing using Staphylococcus aureus Cas9. Nature 520, 186–191 (2015).
-
Jiang, W. et al. Dealing with the evolutionary downside of CRISPR immunity: bacteria and beneficial plasmids. PLoS Genet. 9, e1003844 (2013).
-
Fonfara, I. et al. Phylogeny of Cas9 determines functional exchangeability of dual-RNA and Cas9 among orthologous type II CRISPR-Cas systems. Nucleic Acids Res. 42, 2577–2590 (2014).
-
Kim, E. et al. In vivo genome editing with a small Cas9 orthologue derived from Campylobacter jejuni. Nat. Commun. 8, 14500 (2017).
-
Miller, S. M. et al. Continuous evolution of SpCas9 variants compatible with non-G PAMs. Nat. Biotechnol. 38, 471–481 (2020).
-
Nishimasu, H. et al. Engineered CRISPR-Cas9 nuclease with expanded targeting space. Science 361, 1259–1262 (2018).
-
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
-
Kleinstiver, B. P. et al. High-fidelity CRISPR-Cas9 nucleases with no detectable genome-wide off-target effects. Nature 529, 490–495 (2016).
-
Chen, J. S. et al. Enhanced proofreading governs CRISPR-Cas9 targeting accuracy. Nature 550, 407–410 (2017).
-
Slaymaker, I. M. et al. Rationally engineered Cas9 nucleases with improved specificity. Science 351, 84–88 (2016).
-
Lee, J. K. et al. Directed evolution of CRISPR-Cas9 to increase its specificity. Nat. Commun. 9, 3048 (2018).
-
Kim, Y. et al. Sniper2L is a high-fidelity Cas9 variant with high activity. Nat. Chem. Biol. 19, 972–980 (2023).
-
Hu, J. H. et al. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556, 57–63 (2018).
-
Schmid-Burgk, J. L. et al. Highly parallel profiling of Cas9 variant specificity. Mol. Cell 78, 794–800.e8 (2020).
-
Chen, X. et al. Probing the impact of chromatin conformation on genome editing tools. Nucleic Acids Res. 44, 6482–6492 (2016).
-
Liu, G., Yin, K., Zhang, Q., Gao, C. & Qiu, J.-L. Modulating chromatin accessibility by transactivation and targeting proximal dsgRNAs enhances Cas9 editing efficiency in vivo. Genome Biol. 20, 145 (2019).
-
Gillmore, J. D. et al. CRISPR-Cas9 in vivo gene editing for transthyretin amyloidosis. N. Engl. J. Med. 385, 493–502 (2021).
-
Frangoul, H. et al. CRISPR-Cas9 gene editing for sickle cell disease and β-thalassemia. N. Engl. J. Med. 384, 252–260 (2021).
-
Mansisidor, A. R. & Risca, V. I. Chromatin accessibility: methods, mechanisms, and biological insights. Nucleus 13, 236–276 (2022).
-
Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).
-
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).
-
Buenrostro, J. D. et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell 173, 1535–1548.e16 (2018).
-
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
-
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
-
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
-
Wagih, O. Ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).
-
Russel, J., Pinilla-Redondo, R., Mayo-Muñoz, D., Shah, S. A. & Sørensen, S. J. CRISPRCasTyper: automated identification, annotation, and classification of CRISPR-Cas loci. CRISPR J. 3, 462–469 (2020).
-
Padilha, V. A., Alkhnbashi, O. S., Shah, S. A., de Carvalho, A. C. P. L. F. & Backofen, R. CRISPRcasIdentifier: machine learning for accurate identification and classification of CRISPR-Cas systems. Gigascience 9, giaa062 (2020).
-
Rybarski, J. R., Hu, K., Hill, A. M., Wilke, C. O. & Finkelstein, I. J. Metagenomic discovery of CRISPR-associated transposons. Proc. Natl Acad. Sci. USA 118, e2112279118 (2021).
-
Bland, C. et al. CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007).
-
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
-
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
-
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
-
Tan, Y. et al. Rationally engineered Staphylococcus aureus Cas9 nucleases with high genome-wide specificity. Proc. Natl Acad. Sci. USA 116, 20969–20976 (2019).
-
Yu, M. et al. Supporting data for the manuscript ‘GenomePAM directs PAM characterization and engineering of CRISPR-Cas nucleases using mammalian genome repeats’. NCBI https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1258724 (2025).
-
Yu, M. et al. Zheng-NGS-Lab/genomePAM: mammalian genome repeats for direct PAM requirement characterization and activity profiling of CRISPR-Cas nucleases. GitHub https://github.com/Zheng-NGS-Lab/genomePAM (2025).
Acknowledgements
We thank support from Lau Grant LC230003 (Z.Z.), Swedish Research Council 202001418 (Z.Z.); Research Grants Council of the Hong Kong Special Administrative Region 11103024 and T12-101/23-N (Z.Z.); Shenzhen Medical Research Fund B2402002 (Z.Z.); the InnoHK initiative of the Innovation and Technology Commission of the Hong Kong Special Administrative Region Government (Z.Z.); Lau Grant LC230002 (L.L.); the Kayden–Lambert MGH Research Scholar Award 2023–2028 (B.P.K.); and National Institutes of Health (NIH) grant DP2-CA281401 (B.P.K.). We thank O. W. C. Leung for proofreading the manuscript.
Funding
Open access funding provided by Karolinska Institute.
Ethics declarations
Competing interests
Z.Z. is a co-founder and a scientific advisor for and holds equity in GenEditBio whose interest is reviewed and regulated by institutional Outside Practice policies annually. Z.Z. and M.Y. are inventors on a patent (WIPO Patent application no. WO2024230784A1, 2024) for GenomePAM technologies. B.P.K. is an inventor on patents or patent applications (WIPO Patent application no. WO2021151065A2, 2021) filed by MGB for HT-PAMDA technologies that describe genome engineering technologies and methods to characterize the PAM. B.P.K. consults for Novartis Venture Fund, Foresite Labs and Jumble Therapeutics, and is on the scientific advisory board of Acrigen Biosciences, Life Edit Therapeutics and Prime Medicine. B.P.K. has a financial interest in Prime Medicine, Inc., a company developing therapeutic CRISPR-Cas technologies for gene editing. B.P.K.’s interests were reviewed and are managed by MGH and MGB in accordance with their conflict-of-interest policies. S.Q.T. is a co-inventor on a patent (WIPO Patent application no. WO2015200378A1, 2015) for GUIDE-seq, and a member of the scientific advisory boards of Ensoma and Prime Medicine. B.W., L.I. and J.L. are employees of GenEditBio. The other authors declare no competing interests.
Peer review
Peer review information
Nature Biomedical Engineering thanks Chase Beisel, Pranam Chatterjee and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Occurrence and flanking sequence diversity of selected repetitive sequences.
The list shows 10 example repetitive sequences (Rep-1, Rep-1RC, Rep-2, Rep-2RC, Rep-3, Rep-3RC, Rep-4, Rep-4RC, Rep-U1 and Rep-U2), with their occurrences in human genome (hg38), diversity of flanking 10 bases and explaining notes for their suitabilities to be used as spacer for GenomePAM analysis.
Extended Data Fig. 2 Cell viability assay when transfected with four different conditions in HEK293T (a) and HepG2 (b) cells.
1) SpCas9 plasmid + Rep-1 sgRNA plasmid + dsODN; 2) SpCas9 plasmid + Rep-1 sgRNA plasmid; 3) SpCas9 plasmid + non-target sgRNA plasmid + dsODN; and 4) Lipofectamine 3000 only. Absorbance at 450 nm were evaluated 0 h, 24 h and 48 h after transfection.
Extended Data Fig. 3 Evaluation of GenomePAM on SpCas9 in three other cell lines than HEK293T.
SeqLogo results for SpCas9 proteins PAM preferences in HepG2 (a, b), Huh7 (c, d) and HeLa (e, f) cells. (a, c, e) were summarized by associated perfect match spacers and (b, d, f) mismatch spacers.
Extended Data Fig. 4 GenomePAM results of SpCas9 and SaCas9 using Rep-2 as the spacer, and FnCas12a using Rep-2RC as the spacer in HEK293T cells.
See Extended Data Fig. 1 for the sequences of Rep-2 and Rep-2RC. The PAMs were summarized by associated perfect match spacers (a, c, e) and by mismatch spacers (b, d, f).
Extended Data Fig. 5 GenomePAM analyses on 18 TiCas9 variants in HEK293T cells.
PAM preferences for the 18 TiCas9 (K1315A, K1315C, K1315D, K1315E, K1315F, K1315G, K1315H, K1315I, K1315L, K1315M, K1315N, K1315P, K1315R, K1315S, K1315T, K1315V, K1315W and K1315Y) were summarized by associated perfect matched spacers.
Extended Data Fig. 8 General activities and specificities of different SpCas9 variants using Rep-3 as the spacer in GenomePAM assay.
(a) The number of on-target sites to the number of off-target sites of seven SpCas9 variants. (b) The number of on-target sequencing reads to the number of off-target sequencing reads of seven SpCas9 variants. The number of on-target sites (c) and the number of off-target sites (d) detected when using randomly down-sampled datasets, from 0.1 M to 1 M raw sequencing reads. (e) Relative activities (defined as the number of perfect match sites relative to that of the WT SpCas9) and specificities (defined as the ratio of perfect match to mismatch site numbers relative to the ratio in SpCas9-HF1) of seven SpCas9 variants.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yu, M., Ai, L., Wang, B. et al. GenomePAM directs PAM characterization and engineering of CRISPR-Cas nucleases using mammalian genome repeats. Nat. Biomed. Eng (2025). https://doi.org/10.1038/s41551-025-01464-y
-
Received:
-
Accepted:
-
Published:
-
DOI: https://doi.org/10.1038/s41551-025-01464-y







