Main
The accelerating pace of plant genomics has facilitated structure-based and homology-based comparative approaches to infer the function of proteins regulating important traits; however, the functional annotation of this genetic variation has lagged. It remains very difficult, if not impossible, to predict gene function from sequence alone. Meeting this challenge is a stepping stone toward rationally designing more resilient crop varieties using gene editing1. This gap is particularly acute for transcriptional regulators, a class of proteins involved in virtually all aspects of plant biology2. Sequence-to-function prediction of transcriptional regulators is hindered by the high disorder and poor conservation of their regulatory domains3,4,5, which limits the power of comparative genomic algorithms to decipher how regulatory activity is encoded at the protein level. Moreover, mutations in regulatory domains are well tolerated or have subtle phenotypic effects, which has conspired against the use of classic random mutagenesis screens6. Directed mutagenesis studies are often guided by structural data and sequence conservation, but designing effective mutagenesis experiments becomes challenging when these constraints are absent and the search space is too large.
In recent years, techniques such as STARR-seq7, DAP-seq8 and TARGET9 have enabled mapping the functional requirements of DNA sequences bound and regulated by plant transcription factors (TFs). However, the large scale in planta characterization of the regulatory activity of multiple TFs simultaneously remains challenging and labor intensive10. Given the limitations of traditional methods, the high-throughput study of TF regulatory domains has focused on exploring their protein sequence space using massively parallel reporter assays (MPRAs) in cultures of animal or microbial cells11. In this approach, TF libraries are transformed into cultured cells where they drive the expression of a reporter gene. Cells are then sorted according to the level of reporter expression and next-generation sequencing is used to determine the abundance of each transgene in each sorted bin, a proxy for their activity12. These assays empirically characterize transcriptional regulators at a large scale and can yield the breadth of data necessary to train predictive models of the location and function of TF regulatory domains13,14,15. Such models provide a way to perform inference on protein sequences that are too costly or time-consuming to characterize experimentally16.
Over the last decade, reporter-based MPRAs of protein libraries in microbes such as Saccharomyces cerevisiae (yeast) and animal cell cultures have opened whole additional avenues of inquiry and led to a number of predictive models to predict TF activity13,14,15,17,18,19,20,21,22,23. Although proteins of plant origin have been studied in a high-throughput manner in yeast15,19, these assays have so far not been conducted in plant systems. As a result, our view of plant transcriptional regulation in general and of plant TFs in particular is inherently biased by nonplant model organisms that diverged from plants more than 1 billion years ago24. These assumptions are inextricably embedded in current models to predict sequence features dictating the activity of plant TFs15. More broadly, MPRAs can be used to study other classes of trans-acting and protein-coding genetic elements beyond TFs such as immune effectors or metabolic enzymes whose activity in plants depends on other plant-specific proteins25. There is therefore a need for a high-throughput method to measure protein-coding gene libraries expressed in a plant host.
Here we introduce ENTRAP-seq (enrichment of nuclear trans-elements reporter assay in plants with sequencing), an MPRA based on magnetic sorting of isolated nuclei that is capable of measuring trans-acting coding variants in plant cells. We deployed this method in Nicotiana benthamiana leaves expressing gene libraries delivered by Agrobacterium tumefaciens, allowing us to simultaneously measure the transactivation capacity of at least 2,000 protein variants in planta. We combined this technology with a custom machine learning pipeline to predict and experimentally validate activation domains (ADs) from proteins across 1,495 plant viral genomes. We identified hundreds of peptides of viral origin capable of activating transcription when recruited to a reporter. These potential ADs exhibit positional conservation and tend to occur in structural proteins and enzymes not known to be involved in host gene regulation, such as capsid proteins and RNA polymerases. Moreover, we demonstrate that many of these viral ADs can be used as DNA parts for plant synthetic biology, opening the door to large-scale mining of parts optimized for plant engineering. Comparing the activity of plant TF sequences between yeast and N. benthamiana revealed pervasive discrepancies between these two systems, challenging current assumptions about AD conservation and highlighting the need for in planta assays. Finally, we show that ENTRAP-seq can be used to map, dissect and engineer the ADs of plant TFs. We used machine-based design to guide semirational mutagenesis of the central transcriptional activator in photoperiodic flowering, CONSTANS, identifying sequence features critical for its function and enabling the generation of a new allelic series that includes repressors. ENTRAP-seq bridges a methodological gap in plant biology by enabling the multiplexed characterization of natural and synthetic coding variation in plant cells.
Results
Optimizing magnetic nucleus sorting as a reporter assay
To measure the activity of multiple protein-coding elements in parallel, we sought to develop an MPRA based on separating nuclei by the level of expression of a reporter transgene (Fig. 1a). By placing this reporter under the control of a protein-coding library of regulators (that is, trans-regulatory elements), we aimed to sort nuclei into two fractions: a pull-down fraction enriched with regulators that activate the reporter, and a flow-through fraction enriched with repressors and/or inactive elements. Next-generation sequencing is used to quantify the relative enrichment of each trans-element in each fraction, which is expected to be proportional to their activity (that is, activators or repressors) and strength (Fig. 1a). As the expression platform, we used Agrobacterium-mediated transient expression in N. benthamiana, also known as agroinfiltration (Fig. 1a). Crucially, it was previously shown that transgene expression in this system follows Poisson-like statistics26,27. This makes it possible to minimize the number of cells expressing more than one trans-element from the library by infiltrating Agrobacterium cultures at a very low density. Single transgene expression can mitigate unwanted interactions that may occur when different trans-elements are expressed in the same cell.
a, Overview of the ENTRAP-seq assay. A library of trans-acting elements, such as TFs of varying strengths, is cloned in binary vectors for Agrobacterium-mediated transformation. These trans-elements regulate a promoter driving the expression of an outer nuclear envelope tag (ONEtag) containing a biotin acceptor peptide. A transgene for the constitutive expression of the E. coli biotin ligase BirA is also included to biotinylate the ONEtag. The library as well as the ONEtag reporter and the BirA transgene are transformed into Agrobacterium and mixed cultures are infiltrated into leaves of N. benthamiana. The ONEtag is then biotinylated by constitutively expressed BirA using endogenous biotin. Next, nuclei are isolated from infiltrated leaves and incubated with magnetic streptavidin-coated beads, allowing for magnetic sorting. DNA is extracted from the unsorted and sorted nuclei samples and the abundance of each library element is determined by sequencing the T-DNAs. The enrichment of a given trans-element in the sorted versus unsorted population correlates with its activation strength. b, Live cell fluorescence microscopy images of N. benthamiana epidermis cells expressing a nuclear marker (H2B-RFP) and different ONEtag candidate versions fused to GFP. Scale bars 20 μm. Shown on top of each set of images is a schematic of the ONEtag variant tested. At least three cells were imaged for each construct to confirm consistent localization. c, Schematic of the construct used in d and e to compare promoter activity (RFP fluorescence) with ONEtag accumulation (GFP fluorescence). d, Scatter plot showing RFP (x-axis) and GFP (y-axis) fluorescence of leaves expressing the construct in c containing the SlWIP1-C ONEtag combined with a second strain carrying Gal4DBD or Gal4-VP16. e, Same as d for a ONEtag containing SlWIP2-C. f, Schematic of the experiment to test the sensitivity of the SlWIP2-C ONEtag. Gal4DBD fusions to effector domains of Arabidopsis TFs of varying strengths were used to drive the expression of the reporter. g, Scatter plot of the RFP (x-axis) and GFP (y-axis) fluorescence from leaves expressing the constructs in f. Open circles show mean ± s.d.; r, Pearson’s correlation coefficient. In d, e and g, n = 24 biological replicates (3 plants, 2 leaves per plant, 4 leaf punches per plant). Panel a partially created with BioRender.com.
To link reporter expression to nucleus sorting, we used a chimeric tag composed of a fusion of a biotin acceptor peptide to GFP and an outer nuclear envelope targeting domain (hereafter ONEtag) that can be biotinylated by a coexpressed BirA transgene using endogenous biotin. Nuclei extracts incubated with streptavidin magnetic beads can then be sorted using magnets. Similar approaches have been used to purify nuclei from specific cell types in transgenic plants28, but there are no reports of magnetic sorting of transiently transformed N. benthamiana tissues. We hypothesized that transiently transformed nuclei could be sorted into two fractions: a pull-down fraction enriched in transfer DNAs (T-DNAs) encoding activators, and a flow-through fraction depleted of strong activators and enriched in T-DNAs coding for repressors and neutral elements (Fig. 1a). For this approach to yield quantitative differences in enrichment, magnetic nucleus sorting must be sensitive enough to detect continuous differences in ONEtag expression levels among cells, rather than merely distinguishing between nuclei that do and do not express the ONEtag, as has been previously done to isolate cell type-specific nuclei29. To optimize magnetic nucleus sorting as a quantitative readout of transcriptional activity in agroinfiltrated N. benthamiana, we pursued two design goals. First, the transiently expressed ONEtag should localize to the nuclear envelope, as determined by the GFP subcellular localization. Second, as for any sensitive quantitative reporter, the level of ONEtag accumulation must correlate with the transcriptional activity of the promoter driving its expression.
As the first candidate ONEtag, we tested the AtRanGAP WPP domain, previously used to label nuclei in specific tissues in Arabidopsis. To confirm ONEtag subcellular localization, we coexpressed a nuclear marker (pUBQ10::H2B:mScarlet, hereafter H2B-red fluorescent protein (RFP)). When transiently expressed in N. benthamiana, the WPP tag did not localize to the nuclear envelope, as indicated by its failure to encircle the H2B-RFP signal (Fig. 1b). A similar pattern was observed for the WPP domain of the closest N. benthamiana RanGAP homolog. Since the C terminus of WPP proteins has also been used to target the nuclear envelope, we tested the C-terminal half of the N. benthamiana homolog of AtWIP1, which also failed to localize to the nuclear envelope. We next screened seven domains known to localize to the outer nuclear envelope in their native contexts, including the C termini of two tomato WIP proteins. Several candidates showed correct localization in N. benthamiana (Extended Data Fig. 1a), and we selected the C-terminal halves of WIP1 and WIP2 (SIWIP1/2-C) for further characterization (Fig. 1b).
We next focused on the second design criterion, that is, ensuring that the degree of ONEtag protein accumulation correlates with the transcriptional activity of the promoter driving its expression. To measure promoter activity independent of ONEtag protein abundance, we used an RFP-T2A-ONEtag fusion (Fig. 1c). RFP fluorescence intensity served as a proxy for promoter activity while GFP fluorescence intensity provided a metric for ONEtag accumulation. To drive the RFP-T2A-ONEtag reporter, we placed this transgene under the control of a minimal 35S promoter with 6 upstream Gal4 DNA-binding sites (UAS-min35S::RFP:T2A:ONEtag). To activate expression of this reporter, we coinfiltrated it in N. benthamiana with previously generated strains containing the Gal4 DNA-binding domain (Gal4DBD) alone or fused to the strong VP16 activator (Gal4-VP16)10. As expected, the level of RFP fluorescence was higher in leaves coexpressing the SlWIP1-C ONEtag and Gal4DBD-VP16 compared to Gal4DBD (Fig. 1d). However, GFP fluorescence was comparable between Gal4DBD-VP16 and Gal4DBD, showing that the increase in transcriptional activity driven by the VP16 fusion did not result in a commensurate increase in SlWIP1-C ONEtag protein accumulation (Fig. 1d). Cell microscopy confirmed that the ONEtag and RFP proteins are efficiently separated after being produced from the same transcripts (Extended Data Fig. 1b). While removal of biotin ligase recognition peptide did not relieve the buffering effect, removing SlWIP1-C was sufficient for GFP fluorescence to increase in response to VP16, similar to RFP (Extended Data Fig. 1c,d). We therefore replaced SlWIP1-C with SlWIP2-C. For this ONEtag, GFP and RFP accumulated linearly in response to VP16 showing that the accumulation of this tag closely correlates with the activity of the promoter driving it (Fig. 1e). Removing RFP-T2A from the constructs made accumulation of SlWIP1-C ONEtag slightly responsive to Gal4-VP16 but to a far lesser degree than SlWIP2-C (Extended Data Fig. 1e). To confirm that the accumulation of the SlWIP2-C ONEtag tracks with promoter activity across a large dynamic range, we coexpressed it with eight previously generated Gal4DBD fusions spanning a wide range of transcriptional strengths10. Across combinations, RFP and GFP fluorescence correlated well (R2 = 0.91, Fig. 1f,g and Supplementary Fig. 1). Hence, accumulation of the SlWIP2-C ONEtag is highly sensitive and quantitatively reflects the transcriptional activity of the promoter driving its expression.
Small-scale library validation of ENTRAP-seq
Having optimized the purification tag, we next sought to validate that our approach can be used to measure the activity of trans-acting elements. For enrichment in the nucleus pull-down fraction to serve as a metric of activity, the probability of pulling down a given nucleus should be correlated with its level of ONEtag expression. Further, for this pull-down bias to yield a quantitative metric of reporter expression, the correlation between promoter activity and ONEtag abundance that we observed at the tissue level should also apply in individual nuclei. The ultimate goal of ENTRAP-seq is to serve as a proxy measure of the level of expression of a reporter. Hence, we sought to benchmark ENTRAP-seq by comparing its output (that is, TF enrichment in the pull-down based on next-generation sequencing) against that of a well-established fluorescence reporter assay.
To demonstrate that ENTRAP-seq meets these criteria, we characterized a control set of 26 previously generated synthetic TFs consisting of 53 amino-acid (aa) fragments—hereafter referred to as ‘tiles’—fused to Gal4DBD19 (Supplementary Dataset 1) and spanning a wide range of activity, from repressors to strong activators19. We cloned these Gal4DBD fusion tiles under the constitutive pMas promoter in an Agrobacterium T-DNA expression plasmid containing the SlWIP2-C UAS-min35S::RFP:T2A:ONEtag construct and a constitutive BirA transgene (Fig. 2a). All three cassettes were placed in the same T-DNA to ensure that biotinylated ONEtag expression occurs exclusively in cells that express a synthetic TF. We used this backbone (pSA676v2) for all subsequent ENTRAP-seq experiments and fluorescence reporter assays involving Gal4DBD fusions. We previously determined that a low Agrobacterium OD (optical density at 600 nm) of approximately 0.0025 is optimal for expressing libraries of mixed trans-elements in N. benthamiana as it maximizes the number of transformed cells while minimizing the fraction of these cells that express more than one transgene26. At this OD, approximately 10% of leaf cells express a transgene and about 90% of these cells express a single one. For the following nucleus pull-down experiments, we therefore infiltrated the 26 control strains at a total mix OD of 0.0025. To purify and sort nuclei, we developed a protocol based on pipelines optimized for other plant species, with several modifications29,30 (Methods).
a, T-DNA constructs used to validate ENTRAP-seq. A synthetic promoter containing 6× Gal4-UAS sites followed by the minimal 35S promoter is used to drive the ONEtag fused to RFP by a self-cleaving T2A peptide. Downstream of the ONEtag, the plasmid contains a constitutive BirA expression cassette and a constitutively expressed fusion between the Gal4DBD and variable 53-aa tiles of various strengths. b, Infiltration setup for nucleus pull-down and microscopy experiments shown in c and d. Strains carrying the 26 constructs from a are grown separately and mixed in equal OD ratios before leaf infiltration at a final total OD of 0.0025. This mix is combined with a strain carrying a constitutive H2B-BFP nuclear marker infiltrated at OD 0.1. This tissue is then subjected to magnetic nucleus pull-down and the BFP signal is used to identify the nuclei of transformable leaf cells. c, Distribution of ONEtag expression levels in nuclei from different purification fractions. Histograms are normalized to have the same area. Nuclei with low GFP levels are enriched in the flow-through and there is a progressive enrichment in the pull-down with increasing GFP expression. Brackets show the P values of one-sided Mann–Whitney U-tests between histograms. d, Scatter plot of RFP versus GFP fluorescence in single nuclei in the pull-down fraction from c. e, Schematic of the experiment to compare ENTRAP-seq enrichment with fluorescence reporter assays. Left: the control set of 26 tiles from a was infiltrated at a low OD for nucleus pull-down and sequencing of the flow-through and pull-down fractions. Right: each of the strains in a is infiltrated individually at an OD of 0.2. The activity of each tile is determined by measuring the RFP fluorescence of infiltrated leaves in a plate reader. f, x-axis, transcriptional activity of each tile based on ENTRAP-seq (mean ± s.d. enrichment of each tile T-DNA in the nuclei pull-down fraction compared to the flow-through sample). y-axis, tile transcriptional activity measured individually using a fluorescence plate reader (mean ± s.d., RFP fluorescence). For nuclei pull-down and sequencing n = 4 biological replicates. For plate reader assays n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf). ⍴, Spearman’s correlation coefficient.
To fluorescently label transformable nuclei in the different purification fractions, we infiltrated the tile mix with a strain carrying a pUBQ10::H2B:BFP transgene (BFP, blue fluorescent protein, hereafter H2B-BFP) at OD 0.1, which ensures homogeneous and strong labeling across cells (Fig. 2b). Using the H2B-BFP nuclear marker, we counted nuclei in the pull-down and the flow-through fractions and calculated their level of ONEtag expression by measuring GFP fluorescence (Methods). Using this approach, we counted approximately 300,000 nuclei in the pull-down fraction starting from 10 g of fresh tissue (about 20 fully infiltrated leaves) (Extended Data Fig. 2a). The flow-through fraction contained approximately 10 million nuclei (Extended Data Fig. 2a). These numbers are within a factor of 2 from an estimate of the total number of transformed cells in the sample (estimation in Extended Data Fig. 2b). As shown in Fig. 2c, nuclei in the pull-down had a higher level of ONEtag expression than those in the flow-through, while a second pull-down showed an intermediate distribution (Fig. 2c). These distributions show a large degree of overlap such that the relative abundance in the pull-down with respect to the flow-through increases with ONEtag abundance (Fig. 2c). This feature is critical for the pull-down-to-flow-through ratio to serve as a continuous metric of transcriptional strength. We obtained qualitatively similar results in a biological replicate of this experiment (Extended Data Fig. 2c). Comparing mCherry and GFP fluorescence at the single nucleus level revealed strong correlation between reporter activity and ONEtag accumulation (Fig. 2d, Pearson r = 0.83), further validating the premise of our method.
Next, we sought to demonstrate that ENTRAP-seq recapitulates reporter activity as measured by a low-throughput method. To this end, we asked whether the degree of pull-down enrichment of each of the 26 control tiles correlates with their transcriptional strength measured using an established reporter assay. We infiltrated each construct individually and used a fluorescence plate reader to measure the tissue RFP and GFP fluorescence normalized to a constitutive BFP strain. This ratio was then expressed as a function of the no-tile Gal4DBD control. In contrast to the mixed infiltrations for nucleus pull-down, single tile infiltrations for fluorescence measurements were done at OD 0.2 to ensure a strong fluorescence signal (Fig. 2e and Methods section ‘Fluorescence plate reader reporter assays’). The basal fluorescence of the min35S promoter is close to leaf autofluorescence, hindering the detection of transcriptional repression in fluorescence reporter assays (Supplementary Fig. 2). Hence, in the case of repressor tile strains we coinfiltrated them with a third construct containing a synthetic promoter with embedded Gal4 binding sites, which has a four-times higher basal expression level than min35S31 (Supplementary Fig. 2). To measure the ENTRAP-seq enrichment of these tiles we combined the corresponding 26 Agrobacterium cultures in equal OD ratios and infiltrated the mix at a final OD of 0.0025. Nuclei purified from these leaves were subjected to magnetic sorting and total DNA was extracted from the pull-down and flow-through fractions. Amplicon sequencing of the T-DNA tile region was used to determine the relative abundance of each tile in each fraction to obtain their level of enrichment. Tile enrichment correlated very well with promoter activity as measured by leaf RFP fluorescence (Spearman ⍴ = 0.97), validating sequencing enrichment as a proxy for reporter expression, at least at this library size (Fig. 2f). As expected, leaf GFP fluorescence also correlated well with tile enrichment (Extended Data Fig. 2d, Spearman ⍴ = 0.97) and with RFP fluorescence (Extended Data Fig. 2e, Pearson r = 0.96).
Overall, these results demonstrate that by linking reporter activity to relative pull-down enrichment, ENTRAP-seq provides a method to characterize trans-element activity in plant cells.
Accuracy and reproducibility in large libraries of viral protein tiles
We next sought to leverage ENTRAP-seq to discover new potential transcriptional regulators. Plant pathogen attack is usually linked to widespread transcriptomic changes in the host through the action of pathogen-encoded transcriptional regulators32. There are well-known examples of such regulators delivered by pathogenic bacteria into plant cells, but how plant viruses manipulate host transcription is much less understood33. In the case of DNA viruses, transcriptional regulation of the viral genome is part of the pathogenic cycle. Plant viruses have very small genomes (typically in the order of 10–15 kilobase pairs, Extended Data Fig. 3a) that encode just a few proteins (normally about 4, Extended Data Fig. 3b), none of which resemble known TFs in terms of sequence homology and presence of known TF domains34. Notably, many of the most widely used genetic parts in plant and mammalian genetic engineering—for example, the VP16 AD, the 35S promoter and the SV40 NLS—originate from plant and animal viruses. More broadly, a better molecular understanding of the viral transcriptional arsenal could help devise new crop protection strategies.
To identify potential viral transcriptional regulators, we searched the plant viral proteome for putative ADs. Throughout this study, we adopted a strictly functional definition of ADs, namely, short peptides that are sufficient for strongly activating transcription of a promoter when recruited via a heterologous DBD. The fact that a peptide of viral origin fits this definition does not necessarily indicate that its protein of origin is involved in gene regulation, nor that this peptide is required for regulating transcription within its protein context. Indeed, many potent ADs are found in nonnuclear proteins that lack known gene regulatory functions, such as metabolic enzymes in mitochondria and chloroplasts19. A peptide may adopt a different structural conformation when isolated and fused to a DBD compared to its native protein context. As a result, tiles derived from TFs can sometimes behave as ADs in reporter assays even though they seem dispensable for the transactivation function of the TF itself. This functional definition is meant to emphasize that AD identification in reporter recruitment assays is but a first step toward fully characterizing the regulatory function of a protein.
As of the time of writing, the National Center for Biotechnology Information (NCBI) database contained 1,495 plant viral genomes and 5,607 proteins. To maximize sequence diversity, we first clustered the protein sequences at 90% sequence identity, yielding 4,537 representative nonredundant proteins. Each sequence was then tiled using 53-aa windows and 10-bp overlaps (Fig. 3a). To estimate the likelihood of each tile being a putative AD, we used PADDLE, a deep convolutional neural network that predicts acidic ADs14. The convolutional layer enables PADDLE to identify sequence-to-function relationships, especially within disordered sequences where traditional structure-based or homology-based comparative genomic algorithms fail. Although the model was trained on data from a yeast-based screen using yeast protein sequences, acidic ADs are evolutionarily conserved and we previously demonstrated that PADDLE can identify ADs from plants19. Only tiles with the highest predicted z-scores were selected for experimental testing (Fig. 3a). Finally, we further optimized protein diversity by selecting sequences that maximized the Hamming distance between binary protein encodings (Fig. 3a).
a, Schematic of the computational pipeline to generate libraries of putative transcriptional ADs from plant viruses. From the top, protein open reading frame sequences from genomes of all plant viruses in NCBI were deduplicated and clustered using 90% sequence identity and then fragmented into 53-aa tiles with 10-aa overlaps. PADDLE was used to predict the transactivation strength of these tiles. The top 2,000 tiles with the highest predicted activity were selected (PADDLE z-score >4.9). Protein sequences were binary encoded, and a sequence diversity maximization algorithm was applied to iteratively select sequences that maximized the average Hamming distance from those already included in the subset. This approach enriched for sequence-level diversity in the selected proteins to generate nested libraries of 250, 500, 1,000 and 2,000 tiles. b, Tile activity as measured by ENTRAP-seq (x-axis, sequencing-based enrichment) plotted against the mean ± s.d. tile activity measured individually using a fluorescence reporter assay (y-axis). Two biological replicates of the sequencing experiment are shown (orange and black circles) for libraries of 250, 500, 1,000 and 2,000 viral tiles. Blue lines show a sigmoidal fit to the data with the shaded blue area indicating ± 1 s.d. of residuals as an empirical confidence bound. Horizontal dashed lines indicate the middle point of the sigmoid fits. Vertical dashed lines indicate the enrichment value corresponding to the sigmoid midpoint +1 s.d. This value was taken as the lower bound of enrichment to classify tiles as activators. n for plate reader assays was 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf). c, Reproducibility between biological replicates (Pearson’s correlation in enrichment, y-axis) as a function of the absolute number of tiles that pass a threshold number of mapped reads (x-axis). The set of tiles counted on the x-axis was generated by imposing coverage cutoffs such that only tiles with more counts than an arbitrary and variable threshold (color bar) were considered to calculate the Pearson correlation. d, Reproducibility across different libraries. The activity of each tile was first averaged across both biological replicates of each library of a given size. Scatter plots show the activity of tiles that are shared between libraries of different sizes. The Pearson correlation (r) between libraries is shown.
Although our nuclei enrichment method performed well on the control set of 26 tiles (Fig. 2f), the scalability of ENTRAP-seq to larger libraries was not immediately clear. To address this question, we created 4 nested libraries of 250, 500, 1,000 and 2,000 viral tiles (Fig. 3a and Supplementary Dataset 2). All except for the 2,000-tile library also contained the 26 control tiles for which we had previously obtained single infiltration fluorescence data (Fig. 2f). For the 2,000-tile library, a new set of 26 viral tiles was measured individually in a fluorescence plate reader (Supplementary Dataset 1). These internal controls allowed us to validate the output of the assay by comparing the activity of these tiles in terms of reporter fluorescence and tile enrichment (as in Fig. 2f). Across all libraries, we found that the activity of a given control tile as measured individually in a fluorescence plate reader strongly correlates with its enrichment following magnetic pull-down, for two biological replicates of each library (Spearman ⍴ = 0.72–0.97) (Fig. 3b). These validation experiments also allowed us to obtain a calibration value to classify tiles as activators based on their sequencing-based enrichment. To this end, we fitted the validation data to a sigmoidal curve and calculated the enrichment value corresponding to the midpoint of this sigmoid plus its estimation error (Fig. 3b). This yielded an enrichment value of approximately 1.0–1.1, which was used as a cutoff to classify viral tiles as activators in subsequent analyses (Fig. 3b).
We found no consistent correlation between tile coverage in the library and tile enrichment ratio (Supplementary Fig. 3). When the enrichment of all tiles was considered, we found that biological replicates correlated well (Pearson r = 0.5–0.55) and this correlation increased substantially to r = 0.7–0.8 by using coverage cutoffs to filter out tiles with low representation in the library (Fig. 3c). Because of the nested nature of the libraries, it was also possible to compare tile enrichment across libraries. This revealed a high degree of reproducibility even across libraries (Pearson r = 0.59–0.8 without applying coverage cutoffs) (Fig. 3d). Thus, our assay is accurate and reproducible in libraries of up to 2,000 tiles.
Although the ENTRAP-seq signal correlates well with fluorescence reporter assays (Figs. 2f and 3b), we wished to understand to what extent ENTRAP-seq can be used as a method to resolve strength differences between TFs based on enrichment alone. We therefore sought to determine the assay resolution by estimating the measurement error in the largest tested viral library of 2,000 tiles. We define resolution as the minimum enrichment difference between two proteins to be able to classify them as activators of different strengths with a statistical significance of P ≤ 0.05. To this end, we leveraged replicate data to obtain a distribution of within-protein noise and estimate the error (Extended Data Fig. 4a, see Supplementary Information for details). Using 2 s.d. of the within-protein noise as a significance cutoff we found the resolution of ENTRAP-seq to be ~0.45 enrichment units for a single biological replicate of a 2,000-tile library (Extended Data Fig. 4b, see Supplementary Information for details). Since the enrichment values of activators span 1–2.5 enrichment units, this resolution allows us to distinguish at most 3 quantitatively distinct classes of activator strengths (see Supplementary Information for calculation details). As expected, this uncertainty is reduced by a factor of ~1/√n for n = 2 replicates, with a resolution of ~0.3 enrichment units when the 2 biological replicates of the 2,000-tile library are combined (Extended Data Fig. 4c,d; see Supplementary Information for calculation details). Given the dynamic range of enrichment values, a resolution of 0.3 units allows distinguishing multiple statistically different classes of activator TF based on their ENTRAP-seq enrichment signal. This enrichment uncertainty was largely independent of enrichment values (Extended Data Fig. 4e,f). As shown in Fig. 3c, the within-protein experimental error can be drastically reduced by filtering out tiles that are underrepresented in the library. Such reductions in measurement noise via replication or library coverage cutoffs can improve the assay resolution to ~0.15 enrichment units (Extended Data Fig. 4). Hence, the enrichment ratio behaves like a coarse-grained quantitative metric that can resolve a few different transcriptional strengths. It should be noted that, to a first approximation, enrichment and true TF strength (as measured by a fluorescence reporter assay) follow a logarithmic relationship (Figs. 2f and 3b), meaning that a resolution of 0.3 enrichment units encompasses a wide range of activator TF strengths. As a point of comparison, traditional low-throughput fluorescence reporter assays in N. benthamiana can detect much smaller differences in activation strength of 10–20% between constructs35. Thus, the 2,000 times higher throughput of ENTRAP-seq with respect to established methods comes at the expense of fine quantitative resolution.
Identification of putative ADs in plant viral proteins
Our pipeline for discovering potential viral ADs combining machine learning predictions with high-throughput measurements yielded 1,105 tiles that passed the activator enrichment cutoff, confirming that PADDLE can predict acidic ADs in plants19 (Fig. 4a, Extended Data Fig. 5 and Supplementary Figs. 4–6). These tiles are found throughout the phylogenetic tree of plant viruses (Fig. 4a). Two large families of plant single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA) viruses (Geminiviridae and Caulimoviridae, respectively) as well as multiple positive single-stranded RNA (+ssRNA) families (especially Potyviridae and Betaflexiviridae) were particularly rich in acidic ADs (Fig. 4a). We did not find any ADs in only 4 relatively small families: Amalgaviridae, Aspiviridae, Mitoviridae and Nanoviridae (−ssRNA, −ssRNA, +ssRNA and ssDNA, respectively). This contrasts with recent global surveys of human viruses where ADs were found almost exclusively in a few families of dsDNA viruses22,36. Although our screen did not encompass all viral genomes, it identified approximately twice as many tiles as the more comprehensive studies of human viral ADs, which reported ~600 ADs. This highlights the utility of using a predictive model such as PADDLE to bias the AD search when it is not technically feasible to explore the whole sequence space. In general, our search algorithm based on PADDLE performed remarkably well in terms of predicting ADs; around 50% of the tested tiles qualify as activators based on our calibration thresholds, although there was substantial variation in predictive accuracy across taxa (Fig. 4a, Extended Data Fig. 5 and Supplementary Figs. 4–6).
a, Phylogenetic tree of plant viruses showing the number of tiles selected from each viral genome (gray), the maximum PADDLE predicted z-score (light green), whether a tile from this viral genome was selected for characterization (dark green), the number of tiles that were experimentally measured (purple) and the proportion of measured tiles that were activators (orange). dsRNA, double-stranded RNA. b, Gene ontology (GO) enrichment analysis of tiles measured as activators across all libraries. c, Detailed view of tiles from the proteins most enriched in activator tiles in b. Top: AD prediction (PADDLE z-score) of each tile along the length of all proteins belonging to each given class. The amino-acid position of each tile was normalized by dividing by the length of the protein it originates from. The horizontal dashed line shows the z-score cutoff used to select tiles for testing. Blue dots show tiles that were experimentally tested but did not pass our activator enrichment cutoff. Orange dots show tiles that were tested and had an enrichment higher than the cutoff. Bottom: average ± 1 s.d. protein disorder as predicted by Metapredict. d, Selected strong activator viral tiles tested individually in N. benthamiana using a Gal4DBD recruitment fluorescence reporter assay. Shown is the RFP fluorescence normalized to the Gal4DBD alone (Methods section ‘Fluorescence plate reader reporter assays’). e, A set of tiles tested in d were tested by recruiting them to a synthetic reporter driving GFP using dCas9 fusions. Shown is the normalized leaf GFP fluorescence. sgRNA, single-guide RNA. In d and e, n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf), brackets indicate two-sided Welch’s t-test comparisons against VP16 with a Bonferroni multiple comparison correction, P < 0.05, **P < 0.01, ***P < 0.001, NS, not significant. Box plots show the median ± IQR, whiskers correspond to minimum and maximum values, excluding outliers.
To better understand the origin of these potential regulators, we performed GO term enrichment analysis. We found activator tiles in proteins from a wide range of functional annotations, although we note that due to the biased nature of our search not all functional classes are represented equally in our libraries (Fig. 4b and Extended Data Fig. 6). The presence of ADs in structural and catalytic proteins may reflect the multifunctional nature of viral proteins and mirrors the findings of a recent study that tiled human viruses, which found that 90% of putative viral ADs are found in nonregulatory proteins22. Four kinds of plant viral proteins showed a notable enrichment of activator tiles: geminiviral AL2-like proteins, RNA-dependent RNA polymerases (RdRp), helicases and helical capsid proteins (Fig. 4b). The AL2 protein, also known as TrAP (transcriptional activator protein), is involved in transactivating viral promoters in Geminiviridae and has been shown to regulate genomic promoters by physically interacting with host TFs37,38. AL2 lacks a DBD but contains a C-terminal acidic region that can transactivate a synthetic reporter when fused to Gal4DBD39. Thus, the recovery of strong ADs in AL2-like proteins using ENTRAP-seq validates our framework for discovering new viral-encoded transcriptional regulators. Our high-resolution tiling of these proteins allowed us to locate the position of their putative ADs (Fig. 4c). Consistent with the C terminus of AL2 being important for transactivation, we found that ADs are consistently located in three positions within a highly disordered region in the C terminus of diverse AL2 proteins (Fig. 4c and Supplementary Fig. 7). The positions of RdRp activator tiles are also relatively conserved but unlike AL2 these putative ADs are clustered in multiple distinct regions throughout the protein, most of which are predicted to fold into stable structures (Fig. 4c and Supplementary Fig. 7). Similar to RdRp, putative activator tiles in viral helicases and helical viral capsid proteins consistently fall in a few structured regions (Fig. 4c and Supplementary Figs. 7 and 8). In terms of prediction accuracy, most predicted ADs in AL2 and helical capsid proteins turned out to be true activators while the opposite was true for RdRp and helicases (Fig. 4c and Supplementary Fig. 7). These data indicate that a dedicated regulatory protein such as AL2 might activate transcription in a similar manner to canonical eukaryotic TFs with disordered ADs, while multifunctional viral proteins may repurpose stably folded and conserved catalytic or structural domains to accomplish this task. Alternatively, the peptides from these nonregulatory proteins may not actually function as ADs in their native context but simply have a sequence composition that led us to serendipitously identify them as activators in our screening setup. This large dataset should serve as a comprehensive starting point for future testing of specific hypotheses regarding the role of these putative ADs in their native protein context.
The discovery of these putative viral regulators prompted us to consider their potential as new plant synthetic biology tools. To find potent and compact activators that could serve as alternatives to existing field standards, we individually cloned 20 tiles that were among the most enriched in at least 1 library. We then tested them individually using the same Gal4-UAS fluorescence reporter assay described above, side by side with two potent popular ADs used in the field, VP16 and VPR40. All tested tiles activated the reporter at least 20-fold over the DBD alone (Extended Data Fig. 7 and Supplementary Table 3). We repeated this experiment for the strongest seven candidates and verified that four performed as well or better than VP16 and VPR (Fig. 4d), confirming that tile enrichment correlates well with AD strength. To determine whether these ADs perform well in different genetic contexts, we fused them to dCas9 and tested their activity when recruited to a synthetic promoter driving GFP as a reporter41 (Fig. 4e). dCas9-VPR outperformed all other fusions, but one tile was stronger than VP16 when fused to dCas9 (Fig. 4e). Since VPR is considerably larger than our viral tiles (553 aa versus 53 aa), this compact viral AD may be advantageous for vectors with a limited size capacity such as viral vectors42,43.
To characterize potential new repressor domains of viral origin, we followed a similar strategy, selecting 20 tiles that were highly depleted from the pull-down across libraries (Extended Data Fig. 7). As field standard controls, we included four of the most potent previously characterized synthetic ethylene-responsive element binding factor-associated amphiphilic repression (EAR) repression motifs44. Only 3 out of 20 of these putative viral repressors had significantly lower reporter expression than Gal4DBD but none surpassed synthetic EAR motifs in potency. The remaining 17 out of 20 tiles were statistically indistinguishable from the Gal4DBD control (Extended Data Fig. 7). Thus, the dynamic range of ENTRAP-seq spans from transcriptionally neutral tiles to the most potent known activators but it is not large enough to resolve transcriptional repressors. To specifically detect transcriptional repression, it may be necessary to use a reporter with a higher basal expression level in combination with changes to the pull-down protocol itself20.
In addition to minimizing construct size, these viral ADs offer a diverse set of genetic parts to reduce unintended consequences in synthetic programs, including cross-interactions with modules sharing identical parts and gene silencing arising from context-dependent effects.
Partial conservation of AD function between distantly related eukaryotes
Recently, plant protein sequences have been used in MPRAs in yeast to identify peptides that are sufficient for transcriptional activation in this model system15,19. Deriving insight about plant ADs from these experiments in a distantly related host depends on the degree to which transactivation mechanisms are conserved between these two systems. It is well documented that there is conservation of transcriptional activation mechanisms among eukaryotes, including between yeast and plants10,45. Decades of insights about plant TFs derived from yeast-based assays and our own results showing that a model trained in yeast successfully predicts plant viral ADs support this idea19. On the other hand, plants have highly derived and expanded TF families46,47,48 and plant transcriptional regulation displays many regulatory features not shared by other eukaryotes19,24,49. Not surprisingly, yeast-based screens can miss known ADs in well-studied plant TFs15. A survey of tens of ADs found that transactivation potency in yeast does not predict strength in N. benthamiana19,24 while an earlier study showed that the quantitative activity of mutant variants of a maize TF was largely consistent between yeast and maize50. Thus, it is not clear the degree to which yeast and plants have their own taxa-specific activation mechanisms and whether the quantitative activation strength of ADs is consistent between these two models19,24. To systematically study these questions, we took advantage of an existing dataset where thousands of protein tiles from Arabidopsis TFs were tested in yeast15. We leveraged ENTRAP-seq to test a subset of these tiles in N. benthamiana to compare their activity across model systems. We randomly selected 250 tiles while keeping the same distribution of yeast transcriptional strengths (referred to as PADI scores15) from the original library. We then cloned this library in the same vector used for our viral libraries and performed an ENTRAP-seq experiment.
Comparing the rank ordering of activity between yeast and N. benthamiana revealed a very limited correlation (Spearman ⍴ = 0.18, P = 0.0075) (Fig. 5a). Indeed, numerous strong yeast activators were depleted from the ENTRAP-seq pull-down and multiple enriched tiles in N. benthamiana were not found to be activators in yeast (Fig. 5a). Because yeast-based MPRAs are heavily biased toward identifying acidic ADs, we hypothesized that tiles not found to be activators in yeast might belong to other classes of ADs. To classify tiles as likely plant activators we used an enrichment threshold of 1.2: the minimum enrichment value for validated activators from the viral libraries. Among the 40 tiles that had an ENTRAP-seq enrichment greater than 1.2 and a PADI score lower than 1 (labeled ‘quadrant 4’ in Fig. 5a) we found two glutamine (Q)-rich sequences that were not classified as activators in yeast. The transactivation role of Q-rich stretches is conserved in eukaryotes51. The first of these polyQ ADs is in the middle region (MR) of the TF gene AUXIN RESPONSE FACTOR 19 (ARF19, AT1G19220, Fig. 5b). The middle region is known to be important for transactivation in activator auxin response factors (ARFs)52 but the yeast assay mapped the ADs of ARF19 and other activator ARFs to regions flanking the middle region15. The second polyQ stretch belongs in the C terminus of the TF gene AGAMOUS-LIKE 65 (AGL65, AT1G18750), a region that is thought to contain the AD in this family of TFs53 (Fig. 5c).
a, Comparison of tile transcriptional activity between yeast and N. benthamiana. A set of peptide tiles from Arabidopsis TFs previously measured in yeast was tested in N. benthamiana. Scatter plot shows the activity of each tile in yeast (PADI score, y-axis) and N. benthamiana (tile enrichment, x-axis). The horizontal dashed line indicates the PADI threshold value used to classify tiles as activators in yeast. The vertical dashed line shows the enrichment value below which we never detected validated activator tiles in Fig. 3b. b, Q enrichment along the ARF19 TF protein shown as percentage of Q residues in a moving window of 20 aa. The green bar indicates the position of the ARF AD known as middle region (MR). Orange bars show the position of ADs according to the yeast assay. The blue bar shows the position of the AD in plants according to ENTRAP-seq. c, Same as b for the AGL65 TF. No ADs were identified for this TF in yeast. d, The degree of structural disorder of each tile (y-axis) is plotted against its ENTRAP-seq activity in N. benthamiana (x-axis). The histogram on the right includes tiles likely to be activators in both yeast and N. benthamiana as well as N. benthamiana only (quadrants 2 and 4 in a). e, Predicted disorder in the NAC001 TF. The green bars indicate the NAC dimerization domain and the transactivation domain. The blue bar shows the AD identified by ENTRAP-seq and the orange bar shows the position of the ADs active in yeast. f, Validation of the tiles highlighted in blue in b, c and e using a fluorescence reporter assay. g, A random selection of tiles that were activators in yeast but de-enriched in tobacco (‘quadrant 1’ in a) tested in a fluorescence reporter assay against the Gal4DBD alone as a control. In f and g, n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf), *P < 0.05, **P < 0.01, ***P < 0.001, NS, not significant in a two-sided Welch’s t-test against Gal4DBD with a Bonferroni multiple comparisons correction. Box plots show the median ± IQR, whiskers correspond to minimum and maximum values excluding outliers.
An alternative hypothesis for why ADs can be active in plants but not in yeast is that these tiles are not ‘bona fide’ ADs, namely, peptides in intrinsically disordered regions that are sufficient to directly recruit the core transcriptional machinery to the promoter. Instead, these tiles may code for folded protein domains that bind plant-specific TFs present in N. benthamiana that in turn contain canonical ADs that activate transcription in yeast. To explore this possibility, we calculated the predicted disorder of tiles in the library and compared this prediction with their ENTRAP-seq enrichment. We found these two metrics to be completely uncorrelated (Spearman ⍴ = 0.13, P = 0.054) and tiles that are active in plants (enrichment >1.2) are equally likely to be structured as to be disordered (Fig. 5d). Thus, activator tiles with a plant-specific bias do not necessarily form folded domains. However, among highly structured tiles we found cases that do fit our initial hypothesis. For example, a tile in the TF gene NAC DOMAIN CONTAINING-PROTEIN 1 (NAC001, AT1G01010) containing part of a structured domain involved in homo-dimerization and hetero-dimerization within NAC family TFs was found to be highly active in plants but not in yeast54 (Fig. 5e). The NAC TF family is plant-specific and contains acidic ADs in their C-terminal half that are activators in yeast55 (Fig. 5e). Thus, in N. benthamiana, this NAC001 tile might simply recruit endogenous NAC activators that, in turn, contain ADs that are active in both N. benthamiana and yeast. In yeast, however, this NAC001 tile may be incapable of activating the reporter because this organism lacks interacting NAC partners.
We confirmed that the polyQ ADs of ARF19 and AGL65 as well as the NAC001 AD activated the reporter in our fluorescence plate reader assay (Fig. 5f and Supplementary Table 3). In this comparative ENTRAP-seq experiment, we found 14 tiles that had a PADI score >1 and an enrichment value <1.2, corresponding to likely yeast-only activators (labeled ‘quadrant 1’ in Fig. 5a). To test whether these plant-derived sequences correspond to ADs that activate transcription in yeast but not in plants, we individually measured 12 out of 14 of these tiles. Most (9 out of 12) of these tiles did not activate transcription in N. benthamiana in a fluorescence reporter assay, in fact, 5 out of 12 were repressors, consistent with their depletion from the ENTRAP-seq pull-down (Fig. 5g and Supplementary Table 3). These results are consistent with the notion that plant-specific as well as yeast-specific AD tiles are not uncommon. The stark differences in the rank order of activity between the two systems for most tiles suggest that there are fundamental differences in transcriptional activation mechanisms between these two models, although other explanations are conceivable as well. For example, the reporter in the yeast assay is chromosomally integrated while the ENTRAP-seq reporter is transiently expressed. To better understand the degree of conservation of ADs between these two model eukaryotes, future studies will be necessary to leverage ENTRAP-seq to measure all tiles across all Arabidopsis TFs in an unbiased and systematic manner. Finally, these findings show that it is wise to exercise caution when interpreting tile MPRA data of plant proteins tested in yeast as both false positives and false negatives can arise when using one model system to represent the biology of another.
Rationally tuning the activity of a plant TF
One of the implicit assumptions of AD recruitment assays is that TFs are largely modular56. If this is the case, it may be possible to study and engineer plant TFs in high throughput by isolating, mutating and screening their ADs for altered functionality. Because ENTRAP-seq can reliably detect coarse-grained quantitative differences in enrichment (Extended Data Fig. 4 and Supplementary Information), and enrichment is highly correlated with activation strength (Figs. 2f and 3b), we hypothesized that the method could be used to screen for mutant AD variants displaying a continuous spectrum of activity. To test this idea, we leveraged ENTRAP-seq to characterize a mutant library of the Arabidopsis TF gene CONSTANS (CO, AT5G15840) with the goal of creating an allelic series containing weaker and stronger activators than wild type (WT) as well as alleles that switch CO from activation to repression of target genes. Natural variation in CO and its homologs is associated with flowering time, a key agronomic trait57,58. Although Arabidopsis CO appears to act exclusively as a transcriptional activator, related CO-like genes in this and other species can act as repressors, suggesting that it is feasible to switch the CO regulatory role via mutations59,60,61. In agreement with the role of CO in its native context and previous reports, transient expression in N. benthamiana of a fusion between CO and a BFP (BFP-CO) strongly activates expression of the FLOWERING LOCUS T (FT, AT1G65480) promoter (1.9 kb upstream of the ATG, hereafter FTpro) as measured by a GFP reporter assay62 (Fig. 6a and Supplementary Table 3).
a, Transactivation assay for CO function in N. benthamiana. A strain carrying GFP driven by the Arabidopsis FTpro was infiltrated with a strain carrying pUBQ::BFP:CO or an empty vector (EV). pUBQ::mCherry was used as a normalization. Shown is the P value of a two-sided Welch’s t-test. b, Location of the CO AD. Solid lines show the probability of residues being part of an acidic AD according to the PADDLE and TADA models. The shaded area shows the probability of being a disordered region according to Metapredict. Red bar on top shows the empirically determined AD. The dashed lines indicate the 69-aa region used to build a tile library. c, Screening of the CO AD tile library using nuclei enrichment. Mutant tiles are fused to the Gal4DBD and used to drive the expression of the magnetic sorting reporter. d, Histogram showing the distribution of tile enrichment values. The dashed line and the marker show the mean enrichment of six WT tiles. Error bar corresponds to s.d. e, Heatmap showing the enrichment values of tiles carrying single amino-acid substitutions. f, The single residue substitutions shown in e were classified in three categories, weaker than WT (blue), similar to WT (yellow) and stronger than WT (red). The stacked bars show the fraction of mutations in e belonging to each activity category. g, Transactivation assay to test the effect of mutations in the CO AD in the context of the full-length protein. Mutations identified in c–f as being weaker (loss-of-function) or stronger (gain-of-function) than the WT AD were introduced into the CO coding sequence and used to drive expression of the FTpro GFP reporter. h, FT promoter activity in the presence of various CO mutant alleles. Predicted activity is based on the enrichment of the corresponding AD tile with respect to the WT tile. In a and h, n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf), box plots show the median ± IQR, bars correspond to minimum and maximum values excluding outliers. ***P < 0.001, **P < 0.01, *P < 0.05, NS, not significantly in a two-sided Welch’s t-test comparison against WT with a Bonferroni multiple comparison correction.
CO contains a poorly conserved, highly disordered middle region that is sufficient for transactivation in a Gal4DBD recruitment assay62 (Fig. 6b). Although multiple CO alleles have been identified in random mutagenesis screens, all lie outside this middle region and are loss of function63 (Fig. 6b). To better define the position of the CO AD within this middle region, we leveraged PADDLE as well as TADA, another model trained on datasets from yeast-based screens that can predict acidic ADs13,15. Both models have a high degree of agreement and point to a region around aa position 175 as the putative CO AD (Fig. 6b). We therefore selected a 69-aa window in this region as the putative AD (Fig. 6b). To explore the sequence space of the CO AD, we generated a library of 350 tiles carrying 1–4 aa substitutions and including the WT AD as a control. The substitutions were rationally designed to either increase or decrease AD strength based on features important for acidic AD function in yeast and animal cells, namely (1) depletion of positive residues (R and K), (2) enrichment of negative residues (D and E), (3) enrichment of hydrophobic residues (particularly aromatics and leucine), (4) depletion of continuous stretches of hydrophobic residues and (5) lack of stable secondary structure6,13,64,65. Following the same rationale, we also mutated every residue to K, P, W and D. As internal controls we included five WT tiles. Finally, we introduced ten random synonymous mutations in each tile in the library to better distinguish between highly similar tiles during sequencing. Following the same strategy used for our previous experiments, we cloned these tiles as Gal4DBD fusions in our magnetic sorting reporter and performed an ENTRAP-seq assay (Fig. 6c).
The enrichment of the CO AD tiles spanned a range of values comparable to other libraries and the enrichment of WT tiles was highly reproducible (1.12 ± 0.02). Perhaps not surprisingly, most tiles were less enriched than WT, corresponding to potential loss-of-function mutants (Fig. 6d). We found a moderate agreement between tile activity as measured by ENTRAP-seq and the predictions from PADDLE and TADA (Spearman ⍴ = 0.38 and 0.4, respectively) although the two models largely agreed with one another (Pearson r = 0.89) (Supplementary Fig. 9). This suggests that models trained in plants could be exploited in the future to guide the design of mutant plant AD screens. Comparing the activity of each tile to its biochemical properties showed that, as expected for an acidic AD, positive charge and the number of F or W residues have a significant correlation with activity (Spearman ⍴ = −0.38 and 0.24, respectively, Extended Data Fig. 8). Throughout the AD, mutation of residues to K tended to decrease the activity and most of the strong loss-of-function AD alleles are K substitutions (Fig. 6e,f). Most gain-of-function single mutants correspond to W or D substitutions (Fig. 6e,f). Mutating residues to P, which severely disrupts protein secondary structure, rarely had strong effects (Fig. 6e,f). These results are consistent with the CO AD behaving like a classic intrinsically disordered acidic AD and highlights the power of combining predictive models with semirational mutagenesis guided by known AD compositional biases.
To test the effect of these mutations in the context of the full-length CO protein, we chose ten putative loss-of-function and gain-of-function alleles among the tiles at either end of the enrichment value distribution. We then introduced the coding mutations from each of these tiles into the full-length CO expression construct and tested their effect individually using the FTpro fluorescence reporter previously established (Fig. 6g and Supplementary Table 3). As controls we also cloned three previously isolated semidominant loss-of-function alleles corresponding to point mutations outside the AD identified in random mutagenesis screens. Out of ten predicted loss-of-function CO mutants, seven were significantly weaker activators than WT and two of these were as weak as classic alleles. The M186P mutant behaved as a gain-of-function allele, in contrast to its activity as a Gal4DBD fusion tile (Fig. 6h). The ethyl methanesulfonate (EMS) mutant controls as well as some of our isolated alleles switched CO activity from an activator into a repressor, consistent with the EMS mutants being semidominant in Arabidopsis63. Out of the ten predicted gain-of-function mutant alleles, five were stronger activators than WT (Fig. 6h). Among the rest, four retained activator function but were not statistically stronger than WT and one putative gain-of-function allele (Y194W) was a slightly weaker activator than WT. We conducted an independent whole-experiment replicate of this figure and obtained similar results (Pearson r = 0.92, Extended Data Fig. 9). Collectively, these data demonstrate that tile activity in the context of an ENTRAP-seq assay can be predictive of tile function within the original TF protein environment. This allowed us to rapidly engineer a new allelic series of a plant TF, including alleles that switch its function from activator to repressor.
Discussion
The long timescales required to generate transgenic plants has historically been a limiting factor in efforts to systematically map the sequence-to-function protein landscape of transcriptional regulators and other classes of proteins66. Compared to stable transgenic approaches, protoplast transformation and transient expression in N. benthamiana can substantially accelerate the pace of sequence characterization of gene variants in planta. These systems have recently been used as expression platforms for multiplexed reporter assays to measure the activity of tens of thousands noncoding variants at a time7,24. However, these technologies remain limited to library elements acting in cis to the reporter, such as enhancers and promoters. Gene libraries acting in trans (for example, single-guide RNA and protein-coding libraries) represent a unique challenge because transgenes that regulate the same reporter can readily interfere with one another if expressed in the same cell. To solve this problem, it was first necessary to achieve single-transgene expression, which was only very recently demonstrated for agroinfiltration26,27. The second missing piece was linking transgene activity to a measurable output, something that ENTRAP-seq achieves by biasing the nucleus pull-down probability by the level of reporter expression.
While methods such as DAP-seq8 and TARGET9 have shed light on TF DNA-binding preferences and cis-regulatory landscapes, they do not capture how TF regulatory activity itself is encoded at the protein level. This gap is particularly critical given that many transcriptional regulatory domains—such as ADs—are intrinsically disordered and poorly conserved, making them difficult to identify through structure-based and homology-based algorithms. Beyond mapping transcriptional circuitry, ENTRAP-seq provides a powerful framework for dissecting and engineering protein function. Its parallelized, library-based design is well-suited for the systematic exploration of sequence-to-function relationships, akin to deep mutational scanning approaches that have revealed critical residues and domains in other systems66. By screening libraries of natural variants or synthetic constructs, ENTRAP-seq enables the direct measurement of how coding sequence variation impacts regulatory activity: an essential step in linking genotype to phenotype. This capability is particularly relevant for proteins involved in pathways that are not well conserved across eukaryotes and has broad implications for plant biology and crop improvement.
As a method to quantify reporter activity, ENTRAP-seq shows a remarkable level of agreement with traditional low-throughput approaches. Indeed, combining all the cross-comparisons between enrichment and fluorescence reporter assays presented in this study reveals a tight correlation (Extended Data Fig. 9, Spearman ⍴ = 0.83, N = 497 comparisons in total). This meta-analysis also shows that our method to define activators based on enrichment is highly conservative: there are zero tiles that pass the enrichment cutoff but fail to activate the reporter in a fluorescence assay (Extended Data Fig. 10). This lack of false positives at the expense of false negatives should be kept in mind when interpreting our AD identification results. Our analysis of ENTRAP-seq resolution demonstrated that the degree of enrichment is a coarse-grained but ultimately continuous metric; however, this information is necessarily lost when collapsing this metric into a binary. Although the absolute levels of enrichment of ENTRAP-seq (approximately 0.5–2.5 times) are small compared to other enrichment-based MPRAs, the standard error of this method is also much smaller. Z-score normalization of enrichment from ENTRAP-seq and a recently published enrichment-based cell culture MPRA22 shows that both methods have comparable dynamic ranges (Supplementary Fig. 10). Together, these findings highlight the robustness, precision and conservative reliability of ENTRAP-seq as an approach for identifying and approximating the activity of transcriptional regulators.
Recently, genome editing technologies have opened the door to generating TF alleles with quantitative differences in activity as a means to modulate quantitative traits in crops67. These efforts have focused on editing cis-regulatory regions rather than TF coding sequences. We foresee that coding mutations in TF regulatory domains may provide an alternative approach to engineer such allelic series. Compared to cis-regulatory mutations, coding mutations are more likely to conserve expression levels and spatiotemporal expression patterns. These mutations could be identified by combining ENTRAP-seq screening with predictive sequence-to-function models of TF activity. Because quantitative and qualitative aspects of plant transcription are not necessarily conserved in yeast, future models of plant TF activity will need to be trained on diverse plant-derived sequences expressed in plant cells. First-generation yeast models such as PADDLE were based on 5,000–10,000 protein tiles, a scale that is within reach by combining a few ENTRAP-seq runs. Similarly, by combining several ENTRAP-seq runs, genome scale coverage of all TFs in a plant genome may also be feasible. Overall, ENTRAP-seq addresses a key technical gap in plant functional genomics by systematically linking protein sequence to transcriptional output in a native plant context, allowing for the functional annotation of regulatory domains at scale.
Methods
Molecular cloning
Except for dCas9 fusions, all validation and candidate tiles tested individually in fluorescence reporter assays were cloned in the pCambia1300-based backbone pSA676v2 containing 3 transgenes: 6xUAS-min35S::RFP:T2A:SlWIP2-C ONEtag, pUBQ10::BirA and pMas::Gal4DBD-dropout. The dropout consists of an Escherichia coli GFP reporter flanked by BsaI sites for negative GFP colony selection. Individual tiles were cloned in BsaI-digested pSA676v2 using Gibson assembly. Tiles fused to dCas9 were cloned into backbone pSA742 containing p35S::dCas9-dropout using Gibson assembly. The sequence of pSA676v2 and pSA742 as well as other plasmids used in this study can be found in https://public-registry.jbei.org. For library cloning, see Methods section ‘Library cloning’.
Live imaging of leaf tissues
Leaf samples 3 days postinfiltration were mounted in water. All images were acquired in a Zeiss LSM 710 confocal microscope using a ×63 oil objective. Channels were acquired sequentially using 20 z slices every 0.5 μm and 1,024 × 1,024 pixel image size. The zoom was 2.0 and the wavelengths were 568 nm excitation, 585–630 nm emission for RFP and 488 nm excitation, 494–581 nm emission for GFP. The pinhole in both channels was set to 1 a.u.
Imaging and fluorescence quantification of isolated nuclei
Purified nuclei were mounted in nucleus purification buffer (NPB) buffer in a hemocytometer and imaged using a Leica DM6B epifluorescence microscope. Eight images were acquired from each purification fraction. For each image, three z-stacks were acquired sequentially, one for each channel. Each z-stack consisted of 5 z-sections every 20 µm. The BFP nuclear marker was imaged with a 4,6-diamidino-2-phenylindole filter (excitation 350 ± 50, emission 460 ± 50), the ONEtag was imaged using the L5 filter (excitation 480 ± 40, emission 527 ± 30) and mCherry RFP was imaged using the TXR filter (excitation 560 ± 20, emission 630 ± 38). Images were acquired using a ×5 dry objective for an area of 2.641 mm2 and 2,048 × 2,048 pixels (pixel size, 0.78 µm2). To avoid detector saturation, the laser power and the camera exposure were set to 17% and 600 ms, respectively, for all channels. Stacks were max-projected in z for all subsequent analyses. Representative BFP images were manually curated to label nuclei and background and these labels were used to train a random forest classifier. This model was then used to generate a binary segmentation of all BFP images. Only objects with an area between 40 µm2 and 400 µm2 were considered to be true nuclei. Binary nuclear masks were superimposed on the GFP and RFP images to calculate the nuclear fluorescence in these channels. Nucleus fluorescence was calculated as the average background-subtracted pixel intensity of the 20% brightest pixels per nucleus. The average pixel intensity outside a dilated nuclear mask was used as background.
Library cloning
Single-stranded oligo pools containing tiles of each library were obtained from Twist Biosciences. The oligos were flanked by a shared overhang sequence containing BsaI cut sites for cloning into the ENTRAP-seq backbone pSA676v2. The overhangs were 5′ ACGTCAGTGTGGTCTCTAGGT and TAAGCTTAGAGACCAAGCGCTTA 3′ (BsaI sites underlined, cloning overhangs in bold). Oligo pool DNA was resuspended in 150 μl of water and used for a 50 μl PCR reaction using primers binding to the overhangs (tileLib-F and tileLib-R, Supplementary Table 2: primers). The PCR reaction contained 1 μl of resuspended oligos as template, 2.5 μl of each primer at 10 µM, 19 μl of water and 25 μl of 2× NEB Q5 high fidelity DNA polymerase master mix. The reactions were run using the standard conditions provided by the Q5 manufacturer using a melting temperature of 65 °C, an elongation time of 15 s and 15 cycles. A control 10-μl reaction was run for 35 cycles. Both reactions were run side by side on a 1% agarose gel to confirm correct amplification. The PCR was purified using Zymoclean DNA clean and concentrator columns using two washes with 600 μl of wash buffer and eluting in 30 μl of water. The dsDNA amplified oligo pool was then digested overnight at 37 °C using the following conditions: 13 μl of water, 30 μl of purified PCR, 5 μl of NEB 10× HF buffer and 2 μl of NEB BsaI-HF. The digestion was purified using Zymoclean DNA clean and concentrator columns using 2 washes with 600 μl of wash buffer and eluting in 20 μl of water. For the backbone, approximately 10 μg of pSA676v2 DNA were obtained from a standard miniprep. The plasmid was digested overnight at 37 °C with NEB BsaI-HF using the following reaction conditions: 50 μl of plasmid miniprep DNA, 3 μl of restriction enzyme, 8 μl of 10× HF buffer and 19 μl of water. The entire digestion was run on a 1% agarose gel and the band corresponding to the backbone was purified using the Zymoclean gel DNA recovery kit using 2 washes with 600 μl of wash buffer and eluting in 20 μl. The digested dsDNA library was ligated into digested pSA676v2 using NEB T7 ligase. Here, 500 ng of backbone (approximately 6 μl) and enough insert to achieve a ratio of 3:1 insert:backbone were added to 120 μl of the final reaction volume ensuring that the total DNA concentration did not exceed 10 ng μl−1. The reaction was run using 3 cycles of 2 h at 22 °C and 4 h at 16 °C. The ligations were column-purified and eluted in 15 μl of water.
Library transformation into bacteria
One tube of E. coli SIG10 MAX electrocompetent cells (Millipore Sigma CMC0004) was electroporated with 2 μl of ligated library and recovered in Luria-Bertani (LB) medium at 37 °C for 1 h. Following recovery, 1 μl of cells was plated in 50 μg ml−1 Kanamycin LB plates in triplicate to assess the number of colony forming units (CFUs) per transformation. The minimum number of CFUs used was ten times the library size. The rest of the culture was inoculated in 50 ml of 50 μg ml−1 Kanamycin LB and grown overnight at 37 °C. A miniprep was run on 5 ml of the overnight culture to obtain plasmid DNA for Agrobacterium transformation. For each library, Agrobacterium GV3101::pMP90 was electroporated using 10 μl of competent cells, 2 μl of library miniprep (approximately 500 ng of DNA) and 88 μl of water in each cuvette. Cells were recovered in 1 ml of LB at 28 °C for 2 h. Five such transformations per library were used to increase the number of CFUs. Following recovery, 1 μl of cells was plated in LB plates containing 50 μg ml−1 rifampicin, 50 μg ml−1 kanamycin and 30 μg ml−1 gentamicin in triplicate to assess the number of CFUs per transformation. The minimum number of Agrobacterium CFUs used was ten times the library size. Recovered Agrobacterium cells were inoculated in 100 ml of LB containing the same antibiotic concentrations and grown for 2 days at 30 °C shaking. Glycerol cryostocks were prepared from these cultures.
Plant growth conditions
N. benthamiana was grown in a controlled environment room at 25 °C of temperature, 60% humidity, 16 h/8 h light/dark cycle and a light intensity of 120 μmol m−2 s−1. Before agroinfiltration, plants were grown for 29 days (sowing at day 0) in individual pots using Sunshine Mix no. 4 soil (Sungro) supplemented with 499 Osmocote 14-14-14 fertilizer (ICL) at 5 ml l−1. Immediately after agroinfiltration plants were returned to the growth room for 3 days until sampling.
Agrobacterium growth conditions
For fluorescence plate reader assays Agrobacterium glycerol stocks were streaked onto LB agar plates supplemented with antibiotics: 50 μg ml−1 rifampicin, 50 μg ml−1 kanamycin and 30 μg ml−1 gentamicin. The day before infiltration, individual colonies were inoculated into liquid LB medium containing the same antibiotics and incubated overnight at 30 °C, shaking. On the day of infiltration, cultures were diluted 1:10 into LB with identical antibiotic concentrations and incubated further under the same conditions for 3 h or until an OD of ~1.0 was reached. For gene libraries, transformed cells previously recovered for 2 h in LB without antibiotics were transferred to 100 ml of LB containing the same antibiotic concentrations listed above. Two days later, the OD of this culture was measured and the culture was diluted 1:10 if the OD was higher than 1.0 before infiltration. Alternatively, approximately 100 μl of library glycerol cryostocks were inoculated into LB containing antibiotics the day before infiltration. The day of infiltration, the culture was diluted 1:10 and grown until reaching an OD of ~1.0.
Agroinfiltration
Agrobacterium liquid cultures with an OD between 0.5 and 1.0 were pelleted by centrifugation at 4,000g for approximately 10 min and resuspended in an equal volume of infiltration buffer (10 mM MES, pH 5.6; 10 mM MgCl2; 150 μM Acetosyringone). The resuspended cultures were incubated with shaking at room temperature for 1 h before OD measurements were taken. Next, a 1:5 dilution of each culture was prepared in infiltration buffer, and the OD of this dilution was determined using a spectrophotometer. Final OD dilutions for infiltration were then made using the same buffer. The mixtures were infiltrated into the 6th and 7th leaves (counting upward, with cotyledons considered as leaves 1 and 2) within 1 h of preparation. For fluorescence reporter assays, 3 mixes were infiltrated per leaf at an OD of 0.2 of each strain. For nucleus purification infiltrations the culture was diluted to OD 0.0025.
Fluorescence plate reader reporter assays
For leaf fluorescence measurements of transcriptional reporters, four 6-mm leaf disks were cut from each agroinfiltrated leaf using a single-hole puncher and deposited on top of 350 µl of tap water in a clear-bottom black 96-well plate. All reporter constructs were cloned in the ENTRAP-seq construct backbone (pSA676v2). The DBD-only control contains a stop codon instead of a tile (pSA676v2s). For normalization, a pMAS::BFP strain was coinfiltrated (pSA639). Both strains were diluted to OD 0.4 and mixed in equal ratios to a final OD of 0.2 each. Three days postinfiltration the plates were read on a BioTek Synergy H1 plate reader using 488-nm excitation and 520-nm emission for GFP, 587-nm excitation and 615-nm emission for RFP, 400-nm excitation and 454-nm emission for BFP. The normalized reporter activity reported throughout this work corresponds to a double normalization to BFP and to Gal4DBD alone. First, the RFP fluorescence of each leaf punch was divided by its BFP fluorescence. Next, this value was divided by the mean RFP/BFP fluorescence of Gal4DBD.
NPB and NPBt
NPB was prepared following the recipe from ref. 28: 20 mM MOPS pH 7, 40 mM NaCl, 90 mM KCl, 2 mM EDTA pH 8, 0.5 mM EGTA pH 8, 0.5 mM spermidine, 0.2 mM spermine and 1 complete protease inhibitor tablet per 100 ml. A 5× stock solution was prepared adding everything except for fresh ingredients: spermidine, spermine and protease inhibitors. The fresh ingredients were added to the 1× solution up to 1 day before nucleus pull-down. NPB with triton (NPBt) was prepared following the same recipe and protocol except that 0.1% (v/v) Triton X-100 was added.
Preparing tissue for nucleus sorting
Agrobacterium cultures containing gene libraries were diluted to a final OD of 0.0025 in infiltration buffer and infiltrated into 8–10 4-week-old N. benthamiana plants (Methods section ‘Agroinfiltration’). The full leaf area was infiltrated. Three days postinfiltration, leaves were harvested and the mid vein was removed. An ~5-mm perimeter around the edge of the leaves was discarded as well. Then, 10 g of leaf material was chopped into ~1-cm2 pieces using a razor blade. The chopped leaves were submerged in 50 ml of NPB in a beaker and placed in a vacuum chamber for 10 min to infiltrate NPB into the tissue (Methods section ‘NPB and NPBt’). Further rounds of vacuum filtration were applied until the tissue was completely soaked in NPB. Next, the tissue was washed three times with tap water and tapped dry with paper towels. Finally, the leaf material was flash frozen in liquid nitrogen, ground into a very fine powder using mortar and pestle and stored at −80 °C.
Nucleus pull-down
Frozen ground tissue (Methods section ‘Preparing tissue for nucleus sorting’) was resuspended in 100 ml of ice-cold NPB and passed through a 70-μm strainer (Corning CLS431751) into 2 50-ml tubes to separate nuclei from cells and other larger leaf fragments, keeping the material on ice at all times. This filtrate was centrifuged at 500g for 10 min at 4 °C. The pellets from both tubes were combined by resuspending them in 20 ml of NPB. The resuspension was passed through a 100-μm strainer (Fisher Scientific, cat. no. 07000221) to remove remaining pieces of tissue. In parallel, 100 μl of Dynabeads MyOne streptavidin beads C1 (Thermo Fisher, cat. no. 65001) were washed by resuspending them in 1.5 ml of NPB, separating them in a magnetic rack and resuspending in 100 μl of NPB. The washed dynabeads were then added to the resuspended nuclei and the mix was incubated for 30 min, shaking gently in a cold room. The following steps were carried out in a cold room. The nuclei were diluted in NPBt to a final volume of 100 ml split into 2 50-ml tubes. Each tube was then attached to 4 neodymium magnets (LOVIMAG no. B0CBTSFFSB) using an elastic band (see Supplementary Fig. 11 for details) and placed in this setup for 30 min to pull-down nuclei. The flow-through of this pull-down (that is, the supernatant) was carefully removed and centrifuged for 10 min at 500g, resuspended in 25 ml of NPB per tube and centrifuged again at 500g for 10 min to obtain a pellet corresponding to the flow-through sample. The pull-down from the first magnetic separation was washed twice by resuspending it in 50 ml of NPBt and placing it in the magnetic rack for 20 min. The washed nuclei were next resuspended in 15 ml of NPBt in a 15-ml tube and placed in the magnetic rack again for 20 min. The pellet from this separation was resuspended in 3 ml of NPB and 10 μl of this sample was taken for imaging. The rest of the sample was centrifuged at 1,000g for 10 min to obtain a pellet corresponding to the pull-down sample. The pellets corresponding to the flow-through and pull-down samples were resuspended in 1 ml and 200 μl of cetyltrimethylammonium bromide (CTAB), respectively (100 mM Tris-Cl pH 8, 20 mM EDTA pH 8, 1.4 M NaCl, 2% (w/v) CTAB, 1% PVP 40,000). These samples were stored at −20 °C until DNA extraction.
Nucleus DNA extraction
Purification of DNA from the flow-through and pull-down samples was carried out based on a previously published protocol68. CTAB-resuspended pellets were incubated at 65 °C for 30 min vortexing every 10 min. A volume approximately equal to the pellet volume of phenol:chloroform:isoamyl alcohol 25:24:1 v:v was added per tube, and tubes were vortexed thoroughly. Samples were centrifuged at 20,000g for 2 min at room temperature. The top aqueous layer was removed and mixed with an equal volume of CTAB in a new tube. Samples were vortexed and incubated at 65 °C for 20 min. A volume of chloroform:isoamyl alcohol 96:4 v:v equal to the volume of CTAB added in the last step was added to each tube. Samples were vigorously vortexed and centrifuged at 20,000g at room temperature for 2 min. The top aqueous layer was transferred to a new tube and 70% of this volume of isopropanol was gently added to the wall of each tube. Samples were mixed gently by inverting 5–10 times and placed and −20 °C for 20 min. The DNA was pelleted by centrifugation at 20,000g for 15 min. The supernatant was removed, and the pellets were washed with ice-cold 70% ethanol. Ethanol was removed and the pellets were air dried on a bench top before resuspending them in either 54 μl (pull-down) or 108 μl (flow-through) of water.
Sequencing library preparation
For each sample, 480 μl of PCR reaction was prepared containing 240 μl of Q5 2× master mix, 168 μl of water, 24 μl of nucleus DNA extraction and 24 μl of a 10-μM stock of each primer. Primers were designed to anneal to regions immediately adjacent to the tile sequences and contained Nextera-style sequencing adapters and 8-bp indexes (Supplementary Table 2, primers). Throughout this work we used the same number of PCR cycles for all samples, which was determined as follows. We ran quantitative PCR (qPCR) reactions using the pull-down and flow-through DNA samples of 2 replicates of the viral 250-tile library, using the same set of primers used for preparing sequencing libraries. The qPCR Ct value of these reactions was 22–23 cycles. We added 2 more cycles to obtain extra DNA for sequencing, resulting in 25 cycles used for library preparation PCR. The library PCR were run on a 2% agarose gel, and bands of the expected size were excised and purified using a Zymo gel purification column following the manufacturer’s instructions. Libraries were eluted in 50 μl of water and run in a fragment analyzer to confirm that the size distribution matched the expected amplicon size. Whenever necessary, libraries were size-selected using SPRI bead cleanup. Samples from the 26 control strains experiments shown in Fig. 2 were PCR amplified using the same protocol except that primers did not contain any extensions. Adapters were added to the purified PCR products using the Nanopore Native Barcoding Kit 96 V14 (cat. no. SQK-NBD114.96).
DNA sequencing
The libraries prepared from the control set of 26 mixed strains shown in Fig. 2 were sequenced using the entire flow cell of a Nanopore minION for each replicate. The rest of the samples were run on an Illumina NextSeq 2000 sequencer for 100 cycles using 30% PhiX and single-end reads covering the 5′ end of the sequencing library amplicon.
Mapping reads to tiles
The 2FAST2Q69 package version 2.6.0 was run using the following parameters: feature length of 76 bp, allowed mismatches 2, minimal Phred score 20, upstream search sequence TGGATCTGGAGGT, mismatches in upstream sequence 0, and minimal upstream sequence Phred score of 30.
Calculation of tile enrichment
For each tile, the number of mapped reads in the pull-down sample was divided by the total number of reads mapped to all the library tiles in this sample to obtain a number corresponding to the proportion of pull-down reads mapping to each specific tile. The same calculation was performed for each tile in the flow-through sample. The ratio between these two values corresponds to the tile enrichment.
Calculation of tile coverage
We define tile coverage as the percentage of all mapped reads in both pull-down and flow-through samples that map to a given tile. To obtain this percentage the number of mapped reads in the pull-down and flow-through samples are added and the reads that match each tile are divided by the total number of mapped reads.
Designing viral libraries
A total of 1,495 plant virus genomes were downloaded from NCBI, yielding 5,607 unique proteins. These proteins were clustered at 90% sequence identity using CD-HIT70, resulting in 4,537 representative proteins. Each protein was fragmented into 53-aa tiles with a 10-aa stride, generating 190,520 tiles. Tile size was kept to 53 aa to match the tile size in the dataset used to train the PADDLE AD prediction model14. This size is large enough to capture common ADs while minimizing PCR bias due to long amplicon size3,4. PADDLE was used to perform inference on each tile. To construct a diverse subset for experimental screening, the top 2,000 tiles with the highest predicted z-score were binary encoded, and a greedy selection algorithm was applied to maximize the pairwise Hamming distance between selected encodings. This approach prioritized sequences with the most dissimilar binary representations, thereby enhancing diversity in the DNA library.
Machine-based predictions of ADs
In addition to PADDLE, TADA was used to predict transactivation strength, and Metapredict71 was used to identify disorder regions in proteins. Because PADDLE and TADA use different input sizes (53 aa and 40 aa, respectively), CONSTANS and CO variants were tiled with a 1-aa stride, and tile-level predictions were aggregated into residue-level profiles by averaging the scores of all overlapping tiles at each position. Residues not covered by any tile were assigned a score of zero. PADDLE and TADA predictions were independently min–max normalized to scale scores between 0 and 1 for comparison between models.
Calibration of ENTRAP-seq from fluorescence data
To model the relationship between ENTRAP-seq and fluorescence measurements, we fit a sigmoid function to the enrichment ratios as a function of log of normalized RFP fluorescence for control tiles. The standard deviation of residuals from the fitted curve was used to estimate confidence bounds. We defined the activation threshold as the upper confidence bound at the midpoint of the sigmoidal fit.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Raw sequencing reads are publicly available through the NCBI SRA Database under BioProject accession PRJNA1254296 (ref. 72). The identity and fluorescence of internal control tiles is provided in Supplementary Dataset 1. Tile sequence information as well as the enrichment ratios for each sequencing experiment are provided in Supplementary Dataset 2. Raw and processed fluorescence data are provided in Supplementary Table 3. All plasmids are available from the JBEI registry at https://public-registry.jbei.org/. A list of all the plasmids used in this study can be found in Supplementary Table 1. The sequences of plasmids that were created for this study can be accessed from the JBEI public registry at https://public-registry.jbei.org/.
Code availability
All code and models related to this study are publicly available on GitHub (github.com/shih-lab/magnets)73.
References
-
Borowsky, A. T. & Bailey-Serres, J. Rewiring gene circuitry for plant improvement. Nat. Genet. 56, 1574–1582 (2024).
-
Gonzalez, D. H. Plant Transcription Factors: Evolutionary, Structural and Functional Aspects 1st edn (Academic, 2015).
-
Udupa, A., Kotha, S. R. & Staller, M. V. Commonly asked questions about transcriptional activation domains. Curr. Opin. Struct. Biol. 84, 102732 (2024).
-
Soto, L. F. et al. Compendium of human transcription factor effector domains. Mol. Cell 82, 514–526 (2022).
-
Ferrie, J. J., Karr, J. P., Tjian, R. & Darzacq, X. ‘Structure’-function relationships in eukaryotic transcription factors: the role of intrinsically disordered regions in gene regulation. Mol. Cell 82, 3970–3984 (2022).
-
Staller, M. V. et al. A high-throughput mutational scan of an intrinsically disordered acidic transcriptional activation domain. Cell Syst. 6, 444–455 (2018).
-
Jores, T. et al. Identification of plant enhancers and their constituent elements by STARR-seq in tobacco leaves. Plant Cell 32, 2120–2131 (2020).
-
O’Malley, R. C. et al. Cistrome and epicistrome features shape the regulatory DNA landscape. Cell 165, 1280–1292 (2016).
-
Brooks, M. D., Reed, K. M., Krouk, G., Coruzzi, G. M. & Bargmann, B. O. R. The TARGET system: rapid identification of direct targets of transcription factors by gene regulation in plant cells. Methods Mol. Biol. 2594, 1–12 (2023).
-
Hummel, N. F. C. et al. The trans-regulatory landscape of gene networks in plants. Cell Syst. 14, 501–511 (2023).
-
Kinney, J. B. & McCandlish, D. M. Massively parallel assays and quantitative sequence-function relationships. Annu. Rev. Genomics Hum. Genet. 20, 99–127 (2019).
-
Kinney, J. B., Murugan, A., Callan, C. G. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA 107, 9158–9163 (2010).
-
Erijman, A. et al. A high-throughput screen for transcription activation domains reveals their sequence features and permits prediction by deep learning. Mol. Cell 78, 890–902 (2020).
-
Sanborn, A. L. et al. Simple biochemical features underlie transcriptional activation domain diversity and dynamic, fuzzy binding to Mediator. eLife 10, e68068 (2021).
-
Morffy, N. et al. Identification of plant transcriptional activation domains. Nature 632, 166–173 (2024).
-
Hie, B. L. & Yang, K. K. Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol. 72, 145–152 (2022).
-
Arnold, C. D. et al. A high-throughput method to identify trans-activation domains within transcription factor sequences. EMBO J. 37, e98896 (2018).
-
Ravarani, C. N. et al. High-throughput discovery of functional disordered regions: investigation of transactivation domains. Mol. Syst. Biol. 14, e8190 (2018).
-
Hummel, N. F. C., Markel, K., Stefani, J., Staller, M. V. & Shih, P. M. Systematic identification of transcriptional activation domains from non-transcription factor proteins in plants and yeast. Cell Syst. 15, 662–672 (2024).
-
Tycko, J. et al. High-throughput discovery and characterization of human transcriptional effectors. Cell 183, 2020–2035 (2020).
-
DelRosso, N. et al. Large-scale mapping and mutagenesis of human transcriptional effector domains. Nature 616, 365–372 (2023).
-
Ludwig, C. H. et al. High-throughput discovery and characterization of viral transcriptional effectors in human cells. Cell Syst. 14, 482–500 (2023).
-
Alcantar, M. A., English, M. A., Valeri, J. A. & Collins, J. J. A high-throughput synthetic biology approach for studying combinatorial chromatin-based transcriptional regulation. Mol. Cell 84, 2382–2396 (2024).
-
Voichek, Y., Hristova, G., Mollá-Morales, A., Weigel, D. & Nordborg, M. Widespread position-dependent transcriptional regulatory sequences in plants. Nat. Genet. 56, 2238–2246 (2024).
-
Arndell, T. et al. Pooled effector library screening in protoplasts rapidly identifies novel Avr genes. Nat. Plants 10, 572–580 (2024).
-
Alamos, S. et al. Quantitative dissection of Agrobacterium T-DNA expression in single plant cells reveals density-dependent synergy and antagonism. Nat. Plants 11, 1060–1073 (2025).
-
Carlson, E. D., Rajniak, J. & Sattely, E. S. Multiplicity of the Agrobacterium infection of Nicotiana benthamiana for transient DNA delivery. ACS Synth. Biol. 12, 2329–2338 (2023).
-
Wang, D. & Deal, R. B. Epigenome profiling of specific plant cell types using a streamlined INTACT protocol and ChIP–seq. Methods Mol. Biol. 1284, 3–25 (2015).
-
Deal, R. B. & Henikoff, S. The INTACT method for cell type-specific gene expression and chromatin profiling in Arabidopsis thaliana. Nat. Protoc. 6, 56–68 (2011).
-
Reynoso, M. A. et al. Nuclear transcriptomes at high resolution using retooled INTACT. Plant Physiol. 176, 270–281 (2018).
-
Cai, Y.-M. et al. Rational design of minimal synthetic promoters for plants. Nucleic Acids Res. 48, 11845–11856 (2020).
-
Tehrani, N. & Mitra, R. M. Plant pathogens and symbionts target the plant nucleus. Curr. Opin. Microbiol. 72, 102284 (2023).
-
Boch, J., Bonas, U. & Lahaye, T. TAL effectors–pathogen strategies and plant resistance engineering. N. Phytol. 204, 823–832 (2014).
-
Hull, R. Plant Virology 5th edn (Academic, 2013).
-
Tang, S. N. et al. Causes and consequences of experimental variation in Nicotiana benthamiana transient expression. Preprint at bioRxiv https://doi.org/10.1101/2025.06.12.659391 (2025).
-
Liu, X. et al. Human virus transcriptional regulators. Cell 182, 24–37 (2020).
-
Guerrero, J. et al. Manipulation of the plant host by the geminivirus AC2/C2 protein, a central player in the infection cycle. Front. Plant Sci. 11, 591 (2020).
-
Wang, L. & Lozano-Durán, R. Manipulation of plant RNA biology by geminiviruses. J. Exp. Bot. 74, 2311–2322 (2023).
-
Hartitz, M. D., Sunter, G. & Bisaro, D. M. The tomato golden mosaic virus transactivator (TrAP) is a single-stranded DNA and zinc-binding phosphoprotein with an acidic activation domain. Virology 263, 1–14 (1999).
-
Chavez, A. et al. Highly efficient Cas9-mediated transcriptional programming. Nat. Methods 12, 326–328 (2015).
-
Zinselmeier, M. H. et al. Optimized dCas9 programmable transcriptional activators for plants. Plant Biotechnol. J. 22, 3202–3204 (2024).
-
Tycko, J. et al. Development of compact transcriptional effectors using high-throughput measurements in diverse contexts. Nat. Biotechnol. 43, 1525–1538 (2025).
-
Weiss, T. et al. Viral delivery of an RNA-guided genome editor for transgene-free germline editing in Arabidopsis. Nat. Plants 11, 967–976 (2025).
-
Markel, K., Sabety, J., Wijesinghe, S. & Shih, P. M. Design and characterization of a transcriptional repression toolkit for plants. ACS Synth. Biol. 13, 3137–3143 (2024).
-
Li, J. et al. Activation domains for controlling plant gene expression using designed transcription factors. Plant Biotechnol. J. 11, 671–680 (2013).
-
Seoighe, C. & Gehring, C. Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet. 20, 461–464 (2004).
-
Lai, X., Chahtane, H., Martin-Arevalillo, R., Zubieta, C. & Parcy, F. Contrasted evolutionary trajectories of plant transcription factors. Curr. Opin. Plant Biol. 54, 101–107 (2020).
-
Shiu, S.-H., Shih, M.-C. & Li, W.-H. Transcription factor families have much higher expansion rates in plants than in animals. Plant Physiol. 139, 18–26 (2005).
-
Paterson, A. H. & Queitsch, C. Genome organization and botanical diversity. Plant Cell 36, 1186–1204 (2024).
-
Sainz, M. B., Goff, S. A. & Chandler, V. L. Extensive mutagenesis of a transcriptional activation domain identifies single hydrophobic and acidic amino acids important for activation in vivo. Mol. Cell. Biol. 17, 115–122 (1997).
-
Atanesyan, L., Günther, V., Dichtl, B., Georgiev, O. & Schaffner, W. Polyglutamine tracts as modulators of transcriptional activation from yeast to mammals. Biol. Chem. 393, 63–70 (2012).
-
Ulmasov, T., Hagen, G. & Guilfoyle, T. J. Activation and repression of transcription by auxin-response factors. Proc. Natl Acad. Sci. USA 96, 5844–5849 (1999).
-
Cho, S. et al. Analysis of the C-terminal region of Arabidopsis thaliana APETALA1 as a transcription activation domain. Plant Mol. Biol. 40, 419–429 (1999).
-
Ernst, H. A., Olsen, A. N., Larsen, S. & Lo Leggio, L. Structure of the conserved domain of ANAC, a member of the NAC family of transcription factors. EMBO Rep. 5, 297–303 (2004).
-
Olsen, A. N., Ernst, H. A., Leggio, L. L. & Skriver, K. NAC transcription factors: structurally distinct, functionally diverse. Trends Plant Sci. 10, 79–87 (2005).
-
Brent, R. & Ptashne, M. A eukaryotic transcriptional activator bearing the DNA specificity of a prokaryotic repressor. Cell 43, 729–736 (1985).
-
Rosas, U. et al. Variation in Arabidopsis flowering time associated with cis-regulatory variation in CONSTANS. Nat. Commun. 5, 3651 (2014).
-
Takahashi, Y., Teshima, K. M., Yokoi, S., Innan, H. & Shimamoto, K. Variations in Hd1 proteins, Hd3a promoters, and Ehd1 expression levels contribute to diversity of flowering time in cultivated rice. Proc. Natl Acad. Sci. USA 106, 4555–4560 (2009).
-
Tan, J. et al. OsCOL10, a CONSTANS-like gene, functions as a flowering time repressor downstream of Ghd7 in rice. Plant Cell Physiol. 57, 798–812 (2016).
-
Steinbach, Y. The Arabidopsis thaliana CONSTANS-LIKE 4 (COL4)—a modulator of flowering time. Front. Plant Sci. 10, 651 (2019).
-
Nemoto, Y., Nonoue, Y., Yano, M. & Izawa, T. Hd1,a CONSTANS ortholog in rice, functions as an Ehd1 repressor through interaction with monocot-specific CCT-domain protein Ghd7. Plant J. 86, 221–233 (2016).
-
Tiwari, S. B. et al. The flowering time regulator CONSTANS is recruited to the FLOWERING LOCUS T promoter via a unique cis-element. N. Phytol. 187, 57–66 (2010).
-
Robson, F. et al. Functional importance of conserved domains in the flowering-time gene CONSTANS demonstrated by analysis of mutant alleles and transgenic plants. Plant J. 28, 619–631 (2001).
-
Staller, M. V. et al. Directed mutational scanning reveals a balance between acidic and hydrophobic residues in strong human activation domains. Cell Syst. 13, 334–345.e5 (2022).
-
Sigler, P. B. Transcriptional activation. Acid blobs and negative noodles. Nature 333, 210–212 (1988).
-
Ho, W. W. H. & Weigel, D. Structural features determining flower-promoting activity of Arabidopsis FLOWERING LOCUS T. Plant Cell 26, 552–564 (2014).
-
Rodríguez-Leal, D., Lemmon, Z. H., Man, J., Bartlett, M. E. & Lippman, Z. B. Engineering quantitative trait variation for crop improvement by genome editing. Cell 171, 470–480 (2017).
-
Weigel, D. & Glazebrook, J. Arabidopsis: A Laboratory Manual 1st edn (Cold Spring Harbor Laboratory, 2002).
-
Bravo, A. M., Typas, A. & Veening, J.-W. 2FAST2Q: a general-purpose sequence search and counting program for FASTQ files. PeerJ 10, e14041 (2022).
-
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
-
Emenecker, R. J., Griffith, D. & Holehouse, A. S. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys. J. 120, 4312–4319 (2021).
-
Alamos, S. et al. Multiplexed profiling of transcriptional regulators in plant cells. Raw sequencing reads. NCBI SRA https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1254296/ (2025).
-
Alamos, S. et al. Multiplexed profiling of transcriptional regulators in plant cells. Source code. GitHub https://github.com/shih-lab/magnets (2025).
Acknowledgements
We thank all members of the Shih laboratory. From weekly growing healthy plants to making Agrobacterium competent cells to discussing results, their support enabled this research. We are thankful to N. Hummel for introducing us to agroinfiltration and the study of TFs in plants and for sharing countless plasmids. A. Lanctot and M. Kesselhaut provided insightful comments during the preparation of this article. A. Sarkiss helped with infiltrations. We also thank several colleagues with whom we discussed preliminary data and ideas during the development of this project. We had productive and inspiring discussions with M. Staller, R. Deal, J. M. Alvarez, P. Lang, H. Garcia, C. Queitsch and G. Coupland. A few scientists provided critical technical feedback as well. Y. Tang and J. Bailey-Serres shared constructs for nuclear envelope targeting, A. Deutschbauer gave us advice for library cloning and sequencing, and R. Kuo and N. Prywes guided us through the early stages of next-generation sequencing. Finally, we thank S. McDevitt from QB3 Genomics, UC Berkeley, Berkeley, CA, USA RRID:SCR_022170 and B. McCarthy from the UC Berkeley DNA sequencing facility for their advice and assistance. This work was part of the Department of Energy Joint BioEnergy Institute (https://www.jbei.org) supported by the US Department of Energy, Office of Science, Office of Biological and Environmental Research, through contract no. DE-AC02-05CH11231 between Lawrence Berkeley National Laboratory and the US Department of Energy. L.W. was supported by the National Science Foundation Graduate Research Fellowship.
Ethics declarations
Competing interests
A patent for characterizing gene libraries in plant cells was filed by the Lawrence Berkeley National Laboratory with S.A. and P.M.S. as inventors. P.M.S. has financial interests in BasidioBio. The other authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Additional ONEtag optimization details.
(a) Fluorescence microscopy images showing the subcellular localization of ONEtag variants tagged with GFP and coexpressed with an RFP nuclear marker. From top to bottom the corresponding plasmid names are pSA266, pSA267, pSA268, pSA270, pSA272. (b) Cells expressing the SlWIP1-C and SlWIP2-C ONEtag fused to RFP by a self-cleaving T2A peptide (plasmids pSA271 and pSA91, respectively). In (a) and (b) at least 3 cells were imaged for each construct to confirm consistent localization, scale bar = 20 um. (c) Top: schematic of the construct. BLRP was removed from the construct shown in Fig. 1c and the responsiveness of this reporter to VP16 was tested in a fluorescence plate reader. (d) Top: schematic of the construct. SlWIP1-C was removed from the construct shown in Fig. 1c and the responsiveness of this reporter to VP16 was tested in a fluorescence plate reader. In (c) and (d) each circle corresponds to one leaf punch. (e) Left: schematic of the experiment. The RFP-T2A tag was removed from both the SlWIP1 and SlWIP2 constructs shown in Fig. 1c and their responsiveness to VP16 was tested in a fluorescence plate reader (plasmids SA752 and pSA753, respectively). Right: leaf GFP fluorescence with or without VP16 driving reporter expression. Box plots show the median ± IQR, whiskers correspond to minimum and maximum values, excluding outliers. N = 24 biological replicates (3 plants, 2 leaves per plant, 4 leaf punches per leaf).
Extended Data Fig. 2 Validation details of ENTRAP-seq for measuring reporter expression.
(a) Total number of nuclei obtained in each purification fraction in the experiment from Fig. 2c (rep.1) and (c) (rep.2). (b) Order-of-magnitude estimate of the number of cells transformed by a strain carrying the ONEtag based on the numbers obtained from Alamos et al. 26. (c) Biological replicate of the experiment shown in Fig. 2c. Shown are the distributions of ONEtag expression levels in nuclei from different purification fractions. The histograms are normalized to have the same area. Nuclei with low GFP levels are enriched in the flow-through and there is a progressive enrichment in the pulldown with increasing GFP expression. P values were obtained using a one-sided Mann–Whitney U test. (d) Extended data for the experiment from Fig. 2e, f. X-axis: transcriptional activity of each tile based on ENTRAP-seq (mean ± s.d. enrichment of each tile T-DNA in the nuclei pulldown fraction compared to the flow-through sample). Y-axis: mean ± s.d. GFP expression measured for each tile individually using a fluorescence plate reader. (e) Extended data for the experiment from Fig. 2e, f. Mean ± s.d. GFP (x-axis) and RFP (y-axis) fluorescence for each control tile. n for nuclei pulldown and sequencing = 4 biological replicates. n for plate reader assays = 32 biological replicates (4 plants with 2 leaves per plant and 4 leaf punches per leaf). ⍴ = Spearman correlation coefficient, r = Pearson correlation coefficient.
Extended Data Fig. 3 Overview of plant virus genomes and tile selection.
(a) Distribution of viral genome lengths. (b) Number of proteins in each viral genome. (c) Distribution of protein lengths. (d) Sequences were clustered at 90% identity (dashed line) to maximize protein diversity, and representative proteins were tiled for activator prediction using PADDLE. (e) The top 2,000 tiles with the highest PADDLE predicted z-score, corresponding to a threshold of 4.9 (dashed line), were selected for experimental characterization.
Extended Data Fig. 4 Estimating the resolution and error of ENTRAP-seq using empirical noise distributions.
(a) Distribution of Δ values in the 2000 tile viral library where Δ is defined as the difference in enrichment of a given tile between two biological replicates. The dashed red line indicates 2 empirical standard deviations in Δ values. (b) Resolution in one replicate: The distribution in (a) is used as a null distribution to estimate the measurement error, defined as 2σ. This 2σ value corresponds to the 95% confidence interval for the Δ between tiles that are not different in strength. Hence, the minimal detectable Δ with p ≤ 0.05 is defined as Δ larger than 2σ. The plot shows this detectable Δ as a function of the number of tiles passing a library coverage cutoff in the 2000 tile library (as in Fig. 3c) (see Supplementary Text for calculation details). (c) Distribution of residuals in the 2000 tile viral library, defined as the orthogonal distance of the enrichment value of a given tile with respect to the the rep1 = rep2 (y = x) line (see Supplementary Text for details). The dashed red line indicates 2 empirical standard deviations in residual values. (d) Resolution using two replicates: The distribution in (c) is used as a null distribution to estimate the error of comparing the mean enrichment between two tiles, defined as 2σ. This 2σ value corresponds to the 95% confidence interval for the difference between the mean enrichment of two tiles that are not different in strength. Hence, the minimal detectable enrichment difference in two replicates with p ≤ 0.05 is defined as difference larger than 2σ residuals. The plot shows this detectable difference as a function of the number of tiles passing a library coverage cutoff (as in Fig. 3c) (see Supplementary Text for calculation details). (e) Resolution is independent of enrichment. For each tile, the residuals were calculated as in (c) and plotted against the mean enrichment of that tile. (f) Assessment of data heteroscedasticity. Tiles in (e) were binned according to their enrichment and, for each bin, the standard deviation in residual values was calculated (y-axis). The s.d. of residuals hovers around 0.2 for all bins, showing that residuals are similarly variable across the enrichment dynamic range.
Extended Data Fig. 5 Fraction of selected tiles identified as activators across viral families.
(a) For each virus family, the number of activator tiles is normalized by the total number of tiles selected for the corresponding family. (b) Same as (a) but showing absolute counts. (c) same as (b) showing the absolute counts in a log scale. Note that the viral families shown here do not exactly match the families in the tree of Fig. 4a. Some families such as Totiviridae had too few tested tiles to be displayed in the tree. Some families shown in the tree (for example, Amalgaviridae) did not have any tiles that passed our criteria for testing.
Extended Data Fig. 6 Functional annotation of proteins in the viral libraries.
(a) GO term analysis across the whole library reveals enrichment of specific protein domains across library sequences, highlighting overrepresentation of functional classes among the tested viral protein sequences. (b) Clusters of Orthologous Groups (COG) categories in the whole library. (c) Protein descriptor categories among proteins for which we found an AD.
Extended Data Fig. 7 Testing potential strong activator and repressor tiles from plant viruses.
(a) Data overview. 40 tiles with a high degree of enrichment (20 putative activators) or de-enrichment (20 putative repressors) in at least one library were selected for individual testing in a fluorescence reporter assay. Shown is the mean reporter fluorescence of each tile infiltrated individually (y-axis) as a function of tile enrichment for each replicate and library where this tile was present (x-axis). (b) Reproducibility of the data in (a). Scatter plot shows the enrichment of each candidate tile in two biological replicates. (c) Reporter fluorescence for each candidate activator tile expressed as fold change over the DBD alone. Gal4 DBD fusions to VP16 and VP4 were included for comparison. (d) Reporter fluorescence for each candidate repressor tile expressed as fold change over DBD alone. Gal4 DBD fusions to synthetic EAR motifs were included for comparison. In (c) and (d), boxplots show the median and the interquartile range, whiskers show minimum and maximum values, excluding outliers. *** p < 0.001, ** p < 0.01, ns not significant of two-sided Welch’s t-tests against DBD with Bonferroni multiple comparison correction, n = 24 biological replicates (3 plants, 2 leaves per plant, 4 leaf punches per plant).
Extended Data Fig. 8 Biochemical properties of the CO acidic AD.
The transcriptional activity of each tile in the CO AD library (y axis) is plotted as a function of different biochemical and sequence features. The WT tile is shown as a white circle. (a) Tile charge calculated as isoelectric point at pH 7.0. (b) number of hydrophobic residues. (c) Number of selected hydrophobic residues, phenylalanine (F) and tryptophan (W) and ⍴ = Spearman correlation coefficient.
Extended Data Fig. 9 Replicate of CO alleles regulating the FT promoter driving GFP.
The experiment shown in Fig. 6h was replicated on a different week. The FT promoter activity driven by a given CO allele on the first replicate (y-axis) is plotted against the results from the same allele from the second replicate (x-axis). n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per plant). GoF, gain of function, LoF, loss of function. Error bars show the to the mean ⨦ s.d.
Extended Data Fig. 10 ENTRAP-seq correlates well with fluorescence plate reader measurements.
Scatter plot showing all presented cross-validation experiments combined. Tile transcriptional activity based on individual infiltrations followed by fluorescence plate reader assays (y-axis) is compared against tile activity based on ENTRAP-seq experiments (enrichment, x-axis). Legend indicates where the data was originally presented in the text. Refer to the original figure for information on biological replicates.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Alamos, S., Waldburger, L., Dee, A. et al. Multiplexed profiling of transcriptional regulators in plant cells. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02880-w
-
Received:
-
Accepted:
-
Published:
-
Version of record:
-
DOI: https://doi.org/10.1038/s41587-025-02880-w






