Introduction
The Enzyme Commission (EC) number classification system is a pivotal tool for categorizing enzymes based on the corresponding enzyme-catalyzed reactions1,2,3. An EC number consists of four hierarchical numbers (e.g., 1.1.1.1), which represent a class, a subclass, a sub-subclass, and a serial number, respectively4. For instance, the malate dehydrogenase enzymatic function is referred to as EC 1.1.1.37; the first level 1 (EC 1) denotes oxidoreductase; the second level (EC 1.1) indicates reaction in the CH-OH group of donors; the third number (EC 1.1.1) shows donor acceptors with NAD+ or NADP+; the last level (EC 1.1.1.37) refers to a malate dehydrogenase5,6. This hierarchical classification characterizes enzyme functions and their roles in diverse biochemical pathways and facilitates the annotations of enzymes in biological databases. The significance of the EC number system extends beyond categorization, underpinning advancements in drug discovery, metabolic engineering, and ecological sustainability7,8,9.
Accurate classification of EC numbers is essential for understanding enzyme functions and their roles in metabolic pathways. Computational approaches for EC number prediction are broadly categorized into three groups: (1) protein structure-based, (2) sequence similarity-based, and (3) machine learning-based methods. Protein structure-based models (e.g., COFACTOR10, i-TASSER suite11, AutoDock12, and more recently, deep learning-enhanced tools as GraphEC13, DeepFRI14 and LM-GVP15) examine structural templates and perform homology analysis through protein threading or graph-based representations of 3D structures. While effective for identifying folds and functional sites, these models are constrained by the limited availability of high-quality structural data. Sequence similarity-based approaches (e.g., EFICAz16, ModEnzA17, PRIAM18, EnzML19) utilize conserved patterns and motifs identified by multiple sequence alignment (MSA) to infer enzyme functions. Despite their substantial utility in identifying functional domains, these approaches (1) are computationally expensive, (2) are ineffective when confronted with the absence of closely related reference sequences, and (3) often require additional post-hoc analyses.
A promising solution is leveraging machine learning, which effectively captures complex relationships within enzyme sequences. The current state-of-the-art models include ECPred20, DEEPre21, DeepEC22, ECPICK23, DeepECtransformer24, and CLEAN25, which show enhanced predictive power. However, there still remains substantial room for improvement in their predictive performance, as these models often struggle with limited and imbalanced training data, particularly for enzymes with underrepresented functions where limited references are available26. Our preliminary data reveal that the state-of-the-art methods achieved an F1-score of only around 70% for underrepresented enzymes (e.g., N≤25), which account for 41% of EC numbers in the dataset. This low performance highlights the challenges in achieving accurate predictions for underrepresented EC numbers.
Model interpretability also remains a critical challenge, particularly in establishing trustworthy predictions. Beyond achieving high predictive accuracy, it is essential to understand the rationale behind a model’s predictions to ensure biological relevance, especially when applied to high-stakes decisions. In EC number prediction, an interpretation scheme validates a model’s predictions by comparing highlighted amino acids with established biological knowledge, such as conserved motifs or functional sites, to ensure that the final prediction relies on biologically known and relevant components. In ECPICK, model interpretation provided insights into how specific motifs or regions influence predictions, yielding domain-specific evidence23. In addition to increasing trustworthiness in the predictions, it also has the potential to uncover new motif sites or functional regions, which could drive further experimental investigations and expand our understanding of enzyme functions.
Incomplete annotations in biological databases present a significant opportunity to improve the predictive performance of deep learning models. For instance, as of September 2022, 15.4% of protein sequences in the Swiss-Prot database are incompletely annotated, often missing last levels of the EC classification (e.g., 1.1.-.-). These incomplete annotations arise from several factors, including the labor-intensive and time-consuming nature of experimental characterization, the vast number of newly discovered enzymes that have not yet been studied, and the challenges in reliably assigning EC numbers to enzymes with or ambiguous functions26. These limitations address the potential to enhance model performance by developing methods that can effectively handle incomplete annotations. To the best of our knowledge, all of the current state-of-the-art models rely on only completely annotated data when training for EC number prediction, which deprives these models of valuable data.
In this study, we develop a Hierarchical Interpretable Transformer model (HIT-EC) that advances EC number classification by leveraging the hierarchical structure of EC numbers and incompletely labeled sequences (Fig. 1). HIT-EC employs a four-level transformer architecture that aligns with the EC number hierarchy, which provides context-aware predictions at each level. The major contributions of the HIT-EC model are as follows: (1) statistically significant improvement of the predictive performance, (2) a training strategy to handle incomplete annotations, and (3) an evidential approach to provide trustworthy predictions (Fig. 1). The performance of HIT-EC was assessed by extensive experiments across multiple evaluation settings with a large dataset (over 200,000 sequences), including (1) a repeated stratified hold-out benchmark, (2) external validation using newly registered enzymes, and (3) the evaluation of the predictive performance on complete genomes of various species. In the experiments, HIT-EC consistently outperformed the state-of-the-art methods across all evaluation settings. HIT-EC improved micro- and macro-averaged F1-scores by at least 6% and 4%, respectively, in the cross-validation experiment, and also showed a 6% improvement in F1-score for underrepresented enzyme classes with fewer than 25 sequences. Furthermore, we demonstrated the trustworthiness of HIT-EC’s predictions by comparing the regions highlighted by our method to established biological knowledge within the CYP106A2 enzyme family, which provided evidence of the similarities between our interpretation and well-known functional and structural characteristics.
A Collection of protein sequences from Uniprot, the Protein Data Bank, and the KEGG database. B Innovative hierarchical interpretable using both local and global flows named HIT-EC. C HIT-EC trains with incomplete EC numbers with the masked loss mechanism. D Assessment of the predictive performance against three state-of-the-art benchmark models across various experiment settings. E HIT-EC provides trustworthy predictions validated by aligning the model’s interpretation with established biological knowledge.
Results
In this section, we present the performance evaluation of HIT-EC through intensive experiments in various settings: (1) the micro- and macro-averaged F1-scores on the repeated stratified hold-out dataset, (2) the performance of HIT-EC on underrepresented classes within this dataset, (3) an ablation study analyzing the contribution of HIT-EC’s components, (4) the predictive performance on newly registered enzymes, and (5) the performance on various microbial species.
Repeated stratified hold-out experiment for performance comparison
We evaluated the performance of HIT-EC by comparing it to state-of-the-art models in a cross-validation setting. We considered manually curated protein sequences from Swiss-Prot27 and the Protein Data Bank (PDB)28, released prior to September 2022. Sequences were filtered to retain only non-redundant entries with a maximum length of 1,023 amino acids. We excluded EC numbers with fewer than ten distinct sequences to maintain sufficient samples for the stratified splits and avoid unreliable estimates for very data-sparse classes, consistent with prior EC prediction studies22,23. We also excluded EC numbers containing an n placeholder (e.g., EC 3.5.1.n3), which denote preliminary, non-final serial assignments. The data preprocessing resulted in the stratified hold-out dataset that includes approximately 200,000 sequences categorized into 1,938 EC numbers. The dataset was split into training (80%), validation (10%), and test (10%) sets by stratified sampling, preserving the class ratios. This repeated stratified hold-out procedure approximates Monte Carlo cross validation29. HIT-EC integrates a global flow with level-specific local flows and combines their outputs for hierarchical EC classification, while supporting training with partial ECs (e.g., 3.5.-.-) via a masked-loss that ignores unannotated levels (see Methods). HIT-EC explains its predictions by assigning relevance scores across the input protein sequence, providing domain-specific evidence for each decision to support trustworthy prediction.
We considered three state-of-the-art benchmark methods, CLEAN25, ECPICK23, and DeepECtransformer (DeepECT)24, for the comparison. For the hyper-parameter tuning, we used Optuna for hyper-parameter optimization30, where the hyper-parameters were optimized to minimize loss on the validation set. For the final predictions, we selected class-specific thresholds which maximize the F1-score on the validation set for each EC number. The optimal hyper-parameter values of HIT-EC were an attention head count of 2, an embedding dimension (d) of 1024, a dropout rate of 0.1, a β of 0.59, and a learning rate of 8.75e−5. For the benchmark models, we employed the optimal hyperparameters reported in their original publications, given that these parameters were optimized on closely related datasets, specifically prior releases of Swiss-Prot or PDB. For training HIT-EC, the stochastic weight averaging method was applied during the last 15 epochs to improve the model’s generalization performance and stability31. All benchmark models were trained on the same dataset, ensuring a consistent and fair comparison across all evaluation tasks. To ensure reproducibility, this experiment was conducted ten times. Throughout this study, we used HIT-EC and all benchmark models trained in this repeated stratified hold-out experiment for a fair comparison in the sections “External validation using newly registered enzymes”, “Species-specific performance comparison using KEGG”, and “Computational efficiency analysis”, and we refer to these models as cross-validation models.
HIT-EC achieved the highest micro-averaged F1-score of 0.93 ± 0.01, followed by CLEAN (0.88 ± 0.01), ECPICK (0.82 ± 0.06), and DeepECT (0.79 ± 0.05) (Fig. 2A). HIT-EC demonstrated an 18% improvement over DeepECT, a 15% improvement over ECPICK, and a 6% improvement over CLEAN. The statistical outperformance was assessed by Wilcoxon signed-rank tests (p < 0.01 for all comparisons). Note that CLEAN always predicts at least one EC number for any given protein sequence. While this strategy results in relatively high F1-scores, it also produces a substantial amount of false positives, yielding a False Discovery Rate (FDR) of 1 when applied to non-enzyme sequences. In contrast, other models employ thresholding for the final predictions, which assign EC numbers for high confidence scores only. For sequences with confidence scores below the threshold, all EC classes were treated as negative predictions during evaluation. The FDRs computed using non-enzyme protein sequences with taxonomy information in Swiss-Prot were 0.15, 0.17, and 0.16 for DeepECT, ECPICK, and HIT-EC, respectively. However, the FDR of CLEAN was 1.
A Micro-averaged F1-scores across ten independent repeated stratified hold-out experiments (n = 10 independent computational replicates). B Macro-averaged F1-scores across the same ten experiments (n = 10). C Micro-averaged F1-scores computed only on underrepresented EC classes (N ≤ 25; n = 10). Across all panels, performance distributions are shown using boxplots overlaid with individual data points. Boxplots indicate the median (center line), 25th and 75th percentiles (bounds of the box), whiskers extending to the minimum and maximum values, and all individual data points. Each replicate corresponds to a fully independent repeated stratified hold-out training and evaluation split and all models were evaluated on identical splits. Statistical comparisons between HIT-EC and the benchmark models (DeepECT, ECPICK, CLEAN) were performed using a one-sided Wilcoxon signed-rank test (paired, non-parametric; n = 10), reflecting the directional hypothesis that HIT-EC improves performance. Statistically significant differences are indicated as ** (p < 0.01). Source data, including all p-values and effect sizes for comparisons between HIT-EC and the benchmark models, are provided in the Source Data file.
HIT-EC also showed the highest macro-averaged F1-score of 0.84 ± 0.02, surpassing CLEAN (0.80 ± 0.02), ECPICK (0.75 ± 0.05), and DeepECT (0.58 ± 0.07) (p < 0.01 for all comparisons) (Fig. 2B). The higher macro-averaged F1-score indicates HIT-EC’s robust performance across all EC numbers, while micro-averaged F1-scores can be biased towards the majority classes.
We further evaluated the performance considering only underrepresented EC classes, which contain fewer than 25 sequences. The repeated stratified hold-out dataset included 799 underrepresented EC numbers. HIT-EC also demonstrated superior performance in this setting, achieving the highest micro-averaged F1-score of 0.77 ± 0.02, followed by CLEAN (0.73 ± 0.04), ECPICK (0.67 ± 0.05), and DeepECT (0.47 ± 0.04) (Fig. 2C). HIT-EC outperformed DeepECT by 64%, ECPICK by 15%, and CLEAN by 5% (p < 0.01 for all comparisons).
We examined F1-scores specific to each EC level to assess model performance across the hierarchy, which highlights the improved predictive performance resulting from the hierarchical design and the training strategy for incomplete EC numbers (Table 1). HIT-EC achieved the highest micro-averaged F1-score at levels 2 and 3, reaching 0.949 ± 0.011 and 0.942 ± 0.012, respectively. These scores reflect relative gains over DeepECT of 23% and 22%, over ECPICK of 6% and 9%, and over CLEAN of 1% and 7%, respectively. At level 1, HIT-EC ranked second (0.953 ± 0.010), with a marginal difference from CLEAN (0.968 ± 0.007), while outperforming DeepECT and ECPICK by 22% and 2%, respectively. In the macro-averaged F1 setting, HIT-EC achieved the best performance at all levels, with improvements over DeepECT ranging from 20% (level 1) to 53% (level 3), over ECPICK from 3% (level 1) to 13% (level 3), and over CLEAN from 0% (level 1) to 10% (level 3). These findings underscore HIT-EC’s consistent and robust performance across all EC levels.
Ablation study
We assessed the individual contributions of HIT-EC components by comparing HIT-EC with three controls: (1) a baseline transformer without hierarchical mechanisms, (2) a baseline transformer trained with the masked loss for incomplete EC numbers, and (3) HIT-EC trained on only complete EC numbers. The baseline transformer corresponds to the standard encoder-only transformer architecture32 without hierarchical modeling or the training strategy for incomplete EC numbers. For all the models, we used hyperparameters selected by Optuna in the repeated stratified hold-out setup30 and the same dataset described in the section “Repeated stratified hold-out experiment for performance comparison”.
The results highlight the benefits of both the hierarchical architecture and the integration of incomplete annotations during training, which collectively enhance generalization on underrepresented classes and macro-level predictive performance. For the micro-averaged F1-score (Fig. 3A), the combination of hierarchical features and incomplete EC training reached 0.93 ± 0.01, similar to the performance of the baseline transformers and 2% higher than HIT-EC without incomplete annotations (0.91 ± 0.02). For the macro-averaged F1-score (Fig. 3B), the model with all components achieved 0.83 ± 0.02, showing a 4% improvement over the baseline transformer (0.80 ± 0.03), a 3% increase over the baseline transformer with incomplete annotation training (0.81 ± 0.02), and a 1% gain over HIT-EC without incomplete annotations (0.82 ± 0.02). For underrepresented EC classes (Fig. 3C), HIT-EC obtained a micro-averaged F1-score of 0.77 ± 0.03, representing a 7% improvement over the baseline transformer (0.72 ± 0.03), an 8% increase over the baseline transformer with incomplete annotation training (0.71 ± 0.03), and a 3% improvement over HIT-EC without incomplete annotations (0.75 ± 0.01). All comparisons were statistically significant based on Wilcoxon signed-rank tests (p < 0.01), with the exception of the comparison between HIT-EC and the baseline transformers in the micro-averaged F1-score setting, where no statistically significant difference was observed.
A Micro-averaged F1-scores across ten independent repeated stratified hold-out experiments (n = 10 independent computational replicates). B Macro-averaged F1-scores across the same ten experiments (n = 10). C Micro-averaged F1-scores computed only on underrepresented EC classes (N ≤ 25; n = 10). Across all panels, performance distributions are shown using boxplots overlaid with individual data points. Boxplots indicate the median (center line), 25th and 75th percentiles (bounds of the box), whiskers extending to the minimum and maximum values, and all individual data points. Each replicate corresponds to a fully independent repeated stratified hold-out training and evaluation split and all models were evaluated on identical splits. Statistical comparisons between HIT-EC and the ablated models (baseline transformer, transformer with incomplete annotations, and HIT-EC without incomplete annotations) were performed using a one-sided Wilcoxon signed-rank test (paired, non-parametric; n = 10), reflecting the directional hypothesis that HIT-EC improves performance. Source data, including all p-values and effect sizes for comparisons between HIT-EC and the benchmark models, are provided in the Source Data file.
We observe that the incomplete label strategy benefits HIT-EC significantly more than the baseline. It may be because HIT-EC leverages incomplete EC labels (e.g., 3.5.-.-) to learn the first levels of the EC hierarchy within the local flows, while the baseline trains from the same labels in a single output space, which provides limited guidance for the specific class.
External validation using newly registered enzymes
We conducted experiments using newly registered enzyme sequences in Swiss-Prot for external validation. We collected a new dataset consisting of all Swiss-Prot sequences released after September 2022, referred to as the New-28245 dataset. We measured the average sequence identity of the most similar training samples to be 47.19%. We note that this similarity, although indicative of relatedness, remains below the threshold where function transfer becomes trivial33,34.
We conducted three complementary evaluations on New-28245 (Fig. 4): (1) the performance of the cross-validation models (Fig. 4A), (2) the performance of the same models on the underrepresented EC classes (N≤25) (Fig. 4B), and (3) the performance of publicly available pre-trained models (Fig. 4C). For the first two evaluations, we applied the cross-validation models to the 2240 sequences in New-28245 associated with at least one EC number present in the repeated stratified hold-out dataset. Across these sequences, the average sequence identity to the most similar training sample was 65.90% HIT-EC achieved the highest micro-averaged F1 score among the cross-validation models, reaching 0.68 ± 0.07 on all sequences and 0.42 ± 0.02 on underrepresented ECs. CLEAN followed with 0.65 ± 0.10 and 0.38 ± 0.09, respectively. ECPICK reached 0.46 ± 0.10 and 0.21 ± 0.09, while DeepECT performed the lowest with 0.33 ± 0.13 and 0.09 ± 0.04. The improvements of HIT-EC over ECPICK, DeepECT, and CLEAN were statistically significant (p < 0.05, Wilcoxon signed-rank test).
A Micro-averaged F1-scores from cross-validation models trained on all protein sequences in the New-28245 dataset (N = 2240; n = 10 independent computational replicates). B Micro-averaged F1-scores from cross-validation models trained only on sequences belonging to underrepresented EC classes (N = 260; n = 10). C Micro-averaged F1-scores of publicly available pre-trained models evaluated on all sequences in the New-28245 dataset (N = 28,245, n = 1). Panels (A) and (B) display performance distributions using boxplots overlaid with individual data points. Boxplots indicate the median (center line), the 25th and 75th percentiles (bounds of the box), whiskers extending to the minimum and maximum values, and all individual data points. Each replicate corresponds to a fully independent repeated stratified hold-out training and evaluation split and all models were evaluated on identical splits. Statistical comparisons between HIT-EC and the benchmark models were performed using a one-sided Wilcoxon signed-rank test (paired, non-parametric; n = 10), reflecting the directional hypothesis that HIT-EC improves performance. Source data, including all p-values and effect sizes for comparisons between HIT-EC and the benchmark models, are provided in the Source Data file.
Additionally, we performed the evaluations using the pre-trained models that are publicly available. We downloaded the latest version of the pre-trained models. For HIT-EC, we re-trained the model using all available datasets before September 2022, including Swiss-Prot, PDB, and manually curated entries from the KEGG database. We incorporated TrEMBL enzymes to broaden HIT-EC’s EC number coverage. For the EC numbers represented by fewer than 80 sequences, we added additional sequences from TrEMBL, where the EC numbers of the TrEMBL sequences were re-labeled by aligning these sequences with sequences in Swiss-Prot and PDB using DIAMOND35 (minimum percent identity of 50% and coverage of 75%)22,23. This data augmentation process produced around 450,000 sequences in the training dataset. The coverage of the resulting HIT-EC model was 4255 EC numbers. HIT-EC still showed the highest micro-averaged F1-score of 0.932, followed by CLEAN (0.889), DeepECT (0.877), and ECPICK (0.806) (Fig. 4C).
The cross-validation models were trained without TrEMBL in the repeated stratified hold-out experiment to prioritize annotation quality and a fair protocol. We trained every method on the same manually curated Swiss-Prot and PDB dataset. In prior related studies, methods were often compared across different datasets, and baseline models were not retrained on the same data. We re-trained all baselines on identical splits to ensure fairness. TrEMBL was then added in the final training to broaden label coverage and improve practical utility.
Species-specific performance comparison using KEGG
We assessed the models’ performances on complete genome datasets derived from KEGG36 to further evaluate the robustness of these models, using the benchmark models trained in the cross-validation experiments. In this experiment, we considered microbial genomes that exhibit great diversity in sequence composition and evolutionary trajectories. Unlike human proteins that have been extensively studied and annotated, microbial proteomes often include novel or poorly characterized sequences. Specifically, we evaluated the models using fourteen microbial species, including Shigella flexneri (strain 301), Acinetobacter pittii (strain PHEA-2), Campylobacter jejuni (strain NCTC 11168), Staphylococcus aureus (strain NCTC 8325), Pseudomonas aeruginosa (strain PAO1), Bacillus subtilis (strain 168), Escherichia coli (strains O157:H7 and K-12), Mycobacterium tuberculosis (strain H37Rv), Klebsiella pneumoniae (strain HS11286), Caulobacter vibrioides (strain NA1000), Chlamydia trachomatis (strain D/UW-3/CX), Listeria monocytogenes (strain EGD-e), and Coxiella burnetii (strain RSA 493). Theses fourteen microbial strains are widely recognized as model microorganisms whose genomes have been extensively studied for decades. Their enzyme annotations are supported not only by KEGG but also by multiple independent resources, such as UniProt37 and EcoCyc38, which are continually updated through manual curation. We minimized the risk of automated annotations in KEGG by limiting the evaluation to such well-annotated organisms, thereby enhancing the validity of our performance metrics.
In this evaluation, we only considered protein sequences associated with EC numbers that the cross-validation models were capable of predicting. HIT-EC exhibited the highest F1-scores in eight of the fourteen species and showed the best performance on average. HIT-EC achieved the micro-averaged F1-score of 0.81 ± 0.06 on average, which was significantly higher than ECPICK (0.78 ± 0.04), CLEAN (0.77 ± 0.04), and DeepECT (0.67 ± 0.08) (Table 2). The overall superior performance of HIT-EC on this comprehensive genome dataset indicates its robustness and capability in large-scale enzyme function prediction. The computational experimental result also implies that HIT-EC effectively captures patterns in diverse protein families and is robust to variations in sequence composition. This experimental setting reflects realistic genomic contexts, where complete proteomes are to be annotated. Given the broad diversity of microorganisms and the complexity of their protein sequences, this evaluation provides insight into the models’ ability to handle genome-wide predictions effectively.
Computational efficiency analysis
We rigorously evaluated the computational efficiency of HIT-EC by recording the execution times of the ten cross-validation models on the enzyme sequences from New-28245. For this evaluation, each enzyme was processed individually, using a batch size of 1. All experiments were performed on a single NVIDIA A30 Tensor Core GPU, which was dedicated exclusively to this task to ensure consistent performance measurements. Table 3 summarizes the average execution times and standard deviations per enzyme sequence for each method. HIT-EC achieved an average runtime of 38 ± 9 milliseconds per sequence, substantially faster than CLEAN (3915 ± 767 ms), DeepECT (519 ± 34 ms), and ECPICK (206 ± 23 ms). Our approach was 99.0% faster than CLEAN, 92.7% faster than DeepECT, and 81.6% faster than ECPICK. The observed differences in execution times can be attributed to the distinct design choices of the competing methods. CLEAN’s reliance on contrastive learning requires computing the distances between the query embedding and all training embeddings, resulting in significant computational overhead. In contrast, DeepECT runs multiple sequence alignment every time it fails to predict an enzyme, which further increases its processing time. Meanwhile, ECPICK’s ensemble learning strategy necessitates executing ten individual models to produce a single prediction, contributing to its comparatively longer runtime. These results highlight the superior efficiency of our method and its potential for high-throughput applications in resource-constrained environments.
Model interpretation for trustworthy predictions
We validated domain-specific evidence derived from the relevance scores computed by the final HIT-EC model. For the assessment, we compared the relevance scores with characterized motif sites within enzyme sequences of (1) the Cytochrome P450 (CYP) family, specifically CYP106A2 [EC 1.14.15.8], and (2) representative enzymes for the seven first-level EC classes. First, we investigated CYP106A2, whcih a bacterial steroid hydroxylase that hydroxylates a broad range of steroids, and its protein structure has also been extensively studied39. We considered thirteen protein sequences of the well-characterized bacterial CYP106A2 enzyme family: two from the PDB and eleven from Swiss-Prot, sharing the amino-acid level similarity less than 60% with 5XNT40 and 4YT341. Then, we computed the relevance scores for each protein sequence, illustrated using multiple sequence alignment (MSA) (e.g., Clustal Omega42). We compared the relevance scores of HIT-EC against ECPICK’s relevance scores, as ECPICK was the only approach that provided evidential scores at the amino-acid level.
The interpretation results showed that HIT-EC enhanced the detection of key motif sites (oxygen-binding, EXXR, and heme-binding motifs). Figure 5 illustrates the CYP106A2 sequences, with functional domains emphasized using colored boxes. Conserved sequences are highlighted in red using ESPript343. Motif sites, including oxygen-binding, EXXR, and heme-binding domains, are outlined with blue boxes, while substrate recognition sites (SRS 1-6) within the CYP106A2 family are marked with green boxes. The signature regions of the enzyme within the CYP106A2 family are illustrated by purple boxes. The comaprison of the relevance scores of ECPICK and HIT-EC is visualized in Fig. 5 using a color scale, such that high scores are in red and low scores are in white. Both models effectively located the key motif sites within the CYP106A2 family with high scores, including the oxygen-binding motif and heme-binding domain, which are pivotal for defining the first and second levels of EC classification (e.g., 1.14). However, HIT-EC produced more discriminative scores for these critical motif regions compared to ECPICK. Specifically, HIT-EC clearly identified the EXXR motif, which ECPICK did not recognize.
HIT-EC identifies key amino acids contributing to the prediction, specifying the critical regions essential for the enzyme’s activity as domain-specific evidence for the trustworthy predictions. The blue box corresponds to the main function site of enzymes, the purple box represents the signature region of the enzyme, and the green box denotes the Substrate Recognition Sites (SRS). Source data are provided as a Source Data file.
Then, we further examined the reliability of the HIT-EC’s interpretation considering representative enzymes from the seven first-level EC classes. The representative enzymes include alcohol dehydrogenase (Uniprot ID: P00330, EC 1.1.1.1), hexokinase (P19367, EC 2.7.1.1), carboxylesterase (P23141, EC 3.1.1.1), carbonic anhydrase (P00918, EC 4.2.1.1), glucose-6-phosphate isomerase (P06744, EC 5.3.1.9), glutamine synthetase (P15104, EC 6.3.1.2), and cytochrome c oxidase subunit I (P00395, EC 7.1.1.9) (Table 4). HIT-EC correctly predicted EC numbers of all seven enzymes. Interestingly, HIT-EC predicted multiple EC numbers for P23141 (EC 3.1.1.1), P00918 (EC 4.2.1.1), P15104 (EC 6.3.1.2). For P23141, HIT-EC assigned EC 3.1.1.1, EC 3.1.1.13, and EC 3.1.1.56. P23141 is the major hepatic carboxylesterase (EC 3.1.1.1), and its multiple substrate reactions on sterol esterase (EC 3.1.1.13) and 4-methylumbelliferyl acetate (EC 3.1.1.56) are well-documented44,45. HIT-EC captured the broadened substrate spectrum. For P00918, HIT-EC predicted EC 4.2.1.69 as well as EC 4.2.1.1. P00918 is mainly associated with standard carbonic anhydrase reaction (EC 4.2.1.1), and can also catalyze the hydration of cyanamide at a much lower level (EC 4.2.1.6946. For P15104, HIT-EC predicted EC 6.3.1.2 and EC 2.3.1.225. EC 6.3.1.2 is the main EC number for P15104, it is reported that EC 2.3.1.225 (DHHC S-acyltransferase activity) can be co-mentioned alongside P15104 in palmitoylation contexts47.
We computed the relevance scores of the seven enzymes and investigated the reliability of the relevance scores by comparing the scores against known functional features. For each enzyme, we analyzed the relevance scores along with twenty reference enzymes belonging to the same EC class. Their relevance scores are visualized in the supplementary. In the Supplementary Figures, known substrate binding sites and enzyme signatures are pointed out in green and purple boxes, respectively. Specifically, for alcohol dehydrogenase (EC 1.1.1.1), the relevance scores highlighted a sequence motif that coincided with the Zinc-containing alcohol dehydrogenase signature (Table 4). The catalytic/structural Zn2+ sites and the substrate-binding site are also matched with high relevance scores (Supplementary Fig. S1). For hexokinase (EC 2.7.1.1), the relevance scores highlighted the hexokinase domain signature. (Table 4). We observed identical highlighting on the hexokinase signature from the relevance scores across the reference sequences of varying lengths (Supplementary Fig. S2). For carboxylesterase (EC 3.1.1.1), the relevance scores highlighted the Carboxylesterases type-B signature (Supplementary Fig. S3), and the substrate-binding site showed the highest relevance signal (Table 4). For carbonic anhydrase (EC 4.2.1.1), the alpha-carbonic anhydrase signature and the substrate-binding site are pointed out by high relevance scores (Table 4, Supplementary Fig. S4). For glucose-6-phosphate isomerase (GPI, EC 5.3.1.9), a canonical GPI signature and four conserved GPI motifs were highlighted (Table 4, Supplementary Fig. S5). For glutamine synthetase (EC 6.3.1.2), the relevance scores highlighted the main signature of Glutamine synthetase catalytic domain (Table 4, Supplementary Fig. S6). For cytochrome c oxidase subunit (EC 7.1.1.9), COX I signature and the membrane-bound region were highlighted by the relevance scores (Supplementary Fig. S7).
Trustworthy prediction of understudied enzymes in microbial bioremediation
We further evaluated HIT-EC’s performance using understudied enzymes in microbial bioremediation and provided predictions of their EC numbers along with interpretation. We considered two enzymes identified from microbial genome sequences: (1) CYP199A35 from strain Amycolatopsismagusensis KCCM40447 and (2) PET hydrolases from strain Streptomyces sp. W2061. These protein sequences are not registered in any public databases, and their EC numbers have not been assigned yet. CYP199A35 is a candidate biocatalyst for microbial bioremediation of aromatic pollutants, catalyzing the hydroxylation of aromatic compounds48, and PET hydrolases are promising enzymes that enhance the biodegradation of polyethylene terephthalate (PET) as a potential solution for plastic degradation49. We identified both enzyme sequences by microbial genome mining, expressed and purified the corresponding proteins, and experimentally verified their enzymatic activities48,49. For CYP199A35, our genomic analysis of Amycolatopsis magusensis KCCM40447 (NCBI accession NZ_JASCSI000000000) showed that the monooxygenase gene (WP_282773100.1) lies adjacent to the 4-hydroxybenzoate transporter (pcaK) locus48. In our experiments, the enzyme encoded by gene WP_282773100.1 O-demethylated the para-O-methyl benzoates 4-MB and 4-MCA to 4-HB and 4-HCA, respectively. This activity occurred with both the native two-component system (Mbr) and a non-physiological three-component system (Pdr/Pdx). Accordingly, WP_282773100.1 was reannotated as CYP199A3548. For PET hydrolases, a combination of in silico and biochemical sequence and structure analyses of a Streptomyces PET hydrolase revealed conserved catalytic motifs typical of PET-active esterases, and molecular-dynamics simulations supported stable binding to bis(2-hydroxyethyl) terephthalate (BHET)49. Streptomyces PET hydrolase successfully degraded a single ester bond of BHET, resulting in the formation of MHET, as determined by LC-MS analysis49. According to the experimental verification, CYP199A35 and PET hydrolases are expected to be assigned EC 1.14.99.15 and EC 3.1.1.101, respectively. Both enzymes are relevant for bioremediation and have been understudied. There is only one curated reference in Swiss-Prot and none in PDB for EC 1.14.99.15 (for CYP199A35), and 10 and 41 curated reference enzymes are available in Swiss-Prot and PDB, respectively, for EC 3.1.1.101 (for PET hydrolases).
HIT-EC correctly predicted the EC numbers of the two enzymes: EC 1.14.99.15 for CYP199A35 and EC 3.1.1.101 for PET hydrolases. We examined HIT-EC relevance scores for the target enzymes and closely related references to support the trustworthiness of the predictions. First, for CYP199A35 (Fig. 6A), HIT-EC highlighted the B-class P450 signature regions (purple boxes) and key functional sites (blue boxes), including the cytochrome P450 cysteine heme-iron ligand signature and metal-coordination sites. These findings align with known features of CYP199A-family monooxygenases. The conserved P450 cysteine heme-iron ligand signature provides structural evidence for monooxygenase chemistry typical of EC 1.14. For the interpretation analysis, the reference sequences included CYP199A-family enzymes, such as CYP199A1 from Bradyrhizobium japonicum USDA110, BAC46313.1; CYP199A4 from Rhodopseudomonas palustris HaA2, PDB 5KDB; CYP199A2 from R. palustris CGA009, WP_011157377.1; CYP199A25 from Arthrobacter sp., WP_091467048; and CYP199A35 from Amycolatopsismagusensis KCCM40447, WP_282773100.1. Second, HIT-EC highlighted PET hydrolase/cutinase-like signature regions (purple boxes) and the Gly-X-Ser-X-Gly motif characteristic of α/β-hydrolase-fold serine hydrolases (blue box) in Fig. 6B. This pattern indicates a serine hydrolase mechanism (EC 3.1). The reference sequence included PDB 6ILW, Ideonella sp.; 4EB0, leaf-branch-compost cutinase; 4CG1, Thermobifida alba; 4WFJ, Saccharomonospora viridis; 1JFR, Streptomyces exfoliatus; Streptomyces PET hadrolase, Streptomyces exfoliatus.
Blue boxes indicate main functional sites; Purple boxes mark enzyme-specific signature regions. † denotes the sequence that HIT-EC analyzed. A HIT-EC predicted EC 1.14.99.15 for CYP199A35. The relevance scores highlight five CYP signature regions and three main functional sites which are characteristics of the CYP199A subfamily. B HIT-EC predicted EC 3.1.1.101 for the Streptomyces PET-hydrolase sequence. The interpretation highlights one main functional site and six PET-hydrolase signature regions, which are essential components of PET-degrading cutinase-like enzymes. Source data are provided as a Source Data file.
Discussion
In this study, we demonstrated that HIT-EC significantly advances EC number prediction through its innovative hierarchical interpretable transformer approach. HIT-EC achieved superior predictive performance in various evaluation settings, including cross-validation with large datasets, external validation, and species-specific evaluation. Importantly, HIT-EC significantly outperformed the benchmark models in predicting underrepresented EC numbers, which can be attributed to its innovative architectural design and learning strategy that handles incomplete annotations. HIT-EC’s transformer-based hierarchical framework captures complex dependencies between the four levels of the EC hierarchy, while preserving broader contextual information of the protein sequence, which results in enhanced predictive power for overall and underrepresented EC classes.
Furthermore, we enhanced the trustworthiness in the predictions through evidential deep learning. By incorporating attention mechanisms and relevance propagation, HIT-EC produces domain-specific evidence for the prediction, which aligns with established biological knowledge. HIT-EC accurately identifies conserved motifs and functional regions in enzymes, as demonstrated with the CYP106A2 family analysis and representative enzymes of the seven major EC families. The evidential approach not only ensures the reliability of its predictions, but also highlights regions of potential biological significance, offering insights that could guide further research and experimental validation.
As a future research direction, further improvement of the predictive power for underrepresented EC numbers would be critical. HIT-EC, while superior to state-of-the-art models, still showed low performance when classifying underrepresented EC numbers (micro-averaged F1-score of 0.77). In this study, we focused on EC classes with 10 to 25 enzyme references, representing roughly 20% of all EC classes across all databases. Classes with fewer than ten references account for about 58% of all EC classes across databases. Future work should broaden coverage to include these extremely rare classes. Additionally, the computational cost of the hierarchical transformer architecture may pose challenges for large-scale deployments. Future work could focus on optimizing the model’s efficiency without compromising its predictive and interpretative capabilities. HIT-EC represents a significant step forward in EC number prediction, advancing deep learning techniques with interpretability to deliver accurate and trustworthy results.
Methods
In this section, we elucidate the proposed HIT-EC model, focusing on three core components: the architecture of the proposed framework, the approach for training with incomplete EC annotations, and the strategy for enhancing model interpretability.
Architecture of HIT-EC
HIT-EC consists of five modules: (1) an embedding layer that encodes a protein sequence to characterize its physicochemical properties as numerical representations, (2) a positional encoding layer that incorporates sequence order information, (3) four transformer encoders, each of which corresponds to a level of the EC hierarchy and estimates level-specific local predictions, (4) a global linear layer that generates a global prediction of the EC numbers, and (5) the aggregation of the global prediction and the level-specific local predictions to form the final prediction. The HIT-EC architecture captures both local and global dependencies within sequences, while leveraging the hierarchical structure of the EC nomenclature.
First, the embedding layer converts amino acid sequences into physicochemical representations50. Each amino acid in a sequence is mapped to a fixed-dimensional vector using an embedding matrix ({{{bf{E}}}}in {{mathbb{R}}}^{23times d}), where d is the dimension of the embeddings. The value 23 corresponds to the 20 amino acids, the special codon X (representing the amino acids B, Z, U, and O), the classification token, and the padding token. Each row of E represents a specific amino acid, and the matrix values are learned in the training phase to capture meaningful biochemical and structural properties. The embedding layer projects each sequence into a continuous vector space. By transforming the discrete sequence data into a numerical format, the embedding layer provides the foundation for downstream layers to process and extract relevant features for enzyme classification.
Second, the positional encoding layer integrates sequence order information into the model to capture the sequential nature of protein structures32. Since transformers lack inherent sequence-order awareness, positional encodings are added to the input embeddings to provide this positional context. HIT-EC employs sinusoidal positional encodings, where each position (i.e., p) in the sequence is assigned a unique vector using sine and cosine functions of different frequencies. The positional encoding (i.e., ({{{mathcal{P}}}}(cdot ))) for each dimension i at the position p is computed as
$${{{mathcal{P}}}}( , p,2i)=sin left(frac{p}{1000{0}^{2i/d}}right),$$
(1)
$${{{mathcal{P}}}}( , p,2i+1)=cos left(frac{p}{1000{0}^{2i/d}}right),$$
(2)
where 2i and 2i + 1 represent even and odd indices, respectively. These encodings are then added to the amino acid embeddings, resulting in ({{boldsymbol{{{mathcal{A}}}}}}in {{mathbb{R}}}^{Stimes d}), where S is the sequence length (i.e., S = 1, 024). Each row of ({{boldsymbol{{{mathcal{A}}}}}}) corresponds to an amino acid, containing both physicochemical properties and positional information. This positional information is crucial for capturing sequence-specific patterns, such as conserved motifs and functional regions, which are associated with the EC number classification.
Third, HIT-EC introduces a transformer-based architecture to implement the complementary paradigms of the local and global flows to enhance the hierarchical classification of enzyme functions. The local flow captures the dependencies between each level and the previous level of the EC hierarchy to capture hierarchical relationships51. This ensures that the predictions for each level are informed by the previous level’s output. On the other hand, the global flow treats each level independently by reintroducing the original sequence embeddings into the subsequent encoders:
$${{{{bf{I}}}}}_{j}={{boldsymbol{{{mathcal{A}}}}}}+{{{{bf{O}}}}}_{j-1},$$
(3)
where Ij is the input of the jth encoder, and Oj−1 is the output of the j − 1th encoder. Note that ({{{{bf{I}}}}}_{1}={{boldsymbol{{{mathcal{A}}}}}}). To enable global classification for enzyme prediction, we add a special classification token as the first token of the input sequence. This token serves as a summary representation of the entire sequence. By including this token, the model can aggregate sequence-level information efficiently for downstream classification tasks. By combining both local and global flows, HIT-EC considers both the specific relationships at each hierarchical level and the overall context of the protein sequence.
HIT-EC includes a hierarchical structure of four encoders to predict each level of the EC hierarchy, where each encoder is composed of a multi-head self-attention mechanism and a position-wise feed-forward neural network. The multi-head self-attention mechanism captures relationships between amino acids within a sequence, identifying multiple relevant interactions in parallel. We use h attention heads, with queries (Q) and keys (K) using per-head dimension (dk = d/h), while the value (V) projection is set to the model dimension (d). The attention scores are
$${{{bf{P}}}}={{{rm{softmax}}}}left(frac{{{{bf{Q}}}}{{{{bf{K}}}}}^{top }}{sqrt{{d}_{k}}}right).$$
(4)
We then obtain the per-head representation with
$${{{{bf{X}}}}}^{(i)}={{{{bf{P}}}}}^{(i)}{{{{bf{V}}}}}^{(i)},$$
(5)
and the attention heads are aggregated additively:
$${{{bf{X}}}}={sum}_{i=1}^{h}{{{{bf{X}}}}}^{(i)},,{{{bf{X}}}}in {{mathbb{R}}}^{Stimes d}.$$
(6)
This allows each head to carry a richer representation52. The position-wise feed-forward network then refines the representations learned by the attention mechanism to detect more complex patterns. Layer normalization and residual connections are also incorporated to stabilize training and enhance convergence. Each encoder generates both a sequence representation and a level-specific prediction. Let ({{{{bf{P}}}}}_{{{{mathcal{L}}}}}=[{{{{bf{P}}}}}_{1},{{{{bf{P}}}}}_{2},{{{{bf{P}}}}}_{3},{{{{bf{P}}}}}_{4}]) denote the concatenated level-specific local predictions, where each Pk corresponds to the model’s level-specific predctions at EC level k.
Fourth, HIT-EC simultaneously generates global predictions for all levels of the EC hierarchy using a linear layer after processing through the encoders. The input of this layer is the representation of the classification token from the fourth encoder. The global prediction ({{{{bf{P}}}}}_{{{{mathcal{G}}}}}) is obtained by applying a fully connected layer to this classification token representation:
$${{{{bf{P}}}}}_{{{{mathcal{G}}}}}={{{rm{Linear}}}}({{{{bf{C}}}}}_{4}),$$
(7)
where C4 is the output embedding of the classification token from the final encoder layer.
Finally, the final prediction is produced by aggregating the local (({{{{bf{P}}}}}_{{{{mathcal{L}}}}})) and global (({{{{bf{P}}}}}_{{{{mathcal{G}}}}})) predictions:
$${{{{bf{P}}}}}_{{{{rm{final}}}}}=beta {{{{bf{P}}}}}_{{{{mathcal{G}}}}}+(1-beta ){{{{bf{P}}}}}_{{{{mathcal{L}}}}},$$
(8)
where β ∈ [0, 1] is a hyperparameter that determines the balance between the hierarchical specificity provided by the local predictions with the comprehensive view captured by the global prediction.
Training with incomplete EC number data
HIT-EC incorporates a masked objective function to handle incompletely labeled data in hierarchical multi-label classification settings. Specifically, the masked loss function adjusts the contribution of unannotated levels in the EC hierarchy, such that only the annotated levels contribute to the computed loss during training. This approach utilizes binary cross-entropy (BCE) loss to quantify the discrepancy between predicted and ground truth values. Let yij be the label for the sequence i at level j. We define a binary mask M, with mij = 1 if the ground truth yij is available and mij = 0 otherwise. The total loss for a batch of N sequences across the four levels of the EC hierarchy is computed as
$$L=frac{1}{N}{sum}_{i=1}^{N}{sum}_{j=1}^{4}{m}_{ij},{L}_{ij},$$
(9)
where Lij is the BCE loss for the predicted probability ({widehat{y}}_{ij}) at level j of the EC hierarchy, given the ground truth yij.
Model interpretation
HIT-EC’s interpretation scheme integrates attention flow with gradient-based relevance propagation, offering a comprehensive view of the model’s decision-making process. HIT-EC identifies biologically significant regions in protein sequences, such as conserved motifs or functional domains like active sites or binding pockets, that contribute to enzymatic activity. Additionally, it accommodates multi-label classification, allowing for separate relevance scores to be assigned for each level of the EC number prediction, thereby providing independent explanations for each hierarchical level.
We designed an interpretation scheme by freezing the trained predictor and introducing a dedicated interpretation path for gradient-based explanations. The inference path follows the transformer attention mechanism with the widened value projection described earlier, trained to convergence and then frozen. The interpretation path instead uses the standard attention module32 and is trained on the same task with the same labels, reproducing the predictor’s decision boundaries while it keeps a clear and stable gradient flow for interpretation. We therefore re-train the regular attention formulation in the interpretation path to preserve clear gradient behavior and maintain fidelity to the frozen predictor.
Relevance scores are computed for each amino acid by adapting an explainable framework for transformer-based models53. The method propagates gradients of the output prediction with respect to the input tokens through the attention layers and combines them with the attention weights to generate the final relevance scores. The contribution of each input feature is appropriately reflected in the corresponding relevance score. The relevance score ri for an input token xi is computed as:
$${r}_{i}={sum}_{l=1}^{T}{sum}_{j=1}^{S}frac{partial y}{partial {a}_{ij}^{l}}{a}_{ij}^{l},$$
(10)
where T is the number of attention layers (i.e., T = 4), S is the number of tokens (i.e., S = 1, 024), ({{a}^{l}}_{ij}) is the attention score from token i to token j at layer l, and y is the model’s final prediction. We first run the inference path to determine the predicted class. Relevance scores are then computed for each EC level using the interpretation path, and the final interpretation is obtained by averaging the relevance scores across all levels to provide a comprehensive, hierarchy-aware interpretation of model predictions.
Statistical analysis
To assess the statistical significance of the performance differences between the benchmark models, we employed the one-tailed Wilcoxon signed-rank test with a sample size of n = 10, corresponding to the number of cross-validation models. We report p values of 0.05 and 0.01 as significance thresholds, as detailed in the Results section. All statistical tests were performed using the wilcoxon function from the scipy.stats module in SciPy version 1.15.254.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Data analyzed during this study are available in the Zenodo repository at https://doi.org/10.5281/zenodo.1770365255. Source data are provided as a Source Data file. Source data are provided with this paper.
Code availability
The code used to develop the model, perform the analyses and generate all results in this study is publicly available and has been deposited in GitHub at https://github.com/datax-lab/HIT-ECunder the MIT license. The specific version of the code associated with this publication is archived in Zenodo and is accessible via https://doi.org/10.5281/zenodo.1770365255.
References
-
Tipton, K. F. & Boyce, S. History of the enzyme nomenclature system. Bioinformatics 16, 34–40 (2000).
-
Robinson, P. K. Enzymes: principles and biotechnological applications. Essays Biochem. 59, 1–41 (2015).
-
Hatzimanikatis, V., Li, C., Ionita, J. A. & Broadbelt, L. J. Metabolic networks: enzyme function and metabolite structure. Curr. Opin. Struct. Biol. 14, 300–306 (2004).
-
Danchin, A. Enzyme nomenclature, recommendations (1992) of the nomenclature committee or the international union of biochemistry and molecular biology. Biochimie 75, 501 (1993).
-
McDonald, A. G., Boyce, S. & Tipton, K. F. Explorenz: the primary source of the iubmb enzyme list. Nucleic Acids Res. 37, 593–597 (2009).
-
Artimo, P. et al. Expasy: Sib bioinformatics resource portal. Nucleic Acids Res. 40, 597–603 (2012).
-
Noelker, C., Hampel, H. & Dodel, R. Blood-based protein biomarkers for diagnosis and classification of neurodegenerative diseases: current progress and clinical potential. Mol. Diagnosis Ther. 15, 83–102 (2011).
-
Xiao, X., Lin, W.-Z. & Chou, K.-C. Recent advances in predicting protein classification and their applications to drug development. Curr. Top. Medicinal Chem. 13, 1622–1635 (2013).
-
Qu, G., Li, A., Acevedo-Rocha, C. G., Sun, Z. & Reetz, M. T. The crucial role of methodology development in directed evolution of selective enzymes. Angew. Chem. Int. Ed. 59, 13204–13231 (2020).
-
Zhang, C., Freddolino, P. L. & Zhang, Y. Cofactor: improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res. 45, 291–299 (2017).
-
Yang, J. et al. The i-tasser suite: protein structure and function prediction. Nat. Methods 12, 7–8 (2015).
-
El-Hachem, N., Haibe-Kains, B., Khalil, A., Kobeissy, F. H. & Nemer, G. AutoDock and AutoDocktools for protein-ligand docking: beta-site amyloid precursor protein cleaving enzyme 1 (Bace1) as a case study. Methods Mol. Biol. 598, 391–403 (2017).
-
Song, Y. et al. Accurately predicting enzyme functions through geometric graph learning on esmfold-predicted structures. Nat. Commun. 15, 8180 (2024).
-
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
-
Wang, Z. et al. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Sci. Rep. 12, 6832 (2022).
-
Tian, W., Arakaki, A. K. & Skolnick, J. Eficaz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res. 32, 6226–6239 (2004).
-
Desai, D. K., Nandi, S., Srivastava, P. K. & Lynn, A. M. Modenza: accurate identification of metabolic enzymes using function specific profile hmms with optimised discrimination threshold and modified emission probabilities. Adv. Bioinform. 2011, 743782 (2011).
-
Claudel-Renard, C., Chevalet, C., Faraut, T. & Kahn, D. Enzyme-specific profiles for genome annotation: Priam. Nucleic Acids Res. 31, 6633–6639 (2003).
-
De Ferrari, L., Aitken, S., Hemert, J. & Goryanin, I. Enzml: multi-label prediction of enzyme classes using interpro signatures. BMC Bioinform. 13, 1–12 (2012).
-
Dalkiran, A. et al. Ecpred: a tool for the prediction of the enzymatic functions of protein sequences based on the ec nomenclature. BMC Bioinform. 19, 1–13 (2018).
-
Li, Y. et al. Deepre: sequence-based enzyme ec number prediction by deep learning. Bioinformatics 34, 760–769 (2018).
-
Ryu, J. Y., Kim, H. U. & Lee, S. Y. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. 116, 13996–14001 (2019).
-
Han, S.-R. et al. Evidential deep learning for trustworthy prediction of enzyme commission number. Brief. Bioinform. 25, 401 (2024).
-
Kim, G. B. et al. Functional annotation of enzyme-encoding genes using deep learning with transformer layers. Nat. Commun. 14, 7370 (2023).
-
Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).
-
Ferdous, S., Shihab, I. F. & Reuel, N. F. Effects of sequence features on machine-learned enzyme classification fidelity. Biochem. Eng. J. 187, 108612 (2022).
-
Boeckmann, B. et al. The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Res. 31, 365–370 (2003).
-
Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
-
Arlot, S. & Celisse, A. A survey of cross-validation procedures for model selection. Statistics Surv. 4, (2010).
-
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: a next-generation hyperparameter optimization framework. In: Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (Association for Computing Machinery (ACM), 2623–2631 2019).
-
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. & Wilson, A.G. Averaging weights leads to wider optima and better generalization. In Proc. 34th Conf. Uncertainty in Artificial Intelligence (UAI 2018) (eds. Globerson, A. & Silva, R.) AUAI Press, 876–885 (2018).
-
Vaswani, A. Attention is all you need. Adv. Neural Inform. Process. Syst. 30, (2017).
-
Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).
-
Addou, S., Rentzsch, R., Lee, D. & Orengo, C. A. Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. J. Mol. Biol. 387, 416–430 (2009).
-
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using diamond. Nat. methods 12, 59–60 (2015).
-
Kanehisa, M. The KEGG database. In: ‘In Silico’ Simulation of Biological Processes: Novartis Foundation Symposium (eds. Bock, G. & Goode, J. A.) 91–103 (John Wiley & Sons, 2002).
-
Bateman, A. Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, https://doi.org/10.1093/nar/gky1049 (2019).
-
Keseler, I.M. et al. The ecocyc database: Reflecting new knowledge about Escherichia coli k-12. Nucleic Acids Res. 45, https://doi.org/10.1093/nar/gkw1003 (2017).
-
Schmitz, D., Janocha, S., Kiss, F. M. & Bernhardt, R. Cyp106a2—a versatile biocatalyst with high potential for biotechnological production of selectively hydroxylated steroid and terpenoid compounds. Biochim. Biophys Acta (BBA)-Proteins Proteom. 1866, 11–22 (2018).
-
Kim, K.-H. et al. Crystal structure and functional characterization of a cytochrome p450 (bacyp106a2) from bacillus sp. pamc 23377. J. Microbiol. Biotechnol. 27, 1472–1482 (2017).
-
Janocha, S., Carius, Y., Hutter, M., Lancaster, C. R. D. & Bernhardt, R. Crystal structure of cyp106a2 in substrate-free and substrate-bound form. ChemBioChem 17, 852–860 (2016).
-
Sievers, F. & Higgins, D. G. Clustal omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).
-
Robert, X. & Gouet, P. Deciphering key features in protein structures with the new endscript server. Nucleic Acids Res. 42, 320–324 (2014).
-
Crow, J. A. et al. Inhibition of carboxylesterase 1 is associated with cholesteryl ester retention in human thp-1 monocyte/macrophages. Biochim. Biophysica Acta—Mol. Cell Biol. Lipids 1781, https://doi.org/10.1016/j.bbalip.2008.07.005 (2008).
-
Pindel, E. V. et al. Purification and cloning of a broad substrate specificity human liver carboxylesterase that catalyzes the hydrolysis of cocaine and heroin. J. Biol. Chem. 272, https://doi.org/10.1074/jbc.272.23.14769 (1997).
-
Guerri, A., Briganti, F., Scozzafava, A., Supuran, C. T. & Mangani, S. Mechanism of cyanamide hydration catalyzed by carbonic anhydrase. II: Suggested by cryogenic x-ray diffraction. Biochemistry 39, https://doi.org/10.1021/bi000937c (2000).
-
Eelen, G. et al. Role of glutamine synthetase in angiogenesis beyond glutamine synthesis. Nature 561, https://doi.org/10.1038/s41586-018-0466-7 (2018).
-
Pardhe, B. D., Paudel, L., Han, S.-R. & Oh, T.-J. Genomic insight into o-demethylation of 4-methoxybenzoate by a two-component system from amycolatopsis magusensis kccm40447. Heliyon 10, 25083 (2024).
-
Thapa, G. et al. In silico analysis and biochemical characterization of streptomyces pet hydrolase with bis(2-hydroxyethyl) terephthalate biodegradation activity. J. Microbiol. Biotechnol. 34, 1836–1847 (2024).
-
ElAbd, H. et al. Amino acid encoding for deep learning applications. BMC Bioinform. 21, 1–14 (2020).
-
Wehrmann, J., Cerri, R. & Barros, R. Hierarchical multi-label classification networks. In: Proc. 35th Int. Conf. Mach. Learn. (eds. Dy, J. & Krause, A.) Vol. 80, 5075–5084 (PMLR, 2018).
-
Bhardwaj, R., Majumder, N., Poria, S. & Hovy, E. More identifiable yet equally performant transformers for text classification. In Proc. 59th Annu. Meet. Assoc. Comput. Linguistics & 11th Int. Joint Conf. Natural Lang. Process. (eds. Zong, C., Xia, F., Li, W. & Navigli, R.) 1172–1182 (Association for Computational Linguistics, 2021).
-
Chefer, H., Gur, S. & Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (IEEE Computer Society / Computer Vision Foundation (CVF), 397–406 2021).
-
Virtanen, P. et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods 17, 261–272 (2020).
-
Dumontet, L., Han, S.-R., Lee, J.H., Oh, T.-J., Kang, M. HIT-EC package. Zenodo https://doi.org/10.5281/zenodo.17703652 (2025).
Acknowledgements
The authors acknowledge the support of the National Science Foundation Major Research Instrumentation (NSF MRI) (Grant#: 2117941 (M.K.)) and the National Research Foundation (NRF) funded by the Korean government (MSIT) (RS-2024-00441423 (T.O.) and RS-2024-00354012 (S.H.)).
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Xiaozhou Luo, Yajie Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Dumontet, L., Han, SR., Lee, J.H. et al. Trustworthy prediction of enzyme commission numbers using a hierarchical interpretable transformer. Nat Commun 17, 1146 (2026). https://doi.org/10.1038/s41467-026-68727-3
-
Received:
-
Accepted:
-
Published:
-
Version of record:
-
DOI: https://doi.org/10.1038/s41467-026-68727-3






