Combining knowledge distillation and neural networks to predict protein secondary structure

Introduction

For a long time, how to use and improve protein secondary structure prediction has been a hotspot of academic concern, and protein secondary structure prediction has greatly contributed to structural genomics and other biomedical fields¹. Currently, protein secondary structure prediction has made remarkable progress, as it has reached the frontier where single-sequence and single-structure approximations are disrupted². Improved accuracy in secondary structure prediction has significant implications for biological and biomedical research.

Accurate secondary structure information serves as a crucial intermediate step toward reliable tertiary structure modeling, especially for proteins lacking homologous templates. Moreover, it facilitates functional annotation of newly sequenced proteins by revealing structural motifs associated with specific biological activities. In biomedical contexts, improved secondary structure prediction can aid in drug discovery by identifying potential binding regions or allosteric sites and in understanding the structural impacts of disease-related mutations. Therefore, our model’s enhanced performance not only contributes to methodological advancements but also holds practical value in supporting downstream tasks in structural biology and precision medicine.

Lightweight models

In the context of protein secondary structure prediction, lightweight models refer to simpler, more computationally efficient models that typically use fewer parameters and less complex architectures. These models often rely on basic features such as amino acid sequences, with or without additional evolutionary information, and typically employ simpler algorithms such as linear models, basic neural networks, or shallow learning models. While they are faster and more resource efficient, their accuracy tends to be lower than that of more advanced, complex models.

Machine learning plays an increasingly important role in research^3,4; in particular, it plays a crucial role in extracting meaningful patterns from large-scale protein datasets to predict structure and function. A large amount of protein data is utilized for learning and information extraction to predict protein structure and function. Currently, commonly used prediction methods include machine learning and deep learning prediction on the basis of amino acid sequences and deep learning prediction. Amino acid sequence-based prediction methods include machine learning algorithms such as support vector machines (SVMs)⁵, random forests and artificial neural networks (ANNs)⁶, which transform amino acid sequences into numerical vectors and predict the structure and function of proteins via feature extraction techniques. Multiple coding methods, such as PSSM coding⁷, one-hot coding⁸, and HMM coding⁹, are integrated, and classifiers are combined to improve the learning ability of the prediction model¹⁰.

At present, deep learning algorithms have been widely used in protein secondary structure prediction. Deep learning methods, including models such as convolutional neural networks (CNNs)¹¹, recurrent neural networks (RNNs)¹² and transformers¹³, have been widely used in protein secondary structure prediction. These models can automatically learn and extract features from protein data to predict protein structure and function more accurately. Moreover, CNNs have the advantages of translational invariance and localization when used for protein secondary structure prediction.

The existing lightweight models for protein secondary structure prediction face several limitations, such as suboptimal accuracy, which typically falls short of the theoretical limit of 88%. These models are often constrained by limited datasets and insufficient feature utilization, which hinders their generalization capabilities. They also struggle to capture non-local interactions between amino acid residues, which is important for accurate secondary structure prediction. Additionally, their simplicity increases the risk of overfitting due to limited data and a lack of regularization. More advanced models with deeper architectures and better feature integration are needed to overcome these challenges.

Knowledge distillation

Knowledge distillation is a model compression and knowledge learning method that was originally proposed by Hinton et al.¹⁵ and has been considered a new integration method¹⁶. The main idea is to improve the effectiveness of a student model by teaching a simple structure, small sample size and less effective student model through a more effective and complex large model teacher model. Currently, advanced protein secondary structure prediction methods require high computational resources, limiting their application to common hardware; similarly, protein language models are widely used, but most of them also have high computational and memory requirements. Protein language models can improve model performance by learning excellent features from many datasets. Attempts have been made to apply distillation learning in protein secondary structure prediction¹⁷.

Wang et al.¹⁸ proposed the DSM-Distil method to solve the secondary structure prediction problem of low-homology proteins (PSSP). By deeply combining the dynamic scoring matrix (DSM) with knowledge distillation (KD), they significantly improved the prediction performance under low-quality MSA (Multiple Sequence Alignment).

In protein secondary structure prediction, the potential of knowledge distillation (KD) has not been fully exploited. Existing methods often suffer from three key limitations: (1) lack of cross-quality knowledge transfer, where models trained on low-quality data fail to benefit from high-quality supervision; (2) weak integration between pretrained models (e.g., BERT) and task-specific predictors, often treating them as separate modules rather than enabling deep interaction; and (3) rigid distillation strategies that ignore the dynamic nature of protein data, such as fixed feature forms or static temperature parameters. To address these gaps, this study introduces a multi-level KD design: transferring global knowledge from high-MSA-quality teacher models to low-quality students, fusing pretrained and domain-specific features through attention mechanisms, and incorporating dynamic feature representations with adaptive attention-based distillation. These improvements highlight new directions for enhancing model generalization and biological awareness in lightweight prediction frameworks.

Protein Language model

Natural language processing techniques, especially the protein language model (PLM), which is a deep learning model that transforms protein sequences into rich, high-dimensional and “context-aware” representations that can be used for protein fold classification, function prediction, and protein secondary structure prediction and design, have been rapidly applied in the field of computational biology. Protein language models can be combined with machine learning and deep learning techniques to achieve protein secondary structure prediction.

Despite the progress made in protein language modeling, challenges remain. Problems such as effectively extracting and utilizing protein sequence features and Q8 protein secondary structure prediction pose significant challenges due to increased category granularity, severe class imbalance, and difficulty in capturing long-range dependencies. Unlike Q3, Q8 requires distinguishing between 8 subtle structural classes, including rare ones like π-helix and β-bridge, which are underrepresented in datasets such as CB513. Additionally, high-dimensional input features and the limited generalization ability of traditional machine learning models hinder prediction performance. Ensemble methods improve accuracy but have high computational costs and overfitting risks. Future improvements may rely on deep learning models that better capture the sequence context, dynamic resampling strategies to handle data imbalance, and the integration of evolutionary and physicochemical features for richer representation.

Main contribution

In this paper, a new temporal convolutional network (TCN) layer and a new combined model called “improved TCN-BiRNN-MLP” are proposed. The structure is built on the basis of a TCN, bidirectional recurrent neural network (BiRNN) and multilayer perceptron (MLP). word vector extraction of one-hot coding features and physicochemical properties of proteins by word2vec disambiguation, using knowledge distillation to allow student models to learn the rich features of the ProtT5-XL-UniRef teacher model and multiple datasets for protein octapeptide and tripeptide prediction. The current dataset comes from the protein primary and secondary structure data of the classic datasets TS115 and CB513¹⁹, which contain 115 and 319 protein secondary structure sequences and amino acid sequences, respectively, as well as from the PDB protein database²⁰ from 2018-06-6 to 2020. The final accuracies of the octapeptide predictions were 88.6%, 86.1%, and 95.5%, and the accuracies of the tripeptide predictions were 91.1%, 90.4%, and 97.0%, respectively. The result taken in this paper is the maximum value among all the rounds, all of which are good.

The main contributions of this paper are as follows: (1) a novel TCN model, which improves upon the original TCN layer; (2) an improved TCN-BiRNN-MLP structure based on the word2vec participle structure, which is proposed to obtain better protein secondary structure prediction results; and (3) the use of a protein language model for the common network layer for knowledge distillation learning to obtain better results on smaller configurations.

Materials and methods

Datasets

To evaluate the validity of the model, three datasets are used to perform the calculations separately. First, two classical datasets, TS115 and CB513, contained 115 and 319 protein secondary structure sequences and amino acid sequences, respectively; at the same time, 15,078 protein data points from 2018–06–6 to 2020 were obtained from the PDB database. which is derived from X-ray crystallography, and the process is performed at a resolution of at least 2.5 angstroms, with no chain breaks and fewer than 5 unknown amino acids. In general, the secondary structure of proteins and peptides can be defined in terms of eight states, namely, H (α-helix), G (3 10-helix), I (π-helix), E (extended β-strand), B (detached β-strand), T (turn), S (bend), and others (C); additionally, the states (E, B) are usually merged as E, (H, G, I) as E, and (C, S, T) as C, simplifying the above eight states (Q₈) into three (Q₃).

Feature extraction

This paper uses word2vec technology to extract features and generate word vectors for protein primary structure sequences. The input features consider two key factors: (1) one-hot encoding and (2) the physicochemical properties of amino acids. One-hot encoding is a simple and direct method for representing protein sequences, where each amino acid is uniquely represented by a binary vector; the physicochemical properties of amino acids, such as polarity, charge and size, can effectively express the properties of protein sequences. Compared with other encoding methods, one-hot encoding is efficient and direct, whereas other encoding methods often require more computing resources and complex preprocessing. The main operations of this paper are as follows:

Step 1:

Initially, a sliding window mechanism is employed to segment protein sequences into local k-mer fragments, enabling the capture of contextual information within the sequence. These fragments are subsequently used to train a Word2Vec model on the Skip-Gram architecture, yielding semantic embedding vectors that effectively represent the local dependencies among amino acids. This representation provides contextual support essential for downstream tasks.

Step 2:

To enhance the comprehensiveness of feature representation, a set of 14 physicochemical properties is incorporated for each amino acid. These include attributes such as hydropathy, flexibility, polarity, charge, pKa values, solvent accessibility, and entropy, which collectively describe the structural, chemical, and thermodynamic characteristics of the residues. The physicochemical vector is concatenated with standard one-hot encoding to form a composite feature representation that captures both symbolic identity and biochemical behavior.

Step 3:

The final input features are constructed by integrating the contextual embeddings derived from Word2Vec with the concatenated physicochemical and one-hot representations. This fusion strategy combines sequential semantics with residue-level intrinsic properties, thereby providing a richer and more discriminative input for model training and prediction. Empirical results demonstrate that this integrated approach consistently outperforms the use of individual feature types, confirming its effectiveness and robustness in protein sequence analysis.

The final prediction is made via the model structure described in this paper, and the complete flow of the article is shown in Fig. 1:

Methods

The preprocessed data are first passed through the Text-Embedding layer, which embeds the word vector features, with a dropout rate of 0.2 to enhance feature output. To enable the network to capture localized features, the input data are segmented into short sequences via the sliding window technique, which are then fed into the improved TCN layer.

(1) First, the word vectors are transformed into three different scales: 1, 9, and 81. These scaled vectors are then integrated into an MSTCN via a 1 × 1 convolutional neural network (1DConv). Next, the MSTCN is flipped to generate both positive and negative representations, which are further processed through the same 1 × 1 convolutional neural network for feature fusion. Finally, the refined feature vectors are extracted via an improved TCN, which enhances the overall representation capability.

(2) For the BiRNN model, three bidirectional GRU layers and one bidirectional LSTM layer are used to capture global interactions within the protein sequences. Two dropout layers are added to stabilize the gradients during training.

(3) Finally, an MLP is used for classification. The structure of this MLP model is a linear stack of two fully connected layers. Each fully connected layer is followed by an activation function and a dropout layer, and the fully connected layers are connected at the final output of the whole model. In the MLP structure, each fully connected layer is followed by a ReLU, which is an activation function. If the input is a positive number, the output remains unchanged; if the input is a negative number, the output is 0. Its purpose is to provide nonlinear transformation so that the model can fit complex data distributions. If ReLU is not used, the stacking of multiple fully connected layers is ultimately equivalent to a single-layer linear transformation, which cannot fully utilize the modeling capabilities of the neural network.

To improve feature learning, the model uses distillation learning with ProtT5-XL-UniRef, the most effective protein language model, as the teacher model. Given ProtT5-XL-UniRef’s richer data and features, this approach aims to enhance the overall model performance.

The specific model structure is shown in Fig. 2.

Improved TCN

This paper adopts a multiscale and bidirectional processing model because a single TCN extracts only unidirectional features and does not fully utilize spatial information or relationships between preceding and succeeding elements. Previous studies have explored multiscale spatial feature extraction and bidirectional TCNs to improve TCN performance in various fields²¹, but none of these approaches capture information comprehensively. In addition, to better emphasize the importance of learning structural features of proteins, which is particularly important for predicting the 3D or secondary structure of proteins, in this paper, we use the improved TCN²² model, which considers the use of multiscale extraction of spatial features, followed by bidirectional extraction of temporal dependencies. This ensures that the model already has an in-depth understanding of the multidimensional features of each part before considering the temporal dependence of the sequence.

In the task of protein secondary structure sequence prediction, the model first receives the preprocessed amino acid sequence input and encodes it into the corresponding feature vector representation. The data input is a batch of 32 samples, each containing 50 word IDs. First, the word vector is converted through the TextEmbedding module. After the embedding layer maps the word ID to a 300-dimensional space to form a (32, 50, 300) tensor, it is processed by a dropout layer with a probability of 0.3 to keep the dimension unchanged.

The model subsequently extracts spatial features through the improved multi-scale TCN module to capture structural information at different scales (1,9,81) to make full use of the local and global spatial correlation features in the sequence. Next, the bidirectional TCN module is used to extract time-dependent information from the forward and backward directions to model the contextual relationship between amino acid residues and enhance the model’s perception of long-range dependencies in the sequence. The forward branch undergoes three residual blocks and average pooling, and the reverse branch performs the same operation on the sequence after reversing it to produce symmetric features. After the two are added together in the channel dimension to form 512-channel features, 768 channels are aligned through 1 × 1 convolution, and finally, a high-level text representation aligned with the original sequence is output.

After multi-scale spatial features and bidirectional temporal features are obtained, the model fuses this information, classifies each amino acid position through a fully connected layer, and finally outputs the corresponding secondary structure label (such as an α-helix, β-fold or random curl). Throughout the process, the model achieves in-depth feature extraction at both the structural and temporal levels, effectively improving the accuracy of the prediction and sequence modeling capabilities.

BiRNN

After the improved multi-scale bidirectional TCN module extracts spatial and temporal features, the model passes the extracted multi-dimensional features as sequence inputs to the BiRNN structure for further temporal modeling. Specifically, the model inputs the output of the TCN into the bidirectional GRU and bidirectional LSTM.

Both the GRU and LSTM are recurrent neural networks (RNNs). In this paper, both models adopt a bidirectional processing method; that is, the bidirectional GRU (BiGRU) and bidirectional LSTM (BiLSTM) both combine forward and backward GRU or LSTM structures.

Among them, the BiGRU is extended from the standard GRU, and the GRU mainly consists of an update gate and reset gate. At time t, the GRU determines the update method of the current hidden state through the gating mechanism, where the update gate controls the degree of retention of the state information of the previous moment in the current state, and the reset gate determines how much past information is forgotten. The bidirectional GRU combines a forward GRU and a backward GRU. The forward GRU is mainly used to capture the historical information of the input data, whereas the backward GRU can obtain the future information of the input data so that the model can simultaneously consider the forward and backward dependencies of the time series, thereby improving the ability of feature extraction.

BiLSTM adopts a similar bidirectional processing strategy. The standard LSTM consists of a forget gate, an input gate, and an output gate. The forget gate determines whether to retain or discard past information, the input gate controls the update of current information, and the output gate determines the hidden state output at the current moment. BiLSTM works together with forward and backward LSTM to capture the global information of the sequence more fully and enhance the model’s ability to understand time series data.

The protein primary structure sequence first enters the bidirectional GRU layer to generate 512-dimensional primary time series features for each step and then passes it as input to the bidirectional LSTM layer for deep semantic fusion. LSTM further captures long-distance dependencies by precisely regulating the cell state while retaining the 512-dimensional bidirectional encoding of each time step (the dimensions of the two hidden layers are consistent). Finally, through the same unpacking and reverse process, the output is a mixed feature tensor aligned with the input sequence. In addition, to integrate the feature outputs of the BiGRU and BiLSTM, this paper uses a 1 × 1 convolutional neural network (1DConv) to fuse the results of the two to further enhance the model’s feature expression capabilities. This fusion method can fully utilize the advantages of different RNN variants, making the model more adaptable and generalizable when processing time series tasks.

MLP

After the BiGRU and BiLSTM extract time series features and fuse them through a 1 × 1 convolution, the model obtains a more expressive feature representation. These fused time series features are input into the MLP for final structural classification and prediction.

An MLP is a supervised deep learning model designed to map input data to output predictions. The MLP class modeled in this paper includes key steps such as initializing the network parameters, constructing hidden layers, and defining forward propagation. First, the initialization phase determines the input size, output size, hidden layer size, and other parameters. The first layer of full connection linearly expands the features of each item and then introduces nonlinearity through the ReLU activation function; the second layer of full connection further compresses the space of each feature. Next, multiple hidden layers were constructed via the 1024 hidden layer list, and an activation function was applied after each hidden layer to prevent overfitting by using dropout layers. Overfitting is suppressed by using a dropout layer with a probability of 0.1. Finally, the output layer was added to complete the MLP architecture. In the forward propagation process, the input data are processed by the hidden layer and activation function, dropout is applied, and finally, the prediction results are obtained through the output layer.

In the forward propagation process, the features output by the BiRNN are processed through hidden layers, activation functions and dropout in turn; finally, the probability distribution of each structural category at each sequence position is obtained through the Softmax function. In this way, the model can not only capture the structural and temporal characteristics of the input sequence but also accurately map these features to corresponding prediction results, thereby achieving accurate recognition of protein secondary structures.

Knowledge distillation

Knowledge distillation aims to transfer knowledge from a large teacher model to a smaller student model, allowing the student to process data similarly or more efficiently than the teacher does. The distillation model in this paper refers to the method of Zhao et al.²² . The teacher model is ProtT5, and the student model is the combined model of 2.3.1–2.3.3 Improved TCN-BiRNN-MLP. The student model is trained via softened SoftMax labels from the teacher model. In this approach, KL divergence is chosen for distillation.

In this work, we adopt a pretrained teacher model based on the ProtT5-XL-UniRef50 checkpoint, which was publicly released by Rostlab on the Hugging Face as the file prot_t5_xl_uniref50.pt. This model was originally introduced in the bio_embeddings publication and is designed for protein sequence representation learning.

The ProtT5-XL-UniRef50 model is based on the T5-3B encoder-decoder architecture and was pretrained in a self-supervised manner via a masked language modeling (MLM) objective. Specifically, 15% of the amino acids in the input sequences are randomly masked, and the model is trained to reconstruct them. Pretraining was conducted on the UniRef50 dataset, which contains approximately 45 million protein sequences.

Unlike the original T5 model, which employs span-based denoising, this version uses a BART-style MLM objective. The model contains approximately 3 billion parameters and was trained on a TPU Pod (V2-256) for a total of 991.5k steps. During pretraining, rare amino acids were replaced with ‘X’, and input sequences were tokenized with a vocabulary size of 21. The resulting embeddings have been shown to capture important biophysical properties relevant to protein structure.

We used this pretrained model as a teacher to distill structural knowledge into our downstream model.

ProtT5’s outputs provide rich contextualized representations learned from massive protein sequence corpora, capturing structural and evolutionary information that traditional one-hot or handcrafted features lack. By using these outputs as soft targets in knowledge distillation, the student model can learn not only the correct labels but also the nuanced inter-class relationships encoded in ProtT5’s predictions. This facilitates smoother convergence and improved generalizability, especially on small or imbalanced datasets. Including more analysis of how these embeddings guide the learning process—such as attention visualization or layer-wise ablation—would further strengthen the model’s interpretability and performance explanation.

During training, KL divergence is used as the loss function to measure the difference between the predicted distribution of the student model and the output distribution of the teacher model. The labels used in the knowledge distillation process are obtained from the outputs of ProtT5, a pretrained model for protein sequences. In the distillation setup, the ProtT5 model is used to generate embeddings for each protein sequence. These embeddings are then passed through the teacher model (a CNN in our case), which predicts the structural labels for each sequence. These predicted labels are treated as soft targets and used to train the student model. Specifically, the student model is trained to minimize the loss between its predictions and the soft labels provided by the teacher model, which are derived from the ProtT5 outputs.

By minimizing the divergence, the student model can better fit the knowledge distribution of the teacher model. Under such a training mechanism, the student model can not only learn basic classification information from the real label but also obtain its judgment “tendency” between different categories from the teacher model, thereby enhancing the generalization ability and performance of the model. This distillation strategy effectively improves the accuracy and stability of the model in the task of protein secondary structure prediction.

Experiment

Evaluation metrics and parameter settings

In this work, accuracy (Q₈) and accuracy (Q₃)²² are used to measure the goodness of fit of the model. The 8-state secondary structures are H, G, I, E, B, S, T, and C, and the 3-state secondary structures are H, E, and C. The Q₃ and Q₈ accuracies are the ratios of the number of correctly numbered residues predicted to be the number of all the residues S, which are defined as Eqs. (1)-(2):

$$Q_{3}=frac{S_{C}+S_{E}+S_{H}}{S}times100$$

(1)

$$Q_{8}=frac{S_{H}+S_{G}+S_{I}+S_{E}+S_{B}+S_{C}+S_{T}S_{S}}{S}times100$$

(2)

where S_i(i∈{H, E,C}or{H, G,I, E,BC, T,S}) denotes the correct number of individual types i predicted. The tools used in this study are Python 3.10 and a GPU with approximately 40 GB of memory. To help readers understand the sliding decomposition window size of each model and decomposition method in this paper, the relevant parameter settings of this paper are shown in Table 1.

Table 1 Parameter settings.

Full size table

Results

To evaluate the validity of the model, experiments were conducted using three datasets: TS115, CB513, and PDB. TS115 and CB513 are classic datasets used to assess the model’s fit. To validate the structure and combinations, multiple modeling approaches were compared. The baseline model uses a TCN-GRU with a linear layer for classification. Additionally, the BiTCN-BiGRU-MLP and improved TCN-BiRNN-MLP models were applied to the three datasets to assess their effectiveness. The results are presented in Table 2, with bold text indicating the best performance and underlined text showing the second-best performance. The results in all the tables in this section are presented below.

Table 2 Results of the methods.

Full size table

Table 2 shows that the distilled water model generally works better than the model without distillation, and as the structure of the model in this paper becomes more complex, the model works better, which on average will be 1–2% better than the structure of the previous layer; at the same time, all of the models will be better than the baseline.

For three datasets (TS115, CB513, PDB), the improved TCN-BiRNN-MLP model yields higher accuracy than do the traditional TCN-GRU and BiTCN-BiGRU-MLP models. For the TS115 dataset, the Q₃ of the improved model reached 91.0%, which was an increase of 0.1% and 1.6% compared with 90.9% for the TCN-GRU and 89.4% for the BiTCN-BiGRU-MLP. For the Q₈ task, the performance of the improved model was 88.3%, which was 2.0% and 0.7% higher than that of the TCN-GRU (86.3%) and BiTCN-BiGRU-MLP (87.6%) models, respectively.

By incorporating knowledge distillation, the performance of the model was further enhanced across all datasets. For example, the distillation-improved TCN-BiRNN-MLP model achieves 91.1% accuracy on the Q3 task of the TS115 dataset, representing a 0.2% improvement over the TCN-GRU model by 90.9%, and 88.6% accuracy on the Q8 task, representing a 2.3% improvement over the TCN-GRU model by 86.3%. For the PDB dataset, the distillation-improved TCN-BiRNN-MLP model reached 97.0% accuracy on the Q3 task and 95.5% accuracy on the Q8 task, both of which were the highest among all the evaluated models. Similar gains were observed on the CB513 dataset, where the distillation-improved TCN-BiRNN-MLP model achieved 89.4% on the Q3 task and 86.1% on the Q8 task.

Paired t-test results confirm that these improvements are statistically significant (p < 0.05) across all datasets and both the Q3 and the Q8 metrics, indicating that the performance gains are unlikely to be due to random variation. Notably, the distillation-improved TCN-BiRNN-MLP model consistently outperformed its non-distilled counterpart and all other benchmark models, particularly in the more complex Q8 task, achieving 88.6% on TS115, 86.1% on CB513, and 95.5% on PDB — all statistically significant improvements over the best non-distilled baseline. These findings demonstrate that the integration of knowledge distillation not only yields consistent accuracy gains but also leads to statistically robust performance improvements in both the Q3 and the Q8 secondary structure prediction tasks.

Ablation study

To further assess the effectiveness of the model, ablation experiments were conducted, as shown in Fig. 3(a-b), which illustrate the performance of BiRNN, BiRNN-MLP, the improved TCN-BiRNN-MLP, and the distillation-improved TCN-BiRNN-MLP on different datasets for tripeptides and octapeptides. The results from the ablation study indicate that each additional module enhances the model’s performance, supporting the validity of the model architecture.

Moreover, to better understand the complementary effects of each functional module, the embedded representations learned by the four models were visualized via t-distributed stochastic neighbor embedding (t-SNE) after dimensionality reduction, as shown in Fig. 4(a-d). The meaning of the points in the tsne diagram is shown in the icon in the upper right corner, which represents three different states (C, H, E). where C (blue) represents Coil, H (pink) represents Helix, E (cyan) represents Beta strand.

From the t-SNE plot, we observe that while some local clusters are formed, the three secondary structure categories (C, H, E) are not completely separable in the embedded 2D space. This suggests that the current feature representation captures some structural differences but may not fully disentangle all classes, especially between C and H. The findings revealed that both tripeptides and octapeptides yielded better results as the complexity of the model increased. Additionally, the t-SNE visualizations revealed that the BiRNN model alone did not capture features effectively. As the complexity of the model increased, it became more efficient at extracting meaningful features from protein amino acid sequences, leading to more effective modeling results.

Comparison

(1) Comparison with advanced algorithms.

To evaluate the performance of the proposed model, it is compared with several state-of-the-art algorithms, including CNN-LSTM ²⁴, Transformer¹³, MLPRNN¹⁶ and GAN-BiRNN²⁵. These advanced algorithms were integrated into the framework of the model presented in this paper, with the data preprocessing method remaining consistent, using one-hot coding and the physicochemical properties as input features for traditional models, whereas Word2Vec was employed separately to learn distributed representations of amino acid sequences on the basis of their contextual relationships without relying on one-hot or physicochemical encoding.

These methods all use the data preprocessing and output structure of this article but only replace the model prediction structure. The model structures are introduced separately:

(a) CNN-LSTM.

The CNN-LSTM in this article refers to some designs. First, a CNN model is designed, which contains two convolutional layers, a pooling layer, and a ReLU activation layer. The features are extracted from the activation layer and input to the Softmax classifier to obtain the first probability output. Then, an LSTM model is designed, which contains a sequence layer and a last layer. The features are extracted from the last layer and input to the Liner layer to obtain the second probability output.

(b) Transformer.

The transformer in this paper adopts some of these designs. This paper applies a transformer with an attention mechanism that has been successful in natural language translation to detect the relevant sequence context of each amino acid position in the entire protein sequence to predict its secondary structure type. The deep learning architecture helps capture the long-range interactions between amino acids (such as words in English sentences) that are relevant to secondary structure prediction. The encoder maps the input word vector of the protein in the symbolic representation to a continuous value sequence. The decoder then generates the output sequence of protein secondary structures one by one in an autoregressive manner from the internal features.

The MLPRNN in this paper adopts some of these designs. The MLPRNN consists of a BiGRU and two MLP blocks. The two stacked BiGRU blocks are covered by two MLP blocks on both sides. Both MLP blocks have one hidden layer. The dimensions of the hidden layer and output layer in the first MLP block are 256 and 512, respectively. The BGRU block is provided by the 512-dimensional output of the first MLP. The BiGRU block is followed by another MLP block, also with one hidden layer. The dimensions of the hidden layers are 256. Finally, the softmax unit provided by the output of the second MLP block makes the prediction.

(d)GAN-BiRNN.

The GAN-BiRNN in this paper adopts some of these designs. This paper mainly refers to the design of the GAN and combines it with the BiRNN model of this paper. In this work, the generator is used to generate a “fake” secondary structure, and the discriminator is used to judge the authenticity of the secondary structure. When the input signal of the discriminator is the secondary structure generated by the generator, the discriminator will judge it as “fake”; when the real secondary structure is input into the discriminator, it will be judged as “true”, and then the error in the discriminator’s judgment will be calculated via the loss function.

The results are shown in Fig. 5.

(2) Comparison with other methods.

To further validate the model and its structure, this study utilizes not only PDB data but also two widely recognized datasets, TS115 and CB513. A comparison with the literature is conducted by reviewing the accuracy of Q₈ and Q₃ on these datasets in relation to the methods used to optimize the model presented in this paper.

(a) DBN-CABS.

DBN-CABS uses the energy feature representation derived from the C-Alpha, C-Beta side chain (CABS) algorithm instead of the traditional PSSM (position-specific scoring matrix) feature. The model adopts a single-layer restricted Boltzmann machine (RBM) architecture to capture the energy relationship between residues in the protein sequence through the energy minimization principle (Boltzmann equation) to predict the secondary structure.

(b) HRNN.

The HRNN adopts a two-dimensional RNN framework and integrates five different types of recurrent network structures (GRU, LSTM, bidirectional RNN, BiGRU and BiLSTM) into a unified model. By combining the PSSM feature input of the protein sequence, it effectively extracts its temporal and spatial contextual information, thereby improving the accuracy of secondary structure prediction.

The Convolutional-Residual-Recurrent Neural Network (CRRNN) is a deep architecture that includes a CNN, a residual network (ResNet) and a Bi-GRU. Its structure includes a local block for extracting local sequence features (including two layers of the 1D-CNN and the original input), three stacked BiGRU layers (with two residual connections), and a fully connected layer and a Softmax output layer, which can achieve simultaneous prediction of protein Q₈ and Q₃ secondary structures. This design combines local convolution and long-range dependency modeling capabilities to effectively improve the prediction accuracy.

(d) ILMCNet.

ILMCNet is a deep neural network model for protein secondary structure prediction. The model combines the embedding representation containing evolutionary information generated by the protein language model (ProtTrans) and the position information obtained by sine-cosine position encoding to form rich input features. The feature enhancement module uses a multi-layer Transformer encoder to further extract contextual information through the self-attention mechanism. The feature extraction module integrates CNN and BiLSTM to capture local features and long-distance dependencies, respectively. Finally, the classification prediction module generates an initial score through a fully connected layer, uses a conditional random field (CRF) to model the dependencies between secondary structure labels, and decodes the optimal label sequence through the Viterbi algorithm.

(e)AttSec.

AttSec is a protein secondary structure prediction model based on the Transformer architecture. It extracts pairwise features between amino acid embeddings through a self-attention mechanism and uses 2D convolutional blocks to capture local patterns in these features. AttSec uses only protein embeddings generated by a pre-trained language model (ProtT5-XL-U50) as input and outperforms existing methods on multiple benchmark datasets, especially for protein predictions that lack homologous sequences.

The results are shown in Table 3. Some articles do not mention the results of the TS115 dataset, so this article omits them here.

Table 3 Comparison with other methods.

Full size table

The results presented in Table 3 demonstrate the effectiveness of the distillation-improved TCN-BiRNN-MLP model in protein secondary structure prediction. Compared with other state-of-the-art models, this approach achieves superior performance on the TS115 and CB513 datasets. Specifically, the model reaches an impressive 91.1% for Q₃ and 88.6% for Q₈ on the TS115 dataset and 90.4% for Q₃ and 86.1% for Q₈ on the CB513 dataset.

Our model combines the TCN and BiRNN to extract deep temporal features and fully capture the long-range dependencies of protein sequences. Moreover, it introduces knowledge distillation technology to improve the generalization ability and prediction accuracy by learning from more complex teacher models. In addition, the MLP layer further optimizes feature representation and improves classification performance, making the model’s Q₃ and Q₈ prediction results on multiple datasets better than those of existing methods.

Sensitivity analysis

This paper introduces the distillation-improved TCN-BiRNN-MLP model, which integrates an improved TCN, BiRNN, MLP, and knowledge distillation. The improved TCN is composed of three key components: (1) the TCN model; (2) multimodal fusion; and (3) forward and backward propagation. To explore the relationship between performance and the number of scales, five different scale configurations were tested^1,2,4,8:, and^1,2,4,8, with the goal of identifying the optimal multiscale setting. All the experiments were conducted with the validation set to select the optimal scale configuration.

Additionally, the distillation model, which integrates teacher and student losses for backpropagation and optimization, employs alpha coefficients. These coefficients were varied from 0.1 to 0.5 to determine the optimal value and assess the model’s sensitivity to these parameters. For this analysis, the octapeptide data from the TS115 dataset were selected because it has a larger number of prediction targets and a smaller dataset size, making it more responsive to parameter variations.

For the sensitivity score of the scale, the α coefficient is fixed at 0.3, as shown in Fig. 6 (a). The model is better when the scale increases; however, there is not much difference in the performance of the model. As the scale increases, there is still a risk of overfitting. Therefore, the scale can be appropriately selected, and the scale is fixed at^1,2,4,8. The α coefficient of the parameter sensitivity analysis is shown in Fig. 6 (b). Considering the combination of the teacher model and the student model, it is not the larger or smaller α coefficient that is better, as is the parameter sensitivity analysis. The change is not overly obvious, but when α is equal to 0.3, the effect is the best, and then it will decline. Therefore, in this paper, α = 0.3 and the scale=^1,2,4,8 are selected as the hyperparameters of the model.

The experimental results show that as the scale increases, the model improves in terms of the Q₃/Q₈ prediction accuracy, but the improvement gradually decreases, and there is a risk of overfitting. To determine the optimal configuration, we conducted hyperparameter tuning on the validation set. Therefore^1,2,4,8, is ultimately selected as the optimal scale configuration to balance accuracy and generalization ability.

For the knowledge distillation parameter α, a smaller α (such as 0.1) will cause the student model to be unable to fully learn the knowledge of the teacher model, affecting performance, whereas a larger α (such as 0.5) may make the student model overly dependent on the teacher model, limiting its own learning ability. Therefore, on the basis of the validation set performance, α = 0.3 is selected as the optimal value, considering both the stability and the generalization capability.

This analysis shows that an appropriate scale combination can effectively improve the feature extraction ability, and a reasonable α value can optimize the distillation effect, ensuring that the student model can effectively learn the knowledge of the teacher model while maintaining a certain degree of independence.

Conclusions

This paper presents an improved TCN-BiRNN-MLP model based on word2vec disambiguation, which demonstrates promising accuracy in predicting protein secondary structures. Compared with the other methods, this paper proposes combining the application of the natural language processing technology word2vec²⁹ and knowledge distillation with the basic model of this paper, the improved TCN-BiRNN-MLP, and proves the effectiveness of the model proposed in this paper through ablation experiments, parameter sensitivity analysis and other experimental methods.

This paper addresses the problems that unidirectional TCNs cannot fully utilize spatial information, the limitations of the BiRNN model in feature fusion, and the insufficient expression ability of the MLP structure in protein sequence prediction tasks. To this end, the authors proposed an improved TCN-BiRNN-MLP model combined with knowledge distillation. The model uses an improved version of the TCN for multiscale feature extraction and bidirectional sequence modeling, combines BiGRU and BiLSTM to enhance temporal feature learning, and performs classification prediction through the MLP. In addition, ProtT5 is introduced as a teacher model for knowledge distillation to improve the generalization ability of the student model. The experimental results show that this method achieves excellent performance in protein sequence prediction tasks and can capture protein structure information more comprehensively.

Although the model has made progress in terms of accuracy and generalizability, it still has several limitations. For example, the model has high computational complexity, a long training time, and may be limited in performance on small datasets. In addition, the impact of hyperparameters of knowledge distillation (such as the α value) on the final performance still needs to be further explored. Future work may consider optimizing the computational efficiency of the model, exploring lighter network structures, and combining more external protein databases to improve the adaptability and stability of the model.

Statement

All methods were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved by a named institutional and/or licensing committee. Informed consent was obtained from all the subjects and/or their legal guardian(s). The School of Computer Science at Liaocheng University approved the study.

Introduction

Lightweight models

Knowledge distillation

Protein Language model

Main contribution

Materials and methods

Datasets

Feature extraction

Methods

Improved TCN

BiRNN

MLP

Knowledge distillation

Experiment

Evaluation metrics and parameter settings

Results

Ablation study

Comparison

Sensitivity analysis

Conclusions

Statement

Related Posts

LDBT instead of DBTL: combining machine learning and rapid cell-free testing