A model for tobacco growing area classification based on time series features of thermogravimetric analysis

a-model-for-tobacco-growing-area-classification-based-on-time-series-features-of-thermogravimetric-analysis
A model for tobacco growing area classification based on time series features of thermogravimetric analysis

Biotechnology for Biofuels and Bioproducts volume 18, Article number: 90 (2025) Cite this article

Abstract

Biomass is greatly influenced by geographic location, soil composition, environment, and climate, making the efficient and accurate identification of growing areas highly significant. This study proposes a classification model for tobacco growing areas based on time series features from thermogravimetric analysis (TGA). This study combines Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM) model to process the derivative thermogravimetric (DTG) data, aiming to uncover the inherent time series properties and the continuous and dynamic relationship between temperatures for classifying tobacco growing areas. By analyzing 375 tobacco samples from ten different provinces, CNN is employed to extract local features, while LSTM captures long-term dependencies in the DTG data. The dataset used in this study has a limited sample size, a wide variety of classes, and an imbalance in the number of samples across these classes. Despite these challenges, the model achieves 86.4% accuracy on the test set, significantly surpassing the performance of the traditional Support Vector Machine model, which only achieves 68.2% accuracy. Additionally, the model reveals key temperature ranges crucial for growing area classification associated with the pyrolysis temperature ranges of volatile components, hemicellulose, cellulose, lignin, and CaCO3 in the tobacco. This model lays the groundwork for the future use of geographical labels to accurately represent tobacco’s style and quality, enabling more precise differentiation and improved quality control.

Introduction

Biomass, which includes various organic materials like plant residues, fecal matter, and biodegradable waste, is an abundant and renewable resource [1]. Pyrolysis refers to a thermochemical process of conversion. The process is conducted in an oxygen-limited or oxygen-free environment and transforms biomass into hydrocarbon biofuels, oxygenated additives, and petrochemical substitutes, enhancing energy efficiency, reducing waste volume, and aiding in pollution control [2, 3]. However, different categories of biomass, as well as the same category from different growing areas or positions on the stalk, display very different characteristics relating to the energy efficiency of biomass conversion, which will have effects on the quality of biofuels, waste volume reduction, and environmental pollution control [4,5,6,7]. Thermogravimetric analysis (TGA) is a commonly employed and effective technique for studying the pyrolysis process, particularly in characterizing biomass to identify its growing areas, composition, and other properties [8,9,10,11].

Similarly, as an agricultural crop with a wide variety of types, tobacco is highly influenced by geographic location, soil composition, and environmental conditions. Hence, the characteristics and quality of tobacco exhibit significant fluctuations across different growing areas [12, 13]. These regional variations result in distinct differences in tobacco grade, flavor, and overall quality, all of which are crucial factors in determining its value. However, in areas near the borders of two or more tobacco growing areas, tobacco may not exhibit clear differences based solely on visual or physical traits. Such variations make it essential to develop a reliable method for classifying tobacco based on its growing areas, ensuring accurate differentiation, and improving quality control. In addition, during the processes of raw material collection, tobacco handling, or transportation, factors such as human error or other reasons may lead to the inability to determine the origin of the tobacco. Currently, the evaluation of tobacco grades and other characteristics depends primarily on manual grading and sensory evaluation, which are highly subjective, and the quantification of indicators is difficult, making them inadequate to meet the growing demands of the tobacco industry. Accurately identifying the geographical origin of tobacco can help the tobacco industry make more informed decisions in quality control, blending, and product development. TGA monitors the mass loss of a sample in relation to temperature and can be represented as the thermogravimetric curve (TG) [14], enabling the rapid identification of its quality, composition, and growing areas to be possible [8, 15,16,17,18,19]. Recently, the use of Machine Learning (ML) techniques in TGA has significantly promoted advancements in tobacco analysis. Studies have shown the powerful capabilities of ML in feature extraction and pattern recognition. Zhang et al. [20] used a support vector machine (SVM) model, enhanced by a genetic algorithm, to distinguish eight flavor types of tobacco with a prediction accuracy of 83.3%. Similarly, Yin et al. [8] developed an SVM model to classify tobacco from two distinct growing areas with an accuracy of 91.67%. Additionally, Wei et al. [18] developed an accurate tobacco pyrolysis model using the extremely randomized trees algorithm, which was in great agreement with the experiment.

Though the combination of the first derivative of the TG (DTG) and ML has been widely applied in the tobacco field, existing models typically treat or process DTG data from different temperatures as independent and unrelated variables [8, 18, 20, 21]. This approach overlooks the fact that the data are taken at successive, equally spaced points in temperature and exhibit an inherent order and sequential relationship. In other words, measurements at certain temperature points cannot be skipped without affecting the validity of the data for subsequent temperatures. Traditional machine learning models, such as SVM [8], are unable to take advantage of such a relationship, leading to limitations in prediction accuracy. This study, for the first time in the tobacco field, considers the inherent time series properties of DTG data. During TGA, the mass loss rate reflects a continuous and dynamic relationship with temperature, which is linearly correlated with time. This study develops a model that integrates Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM). The CNN effectively extracts local features from the DTG data, highlighting patterns such as peaks and trends in the time series [22], while the LSTM captures and models long-term dependencies within the time series data [23,24,25,26,27]. This model improves the accuracy of growing area classification and reveals the key temperature ranges critical for tobacco growing area classification, which are associated with the pyrolysis temperature ranges of substances within tobacco.

Materials and methods

Material preparation

All tobacco samples from ten different growing areas were provided by the Technology Center of China Tobacco Zhejiang Industrial Co., Ltd. (Hangzhou, China). And the growing areas include Fujian (FJ), Yunnan (YN), Guizhou (GZ), Hunan (HA), Hubei (HB), Henan (HN), Sichuan (SC), Shandong (SD), Anhui (AH), and Guangxi (GX) provinces. The specific sample quantities across different provinces are provided in Table 1. Prior to analysis, the tobacco was finely crushed into a powder. The powder was screened through a 40-mesh sieve and held by a 60-mesh sieve, ensuring that the tobacco particle size ranged from 0.250 to 0.425 mm. The screened powder was sealed and stored for further analysis.

Table 1 The specific quantity of samples across different provinces

Full size table

Pyrolysis experiment

The TG data for all samples were obtained from a thermogravimetric analyzer (TA Instruments, Q500, Version 20.13, Build 39). In each test, the sample’s powder was placed in a platinum pan. The temperature–time curve controlled by the program is shown in Fig. 1. Specifically, the initial process involved heating the sample from room temperature to 100 °C at a rate of 30 °C/min. And an isothermal incubation at 100 °C for 5 min was then performed to remove free water fully. After this dehydration process, the sample’s mass was recorded as m100. The sample was then heated to 800 °C at a rate of 10 °C/min. Throughout the heating process, there was a linear relationship between temperature and time. Therefore, the temperature sequence can be treated as a special type of time series, with the mass of the samples at each temperature in the sequence being monitored and denoted as mT. All tests were conducted under consistent conditions, with nitrogen serving as the sample purge gas and the balance purge gas. The sample purge flow was held steady at 60.0 mL/min, while the balance purge flow was maintained at 40.0 mL/min.

Fig. 1
figure 1

Schematic of the temperature–time program and the corresponding mass-time curve during a typical test

Full size image

To normalize the data, the mass of the samples at different temperatures is expressed as a percentage of the mass after the dehydration process, m100. The TG is calculated using the following formula:

$${text{T}}{text{G}}_{T}=frac{{m}_{T}}{{m}_{100}}times 100%$$

(1)

For clarity and improved visual presentation, the average TG curves of each class are shown in Fig. 2. To minimize overlap, the TG curves are shifted along the y-axis. The differences between the average curves of different categories are minimal, suggesting that further analysis is needed to identify potential distinguishing features between them. In this study, the first derivative of the TG curve, referred to as the DTG curve, is used to calculate the mass loss rate of the samples. It helps reduce noise interference, highlight the specific temperature points where significant mass changes occur, and offer helpful tips for detailed analysis [28].

Fig. 2
figure 2

The average TG curves of each class

Full size image

The methodology for data preprocessing

In TGA data acquisition, environmental factors, time-dependent factors, sampling, and sample preparation inevitably introduce noise and abnormal deviations, which can negatively affect prediction accuracy and model performance. Hence, preprocessing steps are crucial in classification models to ensure accuracy and robustness [29, 30]. This study performs a two-step preprocessing of the raw thermogravimetric data: first, the Savitzky-Golay (SG) smoothing technique is applied to smooth the data and calculate DTG [31]. Second, the isolation forest algorithm is used to identify abnormal samples [32, 33]. The SG smoothing algorithm works by fitting polynomials within a local window to smooth the data. As shown in Fig. 3, the DTG curve obtained after SG smoothing is smoother compared to the curve from direct differentiation. This method reduces high-frequency noise interference and highlights the key values of the pyrolysis process. Due to significant fluctuations in the final heating stage of several samples, which are hard to smooth, the temperature data is standardized to 6800 points, ranging from 100 to 780 °C with a 0.1 °C interval.

Fig. 3
figure 3

Curve of TG, DTG, and SG-smoothed DTG of a typical test

Full size image

This study aims to develop a time-series-based model for classifying the growing areas of representative tobacco samples, so all the samples were subjected to anomaly detection after smoothing. Isolation forest is an unsupervised anomaly detection method based on tree structures. Its main idea is to isolate abnormal samples by constructing multiple decision trees [32]. During the construction of these trees, abnormal samples are typically easier to separate than normal points, requiring fewer steps in the tree partitioning process. As a result, abnormal samples tend to be isolated at the leaf nodes of the trees, while normal samples require more partitioning steps. The abnormal status of a sample is evaluated using an anomaly score, which is calculated as follows:

$$s(x,N)={2}^{-frac{E(h({x}_{i}))}{c(m)}} , itext{=1,2,3…}N$$

(2)

where h(xi) is the number of edges traversed by sample xi in an isolation tree, subscript i is the sample index, and the sample quantity is N. E(h(xi)) represents the average of h(xi) across all isolation trees. c(m) is the mean path length of the isolation trees, which depends only on the number of trees m.

After removing the abnormal samples, the dataset is split into three sections: training, validation, and test sets, in an 8:1:1 ratio. Due to the imbalance in the sample sizes of different classes, stratified sampling is used to enhance the robustness of the model training. Stratified sampling ensures that the distribution of each class is consistent across all data subsets, thus mitigating bias caused by data imbalance [34]. Additionally, to minimize bias and enhance the stability of the model across different subsets, the dataset is randomly split during hyperparameter tuning, enabling a more comprehensive evaluation of the model parameters.

Modeling of CNN-LSTM

A hybrid model, which includes CNN and LSTM, is employed to classify the tobacco growing areas based on the time series features of the DTG curve. The CNN-LSTM model, shown in Fig. 4, mainly comprises three key components: CNN, LSTM, and Classification. All modeling and computations were performed in MATLAB (The MathWorks Inc., Natick, Massachusetts, USA).

Fig. 4
figure 4

The CNN-LSTM model developed to classify the growing areas of tobacco

Full size image

The CNN mainly consists of the input, convolution, polling, Rectified Linear Unit (ReLU), and Flatten layers. Specifically designed to receive preprocessed data, the input layer serves as the initial stage in the model. Filters in the convolution layer are applied to the input data to extract local features, emphasizing patterns such as peaks and trends in the time series. The pooling layer decreases the spatial size of the features, preserving essential information while lowering computational complexity [22]. The ReLU layer introduces non-linearity to the network, allowing it to capture more intricate relationships in the data. Moreover, the Flatten layer reshapes the features into a one-dimensional vector, converting them into the format required by the LSTM layer.

The LSTM is adept at managing and modeling long-term dependencies in the time series [23]. As shown in Fig. 5, the most basic unit of the LSTM layer is commonly known as the memory Cell. The cell takes the current time step data Xt as input, along with the hidden state ht-1 and the cell state of Ct-1 from the previous time step. The cell, which includes the forget gate, input gate, and output gate, enables the LSTM to capture and model dynamic relationships within the time series data. The forget gate regulates the amount of information from ht-1 and Ct-1 that should be retained or forgotten in the current time step. The input gate determines the ratio of Xt to be stored in the cell state Ct for the current time step. The output gate controls the output Yt and transmits the hidden state ht and the cell state of Ct from the current time step to the next.

Fig. 5
figure 5

Schematic illustration of the LSTM layer

Full size image

The forget gate is calculated as follows:

$${f}_{t}=sigma ({W}_{f}cdot [{h}_{t-1},{X}_{t}]+{b}_{f})$$

(3)

The input gate is calculated as follows:

$${i}_{t}=sigma ({W}_{i}cdot [{h}_{t-1},{x}_{t}]+{b}_{i})$$

(4)

$${widetilde{C}}_{t}=mathit{tan}h({W}_{C}cdot [{h}_{t-1},{x}_{t}]+{b}_{C})$$

(5)

$${C}_{t}={f}_{t}cdot {C}_{t-1}+{i}_{t}cdot {widetilde{C}}_{t}$$

(6)

The output gate is calculated as follows:

$${o}_{t}=sigma ({W}_{o}cdot [{h}_{t-1},{X}_{t}]+{b}_{o})$$

(7)

$${Y}_{t}={h}_{t}={o}_{t}cdot mathit{tan}h({C}_{t})$$

(8)

where W and b represent the weight matrix and bias parameters. The subscripts f, i, C, and o represent the forget gate, input gate, cell state, and output gate, respectively. The hyperbolic tangent function is represented by tanh, while the sigmoid function is denoted by σ.

To prevent overfitting and improve generalization, a dropout layer is applied to the extracted features after the LSTM layer. The LSTM output is fed through the dropout layer, followed by the fully connected (FC) layer and the softmax (SM) layer, which generates the final classification results. Besides, mini-batch training is adopted to ensure model efficiency and prevent overfitting, with the batches shuffled at the start of every epoch. The training process employs the Adam optimizer, which adaptively adjusts the learning rate for each parameter or weight to ensure stable and efficient convergence. The cross-entropy loss function is used to optimize the multi-class classification task effectively. Model performance is periodically evaluated on the validation set to fine-tune the parameters of the training process. Additionally, early stopping is employed to prevent overfitting by terminating training when the validation loss stops improving.

The evaluation of the model

The evaluation of CNN component

The primary function of the CNN is to extract local features, so the Fisher Discriminant Ratio (FDR) is introduced to evaluate the distinction between the extracted features and initial features [35]. FDR is used as a measure of data separability in Linear Discriminant Analysis [36]. In this study, FDR is used to measure the discriminative ability of the features by calculating the ratio of between-class variance to within-class variance at each temperature point. Specifically, the distance between the classes (σbetween) represents the consistency among samples within the same class, with a smaller value indicating that samples within the same class are more coincident. Conversely, distance within the classes (σwithin) measures the distinction between different classes, where a larger value suggests that the classes are more distinguishable. The formula of FDR is given as

$${text{FDR}}(T)=frac{{sigma }_{between}^{2}(T)}{{sigma }_{within}^{2}(T)}=frac{{sum }_{i=1}^{C}{N}_{i}[{mu }_{i}(T)-mu (T){]}^{2}}{{sum }_{i=1}^{C}{sum }_{x(T)in {w}_{i}}[x(T)-{mu }_{i}(T){]}^{2}}$$

(9)

where C represents the number of classes, N is the total number of samples, Ni is the sample count of class wi, x(T) is the value of samples from class wi at temperature point T, and

$${mu }_{i}(T)=frac{1}{{N}_{i}}{sum }_{x(T)in {w}_{i}}x(T)$$

(10)

is the mean of class wi at temperature point T, and

$$mu (T)=frac{1}{N}{sum }_{i=1}^{N}{x}_{i}(T)$$

(11)

is the mean of all classes at temperature point T.

The FDR is an effective index for evaluating the discriminative ability of the feature across different classes. A higher FDR value indicates that the feature has better discriminative ability and contributes significantly to the classification of the model. Conversely, a lower FDR value suggests that the feature has a limited ability to classify, potentially requiring further optimization.

The evaluation of LSTM component

To validate the effectiveness of LSTM in capturing long-term dependencies across the temperature series and modeling the inherent sequential relationships between temperatures, this study extracts the output data from the LSTM layer and applies t-distributed Stochastic Neighbor Embedding (t-SNE) [37]. The features of the data after LSTM are projected onto a two-dimensional plane by the t-SNE, which is an unsupervised, nonlinear technique for dimensionality reduction, commonly used to visualize high-dimensional data. And the results are compared with the dimensionality reduction outcome of the data after CNN.

The evaluation of CNN-LSTM model

For comparison, an SVM model is established, replacing the LSTM component shown in Fig. 4. SVM is a classical ML model that can effectively handle high-dimensional feature data [8, 38]. The SVM model separates feature samples of different classes by optimizing the hyperplane, thereby achieving the classification task. In this study, a linear kernel function is selected to build the SVM model, since the number of features greatly exceeds the number of samples [39,40,41]. In the SVM model, the penalty parameter C significantly affects the classification performance [39], so a grid search method is employed to optimize C.

Results and discussion

Analysis of samples

Preprocessing data and splitting the dataset

As shown in Eq. (2), the more anomalous a sample is, the shorter its path and the higher its anomaly score. The histogram of anomaly scores is shown in Fig. 6. When the threshold was set at 0.6, fewer abnormal samples were detected, leading to a decline in classification performance. On the other hand, when the threshold was set at 0.5, many samples were labeled as abnormal samples, but there was no significant improvement in the model’s accuracy. After comparative screening, samples with anomaly scores greater than 0.55 are removed [42], resulting in a total of 8 samples being excluded from the dataset. This anomaly detection method provides more stable and reliable data inputs for subsequent modeling.

Fig. 6
figure 6

The histogram of anomaly scores for all samples

Full size image

Table 2 summarizes the distribution of 375 samples from 10 growing areas across different provinces and subsets after abnormal sample removal, providing an overview of the dataset used in this study. Tobacco leaves are consumed gradually and cannot be stored for extended periods. And industrial enterprises typically source tobacco from various growing regions based on their preferences for raw materials, leading to an uneven distribution of tobacco from these areas. Furthermore, the research requires the selection of representative samples. This results in a dataset characterized by a small overall sample size, a broad variety of classes, and an imbalance in the number of samples across different classes. These characteristics present challenges for developing a robust classification model, as the limited sample size may reduce generalization ability, and the imbalance in sample distribution could lead to biased predictions favoring the class with the most samples.

Table 2 Distribution of samples across different provinces and subsets

Full size table

The average SG-smoothed DTG curves are shown in Fig. 7. Compared to the average TG curves in Fig. 2, the DTG curves provide a more detailed reflection of the mass loss rate at specific temperature ranges, highlighting differences that are less evident in the TG curves. DTG can be interpreted as the pyrolysis of four types of substances at different temperatures: volatile substances, hemicellulose, cellulose, lignin, and CaCO3 [16, 43, 44]. As shown in Fig. 7, the average DTG curve of AH province is deconvoluted into five peaks. Specifically, at around 190 °C, the first peak represents the pyrolysis of volatile components. The peak near 275 °C represents the pyrolysis of hemicellulose, followed by another peak at approximately 320 °C for cellulose pyrolysis [44]. Lignin contributes to a significant peak around 430 °C, marking its primary mass loss [43]. A distinct valley is observed near 220 °C, resulting from the overlap of multiple pyrolysis processes. Additionally, the peak associated with the thermal decomposition of CaCO3 and other inorganic salts occurs at around 620 °C [43]. It can be observed that the temperatures corresponding to the peaks and valleys of tobacco from different growing areas are not entirely consistent, which is one of the reasons for considering the time series relationship in DTG data.

Fig. 7
figure 7

The average SG-smoothed DTG curves for each class and peak deconvolution of the average DTG curve for AH province

Full size image

Furthermore, this study calculates the Pearson correlation heatmap between different temperature points across all samples, as shown in Fig. 8, which clearly shows temperature points with high positive correlations (yellow regions). It can be clearly observed that the correlation levels between temperature points are divided into five temperature ranges, which correspond to the five substances mentioned earlier, which are volatile components, hemicellulose, cellulose, lignin, and CaCO3. During the pyrolysis of the associated substances, the correlation between adjacent temperature points, even with intervals of up to 30 °C, exhibits consistently high levels, indicating that the mass loss rates at these temperatures are strongly interdependent. In contrast, the correlation between temperatures corresponding to different substances is relatively low. Figure 8 highlights that during the pyrolysis process, the relationships between adjacent temperature points are not isolated but instead exhibit a continuous and dynamic relationship. Therefore, it is essential to incorporate this dynamic temperature dependence into modeling approaches, as it plays a critical role in capturing the inherent time-series properties of the data. This dynamic relationship is crucial for accurately identifying key temperature regions to enhance the performance of classification models.

Fig. 8
figure 8

Pearson correlation heatmap between different temperature points across all samples

Full size image

Hyperparameters of the CNN-LSTM model

The hyperparameters of the CNN-LSTM model are selected to ensure efficient training and robust performance. Important hyperparameters, such as max epochs, mini-batch size, learning rate, loss function, optimizer type, and gradient decay factor, are fine-tuned during the training process. Additionally, techniques such as data shuffling, the piecewise learning rate schedule, and early stopping are employed. A detailed summary of the hyperparameters is provided in Table 3.

Table 3 Hyperparameter of CNN-LSTM model

Full size table

Cross-entropy Loss in the table is widely used for classification tasks. Given the true label ({y}_{i}) and the predicted label ({widehat{y}}_{i}) from the model, the Loss function Loss is calculated as follows:

$$Loss=-frac{1}{N}{sum }_{i=1}^{N}{y}_{i}{mathit{log}}_{2}({widehat{y}}_{i})$$

(12)

The weight matrix and bias parameters θ of the CNN-LSTM model are updated using the Adam algorithm. The parameter update rule is as follows:

$${theta }_{t+1}={theta }_{t}-eta cdot frac{partial Loss}{partial {theta }_{t}}$$

(13)

where (eta) is the learning rate, (frac{partial Loss}{partial {theta }_{t}}) is the gradient of the Loss with respect to the parameters ({theta }_{t}), the subscript t specifically denotes the epoch number.

Comprehensive evaluation

The evaluation of the CNN component

To evaluate the role of the CNN and identify which temperature points contribute more to the classification, FDR is calculated for both the raw data and the data after CNN, as shown in Fig. 9. It can be observed that the FDR of the data after CNN is improved at the majority of temperature points compared to the raw data, with significant enhancements at several key temperature ranges mentioned in section “Preprocessing data and splitting the dataset“. These temperature ranges correspond to the pyrolysis of hemicellulose, cellulose, lignin, volatile components, and CaCO3, respectively. Furthermore, the valley observed at 220 °C in Fig. 3 also contributes a lot to the classification.

Fig. 9
figure 9

Calculated FDR of the raw data and the data after CNN at different temperatures

Full size image

The evaluation of the CNN-LSTM model

As described in section “The evaluation of LSTM component“, an SVM model is established, and a grid search method is employed to optimize the parameter C. The results are presented in their logarithmic form in Fig. 10. The training set quickly achieves 100% accuracy as the value of C increases, while the validation set accuracy initially improves and then decreases. The optimal parameter is determined to be lg(C) = 2.9 based on the best validation set accuracy, which is 78.1%. However, the trained CNN-SVM model achieves an accuracy of only 68.2% on the test set, as shown in the classification confusion matrix of the CNN-SVM model on subsets in Fig. 11. This indicates that the model suffers from overfitting and performs poorly in classification on the prediction set.

Fig. 10
figure 10

The effect of parameter C on the training and validation accuracy in the SVM model

Full size image

Fig. 11
figure 11

Confusion matrix for growing areas classification results of the CNN-SVM model on subsets

Full size image

This study employs LSTM to capture the dynamic relationships and time-series properties of the DTG data, with the model manually fine-tuned. The training process optimization is shown in Fig. 12. Due to the use of mini-batch training, every epoch consists of 5 iterations. After 958 epochs, the validation loss met the early stopping criterion, and the training process was terminated.

Fig. 12
figure 12

Training and validation accuracy and loss curves during the CNN-LSTM model iterative process

Full size image

As shown in Fig. 13, the features of the data after LSTM and CNN are projected onto a two-dimensional plane by the t-SNE. The data after LSTM achieves clearer regional separations and exhibits better clustering performance while significantly reducing dimensionality from about 6800 to only 256 features. In other words, the LSTM reduces the computational complexity for the subsequent classification component while improving classification performance by capturing long-term dependencies across the temperature series, as it has already shown strong effectiveness in an unsupervised setting.

Fig. 13
figure 13

t-SNE nonlinear dimensionality reduction results of data after CNN and LSTM

Full size image

The classification confusion matrix for the subsets is shown in Fig. 14. The model achieves an accuracy of 96.2% on the training set and 92.7% on the validation set, with a final classification accuracy of 86.4% on the test set. Errors typically occur in the class with smaller sample sizes, and tobacco from the GZ province growing area tends to be misclassified as originating from SC province across all three subsets. Guizhou and Sichuan are neighboring provinces located in the Yunnan-Guizhou Plateau of southwest China, sharing similar geographic and climatic characteristics. Therefore, this misclassification is both acceptable and understandable.

Fig. 14
figure 14

Confusion matrix for growing areas classification results of the CNN-LSTM model on subsets

Full size image

The CNN-LSTM model achieves significantly higher accuracy compared to the SVM, demonstrating its ability to effectively capture the dynamic relationships and time-series properties of the DTG data and thereby enhance classification accuracy.

Conclusion

This study successfully develops a classification model for tobacco growing areas based on time series features from TGA. The hybrid CNN-LSTM model significantly improves classification accuracy compared to the SVM model. By effectively capturing the dynamic relationships and long-term dependencies within the DTG data, this study considers DTG as time series data for the first time in the tobacco industry. Through using time series data, the model captures the continuous and dynamic relationship between temperatures, enabling the classification of tobacco growing areas. Despite the dataset’s small sample size, a wide variety of classes, and imbalance in the number of samples across different classes, the model achieves an accuracy of 86.4% on the test set in classifying tobacco from 10 growing areas. This result significantly outperforms the SVM model, which achieves only 68.2% accuracy. The model identifies key temperature ranges, which are associated with the pyrolysis of tobacco components such as volatile substances, hemicellulose, cellulose, lignin, and inorganic salts. These temperature points are critical for accurate classification. However, the time-series-based classification model was only validated for tobacco, and its applicability to other biomass materials needs further investigation. Future work could focus on validating the model using DTG curves from other biomass materials to extending its applicability to a broader range of biomass types.

Availability of data and materials

No datasets were generated or analysed during the current study.

References

  1. Vuppaladadiyam AK, Vuppaladadiyam SSV, Awasthi A, Sahoo A, Rehman S, Pant KK, Murugavelh S, Huang Q, Anthony E, Fennel P, Bhattacharya S, Leu S-Y. Biomass pyrolysis: a review on recent advancements and green hydrogen production. Bioresour Technol. 2022;364: 128087. https://doi.org/10.1016/j.biortech.2022.128087.

    Article  PubMed  CAS  Google Scholar 

  2. Hu B, Liu J, Xie W, Li Y, Lu Q. Chapter 6 – Biofuels production using pyrolysis techniques. In: Jeguirim M, Zorpas AA, editors. Adv. Biofuels Prod .Optim Appl. Amsterdam: Elsevier; 2024. p. 103–25. https://doi.org/10.1016/B978-0-323-95076-3.00010-7.

    Chapter  Google Scholar 

  3. Faridi IK, Tsotsas E, Heineken W, Koegler M, Kharaghani A. Spatio-temporal prediction of temperature in fluidized bed biomass gasifier using dynamic recurrent neural network method. Appl Therm Eng. 2023;219: 119334. https://doi.org/10.1016/j.applthermaleng.2022.119334.

    Article  Google Scholar 

  4. Sharma A, Pareek V, Zhang D. Biomass pyrolysis—a review of modelling, process parameters and catalytic studies. Renew Sustain Energy Rev. 2015;50:1081–96. https://doi.org/10.1016/j.rser.2015.04.193.

    Article  CAS  Google Scholar 

  5. Castellano JM, Gómez M, Fernández M, Esteban LS, Carrasco JE. Study on the effects of raw materials composition and pelletization conditions on the quality and properties of pellets obtained from different woody and non woody biomasses. Fuel. 2015;139:629–36. https://doi.org/10.1016/j.fuel.2014.09.033.

    Article  CAS  Google Scholar 

  6. Fernandez A, Saffe A, Pereyra R, Mazza G, Rodriguez R. Kinetic study of regional agro-industrial wastes pyrolysis using non-isothermal TGA analysis. Appl Therm Eng. 2016;106:1157–64. https://doi.org/10.1016/j.applthermaleng.2016.06.084.

    Article  CAS  Google Scholar 

  7. Liu N, Dou C, Yang X, Bai B, Zhu S, Tian J, Wang Z, Xu L, Shen B. Effects of pretreatment procedure, compositional feature and reaction condition on the devolatilization characteristics of biomass during pyrolysis process: a review. J Energy Inst. 2025;118: 101943. https://doi.org/10.1016/j.joei.2024.101943.

    Article  CAS  Google Scholar 

  8. Yin C, Deng X, Yu Z, Liu Z, Zhong H, Chen R, Cai G, Zheng Q, Liu X, Zhong J, Ma P, He W, Lin K, Li Q, Wu A. Auto-classification of biomass through characterization of their pyrolysis behaviors using thermogravimetric analysis with support vector machine algorithm: case study for tobacco. Biotechnol Biofuels. 2021;14:106. https://doi.org/10.1186/s13068-021-01942-w.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. Sarkar P, Mukherjee A, Sahu SG, Choudhury A, Adak AK, Kumar M, Choudhury N, Biswas S. Evaluation of combustion characteristics in thermogravimetric analyzer and drop tube furnace for Indian coal blends. Appl Therm Eng. 2013;60:145–51. https://doi.org/10.1016/j.applthermaleng.2013.06.054.

    Article  CAS  Google Scholar 

  10. Vamvuka D, Kakaras E, Kastanaki E, Grammelis P. Pyrolysis characteristics and kinetics of biomass residuals mixtures with lignite. Fuel. 2003;82:1949–60. https://doi.org/10.1016/S0016-2361(03)00153-4.

    Article  CAS  Google Scholar 

  11. Song Z, Zhang X, Li X, Zhang J, Shao J, Zhang S, Yang H, Chen H. Machine learning assisted prediction of specific surface area and nitrogen content of biochar based on biomass type and pyrolysis conditions. J Anal Appl Pyrolysis. 2024;183: 106823. https://doi.org/10.1016/j.jaap.2024.106823.

    Article  CAS  Google Scholar 

  12. Jiang Y-M, Cui W-H, Dong Q-L. Comprehensive evaluation and analysis of tobacco planting environment based on space technology: comprehensive evaluation and analysis of tobacco planting environment based on space technology. Chin J Plant Ecol. 2012;36:47–54. https://doi.org/10.3724/SP.J.1258.2012.00047.

    Article  Google Scholar 

  13. Teng L, Jiang G, Ding Z, Wang Y, Liang T, Zhang J, Dai H, Cao F. Evaluation of tobacco-planting soil quality using multiple distinct scoring methods and soil quality indices. J Clean Prod. 2024;441: 140883. https://doi.org/10.1016/j.jclepro.2024.140883.

    Article  Google Scholar 

  14. Li Y, Li J, Zhou S, Meng B, Wu T. A review on thermogravimetric analysis-based analyses of the pyrolysis kinetics of oil shale and coal. Energy Sci Eng. 2024;12:329–55. https://doi.org/10.1002/ese3.1627.

    Article  CAS  Google Scholar 

  15. Peng Y, Hao X, Qi Q, Tang X, Mu Y, Zhang L, Liao F, Li H, Shen Y, Du F, Luo K, Wang H. The effect of oxygen on in-situ evolution of chemical structures during the autothermal process of tobacco. J Anal Appl Pyrolysis. 2021;159: 105321. https://doi.org/10.1016/j.jaap.2021.105321.

    Article  CAS  Google Scholar 

  16. Mu Y, Peng Y, Tang X, Ren J, Xing J, Luo K, Fan J, Zhang K. Experimental and kinetic studies on tobacco pyrolysis under a wide range of heating rates. ACS Omega. 2022;7:1420–7. https://doi.org/10.1021/acsomega.1c06122.

    Article  PubMed  CAS  Google Scholar 

  17. Peng Y, Bi Y, Dai L, Li H, Cao D, Qi Q, Liao F, Zhang K, Shen Y, Du F, Wang H. Quantitative analysis of routine chemical constituents of tobacco based on thermogravimetric analysis. ACS Omega. 2022;7:26407–15. https://doi.org/10.1021/acsomega.2c02243.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  18. Wei H, Xing J, Luo K, Peng Y, Fan J, Zhang K, Wang H. Predicting tobacco pyrolysis based on chemical constituents and heating conditions using machine learning approaches. Fuel. 2023;335: 126895. https://doi.org/10.1016/j.fuel.2022.126895.

    Article  CAS  Google Scholar 

  19. Migliaccio R, Cerciello F, Oliano MM, Russo C, Apicella B, Senneca O. Effect of oxidative atmospheres on thermochemical degradation of tobacco: discriminating between oxidative pyrolysis and combustion. Fuel. 2024;374: 132313. https://doi.org/10.1016/j.fuel.2024.132313.

    Article  CAS  Google Scholar 

  20. Zhang T, Wang L, Mei J, Wang A, Qiao X, Wang B, Li Q, Li B. Construction of a flavor category discrimination model based on thermal analysis spectra of flue-cured tobacco. Tob Sci Technol. 2020;53:75–80. https://doi.org/10.16135/j.issn1002-0861.2020.0001.

    Article  Google Scholar 

  21. Stańczyk U, Jain LC. Feature selection for data and pattern recognition. In: Stańczyk U, Jain LC, editors. Feature Sel Data Pattern Recognit. Berlin: Springer; 2015. https://doi.org/10.1007/978-3-662-45620-0_1.

    Chapter  Google Scholar 

  22. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. https://doi.org/10.1038/nature14539.

    Article  PubMed  CAS  Google Scholar 

  23. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

    Article  PubMed  CAS  Google Scholar 

  24. Zhao B, Lu H, Chen S, Liu J, Wu D. Convolutional neural networks for time series classification. J Syst Eng Electron. 2017;28:162–9. https://doi.org/10.21629/JSEE.2017.01.18.

    Article  CAS  Google Scholar 

  25. Talebjedi B, Laukkanen T, Holmberg H, Syri S. Advanced design and operation of Energy Hub for forest industry using reliability assessment. Appl Therm Eng. 2023;230: 120751. https://doi.org/10.1016/j.applthermaleng.2023.120751.

    Article  Google Scholar 

  26. Brahma I, Singh S. Experimental, numerical and deep learning modeling study of heat transfer in turbulent pulsating pipe flow. Appl Therm Eng. 2024;244: 122685. https://doi.org/10.1016/j.applthermaleng.2024.122685.

    Article  Google Scholar 

  27. Lu D, Suh Y, Won Y. Rapid identification of boiling crisis with event-based visual streaming analysis. Appl Therm Eng. 2024;239: 122004. https://doi.org/10.1016/j.applthermaleng.2023.122004.

    Article  Google Scholar 

  28. Menczel JD, Prime RB. Thermal analysis of polymers: fundamentals and applications, thermal analysis of polymers: fundamentals and applications. 2008. https://xueshu.baidu.com/usercenter/paper/show?paperid=fadc8ec707e737d7e003967eb0be303b. Accessed 26 Dec 2024.

  29. Alshdaifat E, Alshdaifat D, Alsarhan A, Hussein F, El-Salhi SMFS. The effect of preprocessing techniques applied to numeric features, on classification algorithms’ performance. Data. 2021;6:11. https://doi.org/10.3390/data6020011.

    Article  Google Scholar 

  30. Mishra P, Biancolillo A, Roger JM, Marini F, Rutledge DN. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends Anal Chem. 2020;132: 116045. https://doi.org/10.1016/j.trac.2020.116045.

    Article  CAS  Google Scholar 

  31. Savitzky A, Golay MJ. Smoothing and differentiation of data by simplified least squares procedures. Anal Chem. 1964;36:1627–39. https://doi.org/10.1021/ac60214a047.

    Article  CAS  Google Scholar 

  32. Liu FT, Ting K, Zhou ZH. Isolation Forest, In: 2008 Eighth IEEE Int. Conf. Data Min., 2008. https://www.semanticscholar.org/paper/Isolation-Forest-Liu-Ting/00a1077d298f2917d764eb729ab1bc86af3bd241.

  33. Liu X, Aldrich C. Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models. Fuel. 2023;335: 126891. https://doi.org/10.1016/j.fuel.2022.126891.

    Article  CAS  Google Scholar 

  34. He H, Ma Y. Imbalanced datasets: from sampling to classifiers, in: Imbalanced Learn. Found. Algorithms Appl., IEEE, 2013: pp. 43–59. https://doi.org/10.1002/9781118646106.ch3.

  35. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7:179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.

    Article  Google Scholar 

  36. Zeng X, Naghedolfeizi M, Arora S, Yousif N, Aberra D. Selection of principal components based on Fisher discriminant ratio. 2016; 98710K. https://doi.org/10.1117/12.2227045.

  37. van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–605.

    Google Scholar 

  38. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97. https://doi.org/10.1007/BF00994018.

    Article  Google Scholar 

  39. Hsu C, Chang C, Lin CJ. A practical guide to support vector classification. 2003.

  40. Yuan G-X, Ho C-H, Lin C-J. Recent advances of large-scale linear classification. Proc IEEE. 2012;100:2584–603. https://doi.org/10.1109/JPROC.2012.2188013.

    Article  Google Scholar 

  41. Mallat S. Understanding deep convolutional networks. Philos Trans R Soc Math Phys Eng Sci. 2016;374:20150203. https://doi.org/10.1098/rsta.2015.0203.

    Article  CAS  Google Scholar 

  42. Liu FT, Ting KM, Zhou Z-H. Isolation-based anomaly detection. ACM Trans Knowl Discov Data. 2012;6:1–39. https://doi.org/10.1145/2133360.2133363.

    Article  Google Scholar 

  43. Skreiberg A, Skreiberg Ø, Sandquist J, Sørum L. TGA and macro-TGA characterisation of biomass fuels and fuel mixtures. Fuel. 2011;90:2182–97. https://doi.org/10.1016/j.fuel.2011.02.012.

    Article  CAS  Google Scholar 

  44. Liao J, Lu Z, Hu S, Li Q, Che L, Chen XD. Effects of prewash on the pyrolysis kinetics of cut tobacco. Dry Technol. 2017;35:1368–78. https://doi.org/10.1080/07373937.2017.1320803.

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the Science Foundation of China Tobacco Zhejiang Industrial (Grant No. ZJZY2024C002) and Zhejiang Province Natural Science Foundation (Grant No. LTGS23E060001)

Funding

This research was supported by the Science Foundation of China Tobacco Zhejiang Industrial (Grant No. ZJZY2024C002) and Zhejiang Province Natural Science Foundation (Grant No. LTGS23E060001).

Author information

Authors and Affiliations

  1. Key Laboratory of Refrigeration and Cryogenic Technology of Zhejiang Province, Zhejiang University, Hangzhou, 310027, China

    Jiaxu Xia & Zhihua Gan

  2. Cryogenic Center, Hangzhou City University, Hangzhou, 310015, China

    Jiaxu Xia, Guanqun Luo & Zhihua Gan

  3. Technology Center, China Tobacco Zhejiang Industrial Co., Ltd, Hangzhou, 310012, China

    Yunong Tian, Xianwei Hao & Yuhan Peng

Authors

  1. Jiaxu Xia
  2. Yunong Tian
  3. Xianwei Hao
  4. Yuhan Peng
  5. Guanqun Luo
  6. Zhihua Gan

Contributions

Jiaxu Xia: Methodology, validation, writing—original draft, Conceptualization. Yunong Tian: Methodology, validation, writing—review and editing. Xianwei Hao: Methodology, validation, writing—review and editing. Yuhan Peng: Data curation, investigation, funding acquisition. Guanqun Luo: Data curation, validation, writing—review and editing. Zhihua Gan: Project administration, validation, writing—review and editing. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yuhan Peng or Guanqun Luo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, J., Tian, Y., Hao, X. et al. A model for tobacco growing area classification based on time series features of thermogravimetric analysis. Biotechnol. Biofuels Bioprod. 18, 90 (2025). https://doi.org/10.1186/s13068-025-02682-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13068-025-02682-x

Keywords