Development and validation of machine learning models based on blood routine tests and tumor markers in early screening of primary bronchogenic lung cancer

Wenjing Deng; Lijuan Pan; Haolin Wang; Yulong Liu; Xuelian Peng; Chunyan Yang; Jin Li; Baoru Han

doi:10.21037/tlcr-2025-970

Original Article

Development and validation of machine learning models based on blood routine tests and tumor markers in early screening of primary bronchogenic lung cancer

Wenjing Deng^1#, Lijuan Pan^2#, Haolin Wang^1#, Yulong Liu², Xuelian Peng², Chunyan Yang², Jin Li², Baoru Han¹

¹College of Artificial Intelligence Medicine, Chongqing Medical University, Chongqing, China; ²Department of Laboratory Medicine, The Affiliated Dazu’s Hospital of Chongqing Medical University, Chongqing, China

Contributions: (I) Conception and design: W Deng, L Pan, H Wang, J Li, B Han; (II) Administrative support: J Li, B Han; (III) Provision of study materials or patients: W Deng, L Pan, H Wang, Y Liu, X Peng, C Yang; (IV) Collection and assembly of data: L Pan, W Deng, H Wang; (V) Data analysis and interpretation: W Deng, L Pan, H Wang, J Li, B Han; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work as co-first authors.

Correspondence to: Jin Li, MD, PhD. Department of Laboratory Medicine, The Affiliated Dazu’s Hospital of Chongqing Medical University, 1073 South 2nd Ring Road, Chongqing 402360, China. Email: lijin@hospital.cqmu.edu.cn; Baoru Han, PhD. College of Artificial Intelligence Medicine, Chongqing Medical University, Building 6, Lanyuan, No. 61, University Town Middle Road, Shapingba District, Chongqing 400016, China. Email: baoruhan@cqmu.edu.cn.

Background: Primary bronchogenic lung cancer (PBLC) poses a serious threat to human health with its high mortality rate largely attributed to challenges in reliable early detection. Hence, the early identification of PBLC is essential for subsequent patient treatment. Machine learning (ML) models that utilize accessible data, such as routine blood tests and tumor markers, present a promising approach for enhancing early screening rates. This study aims to construct an ML prediction model based on the combined analysis of routine blood tests and tumor markers and to establish an early intelligent screening platform for PBLC through systematic integration and development of technology so as to improve the early screening rate of PBLC.

Methods: This study used samples from the PBLC group and the healthy control (HC) group from 2018 to 2023 (n=1,054). Data from The Affiliated Dazu’s Hospital of Chongqing Medical University were used for model construction and internal validation (n=767), and data from the Chongqing Dazu District People’s Hospital Medical Community were used for external validation (n=287). After feature selection using the least absolute shrinkage and selection operator (LASSO) algorithm, 14 features were selected, including routine blood tests and tumor markers. Subsequently, 10 ML models were used to establish prediction models using eight evaluation metrics, including accuracy, sensitivity, specificity, and area under the curve (AUC), to develop an early PBLC prediction tool.

Results: Among multiple ML models for early prediction of PBLC in patients, the Xtreme Gradient Boosting (XGBoost) model achieved an AUC above 0.980 in both internal and external validation. Basophils, lymphocytes, and carcinoembryonic antigen (CEA) ranked highest in feature importance for early PBLC prediction, suggesting that the indicators from routine blood tests and tumor markers jointly influence the predictive performance, thereby underscoring the practicality of integrating these two types of indicators in model development.

Conclusions: The ML models developed possess substantial application value in the early screening of PBLC, which is beneficial for the prompt detection and treatment of individuals diagnosed with PBLC.

Keywords: Primary bronchogenic lung cancer (PBLC); machine learning (ML); routine blood tests; tumor markers; least absolute shrinkage and selection operator (LASSO)

Submitted Aug 23, 2025. Accepted for publication Nov 20, 2025. Published online Dec 29, 2025.

doi: 10.21037/tlcr-2025-970

Highlight box

Key findings

• Machine learning (ML) algorithms can be used for early prediction of primary bronchogenic lung cancer (PBLC).

• Combining routine blood tests with tumor markers can help improve the early screening rate of PBLC.

What is known and what is new?

• The use of simple indicators for early prediction of PBLC is crucial for early treatment.

• The Xtreme Gradient Boosting model we developed achieved an area under the curve above 0.980 in both internal and external validation, highlighting the practicality of constructing models by integrating routine blood tests and tumor markers.

What is the implication, and what should change now?

• The ML algorithm based on routine blood test and tumor marker parameters has good practical value in the early screening of PBLC, providing clinicians with a more accurate and efficient predictive tool.

Introduction

Primary bronchogenic lung cancer (PBLC) is a prevalent malignant tumor and a substantial hurdle to prolonging life expectancy. Its incidence and mortality rates have been steadily rising, posing a significant worldwide public health issue. According to the National Cancer Center, PBLC accounted for over a quarter (1.0606 million) of China’s 4.8247 million new cancer cases in 2022 (1). The high mortality rate is inextricably linked to late-stage diagnosis, often compounded by an intricate etiology related to factors such as heavy smoking, living environment, air quality, genetic material, and lung infection (2,3). This underscores the urgent need for effective early screening methods (4), which are essential to improve patient survival rates.

Previously, imaging was the main screening standard for PBLC. However, as times progress, an increasing number of departments and researchers have started to focus on and advocate for the introduction of lung cancer screening (LCS) programs. Although low-dose computed tomography (LDCT) is used to screen high-risk populations, its role is complex. Research has highlighted that the efficacy of computed tomography (CT) in lowering mortality rates in LCS remains debatable, particularly in advanced-stage PBLC (5). Furthermore, while potentially advantageous for high-risk individuals, the possible detrimental effects of CT remain unclear (6). Concurrently, traditional methods relying solely on a limited number of serum tumor markers do not adequately meet the screening and diagnostic needs for early-stage PBLC (7). These challenges, coupled with high costs and cost-effectiveness considerations, limit the widespread implementation of current screening protocols (8). Current screening practices mandate a shift toward non-invasive, cost-effective mass screening assays. Although routine blood tests and tumor markers are widely accessible, their inherent low specificity and the complexity of integrated biological patterns challenge the extraction of reliable diagnostic value from these data-rich sources.

Herein lies the transformative potential of machine learning (ML). Recent advances in ML have enabled more accurate predictions based on numerous prognostic factors, thereby offering patients more tailored and efficient treatment strategies (9). Researchers utilizing large medical datasets have achieved favorable outcomes across various clinical scenarios and disease types, including diagnosis and treatment evaluation (10). Current ML applications have shown remarkable success in PBLC early diagnosis, achieving high accuracy (95.6%) in detecting and classifying pulmonary nodules from CT scans (11). Coupled with liquid biopsy, ML facilitates non-invasive screening by analyzing complex circulating biomarkers, such as exosomal features (12) and cell-free DNA (cfDNA) alterations (13).

However, the clinical utility of current ML applications faces two major hurdles: scalability and specificity. Some promising models rely on technically demanding modalities, such as peripheral blood transcriptomics (14) or surface-enhanced Raman spectroscopy (SERS) features of serum exosomes (15), incurring high costs and complexity that restrict widespread implementation. More critically, models based on routine blood indicators often suffer from suboptimal diagnostic specificity, as these markers are non-specific and influenced by various non-malignant conditions (16). Therefore, there is an urgent need for a cost-effective and highly specific ML strategy that efficiently integrates accessible routine clinical parameters with targeted tumor markers to bridge the gap in the early detection of PBLC.

This study aims to develop and validate an ML model for the early screening of PBLC using data from routine blood tests and tumor markers. The goal is to find the optimal model and apply it to a visualization prediction system, thereby providing effective clinical decision support for early screening. We present this article in accordance with the TRIPOD reporting checklist (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-970/rc).

Methods

Data sources and study population

All subjects in this study were from the Chongqing Dazu District People’s Hospital Medical Community (including The Affiliated Dazu’s Hospital of Chongqing Medical University, Longshui Branch, Wangu Branch, Third District Branch, Shima Branch, Zhuxi Branch, Longshi Branch, Shiwan Branch, Jinshan Branch, Guoliang Branch, Baoxing Branch, Zhifeng Branch, Baoding Branch, Huilong Branch, and Yongxi Branch), which totaled 15 hospitals. The data were collected from statistical data on various clinical indicators of patients from January 2018 to December 2023. In the process of recording all medical statistics, we implemented an anonymization process to completely delete any information that could compromise patient privacy.

In addition, we performed two types of tests on patients. Firstly, routine blood tests were conducted. Peripheral blood was collected from the participants and tested using a routine blood analyzer (BC7500, Mindray, Shenzhen, China). The tests included basophils count (Baso#), basophils percentage (Baso%), eosinophils count (Eos#), eosinophils percentage (Eos%), hemoglobin (Hb), hematocrit (Hct), lymphocytes count (Lymph#), lymphocytes percentage (Lymph%), mean corpuscular Hb (MCH), MCH concentration (MCHC), mean corpuscular volume (MCV), mean platelet volume (MPV), monocytes count (Mono#), monocytes percentage (Mono%), neutrophils count (Neut#), neutrophils percentage (Neut%), platelet-large cell ratio (P-LCR), plateletcrit (PCT), platelet distribution width (PDW), platelet count (PLT), red blood cell count (RBC), red cell distribution width-coefficient of variation (RDW-CV), and white blood cell count (WBC). Secondly, tumor markers were detected. Similarly, peripheral blood was collected from the subjects, and carcinoembryonic antigen (CEA) and carbohydrate antigen 125 (CA125) were detected using a fully automated biochemical analyzer (Chemistry XPT, Siemens, Erlangen, Germany), while cytokeratin 19 fragment (CYFRA21-1), neuron-specific enolase (NSE), and squamous cell carcinoma-associated antigen (SCC-Ag) were detected using a fully automated chemiluminescence immunoassay analyzer (Wan 200+, Xiamen Youmaike, Xiamen, China). Through a comprehensive analysis of these tumor markers and routine blood indicators, a deeper understanding of the traits of PBLC patients can be achieved.

The inclusion criteria for this study were as follows: (I) participants were aged between 18 and 90 years; (II) patients who had undergone tumor marker and routine blood tests at a hospital were selected; (III) a professional clinician determined whether the patient had PBLC based on a comprehensive consideration of clinical symptoms, imaging results, pathological examinations, and molecular biological test results; (IV) patients diagnosed with PBLC, including but not limited to different pathological types such as lung adenocarcinoma, lung squamous cell carcinoma, and small cell lung cancer (SCLC); (V) those who sought medical attention for “routine physical examination” and had no history of lung disease were included in the healthy control (HC) group; (VI) complete recording of the participants’ main clinical information; (VII) all clinical test data were collected before the administration of medication. The exclusion criteria were as follows: (I) patients with incomplete or invalid results for routine blood tests and tumor markers; (II) patients with other serious malignant tumors besides PBLC; (III) patients with active infectious diseases; (IV) patients with autoimmune diseases.

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethical Committee at The Affiliated Dazu’s Hospital of Chongqing Medical University (approval No. DZ2025-03-020). All participating hospitals were informed and agreed to the study. Given the retrospective design of the study, the Ethics committee granted exemption from obtaining informed consent.

Data preprocessing

This study summarized and organized the data extracted from the hospital, used the data from The Affiliated Dazu’s Hospital of Chongqing Medical University (DZ-H) for model construction and internal validation, and used data from the remaining 14 hospitals in the Medical Community of Dazu District People’s Hospital of Chongqing (MC-H) for external validation. We adopted a unified processing method for processing the data. Firstly, we removed rows or columns with missing values and outliers in the dataset and refined the features containing a large number of null values and invalid data. Secondly, we used the least absolute shrinkage and selection operator (LASSO) algorithm (17,18) for feature selection, which automatically filters key features through L1 regularization constraints, reducing the data dimensionality to speed up model training and improve efficiency (19). Thirdly, for our target label, that is, clinical diagnosis, we converted it into a binary according to the actual situation and finally classified it into the second classification. Finally, for imbalanced samples, a random undersampling operation (20) was used to reduce the number of majority categories to make them more balanced with the number of minority categories, avoiding overfitting of the model to the majority category during training (Figure 1).

Figure 1 Details of the PBLC dataset. The figure intuitively shows the characteristics of all data and the status of the dataset (A), and the status of the datasets of three groups: the overall group (B), the male group (C), and the female group (D). DZ-H, The Affiliated Dazu’s Hospital of Chongqing Medical University; LC, lung cancer; MC-H, Longshui Branch, Wangu Branch, Third District Branch, Shima Branch, Zhuxi Branch, Longshi Branch, Shiwan Branch, Jinshan Branch, Guoliang Branch, Baoxing Branch, Zhifeng Branch, Baoding Branch, Huilong Branch, and Yongxi Branch; PBLC, primary bronchogenic lung cancer.

Statistical analysis

While developing the model, we utilized Python 3.12 as the programming language environment, incorporating the Scikit-learn 1.5.1 ML library, Shap 0.46.0, Matplotlib 3.10.0, Pandas 2.2.2, Numpy 1.26.4, and Streamlit 1.37.1 modules.

This study conducted a statistical analysis of the datasets involved, statistically analyzed the distribution of the demographic characteristics and routine laboratory parameters in DZ-H and MC-H, and calculated the distribution ratio of data in PBLC and HC, the median age of patients, and the distribution ratio of men and women. Furthermore, the differences in age, gender, tumor markers, and routine blood tests between the two groups were compared, and the corresponding means, standard deviation (SD) (21), and P values were calculated. For all continuous variables (including age, gender, tumor markers, and routine blood tests), we used Welch’s t-test (unequal variance t-test) (22) for inter-group comparisons. This method is an improvement on the traditional Student’s t-test and is particularly suitable for situations in which the two groups have unequal variances and imbalanced sample sizes. When the P value was less than 0.05 (23), the difference between the PBLC and HC groups was deemed statistically significant. These analyses allowed for a clearer revelation of the distinct differences and characteristic distribution between patients with and without PBLC.

ML algorithms

In the face of complex data processing and analysis challenges, we explored several advanced ML models. In this article, we selected ten representative ML algorithms: Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), Multilayer Perceptron (MLP), K-Nearest Neighbor (KNN), Decision Tree (DT), Gradient Boosting Decision Tree (GBDT), eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Adaptive Boosting (AdaBoost).

SVM (24) is a linear model mostly used for classification tasks. It utilizes kernel functions to construct the optimal classification hyperplane to analyze data, making it particularly suitable for small datasets with high-dimensional features. NB (25) calculates the posterior probability of each class for a given data sample using Bayes’ theorem, making it ideal for high-dimensional sparse data. MLP (26) is an artificial neural network (ANN) based on the back-propagation algorithm and is suitable for dealing with nonlinear problems. RF (27) improves model stability through the diversity of multiple trees while maintaining high prediction accuracy, making it suitable for tasks requiring high generalization ability. GBDT (28), XGBoost (29), LightGBM (30), and AdaBoost (31) are all ensemble learning methods that significantly improve model accuracy through sequential iterative weak learners, effectively handling nonlinear relationships and high-dimensional features. DT (32) is a tree-structured method for issues involving regression and classification, which is suitable for small to medium-sized datasets. By computing the distance between the sample to be classified and the training sample, KNN (33) can identify the sample category. It is sensitive to feature scale and k-value and is often used for local pattern recognition and small datasets. This study aims to compare various ML algorithms, revealing the differences in performance of different algorithms in classification tasks and helping us select the optimal model based on data characteristics.

Model development and evaluation process

We divided the data from DZ-H into an 8:2 ratio for model construction and internal validation purposes. Using a random search, parameters were randomly selected from the specified range to train the model, and the best parameter combination was found and saved. During the random search process, the AUC value was used as the evaluation criterion, which reflects the effectiveness of the model in distinguishing between positive and negative samples. Subsequently, we used stratified 10-fold cross-validation (34) to evaluate the performance of the optimal model. Unlike ordinary 10-fold cross-validation, stratified 10-fold cross-validation is suitable for processing unbalanced data sets and can ensure the uniformity of data distribution and the reliability of evaluation (Table S1) (35).

Subsequently, this study used data from the MC-H for external validation, which can better simulate the real medical environment. In this process, we verified the key evaluation indicators of each model and focused on drawing the ROC curves of different models on the internal validation set to clearly show the performance differences between different models and provide a basis for subsequent model selection.

Ultimately, we can calculate a series of important performance indicators for the various models that have been optimized and trained above through the confusion matrix. These indicators include the model’s accuracy, positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, F1 score, area under the curve (AUC), and average precision (AP) (36). The accuracy metric can elucidate the percentage of accurate predictions rendered by the model across all samples; PPV and NPV (37) indicate the model’s reliability in forecasting positive and negative samples, respectively; sensitivity and specificity (38) describe the model’s capability to identify positive and negative samples; the F1 score (39) reflects the harmonic mean of precision and recall, assisting in effectively balancing these two metrics; when assessing the model’s overall performance across several thresholds, the area under the receiver operating characteristic curve (AUROC) can comprehensively measure the model’s discriminative ability; while in imbalanced data, the area under the precision-recall curve (AUPRC) (40) focuses more on the detection stability of positive samples. And the bootstrap statistical method was used to calculate the confidence interval (CI) of each evaluation indicator (Figure 2).

Figure 2 PBLC data operation process. AdaBoost, Adaptive Boosting; AUC, area under the curve; DT, decision tree; GBDT, Gradient Boosting Decision Tree; KNN, K-Nearest Neighbor; LightGBM, Light Gradient Boosting Machine; MLP, Multilayer Perceptron; NB, Naive Bayes; NPV, negative predictive value; PBLC, primary bronchogenic lung cancer; PPV, positive predictive value; RF, Random Forest; SVM, Support Vector Machine; XGBoost, eXtreme Gradient Boosting.

Shapley additive explanations

A technique for determining the Shapley value in game theory to rank the significance of each characteristic for the investigation of model interpretability is called Shapley additive explanations (SHAP) analysis. It can systematically analyze the influence direction and intensity of each feature variable in the prediction model (41-43). The effect of each feature on the model was quantified by its associated SHAP value in the SHAP analysis. A characteristic has a favorable effect on the prediction outcomes when its SHAP value is positive. Conversely, the feature has a detrimental effect on the prediction outcomes if the SHAP value is negative (44-46).

Results

Patient characteristics and variables

This study systematically screened the clinical characteristics of PBLC patients using LASSO regression and identified 14 features with significant predictive value, including CA125, CEA, CYFRA21-1, Baso%, Eos%, Hb, Lymph#, Lymph%, MCHC, Mono%, Neut#, PCT, PDW, and RDW-CV. After statistical analysis of the data from DZ-H, except for the P values of Eos% less than 0.05, the P values of the other indicators were less than 0.001, indicating a difference between the PBLC and HC groups (Table 1). In addition, statistical analysis of data from MC-H showed that the P values of CA125, Baso%, Hb, Lymph#, Lymph%, MCHC, Mono%, Neut#, and RDW-CV were all less than 0.001, indicating that there were highly statistically significant differences in these indicators between the PBLC and HC groups. A P value of less than 0.05 for PDW revealed a statistically significant difference between the two groups. Conversely, no notable variance was observed between the two groups regarding indicators such as CEA, CYFRA21-1, Eos%, and PCT (Table 2).

Table 1

Patient demographic characteristics and distribution of routine laboratory parameters from DZ-H

Variables	Total	PBLC group	HC group	P value
Number of patients	767 [100]	409 [53.32]	358 [46.68]
Age, years	57.00 (48.00–65.00)	64.00 (55.00–67.00)	51.0 (40.25–58.00)	<0.001^†
Gender				<0.001^†
Male	466 [100]	312 [66.95]	154 [33.05]
Female	301 [100]	97 [32.23]	204 [67.77]
CA125, U/mL	50.61 (184.67)	85.97 (247.57)	10.22 (7.15)	<0.001^†
CEA, ng/mL	79.48 (583.58)	147.97 (793.30)	1.23 (1.07)	<0.001^†
CYFRA21-1, ng/mL	6.33 (15.61)	9.96 (20.68)	2.19 (1.16)	<0.001^†
Baso, %	0.27 (0.33)	0.43 (0.31)	0.09 (0.24)	<0.001^†
Eos, %	2.54 (2.89)	2.31 (2.37)	2.81 (3.36)	<0.05^‡
Hb, g/L	131.59 (22.94)	120.49 (22.83)	144.27 (15.21)	<0.001^†
Lymph#, 10⁹/L	1.61 (0.69)	1.25 (0.62)	2.02 (0.53)	<0.001^†
Lymph, %	26.39 (11.55)	19.80 (10.19)	33.92 (7.79)	<0.001^†
MCHC, g/L	324.32 (12.19)	322.61 (12.75)	326.27 (11.21)	<0.001^†
Mono, %	7.59 (4.17)	9.28 (4.90)	5.65 (1.70)	<0.001^†
Neut#, 10⁹/L	4.41 (2.71)	5.14 (3.35)	3.58 (1.29)	<0.001^†
PCT, %	0.25 (0.12)	0.27 (0.16)	0.23 (0.05)	<0.001^†
PDW	12.88 (2.67)	12.19 (2.67)	13.66 (2.45)	<0.001^†
RDW-CV	14.05 (1.81)	14.75 (2.07)	13.26 (0.98)	<0.001^†

Data are presented as median (Q1–Q3), n [%], or mean (SD). ^†, the P value is less than 0.001, indicating that the difference between the PBLC group and HC group is highly statistically significant. ^‡, the P value is less than 0.05, indicating that the difference between the PBLC group and HC group is statistically significant. Baso%, basophils percentage; CA125, carbohydrate antigen 125; CEA, carcinoembryonic antigen; CYFRA21-1, cytokeratin 19 fragment; DZ-H, The Affiliated Dazu’s Hospital of Chongqing Medical University; Eos%, eosinophils percentage; Hb, hemoglobin; HC, healthy control; Lymph#, lymphocytes count; Lymph%, lymphocytes percentage; MCHC, mean corpuscular hemoglobin concentration; Mono%, monocytes percentage; Neut#, neutrophils count; PBLC, primary bronchogenic lung cancer; PCT, procalcitonin; PDW, platelet distribution width; RDW-CV, red cell distribution width-coefficient of variation; SD, standard deviation.

Table 2

Patient demographic characteristics and distribution of routine laboratory parameters from MC-H

Variables	Total	PBLC group	HC group	P value
Number of patients	287 [100]	118 [41.11]	169 [58.89]
Age, years	57.00 (50.50–66.00)	63.00 (56.00–69.00)	54.00 (47.00–60.00)	<0.001^†
Gender				<0.001^†
Male	154 [100]	87 [56.49]	67 [43.51]
Female	133 [100]	31 [23.31]	102 [76.69]
CA125, U/mL	20.68 (38.21)	34.10 (51.64)	11.31 (20.33)	<0.001^†
CEA, ng/mL	23.18 (271.97)	54.51 (423.24)	1.30 (0.88)	0.18
CYFRA21-1, ng/mL	6.58 (35.95)	12.34 (55.66)	2.55 (1.81)	0.059
Baso, %	0.23 (0.32)	0.49 (0.31)	0.05 (0.18)	<0.001^†
Eos, %	2.70 (2.73)	2.79 (3.05)	2.63 (2.48)	0.65
Hb, g/L	136.56 (20.40)	126.86 (21.63)	143.34 (16.44)	<0.001^†
Lymph#, 10⁹/L	1.68 (0.66)	1.31 (0.63)	1.94 (0.55)	<0.001^†
Lymph, %	29.17 (10.98)	21.90 (10.87)	34.24 (7.74)	<0.001^†
MCHC, g/L	324.85 (12.23)	321.92 (10.44)	326.90 (12.98)	<0.001^†
Mono, %	7.17 (3.37)	9.25 (4.02)	5.72 (1.72)	<0.001^†
Neut#, 10⁹/L	3.87 (2.11)	4.59 (2.84)	3.37 (1.17)	<0.001^†
PCT, %	0.23 (0.07)	0.22 (0.09)	0.23 (0.06)	0.66
PDW	13.25 (2.94)	12.56 (3.17)	13.73 (2.68)	<0.05^‡
RDW-CV	13.73 (1.59)	14.31 (1.86)	13.32 (1.23)	<0.001^†

Data are presented as median (Q1–Q3), n [%], or mean (SD). ^†, the P value is less than 0.001, indicating that the difference between the PBLC group and HC group is highly statistically significant. ^‡, the P value is less than 0.05, indicating that the difference between the PBLC group and HC group is statistically significant. Baso%, basophils percentage; CA125, carbohydrate antigen 125; CEA, carcinoembryonic antigen; CYFRA21-1, cytokeratin 19 fragment; Eos%, eosinophils percentage; Hb, hemoglobin; HC, healthy control; Lymph#, lymphocytes count; Lymph%, lymphocytes percentage; MC-H, Longshui Branch, Wangu Branch, Third District Branch, Shima Branch, Zhuxi Branch, Longshi Branch, Shiwan Branch, Jinshan Branch, Guoliang Branch, Baoxing Branch, Zhifeng Branch, Baoding Branch, Huilong Branch, and Yongxi Branch; MCHC, mean corpuscular hemoglobin concentration; Mono%, monocytes percentage; Neut#, neutrophils count; PBLC, primary bronchogenic lung cancer; PCT, procalcitonin; PDW, platelet distribution width; RDW-CV, red cell distribution width-coefficient of variation; SD, standard deviation.

Comparison of model internal and external validation results

The results of the DZ-H internal validation revealed that the XGBoost models had the greatest AUC values, which were 0.982 (95% CI: 0.960–0.996) (Figure 3A). By observing other indicators of the XGBoost model, it was found that the XGBoost model had the highest accuracy, sensitivity, and F1 score, with values of 0.929 (95% CI: 0.883–0.968), 0.929 (95% CI: 0.883–0.968), and 0.929 (95% CI: 0.884–0.968), respectively (Figure 3B, Table S2). For the trained model, we used the MC-H data for external validation. For all data sets, we found that the XGBoost model still performed best. The NPV, AUC, and AP values were the highest at 0.993 (95% CI: 0.977–1.000), 0.991 (95% CI: 0.982–0.997), and 0.987 (95% CI: 0.977–0.996), respectively (Figure 3C,3D, Figure 4, and Table S3). Therefore, the external dataset verifies the effectiveness of the XGBoost model and can be used for the early screening of PBLC.

Figure 3 Comparison of ROC curves (A,C) and PR curves (B,D) between different models in the PBLC and HC groups during internal and external validation. The curve represents the ROC or PR curves for different ML algorithms. In the ROC curve, the horizontal axis represents the false positive rate and the vertical axis represents the true positive rate. In the PR curve, the horizontal axis represents the recall rate, and the vertical axis represents the precision rate. Different colors represent different algorithms. AdaBoost, Adaptive Boosting; AP, average precision; AUC, area under the curve; DT, Decision Tree; GBDT, Gradient Boosting Decision Tree; HC, healthy control; KNN, K-Nearest Neighbor; LightGBM, Light Gradient Boosting Machine; ML, machine learning; MLP, Multilayer Perceptron; NB, Naive Bayes; PBLC, primary bronchogenic lung cancer; PR, precision-recall; RF, Random Forest; ROC, receiver operating characteristic; SVM, Support Vector Machine; XGBoost, eXtreme Gradient Boosting.

Figure 4 Evaluation results of different models in the PBLC group and HC group in the internal validation and external validation. The performance of various models on six indicators is compared: accuracy, NPV, PPV, specificity, sensitivity, and F1-score. (A) Different column colors represent different models. The longer the column, the better the algorithm performs on the evaluation indicator. (B) Each vertex represents a ML algorithm, and these vertices are connected to the center point by lines. The coordinates of the center point are (0.2, 0), which serves as the average performance benchmark for all classification indicators. The farther away from the center point, the better the algorithm performs on the evaluation indicator. AdaBoost, Adaptive Boosting; DT, Decision Tree; GBDT, Gradient Boosting Decision Tree; HC, healthy control; KNN, K-Nearest Neighbor; LightGBM, Light Gradient Boosting Machine; ML, machine learning; MLP, Multilayer Perceptron; NB, Naive Bayes; NPV, negative predictive value; PBLC, primary bronchogenic lung cancer; PPV, positive predictive value; RF, Random Forest; SVM, Support Vector Machine; XGBoost, eXtreme Gradient Boosting.

Analysis of model interpretability

The interpretability of the prediction model was assessed using SHAP analysis to rank feature importance (Figure S1). As illustrated in Figure 5, the top 14 contributing features are identified, with Baso%, Lymph%, and CEA exerting the most significant influence on the model’s output. This analysis highlights the key clinical indicators most valuable for PBLC diagnosis.

Figure 5 Feature importance analysis of different groups based on SHAP values. SHAP summary plot of 14 features based on the XGBoost model. Baso%, basophils percentage; CA125, carbohydrate antigen 125; CEA, carcinoembryonic antigen; CYFRA21-1, cytokeratin 19 fragment; Eos%, eosinophils percentage; Hb, hemoglobin; Lymph%, lymphocytes percentage; MCHC, mean corpuscular hemoglobin concentration; Mono%, monocytes percentage; Neut#, neutrophils count; PCT, procalcitonin; PDW, platelet distribution width; RDW-CV, red cell distribution width-coefficient of variation; SHAP, Shapley Additive Explanations; XGBoost, Xtreme Gradient Boosting.

Developing the user interface

To improve clinicians’ accuracy in early PBLC diagnosis, we developed a PBLC risk prediction model called the XGBoost model. Integrating the optimal ML model into the back-end of the prediction tool will efficiently help doctors evaluate the PBLC risk of patients. In this system, the homepage can display the number of hospital visits, admissions, waiting for diagnosis, and discharges on the same day, as well as the statistics of patient age and gender. In addition, a special feedback function is set up so that users can directly make suggestions or provide feedback to the administrator, thereby improving the efficiency and quality of medical services and enhancing the interaction and communication between physicians and system administrators (Figure 6A).

Figure 6 Display the online platform of the system. (A) Home page interface. (B) View patient diagnosis report interface.

Doctors can enter the system to view all personal information and test information of all patients. After entering a group of patients’ personal test information, doctors can click the “view” button to check whether the input values of each indicator are correct (Figures S2-S4). If correct, they can click the “view diagnostic results” button to view the results of the patient’s PBLC diagnosis report as a reference. Doctors can decide whether to adopt the diagnostic results predicted and disease probabilities by this system based on their own clinical experience and independently judge whether to confirm the diagnostic results provided by the system within the system (Figure 6B). This design not only provides doctors with a valuable reference basis but also improves the reliability of diagnostic results, thereby enhancing doctors’ trust in the system’s predicted results.

Discussion

Principal results

Early screening for PBLC mainly relies on traditional methods, such as assessment of clinical symptoms, imaging examinations, and histopathological analysis. However, there are certain disadvantages with these methods: first, clinical symptoms often appear only in the late stage of PBLC, making early screening difficult (47); second, although imaging examinations such as X-rays and CT scans can detect lung abnormalities, it is difficult to accurately distinguish between benign and malignant tumors accurately, and there is a risk of misdiagnosis and missed diagnosis (48); finally, although histopathological analysis is considered the gold standard for diagnosis, inaccurate results can still occur due to factors such as sampling errors, specimen processing, and subjective judgement (49). Furthermore, as an invasive procedure, needle biopsy carries the risk of tumor cells spreading along the needle tract (50). To address these limitations, it is particularly important and necessary to utilize PBLC data for the research on ML models for early screening and prediction. ML algorithms can effectively integrate and analyze PBLC-related data, such as routine blood tests and tumor markers. Compared with traditional methods, ML technology not only significantly improves the accuracy and efficiency of early PBLC screening but also generates precise treatment plans based on the individualized clinical characteristics of patients. Based on these advantages, we constructed a high-precision early PBLC prediction model by selecting an optimal ML algorithm.

In our study, we chose the LASSO algorithm to automatically select the most predictive clinical features for model construction. LASSO uses an L1 regularization penalty term to compress the coefficients of unimportant features to zero, achieving automatic variable selection and avoiding bias from manual selection. The retained sparse feature set directly reflects the key predictors, facilitating clinical interpretation and application. Its advantages in feature selection have been demonstrated in clinical feature extraction in studies of sepsis in the elderly (51), metabolic dysfunction-associated steatotic liver disease (52), and non-SCLC (NSCLC) (53).

The research results show that the XGBoost model exhibits superior predictive performance among the candidate algorithms. Its performance was particularly outstanding in internal validation, with an AUC exceeding 0.980. In addition, all evaluation metrics, including accuracy, PPV, NPV, sensitivity, specificity, F1-score, and AP, exceeded 0.890, demonstrating an excellent overall performance. Notably, in external validation, the model also maintained a good generalization ability, with the AUC remaining above 0.980 and the other seven key evaluation metrics consistently above 0.800, further validating its reliability in practical applications. In medical research, the XGBoost algorithm, which is based on gradient boosting decision trees, has been widely used because of its superior predictive performance (54-56). This study experimentally demonstrated that XGBoost exhibits significantly higher accuracy than other traditional ML algorithms in the early screening of PBLC. To fully evaluate the robustness of the model and to avoid overfitting, a hierarchical 10-fold cross-validation method was used. The results show that XGBoost not only has a higher predictive accuracy but also significantly better generalization ability than the other nine models.

Through an interpretability analysis of the XGBoost model using SHAP, this study identified several important clinical features for predicting PBLC. On the one hand, basophils and lymphocytes have the highest correlation with early screening for PBLC. Abnormalities in routine blood inflammatory indicators may be closely associated with the occurrence of an inflammatory response in the body (57,58). As chronic inflammation is closely related to the occurrence of tumors, persistent inflammation may lead to the proliferation of bronchial alveolar stem cells and promote the carcinogenesis of lung epithelial cells (59). On the other hand, the XGBoost model was also highly correlated with CEA, CA125, and CYFRA21-1. CEA is an acidic glycoprotein expressed by the CEACAM 5 gene, which plays an important role in the diagnosis of PBLC together with CYFRA21-1 (60). Related studies have shown that it has potential value in the diagnosis, treatment monitoring, and prognostic evaluation of PBLC (61). Additionally, CA125, a cleavage product of mucin MUC16, is also a marker for predicting poor prognosis in PBLC (62). Combining these three tumor markers for early PBLC prediction has clinical predictive value (63). Therefore, combining the detection of routine blood indicators with tumor markers (64) can help improve the early screening rate of PBLC. Moreover, research on PBLC is significantly affected by age. The risk of PBLC in men and women varies with age, as indicated by this finding. This finding has positive implications for researchers who wish to investigate patient-age bias in greater detail. Moreover, according to the statistical data analysis, age has always been an essential indicator of earlier LCS. Presently, the National Comprehensive Cancer Network (NCCN) advises individuals with an elevated risk of developing PBLC to consider beginning screening at the age of 50 or 55 years (65). Hence, it is evident that in the early screening of PBLC, the combined utilization of routine blood tests and tumor markers is the most effective approach, and we must also pay attention to the impact of age on PBLC. In practical clinical application, doctors need to combine the model’s predictive results with clinical symptoms and other examination results to more accurately diagnose PBLC patients.

Limitations

There are still significant restrictions, even though this study has produced some results in terms of visualization and user-oriented system design (66-69). Firstly, the dataset used in this study was limited to the research in Dazu District and was a retrospective study. Data from other sources were not included, which, to some extent, limits the generalizability of the research results. Secondly, the real clinical data of different hospitals vary owing to factors such as medical resource allocation, patient disease distribution, and data collection methods. This results in the model trained in a specific hospital being unable to obtain equally effective early screening results in other hospitals. Finally, this study did not include clinical data such as pathological classification, clinical stage, and metastatic status of PBLC, making it difficult to conduct a detailed analysis of different histological subtypes, so that the research conclusions cannot accurately reflect the disease characteristics, diagnosis, and treatment differences of each subgroup of the population. Therefore, future research should explore the addition of massive datasets from multiple hospitals to enhance the robustness of the model.

Conclusions

In summary, an ML algorithm based on routine blood tests and tumor marker parameters provides better practical value for the early screening of PBLC. This approach provides a more accurate and efficient diagnostic tool for clinical practice, which helps detect PBLC at an early stage, thereby optimizing patient treatment plans and improving patient survival rates.

Acknowledgments

We would like to thank all participants involved in this study.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-970/rc

Data Sharing Statement: Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-970/dss

Peer Review File: Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-970/prf

Funding: This study was supported by the Medical Image Intelligent Analysis and Application Innovation Team of Chongqing Medical University (No. ZSK0102 to B.H.), the National Natural Science Foundation of China (No. 72101040 to H.W.), and the Major Joint Science and Health Project of DaZu District (No. DZKJ2024JSYJ-KWXM1001 to J.L.).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-970/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethical Committee at The Affiliated Dazu’s Hospital of Chongqing Medical University (approval No. DZ2025-03-020). All participating hospitals were informed and agreed to the study. Given the retrospective design of the study, the Ethics committee granted exemption from obtaining informed consent.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Calvo V, Niazmand E, Carcereny E, et al. Family history of cancer and lung cancer: Utility of big data and artificial intelligence for exploring the role of genetic risk. Lung Cancer 2024;195:107920. [Crossref] [PubMed]
Huang J, Deng Y, Tin MS, et al. Distribution, Risk Factors, and Temporal Trends for Lung Cancer Incidence and Mortality: A Global Analysis. Chest 2022;161:1101-11. [Crossref] [PubMed]
Hutchings H, Wang A, Grady S, et al. Influence of air quality on lung cancer in people who have never smoked. J Thorac Cardiovasc Surg 2025;169:454-461.e2. [Crossref] [PubMed]
Kearney LE, Belancourt P, Katki HA, et al. The Development and Performance of Alternative Criteria for Lung Cancer Screening. Ann Intern Med 2024;177:1222-32. [Crossref] [PubMed]
Panunzio A, Sartori P. Lung Cancer and Radiological Imaging. Curr Radiopharm 2020;13:238-42. [Crossref] [PubMed]
Bach PB, Mirkin JN, Oliver TK, et al. Benefits and harms of CT screening for lung cancer: a systematic review. JAMA 2012;307:2418-29. [Crossref] [PubMed]
Xiang D, Zhang B, Doll D, et al. Lung cancer screening: from imaging to biomarker. Biomark Res 2013;1:4. [Crossref] [PubMed]
Wang Y, Zhou C, Ying L, et al. Leveraging Serial Low-Dose CT Scans in Radiomics-based Reinforcement Learning to Improve Early Diagnosis of Lung Cancer at Baseline Screening. Radiol Cardiothorac Imaging 2024;6:e230196. [Crossref] [PubMed]
Fujimoto D, Hayashi H, Murotani K, et al. Prediction of prognosis in lung cancer using machine learning with inter-institutional generalizability: A multicenter cohort study (WJOG15121L: REAL-WIND). Lung Cancer 2024;194:107896. [Crossref] [PubMed]
Cheng C, Li Y, Wu F. Application value of early lung cancer screening based on artificial intelligence. J Radiat Res Appl Sci 2024;17:100982.
Anjomrooz M, Mohammadian M, Joveini F, et al. A Comparative Analysis of Deep Learning Architectures for Classifying Malignant Lung Nodules in CT Scans. InfoScience Trends 2025;2:1-11.
Liu HS, Ye KW, Liu J, et al. Lung cancer diagnosis through extracellular vesicle analysis using label-free surface-enhanced Raman spectroscopy coupled with machine learning. Theranostics 2025;15:7545-66. [Crossref] [PubMed]
Meshkov IO, Koturgin AP, Ershov PV, et al. Diagnostics of lung cancer by fragmentated blood circulating cell-free DNA based on machine learning methods. Front Med (Lausanne) 2025;12:1435428. [Crossref] [PubMed]
Li X, Li X, Qin J, et al. Machine learning-derived peripheral blood transcriptomic biomarkers for early lung cancer diagnosis: Unveiling tumor-immune interaction mechanisms. Biofactors 2025;51:e2129. [Crossref] [PubMed]
Liu Y, Cai C, Xu W, et al. Interpretable Machine Learning-Aided Optical Deciphering of Serum Exosomes for Early Detection, Staging, and Subtyping of Lung Cancer. Anal Chem 2024;96:16227-35. [Crossref] [PubMed]
Wu J, Zan X, Gao L, et al. A Machine Learning Method for Identifying Lung Cancer Based on Routine Blood Indices: Qualitative Feasibility Study. JMIR Med Inform 2019;7:e13476. [Crossref] [PubMed]
Xie SH, Zhang WF, Wu Y, et al. Application of predictive model based on CT radiomics and machine learning in diagnosis for occult locally advanced esophageal squamous cell carcinoma before treatment: A two-center study. Transl Oncol 2024;47:102050. [Crossref] [PubMed]
Wang N, Qu S, Kong W, et al. Establishment and validation of novel predictive models to predict bone metastasis in newly diagnosed prostate adenocarcinoma based on single-photon emission computed tomography radiomics. Ann Nucl Med 2024;38:734-43. [Crossref] [PubMed]
Qiu S, Mu S, Tao Y, et al. Machine learning model predicts clotting risk during CRRT in ESKD patients: a SHAP-interpretable approach. Ren Fail 2025;47:2562448. [Crossref] [PubMed]
Mizukoshi R, Maruiwa R, Ito K, et al. Machine Learning Approaches for Early Detection of Ossification of Posterior Longitudinal Ligament in Health Screening Settings. Bioengineering (Basel) 2025;12:749. [Crossref] [PubMed]
Wang W, Zeng W, He S, et al. A new model for predicting the occurrence of polycystic ovary syndrome: Based on data of tongue and pulse. Digit Health 2023;9:20552076231160323. [Crossref] [PubMed]
Kuo WY, Huang CC, Liu CF, et al. Utilizing machine learning for predicting mortality in patients with heat-related illness who visited the emergency department. Int J Med Inform 2025;201:105951. [Crossref] [PubMed]
Su M, Guo J, Chen H, et al. Developing a machine learning prediction algorithm for early differentiation of urosepsis from urinary tract infection. Clin Chem Lab Med 2023;61:521-9. [Crossref] [PubMed]
Anand V, Khajuria A, Pachauri RK, et al. Optimized machine learning based comparative analysis of predictive models for classification of kidney tumors. Sci Rep 2025;15:30358. [Crossref] [PubMed]
Wang K, Adjeroh DA, Fang W, et al. Comparison of Deep Learning and Traditional Machine Learning Models for Predicting Mild Cognitive Impairment Using Plasma Proteomic Biomarkers. Int J Mol Sci 2025;26:2428. [Crossref] [PubMed]
Al-Husban SA, Idham MK, Padil KH, et al. Accident severity prediction on arterial roads via multilayer perceptron neural network. Int J Inj Contr Saf Promot 2025;32:376-95. [Crossref] [PubMed]
Yu Z, Kou F, Gao Y, et al. A machine learning model for predicting abnormal liver function induced by a Chinese herbal medicine preparation (Zhengqing Fengtongning) in patients with rheumatoid arthritis based on real-world study. J Integr Med 2025;23:25-35. [Crossref] [PubMed]
Elazab A, Wang C, Abdelaziz M, et al. Alzheimer’s disease diagnosis from single and multimodal data using machine and deep learning models: Achievements and future directions. Expert Syst Appl 2024;255:124780.
Noda R, Ichikawa D, Shibagaki Y. Machine learning-based diagnostic prediction of IgA nephropathy: model development and validation study. Sci Rep 2024;14:12426. [Crossref] [PubMed]
Li X, Xiong X, Liang Z, et al. A machine learning diagnostic model for Pneumocystis jirovecii pneumonia in patients with severe pneumonia. Intern Emerg Med 2023;18:1741-9. [Crossref] [PubMed]
Chen X, He L, Shi K, et al. Interpretable Machine Learning for Fall Prediction Among Older Adults in China. Am J Prev Med 2023;65:579-86. [Crossref] [PubMed]
Shu P, Huang L, Huo S, et al. Machine learning-based risk prediction model for arteriovenous fistula stenosis. Eur J Med Res 2025;30:217. [Crossref] [PubMed]
Chen L, Yuan L, Sun T, et al. The performance of VCS(volume, conductivity, light scatter) parameters in distinguishing latent tuberculosis and active tuberculosis by using machine learning algorithm. BMC Infect Dis 2023;23:881. [Crossref] [PubMed]
Prusty S, Patnaik S, Dash SK. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front Nanotechnol 2022;4:972421.
Peng X, Liu Y, Zhang B, et al. A preliminary prediction model of pediatric Mycoplasma pneumoniae pneumonia based on routine blood parameters by using machine learning method. BMC Infect Dis 2024;24:707. [Crossref] [PubMed]
Ye X, Zhao X, Lou Y, et al. Machine learning algorithms with body fluid parameters: an interpretable framework for malignant cell screening in cerebrospinal fluid. Clin Chem Lab Med 2025;63:2012-21. [Crossref] [PubMed]
Zhang F, Wang H, Liu L, et al. Machine learning model for the prediction of gram-positive and gram-negative bacterial bloodstream infection based on routine laboratory parameters. BMC Infect Dis 2023;23:675. [Crossref] [PubMed]
Sang H, Lee H, Lee M, et al. Prediction model for cardiovascular disease in patients with diabetes using machine learning derived and validated in two independent Korean cohorts. Sci Rep 2024;14:14966. [Crossref] [PubMed]
Qu X, Zhang C, Houser SH, et al. Prediction model for early childhood caries risk based on behavioral determinants using a machine learning algorithm. Comput Methods Programs Biomed 2022;227:107221. [Crossref] [PubMed]
Lu Y, Li Y, Chi S, et al. Comparison of machine learning and logistic regression models for predicting emergence delirium in elderly patients: A prospective study. Int J Med Inform 2025;199:105888. [Crossref] [PubMed]
Smith AH, Gray GM, Ashfaq A, et al. Using machine learning to predict five-year transplant-free survival among infants with hypoplastic left heart syndrome. Sci Rep 2024;14:4512. [Crossref] [PubMed]
Lundberg SM, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell 2020;2:56-67. [Crossref] [PubMed]
Yin H, Wang K, Yang R, et al. A machine learning model for predicting acute exacerbation of in-home chronic obstructive pulmonary disease patients. Comput Methods Programs Biomed 2024;246:108005. [Crossref] [PubMed]
Zhou W, Yan Z, Zhang L. A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP importance analysis in soybean branching prediction. Sci Rep 2024;14:5905. [Crossref] [PubMed]
Zhou N, Qin W, Zhang JJ, et al. Epidemiological exploration of the impact of bluetooth headset usage on thyroid nodules using Shapley additive explanations method. Sci Rep 2024;14:14354. [Crossref] [PubMed]
Moor M, Bennett N, Plečko D, et al. Predicting sepsis using deep learning across international sites: a retrospective development and validation study. EClinicalMedicine 2023;62:102124. [Crossref] [PubMed]
Verma S, Maerkisch L, Paderno A, et al. One scan, multiple insights: A review of AI-Driven biomarker imaging and composite measure detection in lung cancer screening. Meta Radiol 2025;3:100124.
Deng Z, Ma X, Zou S, et al. Innovative technologies and their clinical prospects for early lung cancer screening. Clin Exp Med 2025;25:212. [Crossref] [PubMed]
Ahn Y, Lee GD, Choi S, et al. Recurrence Risk Following Percutaneous Transthoracic Needle Biopsy in Patients Undergoing Sublobar Resection for Stage I Lung Cancer. Radiology 2025;317:e250415. [Crossref] [PubMed]
Lojo-Rodríguez I, Botana-Rial M, González-Piñeiro A, et al. Optimizing tissue sampling during medical pleuroscopy for diagnosis of malignant pleural effusion due to lung cancer. Sci Rep 2025;15:37409. [Crossref] [PubMed]
Ma X, Mai Y, Ma Y, et al. Constructing an early warning model for elderly sepsis patients based on machine learning. Sci Rep 2025;15:10580. [Crossref] [PubMed]
Zhu G, Song Y, Lu Z, et al. Machine learning models for predicting metabolic dysfunction-associated steatotic liver disease prevalence using basic demographic and clinical characteristics. J Transl Med 2025;23:381. [Crossref] [PubMed]
Zhao R, Lu H, Yuan H, et al. Plasma proteomic profiles for early detection and risk stratification of non-small cell lung carcinoma: A prospective cohort study with 52,913 participants. Int J Cancer 2025;157:1577-89. [Crossref] [PubMed]
Zeng T, Yanshan Liang Y, Dai Q, et al. Application of machine learning algorithms to screen potential biomarkers under cadmium exposure based on human urine metabolic profiles. Chin Chem Lett 2022;33:5184-8.
Tian F, Lin Y, Wang L, et al. Construction of a risk screening and visualization system for pulmonary nodule in physical examination population based on feature self-recognition machine learning model. Front Med (Lausanne) 2024;11:1424750. [Crossref] [PubMed]
Ke X, Cai X, Bian B, et al. Predicting early gastric cancer risk using machine learning: A population-based retrospective study. Digit Health 2024;10:20552076241240905. [Crossref] [PubMed]
Crusz SM, Balkwill FR. Inflammation and cancer: advances and new agents. Nat Rev Clin Oncol 2015;12:584-96. [Crossref] [PubMed]
Li Z, Zhang W, Huang J, et al. Machine learning and discriminant analysis model for predicting benign and malignant pulmonary nodules. BMC Med Inform Decis Mak 2025;25:272. [Crossref] [PubMed]
McFarland DC, Jutagir DR, Miller AH, et al. Tumor Mutation Burden and Depression in Lung Cancer: Association With Inflammation. J Natl Compr Canc Netw 2020;18:434-42. [Crossref] [PubMed]
Zhou Y, Tao L, Qiu J, et al. Tumor biomarkers for diagnosis, prognosis and targeted therapy. Signal Transduct Target Ther 2024;9:132. [Crossref] [PubMed]
Xu CM, Luo YL, Li S, et al. Multifunctional neuron-specific enolase: its role in lung diseases. Biosci Rep 2019;39:BSR20192732. [Crossref] [PubMed]
Wang CF, Peng SJ, Liu RQ, et al. The Combination of CA125 and NSE Is Useful for Predicting Liver Metastasis of Lung Cancer. Dis Markers 2020;2020:8850873. [Crossref] [PubMed]
Wang J, Wang X, Mao Y, et al. Peripheral blood tumor marker levels can indicate the location of lung cancer metastasis. Oncol Lett 2025;30:545. [Crossref] [PubMed]
Yang Z, Zhao S, Cheng Z, et al. Combined inflammatory-lipid index and tumor markers for predicting the spatial localization of lesions in early-stage non-small cell lung cancer. Front Oncol 2025;15:1635315. [Crossref] [PubMed]
Wei W, Wang Y, Ouyang R, et al. Machine Learning for Early Discrimination Between Lung Cancer and Benign Nodules Using Routine Clinical and Laboratory Data. Ann Surg Oncol 2024;31:7738-49. [Crossref] [PubMed]
Daramola O, Kavu TD, Kotze MJ, et al. Detecting the most critical clinical variables of COVID-19 breakthrough infection in vaccinated persons using machine learning. Digit Health 2023;9:20552076231207593. [Crossref] [PubMed]
Huang S, Zhou Y, Liang Y, et al. Machine-learning-derived online prediction models of outcomes for patients with cholelithiasis-induced acute cholangitis: development and validation in two retrospective cohorts. EClinicalMedicine 2024;76:102820. [Crossref] [PubMed]
Guan X, Du Y, Ma R, et al. Construction of the XGBoost model for early lung cancer prediction based on metabolic indices. BMC Med Inform Decis Mak 2023;23:107. [Crossref] [PubMed]
Xu J, Zhang W, Bai W, et al. A multi-biomarker machine learning approach for early prediction of interstitial lung disease in rheumatoid arthritis. BMC Pulm Med 2025;25:394. [Crossref] [PubMed]

Cite this article as: Deng W, Pan L, Wang H, Liu Y, Peng X, Yang C, Li J, Han B. Development and validation of machine learning models based on blood routine tests and tumor markers in early screening of primary bronchogenic lung cancer. Transl Lung Cancer Res 2025;14(12):5431-5446. doi: 10.21037/tlcr-2025-970

Development and validation of machine learning models based on blood routine tests and tumor markers in early screening of primary bronchogenic lung cancer

Highlight box

Introduction

Methods

Data sources and study population

Data preprocessing

Statistical analysis

ML algorithms

Model development and evaluation process

Shapley additive explanations

Results

Patient characteristics and variables

Table 1

Table 2

Comparison of model internal and external validation results

Analysis of model interpretability

Developing the user interface

Discussion

Principal results

Limitations

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share