Deep learning in histopathology images for prediction of oncogenic driver molecular alterations in lung cancer: a systematic review and meta-analysis
Highlight box
Key findings
• This systematic review and meta-analysis assessed the predictive performance of deep learning (DL) models in identifying oncogenic driver molecular alterations in non-small cell lung cancer (NSCLC) using hematoxylin and eosin-stained whole-slide images (H&E WSIs). Convolutional neural networks (CNNs) were the most frequently employed architectures. Among the analyzed alterations, ALK demonstrated the best performance (sensitivity: 80%, specificity: 85%), followed by EGFR (sensitivity: 78%, specificity: 74%) and TP53 (sensitivity and specificity: 70%). These findings highlight the potential of DL models as screening tools for molecular biomarkers in NSCLC.
What is known and what is new?
• Oncogenic driver molecular alterations like EGFR, ALK, ROS1, and KRAS have transformed NSCLC treatment through targeted therapies. Advances in artificial intelligence (AI) have shown promise in predicting these biomarkers from H&E WSI.
• This study consolidates evidence on the utility of DL models, particularly CNNs, in predicting NSCLC molecular biomarkers, providing performance metrics for key alterations. It underscores their capability to enhance precision medicine.
What is the implication, and what should change now?
• The findings emphasize the potential of DL models to serve as cost-effective, rapid screening tools in NSCLC, complementing traditional molecular diagnostics. However, validation across diverse populations and clinical settings is crucial to ensure robustness and generalizability. Future research should focus on standardizing AI methodologies and integrating these models into clinical workflows to advance personalized treatment strategies in NSCLC.
Introduction
Lung cancer (LC) is the second most diagnosed neoplasm and the number one cause of cancer death worldwide. In 2022, LC accounted for 12.4% of all new cancer cases and 18.7% of all cancer deaths globally (1). Among the LC subtypes, non-small cell lung cancer (NSCLC) accounts for 85% of cases (2). The identification of oncogenic driver molecular alterations (mutations, rearrangements, amplifications) in NSCLC has led to targeted therapies that have significantly improved outcomes in recent years (3). Targeted molecular alterations include EGFR, ALK, ROS1, KRAS, BRAF, HER2, MET, RET, and NTRK (3).
The prevalence of oncogenic drivers varies by ethnicity, smoking status, and gender (4). EGFR mutations, which are the most frequent molecular alterations in NSCLC, differ significantly according to ethnicity, with a prevalence of 49.1% in the Asian population, 23% in the Latino population, and 12.8% in the Caucasian population (5,6). EGFR mutations occur at a significantly higher frequency in lung adenocarcinoma, women, younger patients, and never smokers (3). Deletion of exon 19 and the missense mutations in exon 21 (L858R) are the most common mutations of the EGFR gene (80–90%) and are most sensitive to treatment with tyrosine kinase inhibitors (TKIs) (7). KRAS is a well-known oncogenic driver with a variable prevalence depending on race, with higher rates observed in White (33.38%) and Black individuals (27.33%), and a lower prevalence in Latin American (15%) and Asian populations (11.75%) (3,4,6). KRAS mutations are more frequent in women and younger patients and are often linked to a poorer prognosis compared to KRAS wild-type tumors (8). Similarly, ALK fusions are predominantly found in younger individuals, never-smokers or light smokers, and those with advanced-stage disease. ALK rearrangements are most observed in lung adenocarcinomas exhibiting acinar or solid patterns, or with signet-ring cell morphology (9,10). The prevalence of ALK fusions is around 5% (3,6). Likewise, the prevalence of ROS1 rearrangements is around 2% (3,6) and is more frequent in female patients, non-smokers, and those with adenocarcinoma at advanced stages. These tumors often present a solid architecture with cribriform features, psammoma-rich stroma, and signet-ring cells (11).
In recent years, multiple studies have demonstrated the potential of artificial intelligence (AI) in digital pathology to predict various molecular biomarkers in different tumors (lung, breast, thyroid, melanoma, glioma, liver, bladder, colorectal) using hematoxylin and eosin-stained whole-slide images (H&E WSIs) (12,13). In LC, most of the published literature centers on the distinction between malignant and benign differentiation among subtypes of LC (adenocarcinoma, squamous cell carcinoma, small cell lung carcinoma), identification of architectural pattern of adenocarcinoma (lepidic, acinar, papillary, micropapillary, solid), prognosis prediction, mutational status characterization, and programmed death ligand-1 (PD-L1) expression status estimation (14).
Predicting gene expression from WSIs could significantly impact the clinical prognosis of cancer patients (15). This article aims to comprehensively review the published deep learning (DL) models predicting oncogenic driver molecular alterations in NSCLC, based on H&E WSIs, and their current diagnostic test accuracy. We present this article in accordance with the PRISMA reporting checklist (16) (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2024-1196/rc).
Methods
The protocol was submitted and registered in PROSPERO, the International Prospective Register of Systematic Reviews, under the number CRD42024573602.
Information sources and search strategy
Detailed individual search strategies were followed in Embase, LILACS, Medline, Web of Science and Cochrane. The final search date for all databases was August 31 of 2024. Additionally, a manual search was performed, and the reference lists of the selected articles were carefully examined. Additionally, we performed a snowball search on Google Scholar up to January 2025 to further ensure the inclusion of relevant studies. Terms considered included: non-small cell lung carcinoma, squamous cell carcinoma, EGFR, ALK, ROS1, KRAS, BRAF, HER2, MET, RET, NTRK, TP53, STK11, KEAP1, digital pathology, AI and DL. Appropriate truncations and word combinations were used for each database search. The full research strategy can be found in Table S1.
Inclusion criteria
Original studies with DL models utilizing histological images (H&E slides) to predict alterations in actionable genes (e.g., EGFR, ALK, ROS1, KRAS, BRAF, HER2, MET, RET, NTRK), non-actionable genes, and tumor mutational burden (TMB) status in LC were included. Eligible studies were those published in English or Spanish from any country.
Exclusion criteria
The exclusion criteria were as follows: (I) studies that did not specifically differentiate LC data from other tumor types; (II) studies lacking essential data required for meta-analysis (e.g., AUROC, sample size, sensitivity, specificity), even after attempts to contact the authors; (III) studies with inconsistencies between textual and tabular results.
Study selection
The eligibility of the selected articles was assessed in two phases. In phase 1, two authors (D.M.G., G.G.G.) independently screened the studies by title and abstract. In phase 2, the same authors reviewed the full text of the screened articles and excluded those that did not meet the inclusion criteria. Any disagreements were resolved by a third author (R.P.M.). References from relevant articles were manually searched to ensure comprehensive coverage. All data included was reviewed by the authors.
Data collection process and data extraction
The following data were extracted from each article when possible: author name, country of origin of the study, year of publication, true positive (TP), true negative (TN), false positive (FP), false negative (FN), positive predictive value, negative predictive value, N (sample size positive), one row for each gene, 95% confidence interval (CI), area under the receiver operating characteristic (AUROC) for each gene, average AUROC (for the entire model), sensitivity, specificity, studied structures, targeted mutations, magnification, number of WSIs, patch size, pre-processing, DL method, graphics processing units (GPUs) used, number of sources, public databases, molecular alterations in actionable and non-actionable genes. When comparing multiple models, we selected the one with the highest AUROC for inclusion in the meta-analysis. Similarly, when evaluating different AUROCs from the same model but across different groups (e.g., training vs. internal and external validation), we prioritized the performance data from the external validation cohort. Article discrepancies were discussed and solved with input from a third researcher. If the required data was not complete, attempts were made to contact the authors to obtain the missing information.
Risk of bias and applicability
To assess the methodological quality and applicability of the studies, we included the checklist for Artificial Intelligence in Medical Imaging (CLAIM) 2024 update (17), which was applied. Four reviewers (D.M.G., G.G.G., A.M.Z., J.G.) reviewed 42 sections/topics for artificial intelligence use. Each section/topic was analyzed as ‘present’ (P), ‘absent’ (A), and ‘not applicable’ (NA).
Data synthesis
From each article, genes with predicted alterations by the models and available performance data were extracted. Only genes reported in four or more studies were included in the meta-analysis.
Statistical analysis
For each gene, the data was synthesized to assess the predictive performance, focusing on key metrics like AUROC, sensitivity, and specificity. When necessary, missing data were requested from the authors to ensure completeness. The studies included varied in their methodologies, but consistency in reporting key genes allowed for meaningful aggregation of results across different models.
Summary measures
The primary outcome was the performance of DL models and diagnostic accuracy in H&E-stained lung images for predicting oncogenic molecular alterations in LC. The discriminatory capacity of each model was evaluated by the score of the C-Index or AUROC and summary receiver operating characteristic (SROC). The interpretation of the AUROC scores was made considering the following cut-off points: an AUROC of 0.50 indicates no discriminatory ability (random chance), 0.51–0.60 suggests minimal discrimination, 0.61–0.69 is poor, 0.7–0.8 is acceptable, >0.8 is excellent, and >0.9 is outstanding. Positive and negative predictive values were calculated when needed.
Results
Results of the search and screening
A total of 6,872 articles were retrieved through the search strategy, with an additional 7 studies identified through manual searches. After removing duplicates, 2,039 articles remained for screening by title and abstract. Following this screening process, 49 articles were selected for full-text review, of which 23 were ultimately included in the final analysis. Most articles (31.82%) originated from China and the United States. Only one article each was identified from Poland, Germany, France, Canada, Japan, and Israel. Two authors, Wang et al. (18) and Fu et al. (19), were affiliated with multiple countries: the United States and Japan, and Russia and the United Kingdom, respectively. Of the authors contacted to obtain the missing information, we were able to establish communication with Coudray et al. (20) and Morel et al. (21). The study selection process is illustrated in Figure 1, following the PRISMA guidelines.
DL models
The main characteristics of the studies included are summarized in Table 1 (15,18-39). Across the studies, a total of 33,268 H&E WSIs were analyzed. The 23 selected articles primarily reported data on lung adenocarcinoma. Of these, two studies included both adenocarcinoma and squamous cell carcinoma, while two others provided data on various cancer types. Tumor tissue was the most common structure studied, with some studies also including adjacent normal tissue, including parenchyma and tumor stroma.
Table 1
Authors and year (ref) | Molecular alteration | AUC (95% CI) | Databases | DL architecture | Additional pre-processing | xAI | Metadata | Total number of WSI | Patch size (pixel) | WSI | External validation | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training set | Validation set | Test set | |||||||||||
Zhao et al. 2025 (22) | ALK; EGFR; KRAS; LRP1B; ROS1; TP53 | 0.908 (0.78–1.00); 0.953 (0.90–0.99); 0.924 (0.81–1.00); 0.901 (0.77–1.00); 0.971 (0.93–1.00); 0.969 (0.93–1.00) |
Institutional†, TCGA | Hybrid architecture with self-supervised learning, residual transformers and MIL | Automatic segmentation and creation of mosaics at 20×, eliminating those with little information | Yes | Yes | 1,716 | 1,120×1,120 | 1,200 | 258 | 258 | Yes |
Zhang et al. 2024 (23) | EGFR | 0.867 (N/A) | Institutional†, TCGA | ViT | Background regions were filtered using adaptive thresholding | Yes | N/A | 563 | 256×256 | 391 | 172 | N/A | Yes |
Morel et al. 2023 (21) | EGFR; KRAS; TP53 | 0.660 (0.62–0.70); 0.570 (0.53–0.61); 0.680 (0.68–0.68) |
TCGA | CNN based on U-net, EfficientNetB7 | The foreground was extracted, patches were segmented and embeddings were generated. | No | N/A | 719; 541; 459 | 600 × 600 | CV | CV | CV | No |
Pao et al. 2023 (24) | EGFR | 0.870 (0.86–0.88) | Foundation Medicine for genomic profiling | ResNet-50 | Resize 1,024×1,024 pixel patches to 224×224 pixel patches | Yes | No | 2,099 | 1,024×1,024 | CV | CV | CV | No |
Zhao et al. 2023 (25) | EGFR | 0.820 (N/A) | Institutional† | ResNet-50 | Data augmentation, including random rotation, Gaussian and motion blur, brightness, saturation, contrast and hue adjustment | No | No | 260 | 320×320 | 226 | N/A | 100 | No |
Dammak et al. 2023 (26) | TMB | 0.680 (N/A) | TCGA | VGG-16, Xception, NASNet-Large | Extraction of two tile sets: viable cancer and all cancers, balanced by random reduction a color normalization | No | No | 50 | 224×224 | 23 | 7 | 20 | No |
Mayer et al. 2022 (27) | ALK; ROS1 | 1 (N/A); 0.986 (N/A) | Institutional† | end to end CNN–GANs | Data augmentation using preprocessing algorithms and generative networks | No | No | 234 | 1,536×1,536 | 162 | 72 | N/A | Yes |
Chen et al. 2023 (15) | ALK; EGFR; KEAP1; KRAS; STK11; TP53 | 0.655 (N/A); 0.643 (N/A); 0.630 (N/A); 0.608 (N/A); 0.647 (N/A); 0.692 (N/A) |
TCGA | Fine-tuned Xception | Background tiles removed, color normalization | Yes | Yes | 434 | 224×224 | CV | CV | CV | No |
Terada et al. 2022 (28) | ALK | 0.730 (0.65–0.82) | Institutional† | MXNet framework with DenseNet-121 backbone | N/A | Yes | No | 300 | 256×256 | 471 | 97 | 179 | Yes |
Rączkowska et al. 2022 (29) | ALK; BRAF; EGFR; KEAP1; KRAS; MET; ROS1; RET; STK11; TP53 | 0.626 (−2.43 to 3.67); 0.571 (−1.38 to 2.52); 0.651 (−0.85 to 2.15); 0.611 (−0.99 to 2.21); 0.624 (−0.88 to 2.12); 0.692 (−1.62 to 3.99); 0.615 (−3.93 to 5.15); 0.671 (−1.60 to 3.94); 0.629 (−1.89 to 3.13); 0.586 (−0.72 to 1.88) | Institutional† | ARA-CNN based on ResNet and DarkNet 19 based in ResNet and DarkNet 19 | Color normalization and background tiles removed | Yes | No | 23 | 172×172 | CV | CV | CV | No |
Tomita et al. 2022 (30) | BRAF; EGFR; KRAS; STK11; TP53 | 0.451 (0.27–0.62); 0.686 (0.62–0.75); 0.629 (0.55–0.70); 0.484 (0.40–0.56); 0.677 (0.60–0.75) |
Institutional†, CPTAC-3 | ResNet-18, EfficientNetB0 | Artifacts and non-tumoral patches removed | No | No | 747 | 224×224 | 471 | 97 | 179 | Yes |
Ishii et al. 2022 (31) | ALK; EGFR; KRAS | N/A (N/A) | Institutional† | MobileNet V2 | Laplacian filtering, data augmentation through horizontal inversion | Yes | Yes | 108 | 299×299 | 77 | 31 | N/A | No |
Niu et al. 2022 (32) | TMB | 0.641 (N/A) | TCGA | ResNet-18 | Color normalization and background tiles removed | No | N/A | 427 | 512×512 | CV | CV | CV | No |
Sadhwani et al. 2021 (33) | TMB | 0.710 (0.63–0.79) | TCGA | Inception V3, MobileNet | N/A | Yes | No | 472 | 512×512 | 295 | N/A | 177 | No |
Wang et al. 2021 (18) | EGFR; FAT1; KEAP1; KRAS; NF1; STK11; TP53 | 0.824 (N/A); 0.564 (N/A); 0.768 (N/A); 0.711 (N/A); 0.723 (N/A); 0.731 (N/A); 0.759 (N/A) |
TCGA | ResNet, VGG networks, AlexNet8, InceptionV310, ShuffleNetV211, MobileNetV21, GoogleNet13, MNASNET14 | Including image tiling, blur detection, color correction and data splitting | No | No | N/A | 512×512 | 72% | 8% | 20% | No |
Huang et al. 2021 (34) | MET | 0.860 (N/A) | TCGA | DeepIMLH: end to end CNN based in ResNet | DCGMM for color normalization | No | No | 180 | 512×512 | CV | CV | CV | No |
Yang et al. 2021 (35) | EGFR; STK11; TP53 | 0.840 (N/A); 0.710 (N/A); 0.870 (N/A) | TCGA, ICGC | DeepLRHE: end to end CNN based in ResNet | Color segmentation and normalization using a deep convolutional Gaussian mixture model | Yes | No | 180 | 512×512 | CV | CV | CV | No |
Noorbakhsh et al. 2020 (36) | TP53 | 0.640 (0.76–0.98) | TCGA, CPTAC | Inception V3 | The background was removed and removed regions without tissue and also limits regions with excess fat | Yes | No | 595 | 512×512 | 396 | N/A | 169 | Yes |
Kather et al. 2020 (37) | ALK; BRAF; EGFR; FAT1; KEAP1; KRAS; MET; NF1; STK11; TP53 | 0.535 (N/A); 0.483 (N/A); 0.624 (N/A); 0.544 (N/A); 0.619 (N/A); 0.560 (N/A); 0.412 (N/A); 0.524 (N/A); 0.638 (N/A); 0.720 (N/A) |
TCGA, DACHS | ResNet-18, AlexNet, Inception-V3, DenseNet-201 and ShuffleNet | Artifacts and non-tumoral patches removed | No | Yes | 457 | 512×512 | CV | CV | CV | Yes |
Fu et al. 2020 (19) | BRAF; EGFR; FAT1; KEAP1; KRAS; STK11; TP53 | 0.439 (0.31–0.56); 0.719 (0.60–0.83); 0.409 (0.24–0.57); 0.541 (0.43–0.65); 0.572 (0.48–0.66); 0.523 (0.37–0.67); 0.736 (0.66–0.80) |
TCGA | Customized Inception V4 | Artifacts and non-tumoral patches removed | Yes | Yes | 182 | 512×512 | CV | CV | CV | Yes |
Jain et al. 2020 (38) | TMB | 0.920 (0.89–0.95) | TCGA | Inception V3 | Artifacts and non-tumoral patches removed | No | Yes | 760 | 512×512 | 534 | 109 | 117 | No |
Sha et al. 2019 (39) | PD-L1 | 0.800 (N/A) | Institutional† | ResNet‑18 | Image augmentations, including random crop, random rotation, random flip, and color jitter | No | No | 130 | 466×466 | 82 | N/A | 48 | No |
Coudray et al. 2018 (20) | EGFR; FAT1; KEAP1; KRAS; NF1; STK11; TP53 | 0.754 (0.74–0.76); 0.739 (0.73–0.74); 0.684 (0.67–0.69); 0.814 (0.80–0.82); 0.714 (0.70–0.72); 0.845 (0.83–0.85); 0.674 (0.66–0.68) |
TCGA | Fine-Tuned, Inception V3 | Removing images with more than 50% background, ×20 enlargement | No | No | 567 | 512×512 | 403 | 85 | 79 | No |
†, Institution’s own database. ARA, accurate, reliable and active; AUC, area under the curve; CI, confidence interval; CNN, convolutional neural network; CPTAC, clinical proteomic tumor analysis consortium; CV, cross validation; DACHS, darmkrebs: chancen der verhütung durch screening; DL, deep learning; GAN, generative adversarial network; ICGC, International Cancer Genome Consortium; N/A, not applicable; TCGA, The Cancer Genome Atlas; ViT, vision transformer; WSI, whole slide image; xAI, explainable artificial intelligence.
The most frequently targeted genes were EGFR (60.87%), TP53 (43.48%), and KRAS (43.48%), whereas actionable genes such as MET and ROS1 were less common (13.04%). Eleven studies evaluated multiple genes, while the remainder focused on a single gene. Most models were trained using data from The Cancer Genome Atlas (TCGA) database (65.22%), while five studies relied on proprietary databases. Of the 23 studies identified, only five performed external validation.
Convolutional neural networks (CNNs) were the most common architectures, with InceptionV3 being the most common (26.09%), followed by ResNet-18, EfficientNet, and DenseNet (17.39%, 8.70% and 4.35%, respectively). While some of the highest-performing models used a single CNN to evaluate a single gene, most models employed multiple DL architectures for analyzing several genes. Transfer learning was prevalent, utilizing pre-trained architectures such as InceptionV3, ResNet, Xception, and EfficientNet, along with self-supervised models like Vision Transformer and Byol. Additionally, multi-instance learning and attention mechanisms were used to identify mutations or highlight relevant areas in lung adenocarcinoma.
Various preprocessing techniques improved model accuracy, including background removal, artifact filtering, and exclusion of non-tumor areas through manual annotation or CNNs. Color normalization ensured sample standardization, and Mayer et al. (27) applied antagonistic generative networks for synthetic data generation, boosting model robustness. Despite these efforts, image preprocessing showed no direct impact on model AUROC, likely due to its role in initial setup rather than performance. Finally, ten studies incorporated explainable AI methods to visualize tissue regions critical for mutation detection and cancer classification.
Meta-analysis of molecular alterations
Meta-analyses of the performance of various DL models were conducted for different genes. ALK was evaluated in four articles and was the only gene with excellent performance (AUROC >0.8), achieving an overall sensitivity of 80% (95% CI: 53–94%) and specificity of 85% (95% CI: 39–98%). EGFR and TP53 demonstrated acceptable performance (AUROC 0.78 and 0.75, respectively). EGFR, assessed in 13 articles, showed a sensitivity of 80% (95% CI: 72–86%) and specificity of 77% (95% CI: 69–83%), while TP53, evaluated in 10 articles, had both sensitivity and specificity of 70% (95% CI: 65–75%) (Figure 2). Genes with poor performance (AUROC 0.61–0.69) included STK11, with sensitivity of 65% (95% CI: 56–73%) and specificity of 65% (95% CI: 57–72%) across eight articles; KRAS, with sensitivity of 63% (95% CI: 56–69%) and specificity of 62% (95% CI: 54–69%) in eight articles; FAT1, with sensitivity of 60% (95% CI: 48–71%) and specificity of 61% (95% CI: 52–69%) in four articles; and TMB, with sensitivity of 70% (95% CI: 60–78%) and specificity of 71% (95% CI: 53–84%) in four articles. KEAP1, evaluated in six articles, showed a sensitivity of 56% (95% CI: 38–73%) and specificity of 73% (95% CI: 58–84%). Finally, BRAF, assessed in four articles, demonstrated the lowest performance, with a sensitivity of 51% (95% CI: 40–61%) and a specificity of 48% (95% CI: 44–52%) (Figures S1-S14).

SROC analysis of DL models applied to histopathological images shows varied diagnostic performance for EGFR, ALK, and TP53 alterations. EGFR models exhibit moderate-to-high sensitivity and intermediate specificity, with moderate homogeneity among studies (Figure 2D). ALK models achieve high sensitivity but display notable inter-study variability. TP53 models demonstrate moderate diagnostic accuracy with consistent global estimates but some heterogeneity across studies.
Quality assessment
Each applicable CLAIM 2024 criterion (17) was assigned a value of 1 when present and 0 when absent, while ‘NA’ was recorded when a criterion did not apply. This scoring allowed us to calculate the percentage of criteria fulfilled by each study, as shown in Table S2. A percentage equal to or greater than 70% was considered to reflect good methodological quality. In this assessment, all studies achieved scores above 70%, indicating that they met a substantial proportion of the evaluated quality criteria. The dataset analysis by quartiles showed that Q1 ranged from 77.27 to 84.09, Q2 from 86.36 to 93.18, Q3 was concentrated at 93.18, and Q4 ranged from 93.18 to 95.45. The concentration of values in the upper quartiles indicates high quality.
Sensitivity analysis
A sensitivity analysis was performed by systematically excluding the best-performing models for the genes (EGFR, TP53, KRAS, ALK) to evaluate their impact on the overall estimates. Excluding the best performing models resulted in a decrease in both sensitivity and specificity [EGFR: pooled sensitivity 0.78 (95% CI: 0.70–0.84), specificity 0.75 (95% CI: 0.69–0.80). KRAS: pooled sensitivity 0.62 (95% CI: 0.54–0.70), specificity 0.61 (95% CI: 0.53–0.69). TP53: pooled sensitivity 0.68 (95% CI: 0.65–0.71), specificity 0.68 (95% CI: 0.64–0.71 )]. Potential sources of heterogeneity were examined, including the type of DL model architecture, the characteristics of the WSI, and the methodological quality of the studies. These analyses ensured the stability of the conclusions and the generalizability of the models across different clinical settings and H&E WSIs.
Discussion
This study provides an analysis of DL models for predicting oncogenic driver molecular alterations in NSCLC using H&E WSIs, focusing on their prognostic test accuracy. The meta-analysis identified EGFR as the most studied gene, demonstrating high sensitivity and specificity for mutation prediction. Key techniques included fine-tuned transfer learning, end-to-end CNN architectures, and ViT-based models. Zhao et al. [2025] (22) used a hybrid architecture with self-supervised learning, Yang et al. (40) used an end-to-end CNN with ResNet-50, while Coudray et al. (20) and Pao et al. (24) applied fine-tuned transfer learning. Among these, Zhao et al. (22) achieved the highest AUROC (0.95) with a self-supervised approach combining multiple instance learning (MIL) and a residual transformer, improving scalability and reducing reliance on manual labels. While approaches with end-to-end CNNs and fine-tuning using U-Net, EfficientNetB7, and ResNet-50 have demonstrated strong performance in image segmentation and classification, this hybrid model outperforms them by capturing long-range spatial relationships and generalizing better datasets without detailed annotations (41). However, it is important that an AUROC of 80% does not imply replacing conventional diagnostics but complements existing approaches, enhancing decision-making in clinical practice.
Other outstanding performing models are those designed for the prediction of ALK mutations. Among all the models that evaluated ALK mutations and all gene mutations, the best one was Mayer et al. (27). This model also predicted ROS1 mutations, and for both genes discriminatory capacity was outstanding (AUROC 1.00 and 0.98, respectively). Still sensibility and specificity intervals were wide. Unlike the EGFR models and others, this model was the only one that used unsupervised learning. It employed an on-the-fly image mosaic selection method, randomly sampling regions without pre-creating patches. Additionally, it utilized data augmentation with preprocessing algorithms and generative adversarial networks (GANs), followed by fine-tuning on a different dataset (end-to-end CNN-GANs). In contrast, the other models relied on architectures such as deep CNNs (EfficientNetB7, EfficientNetB0, ResNet), depth-separable convolution networks (Xception, MobileNet, ShuffleNetV2), advanced hybrid networks (KAT-ViT, NASNet-Large, InceptionV3, InceptionV4), and transfer learning (Xception, DeepLRHE). Mayer et al.’s superiority can be attributed to its integration of large-scale, diverse datasets (27), the use of GANs for effective data augmentation (42), and the pretraining on heterogeneous cancer data followed by fine-tuning with LC images significantly enhanced adaptability and precision (43). Furthermore, its application of unsupervised learning enabled the discovery of complex, unlabeled patterns, underscoring the model’s robust performance.
The evolution of AI in pathology has progressed from basic pattern recognition to advanced DL models that not only analyze high-resolution digital images but also integrate genomic and clinical data for a more comprehensive diagnostic approach. DL has transformed histopathological image analysis, with early CNNs like Inception and ResNet giving way to sophisticated end-to-end architectures and hybrid models that incorporate self-supervised learning, transformers, and MIL, along with Vision Transformers for molecular classification. Improved preprocessing techniques—such as U-Net segmentation, Gaussian-based color normalization, and GANs—yield higher-quality data, paving the way for more accurate and interpretable diagnostics. These advancements, combined with future generative and multimodal approaches, are setting new standards in diagnostic performance and clinical workflow optimization, ultimately driving personalized healthcare solutions (44,45).
For prediction outcomes, WSI pre-processing states for one key feature that influences model performance. Most of the models used a patch size of 512×512 pixels and a WSI number of 180, while the outstanding models employed more pixels and a considerably higher number of WSI. Mayer et al. (27) utilized the largest number of WSI (21.299) and pixels (1,536×1,536), followed by Pao et al. (24) (2,099 WSI and 1,024×1,024 pixels) and Jain et al. (38) (760 WSI and 512×512 pixels). Models’ performance was directly proportional to the number of images and pixels. Both Pao et al. (24) and Mayer et al. (27) images differed in the databases they used, which were institutional and foundation medicine for genomic profiling, respectively, while the most used database was TCGA. The use of TCGA has limitations due to the lack of pixel-level or regional labels, leading to reliance on weak labels associated with the primary diagnosis of the entire image, which may include healthy tissue. The absence of detailed segmentation may impact the quality and accuracy of the analysis (46).
To extract patch-level morphological features, Zhang et al. (23) used a CNN with contrastive learning, paired with a vision transformer for whole-slide mutation prediction. Their model, trained using a self-supervised BYOL-based method and fine-tuned with contrastive divergence clustering, demonstrated excellent predictive performance for EGFR mutations. This patch-based learning approach focuses on smaller regions to reduce computational complexity while retaining critical tumor features and capturing spatial relationships for accurate and interpretable predictions. The algorithm identified EGFR-19del and EGFR-L858R mutations with strong internal and external validation. Tumors with 19del mutations displayed glandular and cribriform patterns with mucin secretion, while L858R-mutated tumors exhibited papillary and micropapillary structures, often associated with stromal reactions and lymphocytic infiltration. EGFR-mutated tumors showed reduced T-cell infiltration and lower immunogenicity compared to wild-type samples (47). Additionally, Pao et al. (24) highlighted that wild-type tumors with high prediction scores showed attention patches linked to increased immune infiltration. Despite a 0.3 AUROC difference favoring Pao’s larger dataset, Zhang’s model achieved superior results due to its self-supervised design and vision transformer architecture, overcoming limitations of fewer and lower-resolution images, underscoring the potential of these methods to improve tumor microenvironment characterization (48).
In recent years, non-targetable mutations such as TP53, STK11, and KEAP1, as well as TMB, have been recognized for their significant impact on clinical prognosis (49,50). TP53 models achieved high diagnostic accuracy, with sensitivity and specificity around 70%, particularly when using Inception-based architectures with preprocessing. For STK11, the best detection performance (AUROC 0.845) came from Inception V3, while simpler models underperformed. KEAP1 mutation detection showed AUROCs ranging from 0.541 to 0.684, again with Inception V3 leading. TMB, a crucial immunotherapy biomarker in NSCLC (51), also benefited from advanced architectures. Sadhwani et al. (33) reported an AUROC of 0.71 using Inception V3, while Niu et al. (32) achieved 0.641 with ResNet-18 and preprocessing. Jain et al. (38) attained the highest AUROC of 0.92 with Inception V3 and artifact removal, while Dammak et al. (26) obtained 0.68 with VGG16 and smaller images, revealing limitations. Overall, Inception V3 consistently outperformed other architectures, demonstrating its superior capability in detecting mutations and predicting biomarkers from histological images.
Building on these advancements, the integration of AI into routine pathological biomarker diagnostics marks a transformative step forward. As Mayer et al. (27) suggest, AI can complement traditional methods such as immunohistochemistry (IHC), fluorescent in situ hybridization (FISH), and next-generation sequencing (NGS) by rapidly analyzing WSIs and inferring molecular alterations within minutes. This not only accelerates the diagnostic process but also enhances its precision, as demonstrated in a case report where an EGFR mutation in NSCLC adenocarcinoma (52) was identified within 48 hours using AI. Such innovations exemplify the potential of AI to bridge the gap between computational advances and clinical applicability, streamlining workflows and improving patient outcomes. Furthermore, pre-commercial digital pathology AI panel systems—not only LungOI but also others, such as Paige. AI and Philips IntelliSite have been comprehensively reviewed in recent studies (53,54). These systems, which integrate advanced image analysis algorithms, are steadily progressing toward clinical implementation, effectively bridging the gap between computational advances and clinical applicability, streamlining workflows, and ultimately improving patient outcomes (52).
The study limitations have to do with restricted data access, which limited the scope of the analysis. In cases where the authors did not provide additional information upon request, we made estimations of the predictive performance based on the data available in their publications. The use of internal hospital databases constrained the generalizability of findings to external populations. Lastly, not all studies utilized external data for validation; although most relied on public databases, this was not universal, which affects the number of WSIs used. Also, most of the studies utilized TCGA databases for their analysis. It has been reported that TCGA data may contain quality issues, and some cases have been found to be misclassified (46). All analyses coming from this source may be problematic for conducting a meta-analysis, as it could lead to similar results being influenced by these data quality concerns. Also, the generalizability of the findings is limited due to the fact that only 23 studies were ultimately included in this review, and only 5 of them had external testing cohorts, conditioning external validity.
It is important to mention the variability in the reporting of cohort characteristics across the studies included. Some of the 23 selected studies did not provide comprehensive demographic or clinical metadata, such as race, age, gender, or other relevant information. This lack of detailed population characterization limits our ability to create a consistent and complete summary of the cohorts and hampers the potential for meaningful cross-study comparisons. Future research should prioritize standardized and thorough reporting of cohort characteristics, particularly in studies utilizing WSI data, to enhance transparency and support more robust comparative analyses.
Conclusions
This systematic review and meta-analysis provide a comprehensive overview of DL models used to predict clinically relevant molecular alterations in LC. DL models show promise as initial screening tools and may also be valuable in cases where tissue samples for molecular testing are unavailable. Further research is needed across diverse populations to generate real-world data on the prediction of clinical mutations. Additionally, studies evaluating survival outcomes, treatment responses, and resistance to TKIs are essential to further validate these models. Moreover, recent advancements in genome editing and spatial transcriptomics have provided new insights into the association between genetic abnormalities and pathological morphological features. Future research should explore how these emerging technologies can complement DL models, improving predictive accuracy and enhancing personalized medicine approaches.
Finally, ensuring external validation across diverse datasets and populations is essential to confirm the generalizability of our model. Additionally, addressing racial variability helps mitigate biases and account for environmental factors influencing health outcomes. By focusing on these aspects, we enhance the model’s fairness and effectiveness, reinforcing its potential for practical applications.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the PRISMA reporting checklist. Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2024-1196/rc
Peer Review File: Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2024-1196/prf
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2024-1196/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Wéber A, Morgan E, Vignat J, et al. Lung cancer mortality in the wake of the changing smoking epidemic: a descriptive study of the global burden in 2020 and 2040. BMJ Open 2023;13:e065303. [Crossref] [PubMed]
- Khodabakhshi Z, Mostafaei S, Arabi H, et al. Non-small cell lung carcinoma histopathological subtype phenotyping using high-dimensional multinomial multiclass CT radiomics signature. Comput Biol Med 2021;136:104752. [Crossref] [PubMed]
- Sholl LM, Cooper WA, Kerr KM, et al. IASLC atlas of molecular testing for targeted therapy in lung cancer. Denver: IALSC; 2023.
- Shi H, Seegobin K, Heng F, et al. Genomic landscape of lung adenocarcinomas in different races. Front Oncol 2022;12:946625. [Crossref] [PubMed]
- Melosky B, Kambartel K, Häntschel M, et al. Worldwide Prevalence of Epidermal Growth Factor Receptor Mutations in Non-Small Cell Lung Cancer: A Meta-Analysis. Mol Diagn Ther 2022;26:7-18. [Crossref] [PubMed]
- Parra-Medina R, Castañeda-González JP, Montoya L, et al. Prevalence of oncogenic driver mutations in Hispanics/Latin patients with lung cancer. A systematic review and meta-analysis. Lung Cancer 2023;185:107378. [Crossref] [PubMed]
- Castañeda-González JP, Chaves JJ, Parra-Medina R. Multiple mutations in the EGFR gene in lung cancer: a systematic review. Transl Lung Cancer Res 2022;11:2148-63. [Crossref] [PubMed]
- Finn SP, Addeo A, Dafni U, et al. Prognostic Impact of KRAS G12C Mutation in Patients With NSCLC: Results From the European Thoracic Oncology Platform Lungscape Project. J Thorac Oncol 2021;16:990-1002. [Crossref] [PubMed]
- Fois SS, Paliogiannis P, Zinellu A, et al. Molecular Epidemiology of the Main Druggable Genetic Alterations in Non-Small Cell Lung Cancer. Int J Mol Sci 2021;22:612. [Crossref] [PubMed]
- Chaves JJ, Carvajal Fierro C, Parra-Medina R. Primary adenocarcinoma of the lung with signet-ring cells and ALK rearrangement. Two case reports. Revista Colombiana de Neumología 2022;30:80-5.
- Gendarme S, Bylicki O, Chouaid C, et al. ROS-1 Fusions in Non-Small-Cell Lung Cancer: Evidence to Date. Curr Oncol 2022;29:641-58. [Crossref] [PubMed]
- El Nahhas OSM, Loeffler CML, Carrero ZI, et al. Regression-based Deep-Learning predicts molecular biomarkers from pathology slides. Nat Commun 2024;15:1253. [Crossref] [PubMed]
- McGenity C, Clarke EL, Jennings C, et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. NPJ Digit Med 2024;7:114. [Crossref] [PubMed]
- Davri A, Birbas E, Kanavos T, et al. Deep Learning for Lung Cancer Diagnosis, Prognosis and Prediction Using Histological and Cytological Images: A Systematic Review. Cancers (Basel) 2023;15:3981. [Crossref] [PubMed]
- Chen Z, Li X, Yang M, et al. Optimization of deep learning models for the prediction of gene mutations using unsupervised clustering. J Pathol Clin Res 2023;9:3-17. [Crossref] [PubMed]
- Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372: [Crossref] [PubMed]
- Tejani AS, Klontzas ME, Gatti AA, et al. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update. Radiol Artif Intell 2024;6:e240300. [Crossref] [PubMed]
- Wang Y, Coudray N, Zhao Y, et al. HEAL: an automated deep learning framework for cancer histopathology image analysis. Bioinformatics 2021;37:4291-5. [Crossref] [PubMed]
- Fu Y, Jung AW, Torne RV, et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat Cancer 2020;1:800-10. [Crossref] [PubMed]
- Coudray N, Ocampo PS, Sakellaropoulos T, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med 2018;24:1559-67. [Crossref] [PubMed]
- Morel LO, Derangère V, Arnould L, et al. Preliminary evaluation of deep learning for first-line diagnostic prediction of tumor mutational status. Sci Rep 2023;13:6927. [Crossref] [PubMed]
- Zhao Y, Xiong S, Ren Q, et al. Deep learning using histological images for gene mutation prediction in lung cancer: a multicentre retrospective study. Lancet Oncol 2025;26:136-46. [Crossref] [PubMed]
- Zhang W, Wang W, Xu Y, et al. Prediction of Epidermal Growth Factor Receptor Mutation Subtypes in Non-Small Cell Lung Cancer From Hematoxylin and Eosin-Stained Slides Using Deep Learning. Lab Invest 2024;104:102094. [Crossref] [PubMed]
- Pao JJ, Biggs M, Duncan D, et al. Predicting EGFR mutational status from pathology images using a real-world dataset. Sci Rep 2023;13:4404. [Crossref] [PubMed]
- Zhao D, Zhao Y, He S, et al. High accuracy epidermal growth factor receptor mutation prediction via histopathological deep learning. BMC Pulm Med 2023;23:244. [Crossref] [PubMed]
- Dammak S, Cecchini MJ, Breadner D, et al. Using deep learning to predict tumor mutational burden from scans of H&E-stained multicenter slides of lung squamous cell carcinoma. J Med Imaging (Bellingham) 2023;10:017502. [Crossref] [PubMed]
- Mayer C, Ofek E, Fridrich DE, et al. Direct identification of ALK and ROS1 fusions in non-small cell lung cancer from hematoxylin and eosin-stained slides using deep learning algorithms. Mod Pathol 2022;35:1882-7. [Crossref] [PubMed]
- Terada Y, Takahashi T, Hayakawa T, Ono A, Kawata T, Isaka M, et al. Artificial Intelligence–Powered Prediction of ALK Gene Rearrangement in Patients With Non–Small-Cell Lung Cancer. JCO Clinical Cancer Informatics 2022;6:e2200070. [Crossref] [PubMed]
- Rączkowska A, Paśnik I, Kukiełka M, et al. Deep learning-based tumor microenvironment segmentation is predictive of tumor mutations and patient survival in non-small-cell lung cancer. BMC Cancer 2022;22:1001. [Crossref] [PubMed]
- Tomita N, Tafe LJ, Suriawinata AA, et al. Predicting oncogene mutations of lung cancer using deep learning and histopathologic features on whole-slide images. Transl Oncol 2022;24:101494. [Crossref] [PubMed]
- Ishii S, Takamatsu M, Ninomiya H, et al. Machine learning-based gene alteration prediction model for primary lung cancer using cytologic images. Cancer Cytopathol 2022;130:812-23. [Crossref] [PubMed]
- Niu Y, Wang L, Zhang X, et al. Predicting Tumor Mutational Burden From Lung Adenocarcinoma Histopathological Images Using Deep Learning. Front Oncol 2022;12:927426. [Crossref] [PubMed]
- Sadhwani A, Chang HW, Behrooz A, et al. Comparative analysis of machine learning approaches to classify tumor mutation burden in lung adenocarcinoma using histopathology images. Sci Rep 2021;11:16605. [Crossref] [PubMed]
- Huang K, Mo Z, Zhu W, et al. Prediction of Target-Drug Therapy by Identifying Gene Mutations in Lung Cancer With Histopathological Stained Image and Deep Learning Techniques. Front Oncol 2021;11:642945. [Crossref] [PubMed]
- Yang Y, Yang J, Liang Y, et al. Identification and Validation of Efficacy of Immunological Therapy for Lung Cancer From Histopathological Images Based on Deep Learning. Front Genet 2021;12:642981. [Crossref] [PubMed]
- Noorbakhsh J, Farahmand S, Foroughi Pour A, et al. Deep learning-based cross-classifications reveal conserved spatial behaviors within tumor histological images. Nat Commun 2020;11:6367. [Crossref] [PubMed]
- Kather JN, Heij LR, Grabsch HI, et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat Cancer 2020;1:789-99. [Crossref] [PubMed]
- Jain MS, Massoud TF. Predicting tumour mutational burden from histopathological images using multiscale deep learning. Nature Machine Intelligence 2020;2:356-62.
- Sha L, Osinski BL, Ho IY, et al. Multi-Field-of-View Deep Learning Model Predicts Nonsmall Cell Lung Cancer Programmed Death-Ligand 1 Status from Whole-Slide Hematoxylin and Eosin Images. J Pathol Inform 2019;10:24. [Crossref] [PubMed]
- Yang Y, Yang J, Liang Y, et al. Identification and Validation of Efficacy of Immunological Therapy for Lung Cancer From Histopathological Images Based on Deep Learning. Front Genet 2021;12:642981. [Crossref] [PubMed]
- Jena B, Saxena S, Nayak GK, et al. Artificial intelligence-based hybrid deep learning models for image classification: The first narrative review. Comput Biol Med 2021;137:104803. [Crossref] [PubMed]
- Tran NT, Tran VH, Nguyen NB, et al. On Data Augmentation for GAN Training. IEEE Trans Image Process 2021;30:1882-97. [Crossref] [PubMed]
- Yong MP, Hum YC, Lai KW, et al. Histopathological Cancer Detection Using Intra-Domain Transfer Learning and Ensemble Learning. IEEE Access 2024;12:1434-57.
- Wu Y, Cheng M, Huang S, et al. Recent Advances of Deep Learning for Computational Histopathology: Principles and Applications. Cancers (Basel) 2022;14:1199. [Crossref] [PubMed]
- Perez-Lopez R, Ghaffari Laleh N, Mahmood F, et al. A guide to artificial intelligence for cancer researchers. Nat Rev Cancer 2024;24:427-41. [Crossref] [PubMed]
- Dehkharghanian T, Bidgoli AA, Riasatian A, et al. Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagn Pathol 2023;18:67. [Crossref] [PubMed]
- Dong ZY, Zhang JT, Liu SY, et al. EGFR mutation correlates with uninflamed phenotype and weak immunogenicity, causing impaired response to PD-1 blockade in non-small cell lung cancer. Oncoimmunology 2017;6:e1356145. [Crossref] [PubMed]
- Dimitriou N, Arandjelović O, Caie PD. Deep Learning for Whole Slide Image Analysis: An Overview. Front Med (Lausanne) 2019;6:264. [Crossref] [PubMed]
- Ferrara MG, Belluomini L, Smimmo A, et al. Meta-analysis of the prognostic impact of TP53 co-mutations in EGFR-mutant advanced non-small-cell lung cancer treated with tyrosine kinase inhibitors. Crit Rev Oncol Hematol 2023;184:103929. [Crossref] [PubMed]
- Di Federico A, De Giglio A, Parisi C, et al. STK11/LKB1 and KEAP1 mutations in non-small cell lung cancer: Prognostic rather than predictive? Eur J Cancer 2021;157:108-13. [Crossref] [PubMed]
- Meri-Abad M, Moreno-Manuel A, García SG, et al. Clinical and technical insights of tumour mutational burden in non-small cell lung cancer. Crit Rev Oncol Hematol 2023;182:103891. [Crossref] [PubMed]
- Waissengrin B, Garasimov A, Bainhoren O, et al. Artificial intelligence (AI) molecular analysis tool assists in rapid treatment decision in lung cancer: a case report. J Clin Pathol 2023;76:790-2. [Crossref] [PubMed]
- Shafi S, Parwani AV. Artificial intelligence in diagnostic pathology. Diagn Pathol 2023;18:109. [Crossref] [PubMed]
- Campanella G, Hanna MG, Geneslaw L, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 2019;25:1301-9. [Crossref] [PubMed]