Opportunities and challenges in lung cancer care in the era of large language models and vision language models

Yi Luo; Hamed Hooshangnejad; Wilfred Ngwa; Kai Ding

doi:10.21037/tlcr-24-801

Review Article

Opportunities and challenges in lung cancer care in the era of large language models and vision language models

Yi Luo¹, Hamed Hooshangnejad^1,2, Wilfred Ngwa², Kai Ding²

¹Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA; ²Department of Radiation Oncology and Molecular Radiation Sciences, Johns Hopkins University, Baltimore, MD, USA

Contributions: (I) Conception and design: Y Luo, H Hooshangnejad, K Ding; (II) Administrative support: W Ngwa, K Ding; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: Y Luo, K Ding; (V) Data analysis and interpretation: Y Luo, K Ding; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Kai Ding, PhD, MS, MBA. Department of Radiation Oncology and Molecular Radiation Sciences, Johns Hopkins University, 401 N Broadway, Suite 1440, Baltimore, MD 21231, USA. Email: kai@jhu.edu.

Abstract: Lung cancer remains the leading cause of cancer-related deaths globally. Over the past decade, the development of artificial intelligence (AI) has significantly propelled lung cancer care, particularly in areas such as lung cancer early diagnosis, survival prediction, recurrence prediction, medical image processing, medical image registration, medical visual question answering, clinical report writing, medical image generation, and multimodal integration. This review aims to provide a comprehensive summary of the various AI methods utilized in lung cancer care, with a particular emphasis on machine learning and deep learning techniques. Moreover, with the advent and widespread application of large language models (LLMs), vision language models (VLMs), and multimodal integration for downstream clinical tasks, we explore the current landscape these cutting-edge AI tools offer. However, it also presents both significant challenges and opportunities, including data privacy risks, inherent biases that may exacerbate healthcare disparities, model hallucinations, ethical implications, implementation costs, and the lack of standardized evaluation metrics. Furthermore, the translation of these technologies from experimental research to clinical implementation demands comprehensive validation protocols and multidisciplinary collaboration to guarantee patient safety, therapeutic efficacy, and equitable healthcare delivery. This review emphasizes the critical role of AI in enhancing our understanding and management of lung cancer, ultimately striving for precision medicine and equitable healthcare worldwide.

Keywords: Large language model (LLM); vision language model (VLM); lung cancer care

Submitted Sep 12, 2024. Accepted for publication Apr 16, 2025. Published online May 23, 2025.

doi: 10.21037/tlcr-24-801

Introduction

Lung cancer, encompassing both small cell lung cancer and non-small cell lung cancer (NSCLC), ranks as the leading cause of cancer-related mortality worldwide (1) with over 238,000 new diagnoses annually in the US (2). Despite considerable advances in medical research that have expanded our understanding and treatment approaches, the overall five-year survival rate is less than 20% (3), posing a significant public health burden. Early diagnosis and effective treatment of lung cancer continue to face considerable hurdles. In recent years, integrating artificial intelligence (AI) into oncology research and clinical protocols has instigated profound transformations within the field. As AI technologies progress and demonstrate substantial efficacy, they are reshaping the research paradigms and methodologies in lung cancer, signaling promising avenues for future exploration in this critical area of healthcare.

The initial deployment of AI in lung cancer research has predominantly concentrated on enhancing diagnostic accuracy and therapeutic interventions (4-12), as well as refining lung cancer image analysis (13-19). Advanced machine learning algorithms, particularly deep learning, have been widely utilized in the analysis of complex biomedical data. These technologies possess the capability to identify subtle patterns within medical imaging scans, genetic information, and clinical datasets, patterns that might elude nonspecialized observers and are often labor-intensive and subjective even for skilled clinicians to detect (20). More recently, the application of Transformer models has been adapted for both textual and visual data in lung cancer research (21-28). These models have led to the development of sophisticated diagnostic tools that integrate textual clinical notes with imaging data, offering a more comprehensive view of patient information and improving the accuracy of clinical assessments. Additionally, new directions are being explored, such as generating lung cancer computed tomography (CT) diagnostic reports, which provide deeper and more user-friendly explanations, thereby enhancing the accuracy of diagnoses and treatment plans.

This article offers a comprehensive analysis of the key challenges encountered in lung cancer care, detailing the spectrum of AI methodologies employed, from conventional machine learning techniques to advanced deep learning frameworks, and further into the cutting-edge realms of large language models (LLMs) and vision language models (VLMs). Additionally, we comprehensively outline the significant opportunities and obstacles that contemporary AI presents in the field of lung cancer care.

Lung cancer diagnosis and prognosis

Early diagnosis is critical for improving the five-year survival rate, which remains low due to the asymptomatic nature of early-stage lung cancer (29). In recent years, the emergence of AI has sparked considerable interest in its potential role in lung cancer (30). Effective lung cancer management mainly focuses on three interconnected challenges: (I) early diagnosis, which is essential for detecting cancer at a treatable stage and enabling timely intervention, (II) survival prediction and (III) recurrence prediction. (II) and (III) can directly impact the decision on treatment strategies. By predicting patient outcomes and assessing post-treatment relapse risks, these tasks enable clinicians to evaluate the effectiveness of therapies in shrinking or eliminating tumors, extending survival, and minimizing recurrence. This comprehensive approach drives the development of personalized therapeutic strategies, ultimately improving patient outcomes.

In recent years, various machine learning and statistical methods have been employed to enhance the diagnosis and prognosis of lung cancer. These studies demonstrate the effectiveness of different approaches, such as artificial neural networks (ANNs), support vector machines (SVMs), Bayesian networks (BNs), and decision trees (DTs), in improving diagnostic accuracy and providing personalized treatment recommendations. Besides, deep learning based approaches, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have shown significant promise. The implementation of these advanced techniques not only facilitates early detection and precise diagnosis but also enables the generation of personalized prognostic predictions, ultimately enhancing patient outcomes and driving progress in the field of precision medicine.

Lung cancer early diagnosis

Wu et al. (4) and Feng et al. (6) identified valuable tumor markers and utilized ANNs for predictions. Similarly, Nasser and Abu-Naser (5) explored the potential relationship between lung cancer and other symptoms, such as chronic diseases, coughing, shortness of breath, using ANN. Petousis et al. (9) employed dynamic Bayesian networks (DBNs) to predict lung cancer incidence in high-risk individuals using longitudinal data. Additionally, Guo et al. (31) applied SVM to classify lung cancer using positron emission tomography (PET)/CT image features from 32 patients, accurately distinguishing benign from early malignant nodules and advanced malignant nodules. Sherafatian and Arjmand (10) used DT algorithms to identify lung cancer biomarkers and subtypes using miRNA expression data. Sun et al. (12) compared multiple machine learning classifiers within a large dataset of 5,984 region of interests (ROIs) and 488 features. Krishnaiah et al. (11) compared various data mining classification techniques, including rule-based methods, DTs, Naive Bayes, and ANNs, to develop a highly accurate lung cancer prediction system. Massion et al. (32) trained the lung cancer prediction convolutional neural network (LCP-CNN) to differentiate between benign and malignant lung nodules using CT images from the National Lung Screening Trial. The model was validated internally through cross-validation and externally with data from two academic institutions. Further external validation (33,34) confirmed the effectiveness of LCP-CNN, demonstrating high area under the curve (AUC) and sensitivity, which ultimately helps reduce unnecessary follow-up scans.

Lung cancer survival prediction

Sesen et al. (7) evaluated BNs for predicting survival and selecting treatment plans for lung cancer patients using data from the English Lung Cancer Database. Bartholomai et al. (35) developed a method to predict lung cancer survival times using regression and classification models, specifically employing random forests (RFs) for classification and RFs, general linear regression and gradient boosting machines (GBMs) for regression with data from the Surveillance, Epidemiology, and End Results (SEER) program (36). Lynch et al. (37) applied a range of supervised learning techniques, including linear regression, DTs, GBMs, SVMs, and a custom ensemble model, to classify the survival times of lung cancer patients utilizing data from the SEER database. While these models demonstrated effectiveness in predicting short to moderate survival times, their performance diminished when predicting survival times beyond 35 months.

She et al. (38) validated a deep feed-forward network (DFFN) survival neural network Deepsurv (39), incorporating 127 clinical features, to predict lung cancer specific survival in NSCLC patients using data from the SEER database and Shanghai Pulmonary Hospital. Doppalapudi et al. (40) utilized a 3 dimensions (3D) CNN with CT data to predict NSCLC patient survival periods, employing transfer learning for surgical and radiation therapy patients. The model effectively stratified patients into high- and low risk groups based on mortality, emphasizing the importance of tumor-surrounding tissue in prognostication. Xu et al. (41) developed a deep learning model using CNNs and RNNs to effectively predict clinical outcomes by analyzing time series CT images of NSCLC patients. The model, which analyzed scans at pretreatment and posttreatment intervals (1, 3, and 6 months), showed improved predictive performance for survival and cancer-specific outcomes by incorporating follow-up scans.

Lung cancer recurrence prediction

Luo et al. (8) used a BN approach to predict local control in NSCLC patients. Yang et al. (42) used machine learning algorithms, including DTs and SVMs, to predict recurrence and survivability in NSCLC patients using integrated genomic and clinical data from The Cancer Genome Atlas (TCGA). Mohamed et al. (43) evaluated various machine learning models, including RF, multi-layer perceptron (MLP), SVM, and logistic regression (LR), to predict recurrence or disease-free survival in early-stage NSCLC patients. Janik et al. (44) predicted recurrence risk in early-stage NSCLC patients using RFs and graph machine learning (GML). Kim et al. (45) designed an ensemble model to predict recurrence in early-stage NSCLC patients by integrating clinical data, handcrafted radiomic features, and deep learning-based radiomics. Lee et al. (46) introduced DeepBTS, a neural network for predicting recurrence-free survival post-surgery in NSCLC patients, which outperformed traditional Cox proportional hazards models.

Balancing model performance and interpretability

Overall, machine learning-based methods have achieved promising results in lung cancer diagnosis using various feature inputs, including gene expression, biomarkers, clinical data, CT image features and so on. The key advantage of machine learning methods is their explainability, making them accessible for both clinical and industrial applications. However, one potential drawback is that while their performance is acceptable, it is not yet optimal. With the advancement of deep learning methods, which offer higher accuracy, there is potential to further enhance the effectiveness of lung cancer diagnosis models to better meet clinical needs.

Despite these advancements, deep learning methods often suffer from a lack of interpretability due to their “black box” nature. The inherent complexity of these models presents significant challenges in interpreting their decision-making processes, which can limit their use in clinical decision-making where transparency and explainability are crucial. Clinicians need to understand how a diagnosis is reached to trust and effectively use these models in practice. Therefore, while deep learning methods hold great promise for improving diagnostic accuracy, their integration into clinical workflows requires addressing the challenges of interpretability.

Balancing interpretability and model performance is a critical trade-off in the clinical application of these technologies. While machine learning models provide a more transparent decision-making process, they may not achieve the same level of accuracy as deep learning models. Conversely, deep learning models can offer superior performance but at the cost of reduced explainability. This trade-off necessitates careful consideration and possibly the development of hybrid approaches that can leverage the strengths of both paradigms, ensuring that the models are both accurate and interpretable enough to be trusted and utilized effectively in clinical settings. Given the unparalleled performance of deep learning methods, a promising research direction lies in improving their interpretability. For instance, techniques like Grad-CAM (47) offer visual explanations for 2D CNNs, while Mukunda et al. (48) have introduced visual explanations for 3D CNNs in clinical ocular torsion detection, achieving high accuracy and AUC. These approaches provide new ways to enhance the transparency of deep learning models, making them more suitable for clinical decision-making. Exploring and refining these methods can significantly increase the interpretability of various deep learning techniques, ultimately leading to better support for clinical decisions. A summary of the AI methods applied in lung cancer diagnosis and prognosis is provided in Table 1. For studies that compare multiple methods, we have listed the one with the best performance.

Table 1

Summary of AI methods in lung cancer diagnosis and prognosis

Author, year	Input	Method	Contribution
Wu et al., 2011 (4)	Lung cancer markers	ANN	Lung cancer diagnosis system with optimal marker group inputs
Feng et al., 2012 (6)	Lung cancer markers	ANN	Classifying benign lung cancers, normal individuals, and lung cancers from three common gastrointestinal cancers
Nasser and Abu-Naser, 2019 (5)	Symptoms	ANN	Explored potential relationship between lung cancer and clinical symptoms
Petousis et al., 2016 (9)	Low-dose CT	DBN	Lung cancer screening tool that demonstrated high discriminatory and predictive power for both cancerous and non-cancerous cases
Guo et al., 2015 (31)	PET/CT images	SVM	Classifying benign, early malignant and advanced malignant nodules
Sherafatian et al., 2019 (10)	MiRNAs	DT	Classification models for lung cancer diagnosis and subtyping
Sun et al., 2013 (12)	488 multimodal features	SVM	Multiple machine learning models comparison for lung cancer diagnosis
Krishnaiah et al., 2013 (11)	Symptoms	BN	Robust lung cancer classification dealing with small or incomplete data sets
Massion et al., 2020 (32)	CT	CNN	Commercial tool LCP-CNN for distinguishing benign and malignant lung nodules
Heuvelmans et al., 2021 (33) and Baldwin et al., 2020 (34)	CT	CNN	Further external validation for LCP-CNN
Sesen et al., 2013 (7)	13 patient and disease variables	BN	Lung cancer survival prediction and treatment planning suggestion
Bartholomai et al., 2018 (35)	13 selected patient parameters from SEER	RF, LR, and GBM	Lung cancer classification and survival prediction
lynch et al., 2017 (37)	18 selected patient parameters from SEER	SVM, DT, LR, GBM, and ensemble	Lung cancer survival prediction with a custom ensemble model
She et al., 2020 (38)	127 clinical features	DFFN	Lung cancer survival prediction integrating Cox proportional hazards
Doppalapudi et al., 2021 (40)	CT	CNN	Deep learning model for lung cancer survival prediction with interpretable feature importance
Xu et al., 2019 (41)	CT	CNN+RNN	Lung cancer survival and cancer-specific outcomes prediction with time series scans
Luo et al., 2018 (8)	288 features from seven categories	BN	Local control prediction and dynamically optimized treatment planning
Yang et al., 2022 (42)	Genomic and clinical data	DT + SVM	Recurrence and survival prediction
Mohamed et al., 2021 (43)	26 patient and disease features	RF, MLP, SVM, and LR	Multiple machine learning models comparison for lung cancer recurrence prediction
Janik et al, 2023 (44)	72 patient and disease features	RF + GML	Lung cancer recurrence prediction with tabular and graph machine learning models
Kim et al., 2022 (45)	Multimodal features including CT, clinical variables, and handcrafted radiomic features	CNN, and ensemble	Multimodal integration for recurrence prediction
Lee et al., 2020 (46)	Multimodal features including radiological, pathological imaging, and genomic features	MLP	Multimodal integration for recurrence prediction

AI, artificial intelligence; ANN, artificial neural network; BN, Bayesian network; CNN, convolutional neural network; CT, computed tomography; DBN, dynamic Bayesian networks; DFFN, deep feed-forward network; DT, decision tree; GBM, gradient boosting machine; GML, graph machine learning; LCP, lung cancer prediction; LR, logistic regression; MLP, multi-layer perceptron; PET, positron emission tomography; RF, random forest; RNN, recurrent neural network; SEER, Surveillance, Epidemiology, and End Results; SVM, support vector machine.

Medical image processing

Medical image processing is crucial at the intersection of AI and lung cancer, primarily divided into two key categories: medical image segmentation and medical image registration.

Medical image segmentation involves dividing an image into distinct regions or segments, with the goal of simplifying its representation and enhancing its interpretability for subsequent analysis. In the context of lung cancer, segmentation is crucial for identifying and delineating the boundaries of tumors, nodules, and other relevant anatomical structures. AI-powered segmentation algorithms enhance the accuracy and efficiency of this process, enabling precise localization and characterization of cancerous regions, which is essential for diagnosis, treatment planning, and monitoring disease progression.

Medical image registration, on the other hand, involves aligning images from different modalities, time points, or patients to a common coordinate system. This process is vital for comparing and integrating data from various imaging techniques such as CT, magnetic resonance imaging (MRI), and PET scans. In lung cancer, registration facilitates the fusion of anatomical and functional information, enhancing the ability to track tumor changes over time, assess treatment response, and plan radiation therapy with greater precision. AI-driven registration methods improve the robustness and speed of image alignment, addressing challenges such as patient movement and anatomical variability.

Medical image segmentation

The rapid advancements in medical imaging technology have positioned medical image segmentation as one of the central focuses within the field of computer vision. This process leverages sophisticated image processing techniques to analyze and manipulate 2D and 3D medical images, enabling the precise segmentation and extraction of various anatomical structures, including organs, soft tissues, and pathological regions. By facilitating both qualitative and quantitative assessments of lesions and areas of interest, medical image segmentation markedly enhances the accuracy and dependability of clinical diagnoses.

Traditional methods for medical image segmentation include threshold-based (49), regionbased (50), and edge-detection methods (51). However, these methods often perform poorly in handling details. In recent years, deep learning methods have made remarkable progress in the field of image segmentation, particularly through the use of CNNs, which excel in feature extraction and image representation. CNNs do not require manual extraction of image features or extensive preprocessing, making them highly effective for medical image segmentation tasks.

Fully convolutional networks (FCNs) (52) have revolutionized the field of semantic segmentation. Traditional CNNs typically end with fully connected layers, but FCNs replace these with convolutional layers, generating detailed segmentation maps. One of the key advantages of FCNs is their ability to process images of any size, thanks to their use of deconvolution layers which upsample feature maps back to the original image size, allowing for precise pixel-level classification. Building on the foundation of FCNs, models like SegNet, DeepLab, and Mask R-CNN have introduced significant innovations for semantic segmentation (13-15). SegNet utilizes an encoder-decoder architecture with pooling indices for precise upsampling, enhancing boundary detection. DeepLab employs atrous convolution and conditional random fields (CRFs) post-processing to capture multi-scale context and refine boundaries. Mask R-CNN builds upon Faster R-CNN by introducing an additional branch for predicting segmentation masks, enabling effective instance segmentation alongside object detection. These advancements have substantially improved segmentation accuracy and detail.

U-Net (16) has gained widespread adoption in medical imaging and even in larger computer vision fields due to its ability to effectively merge low-level and high-level features. This structure comprises an encoder-decoder framework with skip connections, preserving high resolution details. Enhanced variants like 3D U-Net (17) have extended U-Net’s capabilities, making it proficient in volumetric image segmentation and versatile in addressing diverse challenges across various medical applications.

Generative adversarial networks (GANs) have been increasingly used to enhance image segmentation (53). GANs consist of a generator, which synthesizes realistic images, and a discriminator, which assesses their authenticity. Through adversarial training, these networks improve segmentation models’ accuracy and robustness. Specific techniques, such as SegAN (18), SCAN (19), have been developed to tackle challenges like imbalanced pixel categories and limited data availability in medical image segmentation.

Medical image registration

Medical image registration is essential for various applications in medical imaging, including disease diagnosis, treatment planning, and prognosis evaluation. It aligns multiple images into a common coordinate system, enabling precise comparison and analysis, which is particularly crucial for lung cancer diagnosis and treatment. The highly deformable nature of the lungs poses a substantial challenge to achieving precise registration (54). In clinical practice, multiple sets of medical images are often acquired throughout the diagnostic and treatment processes, making image registration vital for harmonizing variations in spatial resolution and image properties—especially in lung cancer radiation therapy (55), managing lung geometric changes (56,57), and assessing regional pulmonary function (58-61). Advances in deep learning have significantly improved the performance and efficiency of registration algorithms, enabling faster and more accurate image alignment. This is particularly important in clinical settings, where timely and precise image registration can enhance diagnostic accuracy and improve treatment outcomes.

There are three main wide-used fundamental image registration deep learning architecture: CNN, Spatial Transformer Network (STN) and GAN (62).

CNNs are instrumental in medical image registration, as they excel at learning hierarchical features from complex imaging data. Key variants include Voxelmorph (63) which uses a U-Net architecture to predict deformation fields for aligning medical images such as brain MRIs. Hypermorph (64) extends VoxelMorph by integrating hyperparameter optimization into the learning process, enhancing the accuracy and robustness of registrations. Another variant, Cyclemorph (65) employs cycle consistency to ensure the registration process is reversible, making it particularly useful in unsupervised learning settings. Recent research by Hooshangnejad (66) come up with DAARTto perform the diagnostic CT to planning CT image adaptation. omitting the need for acquiring multiple scans before radiation treatment delivery, reducing the cost and length of the radiation therapy treatment pathway.

STNs are designed to learn spatial transformations that optimize image alignment. They comprise a localization network that estimates transformation parameters, a grid generator that constructs a sampling grid, and a sampler that applies this grid to warp the input image. This architecture is used in models like Transmorph (67), which combines STNs with transformer networks to handle large deformations and capture long-range dependencies in medical images, thus improving registration accuracy. Another model, FAIM (68), integrates STNs with anti-folding regularization to ensure smooth and invertible transformations, enhancing the stability and reliability of the registration process.

GANs are also utilized in medical image registration to provide learnable similarity metrics and handle multi-modal image alignment. CycleGAN (69) is particularly effective in translating images between different modalities (e.g., CT to MRI) to simplify the registration process by transforming it into a mono-modal problem. Additionally, GAN-based adversarial similarity networks use a generator to predict deformation fields and a discriminator to evaluate the similarity between the warped and fixed images, ensuring that the registration process adapts to the specific characteristics of the medical images being aligned.

By leveraging advanced AI techniques, these two aspects of medical image processing significantly contribute to early detection, accurate diagnosis, and effective treatment of lung cancer, ultimately improving patient outcomes. The representative AI work in medical image and registration mentioned in this paper can be found in Table 2. Especially, except for the application and methods, we also discussed the code availability of these leading work.

Table 2

Summary of representative artificial intelligence work in medical image segmentation and registration

Application	Author	Year	Method	Code availability
Medical image segmentation	Zhao et al. (51)	2006	Edge detection-based segmentation	×
	Cigla et al. (50)	2008	Region-based segmentation	×
	Xu et al. (49)	2010	Threshold-based segmentation	×
	Ronneberger et al. (16)	2015	U-Net	√
	Çiçek et al. (17)	2016	3D U-Net	√
	Badrinarayanan et al. (13)	2017	Segnet	√
	Chen et al. (14)	2017	Deeplab	√
	He et al. (15)	2017	Mask R-CNN	√
	Xue et al. (18)	2018	SegAN	√
	Dai et al. (19)	2018	SCAN	√
Medical image registration	Zhu et al. (69)	2017	CycleGAN	√
	Balakrishnan et al. (63)	2019	Voxelmorph	√
	Kuang et al. (68)	2019	FAIM	√
	Hoopes et al. (64)	2021	Hypermorph	√
	Kim et al. (65)	2021	Cyclemorph	√
	Chen et al. (67)	2022	Transmorph	√
	Hooshangnejad et al. (66)	2023	DAART	√

CNN, convolutional neural network; DARRT, deeply accelerated adaptive radiation therapy; FAIM, FAst IMage registration; SCAN, Structure Correcting Adversarial Network.

Despite significant advancements in deep learning-based medical image segmentation and registration methods, challenges persist. One common issue is the need to train a new model for each specific medical imaging dataset and task. However, due to the scarcity of annotated medical data, training such models is not always practical. Consequently, there is a pressing demand for a versatile, robust model capable of handling various segmentation or registration tasks across different datasets without extensive retraining. Such a model would greatly enhance the efficiency and applicability of deep learning techniques in medical imaging, addressing the limitations posed by limited data availability and diverse medical imaging requirements.

LLM and VLM in lung cancer

Emergency of LLM and VLM

Since 2020, LLMs and VLMs have rapidly evolved. 2020 saw Google’s T5 model (70), treating all natural language processing (NLP) tasks as text-to-text problems with a unified framework. In June 2020, OpenAI released GPT-3 (71), featuring 175 billion parameters and a transformer architecture, pre-trained using unsupervised learning to perform various NLP tasks. In January 2021, OpenAI introduced CLIP (72), a VLM using contrastive learning to align visual and textual representations, pre-trained on 400 million text-image pairs. At the same time, OpenAI launched DALLE (73), a model generating images from textual descriptions using a transformer variant. In March 2021, Microsoft introduced VinVL (74), enhancing visual understanding with richer annotations and transformer-based integration of visual and textual features. In April 2021, the ViLT model was proposed (75), focusing on vision-and-language transformer tasks by using minimal visual feature extraction and directly feeding raw image patches into a transformer for improved efficiency. In May 2021, Google released ALIGN (76), scaling up the text-image contrastive learning approach with over a billion image-text pairs. In November 2021, Microsoft’s Florence (77) employed a multi-modal transformer architecture for better visual and textual content understanding. December 2021 saw DeepMind’s Gopher (78), with 280 billion parameters, improving factual accuracy and reasoning. March 2022 featured DeepMind’s Chinchilla (79), emphasizing a balance between model size and training data, using dense and sparse layers for efficient learning. April 2022 brought Google’s PaLM (80) with 540 billion parameters, using the Pathways system for efficient training and robust multilingual tasks. April 2022 introduced Microsoft’s BEiT, adapting BERT pre-training for image transformers. In September 2022, OpenAI launched DALL-E 2 (81), improving image quality and diversity with diffusion models. In February 2023, Meta launched LLaMA (82), an open-source language model focusing on accessibility and versatility in NLP applications. March 2023 saw OpenAI release GPT-4 (83), further enhancing conversational abilities with dynamic attention span adjustment. In July 2023, Meta introduced LLaMA 2 (84), building upon its predecessor with improved performance and capabilities. In August 2023, Anthropic developed Claude instant, focusing on safety and reliability in language models. In October 2023, Salesforce introduced BLIP-2 (85), integrating large-scale multi-modal data to advance text and image understanding through self-supervised and supervised learning techniques. In November 2023, Anthropic unveiled Claude 2. The following month, Google released Gemini, a capable and versatile model designed to handle multi-modal tasks. In May 2024, OpenAI launched GPT-4o, an optimized version of GPT-4 for more robust performance in vision language tasks. Meta followed in April 2024 with LLaMA 3, continuing its commitment to advancing open-source NLP technologies. In March and June 2024, Anthropic released Claude 3 and Claude 3.5, further improving safety, reliability, and overall performance. Google’s Gemini series evolved with the introduction of Gemini 1.5 Pro in May 2024 (86), featuring an enhanced context window for improved processing of longer inputs. Meta contributed to the open-source community with successive iterations of the LLaMA 3 family (LLaMA 3.1, 3.2, and 3.3) in the latter half of 2024. In December 2024, DeepSeek released DeepSeek v3 (87), a 671 billion parameter, open-source LLM with a 128,000 token context length, claiming performance exceeding that of GPT-4o. Alibaba’s Qwen series progressed with the release of Qwen 2.5-VL (88) in February 2025, which leveraged a Mixture of Experts (MoE) architecture, achieving a reported 30% reduction in computational costs compared to monolithic models by selectively activating relevant expert sub-models. Google further advanced its Gemini models with the release of Gemini Pro 2.0 in February 2025, highlighting improvements in coding capabilities and complex prompt handling. Figure 1 highlights the overall progression timeline of the LLM and VLM over the past few years.

Figure 1 The development of LLM and VLM. LLM, large language model; VLM, vision language model.

Current state of LLM and VLM in lung cancer

VLMs like GPT-4 and other transformer-based models have demonstrated significant advancements in both natural language understanding and vision-language tasks. Their application in healthcare is expanding rapidly due to their scalability, adaptability to different modalities and contexts, and ability to handle complex vision language tasks. In healthcare, LLMs are revolutionizing clinical decision support, medical record analysis, patient engagement, and responses to zero-shot or few-shot queries by exhibiting a nuanced understanding of professional medical knowledge.

State-of-the-art models such as GPT-4, Google’s Gemini, and Claude3 excel in various healthcare applications, including named entity recognition (NER), relation extraction, natural language inference, visual recognition, and so on. The deployment of these models has led to significant improvements in diagnostic accuracy, administrative efficiency, and personalized patient care.

Med-PaLM 2 (89) is specifically designed for medical question answering, leveraging large datasets of medical literature, clinical guidelines, and expert annotations to provide accurate and contextually relevant responses. This model excels in understanding and processing medical terminology and clinical concepts, making it invaluable for healthcare professionals and patients seeking reliable medical information. Other specialized models include BioBERT (90), tailored for processing biomedical texts, and ClinicalBERT (91), designed for handling clinical notes. Specifically, in areas closely related to lung cancer, such as radiology and oncology, significant efforts have been made to develop models addressing these specialized needs. Radiology-Llama2 (92) focuses on radiology tasks, such as interpreting medical images and generating detailed radiology reports, and is fine-tuned on extensive datasets of radiological images and accompanying reports. RadOnc-GPT (93) is a LLM tailored for radiation oncology. Fine-tuned on an extensive dataset of radiation oncology patient records from the Mayo Clinic, it demonstrates exceptional performance in generating radiotherapy treatment plans and offering detailed diagnostic descriptions. CancerLLM (94) is designed for the cancer domain, addressing unique challenges in cancer diagnosis and treatment planning, pre-trained on millions of clinical notes and pathology reports, and fine-tuned for tasks such as cancer phenotype extraction and treatment plan generation. EXACT-Net (95) presents a promising new approach to reducing false positives in lung cancer nodule detection by combining ChatGPT to accurately extract diagnostic nodule locations from clinical reports. PMC-LLaMA (96) integrates biomedical academic papers and medical textbooks, providing robust capabilities in medical question answering and conversational dialogues, and despite its compact size, it surpasses well-known models like ChatGPT on various public medical QA benchmarks.

The integration of these specialized LLMs into healthcare settings underscores their potential to significantly enhance various aspects of medical practice, from improving diagnostic accuracy to optimizing treatment plans and facilitating efficient information retrieval. In the context of lung cancer diagnosis and treatment, medical imaging is pivotal. VLMs that integrate medical images with clinical text can advance healthcare outcomes by enhancing tasks such as clinical report generation, medical visual question answering, and error detection in medical reports.

One of the pioneering approaches in this field is the development of models like MAIRA-1 and MAIRA-2 (22,23), which specialize in radiology report generation by combining chest X-ray images with associated clinical reports. These models leverage pre-trained vision encoders to extract visual features from the images and combine them with text prompts that guide the LLMs in generating detailed and accurate medical reports. In addition, models like Merlin and Dia-LLaMA are designed to handle the complex 3D nature of CT scans, making them particularly suitable for detailed and comprehensive report generation in lung cancer diagnostics (24,97). Merlin, an advanced multimodal model, integrates structured electronic health record (EHR) data with unstructured radiology report text to enable comprehensive and robust supervision. Utilizing a vast clinical dataset comprising over 6 million images from 15,331 CT scans, more than 1.8 million EHR diagnosis codes, and over 6 million tokens from radiology reports, Merlin demonstrates significant advancements across multiple domains. These include zero-shot classification for 31 distinct findings, phenotype classification across 692 distinct phenotypes, as well as zero-shot cross-modal retrieval tasks, including image-to-impression and image-to-findings. Furthermore, Merlin excels in specialized tasks, including predicting 5-year outcomes for six chronic diseases, generating detailed radiology reports, and conducting 3D semantic segmentation for 20 different organs. In contrast, Dia-LLaMA adapts the comparatively smaller LLaMA2-7B model for CT report generation, incorporating diagnostic information as guidance prompts and using a pre-trained ViT3D model to extract visual information from CT images. This framework emphasizes critical abnormal information by using disease-aware attention and a disease prototype memory bank to capture common disease representations, enhancing its ability to generate detailed and accurate medical reports. Another notable model is LLM-RadJudge (21), which points out the defect of traditional reports generation evaluation metrics, innovatively evaluates generated radiology reports with LLM to ensure they meet clinical standards. By comparing the performance of various LLMs, it achieves evaluation consistency close to that of radiologists, thus facilitating the development of more clinically relevant models. The SERPENT-VLM (27) model focuses on enhancing the interpretability of VLMs by integrating semantic and visual information, enabling more accurate and context-aware medical report generation. This model improves the overall quality of the generated reports by aligning visual features with medical terminologies. RAD-DINO (26) introduces a novel approach to integrating domain-specific knowledge into VLMs, enhancing their capability to generate accurate medical reports by incorporating domain-specific ontologies and visual features. This model scales well with increased dataset size and diversity, demonstrating strong correlations between imaging features and clinical information such as patient medical records. CheXpert Plus (98) builds upon existing radiology datasets to enhance the training and performance of VLMs, particularly in identifying and characterizing lung abnormalities. By leveraging an extensive dataset, CheXpert Plus contributes to the continuous improvement of VLM performance in clinical settings.

Realistic synthetic medical image generation, encompassing modalities like X-ray and CT, is another rapidly developing approach in this area. This addresses the critical need for large, diverse datasets in medical imaging, particularly given data scarcity and patient privacy concerns. Bluethgen et al. (99) adapted a latent diffusion model, pre-trained on natural image and text descriptor pairs to generate diverse and visually plausible synthetic chest Xray images. Xu et al. (100) introduced MedSyn, a methodology for producing high-quality 3D lung CT images guided by textual input. MedSyn uses a hierarchical scheme with a modified UNet architecture. Initially, low-resolution images were synthesized conditioned on textual input, serving as a basis for subsequent generators that produced complete volumetric data. To ensure anatomical plausibility, generation is then guided by simultaneous creation of vascular, airway, and lobular segmentation masks alongside the CT images. Li et al. (101) proposed TextoMorph enabling textual control over tumor characteristics including texture, heterogeneity, boundaries, and pathology type, for more realistic, diverse tumor synthesis. Furthermore, Nvidia introduced MAISI (102), leveraging a foundational volume compression network and a latent diffusion model for high-resolution CT image generation.

The application of LLM and VLM is also contributing significantly to multimodal integration in various clinical applications. Specifically, the strong textual encoding capabilities of these models offer potential improvements in areas such as medical image segmentation and lung cancer survival prediction. LLMSeg (103) pioneered the integration of textual information into the challenging task of 3-dimensional context-aware target volume delineation for radiation oncology. ConTEXTual Net (104) introduced a novel vision-language model that extracts language features from free-form radiology reports with a pre-trained language model. This model then employs cross-attention between the extracted language features and the intermediate embeddings of an encoder-decoder CNN, thereby enabling language-guided pneumothorax segmentation on chest radiographs. Kim et al. (105) proposed a framework that integrates the prognostic capabilities of both CT and pathology images with clinical information. This framework utilizes a multi-modal integration approach via multiple instance learning, leveraging LLM, CT encoder, and pathology encoder to manage the complexities of multi-modal medical datasets for 5-year overall survival prediction in lung cancer.

In summary, the integration of LLMs and VLMs in healthcare, particularly in lung cancer diagnostics, has demonstrated significant potential. Models like MAIRA-1, MAIRA2, Merlin, Dia-LLaMA, and others have advanced the accuracy and efficiency of radiology report generation by combining medical images with clinical data. Separately, models like MAISI, MedSyn, and TextoMorph have contributed significantly to the development of synthetic medical image generation, addressing the need for diverse datasets. Furthermore, models like LLMSeg and ConTEXTual Net demonstrate the potential of multimodal integration for downstream clinical tasks. These models leverage sophisticated techniques such as pre-trained vision encoders, multimodal data integration, and innovative evaluation methods to enhance radiology report generation, synthetic medical image creation, and medical image analysis. This progress sets the stage for more advanced AI applications in healthcare, promising improved diagnostic and treatment outcomes. Table 3 highlights the outstanding work with LLM and VLM in different application fields.

Table 3

Summary of representative work in different specialized fields with LLM and VLM

Application	Author	Year	Model
Medical LLM	Huang et al. (91)	2019	ClinicalBERT
	Lee et al. (90)	2020	BioBERT
	Singhal et al. (89)	2023	Med-PaLM 2
	Liu et al. (92)	2023	Radiology-Llama2
	Liu et al. (93)	2023	RadOnc-GPT
	Li et al. (94)	2024	CancerLLM
	Wu et al. (96)	2024	PMC-LLaMA
Medical report generation	Hyland et al. (22)	2024	MAIRA-1
	Bannur et al. (23)	2024	MAIRA-2
	Blankemeier et al. (24)	2024	Merlin
	Chen et al. (97)	2024	Dia-LLaMA
	Wang et al. (21)	2024	LLM-RadJudge
	Kapadnis et al. (27)	2024	SERPENT-VLM
	Pérez-García et al. (26)	2024	RAD-DINO
	Chambon et al. (98)	2024	CheXpert Plus
Medical image generation	Bluethgen et al. (99)	2024
	Xu et al. (100)	2024	MedSyn
	Li et al. (101)	2024	TextoMorph
	Guo et al. (102)	2024	MAISI
Multimodal integration for downstream tasks	Hooshangnejad et al. (95)	2024	EXACT-Net
	Oh et al. (103)	2024	LLMSeg
	Huemann et al. (104)	2024	ConTEXTual Net
	Kim et al. (105)	2024

LLM, large language model; VLM, vision language model.

Challenges

Implementing LLMs and VLMs in healthcare presents several barriers and challenges in practical applications, ranging from privacy and security to biases, accuracy, cost, and so on. Addressing these issues is critical to ensure the responsible and effective deployment of LLMs and VLMs in medical contexts, enhancing patient care while maintaining ethical standards.

First, security and privacy considerations are paramount when utilizing LLMs in healthcare due to the handling of highly sensitive patient data. The unintentional inclusion of personally identifiable information in training datasets can breach patient confidentiality. Furthermore, LLMs can deduce sensitive personal attributes from ostensibly non-specific data, amplifying privacy risks. To safeguard patient privacy and uphold research integrity, it is essential to apply comprehensive data anonymization methods, ensure secure data storage, and rigorously follow ethical guidelines.

Second, bias and fairness issues in training data can lead to biased outputs from LLMs, which is particularly concerning in healthcare, where such biases can exacerbate health disparities, especially for low- and middle-income countries (LMICs) (106). Addressing these biases requires careful curation and preprocessing of training data, continuous monitoring, and strategies to mitigate biases in model outputs. Collaborative efforts between domain experts, ethicists, and data scientists are necessary to develop guidelines and best practices for reducing biases and promoting fairness in LLM and VLM applications.

Third, hallucinations and fabricated information pose a significant risk, as LLMs are prone to generating plausible-sounding but incorrect information. This is especially dangerous in healthcare, where inaccurate information can lead to harmful clinical decisions. Developing methods to detect and mitigate hallucinations is critical to ensure their reliability in medical contexts.

Fourth, legal and ethical challenges arise with the deployment of LLMs and VLMs in healthcare. Establishing clear legal frameworks to govern the use of AI in healthcare is essential to ensure these technologies are used responsibly. Ethical considerations include ensuring patient autonomy, obtaining informed consent, and maintaining patient confidentiality. Addressing these legal and ethical challenges is vital to gaining public trust and ensuring the ethical deployment of LLMs and VLMs in healthcare.

Fifth, the high implementation costs represent a significant barrier to deploying LLMs in healthcare. These costs encompass not only the initial investment in hardware and software but also the ongoing expenses associated with maintenance, updates, and personnel training. High-performance computing resources are essential to support the complex computations required by LLMs and VLMs, and these resources can be expensive to acquire and maintain. This financial barrier is particularly pronounced in LMICs, which may struggle to afford such high costs, thereby exacerbating global healthcare disparities. Overcoming this challenge requires strategic planning, investment in scalable infrastructure, the development of models that require lower computing resources, and the creation of easily deployable local models. Additionally, potential collaboration with technology partners to share costs and resources is crucial to ensure more equitable access to advanced healthcare technologies worldwide.

Finally, clinical accuracy and data quality are critical for the effective deployment of LLMs and VLMs in medical applications. Ensuring clinical accuracy is a primary challenge, as models often generate hallucinated content that can be misleading and dangerous. The quality and heterogeneity of data used for training significantly impact performance, with issues like compression artifacts affecting the detection of subtle radiographic findings.

Moreover, evaluation metrics used for generated reports often fail to reflect clinical requirements adequately. Metrics like BLEU (107), ROUGE (108), and METOER (109) primarily focus on grammatical and lexical similarities, while clinical metrics might not fully capture the complexity and nuances of medical diagnoses. Developing more clinically relevant evaluation metrics, such as RadCliQ (110), RadEval (111), and LLM-RadJudge (21) which better align with radiologists’ judgment, is essential for assessing the performance of VLMs in a clinically meaningful way.

Opportunities

Despite the challenges, implementing LLMs and VLMs in lung cancer presents numerous promising advancements and opportunities. Addressing key areas such as data integration, data anonymization, creation of paired datasets, and improvement of evaluation metrics can lead to comprehensive diagnostic capabilities, enhanced clinical performance, and better alignment with healthcare professionals’ needs.

First, integrating multiple data sources in VLMs can significantly boost diagnostic accuracy and comprehensiveness. Currently, most research focuses on aligning single-modality images and text. However, in clinical applications, especially for lung cancer localization, relying solely on CT images is inadequate. PET images offer another valuable modality that, when combined with CT images, can provide more precise tumor localization. Integrating PET and CT images, and aligning them with clinical reports, is a promising approach for more accurate lung cancer diagnosis and treatment. This integration provides a holistic view of a patient’s condition, facilitating more precise and effective treatment planning.

Second, enhancing patient data anonymization processes is crucial for ensuring privacy and regulatory compliance. At present, most data anonymization is done manually, which is time-consuming and resource-intensive, requiring highly specialized doctors. Moreover, many data anonymization tools are not deployable locally. Although there has been research on using LLMs for data anonymization, their performance often falls short of expectations (112), and local deployment is resource-demanding. Therefore, there is a high demand for low-cost, high-performance, locally deployable, lightweight anonymization models. Improved anonymization techniques can facilitate the broader adoption of LLMs in clinical research and practice, protecting patient privacy while maintaining the utility of clinical data and adhering to ethical standards.

Third, the creation of extensive, well-annotated datasets like CHEXPERT PLUS is essential for training and evaluating VLMs. While general-purpose LLMs and VLMs have progressed rapidly, those tailored for clinical applications have lagged. The main reason is the abundance of high-quality data for general purposes, whereas high-quality, well-annotated clinical data is scarce and requires multiple doctors to label and review, which is highly labor-intensive. For VLMs to advance, there is a great need for large, high-quality data pairs, such as X-ray clinical report pairs, CT clinical report pairs, CT-PET clinical report pairs, and even multi-modality pairs. These datasets will enable the development of robust and generalizable models, improving their clinical applicability and performance. Large-scale datasets with diverse and representative samples ensure that VLMs can learn from a wide range of clinical scenarios, enhancing their ability to generalize to new and unseen cases.

Fourth, although there are general evaluation metrics for VLM and LLM tasks, clinical applications require more rigorous standards. Developing clinically relevant evaluation metrics will help ensure these models perform better in clinical tasks, meeting the needs of healthcare professionals and enhancing their decision-making capabilities.

Finally, expanding research into practical applications such as visual question answering and decision support will better align LLM and VLM development with the real needs of healthcare providers. By focusing on real-world clinical challenges, researchers can develop models that offer tangible benefits to healthcare professionals and patients, ensuring that advanced AI technologies are effectively integrated into clinical practice.

Conclusions

This study provides an extensive review of the application of AI in lung cancer research, with a special focus on the emerging roles of LLMs and VLMs. It aims to offer an accessible understanding of how technological iterations in AI are addressing critical challenges in lung cancer care. We began by discussing the complexities involved in the diagnosis and prognosis of lung cancer and highlighted the significant advancements made by traditional AI techniques, such as machine learning and deep learning, in predicting lung cancer prognosis, survival rates, and recurrence.

We then shifted our focus to medical image processing, which is crucial for early detection and precise treatment planning in lung cancer. We examined advancements in image segmentation and registration, demonstrating how AI techniques enhance the accuracy and efficiency of these processes. This leads to better localization and characterization of cancerous regions, thereby improving diagnostic and treatment outcomes.

Finally, we explored the specific applications of LLMs and VLMs, illustrating how these models enhance precision care in lung cancer and facilitate the comprehensive analysis of clinical documents. Despite the considerable promise these technologies hold, their implementation faces several practical challenges, including ensuring data privacy, reducing algorithmic biases, and managing high computational costs. Addressing these challenges is essential to facilitate the widespread adoption of these technologies in clinical settings. Future research should prioritize the development of more efficient data integration techniques and user-friendly, cost-effective deployment solutions to ensure the clinical applicability of LLMs and VLMs. Interdisciplinary collaboration across engineering, computer science, and clinical practice will be crucial to ensuring that technological advancements not only improve lung cancer treatment outcomes but also maintain patient safety, adhere to ethical standards, and promote healthcare equity worldwide.

Acknowledgments

None.

Footnote

Peer Review File: Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-24-801/prf

Funding: This work was supported by the National Cancer Institute of the National Institutes of Health (award number R25CA288263 and R37CA229417).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-24-801/coif). K.D. serves as an unpaid editorial board member of Translational Lung Cancer Research from February 2024 to January 2026. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Leiter A, Veluswamy RR, Wisnivesky JP. The global burden of lung cancer: current status and future trends. Nat Rev Clin Oncol 2023;20:624-39. [Crossref] [PubMed]
Siegel RL, Miller KD, Wagle NS, et al. Cancer statistics, 2023. CA Cancer J Clin 2023;73:17-48. [Crossref] [PubMed]
Global Burden of Disease Cancer Collaboration. Global, Regional, and National Cancer Incidence, Mortality, Years of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-years for 32 Cancer Groups, 1990 to 2015: A Systematic Analysis for the Global Burden of Disease Study. JAMA Oncol 2017;3:524-48. [Crossref] [PubMed]
Wu Y, Wu Y, Wang J, et al. An optimal tumor marker group-coupled artificial neural network for diagnosis of lung cancer. Expert Systems with Applications 2011;38:11329-34.
Nasser IM, Abu-Naser SS. Lung cancer detection using artificial neural network. International Journal of Engineering and Information Systems (IJEAIS) 2019;3:17-23.
Feng F, Wu Y, Wu Y, et al. The effect of artificial neural network model combined with six tumor markers in auxiliary diagnosis of lung cancer. J Med Syst 2012;36:2973-80. [Crossref] [PubMed]
Sesen MB, Nicholson AE, Banares-Alcantara R, et al. Bayesian networks for clinical decision support in lung cancer care. PLoS One 2013;8:e82349. [Crossref] [PubMed]
Luo Y, McShan D, Ray D, et al. Development of a Fully Cross-Validated Bayesian Network Approach for Local Control Prediction in Lung Cancer. IEEE Trans Radiat Plasma Med Sci 2019;3:232-41. [Crossref] [PubMed]
Petousis P, Han SX, Aberle D, et al. Prediction of lung cancer incidence on the low-dose computed tomography arm of the National Lung Screening Trial: A dynamic Bayesian network. Artif Intell Med 2016;72:42-55. [Crossref] [PubMed]
Sherafatian M, Arjmand F. Decision tree-based classifiers for lung cancer diagnosis and subtyping using TCGA miRNA expression data. Oncol Lett 2019;18:2125-31. [Crossref] [PubMed]
Krishnaiah V, Narsimha G, Chandra NS. Diagnosis of lung cancer prediction system using data mining classification techniques. International Journal of Computer Science and Information Technologies 2013;4:39-45.
Sun T, Wang J, Li X, et al. Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set. Comput Methods Programs Biomed 2013;111:519-24. [Crossref] [PubMed]
Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:2481-95. [Crossref] [PubMed]
Chen LC, Papandreou G, Kokkinos I, et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018;40:834-48. [Crossref] [PubMed]
He K, Gkioxari G, Dollár P, et al., editors. Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision; 2017.
Ronneberger O, Fischer P, Brox T, editors. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18; 2015: Springer.
Çiçek Ö, Abdulkadir A, Lienkamp SS, et al., editors. 3D U-Net: learning dense volumetric segmentation from sparse annotation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19; 2016: Springer.
Xue Y, Xu T, Zhang H, et al. SegAN: Adversarial Network with Multi-scale L(1) Loss for Medical Image Segmentation. Neuroinformatics 2018;16:383-92. [Crossref] [PubMed]
Dai W, Dong N, Wang Z, et al., editors. SCAN: Structure correcting adversarial network for organ segmentation in chest x-rays. International Workshop on Deep Learning in Medical Image Analysis; 2018: Springer.
Razzak MI, Naz S, Zaib A. Deep learning for medical image processing: Overview, challenges and the future. In: Dey N, Ashour A, Borra S. (eds) Classification in BioApps. Lecture Notes in Computational Vision and Biomechanics; Springer, Cham; 2017;26:323-50.
Wang Z, Luo X, Jiang X, et al. Llm-radjudge: Achieving radiologist-level evaluation for x-ray report generation. arXiv preprint arXiv:240400998. 2024.
Hyland SL, Bannur S, Bouzid K, et al. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:231113668. 2023.
Bannur S, Bouzid K, Castro DC, et al. Maira-2: Grounded radiology report generation. arXiv preprint arXiv:240604449. 2024.
Blankemeier L, Cohen JP, Kumar A, et al. Merlin: A vision language foundation model for 3d computed tomography. Research Square. 2024:rs. 3. rs-4546309.
Yildirim N, Richardson H, Wetscherek MT, et al., editors. Multimodal healthcare AI: identifying and designing clinically relevant vision-language applications for radiology. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems; 2024.
Pérez-García F, Sharma H, Bond-Taylor S, et al. Rad-dino: Exploring scalable medical image encoders beyond text supervision. arXiv preprint arXiv:240110815. 2024.
Kapadnis MN, Patnaik S, Nandy A, et al. SERPENT-VLM: Self-Refining Radiology Report Generation Using Vision Language Models. In: Proceedings of the 6th Clinical Natural Language Processing Workshop; 2024, p. 283-91.
Zhang X, Wu C, Zhang Y, et al. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat Commun 2023;14:4542. [Crossref] [PubMed]
Huang S, Yang J, Shen N, et al. Artificial intelligence in lung cancer diagnosis and prognosis: Current application and future perspective. Semin Cancer Biol 2023;89:30-7. [Crossref] [PubMed]
Gandhi Z, Gurram P, Amgai B, et al. Artificial Intelligence and Lung Cancer: Impact on Improving Patient Outcomes. Cancers (Basel) 2023;15:5236. [Crossref] [PubMed]
Guo N, Yen RF, El Fakhri G, et al., editors. SVM based lung cancer diagnosis using multiple image features in PET/CT. 2015 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC); 2015: IEEE.
Massion PP, Antic S, Ather S, et al. Assessing the Accuracy of a Deep Learning Method to Risk Stratify Indeterminate Pulmonary Nodules. Am J Respir Crit Care Med 2020;202:241-9. [Crossref] [PubMed]
Heuvelmans MA, van Ooijen PMA, Ather S, et al. Lung cancer prediction by Deep Learning to identify benign lung nodules. Lung Cancer 2021;154:1-4. [Crossref] [PubMed]
Baldwin DR, Gustafson J, Pickup L, et al. External validation of a convolutional neural network artificial intelligence tool to predict malignancy in pulmonary nodules. Thorax 2020;75:306-12. [Crossref] [PubMed]
Bartholomai JA, Frieboes HB. Lung Cancer Survival Prediction via Machine Learning Regression, Classification, and Statistical Techniques. Proc IEEE Int Symp Signal Proc Inf Tech 2018;2018:632-7.
Surveillance, Epidemiology, and End Results (SEER) Program. In: National Cancer Institute D, editor. 2012.
Lynch CM, Abdollahi B, Fuqua JD, et al. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int J Med Inform 2017;108:1-8. [Crossref] [PubMed]
She Y, Jin Z, Wu J, et al. Development and Validation of a Deep Learning Model for Non-Small Cell Lung Cancer Survival. JAMA Netw Open 2020;3:e205842. [Crossref] [PubMed]
Katzman JL, Shaham U, Cloninger A, et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 2018;18:24. [Crossref] [PubMed]
Doppalapudi S, Qiu RG, Badr Y. Lung cancer survival period prediction and understanding: Deep learning approaches. Int J Med Inform 2021;148:104371. [Crossref] [PubMed]
Xu Y, Hosny A, Zeleznik R, et al. Deep Learning Predicts Lung Cancer Treatment Response from Serial Medical Imaging. Clin Cancer Res 2019;25:3266-75. [Crossref] [PubMed]
Yang Y, Xu L, Sun L, et al. Machine learning application in personalised lung cancer recurrence and survivability prediction. Comput Struct Biotechnol J 2022;20:1811-20. [Crossref] [PubMed]
Mohamed SK, Walsh B, Timilsina M, et al. On Predicting Recurrence in Early Stage Non-small Cell Lung Cancer. AMIA Annu Symp Proc 2021;2021:853-62.
Janik A, Torrente M, Costabello L, et al. Machine learning–assisted recurrence prediction for patients with early-stage non–small-cell lung cancer. JCO Clinical Cancer Informatics 2023;7:e2200062. [Crossref] [PubMed]
Kim G, Moon S, Choi JH. Deep Learning with Multimodal Integration for Predicting Recurrence in Patients with Non-Small Cell Lung Cancer. Sensors (Basel) 2022;22:6594. [Crossref] [PubMed]
Lee B, Chun SH, Hong JH, et al. DeepBTS: Prediction of Recurrence-free Survival of Non-small Cell Lung Cancer Using a Time-binned Deep Neural Network. Sci Rep 2020;10:1952. [Crossref] [PubMed]
Selvaraju RR, Cogswell M, Das A, et al., editors. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision; 2017.
Mukunda K, Ye T, Luo Y, et al. Deep Learning Detection of Subtle Torsional Eye Movements: Preliminary Results. bioRxiv. 2024:2024.05. 26.595236.
Xu A, Wang L, Feng S, et al., editors. Threshold-based level set method of image segmentation. 2010 Third International Conference on Intelligent Networks and Intelligent Systems; 2010: IEEE.
Cigla C, Alatan AA, editors. Region-based image segmentation via graph cuts. 2008 15th IEEE International Conference on Image Processing; 2008: IEEE.
Yu-Qian Z, Wei-Hua G, Zhen-Cheng C, et al. Medical images edge detection based on mathematical morphology. Conf Proc IEEE Eng Med Biol Soc 2005;2005:6492-5. [Crossref] [PubMed]
Shelhamer E, Long J, Darrell T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:640-51. [Crossref] [PubMed]
Donahue J, Krähenbühl P, Darrell T. Adversarial feature learning. arXiv preprint arXiv:160509782. 2016.
Murphy K, van Ginneken B, Reinhardt JM, et al. Evaluation of registration methods on thoracic CT: the EMPIRE10 challenge. IEEE Trans Med Imaging 2011;30:1901-20. [Crossref] [PubMed]
Ding K, Bayouth JE, Buatti JM, et al. 4DCT-based measurement of changes in pulmonary function following a course of radiation therapy. Med Phys 2010;37:1261-72. [Crossref] [PubMed]
Yin Y, Hoffman EA, Ding K, et al. A cubic B-spline-based hybrid registration of lung CT images for a dynamic airway geometric model with large deformation. Phys Med Biol 2011;56:203-18. [Crossref] [PubMed]
Ding K, Cao K, Christensen GE, et al., editors. Registration-based regional lung mechanical analysis: Retrospectively reconstructed dynamic imaging versus static breath-hold image acquisition. Medical Imaging 2009: Biomedical Applications in Molecular, Structural, and Functional Imaging; 2009: SPIE.
Du K, Ding K, Cao K, et al., editors. Registration-based measurement of regional expiration volume ratio using dynamic 4DCT imaging. 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro; 2011: IEEE.
Ding K, Cao K, Christensen G, et al., editors. Registration-based lung tissue mechanics assessment during tidal breathing. First International Workshop on Pulmonary Image Analysis; 2008: Lulu New York.
Ding K, Miller W, Cao K, et al., editors. Quantification of regional lung ventilation from tagged hyperpolarized helium-3 MRI. 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro; 2011: IEEE.
Ding K, Du K, Cao K, et al., editors. Time-varying lung ventilation analysis of 4DCT using image registration. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2011: IEEE.
Chen X, Diaz-Pinto A, Ravikumar N, et al. Deep learning in medical image registration. Progress in Biomedical Engineering 2021;3:012003.
Balakrishnan G, Zhao A, Sabuncu MR, et al. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Trans Med Imaging 2019; Epub ahead of print. [Crossref]
Hoopes A, Hoffmann M, Fischl B, et al., editors. Hypermorph: Amortized hyperparameter learning for image registration. Information Processing in Medical Imaging: 27th International Conference, IPMI 2021, Virtual Event, June 28–June 30, 2021, Proceedings 27; 2021: Springer.
Kim B, Kim DH, Park SH, et al. CycleMorph: Cycle consistent unsupervised deformable image registration. Med Image Anal 2021;71:102036. [Crossref] [PubMed]
Hooshangnejad H, Chen Q, Feng X, et al. DAART: a deep learning platform for deeply accelerated adaptive radiation therapy for lung cancer. Front Oncol 2023;13:1201679. [Crossref] [PubMed]
Chen J, Frey EC, He Y, et al. TransMorph: Transformer for unsupervised medical image registration. Med Image Anal 2022;82:102615. [Crossref] [PubMed]
Kuang D, Schmah T, editors. FAIM–a convnet method for unsupervised 3d medical image registration. Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10; 2019: Springer.
Zhu JY, Park T, Isola P, et al., editors. Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision; 2017.
Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. 2020;21:1-67.
Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems 2020;33:1877-901.
Radford A, Kim JW, Hallacy C, et al., editors. Learning transferable visual models from natural language supervision. International Conference on Machine Learning; 2021: PmLR.
Ramesh A, Pavlov M, Goh G, et al., editors. Zero-shot text-to-image generation. International Conference on Machine Learning; 2021: Pmlr.
Zhang P, Li X, Hu X, et al., editors. Vinvl: Revisiting visual representations in vision-language models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021.
Kim W, Son B, Kim I, editors. Vilt: Vision-and-language transformer without convolution or region supervision. International Conference on Machine Learning; 2021: PMLR.
Jia C, Yang Y, Xia Y, et al., editors. Scaling up visual and vision-language representation learning with noisy text supervision. International Conference on Machine Learning; 2021: PMLR.
Yuan L, Chen D, Chen Y-L, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:211111432. 2021.
Rae JW, Borgeaud S, Cai T, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:211211446. 2021.
Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models. arXiv preprint arXiv:220315556. 2022.
Chowdhery A, Narang S, Devlin J, et al. PaLM: Scaling Language Modeling with Pathways. The Journal of Machine Learning Research 2023;24:11324-436.
Ramesh A, Dhariwal P, Nichol A, et al. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125. 2022;1(2):3.
Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971. 2023.
Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report. arXiv preprint arXiv:230308774. 2023.
Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:230709288. 2023.
Li J, Li D, Savarese S, et al., editors. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International Conference on Machine Learning; 2023: PMLR.
Team G, Georgiev P, Lei VI, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:240305530. 2024.
Liu A, Feng B, Xue B, et al. Deepseek-v3 technical report. arXiv preprint arXiv:241219437. 2024.
Bai S, Chen K, Liu X, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:250213923. 2025.
Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med 2025;31:943-50. [Crossref] [PubMed]
Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020;36:1234-40. [Crossref] [PubMed]
Huang K, Altosaar J, Ranganath R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:190405342. 2019.
Liu Z, Li Y, Shu P, et al. Radiology-llama2: Best-in-class large language model for radiology. arXiv preprint arXiv:230906419. 2023.
Liu Z, Wang P, Li Y, et al. Radonc-gpt: A large language model for radiation oncology. arXiv preprint arXiv:230910160. 2023.
Li M, Huang J, Yeung J, et al. Cancerllm: A large language model in cancer domain. arXiv preprint arXiv:240610459. 2024.
Hooshangnejad H, Huang G, Kelly K, et al. EXACT-Net: Framework for EHR-Guided Lung Tumor Auto-Segmentation for Non-Small Cell Lung Cancer Radiotherapy. Cancers (Basel) 2024;16:4097. [Crossref] [PubMed]
Wu C, Lin W, Zhang X, et al. PMC-LLaMA: toward building open-source language models for medicine. J Am Med Inform Assoc 2024;31:1833-43. [Crossref] [PubMed]
Chen Z, Luo L, Bie Y, et al. Dia-LLaMA: Towards large language model-driven CT report generation. arXiv preprint arXiv:240316386. 2024.
Chambon P, Delbrouck JB, Sounack T, et al. CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats. arXiv preprint arXiv:240519538. 2024.
Bluethgen C, Chambon P, Delbrouck JB, et al. A vision-language foundation model for the generation of realistic chest X-ray images. Nat Biomed Eng 2025;9:494-506. [Crossref] [PubMed]
Xu Y, Sun L, Peng W, et al. Medsyn: Text-guided anatomy-aware synthesis of high-fidelity 3d ct images. IEEE Transactions on Medical Imaging 2024;
Li X, Shuai Y, Liu C, et al. Text-driven tumor synthesis. arXiv preprint arXiv:241218589. 2024.
Guo P, Zhao C, Yang D, et al. Maisi: Medical ai for synthetic imaging. arXiv preprint arXiv:240911169. 2024.
Oh Y, Park S, Byun HK, et al. LLM-driven multimodal target volume contouring in radiation oncology. Nat Commun 2024;15:9186. [Crossref] [PubMed]
Huemann Z, Tie X, Hu J, et al. ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax. J Imaging Inform Med 2024;37:1652-63. [Crossref] [PubMed]
Kim K, Lee Y, Park D, et al., editors. Llm-guided multi-modal multiple instance learning for 5-year overall survival prediction of lung cancer. International Conference on Medical Image Computing and Computer-Assisted Intervention; 2024: Springer.
Restrepo D, Wu C, Tang Z, et al. Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs. Proceedings of the AAAI Conference on Artificial Intelligence 2025;39:28321-30.
Papineni K, Roukos S, Ward T, et al., editors. Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; 2002.
Lin CY, editor. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Barcelona, Spain. Association for Computational Linguistics 2004, pp. 74-81.
Banerjee S, Lavie A, editors. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; 2005.
Yu F, Endo M, Krishnan R, et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns (N Y) 2023;4:100802. [Crossref] [PubMed]
Calamida A, Nooralahzadeh F, Rohanian M, et al. Radiology-Aware Model-Based Evaluation Metric for Report Generation. arXiv preprint arXiv:231116764. 2023.
Wiest IC, Leßmann M-E, Wolf F, et al. Anonymizing medical documents with local, privacy preserving large language models: The LLM-Anonymizer. medRxiv. 2024:2024.06. 11.24308355.

Cite this article as: Luo Y, Hooshangnejad H, Ngwa W, Ding K. Opportunities and challenges in lung cancer care in the era of large language models and vision language models. Transl Lung Cancer Res 2025;14(5):1830-1847. doi: 10.21037/tlcr-24-801

Opportunities and challenges in lung cancer care in the era of large language models and vision language models

Introduction

Lung cancer diagnosis and prognosis

Lung cancer early diagnosis

Lung cancer survival prediction

Lung cancer recurrence prediction

Balancing model performance and interpretability

Table 1

Medical image processing

Medical image segmentation

Medical image registration

Table 2

LLM and VLM in lung cancer

Emergency of LLM and VLM

Current state of LLM and VLM in lung cancer

Table 3

Challenges

Opportunities

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share