1 2
Liu Xiangrong
3
Jiang Mingming
2
Liu Lu
2Zhang Xinsheng
2
Liu Zhao
4
Yang Bo
1 2
Liu Yinghua
1PLA Medical School Beijing 10085 China
2Department of Nutrition First Medical Center PLA General Hospital Beijing 10085
China
3
Innovative Medical Research Department the PLA General Hospital Beijing 10085 China
4Department of Medical Oncology
First Medical Center PLA General Hospital Beijing 10085 China
Abstract:Objective This study aimed to establish an interpretable predictive model for assessing the risk of lung cancer
occurrence using human blood fatty acid levels thereby promoting the prevention and early diagnosis of lung cancer. Method Clinical
and blood fatty acid data from 1 837 patients with lung cancer 1194 cases or non - tumorous 643 cases conditions were
retrospectively collected from June 2023 to June 2025 at the First Medical Center 1677 cases and the Eighth Medical Center 160
cases of the Chinese PLA General Hospital. By setting a random number seed the dataset from the First Medical Center of the PLA
General Hospital was randomly split into a training set and an internal validation set at a ratio of 7 ∶ 3 while the dataset from the
Eighth Medical Center was designated as the external validation set. Five models—random forest extreme gradient boosting support
vector machine SVM classification and regression tree and logistic regression—were developed to predict the risk of lung
cancer. The predictive performance of the five models was compared using the area under the receiver operating characteristic curve
decision curve analysis and calibration curves and the optimal model was selected. The SHapley additive explanations method was
used to interpret the SVM model. Result A clinical prediction model was established incorporating 17 predictors including age
palmitic acid palmitoleic acid linoleic acid γ - linolenic acid α - linolenic acid arachidonic acid eicosapentaenoic acid
docosahexaenoic acid total monounsaturated fatty acids total polyunsaturated fatty acids the ratio of AA to eicosapentaenoic acid and the ratio of ω - 6 PUFAs to ω - 3 PUFAs. The external validation results showed that the area under the receiver operating
characteristic curve of the Random Forest model Extreme Gradient Boosting model SVM model Classification and Regression Tree
model and Logistic Regression model were 0. 927 0. 931 0. 934 0. 84 and 0. 912 respectively. Decision curve analysis indicated
that across most probability thresholds the SVM model tended to yield greater net benefit compared to the other models. Conclusion
The lung cancer risk prediction model constructed based on the SVM model exhibited the best performance. It can effectively assist
medical personnel in early identification of high-risk populations for lung cancer and facilitate precise nutritional interventions targeting
risk factors to reduce the incidence of the disease.