Construction of risk prediction model for lung cancer based on data mining technology
HUANG Pu-chao1, YUAN Hui-jie2, ZHANG Gui-fang1
1. Xinxiang Central Hospital, the Fourth Clinical College of Xinxiang Medical University, Xinxiang, Henan 453000, China; 2. School of Public Health, Zhengzhou University, Zhengzhou, Henan 450001, China
Abstract:Objective To establish risk prediction models for lung cancer based on the data about epidemiological characteristics and clinical symptoms by data mining technology, and to evaluate the performance of each model so as to screen out the optimal predictive model. Methods Four hundred and sixty patients with lung cancer and 560 patients with benign lung disease were selected as the subjects, and 16 independent variables comprising of epidemiological characteristics and clinical symptoms were collected. All the subjects were randomly divided into the training set and the test set in a ratio of 3:1. Based on the variables and by the use of support vector machine (SVM), C5.0 decision tree and artificial neural network (ANN), 3 risk prediction models for lung cancer were established respectively, and the predictive performances of these models were compared. Results After feature extraction, 9 variables including blood in phlegm, fever and sweating and smoking history were selected as the effective variables in the establishment of risk prediction models for lung cancer. In the test set, the sensitivities of SVM, C5.0 decision tree and ANN models were 74.1%, 62.5% and 92.9%, respectively. The specificities were 76.2%, 80.4% and 64.3%, respectively. The positive predictive values were 70.9%, 71.4% and 67.1%, respectively. The negative predictive values were 79.0%, 73.2% and 92.0%, respectively. The accuracies were 75.3%, 72.5% and 76.9%, respectively. The areas under the curves were 0.752 (95%CI:0.694-0.803), 0.715 (95%CI:0.655-0.769) and 0.786 (95%CI:0.730-0.835), respectively. Conclusion The ANN prediction model has a better overall performance than SVM and C5.0 decision tree models, and it has potential application value in the screening of high-risk population for lung cancer.
黄普超, 原慧洁, 张桂芳. 基于数据挖掘技术的肺癌危险度预测模型的构建[J]. 实用预防医学, 2022, 29(11): 1390-1394.
HUANG Pu-chao, YUAN Hui-jie, ZHANG Gui-fang. Construction of risk prediction model for lung cancer based on data mining technology. , 2022, 29(11): 1390-1394.
[1] Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries[J]. CA Cancer J Clin, 2021, 71(3):209-249. [2] Henley SJ, Ward EM, Scott S, et al. Annual report to the nation on the status of cancer, part I: national cancer statistics[J]. Cancer, 2020, 126(10):2225-2249. [3] Saberi-Karimian M, Khorasanchi Z, Ghazizadeh H, et al. Potential value and impact of data mining and machine learning in clinical diagnostics[J]. Crit Rev Clin Lab Sci, 2021, 58(4):275-296. [4] Valluru D,Jeya IJS. IoT with cloud based lung cancer diagnosis model using optimal support vector machine[J]. Health Care Manag Sci, 2020, 23(4):670-679. [5] Vosshenrich J,Zech CJ,Heye T,et al. Response prediction of hepatocellular carcinoma undergoing transcatheter arterial chemoembolization: unlocking the potential of CT texture analysis through nested decision tree models[J]. Eur Radiol, 2021, 31(6):4367-4376. [6] Xu S, Guan LJ, Shi BQ, et al. Recurrent hemoptysis after bronchial artery embolization: prediction using a nomogram and artificial neural network model[J]. AJR Am J Roentgenol, 2020, 215(6):1490-1498. [7] 中华医学会, 中华医学会肿瘤学分会, 中华医学会杂志社. 中华医学会肺癌临床诊疗指南(2019版)[J]. 中华肿瘤杂志, 2020, 42(04):257-287. [8] Henley SJ, Thomas CC, Lewis DR, et al. Annual report to the nation on the status of cancer, part II: progress toward healthy people 2020 objectives for 4 common cancers[J]. Cancer, 2020, 126(10):2250-2266. [9] 郑荣寿, 孙可欣, 张思维, 等. 2015年中国恶性肿瘤流行情况分析[J]. 中华肿瘤杂志, 2019, 41(1):19-28. [10] Oudkerk M, Liu S, Heuvelmans MA, et al. Lung cancer LDCT screening and mortality reduction - evidence, pitfalls and future perspectives[J]. Nat Rev Clin Oncol, 2021, 18(3):135-151. [11] de Koning HJ, van der Aalst CM, de Jong PA, et al. Reduced lung-cancer mortality with volume CT screening in a randomized trial[J]. N Engl J Med, 2020, 382(6):503-513. [12] 于丽娅, 郭薇, 吕艺, 等. 辽宁省城市肺癌患者10年生存率及其影响因素分析[J]. 实用预防医学, 2021, 28(12):1432-1436. [13] Smith RA, Andrews KS, Brooks D, et al. Cancer screening in the United States, 2019: a review of current American Cancer Society guidelines and current issues in cancer screening[J]. CA Cancer J Clin, 2019, 69(3):184-210. [14] Hidaka A, Sawada N, Svensson T, et al. Family history of cancer and subsequent risk of cancer: a large-scale population-based prospective study in Japan[J]. Int J Cancer, 2020, 147(2):331-337. [15] Wood DE, Kazerooni EA, Baum SL, et al.Lung cancer screening, version 3.2018, NCCN clinical practice guidelines in oncology[J]. J Natl Compr Canc Netw, 2018, 16(4):412-441. [16] Duggirala HJ, Tonning JM, Smith E, et al. Use of data mining at the Food and Drug Administration[J]. J Am Med Inform Assoc, 2016, 23(2):428-434. [17] Duan S,Cao H,Liu H,et al. Development of a machine learning-based multimode diagnosis system for lung cancer[J]. Aging (Albany NY), 2020, 12(10):9840-9854. [18] 高孜博, 李迪, 段书音, 等. 数据挖掘技术在肺癌危险度预测模型中的应用[J]. 肿瘤防治研究, 2021, 48(5):479-483. [19] Yang GR, Wang XJ. Artificial neural networks for neuroscientists: a primer[J]. Neuron, 2020, 107(6):1048-1070.