Abstract:Objective To compare the variable selection and predictive ability of gene expression profile data among the three methods, including smoothly clipped absolute deviation-support vector machine (SCAD-SVM), support vector machine (SVM) and Elastic Net. Methods Different conditions of gene expression profile simulation data and the actual data of colon cancer were generated according to the set of parameters. The false discovery rate (FDR), the consistency error rate and the area under the ROC curve (AUC) were used to evaluate the variable selection and predictive ability of the above-mentioned three methods. Results The simulation test showed that the variable selection and predictive ability of the models established by the three methods were improved when the number of differential variables was fixed and the correlation coefficient between differential variables increased. When the correlation coefficient between differential variables was constant and the number of differential variables increased, the variable selection and predictive ability of SCAD-SVM and Elastic Net showed a downward tendency, whereas those of SVM showed an upward tendency. Conclusions SCAD-SVM not only improves the deficiency of SVM, which can not make variable selection directly, but also simultaneously promotes the precision and prediction accuracy of the model established. On the whole, SCAD-SVM is superior in the variable selection and predictive ability; moreover, it can get higher prediction precision and more stable model estimate when manipulating the high correlation data between variables of gene expression profile data.
史晓雯, 肖纯, 刘芸良, 刘艳. 三种统计分析方法在基因表达谱数据中的比较研究[J]. 实用预防医学, 2018, 25(2): 155-159.
SHI Xiao-wen, XIAO Chun, LIU Yun-liang, LIU Yan. Comparison of three statistical methods based on gene expression profile data. , 2018, 25(2): 155-159.
[1] Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines[J]. Mach Learn,2002,44(3):438-443. [2] Vapnik VN. The nature of statistical learning theory[J]. Springer, New York, 1995, 23(5):22-26. [3] 张学工.关于统计学习理论与支持向量机[J].自动化学报, 2000, 2(1):32-42. [4] Markowetz F, Spang R. Molecular diagnosis: classification, model selection and performance evaluation[J]. Method Inform Med, 2005, 44(3):438-443. [5] Tibshirani R. Regression shrinkage and selection via the LASSO[J]. J R Stat Soc B, 1996, 58:267-288. [6] 刘匆提, 李昂, 门志红,等. 惩罚logistic回归方法在SNPs数据变量筛选研究中的应用[J]. 实用预防医学, 2016, 23(11):1395-1399. [7] Fan JQ, Li RZ. Variable selection via nonconcave penalized likelihood and its oracle properties[J]. J Am Stat Assoc, 2001,96:1348-1360. [8] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining inference and prediction[J]. Springer, New York, 2001, 12(3):368-376. [9] Zhang HH, Ahn J, Lin X, et al. Gene selection using support vector machines with non-convex penalty[J]. Bioinformatics, 2006, 22(1):88-95. [10] Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems [J]. Technometrics, 1970, 12(1): 55-67. [11] Zou H, Hastie T. Regularization and variable selection via the elastic net[J]. J R Stat Soc B, 2005, 67(2):301-320.