随机森林的并行运算方法及适用条件

摘要
图/表
参考文献
相关文章 (9)

全文: PDF (403 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要目的探讨随机森林并行运算的实现方法及其适用条件,为基因组学数据分析提供科学参考.方法基于R foreach包编写随机森林并行运算程序,并利用SNPs模拟数据探究其表现.结果在SNPs位点数量为100,500,1000时,随工作站所占用CPU数量的增多,随机森林并行运算方法的提速效果呈非线性趋势,且位点数量相同但ntree数量不同时速度的提升效果亦不相同;当SNPs位点数量达到5000时,该方法提速效果较差,10核环境下ntree为500和1000时几乎无提速效果,即使ntree达到5000或10000时提速效果也不超过2倍.结论基于R foreach包的随机森林并行运算方法在SNPs位点数量不是很多(如<1000)的情况下其提速效果尚可;但由于共享内存等产生的通信开销的问题的存在,当SNPs位点数较多(超过5000)时,该方法提速效果很差,此时可考虑选择其他分析工具如随机丛林(RJ,Random Jungle ).

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

关键词 ：大数据, 随机森林, 并行运算, 单核苷酸多态性

Abstract：ObjectiveTo explore the implementation method of Parallel Random Forest and its applicable condition and provide scientific reference for genomics data analysis. MethodProgramming the Parallel Random Forest computing program based on R foreach package and using the SNPs simulated data to evaluate its performance. ResultWhen the number of SNPs is 100,500,1000,performance gains are not linear with the number of CPUs increased. And the same amount of data under the condition of different numbers of ntree,the performance gains difference also. When the number of SNPs reaches 5000, the performance of this method is relatively low. When the number of ntree is 5000,10000 under the 10 CPUs environment, the performance is less than 2 times better than sequential job and there is almost no speed gains ConclusionWhen the number of SNPs is not a lot(less than 1000),performance of the Parallel Random Forest computing program based on R foreach package is better. However,if the number of SNPs is high(over 5000),due to the existence of shared memory that can generate communication overhead problems, this method is poor,then we can consider to choose other analysis tools,like Random Jungle .

Key words： Big Data Random Forest Parallel Computation SNPs

基金资助:基金项目:国家自然科学基金(81172741,30972537)

通讯作者: 刘艳,邮箱:liuyan@ems.hrbmu.edu.cn ;yanliu2005@163.com

作者简介: 顾星博(1988-),男,黑龙江省哈尔滨市人,硕士研究生,主要从事生物统计方法的研究与应用工作,邮箱:740774209@qq.com

引用本文:

顾星博, 温琪, 史晓雯, 刘艳. 随机森林的并行运算方法及适用条件[J]. 实用预防医学, 2016, 23(2): 129-132. GU Xingbo, WEN Qi, SHI Xiaowen. Parallel Random Forest method and applicable condition. , 2016, 23(2): 129-132.

链接本文:

https://www.syyfyx.com/CN/ 或 https://www.syyfyx.com/CN/Y2016/V23/I2/129