Abstract:ObjectiveTo explore the implementation method of Parallel Random Forest and its applicable condition and provide scientific reference for genomics data analysis. MethodProgramming the Parallel Random Forest computing program based on R foreach package and using the SNPs simulated data to evaluate its performance. ResultWhen the number of SNPs is 100,500,1000,performance gains are not linear with the number of CPUs increased. And the same amount of data under the condition of different numbers of ntree,the performance gains difference also. When the number of SNPs reaches 5000, the performance of this method is relatively low. When the number of ntree is 5000,10000 under the 10 CPUs environment, the performance is less than 2 times better than sequential job and there is almost no speed gains ConclusionWhen the number of SNPs is not a lot(less than 1000),performance of the Parallel Random Forest computing program based on R foreach package is better. However,if the number of SNPs is high(over 5000),due to the existence of shared memory that can generate communication overhead problems, this method is poor,then we can consider to choose other analysis tools,like Random Jungle .
[1] Baker P. Data Divination: Big Data Strategies(2015)[M]. Cengage Learning PTR. ISBN 978-1-305-11508-8. [2] Breiman L. Random forests [J]. Machine learning, 2001, 45(1):5-32. [3] Cedric Gondro, Julius van der Werf, Ben Hayes. Genome-Wide AssociationStudies and Genomic Prediction(2013) [M].Springer. ISBN 978-1-62703-446-3. [4] Q.Ethan McCallum and Stephen Weston. Parallel R(2011) [M]. O'Reilly. ISBN 978-1-449-30992-3. [5] Min Chen,Shiwen Mao,Yin Zhang(2014) .Big Data Related technologies, Challenges and Future Prospects[M]. Springer. ISBN 978-3-319-06244-0. [6] Norman Matloff .The art of R programming(2011)[M]. No Starch Press, Inc. ISBN-13: 978-1-59327-384-2.