A Massively Parallel SMC Sampler for Decision Trees
Bayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially lon...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/18/1/14 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832589431768350720 |
---|---|
author | Efthyvoulos Drousiotis Alessandro Varsi Alexander M. Phillips Simon Maskell Paul G. Spirakis |
author_facet | Efthyvoulos Drousiotis Alessandro Varsi Alexander M. Phillips Simon Maskell Paul G. Spirakis |
author_sort | Efthyvoulos Drousiotis |
collection | DOAJ |
description | Bayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially long runtime to converge. A widely used approach to reducing an algorithm’s runtime is to employ modern multi-core computer architectures, either with shared memory (SM) or distributed memory (DM), and use parallel computing to accelerate the algorithm. However, the inherent sequential nature of MCMC makes it unsuitable for parallel implementation unless the accuracy is sacrificed. This issue is particularly evident in DM architectures, which normally provide access to larger numbers of cores than SM. Sequential Monte Carlo (SMC) samplers are a parallel alternative to MCMC, which do not trade off accuracy for parallelism. However, the performance of SMC samplers in the context of DTs is underexplored, and the parallelization is complicated by the challenges in parallelizing its bottleneck, namely redistribution, especially on variable-size data types such as DTs. In this work, we study the problem of parallelizing SMC in the context of DTs both on SM and DM. On both memory architectures, we show that the proposed parallelization strategies achieve asymptotically optimal <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mn>2</mn></msub><mi>N</mi><mo>)</mo></mrow></semantics></math></inline-formula> time complexity. Numerical results are presented for a 32-core SM machine and a 256-core DM cluster. For both computer architectures, the experimental results show that our approach has comparable or better accuracy than MCMC but runs up to 51 times faster on SM and 640 times faster on DM. In this paper, we share the GitHub link to the source code. |
format | Article |
id | doaj-art-d84f3fe3cd6742cd8877d01ef04e1747 |
institution | Kabale University |
issn | 1999-4893 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Algorithms |
spelling | doaj-art-d84f3fe3cd6742cd8877d01ef04e17472025-01-24T13:17:28ZengMDPI AGAlgorithms1999-48932025-01-011811410.3390/a18010014A Massively Parallel SMC Sampler for Decision TreesEfthyvoulos Drousiotis0Alessandro Varsi1Alexander M. Phillips2Simon Maskell3Paul G. Spirakis4Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Computer Science, University of Liverpool, Liverpool L69 3BX, UKBayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially long runtime to converge. A widely used approach to reducing an algorithm’s runtime is to employ modern multi-core computer architectures, either with shared memory (SM) or distributed memory (DM), and use parallel computing to accelerate the algorithm. However, the inherent sequential nature of MCMC makes it unsuitable for parallel implementation unless the accuracy is sacrificed. This issue is particularly evident in DM architectures, which normally provide access to larger numbers of cores than SM. Sequential Monte Carlo (SMC) samplers are a parallel alternative to MCMC, which do not trade off accuracy for parallelism. However, the performance of SMC samplers in the context of DTs is underexplored, and the parallelization is complicated by the challenges in parallelizing its bottleneck, namely redistribution, especially on variable-size data types such as DTs. In this work, we study the problem of parallelizing SMC in the context of DTs both on SM and DM. On both memory architectures, we show that the proposed parallelization strategies achieve asymptotically optimal <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mn>2</mn></msub><mi>N</mi><mo>)</mo></mrow></semantics></math></inline-formula> time complexity. Numerical results are presented for a 32-core SM machine and a 256-core DM cluster. For both computer architectures, the experimental results show that our approach has comparable or better accuracy than MCMC but runs up to 51 times faster on SM and 640 times faster on DM. In this paper, we share the GitHub link to the source code.https://www.mdpi.com/1999-4893/18/1/14parallel algorithmsmachine learningBayesian decision treessequential Monte Carlo samplersMarkov Chain Monte Carloshared memory |
spellingShingle | Efthyvoulos Drousiotis Alessandro Varsi Alexander M. Phillips Simon Maskell Paul G. Spirakis A Massively Parallel SMC Sampler for Decision Trees Algorithms parallel algorithms machine learning Bayesian decision trees sequential Monte Carlo samplers Markov Chain Monte Carlo shared memory |
title | A Massively Parallel SMC Sampler for Decision Trees |
title_full | A Massively Parallel SMC Sampler for Decision Trees |
title_fullStr | A Massively Parallel SMC Sampler for Decision Trees |
title_full_unstemmed | A Massively Parallel SMC Sampler for Decision Trees |
title_short | A Massively Parallel SMC Sampler for Decision Trees |
title_sort | massively parallel smc sampler for decision trees |
topic | parallel algorithms machine learning Bayesian decision trees sequential Monte Carlo samplers Markov Chain Monte Carlo shared memory |
url | https://www.mdpi.com/1999-4893/18/1/14 |
work_keys_str_mv | AT efthyvoulosdrousiotis amassivelyparallelsmcsamplerfordecisiontrees AT alessandrovarsi amassivelyparallelsmcsamplerfordecisiontrees AT alexandermphillips amassivelyparallelsmcsamplerfordecisiontrees AT simonmaskell amassivelyparallelsmcsamplerfordecisiontrees AT paulgspirakis amassivelyparallelsmcsamplerfordecisiontrees AT efthyvoulosdrousiotis massivelyparallelsmcsamplerfordecisiontrees AT alessandrovarsi massivelyparallelsmcsamplerfordecisiontrees AT alexandermphillips massivelyparallelsmcsamplerfordecisiontrees AT simonmaskell massivelyparallelsmcsamplerfordecisiontrees AT paulgspirakis massivelyparallelsmcsamplerfordecisiontrees |