A Massively Parallel SMC Sampler for Decision Trees

Bayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially lon...

Full description

Saved in:
Bibliographic Details
Main Authors: Efthyvoulos Drousiotis, Alessandro Varsi, Alexander M. Phillips, Simon Maskell, Paul G. Spirakis
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/18/1/14
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832589431768350720
author Efthyvoulos Drousiotis
Alessandro Varsi
Alexander M. Phillips
Simon Maskell
Paul G. Spirakis
author_facet Efthyvoulos Drousiotis
Alessandro Varsi
Alexander M. Phillips
Simon Maskell
Paul G. Spirakis
author_sort Efthyvoulos Drousiotis
collection DOAJ
description Bayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially long runtime to converge. A widely used approach to reducing an algorithm’s runtime is to employ modern multi-core computer architectures, either with shared memory (SM) or distributed memory (DM), and use parallel computing to accelerate the algorithm. However, the inherent sequential nature of MCMC makes it unsuitable for parallel implementation unless the accuracy is sacrificed. This issue is particularly evident in DM architectures, which normally provide access to larger numbers of cores than SM. Sequential Monte Carlo (SMC) samplers are a parallel alternative to MCMC, which do not trade off accuracy for parallelism. However, the performance of SMC samplers in the context of DTs is underexplored, and the parallelization is complicated by the challenges in parallelizing its bottleneck, namely redistribution, especially on variable-size data types such as DTs. In this work, we study the problem of parallelizing SMC in the context of DTs both on SM and DM. On both memory architectures, we show that the proposed parallelization strategies achieve asymptotically optimal <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mn>2</mn></msub><mi>N</mi><mo>)</mo></mrow></semantics></math></inline-formula> time complexity. Numerical results are presented for a 32-core SM machine and a 256-core DM cluster. For both computer architectures, the experimental results show that our approach has comparable or better accuracy than MCMC but runs up to 51 times faster on SM and 640 times faster on DM. In this paper, we share the GitHub link to the source code.
format Article
id doaj-art-d84f3fe3cd6742cd8877d01ef04e1747
institution Kabale University
issn 1999-4893
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj-art-d84f3fe3cd6742cd8877d01ef04e17472025-01-24T13:17:28ZengMDPI AGAlgorithms1999-48932025-01-011811410.3390/a18010014A Massively Parallel SMC Sampler for Decision TreesEfthyvoulos Drousiotis0Alessandro Varsi1Alexander M. Phillips2Simon Maskell3Paul G. Spirakis4Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UKDepartment of Computer Science, University of Liverpool, Liverpool L69 3BX, UKBayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially long runtime to converge. A widely used approach to reducing an algorithm’s runtime is to employ modern multi-core computer architectures, either with shared memory (SM) or distributed memory (DM), and use parallel computing to accelerate the algorithm. However, the inherent sequential nature of MCMC makes it unsuitable for parallel implementation unless the accuracy is sacrificed. This issue is particularly evident in DM architectures, which normally provide access to larger numbers of cores than SM. Sequential Monte Carlo (SMC) samplers are a parallel alternative to MCMC, which do not trade off accuracy for parallelism. However, the performance of SMC samplers in the context of DTs is underexplored, and the parallelization is complicated by the challenges in parallelizing its bottleneck, namely redistribution, especially on variable-size data types such as DTs. In this work, we study the problem of parallelizing SMC in the context of DTs both on SM and DM. On both memory architectures, we show that the proposed parallelization strategies achieve asymptotically optimal <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mn>2</mn></msub><mi>N</mi><mo>)</mo></mrow></semantics></math></inline-formula> time complexity. Numerical results are presented for a 32-core SM machine and a 256-core DM cluster. For both computer architectures, the experimental results show that our approach has comparable or better accuracy than MCMC but runs up to 51 times faster on SM and 640 times faster on DM. In this paper, we share the GitHub link to the source code.https://www.mdpi.com/1999-4893/18/1/14parallel algorithmsmachine learningBayesian decision treessequential Monte Carlo samplersMarkov Chain Monte Carloshared memory
spellingShingle Efthyvoulos Drousiotis
Alessandro Varsi
Alexander M. Phillips
Simon Maskell
Paul G. Spirakis
A Massively Parallel SMC Sampler for Decision Trees
Algorithms
parallel algorithms
machine learning
Bayesian decision trees
sequential Monte Carlo samplers
Markov Chain Monte Carlo
shared memory
title A Massively Parallel SMC Sampler for Decision Trees
title_full A Massively Parallel SMC Sampler for Decision Trees
title_fullStr A Massively Parallel SMC Sampler for Decision Trees
title_full_unstemmed A Massively Parallel SMC Sampler for Decision Trees
title_short A Massively Parallel SMC Sampler for Decision Trees
title_sort massively parallel smc sampler for decision trees
topic parallel algorithms
machine learning
Bayesian decision trees
sequential Monte Carlo samplers
Markov Chain Monte Carlo
shared memory
url https://www.mdpi.com/1999-4893/18/1/14
work_keys_str_mv AT efthyvoulosdrousiotis amassivelyparallelsmcsamplerfordecisiontrees
AT alessandrovarsi amassivelyparallelsmcsamplerfordecisiontrees
AT alexandermphillips amassivelyparallelsmcsamplerfordecisiontrees
AT simonmaskell amassivelyparallelsmcsamplerfordecisiontrees
AT paulgspirakis amassivelyparallelsmcsamplerfordecisiontrees
AT efthyvoulosdrousiotis massivelyparallelsmcsamplerfordecisiontrees
AT alessandrovarsi massivelyparallelsmcsamplerfordecisiontrees
AT alexandermphillips massivelyparallelsmcsamplerfordecisiontrees
AT simonmaskell massivelyparallelsmcsamplerfordecisiontrees
AT paulgspirakis massivelyparallelsmcsamplerfordecisiontrees