PCMINN: A GPU-Accelerated Conditional Mutual Information-Based Feature Selection Method

In feature selection, it is crucial to identify features that are not only relevant to the target variable but also non-redundant. Conditional Mutual Information Nearest-Neighbor (CMINN) is an algorithm developed to address this challenge by using Conditional Mutual Information (CMI) to assess the r...

Full description

Saved in:
Bibliographic Details
Main Authors: Nikolaos Papaioannou, Georgios Myllis, Alkiviadis Tsimpiris, Stamatis Aggelopoulos, Vasiliki Vrana
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/6/445
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In feature selection, it is crucial to identify features that are not only relevant to the target variable but also non-redundant. Conditional Mutual Information Nearest-Neighbor (CMINN) is an algorithm developed to address this challenge by using Conditional Mutual Information (CMI) to assess the relevance of individual features to the target variable, while identifying redundancy among similar features. Although effective, the original CMINN algorithm can be computationally intensive, particularly with large and high-dimensional datasets. In this study, we extend the CMINN algorithm by parallelizing it for execution on Graphics Processing Units (GPUs), significantly enhancing its efficiency and scalability for high-dimensional datasets. The parallelized CMINN (PCMINN) leverages the massive parallelism of modern GPUs to handle the computational complexity inherent in sequential feature selection, particularly when dealing with large-scale data. To evaluate the performance of PCMINN across various scenarios, we conduct both an extensive simulation study using datasets with combined feature effects and a case study using financial data. Our results show that PCMINN not only maintains the effectiveness of the original CMINN in selecting the optimal feature subset, but also achieves faster execution times. The parallelized approach allows for the efficient processing of large datasets, making PCMINN a valuable tool for high-dimensional feature selection tasks. We also provide a package that includes two Python implementations to support integration into future research workflows: a sequential version of CMINN and a parallel GPU-based version of PCMINN.
ISSN:2078-2489