Learning From High-Cardinality Categorical Features in Deep Neural Networks

Some machine learning algorithms expect the input variables and the output variables to be numeric. Therefore, in an early stage of modelling, feature engineering is required when categorical variables present in the dataset. As a result, we must encode those attributes into an appropriate feature v...

Full description

Saved in:
Bibliographic Details
Main Author: Mustafa Murat Arat
Format: Article
Language:English
Published: Çanakkale Onsekiz Mart University 2022-06-01
Series:Journal of Advanced Research in Natural and Applied Sciences
Subjects:
Online Access:https://dergipark.org.tr/en/download/article-file/2045221
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832095467756847104
author Mustafa Murat Arat
author_facet Mustafa Murat Arat
author_sort Mustafa Murat Arat
collection DOAJ
description Some machine learning algorithms expect the input variables and the output variables to be numeric. Therefore, in an early stage of modelling, feature engineering is required when categorical variables present in the dataset. As a result, we must encode those attributes into an appropriate feature vector. However, categorical variables having more than 100 unique values are considered to be high-cardinality and there exists no straightforward methods to handle them. Besides, the majority of the work on categorical variable encoding in the literature assumes that the categories is limited, known beforehand, and made up of mutually-exclusive elements, inde-pendently from the data, which is not necessarily true for real-world applications. Feature engineering typically practices to tackle the high cardinality issues with data-cleaning techniques which they are time-consuming and often needs human intervention and domain expertise which are major costs in data science projects The most common methods of transform categorical variables is one-hot encoding and target encoding. To address the issue of encoding categorical variables in environments with a high cardinality, we also seek a general-purpose approach for statistical analysis of categorical entries that is capable of handling a very large number of catego-ries, while avoiding computational and statistical difficulties. Our proposed approach is low dimensional; thus, it is very efficient in processing time and memory, it can be computed in an online learning setting. Even though for this paper, we opt to utilize it in the input layer, dictionaries are typically architecture-independent and may be moved between different architectures or layers.
format Article
id doaj-art-454e0c11c95246d19fefb48f9e199387
institution Kabale University
issn 2757-5195
language English
publishDate 2022-06-01
publisher Çanakkale Onsekiz Mart University
record_format Article
series Journal of Advanced Research in Natural and Applied Sciences
spelling doaj-art-454e0c11c95246d19fefb48f9e1993872025-02-05T17:58:10ZengÇanakkale Onsekiz Mart UniversityJournal of Advanced Research in Natural and Applied Sciences2757-51952022-06-018222223610.28979/jarnas.1014469453Learning From High-Cardinality Categorical Features in Deep Neural NetworksMustafa Murat Arat0https://orcid.org/0000-0003-3740-5135HACETTEPE UNIVERSITYSome machine learning algorithms expect the input variables and the output variables to be numeric. Therefore, in an early stage of modelling, feature engineering is required when categorical variables present in the dataset. As a result, we must encode those attributes into an appropriate feature vector. However, categorical variables having more than 100 unique values are considered to be high-cardinality and there exists no straightforward methods to handle them. Besides, the majority of the work on categorical variable encoding in the literature assumes that the categories is limited, known beforehand, and made up of mutually-exclusive elements, inde-pendently from the data, which is not necessarily true for real-world applications. Feature engineering typically practices to tackle the high cardinality issues with data-cleaning techniques which they are time-consuming and often needs human intervention and domain expertise which are major costs in data science projects The most common methods of transform categorical variables is one-hot encoding and target encoding. To address the issue of encoding categorical variables in environments with a high cardinality, we also seek a general-purpose approach for statistical analysis of categorical entries that is capable of handling a very large number of catego-ries, while avoiding computational and statistical difficulties. Our proposed approach is low dimensional; thus, it is very efficient in processing time and memory, it can be computed in an online learning setting. Even though for this paper, we opt to utilize it in the input layer, dictionaries are typically architecture-independent and may be moved between different architectures or layers.https://dergipark.org.tr/en/download/article-file/2045221deep neural networkscategorical variablehigh cardinalitymean target encodingone hot encoding
spellingShingle Mustafa Murat Arat
Learning From High-Cardinality Categorical Features in Deep Neural Networks
Journal of Advanced Research in Natural and Applied Sciences
deep neural networks
categorical variable
high cardinality
mean target encoding
one hot encoding
title Learning From High-Cardinality Categorical Features in Deep Neural Networks
title_full Learning From High-Cardinality Categorical Features in Deep Neural Networks
title_fullStr Learning From High-Cardinality Categorical Features in Deep Neural Networks
title_full_unstemmed Learning From High-Cardinality Categorical Features in Deep Neural Networks
title_short Learning From High-Cardinality Categorical Features in Deep Neural Networks
title_sort learning from high cardinality categorical features in deep neural networks
topic deep neural networks
categorical variable
high cardinality
mean target encoding
one hot encoding
url https://dergipark.org.tr/en/download/article-file/2045221
work_keys_str_mv AT mustafamuratarat learningfromhighcardinalitycategoricalfeaturesindeepneuralnetworks