An Empirical Configuration Study of a Common Document Clustering Pipeline

Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or creating topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction wi...

Full description

Saved in:
Bibliographic Details
Main Authors: Anton Eklund, Mona Forsman, Frank Drewes
Format: Article
Language:English
Published: Linköping University Electronic Press 2023-09-01
Series:Northern European Journal of Language Technology
Online Access:https://nejlt.ep.liu.se/article/view/4396
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832591212056412160
author Anton Eklund
Mona Forsman
Frank Drewes
author_facet Anton Eklund
Mona Forsman
Frank Drewes
author_sort Anton Eklund
collection DOAJ
description Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or creating topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.
format Article
id doaj-art-1361009b7b8f46a89635390778fa2319
institution Kabale University
issn 2000-1533
language English
publishDate 2023-09-01
publisher Linköping University Electronic Press
record_format Article
series Northern European Journal of Language Technology
spelling doaj-art-1361009b7b8f46a89635390778fa23192025-01-22T15:25:15ZengLinköping University Electronic PressNorthern European Journal of Language Technology2000-15332023-09-019110.3384/nejlt.2000-1533.2023.4396An Empirical Configuration Study of a Common Document Clustering PipelineAnton Eklund0Mona Forsman1Frank Drewes2Umeå UniversityAdlede ABUmeå University Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or creating topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters. https://nejlt.ep.liu.se/article/view/4396
spellingShingle Anton Eklund
Mona Forsman
Frank Drewes
An Empirical Configuration Study of a Common Document Clustering Pipeline
Northern European Journal of Language Technology
title An Empirical Configuration Study of a Common Document Clustering Pipeline
title_full An Empirical Configuration Study of a Common Document Clustering Pipeline
title_fullStr An Empirical Configuration Study of a Common Document Clustering Pipeline
title_full_unstemmed An Empirical Configuration Study of a Common Document Clustering Pipeline
title_short An Empirical Configuration Study of a Common Document Clustering Pipeline
title_sort empirical configuration study of a common document clustering pipeline
url https://nejlt.ep.liu.se/article/view/4396
work_keys_str_mv AT antoneklund anempiricalconfigurationstudyofacommondocumentclusteringpipeline
AT monaforsman anempiricalconfigurationstudyofacommondocumentclusteringpipeline
AT frankdrewes anempiricalconfigurationstudyofacommondocumentclusteringpipeline
AT antoneklund empiricalconfigurationstudyofacommondocumentclusteringpipeline
AT monaforsman empiricalconfigurationstudyofacommondocumentclusteringpipeline
AT frankdrewes empiricalconfigurationstudyofacommondocumentclusteringpipeline