Human-interpretable clustering of short text using large language models

Clustering short text is a difficult problem, owing to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating embeddings that capture the semantic nuances of short text...

Full description

Saved in:

Bibliographic Details
Main Authors:	Justin K. Miller, Tristram J. Alexander
Format:	Article
Language:	English
Published:	The Royal Society 2025-01-01
Series:	Royal Society Open Science
Subjects:	large language models text clustering clustering validation
Online Access:	https://royalsocietypublishing.org/doi/10.1098/rsos.241692
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832592029245243392
author	Justin K. Miller Tristram J. Alexander
author_facet	Justin K. Miller Tristram J. Alexander
author_sort	Justin K. Miller
collection	DOAJ
description	Clustering short text is a difficult problem, owing to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating embeddings that capture the semantic nuances of short text. In this study, clusters are found in the embedding space using Gaussian mixture modelling. The resulting clusters are found to be more distinctive and more human-interpretable than clusters produced using the popular methods of doc2vec and latent Dirichlet allocation. The success of the clustering approach is quantified using human reviewers and through the use of a generative LLM. The generative LLM shows good agreement with the human reviewers and is suggested as a means to bridge the ‘validation gap’ which often exists between cluster production and cluster interpretation. The comparison between LLM coding and human coding reveals intrinsic biases in each, challenging the conventional reliance on human coding as the definitive standard for cluster validation.
format	Article
id	doaj-art-d0552615897e4049975c7df78259c11e
institution	Kabale University
issn	2054-5703
language	English
publishDate	2025-01-01
publisher	The Royal Society
record_format	Article
series	Royal Society Open Science
spelling	doaj-art-d0552615897e4049975c7df78259c11e2025-01-22T00:16:49ZengThe Royal SocietyRoyal Society Open Science2054-57032025-01-0112110.1098/rsos.241692Human-interpretable clustering of short text using large language modelsJustin K. Miller0Tristram J. Alexander1School of Physics, The University of Sydney, Sydney, AustraliaSchool of Physics, The University of Sydney, Sydney, AustraliaClustering short text is a difficult problem, owing to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating embeddings that capture the semantic nuances of short text. In this study, clusters are found in the embedding space using Gaussian mixture modelling. The resulting clusters are found to be more distinctive and more human-interpretable than clusters produced using the popular methods of doc2vec and latent Dirichlet allocation. The success of the clustering approach is quantified using human reviewers and through the use of a generative LLM. The generative LLM shows good agreement with the human reviewers and is suggested as a means to bridge the ‘validation gap’ which often exists between cluster production and cluster interpretation. The comparison between LLM coding and human coding reveals intrinsic biases in each, challenging the conventional reliance on human coding as the definitive standard for cluster validation.https://royalsocietypublishing.org/doi/10.1098/rsos.241692large language modelstext clusteringclustering validation
spellingShingle	Justin K. Miller Tristram J. Alexander Human-interpretable clustering of short text using large language models Royal Society Open Science large language models text clustering clustering validation
title	Human-interpretable clustering of short text using large language models
title_full	Human-interpretable clustering of short text using large language models
title_fullStr	Human-interpretable clustering of short text using large language models
title_full_unstemmed	Human-interpretable clustering of short text using large language models
title_short	Human-interpretable clustering of short text using large language models
title_sort	human interpretable clustering of short text using large language models
topic	large language models text clustering clustering validation
url	https://royalsocietypublishing.org/doi/10.1098/rsos.241692
work_keys_str_mv	AT justinkmiller humaninterpretableclusteringofshorttextusinglargelanguagemodels AT tristramjalexander humaninterpretableclusteringofshorttextusinglargelanguagemodels

Human-interpretable clustering of short text using large language models

Similar Items