Explainable Identification of Similarities Between Entities for Discovery in Large Text

With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide pre...

Full description

Saved in:
Bibliographic Details
Main Authors: Akhil Joshi, Sai Teja Erukude, Lior Shamir
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/17/4/135
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850180359582384128
author Akhil Joshi
Sai Teja Erukude
Lior Shamir
author_facet Akhil Joshi
Sai Teja Erukude
Lior Shamir
author_sort Akhil Joshi
collection DOAJ
description With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.
format Article
id doaj-art-03ba0e1124514bfbb6a8d7a4e1e70e3e
institution OA Journals
issn 1999-5903
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Future Internet
spelling doaj-art-03ba0e1124514bfbb6a8d7a4e1e70e3e2025-08-20T02:18:11ZengMDPI AGFuture Internet1999-59032025-03-0117413510.3390/fi17040135Explainable Identification of Similarities Between Entities for Discovery in Large TextAkhil Joshi0Sai Teja Erukude1Lior Shamir2Department of Computer Science, Kansas State University, Manhattan, KS 66502, USADepartment of Computer Science, Kansas State University, Manhattan, KS 66502, USADepartment of Computer Science, Kansas State University, Manhattan, KS 66502, USAWith the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.https://www.mdpi.com/1999-5903/17/4/135text analysistext similaritytext content retrievalexplainable AI
spellingShingle Akhil Joshi
Sai Teja Erukude
Lior Shamir
Explainable Identification of Similarities Between Entities for Discovery in Large Text
Future Internet
text analysis
text similarity
text content retrieval
explainable AI
title Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_full Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_fullStr Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_full_unstemmed Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_short Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_sort explainable identification of similarities between entities for discovery in large text
topic text analysis
text similarity
text content retrieval
explainable AI
url https://www.mdpi.com/1999-5903/17/4/135
work_keys_str_mv AT akhiljoshi explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext
AT saitejaerukude explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext
AT liorshamir explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext