Explainable Identification of Similarities Between Entities for Discovery in Large Text

With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide pre...

Full description

Saved in:

Bibliographic Details
Main Authors:	Akhil Joshi, Sai Teja Erukude, Lior Shamir
Format:	Article
Language:	English
Published:	MDPI AG 2025-03-01
Series:	Future Internet
Subjects:	text analysis text similarity text content retrieval explainable AI
Online Access:	https://www.mdpi.com/1999-5903/17/4/135
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850180359582384128
author	Akhil Joshi Sai Teja Erukude Lior Shamir
author_facet	Akhil Joshi Sai Teja Erukude Lior Shamir
author_sort	Akhil Joshi
collection	DOAJ
description	With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.
format	Article
id	doaj-art-03ba0e1124514bfbb6a8d7a4e1e70e3e
institution	OA Journals
issn	1999-5903
language	English
publishDate	2025-03-01
publisher	MDPI AG
record_format	Article
series	Future Internet
spelling	doaj-art-03ba0e1124514bfbb6a8d7a4e1e70e3e2025-08-20T02:18:11ZengMDPI AGFuture Internet1999-59032025-03-0117413510.3390/fi17040135Explainable Identification of Similarities Between Entities for Discovery in Large TextAkhil Joshi0Sai Teja Erukude1Lior Shamir2Department of Computer Science, Kansas State University, Manhattan, KS 66502, USADepartment of Computer Science, Kansas State University, Manhattan, KS 66502, USADepartment of Computer Science, Kansas State University, Manhattan, KS 66502, USAWith the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.https://www.mdpi.com/1999-5903/17/4/135text analysistext similaritytext content retrievalexplainable AI
spellingShingle	Akhil Joshi Sai Teja Erukude Lior Shamir Explainable Identification of Similarities Between Entities for Discovery in Large Text Future Internet text analysis text similarity text content retrieval explainable AI
title	Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_full	Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_fullStr	Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_full_unstemmed	Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_short	Explainable Identification of Similarities Between Entities for Discovery in Large Text
title_sort	explainable identification of similarities between entities for discovery in large text
topic	text analysis text similarity text content retrieval explainable AI
url	https://www.mdpi.com/1999-5903/17/4/135
work_keys_str_mv	AT akhiljoshi explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext AT saitejaerukude explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext AT liorshamir explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext

Explainable Identification of Similarities Between Entities for Discovery in Large Text

Similar Items