Explainable Identification of Similarities Between Entities for Discovery in Large Text
With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide pre...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Future Internet |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1999-5903/17/4/135 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850180359582384128 |
|---|---|
| author | Akhil Joshi Sai Teja Erukude Lior Shamir |
| author_facet | Akhil Joshi Sai Teja Erukude Lior Shamir |
| author_sort | Akhil Joshi |
| collection | DOAJ |
| description | With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available. |
| format | Article |
| id | doaj-art-03ba0e1124514bfbb6a8d7a4e1e70e3e |
| institution | OA Journals |
| issn | 1999-5903 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Future Internet |
| spelling | doaj-art-03ba0e1124514bfbb6a8d7a4e1e70e3e2025-08-20T02:18:11ZengMDPI AGFuture Internet1999-59032025-03-0117413510.3390/fi17040135Explainable Identification of Similarities Between Entities for Discovery in Large TextAkhil Joshi0Sai Teja Erukude1Lior Shamir2Department of Computer Science, Kansas State University, Manhattan, KS 66502, USADepartment of Computer Science, Kansas State University, Manhattan, KS 66502, USADepartment of Computer Science, Kansas State University, Manhattan, KS 66502, USAWith the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.https://www.mdpi.com/1999-5903/17/4/135text analysistext similaritytext content retrievalexplainable AI |
| spellingShingle | Akhil Joshi Sai Teja Erukude Lior Shamir Explainable Identification of Similarities Between Entities for Discovery in Large Text Future Internet text analysis text similarity text content retrieval explainable AI |
| title | Explainable Identification of Similarities Between Entities for Discovery in Large Text |
| title_full | Explainable Identification of Similarities Between Entities for Discovery in Large Text |
| title_fullStr | Explainable Identification of Similarities Between Entities for Discovery in Large Text |
| title_full_unstemmed | Explainable Identification of Similarities Between Entities for Discovery in Large Text |
| title_short | Explainable Identification of Similarities Between Entities for Discovery in Large Text |
| title_sort | explainable identification of similarities between entities for discovery in large text |
| topic | text analysis text similarity text content retrieval explainable AI |
| url | https://www.mdpi.com/1999-5903/17/4/135 |
| work_keys_str_mv | AT akhiljoshi explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext AT saitejaerukude explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext AT liorshamir explainableidentificationofsimilaritiesbetweenentitiesfordiscoveryinlargetext |