Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis

We present a new word embedding technique in a (non-linear) metric space based on the shared membership of terms in a corpus of textual documents, where the metric is naturally defined by the Boolean algebra of all subsets of the corpus and a measure <inline-formula><math xmlns="http:/...

Full description

Saved in:

Bibliographic Details
Main Authors:	Pedro Fernández de Córdoba, Carlos A. Reyes Pérez, Claudia Sánchez Arnau, Enrique A. Sánchez Pérez
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Computers
Subjects:	word embedding semantic projection set metric Lipschitz function semantic index
Online Access:	https://www.mdpi.com/2073-431X/14/1/30
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832588774542934016
author	Pedro Fernández de Córdoba Carlos A. Reyes Pérez Claudia Sánchez Arnau Enrique A. Sánchez Pérez
author_facet	Pedro Fernández de Córdoba Carlos A. Reyes Pérez Claudia Sánchez Arnau Enrique A. Sánchez Pérez
author_sort	Pedro Fernández de Córdoba
collection	DOAJ
description	We present a new word embedding technique in a (non-linear) metric space based on the shared membership of terms in a corpus of textual documents, where the metric is naturally defined by the Boolean algebra of all subsets of the corpus and a measure <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>μ</mi></semantics></math></inline-formula> defined on it. Once the metric space is constructed, a new term (a noun, an adjective, a classification term) can be introduced into the model and analyzed by means of semantic projections, which in turn are defined as indexes using the measure <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>μ</mi></semantics></math></inline-formula> and the word embedding tools. We formally define all necessary elements and prove the main results about the model, including a compatibility theorem for estimating the representability of semantically meaningful external terms in the model (which are written as real Lipschitz functions in the metric space), proving the relation between the semantic index and the metric of the space (Theorem 1). Our main result proves the universality of our word-set embedding, proving mathematically that every word embedding based on linear space can be written as a word-set embedding (Theorem 2). Since we adopt an empirical point of view for the semantic issues, we also provide the keys for the interpretation of the results using probabilistic arguments (to facilitate the subsequent integration of the model into Bayesian frameworks for the construction of inductive tools), as well as in fuzzy set-theoretic terms. We also show some illustrative examples, including a complete computational case using big-data-based computations. Thus, the main advantages of the proposed model are that the results on distances between terms are interpretable in semantic terms once the semantic index used is fixed and, although the calculations could be costly, it is possible to calculate the value of the distance between two terms without the need to calculate the whole distance matrix. “Wovon man nicht sprechen kann, darüber muss man schweigen”. Tractatus Logico-Philosophicus. L. Wittgenstein.
format	Article
id	doaj-art-70ac52e1455144bab2a35368df9f3fef
institution	Kabale University
issn	2073-431X
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Computers
spelling	doaj-art-70ac52e1455144bab2a35368df9f3fef2025-01-24T13:27:55ZengMDPI AGComputers2073-431X2025-01-011413010.3390/computers14010030Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language AnalysisPedro Fernández de Córdoba0Carlos A. Reyes Pérez1Claudia Sánchez Arnau2Enrique A. Sánchez Pérez3Instituto Universitario de Matemática Pura y Aplicada, Universitat Politècnica de València, 46022 València, SpainInstituto Universitario de Matemática Pura y Aplicada, Universitat Politècnica de València, 46022 València, SpainE.T.S. Ingeniería, Universitat de València, 46100 Valéncia, SpainInstituto Universitario de Matemática Pura y Aplicada, Universitat Politècnica de València, 46022 València, SpainWe present a new word embedding technique in a (non-linear) metric space based on the shared membership of terms in a corpus of textual documents, where the metric is naturally defined by the Boolean algebra of all subsets of the corpus and a measure <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>μ</mi></semantics></math></inline-formula> defined on it. Once the metric space is constructed, a new term (a noun, an adjective, a classification term) can be introduced into the model and analyzed by means of semantic projections, which in turn are defined as indexes using the measure <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>μ</mi></semantics></math></inline-formula> and the word embedding tools. We formally define all necessary elements and prove the main results about the model, including a compatibility theorem for estimating the representability of semantically meaningful external terms in the model (which are written as real Lipschitz functions in the metric space), proving the relation between the semantic index and the metric of the space (Theorem 1). Our main result proves the universality of our word-set embedding, proving mathematically that every word embedding based on linear space can be written as a word-set embedding (Theorem 2). Since we adopt an empirical point of view for the semantic issues, we also provide the keys for the interpretation of the results using probabilistic arguments (to facilitate the subsequent integration of the model into Bayesian frameworks for the construction of inductive tools), as well as in fuzzy set-theoretic terms. We also show some illustrative examples, including a complete computational case using big-data-based computations. Thus, the main advantages of the proposed model are that the results on distances between terms are interpretable in semantic terms once the semantic index used is fixed and, although the calculations could be costly, it is possible to calculate the value of the distance between two terms without the need to calculate the whole distance matrix. “Wovon man nicht sprechen kann, darüber muss man schweigen”. Tractatus Logico-Philosophicus. L. Wittgenstein.https://www.mdpi.com/2073-431X/14/1/30word embeddingsemantic projectionset metricLipschitz functionsemantic index
spellingShingle	Pedro Fernández de Córdoba Carlos A. Reyes Pérez Claudia Sánchez Arnau Enrique A. Sánchez Pérez Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis Computers word embedding semantic projection set metric Lipschitz function semantic index
title	Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis
title_full	Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis
title_fullStr	Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis
title_full_unstemmed	Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis
title_short	Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis
title_sort	set word embeddings and semantic indices a new contextual model for empirical language analysis
topic	word embedding semantic projection set metric Lipschitz function semantic index
url	https://www.mdpi.com/2073-431X/14/1/30
work_keys_str_mv	AT pedrofernandezdecordoba setwordembeddingsandsemanticindicesanewcontextualmodelforempiricallanguageanalysis AT carlosareyesperez setwordembeddingsandsemanticindicesanewcontextualmodelforempiricallanguageanalysis AT claudiasanchezarnau setwordembeddingsandsemanticindicesanewcontextualmodelforempiricallanguageanalysis AT enriqueasanchezperez setwordembeddingsandsemanticindicesanewcontextualmodelforempiricallanguageanalysis

Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis

Similar Items