Evaluating text representations for unsupervised legal semantic textual similarity in Brazilian Portuguese
Abstract Legal domain experts must deal with large amounts of legal data in textual form, such as lawsuits and laws. A common task for those experts, which can be tedious and error-prone, is identifying similarities between a specific lawsuit and previous ones (which had a decision). Several state-o...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-06-01
|
| Series: | Discover Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44248-025-00052-4 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Legal domain experts must deal with large amounts of legal data in textual form, such as lawsuits and laws. A common task for those experts, which can be tedious and error-prone, is identifying similarities between a specific lawsuit and previous ones (which had a decision). Several state-of-the-art computational methods have recently been used for text representation in legal document retrieval based on similarity. However, most studies focus on texts written in English, which are not directly applicable to other languages. We are specifically concerned with texts in Brazilian Portuguese (PT-BR). Thus, this article evaluates 16 methods for calculating similarity to address Semantic Textual Similarity in Brazilian Portuguese legal data. Results show that each text representation may return a different set of documents when considering the same query anchor document. The results also show that less sophisticated text representations like TF-IDF and the BM25 score metric still produce relevant results. Another investigation shows that text characteristics directly impact the performance of Transformer-based models with different attention mechanisms. Furthermore, the analysis comparing the impact of fine-tuning BERT on legal domain data and changing the attention mechanism shows that the latter preserves the BERT original vector space more than the former. Moreover, an experiment comparing heuristic labeling, labeling through text representations similarity, and expert labeling indicates that the Sentence-BERT trained with general domain data, with a Pearson correlation of $$50\%$$ 50 % with the expert labeling, is a good text representation to be used for the Legal Semantic Textual Similarity task. |
|---|---|
| ISSN: | 2731-6955 |