TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation

Natural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-reso...

Full description

Saved in:

Bibliographic Details
Main Authors:	İskender Ülgen Oğul, Fatih Soygazi, Belgin Ergenç Bostanoğlu
Format:	Article
Language:	English
Published:	PeerJ Inc. 2025-01-01
Series:	PeerJ Computer Science
Subjects:	MedNLI NLLB BERT Natural language inference Natural language processing Language translation
Online Access:	https://peerj.com/articles/cs-2662.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832574388519567360
author	İskender Ülgen Oğul Fatih Soygazi Belgin Ergenç Bostanoğlu
author_facet	İskender Ülgen Oğul Fatih Soygazi Belgin Ergenç Bostanoğlu
author_sort	İskender Ülgen Oğul
collection	DOAJ
description	Natural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-resource languages is challenging due to the cost and complexity of manual annotation. Although translation of existing datasets offers a practical solution, direct translation of domain-specific datasets presents unique challenges, particularly in handling abbreviations, metric conversions, and cultural alignment. This study introduces a pipeline for translating a medical NLI dataset into Turkish, which is a low-resource language. Our approach employs fine-tuning the Llama-3.1 model with selected samples from the Medical Abbreviation dataset (MeDAL) to extract and resolve medical abbreviations. Consequently, NLI pairs are refined with extracted abbreviations and subjected to metric correction. Later, the processed sentences are then translated using Facebook’s No Language Left Behind (NLLB) translation model. To ensure quality, we conducted comprehensive evaluations using both machine learning models and medical expert review. Our results show that BERTurk achieved 75.17% accuracy on TurkMedNLI test data and 76.30% on the normalized test set, while BioBERTurk demonstrated comparable performance with 75.59% accuracy on test data and 72.29% on the normalized dataset. Medical experts further validated the translations through manual assessment of sampled sentences. This work demonstrates the effectiveness of large language models in adapting domain-specific datasets for low-resource languages, establishing a foundation for future research in multilingual biomedical NLP.
format	Article
id	doaj-art-95b23af7b36e4e9a83462dc6168535ee
institution	Kabale University
issn	2376-5992
language	English
publishDate	2025-01-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj-art-95b23af7b36e4e9a83462dc6168535ee2025-02-01T15:05:05ZengPeerJ Inc.PeerJ Computer Science2376-59922025-01-0111e266210.7717/peerj-cs.2662TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translationİskender Ülgen Oğul0Fatih Soygazi1Belgin Ergenç Bostanoğlu2Computer Engineering, Izmir Institute of Technology, İzmir, TurkeyComputer Engineering, Adnan Menderes University, Aydın, TurkeyComputer Engineering, Izmir Institute of Technology, İzmir, TurkeyNatural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-resource languages is challenging due to the cost and complexity of manual annotation. Although translation of existing datasets offers a practical solution, direct translation of domain-specific datasets presents unique challenges, particularly in handling abbreviations, metric conversions, and cultural alignment. This study introduces a pipeline for translating a medical NLI dataset into Turkish, which is a low-resource language. Our approach employs fine-tuning the Llama-3.1 model with selected samples from the Medical Abbreviation dataset (MeDAL) to extract and resolve medical abbreviations. Consequently, NLI pairs are refined with extracted abbreviations and subjected to metric correction. Later, the processed sentences are then translated using Facebook’s No Language Left Behind (NLLB) translation model. To ensure quality, we conducted comprehensive evaluations using both machine learning models and medical expert review. Our results show that BERTurk achieved 75.17% accuracy on TurkMedNLI test data and 76.30% on the normalized test set, while BioBERTurk demonstrated comparable performance with 75.59% accuracy on test data and 72.29% on the normalized dataset. Medical experts further validated the translations through manual assessment of sampled sentences. This work demonstrates the effectiveness of large language models in adapting domain-specific datasets for low-resource languages, establishing a foundation for future research in multilingual biomedical NLP.https://peerj.com/articles/cs-2662.pdfMedNLINLLBBERTNatural language inferenceNatural language processingLanguage translation
spellingShingle	İskender Ülgen Oğul Fatih Soygazi Belgin Ergenç Bostanoğlu TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation PeerJ Computer Science MedNLI NLLB BERT Natural language inference Natural language processing Language translation
title	TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation
title_full	TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation
title_fullStr	TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation
title_full_unstemmed	TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation
title_short	TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation
title_sort	turkmednli a turkish medical natural language inference dataset through large language model based translation
topic	MedNLI NLLB BERT Natural language inference Natural language processing Language translation
url	https://peerj.com/articles/cs-2662.pdf
work_keys_str_mv	AT iskenderulgenogul turkmednliaturkishmedicalnaturallanguageinferencedatasetthroughlargelanguagemodelbasedtranslation AT fatihsoygazi turkmednliaturkishmedicalnaturallanguageinferencedatasetthroughlargelanguagemodelbasedtranslation AT belginergencbostanoglu turkmednliaturkishmedicalnaturallanguageinferencedatasetthroughlargelanguagemodelbasedtranslation

TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation

Similar Items