Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation

Abstract BackgroundThe bidirectional encoder representations from transformers (BERT) model has attracted considerable attention in clinical applications, such as patient classification and disease prediction. However, current studies have typically progressed to application d...

Full description

Saved in:
Bibliographic Details
Main Authors: Kyungmo Kim, Seongkeun Park, Jeongwon Min, Sumin Park, Ju Yeon Kim, Jinsu Eun, Kyuha Jung, Yoobin Elyson Park, Esther Kim, Eun Young Lee, Joonhwan Lee, Jinwook Choi
Format: Article
Language:English
Published: JMIR Publications 2024-10-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2024/1/e52897
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850199971065757696
author Kyungmo Kim
Seongkeun Park
Jeongwon Min
Sumin Park
Ju Yeon Kim
Jinsu Eun
Kyuha Jung
Yoobin Elyson Park
Esther Kim
Eun Young Lee
Joonhwan Lee
Jinwook Choi
author_facet Kyungmo Kim
Seongkeun Park
Jeongwon Min
Sumin Park
Ju Yeon Kim
Jinsu Eun
Kyuha Jung
Yoobin Elyson Park
Esther Kim
Eun Young Lee
Joonhwan Lee
Jinwook Choi
author_sort Kyungmo Kim
collection DOAJ
description Abstract BackgroundThe bidirectional encoder representations from transformers (BERT) model has attracted considerable attention in clinical applications, such as patient classification and disease prediction. However, current studies have typically progressed to application development without a thorough assessment of the model’s comprehension of clinical context. Furthermore, limited comparative studies have been conducted on BERT models using medical documents from non–English-speaking countries. Therefore, the applicability of BERT models trained on English clinical notes to non-English contexts is yet to be confirmed. To address these gaps in literature, this study focused on identifying the most effective BERT model for non-English clinical notes. ObjectiveIn this study, we evaluated the contextual understanding abilities of various BERT models applied to mixed Korean and English clinical notes. The objective of this study was to identify the BERT model that excels in understanding the context of such documents. MethodsUsing data from 164,460 patients in a South Korean tertiary hospital, we pretrained BERT-base, BERT for Biomedical Text Mining (BioBERT), Korean BERT (KoBERT), and Multilingual BERT (M-BERT) to improve their contextual comprehension capabilities and subsequently compared their performances in 7 fine-tuning tasks. ResultsThe model performance varied based on the task and token usage. First, BERT-base and BioBERT excelled in tasks using classification ([CLS]) token embeddings, such as document classification. BioBERT achieved the highest F1F1 ConclusionsThis study highlighted the effectiveness of various BERT models in a multilingual clinical domain. The findings can be used as a reference in clinical and language-based applications.
format Article
id doaj-art-54ac1dc40a464c26a2554efa5c933723
institution OA Journals
issn 2291-9694
language English
publishDate 2024-10-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj-art-54ac1dc40a464c26a2554efa5c9337232025-08-20T02:12:29ZengJMIR PublicationsJMIR Medical Informatics2291-96942024-10-0112e52897e5289710.2196/52897Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and ValidationKyungmo Kimhttp://orcid.org/0000-0002-8974-5302Seongkeun Parkhttp://orcid.org/0000-0002-4868-9404Jeongwon Minhttp://orcid.org/0000-0001-8412-5545Sumin Parkhttp://orcid.org/0000-0002-9917-2579Ju Yeon Kimhttp://orcid.org/0000-0001-8982-6869Jinsu Eunhttp://orcid.org/0000-0003-3051-7193Kyuha Junghttp://orcid.org/0000-0002-5442-391XYoobin Elyson Parkhttp://orcid.org/0000-0002-3844-1333Esther Kimhttp://orcid.org/0000-0002-9576-4411Eun Young Leehttp://orcid.org/0000-0001-6975-8627Joonhwan Leehttp://orcid.org/0000-0002-3115-4024Jinwook Choihttp://orcid.org/0000-0002-9424-9944 Abstract BackgroundThe bidirectional encoder representations from transformers (BERT) model has attracted considerable attention in clinical applications, such as patient classification and disease prediction. However, current studies have typically progressed to application development without a thorough assessment of the model’s comprehension of clinical context. Furthermore, limited comparative studies have been conducted on BERT models using medical documents from non–English-speaking countries. Therefore, the applicability of BERT models trained on English clinical notes to non-English contexts is yet to be confirmed. To address these gaps in literature, this study focused on identifying the most effective BERT model for non-English clinical notes. ObjectiveIn this study, we evaluated the contextual understanding abilities of various BERT models applied to mixed Korean and English clinical notes. The objective of this study was to identify the BERT model that excels in understanding the context of such documents. MethodsUsing data from 164,460 patients in a South Korean tertiary hospital, we pretrained BERT-base, BERT for Biomedical Text Mining (BioBERT), Korean BERT (KoBERT), and Multilingual BERT (M-BERT) to improve their contextual comprehension capabilities and subsequently compared their performances in 7 fine-tuning tasks. ResultsThe model performance varied based on the task and token usage. First, BERT-base and BioBERT excelled in tasks using classification ([CLS]) token embeddings, such as document classification. BioBERT achieved the highest F1F1 ConclusionsThis study highlighted the effectiveness of various BERT models in a multilingual clinical domain. The findings can be used as a reference in clinical and language-based applications.https://medinform.jmir.org/2024/1/e52897
spellingShingle Kyungmo Kim
Seongkeun Park
Jeongwon Min
Sumin Park
Ju Yeon Kim
Jinsu Eun
Kyuha Jung
Yoobin Elyson Park
Esther Kim
Eun Young Lee
Joonhwan Lee
Jinwook Choi
Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation
JMIR Medical Informatics
title Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation
title_full Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation
title_fullStr Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation
title_full_unstemmed Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation
title_short Multifaceted Natural Language Processing Task–Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation
title_sort multifaceted natural language processing task based evaluation of bidirectional encoder representations from transformers models for bilingual korean and english clinical notes algorithm development and validation
url https://medinform.jmir.org/2024/1/e52897
work_keys_str_mv AT kyungmokim multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT seongkeunpark multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT jeongwonmin multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT suminpark multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT juyeonkim multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT jinsueun multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT kyuhajung multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT yoobinelysonpark multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT estherkim multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT eunyounglee multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT joonhwanlee multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation
AT jinwookchoi multifacetednaturallanguageprocessingtaskbasedevaluationofbidirectionalencoderrepresentationsfromtransformersmodelsforbilingualkoreanandenglishclinicalnotesalgorithmdevelopmentandvalidation