T-LLaMA: a Tibetan large language model based on LLaMA2

Abstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when a...

Full description

Saved in:
Bibliographic Details
Main Authors: Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen
Format: Article
Language:English
Published: Springer 2024-12-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01641-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571151216279552
author Hui Lv
Chi Pu
La Duo
Yan Li
Qingguo Zhou
Jun Shen
author_facet Hui Lv
Chi Pu
La Duo
Yan Li
Qingguo Zhou
Jun Shen
author_sort Hui Lv
collection DOAJ
description Abstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA .
format Article
id doaj-art-2693417246c046dd9201b988300ece81
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2024-12-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-2693417246c046dd9201b988300ece812025-02-02T12:48:57ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-12-0111111110.1007/s40747-024-01641-7T-LLaMA: a Tibetan large language model based on LLaMA2Hui Lv0Chi Pu1La Duo2Yan Li3Qingguo Zhou4Jun Shen5School of Information Science and Engineering, Lanzhou UniversitySchool of Information Science and Engineering, Lanzhou UniversityThe State Key Laboratory of Tibetan Intelligent Information Processing and Application, Qinghai Normal UniversitySchool of Information Science and Engineering, Lanzhou UniversitySchool of Information Science and Engineering, Lanzhou UniversitySchool of Computing and Information Technology, University of WollongongAbstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA .https://doi.org/10.1007/s40747-024-01641-7Large language modelTibetanLow-resource languagesText classification
spellingShingle Hui Lv
Chi Pu
La Duo
Yan Li
Qingguo Zhou
Jun Shen
T-LLaMA: a Tibetan large language model based on LLaMA2
Complex & Intelligent Systems
Large language model
Tibetan
Low-resource languages
Text classification
title T-LLaMA: a Tibetan large language model based on LLaMA2
title_full T-LLaMA: a Tibetan large language model based on LLaMA2
title_fullStr T-LLaMA: a Tibetan large language model based on LLaMA2
title_full_unstemmed T-LLaMA: a Tibetan large language model based on LLaMA2
title_short T-LLaMA: a Tibetan large language model based on LLaMA2
title_sort t llama a tibetan large language model based on llama2
topic Large language model
Tibetan
Low-resource languages
Text classification
url https://doi.org/10.1007/s40747-024-01641-7
work_keys_str_mv AT huilv tllamaatibetanlargelanguagemodelbasedonllama2
AT chipu tllamaatibetanlargelanguagemodelbasedonllama2
AT laduo tllamaatibetanlargelanguagemodelbasedonllama2
AT yanli tllamaatibetanlargelanguagemodelbasedonllama2
AT qingguozhou tllamaatibetanlargelanguagemodelbasedonllama2
AT junshen tllamaatibetanlargelanguagemodelbasedonllama2