T-LLaMA: a Tibetan large language model based on LLaMA2
Abstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when a...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2024-12-01
|
Series: | Complex & Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1007/s40747-024-01641-7 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832571151216279552 |
---|---|
author | Hui Lv Chi Pu La Duo Yan Li Qingguo Zhou Jun Shen |
author_facet | Hui Lv Chi Pu La Duo Yan Li Qingguo Zhou Jun Shen |
author_sort | Hui Lv |
collection | DOAJ |
description | Abstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA . |
format | Article |
id | doaj-art-2693417246c046dd9201b988300ece81 |
institution | Kabale University |
issn | 2199-4536 2198-6053 |
language | English |
publishDate | 2024-12-01 |
publisher | Springer |
record_format | Article |
series | Complex & Intelligent Systems |
spelling | doaj-art-2693417246c046dd9201b988300ece812025-02-02T12:48:57ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-12-0111111110.1007/s40747-024-01641-7T-LLaMA: a Tibetan large language model based on LLaMA2Hui Lv0Chi Pu1La Duo2Yan Li3Qingguo Zhou4Jun Shen5School of Information Science and Engineering, Lanzhou UniversitySchool of Information Science and Engineering, Lanzhou UniversityThe State Key Laboratory of Tibetan Intelligent Information Processing and Application, Qinghai Normal UniversitySchool of Information Science and Engineering, Lanzhou UniversitySchool of Information Science and Engineering, Lanzhou UniversitySchool of Computing and Information Technology, University of WollongongAbstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA .https://doi.org/10.1007/s40747-024-01641-7Large language modelTibetanLow-resource languagesText classification |
spellingShingle | Hui Lv Chi Pu La Duo Yan Li Qingguo Zhou Jun Shen T-LLaMA: a Tibetan large language model based on LLaMA2 Complex & Intelligent Systems Large language model Tibetan Low-resource languages Text classification |
title | T-LLaMA: a Tibetan large language model based on LLaMA2 |
title_full | T-LLaMA: a Tibetan large language model based on LLaMA2 |
title_fullStr | T-LLaMA: a Tibetan large language model based on LLaMA2 |
title_full_unstemmed | T-LLaMA: a Tibetan large language model based on LLaMA2 |
title_short | T-LLaMA: a Tibetan large language model based on LLaMA2 |
title_sort | t llama a tibetan large language model based on llama2 |
topic | Large language model Tibetan Low-resource languages Text classification |
url | https://doi.org/10.1007/s40747-024-01641-7 |
work_keys_str_mv | AT huilv tllamaatibetanlargelanguagemodelbasedonllama2 AT chipu tllamaatibetanlargelanguagemodelbasedonllama2 AT laduo tllamaatibetanlargelanguagemodelbasedonllama2 AT yanli tllamaatibetanlargelanguagemodelbasedonllama2 AT qingguozhou tllamaatibetanlargelanguagemodelbasedonllama2 AT junshen tllamaatibetanlargelanguagemodelbasedonllama2 |