T-LLaMA: a Tibetan large language model based on LLaMA2

Abstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen
Format:	Article
Language:	English
Published:	Springer 2024-12-01
Series:	Complex & Intelligent Systems
Subjects:	Large language model Tibetan Low-resource languages Text classification
Online Access:	https://doi.org/10.1007/s40747-024-01641-7
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571151216279552
author	Hui Lv Chi Pu La Duo Yan Li Qingguo Zhou Jun Shen
author_facet	Hui Lv Chi Pu La Duo Yan Li Qingguo Zhou Jun Shen
author_sort	Hui Lv
collection	DOAJ
description	Abstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA .
format	Article
id	doaj-art-2693417246c046dd9201b988300ece81
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2024-12-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-2693417246c046dd9201b988300ece812025-02-02T12:48:57ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-12-0111111110.1007/s40747-024-01641-7T-LLaMA: a Tibetan large language model based on LLaMA2Hui Lv0Chi Pu1La Duo2Yan Li3Qingguo Zhou4Jun Shen5School of Information Science and Engineering, Lanzhou UniversitySchool of Information Science and Engineering, Lanzhou UniversityThe State Key Laboratory of Tibetan Intelligent Information Processing and Application, Qinghai Normal UniversitySchool of Information Science and Engineering, Lanzhou UniversitySchool of Information Science and Engineering, Lanzhou UniversitySchool of Computing and Information Technology, University of WollongongAbstract The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA .https://doi.org/10.1007/s40747-024-01641-7Large language modelTibetanLow-resource languagesText classification
spellingShingle	Hui Lv Chi Pu La Duo Yan Li Qingguo Zhou Jun Shen T-LLaMA: a Tibetan large language model based on LLaMA2 Complex & Intelligent Systems Large language model Tibetan Low-resource languages Text classification
title	T-LLaMA: a Tibetan large language model based on LLaMA2
title_full	T-LLaMA: a Tibetan large language model based on LLaMA2
title_fullStr	T-LLaMA: a Tibetan large language model based on LLaMA2
title_full_unstemmed	T-LLaMA: a Tibetan large language model based on LLaMA2
title_short	T-LLaMA: a Tibetan large language model based on LLaMA2
title_sort	t llama a tibetan large language model based on llama2
topic	Large language model Tibetan Low-resource languages Text classification
url	https://doi.org/10.1007/s40747-024-01641-7
work_keys_str_mv	AT huilv tllamaatibetanlargelanguagemodelbasedonllama2 AT chipu tllamaatibetanlargelanguagemodelbasedonllama2 AT laduo tllamaatibetanlargelanguagemodelbasedonllama2 AT yanli tllamaatibetanlargelanguagemodelbasedonllama2 AT qingguozhou tllamaatibetanlargelanguagemodelbasedonllama2 AT junshen tllamaatibetanlargelanguagemodelbasedonllama2

T-LLaMA: a Tibetan large language model based on LLaMA2

Similar Items