Documenting Geographically and Contextually Diverse Language Data Sources

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of docume...

Full description

Saved in:
Bibliographic Details
Main Authors: Angelina McMillan-Major, Francesco De Toni, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Daniel van Strien, Zeerak Talat, Yacine Jernite
Format: Article
Language:English
Published: Linköping University Electronic Press 2025-01-01
Series:Northern European Journal of Language Technology
Online Access:https://nejlt.ep.liu.se/article/view/5217
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832591229864378368
author Angelina McMillan-Major
Francesco De Toni
Zaid Alyafeai
Stella Biderman
Kimbo Chen
Gérard Dupont
Hady Elsahar
Chris Emezue
Alham Fikri Aji
Suzana Ilić
Nurulaqilla Khamis
Colin Leong
Maraim Masoud
Aitor Soroa
Pedro Ortiz Suarez
Daniel van Strien
Zeerak Talat
Yacine Jernite
author_facet Angelina McMillan-Major
Francesco De Toni
Zaid Alyafeai
Stella Biderman
Kimbo Chen
Gérard Dupont
Hady Elsahar
Chris Emezue
Alham Fikri Aji
Suzana Ilić
Nurulaqilla Khamis
Colin Leong
Maraim Masoud
Aitor Soroa
Pedro Ortiz Suarez
Daniel van Strien
Zeerak Talat
Yacine Jernite
author_sort Angelina McMillan-Major
collection DOAJ
description Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.
format Article
id doaj-art-5086e21fd85644c6b616bc0f386706be
institution Kabale University
issn 2000-1533
language English
publishDate 2025-01-01
publisher Linköping University Electronic Press
record_format Article
series Northern European Journal of Language Technology
spelling doaj-art-5086e21fd85644c6b616bc0f386706be2025-01-22T15:24:14ZengLinköping University Electronic PressNorthern European Journal of Language Technology2000-15332025-01-0110110.3384/nejlt.2000-1533.2024.5217Documenting Geographically and Contextually Diverse Language Data SourcesAngelina McMillan-Major0Francesco De ToniZaid AlyafeaiStella BidermanKimbo ChenGérard DupontHady ElsaharChris EmezueAlham Fikri AjiSuzana IlićNurulaqilla KhamisColin LeongMaraim MasoudAitor SoroaPedro Ortiz SuarezDaniel van StrienZeerak TalatYacine JerniteUniversity of Washington Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned. https://nejlt.ep.liu.se/article/view/5217
spellingShingle Angelina McMillan-Major
Francesco De Toni
Zaid Alyafeai
Stella Biderman
Kimbo Chen
Gérard Dupont
Hady Elsahar
Chris Emezue
Alham Fikri Aji
Suzana Ilić
Nurulaqilla Khamis
Colin Leong
Maraim Masoud
Aitor Soroa
Pedro Ortiz Suarez
Daniel van Strien
Zeerak Talat
Yacine Jernite
Documenting Geographically and Contextually Diverse Language Data Sources
Northern European Journal of Language Technology
title Documenting Geographically and Contextually Diverse Language Data Sources
title_full Documenting Geographically and Contextually Diverse Language Data Sources
title_fullStr Documenting Geographically and Contextually Diverse Language Data Sources
title_full_unstemmed Documenting Geographically and Contextually Diverse Language Data Sources
title_short Documenting Geographically and Contextually Diverse Language Data Sources
title_sort documenting geographically and contextually diverse language data sources
url https://nejlt.ep.liu.se/article/view/5217
work_keys_str_mv AT angelinamcmillanmajor documentinggeographicallyandcontextuallydiverselanguagedatasources
AT francescodetoni documentinggeographicallyandcontextuallydiverselanguagedatasources
AT zaidalyafeai documentinggeographicallyandcontextuallydiverselanguagedatasources
AT stellabiderman documentinggeographicallyandcontextuallydiverselanguagedatasources
AT kimbochen documentinggeographicallyandcontextuallydiverselanguagedatasources
AT gerarddupont documentinggeographicallyandcontextuallydiverselanguagedatasources
AT hadyelsahar documentinggeographicallyandcontextuallydiverselanguagedatasources
AT chrisemezue documentinggeographicallyandcontextuallydiverselanguagedatasources
AT alhamfikriaji documentinggeographicallyandcontextuallydiverselanguagedatasources
AT suzanailic documentinggeographicallyandcontextuallydiverselanguagedatasources
AT nurulaqillakhamis documentinggeographicallyandcontextuallydiverselanguagedatasources
AT colinleong documentinggeographicallyandcontextuallydiverselanguagedatasources
AT maraimmasoud documentinggeographicallyandcontextuallydiverselanguagedatasources
AT aitorsoroa documentinggeographicallyandcontextuallydiverselanguagedatasources
AT pedroortizsuarez documentinggeographicallyandcontextuallydiverselanguagedatasources
AT danielvanstrien documentinggeographicallyandcontextuallydiverselanguagedatasources
AT zeeraktalat documentinggeographicallyandcontextuallydiverselanguagedatasources
AT yacinejernite documentinggeographicallyandcontextuallydiverselanguagedatasources