Documenting Geographically and Contextually Diverse Language Data Sources
Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of docume...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Linköping University Electronic Press
2025-01-01
|
Series: | Northern European Journal of Language Technology |
Online Access: | https://nejlt.ep.liu.se/article/view/5217 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832591229864378368 |
---|---|
author | Angelina McMillan-Major Francesco De Toni Zaid Alyafeai Stella Biderman Kimbo Chen Gérard Dupont Hady Elsahar Chris Emezue Alham Fikri Aji Suzana Ilić Nurulaqilla Khamis Colin Leong Maraim Masoud Aitor Soroa Pedro Ortiz Suarez Daniel van Strien Zeerak Talat Yacine Jernite |
author_facet | Angelina McMillan-Major Francesco De Toni Zaid Alyafeai Stella Biderman Kimbo Chen Gérard Dupont Hady Elsahar Chris Emezue Alham Fikri Aji Suzana Ilić Nurulaqilla Khamis Colin Leong Maraim Masoud Aitor Soroa Pedro Ortiz Suarez Daniel van Strien Zeerak Talat Yacine Jernite |
author_sort | Angelina McMillan-Major |
collection | DOAJ |
description |
Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.
|
format | Article |
id | doaj-art-5086e21fd85644c6b616bc0f386706be |
institution | Kabale University |
issn | 2000-1533 |
language | English |
publishDate | 2025-01-01 |
publisher | Linköping University Electronic Press |
record_format | Article |
series | Northern European Journal of Language Technology |
spelling | doaj-art-5086e21fd85644c6b616bc0f386706be2025-01-22T15:24:14ZengLinköping University Electronic PressNorthern European Journal of Language Technology2000-15332025-01-0110110.3384/nejlt.2000-1533.2024.5217Documenting Geographically and Contextually Diverse Language Data SourcesAngelina McMillan-Major0Francesco De ToniZaid AlyafeaiStella BidermanKimbo ChenGérard DupontHady ElsaharChris EmezueAlham Fikri AjiSuzana IlićNurulaqilla KhamisColin LeongMaraim MasoudAitor SoroaPedro Ortiz SuarezDaniel van StrienZeerak TalatYacine JerniteUniversity of Washington Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned. https://nejlt.ep.liu.se/article/view/5217 |
spellingShingle | Angelina McMillan-Major Francesco De Toni Zaid Alyafeai Stella Biderman Kimbo Chen Gérard Dupont Hady Elsahar Chris Emezue Alham Fikri Aji Suzana Ilić Nurulaqilla Khamis Colin Leong Maraim Masoud Aitor Soroa Pedro Ortiz Suarez Daniel van Strien Zeerak Talat Yacine Jernite Documenting Geographically and Contextually Diverse Language Data Sources Northern European Journal of Language Technology |
title | Documenting Geographically and Contextually Diverse Language Data Sources |
title_full | Documenting Geographically and Contextually Diverse Language Data Sources |
title_fullStr | Documenting Geographically and Contextually Diverse Language Data Sources |
title_full_unstemmed | Documenting Geographically and Contextually Diverse Language Data Sources |
title_short | Documenting Geographically and Contextually Diverse Language Data Sources |
title_sort | documenting geographically and contextually diverse language data sources |
url | https://nejlt.ep.liu.se/article/view/5217 |
work_keys_str_mv | AT angelinamcmillanmajor documentinggeographicallyandcontextuallydiverselanguagedatasources AT francescodetoni documentinggeographicallyandcontextuallydiverselanguagedatasources AT zaidalyafeai documentinggeographicallyandcontextuallydiverselanguagedatasources AT stellabiderman documentinggeographicallyandcontextuallydiverselanguagedatasources AT kimbochen documentinggeographicallyandcontextuallydiverselanguagedatasources AT gerarddupont documentinggeographicallyandcontextuallydiverselanguagedatasources AT hadyelsahar documentinggeographicallyandcontextuallydiverselanguagedatasources AT chrisemezue documentinggeographicallyandcontextuallydiverselanguagedatasources AT alhamfikriaji documentinggeographicallyandcontextuallydiverselanguagedatasources AT suzanailic documentinggeographicallyandcontextuallydiverselanguagedatasources AT nurulaqillakhamis documentinggeographicallyandcontextuallydiverselanguagedatasources AT colinleong documentinggeographicallyandcontextuallydiverselanguagedatasources AT maraimmasoud documentinggeographicallyandcontextuallydiverselanguagedatasources AT aitorsoroa documentinggeographicallyandcontextuallydiverselanguagedatasources AT pedroortizsuarez documentinggeographicallyandcontextuallydiverselanguagedatasources AT danielvanstrien documentinggeographicallyandcontextuallydiverselanguagedatasources AT zeeraktalat documentinggeographicallyandcontextuallydiverselanguagedatasources AT yacinejernite documentinggeographicallyandcontextuallydiverselanguagedatasources |