Audio-Language Datasets of Scenes and Events: A Survey
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (<uri...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10854210/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832575624195080192 |
---|---|
author | Gijs Wijngaard Elia Formisano Michele Esposito Michel Dumontier |
author_facet | Gijs Wijngaard Elia Formisano Michele Esposito Michel Dumontier |
author_sort | Gijs Wijngaard |
collection | DOAJ |
description | Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (<uri>https://github.com/GLJS/audio-datasets</uri>). The survey provides a comprehensive analysis of dataset origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets such as AudioSet, with over two million samples, and community platforms such as Freesound, with over one million samples. The survey evaluates acoustic and linguistic variability across datasets through principal component analysis of audio and text embeddings. The survey also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting specific areas requiring attention: multilingual dataset development, specialized domain coverage and improved dataset accessibility. |
format | Article |
id | doaj-art-e3fb60443d5742309939d3aba5f9b4f8 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-e3fb60443d5742309939d3aba5f9b4f82025-01-31T23:04:37ZengIEEEIEEE Access2169-35362025-01-0113203282036010.1109/ACCESS.2025.353462110854210Audio-Language Datasets of Scenes and Events: A SurveyGijs Wijngaard0https://orcid.org/0009-0002-3875-1232Elia Formisano1https://orcid.org/0000-0001-5008-2460Michele Esposito2https://orcid.org/0000-0002-7659-6520Michel Dumontier3https://orcid.org/0000-0003-4727-9435Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The NetherlandsDepartment of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The NetherlandsDepartment of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The NetherlandsDepartment of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The NetherlandsAudio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (<uri>https://github.com/GLJS/audio-datasets</uri>). The survey provides a comprehensive analysis of dataset origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets such as AudioSet, with over two million samples, and community platforms such as Freesound, with over one million samples. The survey evaluates acoustic and linguistic variability across datasets through principal component analysis of audio and text embeddings. The survey also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting specific areas requiring attention: multilingual dataset development, specialized domain coverage and improved dataset accessibility.https://ieeexplore.ieee.org/document/10854210/Audio-to-language learninglanguage-to-audio learningaudio-language datasetsreview |
spellingShingle | Gijs Wijngaard Elia Formisano Michele Esposito Michel Dumontier Audio-Language Datasets of Scenes and Events: A Survey IEEE Access Audio-to-language learning language-to-audio learning audio-language datasets review |
title | Audio-Language Datasets of Scenes and Events: A Survey |
title_full | Audio-Language Datasets of Scenes and Events: A Survey |
title_fullStr | Audio-Language Datasets of Scenes and Events: A Survey |
title_full_unstemmed | Audio-Language Datasets of Scenes and Events: A Survey |
title_short | Audio-Language Datasets of Scenes and Events: A Survey |
title_sort | audio language datasets of scenes and events a survey |
topic | Audio-to-language learning language-to-audio learning audio-language datasets review |
url | https://ieeexplore.ieee.org/document/10854210/ |
work_keys_str_mv | AT gijswijngaard audiolanguagedatasetsofscenesandeventsasurvey AT eliaformisano audiolanguagedatasetsofscenesandeventsasurvey AT micheleesposito audiolanguagedatasetsofscenesandeventsasurvey AT micheldumontier audiolanguagedatasetsofscenesandeventsasurvey |