Audio-Language Datasets of Scenes and Events: A Survey

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (<uri...

Full description

Saved in:
Bibliographic Details
Main Authors: Gijs Wijngaard, Elia Formisano, Michele Esposito, Michel Dumontier
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10854210/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832575624195080192
author Gijs Wijngaard
Elia Formisano
Michele Esposito
Michel Dumontier
author_facet Gijs Wijngaard
Elia Formisano
Michele Esposito
Michel Dumontier
author_sort Gijs Wijngaard
collection DOAJ
description Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (<uri>https://github.com/GLJS/audio-datasets</uri>). The survey provides a comprehensive analysis of dataset origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets such as AudioSet, with over two million samples, and community platforms such as Freesound, with over one million samples. The survey evaluates acoustic and linguistic variability across datasets through principal component analysis of audio and text embeddings. The survey also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting specific areas requiring attention: multilingual dataset development, specialized domain coverage and improved dataset accessibility.
format Article
id doaj-art-e3fb60443d5742309939d3aba5f9b4f8
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-e3fb60443d5742309939d3aba5f9b4f82025-01-31T23:04:37ZengIEEEIEEE Access2169-35362025-01-0113203282036010.1109/ACCESS.2025.353462110854210Audio-Language Datasets of Scenes and Events: A SurveyGijs Wijngaard0https://orcid.org/0009-0002-3875-1232Elia Formisano1https://orcid.org/0000-0001-5008-2460Michele Esposito2https://orcid.org/0000-0002-7659-6520Michel Dumontier3https://orcid.org/0000-0003-4727-9435Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The NetherlandsDepartment of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The NetherlandsDepartment of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The NetherlandsDepartment of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The NetherlandsAudio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (<uri>https://github.com/GLJS/audio-datasets</uri>). The survey provides a comprehensive analysis of dataset origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets such as AudioSet, with over two million samples, and community platforms such as Freesound, with over one million samples. The survey evaluates acoustic and linguistic variability across datasets through principal component analysis of audio and text embeddings. The survey also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting specific areas requiring attention: multilingual dataset development, specialized domain coverage and improved dataset accessibility.https://ieeexplore.ieee.org/document/10854210/Audio-to-language learninglanguage-to-audio learningaudio-language datasetsreview
spellingShingle Gijs Wijngaard
Elia Formisano
Michele Esposito
Michel Dumontier
Audio-Language Datasets of Scenes and Events: A Survey
IEEE Access
Audio-to-language learning
language-to-audio learning
audio-language datasets
review
title Audio-Language Datasets of Scenes and Events: A Survey
title_full Audio-Language Datasets of Scenes and Events: A Survey
title_fullStr Audio-Language Datasets of Scenes and Events: A Survey
title_full_unstemmed Audio-Language Datasets of Scenes and Events: A Survey
title_short Audio-Language Datasets of Scenes and Events: A Survey
title_sort audio language datasets of scenes and events a survey
topic Audio-to-language learning
language-to-audio learning
audio-language datasets
review
url https://ieeexplore.ieee.org/document/10854210/
work_keys_str_mv AT gijswijngaard audiolanguagedatasetsofscenesandeventsasurvey
AT eliaformisano audiolanguagedatasetsofscenesandeventsasurvey
AT micheleesposito audiolanguagedatasetsofscenesandeventsasurvey
AT micheldumontier audiolanguagedatasetsofscenesandeventsasurvey