Data

Databases used in the Interspeech Computational Paralinguistics Challenge (ComParE) series are usually owned by individual donators. End User License Agreements (EULAs) are usually given for participation in the challenge. Usage of the databases outside of the Challenges always has to be negotiated with the data owners – not the organisers of the Challenge. We aim to provide contact information per database – however, this requires consent of the data owners, which we are currently collecting.

Below, a description of the current 2022 data is given. All of these corpora provide realistic data in challenging acoustic conditions. They feature further rich annotation such as subject meta-data, transcripts, and segmentation, and are partitioned into training, development, and test data, observing subject independence. Reproducible benchmark results of the most popular approaches by open-source toolkits will be provided as in the years before: We will provide scripts for computing the results including auDeep, DeepSpectrum, openSMILE, and openXBOW.

The Vocalisations corpus as used in the 2022 Vocalisations Sub-Challenge, provided by Natalie Holz, MPI, Frankfurt am Main, features vocalisations (affect bursts)  such as laughter, cries, moans, or screams, with different affective intensities, indicating different emotions. The data from the female speakers have been made available for the public, see the publication given below; the male speakers are so far unseen. We partition these data into the female vocalisations in train (6 speakers, 625 items) and development  (5 speakers, 460 items),  and the male vocalisations (2 speakers, 276 items) in test, modelling a 6-class problem with the emotion classes achievement, anger, fear, pain, pleasure, surprise. Chance level is 16,7% UAR (Unweighted Average Recall or Balanced Average or Macro Average). Best results for the baseline on test  have been achieved with auDeep with  32.7% UAR. More details are found in: N. Holz, P. Larrouy-Maestri, D. Poeppel: “The variably intense vocalizations of affect and emotion (VIVAE) corpus prompts new  perspective on nonspeech perception,” Emotion, 22(1), 213-225, 2022, and N. Holz, P. Larrouy-Maestri, D. Poeppel: “The paradoxical role of emotional intensity in the perception of vocal affect,” Nature, Scientific Reports 11, 9663, 2021.

The Stuttering corpus as used in the 2022 Stuttering Sub-Challenge, provided by TH Nürnberg and the Kasseler Stottertherapie, is derived from the Kassel State of Fluency (KSoF) corpus. The original corpus features some 5500 typical and nontypical (stuttering) 3-sec segments from 37 German speakers with an overall duration of 4.6 hours. The segments contain speech of persons who stutter (PWO). The recordings from which the segments were extracted were recorded before, during, and after PWOs underwent stuttering therapy. Three annotators labeled each segment as one of 7 classes (block, prolongation, sound repetition, word/phrase repetition, modified speech technique, interjection, no disfluency). They were also asked to provide additional information, e.g., about the recording quality. Annotators were able to assign more than one label per 3-sec segment. For this challenge, we removed all segments assigned more than one label, thus only featuring the 4,601 non-ambiguously labeled segments. The task proposed in this challenge is the classification of speech segments as one of the 8 classes – the seven stuttering-related classes mentioned above and an eighth garbage class, denoting unintelligible segments, segments containing no speech, or segments that are negatively affected by loud background noise. The dataset is split by speaker (train, 23 speakers, devel, 6 speakers, test, 8 speakers). Using an SVM with Linear Kernel together  with  openSMILE features, we get an Unweighted Average Recall (UAR) of 40.4% on the test set. UAR is used as the primary metric to compare system performance. Please find additional info in:  S. P. Bayerl, F. Hönig, J. Reister, K. Riedhammer. “Towards Automated Assessment of Stuttering and Stuttering Therapy,” in Text, Speech, and Dialogue, Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020, and S. P. Bayerl, A. Wolff von Gudenberg, F. Hönig, E. Nöth, K. Riedhammer. “KSoF: The Kassel State of Fluency Dataset – A Therapy Centered Dataset of Stuttering,” arXiv:2203.05383, March, 2022.

The harAGE corpus as used in the 2022 Activity Sub-Challenge, provided by the EU Horizon 2020 project sustAGE, is a multimodal dataset for Human Activity Recognition (HAR) collected using the smartwatch Garmin Vivoactive 3. The dataset contains a total of 17 h 37 m 20 s of triaxial accelerometer, heart rate, and pedometer sensor measurements collected from 30 (14f, 16m) participants. The raw measurements are segmented using windows of 20 sec length and annotated according to the activity participants were doing. Sensor measurements from the following 8 activities are included in the harAGE corpus: lying, sitting, standing, washing hands, walking, running, stairs climbing, and cycling. The harAGE corpus is split in three participant-independent and gender-balanced partitions. The training, development and test partitions contain a total of 10 h 41 m 20 s, 2 h 16 m 0 s, and 4 h 40 m 0s of data from 17 (8f, 9m), 6 (3f, 3m), and 7 (3f, 4m) participants, respectively. The task proposed in this Sub-Challenge consists in the development of unimodal and/or multimodal systems able to analyse 20 sec of sensor measurements and infer the corresponding activity. The UAR is used as the official metric to compare system performances. The best baseline solution implements a multimodal system with dedicated CNNs to extract deep learnt representations from the heart rate, pedometer, and accelerometer modalities. This system implements the inner-stage fusion of the embedded representations learnt via concatenation. The best baseline system obtains a UAR of 72.17% on the test partition. More info is found in A. Mallol-Ragolta, A. Semertzidou, M. Pateraki, B. Schuller: “harAGE: A Novel Multimodal Smartwatch-based Dataset for Human Activity Recognition,” in Proceedings of the 16th International Conference on Automatic Face and Gesture Recognition, (Jodhpur, India – Virtual Event), 2021, and A. Mallol-Ragolta, A. Semertzidou, M. Pateraki, B. Schuller: “Outer Product-Based Fusion of Smartwatch Sensor Data for Human Activity Recognition,” Frontiers in Computer Science, section Mobile and Ubiquitous Computing, 2022

The Mosquito corpus as used in the 2022 Mosquito Sub-Challenge, provided by the Humbug Project, is a large scale audio dataset consisting of over 20hours of mosquito flight recordings, available from https://zenodo.org/record/4904800#.YixXSnpKhaR. It represents the culmination of 5 years of work with the goal of addressing the problem of automatic acoustic mosquito surveillance. Mosquitoes kill more humans than any other creature on the planet. Of the roughly 3500 known species, only a handful are harmful. This creates a demand for methods to classify mosquito species, as it enables a smart response to fight mosquito borne diseases.

The task is to detect timestamps for acoustic mosquito events – Mosquito Event Detection (MED). Our test set and development set are sourced from different experiments to the training data, to encourage the development of generalisable machine learning approaches. The test set is not included in the hosted Zenodo dataset or available online anywhere publicly, to preserve fairness of the tasks. The Baseline for the challenge is a Bayesian Convolutional Neural Network. On the two MED development sets, the baseline achieved PR-AUCs of .981 (A: mosquito bednet recordings) and .707 (B: a more challenging low-SNR environment). More info is found in: I. Kiskin, M. Sinka, A.D. Cobb, W. Rafique, L. Wang, D. Zilli, B. Gutteridge, R. Dam, T. Marinos, Y. Li, D. Msaky: “HumBugDB: A Large-scale Acoustic Mosquito Dataset,” In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.