Name | Institution | Short description | Number of speakers | Age groups | Recorded text | Size | Annotation categories | Emotion origin | Multimodal? | Transcription available? |
ABC (Airplane Behaviour Corpus) | Technische Universität München, Institute for Human-Machine Communication & Department of Informatics (Björn Schuller, Dejan Arsic, Gerhard Rigoll, Matthias Wimmer, Bernd Radig) | ABC contains roughly 11.5 hours of recorded and annotated video material to observe behavior on public transportation (airplanes). Further description: B. Schuller, M. Wimmer, D. Arsic, G. Rigoll, and B. Radig, “Audiovisualbehaviour modeling by combined feature spaces,” in Proc. ICASSP, 2007, pp. 733–736. available at: https://mediatum.ub.tum.de/doc/1138565/1138565.pdf | 8 (m:4 / f: 4) | 25 to 48 years of age (∅ 32 years) | fixed | 431recordings | aggressive, cheerful, intoxicated, nervous, neutral, tired | induced | audiovisual | (yes?) |
emoDB (Berlin emotional Speech Database) | TU Berlin, Speech Communication (Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter Sendlmeier, Benjamin Weiss) | A Database of German Emotional Speech Proceedings Interspeech 2005, Lisbon, Portugal, available at: http://database.syntheticspeech.de/databaseOfGermanEmotionalSpeech.pdf | m:5 / f:5 | 21 to 35 years of age (∅ 30 years) | fixed | 494 samples | anger, boredom, disgust, happiness, fear, sadness, neutral | acted | audio | |
SmartKom (SmartKom Multimodal Corpus) | University of Munich (Bavarian Archive for Speech Signals) | Multimodal human-computer interaction (in the form of a Wizard of Oz experiment) To develop communication assistants which analyze speech, gestures, and facial expressions Further description: Reithinger, N. & Blocher, A., (2003). SmartKom - Multimodale Mensch-Technik-Interaktion (SmartKom – Multimodal Human Computer Interaction). In: Ziegler, J. (Ed.), i-com: Vol. 2, No. 1. Munich: Oldenbourg Wissenschaftsverlag GmbH. (pp. 4-10) Available at: https://doi.org/10.1524/icom.2.1.4.19034 | 224 | n/a | spontan | 448 samples, approx. 4-5 min long | anger, gratitude, helplessness, irritability, happiness, thoughtfulness, surprise, reflectiveness, neutral, unidentifiable episodes | audiovisual | ||
VAM (Vera-Am-Mittag) | Karlsruhe Institute of Technology, Communications Engineering Lab and University of Southern California, Speech Analysis and Interpretation Lab | Recordings from German talkshows Further description: M. Grimm, K. Kroschel and S. Narayanan, "The Vera am Mittag German audio-visual emotional speech database," 2008 IEEE International Conference on Multimedia and Expo, 2008, pp. 865-868, doi: 10.1109/ICME.2008.4607572. available at: https://sail.usc.edu/publications/files/grimmicme2008.pdf | m:15/ f:32 | 16 to 69 years of age (70% under 35 years) | free | 946 samples | valence (negative – positive), activation (calm – excited) and dominance (weak – strong) | natural | audiovisual | yes |
AD (Anger Detection) | University of Ulm, Institute of Communications Engineering | Telephone calls | 9 | n/a | free | 660 recordings | neutral and angry | natural | audio | |
EA-ACT | Björn Schuller, Institute for Human-Machine Communication at Technische Universität München as part of his dissertation | Further description: Schuller, B. (2005). Automatische Emotionserkennung aus sprachlicher und manueller Interaktion. available at: https://d-nb.info/980554381/34 | m:34 / f:5 (native language: 28x German, 1x English, 1x French, 1x Mandarin, 3x Serbian, 5x Turkish) | free | 2280 samples | anger, happiness, sadness, surprise, neutral | acted | |||
FAU Aibo (Aibo Emotion Corpus (AEC) | Universität Erlangen-Nürnberg | Speech recordings of a total of 51 children from two German schools while interacting with the Sony robot Aibo Further description: Steidl, S. “Automatic classification of emotion related user states in spontaneous children's speech.” (2009). available at: http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2009/Steidl09-ACO.pdf | 51 (m:21 /f: 30) | 10 to 13 years of age | free | 17074 samples | neutral, anger, irritability, happiness, surprise, boredom, helplessness, baby talk, admonishing, emphatic, other | natural | audiovisual | yes |
PPMMK-EMO | University of Passau | PPMMK-EMO is a database of German emotional speech recorded at the University of Passau covering the four basic classes angry, happy, neutral, and sad. It has a total of 3,154 samples averaging 2.5 seconds in length recorded from 36 speakers. | 36 | n/a | 3154 samples | anger, happiness, sadness, neutral | ||||
SIMIS (Speech in Minimal Invasive Surgery) | Technische Universität München, Institute for Human-Machine Communication | Recording of surgeons in the operating room Further description: Schuller, B., Eyben, F., Can, S., & Feußner, H. (2010). Speech in Minimal Invasive Surgery - Towards an Affective Language Resource of Real-life Medical Operations. available at: https://mediatum.ub.tum.de/doc/1287421/1287421.pdf | 10 | 24 to 54 years of age | free | 9299 samples | anger, confusion, happiness, impatience, neutral | natural | audio |
Name | Institution | Short description | Number of speakers | Age groups | Recorded text | Size | Annotation categories | Emotion origin | Multimodal? | Transcription available? |
eNTERFACE (eNTERFACE'05 Audio-Visual Emotion Database) | Université catholique de Louvain, Laboratoire de Télécommunications et de Télédétection und Aristotle University of Thessaloniki, Department of Informatics | Database for testing and evaluating video, audio or joint audio-visual emotion recognition algorithms. Further description: O. Martin, I. Kotsia, B. Macq and I. Pitas, "The eNTERFACE' 05 Audio-Visual Emotion Database," 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006, pp. 8-8, doi: 10.1109/ICDEW.2006.145. available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.220.2113&rep=rep1&type=pdf | m: 34 / f: 8 | n/a | fixed | 1277 samples | anger, disgust, happiness, sadness, surprise | induced | audiovisual | |
SUSAS (Speech Under Simulated and Actual Stress) | University of Colorado-Boulder, Robust Speech Processing Laboratory | Further description: Hansen, J., & Bou-Ghazale, S.E. (1997). Getting started with SUSAS: a speech under simulated and actual stress database. EUROSPEECH. available at: https://www.isca-speech.org/archive/archive_papers/eurospeech_1997/e97_1743.pdf | m: 19 / f: 13 | 22 to 76 years of age | free and fixed | 3593 samples | high stress, medium stress, screaming, fear, neutral | natural | audio | |
SAL (Sensitive Artificial Listener) | Queen’s University Belfast, Tel Aviv University , University of Twente | Further description: Douglas-Cowie, Ellen & Cowie, Roddy & Cox, Cate & Amir, Noam & Heylen, Dirk. (2008). The Sensitive Artificial Listener: an induction technique for generating emotionally coloured conversation. Mathematics of Computation - Math. Comput. available at: http://www.lrec-conf.org/proceedings/lrec2008/workshops/W2_Proceedings.pdf | m: 2/ f: 2 | n/a | free | n/a | natural | |||
AVIC (Audiovisual Interest Corpus) | Technische Universität München and Toyota Motor Corporation | Further description: Schuller, Björn & Müller, Ronald & Hörnler, Benedikt & Höthker, Anja & Konosu, Hitoshi & Rigoll, Gerhard. (2007). Audiovisual recognition of spontaneous interest within conversations. Proceedings of the 9th International Conference on Multimodal Interfaces, ICMI'07. 30-37. 10.1145/1322192.1322201. available at: https://www.researchgate.net/publication/221052336_ Audiovisual_recognition_of_spontaneous_interest_within_conversations | m:11 / f: 10 | 30 to >40 years of age (∅ 29 years) | spontan | 3,901 samples | natural | audiovisual | ||
EU-EV (EU-Emotion Voice Database) | University of Amsterdam u.a. | The EU-Emotion voice stimuli consist of 2159 audio-recordings of 54 actors, each uttering sentences with the intention of conveying 20 different emotional states (plus neutral). The database is organized in three separate emotional voice stimulus sets in three different languages (British English, Swedish, and Hebrew) Further description: Available at: Lassalle, Amandine & Pigat, Delia & O'Reilly, Helen & Berggren, Steve & Fridenson-Hayo, Shimrit & Tal, Shahar & Elfström, Sigrid & Råde, Anna & Golan, Ofer & Bölte, Sven & Baron-Cohen, Simon & Lundqvist, Daniel. (2018). The EU-Emotion Voice Database. Behavior Research Methods. 51. 10.3758/s13428-018-1048-1. https://link.springer.com/content/pdf/10.3758/s13428-018-1048-1.pdf | 54 (18 in Hebrew and 18 in Swedish) | 10 to 72 years of age | fixed | 2,159 samples (695 in British English, 1,011 in Swedish, and 453 in Hebrew) | 20 different emotional states (plus neutral) afraid, angry, ashamed, bored, disappointed, disgusted, excited, frustrated, happy, hurt, interested, jealous, joking, kind, proud, sad, sneaky, surprised, unfriendly, worried) | acted | ||
EmoFilm | University of Augsburg, University of Rome, Imperial College London | emotional speech from films multilingual database suitable for study of culture and measurement strategies when evaluating the perception of emotion in speech Further description: Parada-Cabaleiro, E., Costantini, G., Batliner, A., Baird, A., & Schuller, B. (2018). Categorical vs Dimensional Perception of Italian Emotional Speech. INTERSPEECH. Available at: https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/44177/file/0047.pdf | 207 (including Italian and Spanish) | n/a | fixed | 1115 samples | anger, sadness, happiness, fear | acted | audiovisua | |
IEMOCAP (Interactive Emotional Dyadic Motion Capture) | University of Southern California, Signal Analysis and Interpretation Laboratory | consists of dyadic sessions where actors perform improvisations or scripted scenarios, specifically selected to elicit emotional expression Further description: C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008. Available at: https://sail.usc.edu/iemocap/Busso_2008_iemocap.pdf | m:5 /f: 5 | n/a | scripted and spontaneous sessions | 5531 samples | anger, happiness, sadness, excitement, frustration, fear, surprise, neutral and others plus dimensional (valence, activation, dominance) | acted | audiovisual (with motion capture) | yes |
MELD (Multimodal EmotionLines Dataset) | University of Michigan, Nanyang Technological University, Instituto Politénico Nacional,Singapore University of Technology and Design, National University of Singapore | MELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends Further information: Poria, Soujanya & Hazarika, Devamanyu & Majumder, Navonil & Naik, Gautam & Cambria, Erik & Mihalcea, Rada. (2018). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Available at: https://arxiv.org/pdf/1810.02508.pdf | 6+ | n/a | fixed | 13707 samples | anger, disgust, sadness, happiness, neutral, surprise, fear | acted | audiovisual | yes |
HUMAINE (Human-Machine Interaction Network on Emotions) | University of Belfast, LIMSI-CNRS, Universität Erlangen-Nürnberg, Tel Aviv University, National Technical University Athens and several other partners | HUMAINE aims to lay the foundations for European development of systems that can register, model and/or influence human emotional and emotion-related states and processes: "emotion-oriented systems". It contains 48 clips (defined as naturalistic, induced or acted data), selected from the following corpora: - Belfast Naturalistic database (in English, naturalistic, 10 clips) - Castaway Reality Television dataset (in English, naturalistic, 10 clips) - Sensitive Artificial Listener (in English, induced, 12 clips) - Sensitive Artificial Listener (in Hebrew, induced, 1 clip) - Activity/Spaghetti dataset (in English, induced, 7 clips) - Green Persuasive dataset (in English, induced, 4 clips) - EmoTABOO (in French, induced, 2 clips) - DRIVAWORK corpus (in German, induced, 1 clip) - GEMEP corpus (in French, acted, 1 clip) Further description: Douglas-Cowie, Ellen & Cox, Cate & Martin, Jean-Claude & Devillers, Laurence & Cowie, Roddy & Sneddon, Ian & McRorie, Margaret & Pelachaud, Catherine & Peters, Christopher & Lowry, Orla & Batliner, Anton & Hoenig, Florian. (2011). The HUMAINE database. 10.1007/978-3-642-15184-2_14. Available at: https://www.researchgate.net/publication/226191511_The_HUMAINE_database | n/a | n/a | n/a | 48 samples | annotated with >20 labels | natural, induced, and acted | audiovisual | |
CREMA-D | University of Pennsylvania | an audio-visual data set uniquely suited for the study of multi-modal emotion expression and perception Further description: Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014). CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE transactions on affective computing, 5(4), 377–390. https://doi.org/10.1109/TAFFC.2014.2336244 Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313618/ | 91 | n/a | fixed | 7,442 samples | happy, sad, anger, fear, disgust, and neutral (surprise was not considered by the acting directors to be sufficiently specific, as it could relate to any of the other emotions with rapid onset) | acted | audiovisual | |
MOCHA-TIMIT | University of Edinburgh, Centre for Speech Technology Research | Further description: https://data.cstr.ed.ac.uk/mocha/README_v1.2.txt | 2 (m:1 /f:1) | n/a | fixed | 460 samples | ||||
TORGO | University of Toronto, The Speech and Stuttering Institute & Department of Computer Science & Oral Dynamics Laboratory, Department of Speech-Language Pathology, & Holland Bloorview Kids Rehabilitation Hospital,Toronto, | TORGO is one of the most popular dysarthric speech corpora [33]. It consists of aligned acoustic and articulatory recordings from 15 speakers. Seven of these speakers are control speakers without any speech disorders, while the remaining eight speakers present different levels of dysarthria. Further description: R. F., N. A.K., and T. Wolff, “The torgo database of acoustic and articulatory speech from speakers with dysarthria,” Lang Re- sources & Evaluation, vol. 46, pp. 523–541, 2012 Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.420.767&rep=rep1&type=pdf | 15 | n/a | fixed | reflex, respiration, lips, jaw, velum laryngeal, tongue, intellegibility | n/a | |||
The Nemours Database of Dysarthric Speech | Applied Science & Engineering Laboratories (ASEL), A.I. duPont Institute, USA | The Nemours database is a collection of 814 short nonsense sentences; 74 sentences spoken by each of 11 male speakers with varying degrees of dysarthria. Further description: Menéndez-Pidal, Xavier / Polikoff, James B. / Peters, Shirley M. / Leonzio, Jennie E. / Bunnell, H. T. (1996): "The nemours database of dysarthric speech," in ICSLP-1996, 1962-1965. Available at: https://www.isca-speech.org/archive/archive_papers/icslp_1996/i96_1962.pdf | 11 (m:11) | n/a | fixed | 814 samples | 39 segment labels derived from the ARPAbet symbol set. | n/a | broad phonemic transcription |
Name | Institution | Short description | Number of speakers | Age groups | Recorded text | Size | Annotation categories | Emotion origin | Multimodal | Transcription available |
GEMEP (Geneva Multimodal Emotion Portrayal) | Université de Genève | Further description: Bänziger, T., & Scherer, K. R. (2010). Introducing the Geneva Multimodal Emotion Portrayal (GEMEP) corpus. In K. R. Scherer, T. Bänziger, & E. B. Roesch (Eds.), Blueprint for affective computing: A sourcebook (pp. 271-294). Oxford, England: Oxford University Press. Available at: https://www.unige.ch/cisa/files/5814/6721/0641/Banziger__Scherer_-_2010_-_Introducing_the_Geneva_Multimodal_Emotion_Portrayal_GEMEP_Corpus.pdf | m: 5 / f: 5 | n/a | fixed | 1260 samples | admiration, amusement, tenderness, anger, disgust, pride, shame, worry, interest, irritation, elation (joy), contempt, anxiety (worry), pleasure, relief, surprise, sadness | acted | audiovisual | yes |
Name | Institution | Short description | Number of speakers | Age groups | Recorded text | Size | Annotation categories | Emotion origin | Multimodal | Transcription available |
EmoFilm | University of Augsburg | Further description: Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Batliner, Alice Baird, and Björn Schuller (2018), Categorical vs Dimensional Perception of Italian Emotional Speech, in Proc. of Interspeech, Hyderabad, India, pp. 3638-3642. | 207 (including English and Italian) | 1115 recordings | anger, contempt, happiness, fear, and sadness | |||||
SES (Spanish Emotional Speech database) | Further description: Montero, Juan & Gutierrez-Arriola, Juana M. & Colás, José & Macias-Guarasa, Javier & Enríquez, Emilia & Pardo, Juan. (1999). Development of an emotional speech synthesiser in Spanish. Available at: https://www.isca-speech.org/archive/archive_papers/eurospeech_1999/e99_2099.pdf | 1 (m:1) | fixed | 30 words, 15 short sentences and 3 paragraphs | anger, happiness, sadness, surprise, neutral | acted | ||||
Further description: Sanz, Ignasi & Guaus, Roger & Rodrguez, Angel & Lázaro Pernias, Patrícia & Vilar, Norminanda & Pont, Josep Maria & Bernadas, Dolors & Oliver, Josep & Tena, Daniel & Longhi, Ludovico. (2001). Validation Of An Acoustical Modelling Of Emotional Expression In Spanish Using Speech Synthesis Techniques. Available at: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.385.1165&rep=rep1&type=pdf | eight actors (four females, four males), three intensities, 336 utterances. | |||||||||
Technical University of Madrid | 80 utterances (300 utterances with four different sentences as synthetic data set (actors), 80 utterances as real data set (DVD movies)), 15 non-professional speakers (female and male) in the synthetic data set. Available at: https://ieeexplore.ieee.org/abstract/document/1513750 | neutral, happiness, sadness, anger, and fear |
Name | Institution | Short description | Number of speakers | Age groups | Recorded text | Size | Annotation categories | Emotion origin | Multimodal | Transcription available |
CASIA | 4 | 1200 samples | ||||||||
CVE (Chinese Vocal Emotions) | 4 | 874 samples | ||||||||
MES (Mandarin Emotional Speech) | 6 | 360 samples |
Name | Institution | Short description | Number of speakers | Age groups | Recorded text | Size | Annotation categories | Emotion origin | Multimodal | Transcription available |
BUEMODB (Bogazici University Emotion Database) | Bogazici University | acted sentences to measure F0 | 11 (f:7/M:4) | fixed | 484 samples | anger, joy, neutrality, and sadness. | acted | |||
TurES (TURkish Emotional Speech database) | Utterances from 55 Turkish films | 582 (f: 188 / m: 394) | fixed | 5304 samples | happy, surprised, sad, angry, fear, neutral and other) and 3- dimensional emotional space (valence, activation, and dominance). | acted | ||||
EmoSTAR | Utterances from film and tv | >300 samples | ||||||||
Voice Corpus | f:25 / m:25 | 3740 samples | afraid, angry, happy, sad, neutral | |||||||
Turkish Emotion-Voice Database (TurEV-DB) | Cognitive Science Department, Middle East Technical University (METU) | Amateur actors | f:3 m:3 | angry, calm, happy sad | acted |