Speech Communication


Overview of speech databases


NameInstitutionShort descriptionNumber of speakersAge groupsRecorded textSizeAnnotation categoriesEmotion originMultimodal?Transcription available?
ABC (Airplane Behaviour Corpus)Technische Universität München,
Institute for Human-Machine Communication & Department of Informatics

(Björn Schuller, Dejan Arsic, Gerhard Rigoll, Matthias Wimmer, Bernd Radig)
ABC contains roughly 11.5 hours of recorded and annotated video material to observe behavior on public transportation (airplanes).

Further description:

B. Schuller, M. Wimmer, D. Arsic, G. Rigoll, and B. Radig, “Audiovisualbehaviour modeling by combined feature spaces,” in Proc. ICASSP, 2007, pp. 733–736. available at: https://mediatum.ub.tum.de/doc/1138565/1138565.pdf
8 (m:4 / f: 4)25 to 48 years of age (∅ 32 years)fixed431recordingsaggressive, cheerful, intoxicated, nervous, neutral, tiredinducedaudiovisual(yes?)
emoDB (Berlin emotional Speech Database)TU Berlin, Speech Communication
(Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter Sendlmeier, Benjamin Weiss)
A Database of German Emotional Speech
Proceedings Interspeech 2005, Lisbon, Portugal, available at:
m:5 / f:521 to 35 years of age (∅ 30 years)fixed494 samplesanger, boredom, disgust, happiness, fear, sadness, neutralactedaudio 
SmartKom (SmartKom Multimodal Corpus)University of Munich (Bavarian Archive for Speech Signals)Multimodal human-computer interaction (in the form of a Wizard of Oz experiment)
To develop communication assistants which analyze speech, gestures, and facial expressions

Further description:
Reithinger, N. & Blocher, A., (2003). SmartKom - Multimodale Mensch-Technik-Interaktion (SmartKom – Multimodal Human Computer Interaction). In: Ziegler, J. (Ed.), i-com: Vol. 2, No. 1. Munich: Oldenbourg Wissenschaftsverlag GmbH. (pp. 4-10)

Available at: https://doi.org/10.1524/icom.
224n/aspontan448 samples, approx. 4-5 min longanger, gratitude, helplessness, irritability, happiness, thoughtfulness, surprise, reflectiveness, neutral, unidentifiable episodes audiovisual 
VAM (Vera-Am-Mittag)Karlsruhe Institute of Technology, Communications Engineering Lab and University of Southern California, Speech Analysis and Interpretation LabRecordings from German talkshows

Further description:
M. Grimm, K. Kroschel and S. Narayanan, "The Vera am Mittag German audio-visual emotional speech database," 2008 IEEE International Conference on Multimedia and Expo, 2008, pp. 865-868, doi: 10.1109/ICME.2008.4607572.

available at:
m:15/ f:3216 to 69 years of age (70% under 35 years)free946 samplesvalence (negative – positive), activation (calm – excited) and dominance (weak – strong)naturalaudiovisualyes
AD (Anger Detection)University of Ulm, Institute of Communications EngineeringTelephone calls9n/afree660 recordingsneutral and angrynaturalaudio 
EA-ACTBjörn Schuller, Institute for Human-Machine Communication at Technische Universität München as part of his dissertationFurther description:
Schuller, B. (2005). Automatische Emotionserkennung aus sprachlicher und manueller Interaktion.
available at:
m:34 / f:5 (native language: 28x German, 1x English, 1x French, 1x Mandarin, 3x Serbian, 5x Turkish) free2280 samplesanger, happiness, sadness, surprise, neutralacted  
FAU Aibo (Aibo Emotion Corpus (AEC)Universität Erlangen-NürnbergSpeech recordings of a total of 51 children from two German schools while interacting with the Sony robot Aibo

Further description:

Steidl, S. “Automatic classification of emotion related user states in spontaneous children's speech.” (2009).

available at: http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2009/Steidl09-ACO.pdf
51 (m:21 /f: 30)10 to 13 years of agefree17074 samplesneutral, anger, irritability, happiness, surprise, boredom, helplessness, baby talk, admonishing, emphatic, othernaturalaudiovisualyes
PPMMK-EMOUniversity of PassauPPMMK-EMO is a database of German emotional speech recorded at the University of Passau covering the four basic classes angry, happy, neutral, and sad. It has a total of 3,154 samples averaging 2.5 seconds in length recorded from 36 speakers.36n/a 3154 samplesanger, happiness, sadness, neutral   
SIMIS (Speech in Minimal Invasive Surgery)Technische Universität München, Institute for Human-Machine CommunicationRecording of surgeons in the operating room

Further description:
Schuller, B., Eyben, F., Can, S., & Feußner, H. (2010). Speech in Minimal Invasive Surgery - Towards an Affective Language Resource of Real-life Medical Operations.

available at:
1024 to 54 years of agefree9299 samplesanger, confusion, happiness, impatience, neutralnaturalaudio 


NameInstitutionShort descriptionNumber of speakersAge groupsRecorded textSizeAnnotation categoriesEmotion originMultimodal?Transcription available?
eNTERFACE (eNTERFACE'05 Audio-Visual Emotion Database)Université catholique de Louvain, Laboratoire de Télécommunications et de Télédétection und Aristotle University of Thessaloniki, Department of InformaticsDatabase for testing and evaluating video, audio or joint audio-visual emotion recognition algorithms.

Further description:
O. Martin, I. Kotsia, B. Macq and I. Pitas,
"The eNTERFACE' 05 Audio-Visual Emotion Database,"
22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006, pp. 8-8,
doi: 10.1109/ICDEW.2006.145.

available at:
m: 34 / f: 8n/afixed1277 samplesanger, disgust, happiness, sadness, surpriseinducedaudiovisual 
SUSAS (Speech Under Simulated and Actual Stress)University of Colorado-Boulder, Robust Speech Processing LaboratoryFurther description:
Hansen, J., & Bou-Ghazale, S.E. (1997). Getting started with SUSAS: a speech under
simulated and actual stress database. EUROSPEECH.

available at:
m: 19 / f: 1322 to 76 years of agefree and fixed3593 sampleshigh stress, medium stress, screaming, fear, neutralnaturalaudio 
SAL (Sensitive Artificial Listener)Queen’s University Belfast, Tel Aviv University , University of TwenteFurther description:
Douglas-Cowie, Ellen & Cowie, Roddy & Cox, Cate & Amir, Noam & Heylen, Dirk. (2008).
The Sensitive Artificial Listener: an induction technique for generating emotionally coloured
conversation. Mathematics of Computation - Math. Comput.

available at:
m: 2/ f: 2n/afreen/a natural  
AVIC (Audiovisual Interest Corpus)Technische Universität München and Toyota Motor CorporationFurther description:

Schuller, Björn & Müller, Ronald & Hörnler, Benedikt & Höthker, Anja & Konosu,
Hitoshi & Rigoll, Gerhard. (2007). Audiovisual recognition of spontaneous interest within conversations.
Proceedings of the 9th International Conference on Multimodal Interfaces,
ICMI'07. 30-37. 10.1145/1322192.1322201.

available at: https://www.researchgate.net/publication/221052336_
m:11 / f: 1030 to >40 years of age (∅ 29 years)spontan3,901 samples naturalaudiovisual 
EU-EV (EU-Emotion Voice Database)University of Amsterdam u.a.The EU-Emotion voice stimuli consist of 2159 audio-recordings of 54 actors, each uttering sentences with the intention of conveying 20 different emotional states (plus neutral). The database is organized in three separate emotional voice stimulus sets in three different languages (British English, Swedish, and Hebrew)

Further description:

Available at:
Lassalle, Amandine & Pigat, Delia & O'Reilly, Helen & Berggren, Steve & Fridenson-Hayo, Shimrit & Tal, Shahar & Elfström, Sigrid & Råde, Anna & Golan, Ofer & Bölte, Sven & Baron-Cohen, Simon & Lundqvist, Daniel. (2018). The EU-Emotion Voice Database. Behavior Research Methods. 51. 10.3758/s13428-018-1048-1.

54 (18 in Hebrew and 18 in Swedish)10 to 72 years of agefixed2,159 samples (695 in British English, 1,011 in Swedish, and 453 in
20 different emotional states (plus neutral)
afraid, angry, ashamed, bored, disappointed, disgusted, excited, frustrated, happy, hurt, interested, jealous, joking, kind, proud, sad, sneaky, surprised, unfriendly, worried)
EmoFilmUniversity of Augsburg, University of Rome, Imperial College Londonemotional speech from films multilingual database suitable for study of culture and
measurement strategies when evaluating the perception of emotion in speech

Further description:
Parada-Cabaleiro, E., Costantini, G., Batliner, A., Baird, A., & Schuller, B. (2018).
Categorical vs Dimensional Perception of Italian Emotional Speech. INTERSPEECH.

Available at:
207 (including Italian and Spanish)n/afixed1115 samplesanger, sadness, happiness, fearactedaudiovisua 
IEMOCAP (Interactive Emotional Dyadic Motion Capture)University of Southern California, Signal Analysis and Interpretation Laboratoryconsists of dyadic sessions where actors perform improvisations or scripted scenarios,
specifically selected to elicit emotional expression

Further description:
C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan,
"IEMOCAP: Interactive emotional dyadic motion capture database," Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008.

Available at:
m:5 /f: 5n/ascripted and spontaneous sessions5531 samplesanger, happiness, sadness, excitement, frustration, fear, surprise, neutral and others plus dimensional (valence, activation, dominance)actedaudiovisual (with motion capture)yes
MELD (Multimodal EmotionLines Dataset)University of Michigan, Nanyang Technological University, Instituto Politénico Nacional,Singapore University of Technology and Design, National University of SingaporeMELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends

Further information:
Poria, Soujanya & Hazarika, Devamanyu & Majumder, Navonil & Naik, Gautam & Cambria, Erik & Mihalcea, Rada. (2018). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations.

Available at: https://arxiv.org/pdf/1810.02508.pdf
6+n/afixed13707 samplesanger, disgust, sadness, happiness, neutral, surprise, fearactedaudiovisualyes
HUMAINE (Human-Machine Interaction Network on Emotions)University of Belfast, LIMSI-CNRS, Universität Erlangen-Nürnberg,
Tel Aviv University,
National Technical University Athens
and several other partners
HUMAINE aims to lay the foundations for European development of systems that can register, model and/or influence human emotional and emotion-related states and processes: "emotion-oriented systems".

It contains 48 clips (defined as naturalistic, induced or acted data), selected from the following corpora:
- Belfast Naturalistic database (in English, naturalistic, 10 clips)
- Castaway Reality Television dataset (in English, naturalistic, 10 clips)
- Sensitive Artificial Listener (in English, induced, 12 clips)
- Sensitive Artificial Listener (in Hebrew, induced, 1 clip)
- Activity/Spaghetti dataset (in English, induced, 7 clips)
- Green Persuasive dataset (in English, induced, 4 clips)
- EmoTABOO (in French, induced, 2 clips)
- DRIVAWORK corpus (in German, induced, 1 clip)
- GEMEP corpus (in French, acted, 1 clip)

Further description:
Douglas-Cowie, Ellen & Cox, Cate & Martin, Jean-Claude & Devillers, Laurence & Cowie, Roddy & Sneddon, Ian & McRorie, Margaret & Pelachaud, Catherine & Peters, Christopher & Lowry, Orla & Batliner, Anton & Hoenig, Florian. (2011). The HUMAINE database. 10.1007/978-3-642-15184-2_14.

Available at:
n/an/an/a48 samplesannotated with >20 labelsnatural, induced, and actedaudiovisual 
CREMA-DUniversity of Pennsylvaniaan audio-visual data set uniquely suited for the study of multi-modal emotion expression and perception

Further description:
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014). CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE transactions on affective computing, 5(4), 377–390. https://doi.org/10.1109/TAFFC.2014.2336244

Available at:
91n/afixed7,442 sampleshappy, sad, anger, fear, disgust, and neutral (surprise was not considered by the acting directors to be sufficiently specific, as it could relate to any of the other emotions with rapid onset)actedaudiovisual 
MOCHA-TIMITUniversity of Edinburgh, Centre for Speech Technology ResearchFurther description:
2 (m:1 /f:1)n/afixed460 samples    
TORGOUniversity of Toronto, The Speech and Stuttering Institute & Department of Computer Science & Oral Dynamics Laboratory, Department of Speech-Language Pathology, & Holland Bloorview Kids Rehabilitation Hospital,Toronto,TORGO is one of the most popular dysarthric speech corpora
[33]. It consists of aligned acoustic and articulatory recordings
from 15 speakers. Seven of these speakers are control speakers
without any speech disorders, while the remaining eight speakers present different levels of dysarthria.

Further description:
R. F., N. A.K., and T. Wolff, “The torgo database of acoustic
and articulatory speech from speakers with dysarthria,” Lang Re-
sources & Evaluation, vol. 46, pp. 523–541, 2012

Available at:
15n/afixed reflex, respiration, lips, jaw, velum laryngeal, tongue, intellegibilityn/a  
The Nemours Database of Dysarthric SpeechApplied Science & Engineering Laboratories (ASEL),
A.I. duPont Institute, USA
The Nemours database is a collection of 814 short nonsense sentences; 74 sentences spoken by each of 11 male speakers with varying degrees of dysarthria.

Further description:
Menéndez-Pidal, Xavier / Polikoff, James B. / Peters, Shirley M. / Leonzio, Jennie E. / Bunnell, H. T. (1996): "The nemours database of dysarthric speech," in ICSLP-1996, 1962-1965.
Available at:

11 (m:11)n/afixed814 samples39 segment labels derived from the ARPAbet symbol set.n/a broad phonemic transcription


NameInstitutionShort descriptionNumber of speakersAge groupsRecorded textSizeAnnotation categoriesEmotion originMultimodalTranscription available
GEMEP (Geneva Multimodal Emotion Portrayal)Université de GenèveFurther description:
Bänziger, T., & Scherer, K. R. (2010). Introducing the Geneva Multimodal Emotion Portrayal (GEMEP) corpus. In K. R. Scherer, T. Bänziger, & E. B. Roesch (Eds.), Blueprint for affective computing: A sourcebook (pp. 271-294). Oxford, England: Oxford University Press.

Available at:
m: 5 / f: 5n/afixed1260 samplesadmiration, amusement, tenderness, anger, disgust, pride, shame, worry, interest, irritation, elation (joy), contempt, anxiety (worry), pleasure, relief, surprise, sadnessactedaudiovisualyes


NameInstitutionShort descriptionNumber of speakersAge groupsRecorded textSizeAnnotation categoriesEmotion originMultimodalTranscription available
EmoFilmUniversity of AugsburgFurther description: Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Batliner, Alice Baird, and Björn Schuller (2018), Categorical vs Dimensional Perception of Italian Emotional Speech, in Proc. of Interspeech, Hyderabad, India, pp. 3638-3642.207 (including English and Italian)  1115 recordingsanger, contempt, happiness, fear, and sadness   
SES (Spanish Emotional Speech database) Further description:
Montero, Juan & Gutierrez-Arriola, Juana M. & Colás, José & Macias-Guarasa, Javier & Enríquez, Emilia & Pardo, Juan. (1999). Development of an emotional speech synthesiser in Spanish.

Available at:
1 (m:1) fixed30 words, 15 short sentences and 3 paragraphsanger, happiness, sadness, surprise, neutralacted  
  Further description:
Sanz, Ignasi & Guaus, Roger & Rodrguez, Angel & Lázaro Pernias, Patrícia & Vilar, Norminanda & Pont, Josep Maria & Bernadas, Dolors & Oliver, Josep & Tena, Daniel & Longhi, Ludovico. (2001). Validation Of An Acoustical Modelling Of Emotional Expression In Spanish Using Speech Synthesis Techniques.

Available at:
    eight actors (four females, four males), three intensities, 336 utterances.   
 Technical University of Madrid80 utterances (300 utterances with four different sentences as synthetic data set (actors), 80 utterances as real data set (DVD movies)), 15 non-professional speakers (female and male) in the synthetic data set.

Available at:

    neutral, happiness, sadness, anger, and fear   


NameInstitutionShort descriptionNumber of speakersAge groupsRecorded textSizeAnnotation categoriesEmotion originMultimodalTranscription available
CASIA  4  1200 samples    
CVE (Chinese Vocal Emotions)  4  874 samples    
MES (Mandarin Emotional Speech)  6  360 samples    


NameInstitutionShort descriptionNumber of speakersAge groupsRecorded textSizeAnnotation categoriesEmotion originMultimodalTranscription available
BUEMODB (Bogazici University Emotion Database)Bogazici Universityacted sentences to measure F011 (f:7/M:4) fixed484 samplesanger, joy, neutrality, and sadness.acted  
TurES (TURkish Emotional Speech database) Utterances from 55 Turkish films582 (f: 188 / m: 394) fixed5304 sampleshappy, surprised, sad, angry, fear, neutral and other) and 3- dimensional emotional space (valence, activation, and dominance).acted  
EmoSTAR Utterances from film and tv   >300 samples    
Voice Corpus  f:25 / m:25  3740 samplesafraid, angry, happy, sad, neutral   
Turkish Emotion-Voice Database (TurEV-DB)Cognitive Science Department, Middle East Technical University (METU)Amateur actorsf:3 m:3   angry, calm, happy sadacted  


NameInstitutionShort descriptionNumber of speakersAge groupsRecorded textSizeAnnotation categoriesEmotion originMultimodalTranscription available
DES (Danish Emotional Speech)  m:2 / f: 2 fixedsamples induced