Emotion Detection from Speech to Enrich Multimedia Content

Share Embed


Descripción

EMOTION DETECTION FROM SPEECH TO ENRICH MULTIMEDIA CONTENT Feng Yu, Eric Chang, Ying-Qing Xu, Heung-Yeung Shum Microsoft Research China Abstract The paper describes an experimental study on the detection of emotion from speech. As computer based characters such as avatars and virtual chat faces become more common, the use of emotion to drive the expression of the virtual characters become more important. The study utilizes a corpus containing emotional speech with 721 short utterances expressing four emotions: anger, happiness, sadness, and the neutral (unemotional) state, which were captured manually from movies and teleplays. We introduce a new concept to evaluate emotions in speech. Emotions are so complex that most speech sentences cannot be precisely assigned into a particular emotion category; however, most emotional states nevertheless can be described as a mixture of multiple emotions. Based on this concept we have trained SVMs (support vector machines) to recognize utterances within these four categories and developed an agent that can recognize and express emotions.

1. Introduction Nowadays, with the proliferation of the Internet and multimedia, many kinds of multimedia equipment are available. Even common users can record or easily download video or audio data by himself/herself. Can we determine the contents of this multimedia data expeditiously with the computer’s help? The ability to detect expressed emotions and to express facial expressions with each given utterance would help improve the naturalness of computer human interface. Certainly, emotion is an important factor in communication. And people express emotions not only verbally but also by nonverbal means. Non-verbal means comprise body gestures, facial expressions, modifications of prosodic parameters, and changes in the spectral energy distribution [12]. Often, people can evaluate human emotion with only the speaker’s voice since intonations of a person’s speech can reveal emotions. Simultaneously, facial expressions also vary with emotions. There is a great deal of mutual information between vocal and facial expressions. Our own research concentrates on how to form a correspondence between emotional speech and expressions in facial image sequence. We already have a controllable cartoon facial model that can generate various facial images based on different emotional state inputs [14]. This system could be especially important in situations where speech is the primary mode of interaction with the machine. How can facial animation be produced using audio to drive a facial control model? Speech driven facial animation is an effective technique for user interface and has been an active research topic in the past twenty years. Various audio-visual mapping models have been proposed for facial animation [1..3].

However, these methods only synchronize facial motions with speech and rarely can animate facial expressions automatically. In addition, the complexity of audio-visual mapping relations makes the synthesis process language-dependent and less effective. In the computer speech community, much attention has been given to “what was said” and “who said it”, and the associated tasks of speech recognition and speaker identification, whereas “how it was said” has received relatively little. Most importantly in our application, we need an effective tool by which we can easily tell “how it was said” for each utterance. Previous research on emotions both in psychology and speech tell us that we can find information associated with emotions from a combination of prosodic, tonal and spectral information; speaking rate and stress distribution also provide some clues about emotions [6, 7, 10, 12]. Prosodic features are multifunctional. They not only express emotions but also serve a variety of other functions as well, such as word and sentence stress or syntactic segmentation. The role of prosodic information within the communication of emotions has been studied extensively in psychology and psycho-linguistics. More importantly, fundamental frequency and intensity in particular vary considerably across speakers and have to be normalized properly [12]. What kinds of features might carry more information about the emotional meaning of each utterance? Because of the diversity of languages and the different roles and significance of features in different languages, they can not be treated equally [13]. It is hard to calculate which features carry more information, and how to combine these features to get a better recognition rate. Research in automatic detection of expressed emotion is quite limited. Recent research in this aspect mostly focuses on classification, in the other words, mostly aims at ascertaining the emotion of each utterance. This, however, is insufficient for our applications. To describe the degree, compound and variety of emotions in speech more realistically and naturally, we present a novel criterion. Based on this criterion, emotion information contained in utterances can be evaluated well. We assume that there is an emotion space corresponding to our existing facial control model. In [11] Pereira expressed his research on dimensions of emotional meaning in speech, but our emotion space is totally different from his opinion. The facial control model contains sets of emotional facial templates in different degrees drawn by an artist. Within this emotion space the special category “neutral” lies at the origin; other categories are associated with the axes directions in this space. With this assumption we could correspond our cartoon facial control model with emotions. We also would like to determine the corresponding location in this emotion space of given emotional utterances, unlike other methods that simply give a classification

result. This part of the investigation is confined to information within the utterance. Various classification algorithms were made use of in recent studies about emotions in speech recognition, such as Nearest Neighbor, NN (Neural Network), MLB (Maximum-Likelihood Bayes), KR (Kernel Regression), GMM (Gaussian Mixture Model), and HMM (Hidden Markov Model) [5, 6, 9, 12]. Aim to our implementation we choose SVM as our classification algorithm. In our investigation, we have captured a corpus containing emotional speech from movie and teleplays, with over 2000 utterances from several different speakers. Since we model only four kinds of basic emotions—“neutral”, “anger”, “happiness” and “sadness”—we obtain good recognition accuracy. A total of 721 of the most characteristic short utterances in these four emotional categories were selected from the corpus.

2. Experimental Study Because in practice only the emotions “neutral”, “anger”, “happiness” and “sadness” lead to good recognition accuracy, we deal just with these four representative categories in our application even though this small set of emotions does not provide enough range to describe all types of emotions. Furthermore, some utterances can hardly be evaluated as one particular emotion. We still can find some utterances that can be classified solely as one kind of emotion, which we can say are pure emotional utterances. We construct an emotion space in which the special category “neutral” is at the origin, and the other categories are the axes; all pure emotions correspond to the points lying directly on axis (or if we relax the restrictions, nearby an axis); the distance from these points to the origin denotes the degree of these emotional utterances. When the coordinates of a point have greater than one nonzero value, the utterance contains more than one kind of emotion and cannot be ascribed to any single emotion category. We further consider utterances whose emotion type is undoubtedly “neutral” as corresponding to the region closely surrounding the origin of the emotion space. For each of the other three categories, take “anger” for example, utterances which are undoubtedly “anger” have a strong correspondence with the anger axis. Since people can not express emotions to an infinite degree, we assume that each axis has an upper limit based on extraordinarily emotional utterances from the real world. So we choose extraordinary utterances for each emotion as our training data. Since people cannot measure the degree of emotions precisely, we simply choose utterances that are determined to portray a given emotion by almost 100% of the subjects to find the origin and the upper limits of the three axes. Our approach is considerably different from those of other researchers. Other methods can only perform classification to tell which emotional category an utterance belongs to. Our method can handle more complicated problems, such as utterances that contain multiple emotions and the degrees of each emotion.

2.1. Corpus of Emotional Data We need an extensive amount of training data to accurately estimate statistical models. So speech segments from Chinese teleplays are chosen as our corpus. By using teleplays (one film is still not long enough to satisfy our requirement), we were able to collect a large supply of emotional speech samples in a short amount of time. And previous experiments indicate that the emotions in acted speech could be consistently decoded by humans and automatic systems [6], which provided further motivation for their use. The teleplay files were downloaded from Video CDs with audio data extracted at the sampling rate of 16 KHz and the resolution of 16 bits per sample. We employed three students to capture and segment these speech data files. A total of more than 2000 utterances were captured, segmented and pre-tagged from the teleplays. The chosen utterances are all preceded and followed by silence with no background music or any other kinds of background noise. The expressed emotion within an utterance has to be constant. All of these utterances need to be subjectively tagged as one of the four classes. Only pure emotional utterances are usable in accurately forming statistical models. One of these students and a researcher tagged these utterances. They heard and tagged all these utterances several times. Each time, if the tag of an utterance differed from its previous designation, this utterance was removed from our corpus. The initial tags were those that the three students pre-tagged for all the over 2000 utterances. Each tagging session was separated by several days. After tagging several times, only 721 utterances remained. The number of waveforms which belong to each emotion category are shown in Table 1. Table 1: Data sets Anger Happiness Neutral 215 136 242 All data files are 16kHz, 16bit waveforms.

Sadness 128

2.2. Feature Extraction Previous research has shown some statistics of the pitch (fundamental frequency F0) to be the main vocal cue for emotion recognition. Also, the first and second formants, vocal energy, frequency spectral features and speaking rate contribute to vocal emotion signaling [6]. In our study, evaluation features of voice are mainly exacted from pitch, and the features that we grasp from pitch are sufficient for most of our needs. The main means of choosing and drawing features is the method of [6]. First, we obtained the pitch sequence using an internally developed pitch extractor [4]. Then we smoothed the pitch contour using smoothing cubic splines. The resulting approximation of the pitch is smooth and continuous, and it enables us to measure features of the pitch: the pitch derivative, pitch slopes, and the behavior of their minima and maxima over time.

We have measured a total of sixteen features, grouped under the headings below: • Statistics related to rhythm: Speaking rate, Average length between voiced regions, Number of maxima / Number of (minima + maxima), Number of upslopes / Number of slopes; • Statistics on the smoothed pitch signal: Min, Max, Median, Standard deviation; • Statistics on the derivative of the smoothed pitch: Min, Max, Median, Standard deviation; • Statistics over the individual voiced parts: Mean min, Mean max; • Statistics over the individual slopes: Mean positive derivative, Mean negative derivative. All these features are calculated only in the valid region which begins at the first none-0 pitch point and ends at the last none-0 pitch point of each utterance. The features in the first group are related to rhythm. Rhythm is represented by the shape of pitch contour. We assume the inverse of the average length of the voiced part of utterance denotes the speaking rate; the average length between voiced regions can denote pauses in an utterance; The features in the second and third groups are general features of the pitch signal and its derivative; In each individual voiced part we can easily find minima and maxima. We choose the mean of the minima and the mean of the maxima as our fourth group features; We can compute the derivative of each individual slope. If the slope is upslope the derivative is positive, otherwise the derivative is negative. The mean of these positive derivatives and mean of negative derivatives are our features in the fifth group. 2.3. Performance of Emotion Evaluator The classification algorithms used in this research section are mostly based on K-nearest-neighbors (KNN) or neural networks (NN). Considering our application, we need not only classification results, but also proportions of each emotion an utterance contains. After some experimentation, we chose support vector machine (SVM) as our evaluation algorithm [8], because of its high speed and each SVM can give an evaluation to each emotion category. From training data, we can find the origin and the three axes. Because different features are extracted from audio data in different ways and the relationship between these features are complex, we choose a Gaussian kernel K ( xi , x j ) = e



− xi − x j

2

/ 2σ 2

to be our SVM kernel function For each emotional state, a model is learned to separate its type of utterances from others. We refer to SVMs trained in this way as 1-v-r SVMs (short for one-versus-rest). The scheme we adopted learns different SVM models for different categories that can distinguish this kind of emotions from others. Our preliminary experimental results indicate that we can obtain satisfying results only when there are at least 200 different utterances in each emotion category. While each SVM only deals with just a two-class problem and the performance of a SVM classifier is related to just these two classes, the boundary will tend to benefit the class that contains more data. To avoid this

kind of skewing, we balance the training data set of the SVM. Taking “anger” as an example, if we choose about 150 utterances in the “anger” state and also choose about 150 utterances from other emotion categories, with approximately the same number chosen from each of the other categories. In this way, the results are much better than those learned from imbalanced training data sets. Note that the training data can be replicated to balance the data sets. The training data and performances for each SVM are shown in Table 2. The remainder of the data set not used during learning for each individual SVM are used as testing data for this SVM. Table 2: SVM training data sets and performance Category

One

Rest

Accuracy on test set

SVM Anger Happiness Neutral Sadness

162 102 194 96

147 94 193 96

77.16% 65.64% 83.73% 70.59%

We obtain the given emotional utterance’s feature vector by computation using each SVM and collecting each evaluation. We then have the emotional evaluation of the utterance. If only one evaluation is greater than 0, ( f i ( x ) > 0 , 0 ≤ i ≤ 3 , f j ( x ) < 0 ,

i ≠ j ), we label this utterance as this particular kind of emotional utterance; if more than one evaluation is greater than 0, ( f i ( x ) > 0 , f j ( x ) > 0 , 0 ≤ i, j ≤ 3 , i ≠ j ), we label the emotion of this utterance as a mixture of several kind of emotions, each proportional of each emotion’s SVM evaluation. If all evaluations are less than 0, ( f i ( x ) < 0 , i = 0..3 ), we can say the emotion of this utterance has not been defined in our system. 2.4. Comparison We have also compared the effectiveness of the SVM classifier in comparison to the K nearest neighbor classifier and the neural network classifier. One can observe that the SVM classifier compares favorably to the other two types of classifiers. Table 3: Comparison of NN, KNN and SVM Method

Accuracy (%) A H N NN 40.00 27.78 62.68 KNN 42.86 39.28 89.29 SVM 77.16 65.64 83.73 Note: Of each category there are 100 learning all remained are testing utterances.

S 35.71 32.14 70.59 utterances and

3. Conclusions & Discussion Compared with KNN, training an SVM model gives a good classifier without needing much training time. Even if we do not know the exact pertinences between each feature, we still can obtain good results. After we produce the SVM model by training from training data sets, these training data sets are no longer

needed since the SVM model contains all the useful information. So classification does not need much time, and almost can be applied within real-time rendering. The KNN rule relies on a distance metric to perform classification, it is expected that changing this metric will yield different and possibly better results. Intuitively, one should weigh each feature according to how well it correlates with the correct classification. But in our investigation, those features are not irrelevant to each other. The performance landscape in this metric space is quite rugged and optimization is likely expensive. SVM can handle this problem well. We need not know the relationships within each feature pair and the dimensionality of each feature. Compared with NNs, training a SVM model will require much less time than training an NN classifier. And SVM is much more robust than NN. In our application, the corpus is coming from movies and teleplays. There are many speakers with various backgrounds. In this kind of instance NNs do not work well. The most important reason why we chose SVMs is that SVMs give a magnitude for recognition. We need this magnitude for synthesizing expressions with different degrees. For our future work, we plan to study the effectiveness of our current approach on data from different languages and cultures.

4. References [1] Brand, M., “Voice Puppetry”, Proceedings of the SIGGRAPH, 21-28, 1999. [2] Cassell, J., Bickmore, T., Campbell, L., Chang, K., Vilhjlmsson, H., and Yan, H., “Requirements for an architecture for embodied conversational characters”, Proceedings of Computer Animation and Simulation, 109120, 1999. [3] Cassell, J., Pelachaud, C., Badler, N.I., Steedman, M., Achorn, B., Beckett, T., Douville, B., Prevost, S. and Stone, M., “Animated conversation: rule-based generation of facial display, gesture and spoken intonation for multiple conversational agents”, Proceedings of the SIGGRAPH, 28(4): 413-420, 1994. [4] Chang, E., Zhou, J.-L., Di, S., Huang, C., and Lee., K.-F., "Large vocabulary Mandarin speech recognition with different approaches in modeling tones", International Conference on Spoken Language Processing, 2000. [5] Deb Roy and Alex Pentland, “Automatic spoken affect analysis and classification”, in Proceedings of the Sencond International Conference on Automatic Face and Gesture Recognition, pp. 363-367, 1996. [6] Dellaert, F., Polzin, T., and Waibel, A., “Recognizing Emotion in Speech”, Proceedings of the ICSLP, 1996. [7] Erickson, D., Abramson, A., Maekawa, K., and Kaburagi, T., “Articulatory Characteristics of Emotional Utterances in Spoken English” , Proceedings of the ICSLP, 2000. [8] Joachims, T., Schölkopf, B., Burges, C., and Smola, A.(ed.), Making large-Scale SVM Training Practical. Advances in Kernel Methods - Support Vector Training, MIT-Press, 1999. [9] Kang, B.-S., Han C.-H., Lee, S.-T., Youn, D.-H., and Lee, C.Y., “Speaker Dependent Emotion Recognition using Speech Signals” , Proceedings of the ICSLP, 2000. [10] Paeschke, A., and Sendlmeier, W. F., “Prosodic Characteristics of Emotional Speech: Measurements of

[11]

[12]

[13]

[14]

Fundamental Frequency Movements”, Proceedings of the ISCA-Workshop on Speech and Emotion, 2000. Pereira, C., “Dimensions of Emotional Meaning in Speech”, Proceedings of the ISCA-Workshop on Speech and Emotion, 2000. Polzin, T., and Waibel, A., “Emotion-Sensitive HumanComputer Interfaces”, Proceedings of the ISCA-Workshop on Speech and Emotion, 2000. Scherer, K.R., “A Cross-Cultural Investigation of Emotion Inferences from Voice and Speech: Implications for Speech”, Proceedings of the ICSLP, 2000. Yu, F., Li, Y., Chang, E., Xu, Y.-Q., and Shum, H.-Y., “Speech-Driven Cartoon Animation with Emotions”, submitted to ACM Multimedia 2001.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.