Facial Action Units Recognition: A Comparative Study

Share Embed


Descripción

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION <

1

Facial Action Units Recognition – A Comparative Study M. C. Popa*, L. J. M. Rothkrantz, P. Wiggers, R. Braspenning, and C. Shan

Abstract — Many approaches to facial expression recognition focus on assessing the six basic emotions (anger, disgust, happiness, fear, sadness, and surprise). Real-life situations proved to produce many more subtle facial expressions. A reliable way of analyzing the facial behavior is the Facial Action Coding System (FACS) developed by Ekman and Friesen, which decomposes the face into 46 action units (AU) and is usually performed by a human observer. Each AU is related to the contraction of one or more specific facial muscles. In this study we present an approach towards automatic AU recognition enabling recognition of an extensive palette of facial expressions. As distinctive features we used motion flow estimators between every two consecutive frames, calculated in special regions of interest (ROI). Even though a lot has been published on the facial expression recognition theme, it is still difficult to draw a conclusion regarding the best methodology as there is no common basis for comparison. Therefore our main contributions reside in the comparison of different ROI selections proposed by us, different optical flow estimation methods, and also in the comparison of two spatial-temporal classification methods: Hidden Markov Models (HMM) and Dynamic Bayesian Networks (DBN). The classifiers have been trained and tested on the Cohn-Kanade database. The experiments showed that under the same conditions regarding initialization, labeling and sampling, both methods produced similar results, achieving the same recognition rate of 89% for the classification of facial AUs. Still, by enabling non-fixed sampling and using HTK, HMMs rendered a better performance of 93% suggesting that are better suited for the special task of AUs recognition. Index Terms — Facial Action Units recognition, Facial Action Coding System, Hidden Markov Models, Dynamic Bayesian Networks.

I. INTRODUCTION

I

t has been suggested that as much as 50% of what people communicate during face to face communication is through para-language [1], involving several modalities: voice tone and volume, body language, and facial expressions which

Manuscript received December 10, 2009. This work was supported by the Netherlands Organization for Scientific Research (NWO) under Grant 018.003.017. L.J.M.Rothkrantz, P.Wiggers and M.C.Popa are members of the ManMachine Interaction Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4 2628CD Delft, the Netherlands. E-mail: {L.J.M.Rothkrantz, M.C.Popa, P.Wiggers} @ tudelft.nl. R. Braspenning and C. Shan are with Philips Research, High Tech Campus 36, 5656AE Eindhoven, the Netherlands. E-mail: {Ralph.Braspenning, Caifeng.Shan} @ philips.com

represent by far the most important means of conveying emotional states. Recently much effort has been devoted to render human-computer interaction more similar to humanhuman communication and enhance its naturalness [2], [3]. At this moment there are many surveillance systems dedicated to public spaces which could benefit from emotion detection [4]. Besides, many companies are interested in automatic assessment of consumers’ appreciation of their products. That is why nowadays human-computer systems tend to rely even more on automatic approaches in ascertaining the user affect as a powerful instrument to construct efficient system responses. One of the modalities which proved to be very powerful in detecting an individual's emotional state and also his reaction to a situation or a product is represented by facial expressions. Efficient assessment of facial behavior can contribute in a number of application domains: industrial applications, elearning, medicine, security sector, video surveillance, interactive games, entertainment, telecommunication (e.g. call centers) but also behavioral science, anthropology, neurology, and psychiatry [2]. Several models proved their efficacy in analyzing facial behavior, from which we mention Facial Animation Parameters (FAPs) and Facial Action Coding System (FACS) [5]. FACS employs as basic units, action units (AU) which model facial movements, by associating each AU with the activation of one or more specific facial muscles. Moreover FACS proved its applicability not only in facial expressions recognition, but also in distinguishing simulated from genuine smile or pain [6]. Typically, trained human experts are needed for FACS labeling, a task which is very time-consuming, prone to errors and dependable on the subjectivity of the human labelers. Therefore in order to overcome these limitations, we aim at developing an automatic facial action unit recognition system. Many studies tried to find a solution to the facial expression recognition problem. A lot of them focused on recognizing the six basic emotions defined by Ekman, but real-life situations showed that during human-human communication many more facial expressions can be observed, depicting pure or blended emotional states. As facial expressions can be decomposed into a sequence of AUs (based on the observations provided by Ekman), we can consider our system as an important step in the facial expression recognition process. Moreover even one emotion can be expressed in different ways, a fact that suggests the complexity of the analyzed problem.

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION <

2

Fig. 1 Flow diagram of the automatic facial AUs Recognition System In this paper, we propose a system for automatic recognition of AUs. Figure 1 illustrates the flow diagram of our proposed methodology. The system receives as input video sequences of individuals showing AUs (singular or in combination) or facial expressions corresponding to emotional states. The next step consists of processing the input data, by detecting the face region and applying a normalization procedure, which means that all the images will be scaled to the same dimensions. In order to select only the regions affected by the activation of AUs, several techniques can be applied: Active Appearance Model (AAM) extraction [7] or Facial Landmarks Detection [8]. This phase is known as the ROI Selection and is followed by the extraction of the relevant information from the designated areas, or Feature Extraction phase. Next a classification step is performed which produces as output the recognized AUs or emotional states. A more detailed description of each module of the diagram will be provided in Section III. The flow diagram contains several modules, represented by dotted rectangles, indicating the replaceable modules or insertion points. Our aim is to provide a flexible and easy to use solution, which permits integration and comparison of different techniques, algorithms and classification methods. Many authors present their approach, towards a certain problem, in our case facial AUs recognition, but it is very difficult to draw a conclusion regarding the best one, as there is no common basis for comparison. The main contribution of this work consists of an analysis of the main steps involved in the recognition problem of facial AUs, by conducting a comparison of various selections of facial regions of interest, different feature extraction techniques, and different classification engines using a unified testing approach. In this way considerations regarding the best methodology to be followed in the special case of AUs recognition will be formulated. In the following section we present some of the most relevant studies on facial actions and emotion recognition. Section III of this study elaborates on our proposed model, while in Section IV we present and discuss the conducted

experiments. In Section V we formulate our conclusions and provide some directions for future work. II. RELATED WORK The interest in facial actions units and emotion recognition for human-computer interaction was noticed in the last 20 years, substantial efforts being made in this direction. The palette of approaches varies not only in the database selection, but also in the feature extraction techniques and in the classification methods. The number of databases labeled according to the FACS standard is limited, some of the widely used ones being: Cohn-Kanade Facial Expressions Database [9], MMI database [10], and ISL database [11]. Concerning visual information representation used to distinguish facial movements, we can observe two main streams: local and holistic approaches. The local methods analyze only a set of points or special regions of the face. Several facial emotion recognition systems have employed the use of geometrical features, which measure the displacement of special points on the face between the initial and the current frame [12], [13], and [33]. Another local feature extraction technique is represented by Facial Animation Parameters (FAPs) extraction [14], [15]. Holistic methods process the whole face and from the widely used ones we can mention optical flow estimation [16], [17], [18], [19], [23], and [24]. A number of studies applied Gabor wavelet analysis [11], [20], [43], Principal Component Analysis (PCA) [29], [47], Fast Fourier Transform [19] and also Local Binary Patterns [28], [41]. Deciding on the feature representation method is not the only important decision which needs to be taken; another essential decision refers to the classification approach. With respect to classification techniques we can divide them into spatial and spatial-temporal ones. Spatial classification approaches such as Support Vector Machines (SVM) [29], [30] analyze static images, while the

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < spatial-temporal ones (e.g. HMMs and DBNs) take into account also the dynamics of facial features during time. Some of the most relevant works will be presented next, by highlighting their contribution and also their drawbacks. Recognizing AUs over time was investigated by Lien et al. [17], by employing a HMM for each AU and AU combination (3 AUs for the upper face region: AU4, AU1+4, AU1+2 and 6 AUs for the lower face region: AU12, AU6+12+25, AU20+25, AU9+17, AU15+17, and AU17+23+24). The recognition rate for both parts of the face was around 92%. However, the great number of HMMs required to identify potential AU combinations prevents it from real-time applications. Otsuka and Ohya [18] propose a method for recognizing intermediary states of the basic emotions (relaxed, contracting, apex and relaxing). Feature extraction is performed using the gradient-based optical flow algorithm and classification is done using HMMs, by employing a 5states left-right model which contains also a transition from the final state to the initial one in order to permit the recognition of multiple sequences of facial expressions. Even though the training set was limited to one male subject for the off-line experiments and respectively two subjects for the on-line experiments, this work represents an interesting attempt towards modeling intermediary states of emotions. The same authors refine their work in [19] by recognizing facial expressions that could abruptly change from one expression to another one. The proposed HMM model contains states corresponding to the simultaneous motion of two different facial expressions (e.g. a muscle relaxation of one expression and a muscle contraction for another expression). The recognition method implies calculation of optical flow, followed by applying a 2-dimensional Fast Fourier Transform (FFT) and finally the classification is performed using HMMs. The presented approach draws attention by going one step further than many other authors who assume the transition from one emotion to another one can be done only by passing first through the neutral state. Ten years later, Koelstra and Pantic [22] investigate the same issue of recognizing AUs and their temporal segments. Non-rigid registration using free-form deformations is used to determine motion in the face region; the extracted motion fields are then used to derive motion histogram descriptors and for the AUs detection and their temporal segments a combination of ensemble learners and HMMs are used. This study is the first one which can detect all 27 AUs and their temporal segments and reports an average precision rate of 60% on the MMI database and 67% on the Cohn-Kanade database. Some studies were based on similarity with speech models; the same as words are composed of phonemes, facial expressions are composed of AUs, and as HMMs were used successfully in speech recognition by using a HMM for each phoneme, the same approach was proposed for facial expressions recognition, by employing a HMM for each AU. An early study of Smith et al. [20], starting from the previously mentioned parallel between speech recognition and facial expressions recognition, proposes interpolation

3

methods as a solution for the case when there is insufficient data to train all possible combinations of AUs. For the recognition of six upper face AUs (AU1,2,4,5,6,7), a bank of Gabor filters is used, followed by a three layer neural network classification leading to a 92% recognition rate on Cohn-Kanade and Ekman-Hager databases. Another study which emphasizes the mentioned similarity is the one of Lien [21] in which he presents three methods for facial features extraction: facial feature point tracking using the coarse-to-fine pyramid method, optical flow together with PCA and high gradient component analysis in the spatio-temporal domain. The resulting motion vector sequences are fed to a HMM-based classifier and the overall recognition accuracy was around 87% for a set of three upper face and six lower face facial expressions. The optimal HMM topology for modeling the upper face facial expressions was a 2nd-order 3-state left-right HMM and for the lower face a 3rd-order 4-state left-right HMM. An approach similar to the one followed by Lien is presented in [23] (Yeasin et al.), where optical flow is computed and projected into a lower dimensional space (using PCA), then a set of linear classifiers (k-NN) are applied to the projected optical flow in order to derive characteristic temporal signature for every video sequences, signature which is afterwards used to train discrete HMMs. The proposed approach achieved an average recognition rate of 90.9% on Cohn-Kanade database. The novelty brought by this paper consists of the fact that this approach was tested also on spontaneous data collected from TV broadcasts and on data obtained by showing to 21 subjects movies clips meant to arouse spontaneous emotional reactions. Even though the recognition rates obtained for spontaneous data were lower (70%), this can be explained by the variability in terms of lighting conditions, subjects and expressions. Another work which uses optical flow as a feature extraction method is the one of Naghsh-Nilchi and Roshanzamir [24], in which a recognition rate of 94% for the six basic emotions, on the Cohn-Kanade database is reported. They used the optical flow method developed by Gautama and VanHulle [25] and recognized emotions based on the estimated similarities between the source vectors and the extracted motion vectors. Their approach proved to be applicable also for a few number of frames (three frames) leading to a success rate of 83.3%. Between other feature extractions techniques used in combination with HMMs we can mention the work of Zhu et al. [26] based on moment invariants (invariant under shifting, scaling, and rotation) who attained an accuracy of 93.75% by considering four emotions (anger, disgust, happiness, and surprise) on a database consisting of 31 image sequences. This study showed that moment invariants reflect the deformation of facial features but are not a good indicator of the displacement of these features. Another study of Landabaso et al. [15] proposes Facial Animation Parameters (FAPs) extraction in combination with semi-continuous HMM for the recognition of the basic emotions. They claim that they can deal with unrestrained expression intervals in video sequences by using a new HMM topology. This approach could be useful in handling

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < the Cohn-Kanade database, which contains videos of emotions expressed from a neutral state till the apex, while the relaxation part which is missing could be also useful. Still, the authors base their work on the symmetry of emotions, assumption which does not always hold. Hidden Markov Models showed to be efficient not only by themselves, but also in combination with other classifiers. An interesting example in this respect is the work of Valstar and Pantic [27], where a hybrid SVM/HMM classifier is proved to increase the recognition accuracy by 7% in comparison with the case when only SVM is used. The presented method takes into account the strengths of each classifier (while HMM reflects the temporal dynamics of a facial action, SVM discriminates extremely well between different facial expressions) and combines them in order to benefit from both of them. The experiments were performed on the MMI database and the reason for considering this database resides in the fact that it displays the full neutral-expressive-neutral pattern. Aleksic et al. propose in [14] a Multistream Hidden Markov Model (MS-HMM) in combination with FAPs and prove that this approach outperforms the single-stream ones. For both eyebrow and outer-lip regions, FAPs were extracted and the obtained accuracies for each single stream HMM (SS-HMM) (58.80% for eyebrows and 87.32% for outerlips) enabled choosing the stream weights for the MSHMM leading to a recognition accuracy of 93.66%. This study proved that FAPs contain sufficient information about facial expressions. It also provided a model which can be used with any kind of facial features and with any number of facial regions. A similar study to the one previously presented is the study of Cao and Tong [28], in which the method towards automatic facial expression recognition is based on Local Binary Patterns (LBP) and Embedded Hidden Markov Model (EHMM). The HMM corresponding to each facial region (forehead, eyes, nose, mouth, and chin) is expanded to a super state in the EHMM. The argument for using LBPs is based on their ability to describe good enough local image texture. The performance of the presented method was around 61.80% on the JAFEE database. Other results on the JAFEE database are reported in [29], where Liejun et al. propose an approach based on improved Support Vector Machines (SVM) by modifying kernels, while feature extraction is based on PCA. The classification accuracy ratio was 94.8% before and 95.7% after modifying the kernel. Not only HMMs proved to be efficient in capturing the spatio-temporal dimension of emotions, but also Dynamic Bayesian Networks (DBNs) appeared to be a good alternative for understanding and modeling the temporal behaviors of facial expressions. In [33] Zhang and Ji present their approach for facial expressions recognition which is based on fusion not only of the current video

4

observations but also of the previous visual evidences. Measurement of the facial motion is based on tracking facial features by simultaneously using an active Infra-Red illumination and Kalman Filtering, while the feature extraction method is based on geometrical features. The proposed BN causal model consists of three layers: classification, facial AU and sensory information layer. The added value of this study consists in taking advantage of DBNs in order to handle occluded facial expressions and also recognizing image sequences containing multiple expressions which do not require to be temporally segmented by a neutral state. Another approach is proposed in [11] by Tong et al., where they train an AdaBoost classifier for 14 target AUs and learn the DBN in order to correctly model the AUs relationships. The added value consists in the fact that for the AUs which are difficult to be recognized directly, can be inferred indirectly from other AUs. The databases used for this study are the Cohn-Kanade database and the ISL database. With only Adaboost classifier, the system achieves a recognition rate of 91.2% while by using also DBN the rate increases to 93.33%. In order to provide an overview of the previously presented studies, we analyze them by taking into account a number of criteria. Table 1 provides a summary of the described studies on emotion recognition from facial expressions with respect to the database/databases used, the extracted facial features, the employed classifier or the combination of classifiers, the number of target classes (emotions or AUs), the performance, whether the study contains a comparison with other methods, or with other databases and last but not least our opinion about the level of details in which the followed approach was described (with sufficient details, in a comprehensive manner, containing diagrams, pictures) in a range from (+) to (+ + +). A thorough analysis of the presented studies shows that there are many methodologies applicable to the AUs/facial expressions recognition problem. Regarding the classification phase, spatial-temporal approaches are preferred to the spatial ones, and from the first group HMMs are usually employed at a basic level, being suitable for AUs classification, while DBNs are usually used at a higher level for modeling or inferring relationships between AUs. Still we cannot draw a conclusion regarding the best approach to be used, due to the variability in datasets utilizations, feature extraction techniques, testing approaches and last but not least the level of details provided by each presented study. In order to provide an answer to the research question (“Which is the best methodology to be followed?”) we will present in the following Sections our conclusions regarding the comparison of different selection, representation and classification methods.

Table 1 Overview of studies on facial actions units and emotion recognition Study

Data set

Facial features

Classifier

Classes

Performance

Methods Comparison

Level of description of details

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < Huang et Lin [16]

Lien [17]

Tong [11]

et

et

al.

N/A

Optical flow

CK database

al.

CK and MMI databases

Otsuka and Ohya [18]

1 male subject (off-line experiments) 2 male subjects (on-line experiments) 1 subject displaying each facial expression ten times Ekman-Hager and CK databases MMI and CK databases

Otsuka and Ohya [19]

Smith et al. [20] Koelstra and Pantic [22]

(1)Facial feature tracking (2)Optical flow (3)High-gradient component detection Gabor wavelet features

HMM Discriminant analysis (DA)

Gradient-based Optical flow

HMM

Optical flow & 2-dim. FFT

Gabor filters

Optical with PCA

NaghshNilchi and Roshanzamir [24]

CK database

Gautama VanHulle Optical algorithm

Landabaso et al. [15]

CK database

Aleksic et al. [14]

CK database

AdaBoost DBN

&

HMM

NN

Geometrical features

CK database 108 facial expressions from TV broadcasts Dataset of 21 subjects (emotion elicitation)

Yeasin et al. [23]

SVM

SVM Hybrid SVM/HMM flow

and flow

Facial Animation Parameters (FAPs) FAPs

Combination of k-NN with HMM

3 basic emotions (happiness, anger, surprise) + neutral

SVM: 81.5%

9 AUs 3 upper face 6 lower face

(1) HMM: 88% DA: 81% (2) HMM: 92% (3) HMM: 81%

14 AUs

Adaboost: 91.2% DBN: 93.33%

Improved version by employing DBN

6 basic emotions with 4 intermediary states

Emotion category and state: 80%

N/A

4 basic emotions (anger, disgust, happiness, surprise)

HMM: 93%

6 AUs

Correct AUs: 98.5% Correct AUs combination: 93% SVM: 60% SVM/HMM: 66%

23 AUs 4 temporal phases (neutral, offset, onset apex) 6 basic emotions

Estimated similarities between source vectors and extracted motion vectors Semicontinuous HMM

6 basic emotions

MS-HMM

6 basic emotions

Cao and Tong [28]

JAFEE database

Local Binary Patterns (LBP)

Embedded HMM (EHMM)

Liejun et al. [29]

JAFEE database

PCA

SVM

III. MODELING Based on the observations regarding previous work on facial AUs recognition, we plan to investigate several research questions: ROI Selection - Given a face region, which are the best

Face detection: Viola&Jones Skin color model Motion history image Both in terms of facial feature extraction and classification techniques

5

+

+++

++

++ Emotion category : 95% N/A ++

CK database: (1) 90.9% (2) 86% (3) 75.3% 21 subjects dataset: (1) 82.1% TV broadcasts data: (1) 70% All frames: 94% 3 frames: 83.3%

N/A ++ Classifiers: Polynomial parameterization SVM/HMM Classifiers: (1) k-NN with discrete HMMs (2) continuous HMMS (3) k-NN with a voting scheme

+

++

N/A +++

6 basic emotions 5 temporal behaviors

5 basic emotions (happiness, anger, sadness, surprise, fear) + neutral 6 basic emotions + neutral

Emotion categories: 84% Temporal intervals and expression types: 82.12% Eyebrow: 58.8% Outer-Lip: 87.32% Independent E+OL: 88.73% MS-HMM: 93.66% DCT: 60.32% LBP: 61.80% SVM: 94.8% Improved SVM: 95.7%

N/A + Comparison of single-stream HMMs with multistream HMM

++

Feature extract. methods: DCT and LBP

+++

Classifiers: SVM and improved SVM

++

ROIs for AUs detection? Choice of optical flow algorithm - Which is the best algorithm for motion flow estimation between consecutive frames? Classification method - Which classification method is most suitable for our recognition problem? In order to provide an answer to the formulated questions

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < and also for developing an automatic facial AUs recognition system we based our work on previous work done in our research group, consisting of the pre-processing phase of the facial input data. This phase implies face detection using a Viola&Jones detector [38] and face shape and special facial points detection using Active Appearance Model (AAM) extraction [7]. For more details please refer to [12]. A. Dataset We used in our study as input for training and testing our proposed approach the Cohn-Kanade Facial Expression database [9]. This database consists of 480 gray scale recordings of 97 subjects displaying facial expressions, labeled according to the FACS standard. Regarding the variability of data, the subjects ranged in age from 18 to 30 years, 65% were female, 15% were African-American and 3% Asian or Latino. Collected under controlled illumination and background, the Cohn-Kanade database has been widely used for evaluating facial AUs recognition systems, reason which challenged us to use it in our research. We considered the following AUs in our research (AU 1, 2, 4, 5, 7, 9, 10, 12, 15, 16, 17, 20, 23, 24, 25, 26, and 27). The selection of the mentioned AUs was based on a practical reason; we selected the AUs for which more than twenty samples were available in the Cohn-Kanade database, except AU10 which was considered due to its importance. An indication regarding the number of videos and subjects corresponding to each AU is given in the table below. Table 2 Distribution of the considered AUs in the CohnKanade database AU

Nr. of videos

1 2 4 5 7 9 10 12 15

141 95 150 77 106 49 11 94 74

Nr. of subjects 82 76 76 63 68 42 11 79 53

AU

Nr. of videos

16 17 20 23 24 25 26 27

20 154 67 42 42 291 38 76

Nr. of subjects 17 76 59 37 39 92 32 69

B. Region of Interest selection In comparison with other research topics, such as feature extraction techniques or classification approaches which benefited from a lot of interest, being described in many details, ROI selection represents a less researched topic, as we could not find experimental results on comparative studies regarding different ROI selection. Still this step can have a big influence in the recognition process, as it will be shown in Section IV. Therefore we proposed ourselves to investigate which are the best ROIs for our specific case and how can we detect them automatically. In Figure 2, we propose several alternatives, each of them

6

having strong points but also drawbacks. The selection illustrated in (2a) does not require a lot of effort, the division of the face area into a fixed number of blocks being made automatically. We expect the blocks around special facial regions (eyes, nose, and mouth) to carry more information than the others; still, by not knowing the exact correlation between blocks and the previously mentioned facial regions makes it difficult to apply it for AUs recognition. Figure 2b requires a preliminary step, represented by the identification of the eyes corners, nose center, and mouth corners. Given the location of these points, the face is divided into eight regions, each of them being associated with a number of AUs. We used this type of selection in combination with AAM extraction, as this model already provides the required points of reference. Even though a variation of this ROIs selection (consisting of the same ROIs except the last two ones) produced very good results in [24], for the special case of AUs recognition, this solution proved to be less suitable, probably due to the large areas designated to each ROI. The last example (2c) proposes 13 ROIs and it was inspired by the work of Gizatdinova and Surakka [8]. The location of the reference points is provided by the AAM extraction, while the size and location of each ROI was determined using the entropy maximum algorithm which captures the regions with the most complex motion. In order to perform this search, we used the motion flow estimators calculated for the whole facial region between every two consecutive frames, using the Lucas-Kanade algorithm, which will be detailed in the following sub-section. As it was previously described, our automatic process of ROIs selection evolved from a standard segmentation of the face region to a ROIs choice learned from the dataset, suggesting that the final purpose should be considered when deciding for a type of selection. In our case we aim at building an accurate system for facial AUs detection and considering 13 ROIs is the most suitable alternative. However, if the aim is to develop a simple and straight forward solution than the first presented alternative of selection is the appropriate one.

Fig. 2 Automatic ROIs selection. (2a) Grid selection (2b) Selection based on special points (2c) Selection around special landmarks learned from the dataset C. Feature extraction Our goal consists in capturing facial movements on a time basis, movements which can be later on associated with

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < AUs activation. Optical flow method proved its usefulness in measuring pixels displacements between two consecutive frames, constituting in this way a possible solution to our problem. More than that, it represents a widely used method in the facial expression recognition field, many authors reporting their results obtained by using it. An example can be observed in Figure 3. There are several different methods of estimating optical flow, e.g. Lucas-Kanade (LK) [39], Horn-Schunck [32], Phase-based method [25], and 3-D Recursive Search (3DRS) algorithm [40]. We performed a comparison of the before mentioned methods, by visually inspecting their results on parts of the Cohn-Kanade database and 3DRS the LK algorithms seemed to provide the best results. A reason might be due to the localized search of each pixel or block of pixels, while the other methods perform a global estimation of the movement. Our finding is supported also by the work in [45], where a larger number of optical flow methods are compared and the same conclusion is reached regarding Lucas-Kanade method.

Frame n-1

Frame n

Fig. 3 Optical flow estimation (pixel-wise) between two consecutive frames using Lucas-Kanade algorithm We performed experiments with the two mentioned methods for optical flow estimation, pixel-wise (LK) and also block wise (3DRS) and both methods provided the same overall accuracy, even though on particular AUs one method was better than the other one. For every region of interest defined in the previous subsection, we calculated the average displacement on x and y axes, respectively Δx and Δy, and we further used these values as input features for the classification phase. Other authors consider as optical flow parameters the average angle and length of the motion vectors, which only represent a combination of the Δx and Δy displacements, e.g.

α = arctg ( Δy / Δx ) and l = Δ x 2 + Δ y 2 .

We would also like to mention, that even though this method of visual information representation satisfies the

7

temporal requirement (captures the evolution in time of our facial features) it still poses some accuracy issues, the aperture problem being one of them. Even though motion flow estimation method proved to be successful in detecting facial features in a sequence of images, it is worth mentioning that there are also other feature extraction techniques which are suitable for this task, such as Gabor wavelets representations applied in [11], [20], and [42] or Particle filtering [13], [44]. D. Classification Based on the review of the most relevant work in facial AUs recognition, we noticed that spatial-temporal approaches proved to perform better than spatial ones. Furthermore from the spatial-temporal group: Hidden Markov Models and Dynamic Bayesian Networks seem to be producing optimum results. However to the best of our knowledge, no author conducted a comparison of the two methods applied in the special case of facial AUs recognition. Therefore one of the main contributions of this research consists in a relative comparison of these two classification engines, in order to reveal which is the most suitable one and why. The experimental results are presented in the following section. Next we provide some details with respect to each method and how it was implemented in our study. D.1 Hidden Markov Models The Hidden Markov Model (HMM) is a very powerful mathematical tool for modeling time series. It provides efficient algorithms for state and parameter estimation and it automatically performs dynamic time warping for signals that are locally squashed and stretched. It can be used for many purposes other than acoustic modeling, such as speech recognition, gesture recognition, handwriting recognition and also facial AUs recognition. We designed our system by building an HMM for each considered AU and mentioned in sub-section III.A. The performance of an HMM classifier is highly dependant on its topology. In order to determine the best topology for each HMM model, we performed an extensive search, by employing a diverse number of states, number of mixtures and also network topologies (e.g. left-to-right, ergodic model). In Figure 4 an example of a left-to-right HMM with 3 states is provided.

Fig. 4 HMM first order left-to-right model For implementation we chose the Hidden Markov Model Toolkit (HTK) [31], which proved to be very efficient in

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < speech recognition tasks. Initially we planned to use it in the same way as it works for speech, given the parallel between these two worlds: words are composed of phonemes which represent the basic units and facial expressions are composed out of AUs. Just that this similarity does not hold entirely because if phonemes follow a sequential pattern, coming one after another, AUs can appear also simultaneously. We had to adapt the HTK toolkit in order to make it suitable for our recognition task, meaning that we had to define proper configuration, prototype and grammar files. Furthermore the Baum-Welch algorithm was used for training and Viterbi algorithm for testing. The obtained results were promising and are discussed in the following Section. D.2 Dynamic Bayesian Networks Dynamic Bayesian Networks offer a number of advantages for the representation and processing of knowledge and uncertainty, being useful in a wide range of areas: academics, biology, business and finance, scheduling, computer games, and also computer vision. A Dynamic Bayesian Network (DBN) is a way to extend a Bayesian Network (BN) to model probability distributions over semi-infinite collections of random variables. Typically the variables can be partitioned into: Z t = (U t , X t , Yt ) , representing the input, hidden and

output variables of a state-space model. The t index is related to time, being increased each time a new observation is provided. For a more detailed description of the representation, inference and learning with DBNs we refer to [34], [35]. What is interesting to mention is that HMMs represent a special case of DBNs in which, the state of the process is described by a single discrete random variable. Still the particular characteristics of each method make them better suited for certain kinds of problems. HMMs can be used to model any distribution given infinite mixture components [46] and consequently to enable recognition of basic recognition tasks, while DBNs having advantages such as extensibility, interpretability and semantics are better suited for modeling complex situations. In our approach, we modeled the relations between AUs and the corresponding parameters (displacement on x and y axes) by employing a causal model in which the hidden variable AU causes the observations (Vx, Vy), as depicted in Figure 5.

8

In order to approximate the continuous distribution of the Vxi and Vyi variables we used a mixture of Gaussian distributions, depicted in the model by the node M and explained in (1), where the probability of node M being in each state i represents the weight associated with each mixture.

In the implementation process we employed an open source Matlab toolbox: the Bayes Net Toolbox developed by Kevin Murphy [36]. Several decisions had to be taken, concerning data modeling, data representation, and prior probabilities estimation. We used the Expectation-Maximization algorithm (EM), presented in [37], to maximize the posterior probability of the (Vxi, Vyi)i=1,n parameters, given the observed training data (AU, Vx, Vy), in the presence of the hidden parameters (Mi)i=2,4,…2n. The number of EM iterations had an impact on the recognition accuracy, the optimum number being specific for each AU. Furthermore it is known that the EM algorithm depends on the initialization parameters and can get into a local minimum after a number of iterations. In order to avoid this issue, we initialized the conditional probability tables (CPD), both randomly and also by estimating the global mean and variance over the whole training set. The ground truth provided for the considered database, consisting of AUs labels, refers only to the last frame of the video, in which we know that all the mentioned AUs for that video are activated. For modeling HMMs we only used this information, by assigning an AU label for each video. Still, DBNs require a frame by frame labeling for learning the CPDs. In order to provide a fair basis for comparison we created a DBN model that closely mimics the evaluated HMMs. This adaptation was accomplished by adding in the previously presented model in Figure 5, an extra layer between AU and (Vx, Vy) variables, consisting of a state variable S which represents the equivalent of a state in HMM as it can be seen in Figure 6. Moreover this change enabled the use of video labeling.

Fig. 6 Adapted DBN model which uses the same labeling procedure as the one considered in the HMM case Fig. 5 DBN model for AUs

The results obtained for the two considered DBN models together with the results rendered by HMMs will be

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < discussed in the following section. AU2

IV. RESULTS AND DISSCUSIONS

AU4

Different works on facial expression recognition present different testing approaches, varying from n-fold cross validation to leave-one-subject-out (LOSO), persondependent or person-independent, on the same or on several corpora. In order to asses the performance of our method, we used a ten fold cross validation, person-independent approach. We performed several experiments in order to provide a thorough comparison of the different ROI selection alternatives, different optical flow estimation methods and also of the two proposed spatial-temporal classifiers. We used an iterative testing approach, where at each step we determined the best alternative from the proposed ones. First we conducted experiments in order to determine which method of ROI selection is better. The case with 13 ROIs achieved much better performance (93%) than the one with 8 ROIs (75%), using features extracted with Lucas-Kanade algorithm and the HMM classifier. In order to assess the proposed ROIs selection alternatives we tested them by employing a large number of HMMs configurations as it will be detailed below. The obtained results suggested that the 13 ROIs selection alternative is better suited for our recognition goal, as the regions of interest are efficiently capturing the AUs activation/deactivation area. Next, using the same classifier, HMM and given the 13 ROIs selection, we investigated which is the optimum optical flow estimation method. Both proposed methods (Lucas-Kanade and 3-D Recursive Search) achieved the same overall performance, as it can be seen in Table 3, indicating that both are good alternatives. Regarding the HMM classification case, we initiated an extensive search in order to find the best configuration for each AU. Table 3 HMMs experimental results, best model configuration and accuracy for each AU

Eyes ROI AU5 AU7

AUs

AU9

Eyebrows ROI AU1

AU10

AU12 AU15

AU16

85% 80% -

LK 3DRS

98% 99%

95% -

LK 3DRS

97% 97%

97%

91%

LK

93%

3DRS

94%

-

LK

86%

80%

3DRS

86%

-

LK

96%

93%

3DRS

96%

-

Chin ROI AU17

LK

91%

91%

3DRS

87%

-

LK

94%

91%

Mouth ROI AU20 AU23 AU24 AU25

DBN

89%

87% 92% 83% 79%

Lower lip ROI

AU27

93%

LK 3DRS LK 3DRS

Lip corners ROI

Average LK

95% 85% -

Upper lip ROI

Accuracy HMM

93% 96% 97% 91% 89%

Nose ROI

AU26 Feature Extraction Method

3DRS LK 3DRS LK 3DRS

3DRS

96%

-

LK

93%

85%

3DRS

93%

-

LK

94%

84%

3DRS

91%

-

LK

97%

89%

3DRS

95%

-

LK

94%

91%

3DRS

94%

-

LK

98%

95%

3DRS

97%

-

LK

93%

89%

3DRS

93%

-

9

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION <

10

Fig. 7 Graph of AU 1, 2, 12, and 25 behavior where accuracy is a function of the number of states and the different mixtures are represented using different lines The performance achieved for each considered AU, in the case of the two proposed optical flow estimation methods is provided in Table 3, corresponding to the best HMM configuration. In order to find the best configuration, we considered all the models having 1 to 10 states, different topologies, the Gaussian distributions used for modeling the continuous input data, with 1 to 50 mixtures and with a distinct number of features, for each ROI considered. This experiment showed that the best HMM configuration for each AU was different in the case of the two feature extraction methods, suggesting both the high dependency of the model on the input data and also the absence of a general model. Regarding HMM topologies we employed several models, from left-to-right to an ergodic model and the best performance was obtained for the left-to-right topology. Another conclusion of this experiment refers to the representation of the continuous input data; Gaussian Mixture Models proved to be better than simple Gaussian models, the optimum number of mixtures being around or higher than 20. This conclusion can be also noticed in Figure 7, where the lower line corresponds to a simple Gaussian model. Due to space limitation we chose not to present the graphs for all AUs but just for a number of them, which we considered more relevant. Furthermore, employing a bigger number of features proved to be beneficial, e.g. for the mouth ROI using 8 features vs. 4 features contributed to an improvement of 2.4%. This fact was applied for all AUs, as for the results presented in Table 3 the features were extracted from the maximum number of ROIs in which one AU produces an impact.

Based on the conclusions drawn from the experiments with HMMs, we carried out an experiment with DBNs, using the 13 ROIs selection alternative and the features estimated with the Lucas-Kanade algorithm, as the other optical estimation method produced similar results. First we used the model described in sub-section III.D.2, Figure 5, and we obtained an overall recognition accuracy of 85% which was lower than the recognition rate achieved for HMMs of 93%. In order to understand the differences in accuracy between the two considered classification methods (HMMs and DBNs), given that in theory they should be similar, we examined four hypotheses. First of them concerns labeling, and in order to test it we employed a DBN model similar to a HMM model, which was depicted in Figure 6. We achieved an improvement in accuracy of 4%. This experiment showed that for our specific case, video labeling provides better results than frame based labeling. Furthermore this increase in performance between the two regarded DBN models is explained by the increased flexibility of the second model. The second hypothesis regards the training procedure; we used the Expectation-Maximization (EM) algorithm for training DBNs, while for training HMMs, the Baum-Welch algorithm, which represents an adaptation of the EM algorithm for the case of HMMs was employed. The difference between the two considered approaches consisted in the initialization phase, for DBNs with random parameters and for HMMs with the global mean and variance estimated over the training set, task which is performed in HTK using HCOMPV tool. Still, by incorporating this improvement in the DBN model, we did not obtained a significant change in accuracy.

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < A third hypothesis refers to the implementation aspect and was taken into consideration as the two regarded classification methods were implemented in different ways: HMMs in HTK and DBNs in the Bayes Net Toolbox. Consequently we carried out an experiment, comparing HMMs and DBNs, using an implementation of both methods in the same toolbox (Bayes Net Toolbox) and we obtained the same overall performance of 89%. Still, as the performance achieved for HMMs using HTK (93%) was higher than the one obtained in Bayes Net Toolbox we continued our investigation with another hypothesis. The fourth considered hypothesis concerns sampling of the video data. If for HMMs all the frames of each video were used, for DBNs we had to use a fixed number of frames, constraint imposed by the Bayes Net Toolbox. We obtained a fixed number of frames by choosing one frame from each cluster produced by a clustering algorithm which was based on the cumulative sum of the displacements between every two consecutive frames. In order to validate this hypothesis we conducted another experiment with HMMs in the HTK toolbox, using a fixed number of frames and the overall recognition rate dropped with 4%, suggesting that sampling is not always beneficial. Furthermore in these conditions (same initialization method, same labeling procedure and the same sampling rate), the same accuracy was achieved for both classification methods, HMMs and DBNs, proving that the theoretical assumption (HMMs represent a special case of DBNs) was demonstrated. The experimental results for the DBN classification case under the conditions stated above and obtained by employing different parameters settings are depicted in the Table 3. Furthermore we would like to mention that not only the recognition rate of the two classification methods was taken into consideration when comparing them, but also other characteristics which are presented in the table below. Table 4 Comparison of the two regarded classification methods (HMMs and DBNs)

Performance Ease of the use Attention points Training time

HMM 93% (89%) Easy Nr. of states and mixtures ~ t min

DBN 89% Complex Model design ~ 8t min

The price to be paid by preferring DBNs to HMMs is increased algorithmic and computational complexity regarding training time, fact explained by the different number of parameters needed to be estimated and also by the better tuning of the training algorithm for HMMs. Another argument in favor of HMMs is the ease-of-use of the model, whereas for DBNs we had to carefully design the model in order to achieve the same performance. Therefore we consider that HMMs are suitable for simple recognition actions, such as facial AUs, phonemes or text characters, while DBNs are better suited for modeling complex temporal processes such as emotion recognition in a multimodal context.

11

V. SUMMARY AND CONCLUSIONS In this paper, we proposed an approach towards automatic AUs recognition, meant to provide an answer to the research questions stated in Section III. Regarding ROI Selection the experiments showed that using 13 ROIs yields better results than using 8 ROIs, result explained by the better tuning of the ROIs around the considered landmarks (eyebrows, eyes, nose, mouth and chin). Concerning the Choice of optical flow algorithm, both proposed methods (Lucas-Kanade and 3-D Recursive Search) lead to good recognition accuracies, both representing a good choice. Furthermore, in order to solve the Classification method issue we considered two spatial-temporal classification methods, HMMs and DBNs. Each of them posed different challenges, for the HMMs we had to find the best configuration for each AU, whereas for the DBN case our focus was on improving the prior probabilities estimation. By enabling the same conditions regarding initialization, labeling and sampling both classification methods achieved the same performance of 89%. Furthermore by using nonfixed sampling for HMMs, the overall recognition rate was improved to 93%, proving that sampling is not always beneficial. Regarding the labeling procedure, the experiments showed that video based labeling performs better that frame-based one, leading to an increase of 4% in the case of DBNs. This conclusion is important as a frame based labeling is much more time-consuming and prone to errors than a video based one. Even though both classification approaches have the advantage of detecting AUs in the case of partial evidence and under uncertainty, the differences between them such as algorithmic complexity or ease-of-use make them better suited for particular problems. Consequently we consider that HMMs (using HTK) are best suited for the specific task of facial AUs recognition, while DBNs are proper to be used in a complex context where their characteristics (interpretability, extensibility and semantics) can be better applied. We plan to extend our work to the recognition of different facial expressions, by employing a DBN model with an extra layer, where each facial expression represents a combination of specific AUs. Besides, we aim to take advantage of the strong characteristics of DBNs in a multimodal context where different modalities (facial expressions, speech, gestures) can be fused together, each of them having associated different weights.

ACKNOWLEDGMENT The authors would like to thank Dragos Datcu for providing us with pre-analyzed input data. We would also like to thank the authors of the Cohn-Kanade Facial Expressions database for providing it.

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION < REFERENCES [1]

[2]

[3] [4] [5] [6] [7]

[8]

[9] [10]

[11]

[12]

[13]

[14]

[15]

[16] [17] [18] [19] [20] [21]

M. Paleari and S. Antipolis, “Toward multimodal fusion of affective cues,” in Proc. 1st ACM international workshop on Human-centered multimedia, Santa Barbara, California, USA, 2006. M. Pantic, L. J. M. Rothkrantz, “Affect-sensitive Multi-modal Monitoring in Ubiquitous Computing: Advances and Challenges,” AAAI/IEEE International Conference on Enterprise Information Systems, pp. 466-474, July 2001. P. Thomas, G. Silke, E. Martin, K. Andreas, T. Sunna, V. Jurgen, “Smartkom – home – an advanced multi-modal interface to home entertainment, ” In Eurospeech-2003, 1897-1900, 2003. C. Clavel, I. Vasilescu, G. Richard, and L. Devillers, “Voiced and unvoiced content of fear-type emotions in the SAFE corpus,” In Speech Prosody, Dresden, Germany, May 2006. P. Ekman, W. Friesen, “Facial Action Coding System”, Consulting Psychologists Press, Inc., Palo Alto California, USA, 1978. U. Hess, R. E. Kleck, “The cues decoders use in attempting to differentiate emotion-elicited and posed facial expressions,” In European Journal of Social Psychology 24: 367-381, 1994. T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active Appearance Models,” In H. Burkhardt and B. Neumann, editors, 5th European Conference on Computer Vision 1998, Vol. 2, pp. 484-498, Springer, Berlin, 1998. Y. Gizatdinova and V. Surakka, “Automatic Detection of Facial Landmarks From AU-coded Expressive Facial Images,” In 14th Int. Conf. on Image Analysis and Processing, ICIAP 2007, September 2007, pp. 419-424. T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” in Proc. 4th IEEE Int. Conf. Automatic Face and Gesture Recognition (FG’00), Grenoble, France, pp. 46-53. M. Pantic, M. F. Valstar, R. Rademaker and L. Maat, "Web-based Database for Facial Expression Analysis", Proc. IEEE Int'l Conf. Multimedia and Expo (ICME'05), Amsterdam, The Netherlands, July 2005. Y. Tong, L. Wenhui, J. Qiang, “Facial action unit recognition by exploiting their dynamic and semantic relationships,” IEEE transactions on pattern analysis and machine intelligence; 29(10): 1683-99, 2007. D. Datcu, L.J.M. Rothkrantz, "Facial Expression Recognition in still pictures and videos using Active Appearance Models. A comparison approach," CompSysTech'07, ISBN 978-954-9641-50-9, pp. VI.13-1VI.13-6, Rousse, Bulgaria, June 2007. M. Pantic, I. Patras, “Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences,” IEEE Transactions on Systems, Man, and Cybernetics 36 (2) (2006) 433–449. P. S. Aleksic and A. K. Katsaggelos, “Automatic facial expression recognition using facial animation parameters and MultiStream HMMs,” In IEEE Trans. Inf. Forensics Security, 2006, vol.1 no.1, pp. 3-11. J. L. Landabaso, M. Pardas, and A. Bonafonte, “HMM recognition of expressions in unrestrained video intervals,” IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 03, Hong Kong, China, April 2003. X. Huang and Y. Lin, “A vision-based hybrid method for facial expression recognition,” In Proc. 1st Int. Conf. on Ambient Media and systems, Quebec, Canada, Feb. 11-14, 2008, pp. 1-7. J. J. Lien, T. Kanade, J. F. Cohn, and C. C. Li, “Detection, Tracking, and Classification of Action Units in Facial Expressions,” J. Robotics and Autonomous System, 2000, vol. 31, pp. 131-146. T. Otsuka and J. Ohya, “Spotting Segments Displaying Facial Expression from Image Sequences Using HMM,” FG 1998, pp. 442447. T. Otsuka and J. Ohya, “Recognizing Abruptly Changing Facial Expressions from Time-Sequential Face Images,” CVPR 1998, pp. 808-813. E. Smith, M. S. Bartlett, and J. R. Movellan, “Computer Recognition of Facial Actions: A Study of Co-Articulation Effects,” In Proc. 8th Ann. Joint Symposium Neural Computation, 2001. J. J. Lien, “Automatic Recognition of Facial Expressions using Hidden Markov Models and Estimation of Expression Intensity,” Doctoral dissertation, tech. report CMU-RI-TR-98-31, Robotics Institute, Carnegie Mellon University, April, 1998.

12

[22] S. Koelstra and M. Pantic, “Non-rigid registration using free-form deformation for recognition of facial actions and their temporal dynamics”, In Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG’08), 17-19 Sep. 2008, Amsterdam. [23] M. Yeasin, B. Bullot, and R. Sharma, “Recognition of Facial Expressions and Measurement of Levels of Interest from Video,” IEEE Transactions on Multimedia, 2006, vol.8, no. 3, pp. 500-508. [24] A. R. Naghsh-Nilchi and M. Roshanzamir, “An Efficient Algorithm for Motion Detection Based Facial Expression Recognition using Optical Flow,” In Proc. World Academy of Science, Engineering and Technology, vol. 14, August 2006. [25] T. Gautama and M. M. VanHulle, “A phase-based approach to the estimation of the optical flow field using spatial filtering,” IEEE Trans. Neural Networks, vol. 13(5), September 2002. [26] Y. Zhu, L. C. de Silva and C. C. Ko, “Using Moment Invariant and HMM in Facial Expression Recognition,” In SSIAI, 4th IEEE Southwest Symposium on Image Analysis and Interpretation, 2000, pp. 305. [27] M. F. Valstar and M. Pantic, “Combined Support Vector Machines and Hidden Markov Models for Modeling Facial Action Temporal Dynamics,” In ICCV-HCI, 2007, pp. 118-127. [28] J. Cao and C. Tong, “Facial Expression recognition Based on LBPEHMM,” CISP, 2008 Congress on Image and Signal Processing, vol.2m pp. 371-375. [29] W. Liejun, Q. Xizhong, and Z. Taiyi, “Facial Expression Recognition using Improved Support Vector Machine by Modifying Kernels,” Information Technology Journal, Asian Network for Scientific Information, 2009. [30] P. Michel and R. El Kaliouby, “Real time facial expression recognition in video using Support Vector Machines,” In Int. Conf. on Multimodal Interfaces, 2003. [31] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, “The HTK-Book 3.2,” Cambridge University , Cambridge, England, 2002. [32] B. K. P. Horn and B. G. Schunck, “Determining Optical Flow,” In Artificial Intelligence, AI(17), no. 1-3, August 1981, pp. 185-203. [33] Y. Zhang and Q. Ji, “Active and Dynamic Information Fusion for Facial Expression Understanding from Image Sequences,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27. no.5, May 2005. [34] K. Murphy, “Dynamic Bayesian Networks Representation, Inference, and Learning,” PhD Thesis, UC Berkley, Computer Science Division, July 2002. [35] P. Wiggers, “Modeling Context in Automatic Speech Recognition,” PhD Thesis, June 2008. [36] K. P. Murphy, “Bayes Net Toolbox,” Available: http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html, last update October 2007. [37] R. M. Neal and G. E. Hinton, “A view of the EM Algorithm that justifies incremental, sparse, and other variants,” Learning in Graphical Models (Cambridge, MA: MIT Press): 355–368, 1999. [38] P. Viola and M. Jones, “Robust real-time object detection,” International Journal of Computer Vision, 2002. [39] B.D. Lucas and T. Kanade “An iterative image registration technique with an application to stereo vision (darpa),” In Proceedings of the 1981 DARPA Image Understanding Workshop, pp. 121-130. [40] R. Braspenning and G. de Haan, “True-Motion Estimation using Feature Correspondence,” Proceedings of SPIE VCIP 2004, pp. 396407. [41] C. Shan and T. Gritti, "Learning Discriminative LBP-Histogram Bins for Facial Expression Recognition," Proc. British Machine Vision Conference (BMVC'08), Leeds, UK, September 2008. [42] Y. Zhan, J. Ye, D. Niu, P. Cao, "Facial Expression Recognition Based on Gabor Wavelet Transformation and Elastic Templates Matching," Third International Conference on Image and Graphics (ICIG'04), 2004, pp. 254-257. [43] M.S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, J. Movellan, “Recognizing facial expression: machine learning and application to spontaneous behavior,” In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. [44] F. Dornaika, F. Davoine, “Simultaneous facial action tracking and expression recognition using a particle filter,” In IEEE International Conference on Computer Vision (ICCV), 2005.

"This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible"

> > IEEE TRANSACTIONS ON MULTIMEDIA, SPECIAL ISSUE ON MULTIMODAL AFFECTIVE INTERACTION <

[45] J. L. Barron, D. J. Fleet, S. S. Beauchemin, and T. Burkitt, “Performance of optical flow techniques,” Proc. Conf. Computer Vision Pattern Recognition, Champaign, June 1992, pp. 236-242. [46] Z. Ghahramani, “Learning Dynamic Bayesian Networks. Adaptive Processing of Temporal Information,” Lecture Notes in Artificial Intelligence. Springer-Verlag, 1997. [47] G. Donato, M. S. Bartlett, and J. C. Hager, “Classifying Facial Actions,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 21, 1999, pp.974-989.

13

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.