Robust Facial Expression Recognition Based on Local Directional Pattern

June 19, 2017 | Autor: Robert Niese | Categoría: Information Systems, Electrical And Electronic Engineering

Descripción

Robust Facial Expression Recognition Based on 3-d Supported Feature Extraction and SVM Classification Robert Niese, Ayoub Al-Hamadi, Faisal Aziz and Bernd Michaelis Institute for Electronics, Signal Processing and Communications (IESK) Otto-von-Guericke-University Magdeburg, D-39016 Magdeburg, P.O. Box 4210 Germany {Robert.Niese, Ayoub.Al-Hamadi}@ovgu.de

Abstract Facial expression recognition is an imperative task in human computer interaction systems. In this work we propose a new system for automatic expression recognition in video sequences. Our system uses color information to extract the facial features. Additionally, it includes a camera model and a registration step, in which we automatically build a person specific face model from stereo. Photogrammetric techniques are applied to determine real world geometric measures and to build the feature vector. Feature normalization is carried out and support vector machine is trained over the normalized feature vector. Using SVM classification we reach a minimal mixing between different expression classes. Our framework achieves robust and superior expression classification results across a variety of head poses with resulting perspective foreshortening and changing face size. Moreover, the method has shown robustness across a variety of skin colors while reaching high performance.

1. Introduction Since last decade there has been a growing interest in advancements of human computer interaction (HCI). In this paper one core task in HCI is addressed, which is the effective visual facial expression recognition. Its analysis facilitates information about emotions, person perception and it gives insight to interpersonal behavior. It indexes physiological as well as psychological functioning and is essential to use in the scenarios like patient state analysis or for evaluating preverbal infants. Previously humanobserver methods of facial expression analysis needed more labor and were difficult to work out across laboratories and over time. These factors force investigators to use generalized systems which are easy to adopt in any environment. To make valid, more accurate, quantitative measurements of facial expression in diverse applications, it is needed to develop automated methods for recognition. Generally, these systems include automatic feature extractors and change trackers in the facial features in static images and video sequences.

978-1-4244-2154-1/08/$25.00 ©2008 IE

2. Related Work The analysis of faces has received substantial effort in recent years. Applications include face recognition, facial expression analysis, face simulation, animation etc. [1]. Observing facial expressions is a natural way of humans to recognize emotions. Extensive studies on human expressions have laid the strong basis for the existence of universal facial expressions. Paul Ekman and his fellows introduced the Facial Action Coding System (FACS) to cover all possible expressions in static images [2]. Many approaches have been described for facial expression analysis using still images and video sequences, which can provide more information regarding expressive behavior. Black and Yacoob presented one of the earliest works using local parametric motion models to regions of the face and feeding the resultant parameters to a classifier [3]. Chang and colleagues used the low dimensional manifold technique for modeling, tracking, and recognizing facial expressions [4]. They used the Leipschitz embedding to embed aligned facial features to build the manifold. Gabor wavelets were exploited for the detection of facial changes occurred during expression. These changes were analyzed in the temporal domain for the expression recognition by Pantic and colleagues [5]. Pose invariant facial expression was analyzed by Kumano and colleagues where they used variable intensity templates. They described that these templates are different for different facial expressions and can be used for classification [6]. Bartlett et al. [7] presented user independent recognition of facial actions in frontal faces in video streams using an Adaboost classifier and Support Vector Machines. Torre and colleagues used spectral graph techniques for facial shape and appearance features and analyzed them temporally for expressions [8]. Zeng and colleagues treated the emotional facial expression as one class classification problem and separated it from non-emotional facial expressions [9]. Furthermore, they made a comparison between Gaussian based classifiers and a support vector data description technique. Practically, the common hierarchy in all techniques is that firstly feature extraction is done from facial images or sequences, which is followed by the classification module.

The resultant class is one of the predefined expressions. Usually in these systems, variety comes in the feature extraction technique, which ranges from holistic to individual feature analysis. In the second step, the classifier plays an important role to discriminate the expression classes under observation. State of the art techniques for facial expression classification are often limited to more or less frontal poses and small skin color variations. Further, often they are sensitive to global motion with resulting perspective distortions. These restrictions have impact on the applicability, i.e. for human machine interfaces. In our method, we want to overcome these restrictions by incorporating photogrammetric measuring techniques and feature normalization, which increase robustness.

3. Suggested System In this paper, we present an automatic system for the recognition of facial expressions. The number of possible expressions is basically only limited by the available training data. Here we present the discrimination of five classes, i.e. four basic emotion expressions and the neutral expression. Our system is based on the analysis of color image sequences while using 3-d context information. In particular, photogrammetric techniques are applied for the determination of features, which are corresponding to real world measures. Therefore we apply camera calibration and use subject specific surface model data, which is gained with help of a stereo camera system. In this way, we achieve independence from the current head pose and varying face sizes due to head movements and perspective foreshortening like in the case of out of plane rotations. In the following paragraphs, we introduce the components of our facial expression recognition system, starting with the detection of facial points in the image. Further, a brief explanation of the camera and surface model is given, which are required to determine the face pose and establish transformations between 3-d world and image data. On that basis a normalized feature vector is built and fed to a SVM classifier [10]. Finally, we present classification results under various test scenarios. Our analysis proves that the proposed representation of facial feature data leads to a superior classification of different expressions under real world conditions, while reaching high performance.

3.1. Facial Points from Image Processing Analogously to surveyors we define fiducial points, which are used to establish correspondence between model and real world data. However, unlike for surveying it is fairly not practical to use markers in the face. Thus, at least three points need to be found in the face, which firstly, have to be visible also in a wide range of perspectives and secondly, do not change during expression. Even

more, these points need to be well distributed in space and must be robustly detectable in the image. Practically, the only points that fulfill all of these requirements are the two center points of the eyes. Further, we involve the nose tip which is determined through model support after pose estimation (see 3.4). For reinitalization of the system we also include the two mouth corners. This means we have two sets of fiducial image points in sub-pixel coordinates (I1 and I2) (Eq. 1, Fig. 1). The corresponding model points we refer as model anchor points (see 3.3). I1={ile, ire, in},

ire irm

(1)

I2={ile, ire, ilm, irm }, ile 2 in ij ilm

Figure 1: Fiducial points

The initial Adaboost based face detection is followed by the extraction of the eye center points ile/ire and mouth corner points ilm/irm. Here we apply the well established technique by Viola and Jones [11]. A color based face detector could also be applied here. The face detector is used to restrict the facial feature points search space. Application of the face detector is only required for reinitialization of the system. Afterwards the search space is limited by the previous face position. The eye and mouth detection is based on evaluation of color information and application of morphological operators. We use the HSV color model in order to exploit the variations of facial feature areas in different channels. In particular, the eyes are analyzed in the saturation channel, where they show very different behavior, with respect to the surrounding skin pixels, which results in the lowest intensity. Our method exploits this property. A clustering algorithm is applied to gather all low saturated pixels, which gives the eye blob. Based on the eye contour, the centroid is easily calculated (Fig. 2).

a)

b)

Figure 2: Eye detection, a) RGB image b) Saturation channel

Using the facial symmetry, the mouth is detected in the lower part of the face. The analysis of the hue channel clearly depicts the difference between the mouth area and the surrounding facial area. Contrast enhancement augments the discrimination of mouth pixels in the hue channel. Mouth points are clustered and corner points are detected as the outer most points on the mouth contour. The detection is restricted to the parallel eye axis (Fig. 3).

This method has shown to also work under changing conditions, like in the appearance of teeth.

b)

a)

ilm

irm

Figure 3: Mouth, a) Contrast enhanced hue channel, b) Contour

In the 2-d image processing part of our system we are further detecting two eyebrow points ileb/ireb. Eye brow detection becomes easy with the information of the eye centroid’s location. Using the axis of the eye centers, eye brows are searched in the normal direction of that axis. In the proposed approach, grey level information is acquired in the area above the eye center. Contrast is enhanced in order to avoid illumination problems, which results in clear discrimination between forehead and the eye brow. Application of the vertical Sobel operator creates edges in this area, while region growing on the derivative image gives rise to certain contours. The eye brows are selected by holding following conditions: • •

parameters PE describes position and orientation of the camera in space, relative to a world coordinate system. Further, we include a set of six non-linear internal parameters PI that describes the geometrical camera properties (Fig. 5). K ={PE, PI}, PE ={tcx, tcy, tcz, tcω, tcϕ, tcκ}, with tc as translation , rc as rotation, PI ={c, sy, hx, hy, a1, a2}, c sy h x, h y a1, a2

a)

ireb

ileb

b)

ire

ile

Figure 4: Eye brow detection, a) Gradient, b) Color image

Projection center y V iewing ra

Camera Coordinate System 0 Z [mm]

Image Coordinate System

x [Pixel] p i Image Point

c – Negative focal length

X [mm]

World Point

h Principal Point

y [Pixel] Im

ag e

pla n

e

Figure 5: Parameters of the camera model

Facilitating the camera parameters PI and PE, the transformation of 3-d world points to image points is well described through projective geometry. An intermediate conversion to the camera coordinate system is required here [13]. We refer to the total transformation process of a world point pt to a sub-pixel image coordinate it as k(). (3)

it = k (pt, K), pt

b)

- Focal length, - Pixel ratio, - Principal point in pixels, - Radial symmetrical distortion. Y [mm]

Eye brows are the largest elliptical contours Centroid of each eyebrow contour is one of the vertically highest centroids in all contours

Application of these conditions helps to get rid of false contour selection. Results show that this simple technique works very well even in rather bad lighting conditions, with different eye brow shapes and different colored eye brows (light brown, dark brown, black).

(2)

3

, it

2

, Camera model K

The inverse function k -1 (Eq. 4) transforms an image point it to the world point pt. Since it has only two coordinates, an additional parameter dt is required, which is the distance on the viewing ray that goes through the image plane at it.

3.2. Camera Model

pt = k -1 (it, dt, K),

The suggested system is applying a pinhole camera model to simulate a set of fundamental properties of the image capturing device. The camera model K (Eq. 2) is the basis for our world to image fitting technique, as well as for transformations between the world and image space. The camera parameters are gained in a calibration procedure in which we determine external and internal parameters. Calibration is a well known approach in photogrammetry and surveying [12]. In particular we use a calibration target with coded circles and the standard bundle block adjustment technique. The set of six external

dt

(4)

, pt, it, K see (3)

3.3. Surface Model The transformation of image points to real world coordinates plays an important role for subsequent feature extraction. As shown in Eq. 4, a distance value that corresponds to the depth of scene is required. This depth value is retrieved by measuring the distance from the camera plane to the surface model at the present world pose. For this purpose we introduce the person specific mesh surface

model M, which consists of a set of vertices vi and triangle indices wj (Eq. 5). 3

M ={v1, …, vn, w1, …, wm }, vi

, wj

(5)

There are various techniques for creating face models, i.e. morphable models [14] or accurate striped lighting methods [15]. These approaches have the burden of additional light projection equipment or high amount of manual interaction. Here we use an automatic method [16] that exploits the depth and additional color image data, which is taken with the passive stereo camera system. For this initial registration step the observed subject is captured one time in frontal pose and with neutral expression. After creation of model M, the nose tip an is determined through evaluation of the 3-d model shape. Further the fiducial image points of set I2 (Eq. 1) are detected in the corresponding image and projected to model M using Eq. 4. From the resulting 3-d points we define two sets of model anchor points A1 and A2 which are subsequently used for model pose determination (Eq. 6) (Fig. 6). A1={ale, are, an},

Y X

A2={ale, are, alm, arm },

Z

ale aj an alm

are arm

(6)

3

Figure 6: Anchor points

3.4. Pose Estimation The current face position and orientation is defined through model pose parameter set Tv (Eq. 7). It contains three translation and three rotation components. From the current parameter set Tv we derive the corresponding pose matrix TM (Eq. 8), simply by multiplying the basis matrices for the current translations and rotations [13]. (7)

Tv = {tx, ty, tz, rω, rϕ, rκ}, ti, rj TM= (mr,c)(4x4), mr,c

, r, c

{1..4}

(8)

According to camera model K (Eq. 2) and the current pose transformation matrix TM, the image projection of the transformed anchor points aj is determined using Eq. 3. The goal is now to reduce the error measure e (Eq. 9) while the projected 3-d anchor points are approximated to their corresponding fiducial image points ij (Eq. 1) in an iterative least squares approach. Correspondence is apriori known from the preceding image processing part. N

e

ij

k TM · aj

2

min ,

(9)

j=1 2 3 , aj , TM is the current homogeneous whereas ij transformation matrix of the model pose, k is the world to

image projection function (see Eq. 3) and N is the number of corresponding model and image points. Until convergence, the pose parameters are optimized with every iteration of the differential fitting approach. Afterwards, matrix TM is used to transform the surface model M to the current world pose. For initialization and re-initialization in longer image sequences, the fitting process is performed between the point sets I2 and A2, thus eyes and mouth points. After that we determine the world position of the surface model’s 3-d nose point an, which is then projected to the image point in. Hereafter, we apply a tracking of the nose point. For that purpose we track a grid of points on the nose tip in using the LK tracker [17]. This tracking has proved to be stable also during rotation. After a specified number of frames the system is re-initialized. During the tracking we compute the face pose on the basis of the point sets I1 and A1 (Eq. 1, Eq. 6), thus eyes and nose point.

3.5. Facial Feature Points in 3-d Using context information of the current face pose, we can extract additional facial feature points in the image. This context information is included by projecting the vertical pose axis to the image. Thus, the image search space is constrained. In particular, two points iul and ill on the upper and lower lip are found on the mouth contour (see 3.1), while searching from the center of the mouth along the projected pose axis. The total set of image feature points If (Eq. 10) is transformed to world coordinates using Eq. 4 (Fig. 7). This results in the facial feature point set Pf (Eq. 11). The depth values required by Eq. 4 are gained from an intersection test. Here a viewing ray is cast through the virtual camera image plane at pixel coordinate ij towards the scene. The depth value is then returned from the distance of the virtual image plane to the surface model, which is lying at the current pose. If = { ile, ire, ileb, ireb, ilm, irm, iul, ill }, ij

(10)

Pf = { ple, pre, pleb, preb, plm, prm, pul, pll }, pj

(11)

ileb ile

b) preb pre

ilm ill

pul prm

a) ireb ire iul irm

Y X Z

pleb ple plm pll

Figure 7: Facial feature points, a) Image, b) World projection

3.6. Feature Vector Based on point set Pf (Eq. 11) we determine the ten dimensional feature vector F (Eq. 12). The features comprise eight Euclidean 3-d distances across the face and four angles, which are used to describe the current mouth shape. The raising/lowering of both of the eyebrows is gained from distances d1 and d2. The distances between mouth corners and eye centers (d3 and d4) reveal the mouth movement. The widening and opening of the mouth are represented by d5 and d6. Further, between the mouth points four angles aj are determined. F = ( d1 … d6 a1 … a4 )T,

(12)

whereas: d1 = || preb - pre ||,

d2 = || pleb - ple ||,

d3 = || pre - prm ||,

d4 = || ple - plm ||,

d5 = || prm - plm ||,

d6 = || pul - pll ||,

a1 = arccos a3 = arccos

v1 · v2 v1 · v2 -v1 · v4 v1 · v4

,

a2 = arccos

,

a4 = arccos

v1 · v3 v1 · v3 -v1 · v5 v1 · v5

, ,

with:

with xi as feature vector of ratios, μ and σ as mean and standard deviation (15)

Fnorm = norm (Fratio)

3.7. Classification After building the feature vector, classification of the expression is performed. In our problem we are dealing with five expressions, i.e. Happy, Surprise, Anger and Disgust along with the neutral state. For the training and classification, a Support Vector Machine algorithm of the package LIBSVM has been applied [19]. Support Vector Machines (SVMs) are known to model data in a very optimized way. SVM is a well suited classifier for the task of expression classification because it is robust to the curse of dimensionality. SVMs maximize the hyper plane margin between classes. They map input space into utmost linearly separable richer feature space. This mapping does not affect the training time because of the implicit dot product and the kernel trick [10, 20]. In principle the SVM technique finds the hyper plane from the number of candidate hyper planes, which has the maximum margin (Fig. 8). The margin is increased by support vectors, which are lying on the boundary of a class. The SVM’s linearly learned decision function f(x) is described as f (x) = sign w· x + b

v1 = plm - prm ,

v2 = pul - prm ,

v3 = pll - prm ,

v4 = pul - plm ,

with weight vector w, threshold b and input sample x. w

v5 = pll - plm In the initial registration step of our system, we capture the face of the subject with neutral expression. After the surface model has been created (see 3.3), the feature vector Fneutral is determined. On the basis of this neutral configuration the current expression is analyzed. For this purpose we compute the ratios between the components of Fneutral and F resulting in Fratio (Eq. 13). Fratio = F / Fneutral

(13)

Analysis has been carried out for numerous subjects and expressions. We determined statistical parameters for the mean and standard deviation of the feature ratios. Modifying Box and Whisker [18] plotting in terms of mean/std.dev., we got the minimum and maximum values for each feature distribution (Eq. 14). These min/max values are used to normalize the data what leads to the ultimate feature vector Fnorm (Eq. 15). Min = μ - 2σ Max = μ + 2σ xnorm = (xi – Min) / ( Max – Min )

(16)

(14)

x

+b

=0

w |b| ||w||

Slack

Margin

Origin

Figure 8: SVM Classifier, maximization of margin

In our problem, we are using the RBF Gaussian kernel, which is quite suitable with the current number of features. Also it gives the best results as compared to other kernels. Further, the RBF kernel performs at least as well as the liner SVM can do [21, 22]. SVM can be sensitive to the scaling problem, so that we are training SVM on normalized data. The SVM returns the class for an input sample according to Eq. 16. We further compute the probability pj for every class j based on the training model and the method of pair wise coupling [23].

3.8. Training and Test Data We trained the SVM classifier on our database, which contains about 3500 training samples (images) for the expressions from ten persons. Different databases cannot readily be used, since we require calibrated cameras and subject specific surface models, which we gain from an initial stereo scan. Training is done for five expressions including the neutral face. The expressions of fear and sadness were not included due to the lack of proper image data. The presented classification results are based on 1000 test samples, i.e. image sequences of about 50 frames length, starting from the neutral expression. For testing we are using data from different subjects than in the training phase. Our test scenarios contain pose variations including out-of-plane rotation.

4. Experimental Results We analyzed our results in terms of feature robustness and classification accuracy. For that purpose we tested our feature vector Fnorm regarding in-plane and out-of-plane face rotations up to ± 25 degrees and backward-forward head motion leading to an image variation in size by factor 1.5. Further our test data contains a great variety of skin colors. We also compared the feature vector F against the ground truth feature vector Fneutral during head motion while having neutral expression. Here we found that there is only a small deviation d

Lihat lebih banyak...

Robust Facial Expression Recognition Based on Local Directional Pattern

Descripción

Comentarios