Recognizing Continuous Grammatical Marker Facial Gestures in Sign Language Video

July 28, 2017 | Autor: Tan Nguyen | Categoría: Sign Language, American Sign Language, Facial Features, Sign language recognition, Conditional Random Field

Share Embed

Laporkan tautan ini

Descripción

Recognizing Continuous Grammatical Marker Facial Gestures in Sign Language Video? Tan Dat Nguyen1 and Surendra Ranganath2 1

2

Dept. of Electrical & Computer Engineering National University of Singapore, Singapore 117576 [email protected] Indian Institute of Technology – Gandhinagar, India 382424 [email protected]

Abstract. In American Sign Language (ASL) the structure of signed sentences is conveyed by grammatical markers which are represented by facial feature movements and head motions. Without recovering grammatical markers, a sign language recognition system cannot fully reconstruct a signed sentence. However, this problem has been largely neglected in the literature. In this paper, we propose to use a 2-layer Conditional Random Field model for recognizing continuously signed grammatical markers in ASL. This recognition requires identifying both facial feature movements and head motions while dealing with uncertainty introduced by movement epenthesis and other effects. We used videos of the signers’ faces, recorded while they signed simple sentences containing multiple grammatical markers. In our experiments, the proposed classifier yielded a precision rate of 93.76% and a recall rate of 85.54%.

1

Introduction

Interpreting sign language not only requires recognition of hand gestures/signs, but also other non-manual signs. As pointed out in [1], non-manual signs convey important grammatical information. Without these grammatical markers, the same sequence of hand gestures can be interpreted differently. For example, with the hand signs for BOOK and WHERE, a couple of sentences can be framed as – [BOOK]T P [W HERE]W H → Where is the book? – [BOOK]T P [W HERE]RH → I know where the book is! In the notation of the above example, the left hand side of the arrows represent signs in American Sign Language (ASL). The subscripts TP, WH and RH on the words BOOK and WHERE indicate grammatical markers conveyed by facial feature movements and head motions. The facial gesture for Topic (TP) is used to convey that BOOK is the topic of the sentence. The word WHERE accompanied ?

This work is partially support by project grant NRF2007IDM-IDM002-069 on “Life Spaces” from the IDM Project Office, Media Development Authority of Singapore.

2

Tan Dat Nguyen, Surendra Ranganath

by a WH facial gesture, signals a “where?”. The hand sign for WHERE made concurrently with the facial gesture for RH indicates the rhetorical nature of the second sentence. Thus, recognition of non-manual signs is required for building a complete sign language understanding system. However, review [2] of sign language recognition indicates that the dominant interest in sign language recognition has been in hand gesture recognition. Non-manual sign recognition has only recently started to receive attention [3] [4]. Previous works on recognizing facial expressions were reviewed in [5] and [6]. These surveys showed that many works focused on recognizing the six isolated universal expressions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise) with minimal head motion. The latter simplification of the problem makes these methods inapplicable for recognizing facial gestures in sign language, where facial expressions are defined concurrently with head motion to define grammatical markers. There are also many works on analyzing head pose and head motion [7]. However, there are few works in the literature that address recognizing facial expressions coupled with concurrent head motion. Black and Yacoob’s work [8] is a pioneering work in recognizing continuous facial expressions with head motion based on features from dense optical flow and rule-based discriminative models. They obtained an average recognition rate of 88% and 73% on laboratory data and data from TV programs, respectively. De la Torre et al. [9] proposed to detect rare facial gestures made during an interview based on Personalized Active Appearance Model [10]. However, quantitative assessment of the detection was not reported. Cohen et al. [11] used a piecewise 3D wire frame model-based approach for tracking 16 facial features and estimated their 3D motions. These were used in a multi-level HMM scheme for classifying the six universal expressions and the neutral expression in video sequences containing multiple expressions. They reported 82.46% and 58.63% accuracy for person dependent and person independent tests, respectively, on their database of 5 persons. As generative models, HMMs suffer from two weaknesses: the statistical independence assumption of observations and the difficulty in modeling their complicated underlying distributions. On the other hand, Conditional Random Fields (CRF) proposed by Lafferty et al. [12] is a discriminative model which avoids these weaknesses. Kanaujia and Metaxas [13] used the CRF to recognize the six universal expressions and obtained promising results. Quattoni et al.[14] proposed Hidden-state CRF (HCRF) models and obtained an accuracy of 85.25% for recognizing head shakes and head nods. Chang et al. [15] proposed a modified HCRF called Partially-Observed HCRF (PO-HCRF). The PO-HCRF achieved an accuracy of 80.1% with 9.18% false alarm rate for recognizing the six “continuous” universal facial expressions in simulated sequences created by concatenating sequences of isolated expressions. Neidle et al. [4] proposed to detect the presence of WH and NEG grammatical markers in ASL signed sentences. An ASM-based tracking scheme proposed was used to track face and facial feature movements, and provide head pose (pitch, yaw, and tilt) in each frame. Each

Title Suppressed Due to Excessive Length

3

video frame was classified as either WH or not-WH, and a video sequence was labeled based on majority voting of frames. A multiple-SVM classifier was used to label each frame. The recognition accuracies were 100% and 95% for WH and NEG, respectively. In this paper, we consider recognizing continuous facial gestures in sign language, particularly grammatical markers in ASL. The six grammatical markers considered in this paper are summarized in Table 1 in terms of eye, eyebrow, and head movements. We propose to use a layered Conditional Random Field (CRF) model [12] for this purpose. The classifier includes two CRF layers, the first layer to model head motions and the second to model grammatical markers. The separate head motion layer helps to reduce the ambiguity in recognizing grammatical markers in the second layer. For each video sequence, probabilities of different head motions are evaluated by the first layer, and these are input to the second layer together with other features for labeling the grammatical marker for each frame. Manually annotated labels of head motions and grammatical markers were used for training the classifier and assessing performance. The classifier yielded precision and recall rates of 95.24% and 85.54%, respectively.

2

2.1

Recognizing Continuous Facial Gestures in Sign Language Challenges

Facial gestures in ASL are identified from head motion and facial feature movement. In this paper we consider recognition of six grammatical markers listed and described in Table 1, through their head gestures comprising, eye, eyebrow and head movements. In previous work [16], we have considered recognition of isolated facial gestures. Here, we extend our work to recognition of continuous facial gestures as would occur in sign language discourse, and consider four types of facial gesture sequences (Table 2) composed of these grammatical markers. Examples of these facial gesture chains are shown in Table 3.

Exp. Brow Eye Head AS Raise Nil Nod NEG Knit Nil Shake RH Raise Widen Tilt(left/right) TP Raise Widen Move upward WH Knit Squint Move Forward YN Raise Widen Move Forward Table 1. Simplified description of the six ASL grammatical markers (Exp.) considered: Assertion(AS ), Negation(NEG), Rhetorical (RH ), Topic(TP ), Wh question(WH ), and Yes/No question(YN ). Nil denotes unspecified facial feature movements.

4

Tan Dat Nguyen, Surendra Ranganath

There are several aspects to the continuous facial gesture recognition problem which make it challenging, more so than isolated recognition. Movement epenthesis is the extra motion required by the head (and facial features), due to physical constraints, to transit from the end of the previous gesture to the beginning of the next; this is difficult to model due to its variability. Coarticulation refers to the appearance of a head gesture being influenced by adjacent gestures. There can also be asynchronization between head motion and facial feature movement. Movement epenthesis between grammatical markers is shown in Fig. 1. Table 3 shows examples of grammatical marker chains; any facial gesture video frame that does not contain one of the six grammatical marker classes is labeled as Unidentified. This is a generic class which includes gestures between two grammatical markers, and also the neutral expression, which is usually present at the beginning of a sequence. Visually, the beginning and ending of an expression can be considered to coincide with the beginning and ending of the head motion corresponding to that expression. However, while signing, movements of facial features like brows and eyes are independent and may evolve asynchronously with the head motion. This asynchronization adds to the uncertainty in identifying a facial gesture by using a combination of features from head motions and facial feature movements. An effective strategy to deal with this problem is to use multi-channel frameworks [17], where the classifier learns the correlations between the channels through supervised training. Movement epenthesis between grammatical markers also introduces additional variability. This is manifested through the head tending to move back to the neutral position before comfortably starting the next motion. Besides, if expresions have similar eye/brow movements, some subjects tend to hold the state established at one expression into the next expression, while others do not. This phenomenon will alter the temporal patterns of eye/brow movements and affect algorithm performance. The movements of the eyes and brows can be further affected by factors that are not related to facial gestures of interest: natural eye blinks, hand signs for adjectives such as HUNGRY or FAST involving added facial expressions. Moreover, unidentified gestures between facial gestures of interest are highly varied due to combinations of movement epenthesis and other effects. Thus, it will be ineffective to model the sequences using generative models like HMMs. A discriminative model may be more suited for this scenario, and we propose to use a 2-layer CRF model to handle head motion and facial expression towards recognizing continuous grammatical markers. The use of a 2-layer model is also motivated by the experimental data that we gathered, which showed that in spite of movement epentheses, head motions are more consistent than corresponding facial feature movements. 2.2

Layered Conditional Random Field Model

The CRF is a discriminative probabilistic model proposed by Lafferty et al. [12] which can be trained to assign a sequence of predefined labels to a sequence of

Title Suppressed Due to Excessive Length

5

Table 2. Types of grammatical marker sequences considered. Sequence TP AS TP NEG TP RH AS

English sentence I really want the book! I don’t want the book. I know where the game is! It’s in Singapore. TP WH YN Where is the game? Is it in New York?

Unidentified (Neutral)

Topic

ASL signs [BOOK]T P [WANT]AS [BOOK]T P [WANT]N EG [GAME]T P [WHERE]RH [SINGAPORE]AS [GAME]T P [WHERE]W H [NEW YORK]Y N

Undefined

Rhetorical

Fig. 1. When the Rhetorical gesture is performed after a Topic gesture, the head will move from backward position to neutral position before tilting forward (movement epenthesis) while the brow still held raised.

observations. Its evaluation function is composed of weighted potential functions which can utilize not only features extracted from the observations but also their interactions and temporal dependencies. In the linear-chain model, the probability of a label sequence y given an observation sequence x is computed as:   M N T X X X 1  µj gj (yt , yt−1 , x) λi fi (yt , x) + (1) exp p(y|x) = Z(x) t=1 j=1 i=1 where fi and gj are potential functions that evaluate the interaction and temporal dependencies among features, respectively. λi and µj are weights estimated from training data, and Z(x) is a normalization factor. It was shown [12] that the right hand side of Eq. 1 is a convex function parameterized by λi and µj , whose global optimum can be obtained by using iterative scaling algorithms or gradient-based methods. CRFs, which avoid the assumption of statistical independence of observations, have shown better performance than HMMs in many applications [12] [14]. We use a layered model of the chain CRF (Fig. 2) to recognize continuous facial gestures in ASL. The probabilities of head motion labels are evaluated by a CRF in the first layer. These probabilities are passed to the second layer where other facial feature channels are also integrated. The second layer CRF is trained on these integrated features, to provide grammatical marker labels for frames in the test video sequences. Our observations show that the transition from one type of head motion to another can include movement epenthesis but not many other effects. Thus we choose to model movement epentheses explicitly, together with meaningful head motions. Currently, we have used 16 labels of head motions (both meaningful

6

Tan Dat Nguyen, Surendra Ranganath

Table 3. Examples of four types of grammatical marker chains. The neutral expression shown in the first frame is considered to be an unidentified expression. An unidentified facial gesture can also be present between any two grammatical markers and can vary greatly depending on nearby grammatical markers.

Unidentified

Topic

Unidentified

Assertion

Unidentified

Topic

Unidentified

Negation

Unidentified

Topic

Rhetorical

Assertion

Unidentified

Topic

Wh question Unidentified Yes/No question

head motion and their movement epentheses) as described in Table 4 for all combinations of head motions which occur in conjunction with the six grammatical markers of interest. In manually annotating the frames, besides the head motion label, each video frame in the data set is also labeled with one of seven facial gestures: AS, NEG, RH, TP, RH, WH, YN, and Und. The label Und is assigned to frames with unidentified expressions. As shown in Table 4, head motions with labels such as “Back from X” are defined to explicitly model movement epentheses. Exceptional cases are labels 7, 9, and 11 which are constituents of multi-part head motions: head shake and head nod. The N eutral label appears mostly at the beginning of the video sequences. During facial gestures, the head does move past the neutral position but does not stop. The frames in which the head is temporarily at the neutral position is also annotated with the N eutral label. The label S till plays an important role in segmenting meaningful head motions and their movement epentheses (Back from X) because there is usually a short pause (or even long pause) between the meaningful head motion and its “Back from” movement. Motion of the head and facial features are obtained from the tracked feature points (shown in Fig. 3) using an enhanced version of the robust tracking algorithm developed by the authors [16]. The feature points are placed at both rigid and non-rigid facial locations, and distances between them are extracted and used for recognition. These distances (shown in Fig. 4) are, (a) five eyebrow parameters: Left inner brow height (BIL ), Right inner brow height (BIR ), Left middle brow height (BM L ), Right middle brow height (BM R ), Distance between brows (BB ); and (b) two eye parameters: Left eye height (summation of EBL

Title Suppressed Due to Excessive Length

7

Table 4. Head labels used to train the CRF at the first layer. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Label Neutral (Neu) Forward (Fw ) Back from Forward (BfF ) Backward (Bw ) Back from Backward (BfB )

Meaning Head at normal position Head moves forward Head moves from forward position to neutral position Head moves backward Head moves from backward position to neutral position Turn left (TL) Head turns left, usually a part of head shake Back from Turn left (BfTL) Head pose changes from leftward to frontal Turn right (TR) Head turns right, usually a part of head shake Back from Turn right Head pose changes from rightward to frontal (BfTR) Move down (MD) Head moves down, usually a part of head nod Back from Move down Head pose changes from downward to frontal, usually (BfMD) a part of head nod Still Head is kept still Forward left (FL) Head moves forward and slightly turns left Back from Forward left Head pose changes from leftward to frontal and head (BfFL) moves from forward to neutral position Forward right (FR) Head moves forward and slightly turns right Back from Forward right Head pose changes from rightward to frontal and head (BfFR) moves from forward to neutral position

CRF model for facial expressions

yEt-1

yE t

yEt+1

Sequence of expression labels for each frame: {yE1, yE2,... , yEn}

xE

Observation of facial feature movements

Probabilities of head motion labels at each frame

CRF model for head motions

Observation of head motions

yHt-1

Fig. 3. Feature points of interest.

yH t

yHt+1 xH

Fig. 2. Layered CRF for recognizing continuously signed grammatical markers in sign language.

Fig. 4. Distance features used.

8

Tan Dat Nguyen, Surendra Ranganath

and ET L ), Right eye height (summation of EBR and ET R ). A reference line is defined as the line passing through the two inner eye corners, and the height parameters are the perpendicular distances of the feature points from this line. All distance parameters are normalized with respect to their corresponding values in the first frame to remove scaling effects across video sequences. To recognize head motions, tracks of non-deformable facial feature locations, namely, the two inner eye corners (EL3 , ER3 ) and the middle of the nose (N2 ), are used to define three features; SM (the area of the triangle formed by the above three locations in each frame), and CM x , CM y (components of the 2D motion vector3 CM of the center of gravity of the triangle). SM and CM are normalized by the distance EM 0 between the two inner eye corners in the first SM t CM t n n frame: CM t = EM 0 and SM t = EM 0 2 . These three features form the feature vector (at each frame) for the first CRF layer to evaluate probabilities of different head motions. The feature vector (at each frame) of the second CRF layer for recognizing continuous grammatical markers thus has 23 elements: 16 probabilities of head motions and 7 distance ratios computed from the eyes and brows’ tracked features.

3

Experiments and Results

Videos of natural sign language facial gestures of interest were recorded by providing deaf signers (from the Deaf and Hard-of-Hearing Foundation of Singapore) with appropriate signing scripts for sentences. Each English sentence in the script was signed in ASL with hand signs and corresponding facial gestures. These sentences were created or adapted from ASL resources (e.g. [1]). A subject signed each sentence ten times. As mentioned in Section 2, the data includes four types of grammatical marker chains described in Table 2. All six grammatical markers listed in Table 1 are present in the data set together with the 16 types of head motion described in Table 4. For evaluating the feasibility of our proposed recognition method, data from three subjects was used for experiments. The data set included a total of 129 video sequences divided into 93 video sequences for training (an average of seven sequences per subject for each of the four grammatical marker chains) and 36 for testing (about 3 sequences per subject per chain). Each video frame was manually transcribed to have two labels, one for head motion, and the other for grammatical marker, both identified based on visual observation and the signing script. The training set was used to train both CRF layers of the model: head motion layer and grammatical marker layer. Recognition accuracy for grammatical markers was measured by two methods: frame based and label-aligned. In the frame-based method, the label assigned for each frame is compared with the corresponding human annotated label. In the label-aligned method, the frame labels of each sequence are reduced such that consecutive frames with the same label are replaced by a single label. 3

Motion vector vt+1 = (xt+1 , yt+1 ) − (xt , yt )

Title Suppressed Due to Excessive Length

1

10

20

30

40

50

60

70

80

90

9

100

Fig. 5. Frames in a test sequence containing the facial gesture chain TP RH AS. The frame index is shown below each image. Blue dots at facial features of interest are our tracking results.

The two reduced sequences of labels are aligned using the Needleman-Wunsch algorithm [18]. The number of matches, insertions, deletions, and changed labels are then obtained. Insertions are labels output by the classifier, which do not appear in the corresponding annotated data. Deletions are labels which are not recognized by the classifier while they appear in the annotated data. An experiment was conducted to evaluate the performance of the proposed model. The first CRF layer for head motion was trained first. The head motion probabilities output by this trained CRF was used as a part of the training vector for the CRF at the second layer. The two CRF layers were trained using the scaled conjugate gradient algorithm with the CRF Toolbox [19]. Frames from a video sequence in the test set are shown in Fig. 5, where the sequence of facial gestures corresponds to TP RH AS. Fig. 6 shows the probability output of the first layer for the 16 head motion labels described in Table 4. As mentioned in Section 2, the head tends to move past the neutral position before starting a new motion. In the last 10 frames in Fig. 6, there is confusion due to ambiguous head motions at the end of the signed sentence. Fig. 7 shows the probability for the grammatical markers output by the 2-layer CRF classifier. Seven probabilities including six for grammatical markers and one for unidentified expression are obtained at each frame. Fig. 7 shows that the second CRF layer, which is trained with output from the first layer, can tolerate the ambiguity of head motions in recognizing continuous grammatical markers. The average frame-based grammatical marker recognition rate using the complete 2-layer CRF model was 80.82%. The corresponding confusion matrix is shown in Table 5 which shows that most of the confusions are between any grammatical marker and the unidentified expression. Particularly, frame-based label confusions occur at the boundary between facial gestures where ambiguous head motions and asynchronous movements of facial features are present. This makes even manual annotation of consecutive frames into different facial gestures difficult. The label-aligned method of computing accuracy reveals more about the capability of the layered CRF for recognizing continuous grammatical markers by discounting unavoidable confusions during transitions between facial gestures. Table 5 can be augmented with insertion and deletion entries to obtain the extended confusion matrix C from which precision and recall rates are computed M atch M atch and Recall = M atch+Change+Delete , where as: P recision = M atch+Change+Insert X for marker i, Match rate = C(i, i), Change rate = C(i, j), Inj ∈{i,Insert,Delete} /

10

Tan Dat Nguyen, Surendra Ranganath

1 Neu Fw BfF Bw BfB TL BfTL TR BfTR MD BfMD Still TLF BfLF TRF BfRF

0.8

0.6

0.4

1

0.8 Und AS NEG RH TP WH YN

0.6

0.4

0.2 0.2

0

0

10

20

30

40

50

60

70

80

90

100

Fig. 6. The probability outputs of the first layer CRF trained to recognize 16 types of head motion. The color bar at the top is the human annotated head motion label for this video sequence. The curve and bar with the same color are associated with the same head motion. Labels for last 10 frames are ambiguous due to ambiguous head motions at the end of the signed sentence.

0

0

10

20

30

40

50

60

70

80

90

100

Fig. 7. The probabilities of the grammatical markers, output by the second CRF layer trained using head motion probability output (shown in Fig. 6) from the first layer.

sertion rate = C(i, Insert), Deletion rate = C(i, Delete), and C(i, j) is the value at row i and column j of the extended confusion matrix. Label-aligned results were 93.76% for precision and 84.54% for recall. The extended confusion matrix for this evaluation is shown in Table. 6. The precision rate appears quite reasonable given the complexity of the problem. However, the lower recall rate hints that the layered CRF is less sensitive to change of facial gestures in video sequences. This may be improved with more descriptive features for head motion and facial feature movements. As a comparison, the results obtained in this experiment were quite close to the results we obtained in another experiment where the head motion labels were assumed known (the human annotated labels) and were input to the second layer CRF (rather than using the first layer outputs). In this experiment, precision rate of 94.54% and recall rate of 90.78% were obtained for recognizing grammatical markers. Our recent results show that the layered-CRF model outperforms the linear chain CRF and the layered HMM models.

4

Conclusion

In this paper, we addressed the problem of recognizing continuous facial gestures in sign language video. A 2-layer CRF was proposed for recognizing six common grammatical markers in ASL sentences. The first layer was trained for evaluating head motions and the second layer was trained for segmenting and

Title Suppressed Due to Excessive Length

11

Table 5. Confusion matrix for labeling grammatical markers with the proposed model. The average frame-based recognition rate is 80.82%. Und 59.62 9.62 0.98 10.78 3.06 5.61 27.84

AS 7.60 87.46 0 0 1.31 9.35 10.31

NEG 3.09 0 97.07 0 1.17 0 0

RH TP WH YN 6.65 9.26 12.11 1.66 2.92 0 0 0 0 1.95 0 0 89.22 0 0 0 3.35 91.1079 0 0 0 0 84.58 0.46 0 0 5.16 56.70

Table 6. Extended confusion matrix for label-based facial gesture recognition result (%) using 2-layer CRF.

UN AS NEG RH TP WH YN Average

UN 68.97 5.26 0.00 0.00 0.00 0.00 0.00

AS 0.00 84.21 0.00 0.00 0.00 11.11 11.11

NEG 0.00 0.00 100 0.00 0.00 0.00 0.00

RH 0.00 5.26 0.00 100 0.00 0.00 0.00

TP 0.00 0.00 0.00 0.00 91.67 0.00 0.00

WH 0.00 0.00 0.00 0.00 0.00 88.89 0.00

YN 0.00 0.00 0.00 0.00 0.00 0.00 55.56

Insert 3.45 0.00 0.00 0.00 0.00 0.00 0.00

Delete Precision Recall 27.59 95.24 71.43 5.26 88.89 84.21 0.00 100 100 0.00 100 100 8.33 100 91.67 0.00 88.89 88.89 33.33 83.33 55.56 93.76 84.54

recognizing facial gestures using the output from the first layer and measurements of facial feature movements. Data was collected using an experimental set up for capturing natural facial gestures without a forced “neutral” state between gestures. The performance of the complete 2-layer CRF model yielded precision rate of 93.76%, and recall rate of 85.54% for recognizing the six types of continuously signed grammatical markers. These encouraging results show that the proposed 2-layer model is a viable scheme for recognizing facial gestures in sign language. In the near future, we propose to enhance the robustness of the model by incorporating more descriptive features for identifying head motions. We will also conduct more evaluations and comparisons with other methods. Other nonmanual signals will be considered for further development of the system.

References 1. Baker, C., Cokely, D.: American Sign Language: A teacher’s Resource Text on Grammar and Culture. Clerc Books, Gallaudet University Press, Wasington D.C. (1980) 2. Ong, S., Ranganath, S.: Automatic Sign Language Analysis: A Survey and the Future Beyond Lexical Meaning. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 873–891 3. Vogler, C., Goldenstein, S.: Facial movement analysis in ASL. Journal on Universal Access in the Information Society 6 (2008) 363–374

12

Tan Dat Nguyen, Surendra Ranganath

4. Neidle, C., Nash, J., Michael, N., Metaxas, D.: A Method for Recognition of Grammatically Significant Head Movements and Facial Expressions, Developed Through Use of a Linguistically Annotated Video Corpus. In: Proceedings of the Language and Logic Workshop, Formal Approaches to Sign Languages, European Summer School in Logic, Language, and Information (ESSLLI ’09), Bordeaux, France (2009) 5. Pantic, M., Rothkrantz, L.J.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 1424–1445 6. Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern Recognition 36 (2003) 259–275 7. Murphy-Chutorian, E., Trivedi, M.M.: Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 8. Black, M., Yacoob, Y.: Recognizing facial expressions in image sequences using local parameterized models of image motion. International Journal of Computer Vision 25 (1997) 23–48 9. la Torre, F.D., Campoy, J., Ambadar, Z., Cohn, J.F.: Temporal Segmentation of Facial Behavior. In: International Conference on Computer Vision. (2007) 10. Matthews, I., Baker, S.: Active AppearanceModels Revisited. International Journal of Computer Vision 60 (2004) 135 – 164 11. Cohen, I., Sebe, N., Garg, A., Chen, L.S., Huang, T.S.: Facial expression recognition from video sequences: temporal and static modeling. Computer Vision and Image Understanding 91 (2003) 160–187 Special issue on Face recognition. 12. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In: International Conference on Machine Learning. (2001) 13. Kanaujia, A., Metaxas, D.: Recognizing Facial Expressions by Tracking Feature Shapes. In: International Conference on Pattern Recognition, Hong Kong, China (2006) 14. Quattoni, A., Wang, S.B., Morency, L.P., Collins, M., Darrell, T.: Hidden Conditional Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (2007) 1848–1852 15. Chang, K.Y., Liu, T.L., Lai, S.H.: Learning partially-observed hidden conditional random fields for facial expression recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. (2009) 533–540 16. Nguyen, T.D., Ranganath, S.: Tracking facial features under occlusions and recognizing facial expressions in sign language. In: IEEE International Conference on Automatic Face & Gesture Recognition, Amsterdam, Netherlands (2008) 1–7 17. Oliver, N., Horvitz, E., Garg, A.: Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding 96 (2004) 163–180 18. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48 (1970) 443–453 19. Schmidt, M., Swersky, K.: Conditional Random Field Toolbox for Matlab. (http://www.cs.ubc.ca/ murphyk/Software/CRF/crf.html)

Lihat lebih banyak...

Recognizing Continuous Grammatical Marker Facial Gestures in Sign Language Video

Descripción

Comentarios