Pensadores positivos

Share Embed


Descripción

A COMPUTER VISION SYSTEM FOR MONITORING MEDICATION INTAKE David Batz, Michael Batz, Niels da Vitoria Lobo, Mubarak Shah University of Central Florida [email protected], [email protected], [email protected], [email protected]

Abstract We propose a computer vision system to assist a human user in the monitoring of their medication habits. This task must be accomplished without the knowledge of any pill locations, as they are too small to track with a static camera, and are usually occluded. At the core of this process is a mixture of low-level, high-level, and heuristic techniques such as skin segmentation, face detection, template matching, and a novel approach to hand localization and occlusion handling. We discuss the approach taken towards this goal, along with the results of our testing phase. 1. INTRODUCTION Automatically detecting if complex tasks are being performed by humans often requires using multiple recognition and tracking methods. In our case, we are tracking a user as they interact with medication bottles, using only one color camera. Specifically, we would like to know if the user opened a medicine bottle, placed their hand up to their mouth, and then closed the bottle. Our motivation for this program comes from the fact that some people on multiple medications have difficulty remembering which pills to take and when to take them. A vision system which can assist with the tracking of a specific person’s medication habits would be useful. Such a system must check four requirements: 1. 2. 3. 4.

of bottles in view, the bottle detection training data, and a skin color predicate, all of which are constructed offline. The caps of any medication bottles used must also require a twisting motion to be opened. Listed next are the assumptions made about the user. Only one user is close to the camera in the video sequence. There must be a short initial period of time when the user appears and their face is not occluded so it can be automatically initialized and tracked during future occlusions. For now, we assume the user places only one pill in their hand at a time, as tracking the pill is not possible. We can however, look for some improper forms of usage, such as the pill bottle being brought up to the mouth, or the repetition of a hand moving between an open bottle and the mouth. In this paper we also present novel algorithms for hand localization and occlusion handling. Subsequent sections detail each step of the system, and then we present results. 1.2. Main Algorithm Below we list the main steps of the system: Load first frame of a sequence { Compute lighting correction values Automatically initialize bottle tracking templates Load skin color predicate } FOR (each consecutive frame) { Apply lighting correction and Gaussian filter Find skin regions using YCbCr predicate Apply morphological operations to skin regions Compute regional properties of skin regions Check for skin occlusions Repair any skin occlusions found Localize face Localize hands Track medication bottles Determine if any requirements are being met

The right user is taking the medication. The right medication is being taken. The right dosage is being taken. The medication is being taken at the right time.

Our system concentrates on problem (2) and provides a framework for (3). Problem (1) is outside the scope of this paper, and problem (4) is relatively easy to solve and will not be discussed here. 1.1. Assumptions We assume there is one camera monitoring a medication area containing a number of medication bottles already in view. The inputs required by the system are the number

}

2. SKIN SEGMENTATION For each frame, two pre-processing steps are applied to maximize the effectiveness of future operations performed. The first step consists of a lighting correction routine, which helps remove any color biasing. For this, we adopted a histogram based method similar to the one described in [5].

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

The RGB compensation values are only calculated from the first frame and then applied to all frames so that colors remain stable. The second step consists of a standard noise removal using a Gaussian convolution with σ = 0.5. Much of the information utilized by the system is found in the hands and face of the user, which are first localized with a general extraction of all skin regions. To train the skin color predicate, three images of the user in the medication area are taken, and pixels containing skin colors are manually extracted. These pixels are transformed to the normalized YCbCr color space to make their chrominance more independent from their luminance, and to allow for a better distinction between skin and non-skin colors [1]. The normalized Cb and Cr values of pixels in the manually created skin masks are added to the color predicate using a 2D Gaussian weighting mask with σ = 1. Pixels not contained in the masks are subtracted from the predicate with a 2D Gaussian of σ = 0.5. Once the predicate has been constructed, skin regions can be segmented from any frame using a connected components algorithm which starts at areas of high skin probability and terminates at areas of low skin probability. Morphological operations are applied to the skin regions, in the order of a median filter, a binary closing, and the filling of any small interior holes. The system then determines if any occlusions between regions have occurred, which is discussed more in the next section. For each region R, we then calculate various properties including the minimal bounding box (MBB), area, perimeter, centroid, axis of least moment of inertia (Rθ ), and lengths of semi-major (Ra ) and semi-minor (Rb ) best-fit ellipse axis. These properties are used throughout the system, and are used as a preliminary test for head region detection [2] [3]. 3. OCCLUSION HANDLING To ensure that the identity of skin regions persists throughout an occlusion, we have developed a novel approach in splitting hand/face occlusions (Fig. 1), and hand/hand occlusions (Fig. 2). This process utilizes the fact that our 24-30 fps rate will allow each region to move shorter distances between frames. Given three consecutive frames, let the current frame be Ft and its two previous frames be Ft−1 and Ft−2 . Each skin region in Ft has its correspondence found in Ft−1 and each region in Ft−1 has its correspondence found in Ft−2 . The correspondence is determined by the distance between the region’s centroid in one frame and the regions’ MBB in an adjacent frame. Merging regions are detected by performing a forward correspondence, which checks each region’s centroid in Ft−1 against each region’s MBB in frame Ft . Conversely, we detect splitting regions using a backwards correspondence, which checks each region’s centroid in Ft against each region’s MBB in frame Ft−1 . If a region in one frame

lies within multiple MBBs in an adjacent frame, then the region pair with the shortest distance between them is chosen. The result is a tracking of which regions in a previous frame become which regions in the next. An occlusion occurs if a merging correspondence between Ft and Ft−1 point to the same region in Ft . If regions occlude in Ft , we first identify if one of these regions was detected as a face in Ft−1 . Our study has shown that the shape of the head region changes very little during normal motions, which allows us to match the head region’s border in Ft−1 to its respective region in Ft . We use the Modified Hausdorff Distance (MHD) [8] as the measurement to match the borders. The MHD between two sets of points A and B is defined as: M HD(A, B) =

 1 minb∈B a − b ∗ |A| a∈A

where a − b is the Euclidean distance of points a and b. Point set A is the boundary of the head region before the occlusion sequence and B is the boundary of the entire merged region in Ft . A is allowed to rotate and translate a small additional amount in each frame in order to find the best match into B (Fig. 1). To add stability to the motion of the head border, each possible position of A is scaled by  x2 + y 2 + θ2 1+  2 + θ2 x2m + ym m where x, y and θ are the quantized displacements from A’s previous position, and xm , ym , and θm are the maximum allowed displacement of A in each dimension. This ensures that A will only move further away from its previous position if it finds a slightly better match there. The position of A that yields the minimum MHD is subtracted from the skin regions in Ft , and these regions are re-segmented to produce the fixed regions. If the head region’s shape changes significantly during an occlusion, the minimum MHD score will be relatively high, and in this case we re-sample A from the head region’s border after it has been fixed in Ft . If an occlusion occurs between two hands, the proposed border matching technique is not used because hands change contours more unpredictably. Instead, we subtract each region’s centroid in Ft−2 from its respective centroid in Ft−1 to obtain a motion vector used to predict the intersection of occluding regions in Ft . Each region in Ft−1 is translated by these vectors, and a distance measurement is used to select all pixels where opposing regions are in close proximity. The occlusion regions generated by this method are dilated and smoothed with a median filter to ensure they cover the entire occlusion area. Then they are skeletonized using the Guo-Hall algorithm [9] with the following modifications: a pixel must not be removed if it has between 1

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

2. None of the regions must have contained a face in Ft−1 3. The larger of the two regions must have a Ra axis that is approximately twice as long as the smaller region’s one, and it must point in the general direction of the smaller region’s centroid.

Fig. 1. Hand/face occlusion in sequence David lasting 30 frames. Frames 237, 243, 257, and 262 are shown. The new position of the white head template is found, and the black border is the area that is removed from the skin regions. The head border is automatically re-sampled when the user significantly changes pose. and 5 non-skin neighboring pixels, inclusive, and it must be removed if it has 6 or more non-skin neighbors. These modifications simply ensure that occlusion pixels on the border of a skin region will not be removed. This remaining skeleton is subtracted from the skin regions in Ft and the regions are re-segmented. This subtraction performs well when splitting apart hand regions during smaller occlusions, but it may not work well with larger or deeper occlusions, as the proposed technique does not attempt to find the exact interior borders of the occluding skin regions. In some cases, skin regions will split when it

Fig. 2. Hand/hand occlusions in sequence Jimmy on frames 200, 217, and in David on 137, 255. The white line is the hand occlusion skeleton, and the black line shows re-connections between split regions’ centroids.

is desirable that they remain connected. Two examples of this include situations when an arm region is split in two by the head border template subtraction, or a hand is split from an arm region by the user’s wrist watch. Although it is not necessary that this problem be corrected for future hand localization, we must still rejoin these regions so that they will not be falsely detected as occlusions if they re-merge. These split regions can be rejoined to their parent regions by performing a backwards correspondence between Ft and its occlusion subtracted version. Regions belonging to the same correspondence are merged with each other if three conditions are met: 1. Both of the regions must not have been hand regions in Ft−1

These criteria are chosen to specifically catch instances of hand-forearm splits. 4. FACE DETECTION Face localization is a heavily researched area of computer vision, thus giving us numerous options. These options include eigenfaces, support vector machines, feature localization, and template matching. We employ a combination of the latter two techniques, as they require little offline training and can localize each facial feature position [4]. Given a region R from an image I, R will be considered a possible face region if: π 1. 3π 4 > Rθ > 4 2. Rheight > 10% of Iheight 3. Rwidth > 10% of Iwidth a 4. R Rb < 2.25 Rarea 5. (Rperimeter )2 > 0.02

Test (1) is very effective since a user’s head is generally upright when consuming medication. Test (2) and (3) are simple size thresholds, and test (4) rejects regions that are too long and narrow to be a face, such as a forearm. The threshold value (2.25) should be slightly larger than the golden ratio of a face [7], as the head region may also contain the neck. Test (5) is a measure of circularity which provides high responses to elliptical head regions. A face mask, RF M , is made from R by bridging any small horizontal and vertical skin gaps, and removing any areas of R which lie outside of its oriented MBB made by its elliptical axis. A vertical gradient map, Ygrad , of RF M is then produced by the convolution Ygrad = f ⊗g, where f is the current frame’s grayscale image and g is a horizontal Sobel mask. This gradient map will be used later to find strong negatively sloped gradients in the vertical direction, produced by the contrast between skin and either eyebrows or eyes. Next, a template is built from three lines (Fig. 3) with the following criteria: 1. All three lines will have equal slopes, and lengths of 2Rb 2. The topmost line should be placed right above the eye or eyebrow regions. The middle line should be placed above the mouth, and reside somewhere on the nose. The bottom line should be placed below the mouth, and reside somewhere on the chin 3. All lines are evenly spaced from one another by the perpendicular distance R2a .

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

Placing these lines perpendicular to Rθ will not always yield the best results, as the orientation of the face may be offset from the head’s. Thus we allow the lines to rotate by a maximum of 10◦ away from Rθ . For each of these orientations and positions, we calculate the average Ygrad and grayscale values of pixels on the top eye line, and the number of Canny edge pixels between the eye and chin lines. Multiple sets (one for each rotation) of n templates are created with eye line gradients G0 ...Gn−1 , eye line intensities I0 ...In−1 , and inter eye-chin Canny edge counts E0 ...En−1 . Gn , In , and En are normalized on the range 0 to 1 for each template set. Each template is then assigned a probability using the formula: Pi = α ∗ (Gi ) + β ∗ (1.0 − Ii+2 ) + γ ∗ (Ii−2 ) + δ ∗ (Ei ) where α, β, γ, and δ, are weighting constants chosen to be 1.75, 1.5, 1.0, and 1.0. The greatest value of Pi will be a template with a high gradient on the eye line, bright intensities above and dark intensities below the eye line, and a high concentration of edge pixels between the eye and chin lines. Temporal stability is added by weighting each score with a Gaussian accumulation filter. Our tests have shown this template to be placed correctly under a wide range of facial orientations, and it is robust in the presence of eyeglasses and smaller facial occlusions. The angular offset of the chosen template is added to Rθ to create a corrected facial angle, RF θ . We now search for exact facial features within this template area. Variances between eyes and skin will provide strong edges, and the eye socket’s concavity, combined with its darker iris, make it the darkest parts of the upper facial mask. Therefore, a combination of gradients and intensities can be used to localize facial features [5]. For all pixels between the eye and nose lines of RF M , we select those in the upper 35% of the edge magnitude histogram of RF M and in the lower 45% of the grayscale histogram of RF M . The average intensity of the selected pixels is calculated, and all pixels with intensities below this adaptive threshold are removed. Morphological operations are applied to the remaining pixels to enforce the detection of the two largest spherical regions. A connected components algorithm is then applied to find all eye regions. If the eye regions were merged into one elongated region across the nose, then only that region is accepted as the eyes. In order for R to be considered a possible face at this point, at least one eye region must have been found. The most probable mouth region in RF M is then searched for between the nose and chin lines. We apply a lip color probability function developed in [6]: M outhM ap = Cr2 ∗ (Cr2 − η ∗

Cr 2 ) Cb

where η is the ratio of average Cr2 to Cr Cb . To remove any false positives found on the nose, a linear scaling of the lip

probabilities is applied, with the higher weights given to pixels closer to the chin line. A 2D Gaussian accumulation filter with σ = 4 is also applied to the scaled lip scores in order to help suppress spurious false positives and still allow some motion of the mouth. A box is created with Rb height and width dimensions R5a × 1.25 , which have been chosen to approximately contain a closed mouth. This box is aligned with RF θ and placed to encompass the maximum grouping of lip color probabilities, which is considered the mouth region. A MBB aligned with RF θ is then placed around any eye and mouth features. As a final verification step, the average grayscale intensity of each scan-line perpendicular to RF θ (and within this MBB) is calculated. With these intensities, a facial projection curve [5] is created, smoothed, and normalized. We can verify the existence of a face in this curve with several simple tests. The curve’s mean and variance can be compared to certain bounds to reject the more obvious non-faces. Most importantly, the upper portion of the curve should contain a dark minimum where the eyes occur, and below this will be a bright maximum where the nose is. Every time a facial projection curve is verified, the region’s face probability score is incremented. This score is used to help keep the continuity of the most probable face region, as some of the facial tests may fail during occlusions.

Fig. 3. Face localization results for frame 76 of Chris sequence, frame 145 of Alfred sequence, frame 184 of David sequence, and frame 90 of Jimmy sequence.

5. HAND TRACKING Much work has been done on gesture recognition in cases where the only motion or prominent object in a sequence is the hand itself [10] [11]. These systems cannot be used here due to the lack of ideal background conditions in our medication areas, and the relatively small size of hands in our sequences. We developed a novel approach to hand localization based on grayscale sharpness. Our goal is to place circles over the two most probable hand regions in an image. The circular hand model H is defined as: H = [C, r, s, θ] similar to [12], where C is the center of the circle, r is the radius, and s is the total grayscale sharpness of the pixels within that circle. Sharpness is a measure of local variation

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

in pixel intensities. We use the Sum-Modulus-Difference sharpness method, which works by summing the differences between adjacent pixels [13]. θ is the average orientation of any fingers within the hand circle. This will be used to determine if the user is opening or closing a bottle. A value for Hr must first be determined each frame. Given that the width of a hand is approximately half the width of a face, we can set Hr to be 50% of the facial region’s Rb length. Hr will also automatically adjust based on the user’s distance from the camera, because the user’s head region will change size as well. For each skin region R that is not a head region, we then create a mask containing all possible positions for HC . The structure of the human arm allows us to impose some constraints to limit the mask to only the highest probable locations for HC . 1. Hands are fixed in such a way that they must be close to the arm’s major elliptical axis, no matter what pose the arm is in. Only skin pixels within a distance of Rb 2 from the line formed by Rθ and R’s centroid are therefore added to the mask. 2. If the user’s forearms are present (Ra > 2Hr ) then the hands should only exist at the arm region’s endpoints, so only skin pixels greater then a distance R4a from R’s centroid are added to the mask. The white pixels in figure 4 represent these region masks. The exact position of HC for each candidate region is then

sharpness computed. The sharpness for a pixel in image I at position (x, y) is computed as: 1 1  

|I(x, y) − I(x + i, y + j)|

i=−1 j=−1

The sum of all the pixel sharpness calculations within the circle is recorded as Hs , and the maximum value of Hs is where H is placed. Hs is then divided by the area of the hand circle to convert it into the average sharpness per pixel. Next, an adaptive threshold is applied to Hs to remove hands which may have been placed in skin regions that do not contain a real hand. Once this is completed, the two hand models with the highest sharpness are compared with the previously found hands to link each new hand to its previous hand. The black circles in figure 4 represent the final hand locations. If a hand is occluding the head region, we allow the hand search to enter this region because the hand could be located over the face. However, this will often incorrectly place the hand over facial features which have a higher sharpness than hands. To correct this, we subtract the change in the average sharpness per pixel between the current and previous hand circles’ grayscale images from the average sharpness per pixel score for H, and use this altered score to track during hand/face occlusions. This effectively stops the movement of hand circles from the hand regions to the facial features. The bottom row of images in figure 4 all contain hand/face occlusions. Finally, the orientation of the fingers, Hθ , must be determined. We create a 20 bin histogram of the directions of all the Canny edge pixels within H, the largest bin being the new value of Hθ . We preferred this method over using orientated Gabor filters since it was faster and gave more stable results in frames when the fingers are not as prominent. The black lines in figure 4 represent Hθ . 6. MEDICATION BOTTLE TRACKING 6.1. Automatic Bottle Initialization

Fig. 4. Hand localization results for frames 47, 90 on sequence Jimmy, 195, 263 on Alfred, and 84, 194 on Chris. The axis lines represent each regions’ elliptical axis.

determined through sharpness calculations in the grayscale image. The theory behind this method is that hands will have a higher sharpness then the rest of the arm region, due to the appearance of fingernails, knuckles, and edges in the fingers. To find the position of HC with the highest sharpness in a region R, HC is tested against every point in that region’s mask. Any skin pixel within the hand circle then has its

The location of any medication bottles in the medication area must first be determined before bottle tracking can occur. This begins by searching the Canny edge map at the start of each sequence for objects with rectangular shapes and height/width ratios of approximately 2:1. Our medication bottles used in the test videos fit this shape constraint, but this could be modified to allow for bottles of other height/width ratios. The edge matching results are shown in figure 5. Another process must then be used to eliminate the false positives found. For this we use a basic object recognition technique based on SSD matching. An offline bottle library is created from one of the sequences, which bilinear interpolates the size of the bottles to a 50 × 25 template, and

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

Fig. 5. Results of edge matching on frame 0 of Jimmy sequence.

update a bottle’s representation during run time to account for its change in appearance. Any technique such as [15], which pre-computes tracking information from a single initial view cannot be used because of this limitation. For these reasons, we use a 2D template matching algorithm that tracks bottles by correlation color matching. This method can be modified to handle occlusions and account for bottle color changes because there is no precomputation. The bottle template takes the form of figure 8, similar to the template used in [15]. The template model,

then records the RGB components of each pixel of the bottle. The library used for our sequences is shown in figure 6, and more bottles could easily be added to it. Each rectan-

Fig. 8. Bottle tracking template.

Fig. 6. Bottle image library used in our test sequences. From left to right are bottles A, B, C and D.

gle from the first step (Fig. 5) is SSD matched against this library and non-maximal suppression is used to remove the weak matches. This entire process is repeated with an iterative rejection threshold, and terminates when the desired number of bottles is found. Using this technique, the system can match each medication bottle with its corresponding library image. Figure 7 shows the results of automatic bottle initialization, along with the bottle identification letter from the training set that each bottle was matched to.

Fig. 7. Automatic bottle initialization results for each test sequence, frame 0.

6.2. Bottle Tracking Object Tracking is another heavily researched area in computer vision, unfortunately, no single tracking technique or algorithm is ideally suited to our needs. Major occlusions are present in most sequences (when the user covers a bottle with their hand), so EigenTracking was not used because of its instability with occlusions [14]. Bottles can also twist in the horizontal plane of the camera, thus changing their color composition. Therefore, it is necessary to

T , is defined as: T = [l, w, P, θ, Mt ] where Mt is an M × N matrix of tracking points. Each tracking point stores its location relative to P , along with the two color components, Cb and Cr, corresponding to that tracking point in the normalized YCbCr color space of the image. The output of automatic bottle initialization is used to construct a template for each bottle found. The initialization routine places the matrix of tracking points inside the template, using every-other pixel as a tracking point, and then records the normalized Cb and Cr color components of each point. Tracking is a two step process of translating and rotating each bottle’s template around a local neighborhood from the template’s previous position, and then using correlation matching to find the location with the highest similarity to the previous bottle template. We define a bottle’s previous template as Tt−1 , and the new template used for the search as Tt . The first step quickly determines the general position of Tt based on Tt−1 without any rotations. Tt is translated across a range of 2Tl × 2Tl , taking each point as a candidate for P . For each possible P , the correlation between the current image Cb and Cr color components and those stored in Tt−1 is calculated. The maximum of these values is where Tt is translated. The direct point-correlation matching between Tt and Tt−1 proceeds by counting the number of tracking points in Tt−1 that match the color components of its corresponding

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

point in Tt . To compensate for inconsistent lighting conditions, two corresponding tracking pixels between Tt and Tt−1 are considered the same as long as the Cb and Cr components change by no more than 2%. The second step consists of determining the exact translation and rotation for Tt . To judge how well the pixels in Tt correspond to Tt−1 , we make use of direct point-correlation matching again. Only translations within a 5×5 pixel region and rotations within one radian from the previous template angle are tested as candidates for P . During the second search step, we must also account for possible bottle twisting. A standard medicine bottle is composed of a main base color, and multiple other colors on the label. When a bottle twists, all the colors stored in Mt will shift either to the left or right, and one of the vertical edges will transform into the base color of the bottle. It is during this second step of the template matching algorithm (the finer search level), where three different templates are searched for. One template remains unchanged from the previous template, while the two others have their tracking points shifted to the left and right, with the remaining vertical edge filled with the bottle’s base color. If twisting has occurred, one of the twisted templates should match better than the standard one. After finding the best location for Tt , occlusion handling is performed if needed. We separate occlusions into two categories, minor and major, which are treated differently. Minor occlusions are short term and usually caused by one bottle moving behind another. Major occlusions are caused by the user grabbing the bottle with their hand and completely covering the bottle, making it impossible to track by conventional methods. Testing for occlusions is done by calculating the ratio of tracking points which were matched. If this ratio changes significantly from the previous frame’s ratio, or the match ratio drops below a threshold, then an occlusion has occurred. If less then 50% of the bottle is covered by skin, we assume a minor occlusion has occurred. Rather than update Tt to an incorrect occluded area, we use the previous trajectory of the bottle to instead place Tt . If the bottle is covered by over 50% skin, then we assume a major occlusion has occurred. Because these occlusions completely cover the bottle while it is moving, no form of tracking can be used to predict the bottle’s location. Instead, Tt is placed within the nearest hand circle until a good match for the bottle can be found again. As long as hand detection remains accurate, the bottle template is always very close to its actual position. If no occlusions are found, then Tt is assumed to be the real bottle location, and the tracking points are now updated if they pass the same color tolerances described above.

Fig. 9. Major bottle occlusion in Alfred sequence on frame 47 and 48. No good match for the bottle template could be found in the image on the right, so it is placed within the nearest hand until a good match can be found again.

7. HIGH-LEVEL OPERATIONS The opening and closing of a bottle is determined by looking at the orientation of the finger lines in both hand circles. We look for one hand to be on the side of a bottle while its fingers remain relatively still, while the other hand is near the bottle top, and its fingers contain a relatively large amount of twisting. After a bottle has been opened, we must detect if any hands have been placed up to the mouth. If a hand circle enters the face region, and then contains a portion of the mouth box, we assume the user has placed the pill into their mouth. Looking for an actual opening of the mouth is not reliable since the mouth need not open significantly for the placement of a pill, and the hand may occlude the mouth before it opens. We can only achieve requirement (3) at this point by counting the number of times a hand moves between an open bottle and the mouth. 8. RESULTS Four sequences containing different persons and environments were filmed at a resolution of 352 × 240 pixels. We consider a successful single pill detection to be the sequence of a bottle opening, a hand over mouth motion, and finally a bottle closing. The system successfully accomplishes all of the required tasks on six of the eight pills, including detection of bottle opening/closings, and pill placement into the mouth. Figure 10 shows four plots (one for each test sequence) of the results. They display the periods in which the three main events take place, and the frame recorded by the system in which each event was detected. The system was unable to work properly on the video sequence Chris because the white bottles had the same white base color as the table on which they were placed, making the bottle tracking fail (Fig. 11). However, automatic bottle initialization, face, and hand localization were still successful on this sequence. Possible improvements concerning this project include: 1. Find the exact region border between hand/face occlusions.

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

Fig. 10. Plot of results. 2. Develop a more robust bottle tracking procedure which does not fail on test sequence Chris. 3. Use face recognition to determine if the correct person is taking the medication (Requirement 1).

Omni-Face Detection in Video/Images.” IEEE ICIP, pp. 113-116, 2002. [6] Rein-Lien Hsu, Mohamed Abdel-Mottaleb, and Anil K. Jain. “Face Detection In Color Images.” IEEE TPAMI, Vol. 24, No. 5, pp. 696-706, May 2002. [7] L. G. Frakas and I. R. Munro. Antropometric Facial Proportions in Medicine. Charles C. Thomas, Springfield, IL, 1987. [8] M. P. Dubuisson and A. K. Jain. “A Modified Hausdorff Distance for Object Matching.” IEEE ICPR-A, pg. 566-568, 1994.

Fig. 11. Bottle tracking failure in Chris sequence, frames 20, 23, and 25.

9. REFERENCES [1] Jean-Chrisophe Terrillon and Shigeru Akamatsu. “Comparative Performance of Different Chrominance Spaces for Color Segmentation and Detection of Human Faces in Complex Scene Images.” Fourth IEEE FG, pp. 54, March 2000. [2] Son Lam Phung, Abdesselam Bouzerdoum, and Douglas Chai. “A Novel Skin Color Model In YCbCr Color Space and its Application to Human Face Detection.” IEEE ICIP, pp. 289-292, 2002. [3] Nariman Habili, Cheng-Chew Lim, and Alireza Moini. “Hand and Face Segmentation Using Motion and Color Cues in Digital Image Sequences.” IEEE ICME, 2001. [4] Ming-Hsuan Yang, David J. Kriegman, and Narendra Ahuja. ”Detecting Faces in Images: A Survey.” IEEE TPAMI, Vol. 24, No. 1, January 2002.

[9] Zicheng Guo and Richard W. Hall. “Parallel Thinning With Two-Subiteration Algorithms.” Communications of the ACM, Vol. 32, No. 3, pg. 359-373, March 1989. [10] James David and Mubarak Shah. “Toward 3-D Gesture Recognition.” ECCV, 1994. [11] B. Stenger, P. R. S. Mendonc, and R. Cipolla. ”Model-Based 3D Tracking of an Articulated Hand.” IEEE CVPR, pg. 310-315, December 2001. [12] G. McAllister, S.J. McKenna, and I.W. Ricketts. “Tracking a driver’s Hands Using Computer Vision.” IEEE SMC, pg. 1388-1393, October 2000. [13] Ng Kuang Chern, Poo Aun Neow, Marcelo H. Ang Jr. ”Practical Issues in Pixel-based Autofocusing for Machine Vision.” IEEE ICRA, May 2001. [14] Michael J. Black and Allan D. Jepson. ”EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation.” ECCV, pg. 329-342, 1996. [15] Fr´ed´eric Jurie and Michel Dhome. ”A Simple and Efficient Template Matching Algorithm.” IEEE ICCV, July 2001.

[5] Xingquan Zhu, Jianping Fan, and Ahmed K. Elmagarmid. “Towards Facial Feature Extraction and Verification for

Proceedings of the Second Canadian Conference on Computer and Robot Vision (CRV’05) 0-7695-2319-6/05 $ 20.00 IEEE

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.