MOVING CAMERA MOVING OBJECT SEGMENTATION IN COMPRESSED VIDEO SEQUENCES

July 14, 2017 | Autor: William Grosky | Categoría: Moving Object Recognition

Descripción

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

International Journal of Image and Graphics c World Scientific Publishing Company

Moving camera moving object segmentation in compressed video sequences

J. Wang Department of Computer Science, Wayne State University, Detroit, MI-48202, United States [email protected] N.V. Patel∗ Department of Computer Science and Engineering, Oakland University, Rochester, MI-48309, United States [email protected] W.I. Grosky Department of Computer and Information Science, University of Michigan - Dearborn, Dearborn , MI-48128, United States [email protected] F. Fotouhi Department of Computer Science, Wayne State University, Detroit, MI-48202, United States [email protected]

In the paper, we address the problem of camera and object motion detection in the compressed domain. The estimation of camera motion and the moving object segmentation have been widely stated in a variety of context for video analysis, due to their capabilities of providing essential clues for interpreting the high-level semantics of video sequences. A novel compressed domain motion estimation and segmentation scheme is presented and applied in this paper. MPEG-2 compressed domain information, namely Motion Vectors (MV) and Discrete Cosine Transform (DCT) coefficients, is filtered and manipulated to obtain a dense and reliable Motion Vector Field (MVF) over consecutive frames. An iterative segmentation scheme based upon the generalized affine transformation model is exploited to effect the global camera motion detection. The foreground spatiotemporal objects are separated from the background using the temporal consistency check to the output of the iterative segmentation. This consistency check process can coalesce the resulting foreground blocks and weed out unqualified blocks. Illustrative examples are provided to demonstrate the efficacy of the proposed approach. Keywords: MPEG-2; Segmentation; Compressed domain processing; Motion vector; Affine motion model; Temporal tracking. ∗ Corresponding

Author 1

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

2

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

1. Introduction Segmentation of objects in video sequences is very important in many aspects of multimedia applications, such as content-based video retrieval, video composition and video compression. Indeed, the recent multimedia standards specify that a video is composed of meaningful video objects. The moving object segmentation, hence, plays a particularly important role in video processing because the human visual system is very good not only at perceiving but also at tracking and predicting movements. Although many segmentation techniques have been proposed in the literature, fully automatic segmentation tools for general applications are currently not achievable. The state-of-the-art approaches to video object segmentation can be largely categorized into two classes: (1) intraframe segmentation and (2) motion segmentation. In the first class, each frame is treated as a static image and segmented independently into regions in terms of the homogeneous intensity or texture. The traditional image segmentation techniques, such as Shi et al.1 and Salembier et al.2 , are commonly used in this category. This method, however, suffers from several defects, such as high computational complexity and over or under-segmentation. In the second category, motion information is mainly referred and exploited. Segmentation is achieved by coalescing the homogeneous motion information together in terms of its characteristics. Some of the related research operates in the uncompressed domain, such as Kompatsiaris et al.3 , Sifakis et al.4 and O’Connor et al.5 . These approaches provide the potential to estimate object boundaries with pixellevel accuracy, whereas it is also required that video sequences be fully decoded before the segmentation can be performed. As a result, the use of these approaches is often restricted to non-real time applications. In opposition to the uncompressed domain approaches are the compressed domain techniques on a macroblock basis. Widespread research activities have been dedicated to this discipline. In Eng et al.6 and Boulgouris et al.7 , combined with DCT coefficients, translational motion vectors are considered and clustered to separate the foreground. In Jamrozik et al.8 , motion vectors are accumulated over a number of adjacent frames. The magnitude of the displacement of each macroblock is calculated and uniformly quantized so that the blocks can be assigned to different regions. In Sukmarg et al.9 (FOD), segmentation is performed only based on DCT coefficients. Thresholding the average temporal change of each region enables the foreground/background classification, while the inverse DCT is performed only on some parts of frames. In Favalli et al.10 , human intervention is introduced to manually track identified moving objects in the compressed bitstreams based on macroblock motion vectors. In this paper, we present a novel global camera motion estimation algorithm and its application to moving object segmentation under inconsistent camera motions. Our system concerns camera and object motion analysis in the compressed domain. Motion information readily available from MPEG-2 videos is taken into account.

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

3

However, several limitations in using encoded motion vectors for video processing should not be ignored. First, motion vectors from the compressed domain do not constantly correspond to real camera and object motions inasmuch the purpose of using them is to minimize prediction errors in the encoding process, not for accurate motion estimation. In a nearly uniform region, motion vectors, therefore, would possibly be treated like random noise. A second limitation is caused by the spatiotemporal granularity of motion vector fields. Temporally, not every macroblock is associated with a motion vector. Spatially, one motion vector is decoded for a 16 × 16 macroblock. This implies that the encoded motion vector is incapable of properly representing the correct motion for objects smaller than a macroblock that contribute distinct motion within that macroblock. To surmount these difficulties, in our method, we reveal that in certain structured settings, it is possible to reliably estimate global camera motion and apply it to segment the moving foreground from the moving background. Suitability of the proposed algorithm for real-time object tracking along with its applications in video information management, video compression and transmission will be discussed. The proposed technique in this paper implements a three stage approach to achieve the superior segmentation results as well as reduce the computational load. In the first step, it is necessary to process raw motion vectors prior to the estimation. The proposed system incorporates motion vectors from a number of frames over the time axis and spatiotemporally interpolates new motion vectors to make the motion vector field denser and more reliable. We then segment foreground objects from background. In this step, a 6-parameter generalized affine transformation model is iteratively estimated to compute global camera motion. In the third step, we apply an error analysis and elimination process to fine tune segmentation results. Because of the presence of random noise, some isolated background and foreground blocks are falsely included in the output of the second step. A temporal tracking filter is applied in this step. The decision that whether a given block belongs to the foreground or the background is made according to its neighboring block information. The block diagram of the proposed system is illustrated in Figure 1. The remainder of this paper is organized as follows. In Section 2, we discuss the extraction of motion vectors from the compressed bitstreams and the spatiotemporal interpolation of motion vector fields. Section 3 states the iterative segmentation method. Section 4 deals with the temporal object tracking to check the motion consistency and refine estimation results. Experimental results are presented in Section 5, and we draw conclusions in the last section. 2. Spatiotemporal Motion Information Enhancement Object motion in videos is introduced by two phenomena, camera motion and object motion. In compressed videos, this motion information is crudely represented by encoded motion vectors. This raw information can not be reliably used as a

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

4

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

Fig. 1. Block diagram of the proposed motion segmentation method

means of determining object velocity and/or direction unless the encoder is specifically set for this purpose. In general, the encoder generates motion vectors that typically minimize prediction errors and, consequently, they often do not represent true object motion from frame to frame. Moreover, motion vectors are not transmitted for all macroblocks in the compressed domain. In this section, we discuss the process that weakens the aforementioned negative effects of encoded motion vectors. 2.1. Motion Information Extraction As an improved version of the original MPEG-1 standard, MPEG-2 standard allows higher quality videos to be encoded, transmitted and stored. It has been widely referred by many researchers in a variety of context for video analysis, due to its popularity and specific coding characteristics. For the sake of completeness, we provide a brief discussion of MPEG-2 standard in the sequel. For full details on MPEG-2 video coding standard, we refer the readers to 11 . MPEG-2 video defines three main types of coded video frames: intracoded Iframes, predicted P -frames and bidirectionally predicted B-frames. These frames are organized into a sequence of Group of Pictures (GOP). An I-frame is intracoded by sub-sampling into non-overlapping macroblocks and blocks. Each block undergoes a discrete cosine transformation. Each P -frame is predicatively encoded with reference to its previous I or P -frame. For each macroblock in a P -frame, a local region is searched for a match in terms of its intensity. This macroblock is then represented by a motion vector to the position of the match along with the DCT encoding of the residue between the macroblock and its match. B-frames are bidirectionally predicatively encoded frames with forward and backward motion compensation to its nearest I and P -frames. Since B-frames are not used as a reference for coding other frames, they can accommodate more distortion, and higher compression. Following the footsteps of many previous researchers, we extract motion vectors from MPEG-2 compressed bitstreams for background/foreground object separation

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

5

by making use of the MDC library12 . The prerequisite to achieving a reliable and accurate separation result is to obtain a dense and smooth motion vector field. In MPEG-2, however, MVF suffers from noise and lacks of coherence. Intracoded blocks in I and P -frames have no associated motion vectors, whereas bidirectionally predicated blocks in B-frames often have more than one motion vectors. Therefore, a number of rules have to be stipulated as follows to attain more consistent and reliable MVF from MPEG-2 bitstreams. These rules can also remove the dependency on specific MPEG-2 characteristics and associate one motion vector with standardized magnitude to every macroblock. • Intracoded macroblocks in P and B-frames: Since a small portion of macroblocks in P and B-frames are intracoded, the motion information for these macroblocks is assumed to have the same movement as in the previous frame. The justification for this step is that consecutive frames are strongly correlated, therefore there is a high probability that the motion field has not changed much. • Intracoded macroblocks in I-frames: All macroblocks in I-frames are intracoded and have no motion information available. The recovery of the motion information for I-frame in the current GOP can be reasonably estimated from the last P -frame in the previous GOP without employing a Block Matching Algorithm (BMA). Specifically, Let M VI (m, n) represent the motion vector along the horizontal (x-axis) and the vertical (y-axis) direction for the macroblock centered at (m, n) in I-frame. Using the median function from the vectors of neighboring blocks in the previous P -frame, M VI (m, n) can be approximated as follows: M VI (m, n) =

M edian({M VP (i, j)|m − 1 ≤ i ≤ m + 1, n − 1 ≤ j ≤ n + 1}) N (1)

where N stands for the number of frames between the previous P -frame and the current I-frame in display order. Equation 1 gives us an estimated motion vector with standardized magnitude for each intra-coded macroblock in the current I-frame with respect to the prior B-frame. • Bidirectionally predicted macroblocks in B-frames: These macroblocks have two motion vectors. The one pointing backward is reversed and added to the one pointing forward. Then it has to be multiplied by a factor so that its magnitude corresponds to a vector between the current frame and its previous frame. • Skipped macroblocks in I-frames have no movement, while in P frames, these macroblock have movement similar to the block in the previous frame. This is a rule obtained directly from the MPEG-2 standard11 . 2.2. Spatiotemporal Motion Vector Enhancement After the temporal interpolation, the derived motion vector field is observed to remain low spatial granularity. This intrinsic flaw can be amended by a weighted

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

6

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

spatial interpolation algorithm aimed at increasing both the spatial consistency and the spatial granularity of MVF. We propose to use the weighted vector median filter, similarly defined in Ascenso et al.13 and Shen et al.14 , to increase the spatial resolution of the derived motion vector field. The weighted median vector filter maintains the motion field spatial coherence by looking, at each block, for candidate motion vectors at neighboring blocks. This filter is adjustable by a set of weights controlling the filter smoothing strength depending on motion compensated prediction residuals. Specifically, let {M V1 , M V2 , M V3 , M V4 } be the four adjacent motion vectors in a neighborhood. The candidate motion vector M Vc is derived as follows: M Vc = {M Vj | arg min MVj

4 X

wj ||M Vj − M Vi ||L }

(2)

i,j=1

Equation 2 suggests that by weighting neighboring motion vectors according to prediction residuals, effectively, we skew the new motion vector to the optimal one that minimizes the sum of distance to the other three vectors in terms of the Lnorm. The choice of weights is determined according to residual errors as defined by: M AD(Bc , M Vc ) (3) wj = M AD(Bj , M Vj ) M AD(·) represents the mean absolute difference of block Bj having M Vj as its motion vector. The weights have low values when the prediction error for the candidate vector is high. Therefore, the decision of spatial motion vector interpolation with respect to neighboring vectors is made by evaluating both prediction errors and spatial properties of its neighboring motion field.

Fig. 2. Motion vector spatial interpolation

The extracted MVF from the compressed domain suffers from the lack of coherence of the estimated motion to the actual motion. The spatiotemporal motion

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

7

vector field enhancement performed on a macroblock basis possibly overcomes this drawback by smoothing and interpolating new vectors. Given an enhanced MVF, we consider an upscale-by-two operation to increase its spatial granularity. For a certain block and a pair of an extracted motion vector and an interpolated motion vector that locate at the diagonal corners of this block, respectively, as shown in Figure 2, our straightforward approach is to average these two motion vectors and assign the result to the given block as its derived motion vector. This method would definitely not introduce any problem if the parent motion vectors have the same horizontal and vertical motion. In the case that the parent motion vectors represent different motion characteristics, this method is still able to reflect this incoherence on a block level. Figure 3 shows the comparison of the spatiotemporally estimated motion vectors for a given I-frame to the motion vectors resulting from using an exhaustive search BMA. The illustration shows the estimation is very close to the optimal motion estimation with less computational complexity and higher spatial density.

(a)

(b)

(c)

Fig. 3. Comparison of spatiotemporally interpolated I-frame MVF to BMA: (a) Previous P -frame MVF; (b) I-frame MVF obtained by BMA; (c) Spatiotemporally interpolated I-frame MVF

3. Estimation Camera Motion from MPEG Video In this section, we present the algorithm for detecting and separating foreground moving objects under the influence of inconsistent camera motion. The camera is assumed to undergo rotation, zoom and translation. The change of intensities between frames can be modeled by a generalized affine transformation. The iterative segmentation is performed on a block basis using this motion model. As a result, certain blocks in reference frames are marked as possibly belonging to the foreground. 3.1. Generalized Motion Model Generally, global motion is usually introduced by camera’s operation and movement, while local motion comes from the displacement of objects in scenes. Sepa-

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

8

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

rating foreground from background requires us to first segment video frames into objects that are subject to camera motion and objects that are influenced by both camera and object motion. Algorithms for this separation differ in the models used to represent motion as well as the techniques for estimating model parameters. Considering a point P in the background moving in 3-D space, let its displacement to the origin at time t be D = (Xt , Yt , Zt ) ∈ R3 . For any two time instances t and t′ , we assume the camera undergoes a combined motion of rotation, zoom and translation. Hence, the point P at time t is moved to P ′ at time t′ as follows:  ′    X X p1 p2 p3  Y ′  =  −p2 p1 p4   Y  (4) Z′ p5 p6 1 Z An image acquisition system projects 3-D points onto a 2-D image plane with image coordinates d = (xt , yt ) ∈ R2 . A generalized 2-D motion model resulting from this projection is then defined as: ′ " p1 x+p2 y+p3 # x p5 x+p6 y+1 (5) = −p 2 x+p1 y+p4 y′ p5 x+p6 y+1 This model is a good approximation under the assumption that minimal lens distortion effects exist, the inter-frame camera motion is relatively small, and the camera only has the aforementioned motions. Using the parametric transformation model from Tan et al.15 , the unknowns {p1 , . . . , p6 } satisfy   f′ p1 p2 p3 f f′  −p2 p1 p4  =   − f θz θx p5 p6 1 

f

f′ f θz f′ f θy f

−f ′ θx



 f ′ θy  1

(6)

where f and f ′ are the camera focal lengths in the reference frames, f ′ /f indicates the inter-frame camera zoom factor, θz represents the change in the angle of camera rotation about the lens axis, θx and θy represent the camera rotation angles about x-axis and y-axis respectively. Assuming that θx and θy are sufficiently small, then f θy and f θx can approximately simulate the displacement resulted from the camera panning and tilting between the reference frames. The pixel-wise unknown estimation is widely referred in the literature; see Tan et al.16 and MPEG-417 . This approach, however, is found to be computationally expensive and is not appropriate for real-time applications due to the large number of pixels involved in the iterative computation. To overcome this limitation, we propose the use of dense block-based MVF instead. For the kth block Btk in the current frame at time t, let (xkt , ytk ) be the coordinates of the center of the block. If the image acquisition system samples the frames at a fairly high rate, the camera motion between two adjacent frames is reasonably assumed small. We can then express the block coordinates before camera motion in terms of the corresponding

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

block coordinates after camera motion as: k k xt + ∆xkt − x0t xt−1 k = Bt−1 = k ytk + ∆ytk − yt0 yt−1

9

(7)

where (x0t , yt0 ) are the coordinates of the center of the frame, (∆xkt , ∆ytk ) are the k corresponding motion vectors for Btk . Hence, each pair of Bt−1 and Btk in the previous and the current reference frame is considered as a corresponding feature pair. Such a pair of matching blocks yields one possibly noisy sample. Theoretically, three pairs of independent samples would be sufficient to estimate the unknowns in Equation 5. However, considering the measurement noise, it is essential to add more pairs of samples to the resulting over-determined system to improve robustness. In this case, estimation with N pairs of samples is represented as follows: K·P= U

(8)

where  x1t − x0t yt1 − yt0 1 0 . . . −(x1t−1 − x0t−1 )(yt1 − yt0 ) 1 0  yt1 − yt0 −(x1t − x0t ) 0 1 . . . −(yt−1 − yt−1 )(yt1 − yt0 )     K= ... ... ... ... ... ...    xN − x0 y N − y 0 1 0 . . . −(xN − x0 )(y N − y 0 )  t t t t t−1 t−1 t t 0 N 0 N 0 ytN − yt0 −(xN t − xt ) 0 1 . . . −(yt−1 − yt−1 )(yt − yt )    1  p1 xt−1 − x0t−1 1 0  p2   yt−1 − yt−1         P =  ...  , U =  ...    p5   xN − x0  t−1 t−1 N 0 p6 yt−1 − yt−1 

The least-squares solution can be achieved by

P = (Kt · K)−1 · (Kt · U)

(9)

It is worth noting that, in most cases, the normal inverse matrix of Kt K does not exist because of its singularity. This is capable of being solved by adopting a generalized matrix inverse method, such as the Singular Value Decomposition (SVD) method in Press et al.18 . 3.2. Iterative Foreground Block Detection Using the generalized global motion model, we introduce an iterative procedure that detects blocks with larger errors than the average error estimation. Since the global motion model is only an estimation of camera motion, it is incapable of precisely representing the motion activities of foreground blocks. Based upon this inappropriateness, this procedure coarsely categorizes the blocks in the current reference frame into background and foreground. The procedure is terminated when the iteration leaves the set of detected blocks unaltered.

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

10

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

In detail, all blocks in the current frame It are initially set to be background blocks, thus they are assumed to comply to the estimated motion model P. For each background block {Btk , k ∈ [1, . . . , N ]}, we can obtain an approximated block ′ k Bt−1 that represents Btk in the previous reference frame under motion model P. Due to the presence of inconsistent foreground motion and the existence of noisy ′ k k motion vectors, the residual intensity error between Bt−1 and Bt−1 can be defined as: sX ′k ′k ki i i i ) − I(xkt−1 , yt−1 , yt−1 [I(xt−1 )]2 , k ∈ [1, . . . , N ] (10) ek = i

This residual error provides a measure of the accuracy of P in matching the data samples of the estimated motion model. From this error, we compute the sample standard deviation σ e . A portion of sample block pairs are rejected as outliers and will not be considered in the next iteration, if they have larger estimation errors than cσ e , where c is an empirically predefined constant and usually set to be be′ ′ ki ki tween 3 and 4. Note that the estimated position (xt−1 , yt−1 ) generally does not fall onto integer pixel coordinates. Thus a bilinear interpolation for the intensity of ′ ′ ki ki I(xt−1 , yt−1 ) is used to perform the re-sampling. The detailed steps are described in Algorithm 1. The experimental results show that this iterative segmentation Algorithm 1 Iterative foreground object detection algorithm 1: for k=1 to Number of MBs do 2: BgM ask[k] ← 1; 3: end for 4: while BgM as[k] is changed do 5: while BgM ask[k] = 1 do k 6: (xkt−1 , yt−1 ) ← (xkt , ytk ) + (∆xkt , ∆ytk ); 7: end while 8: Generate K and U, P ← (Kt K)−1 (Kt U); ′ ′ k k 9: (xt−1 , yt−1 ) ← (xkt , ytk ) × P; qP ′k ′k ki ki i i 2 10: Calculate the residual error ek = i [I(xt−1 , yt−1 ) − I(xt−1 , yt−1 )] , k ∈ [1, . . . , N ]; 11: σ e ← std(ek ); 12: for all ek > cσ e do 13: OutLierM ap[k] ← 1; 14: end for 15: for all BgM ask[k] = 1 do 16: if kth MB has flag in OutLierM ap then 17: BgM ask[k] ← 0; 18: end if 19: end for 20: end while

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

11

process often converges within three or four iterations, depending on the motion complexity of test video sequences. Although this method is relatively faster and simpler than pixel-wise estimation techniques, it is highly likely that some blocks may get falsely classified. This results from the inaccurate estimation of motion vectors from the compressed bitstreams and the incapability of the motion model to accurately capture the underlying global motion. In order to identify and discard falsely labeled blocks, the temporal consistency of the output of the iterative segmentation over a temporal sliding window is examined. 4. Temporal Tracking Refinement The segmented foreground blocks are temporally tracked using the dense MVF to ensure the temporal consistency of the segmentation. The heuristic tracking method is generally based upon Favalli et al.10 . With no requirement of human intervention, our method uses the coarse segmentation results as its input. For instance, let’s assume, in the current frame at time t, block Bti,j has moved up and left and fallen i−1,j−1 i−1,j i,j−1 i,j into the area that is covered by block Bt+1 , Bt+1 , Bt+1 and Bt+1 in the next reference frame at time t + 1, as shown in Figure 4. Since only discrete positions can be specified, it is necessary to set a threshold to determine whether Bti,j has moved to one of the four blocks in the next frame. If the visual information of Bti,j overlaps that of a block in a neighboring position by more than 25%, then the new block will also be considered as part of the object. Hence, both Bti,j and this new block will be tracked in the sequel. If this quantity exceeds 75%, only the new block will be considered in the next run.

Fig. 4. Block Bti,j in frame t moved up and left in frame t + 1

More specifically, considering a sliding window of length T over the time axis, let {Mt−i , i = 0, 1, . . . , T } denote the resulting output of frame It−i via iterative segmentation, and let τ (·) be the tracking operation. We can obtain the refined segmentation result Mt for frame It as shown in Algorithm 2. The above process is terminated once the imposed restrictions are satisfied. These restrictions include the size of valid moving objects and the temporal duration of their existence, because it is highly likely that very small objects and objects of very small temporal duration

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

12

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

might result from being falsely identified during the segmentation process. The removal of these outliers can definitely fine tune segmentation outputs. The above procedure is illustrated in Figure 5. Algorithm 2 Foreground segmentation refinement via object tracking 1: for i=1 to Number of MBs do i 2: Mtemp ← 0; 3: end for 4: for i=1 to Number of foreground MBs do i 5: if Mt−T = 1 then i 6: Mtemp ← 1; 7: end if 8: end for 9: for i=T to 1 step = -1 do 10: M = τ (Mtemp ); 11: for j=1 to Number of MBs do j 12: if M j = 1 and Mt−i+1 = 1 then j 13: Mtemp ← 1 14: end if 15: end for 16: end for 17: Mt ← Mtemp

(a)

(b)

(c)

Fig. 5. Temporal tracking result: (a) 13rd frame of F ootballP layerT racking sequence; (b) Coarse output of iterative segmentation; (c) Refined result after temporal tracking

5. Experimental Results For performance evaluation, we have applied the proposed method to various sports video sequences, including car racing, football playing and sand sliding. All test

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

13

sequences are compressed by MPEG-2 encoder according to the frame pattern IBBP BBP BBP BB. They consist of complicated and combinational object motion and incoherent camera motion. In every simulation, we first extract motion vectors from the test video sequences. For each P and B-frame that motion vectors have already been encoded, only the spatial interpolation is carried out to enhance the spatial granularity of MVF, in the meanwhile, every I-frame is subject to a temporal as well as spatial interpolation to obtain a spatiotemporally dense and reliable MVF. The size of the temporal tracking window is set to be 4 in all the experiments. 5.1. Test sequence I, II: Football Playing In these two cases, the test sequences are obtained from a football game. The size of the frame is 320 × 240 for both of them. For test case I, the camera and the object motion are relatively simple. Only fast camera panning is observed. The moving foreground object in this sequence is merely a single player running from right to left. Figure 6 (a) - (c) show the 11th, 13th and 15th frame extracted from the compressed sequence. The spatiotemporally enhanced MVFs of the corresponding frames used for the object segmentation are displayed in Figure 6 (d) - (f), respectively. Figure 6 (g) - (i) show the initial separation results after the iterative segmentation. The results contain some falsely identified blocks. Figure 6 (j) - (l) demonstrate the temporal tracking output. The outlying blocks, due to the short existence or the small size, have been removed. Considering the rather simple motion in the case, the constant c in the iterative segmentation process is empirically set to be 4. Test case II is also part of the football game. In this case, the scene contains a complicated combination of camera motion, tracking and zooming, and several different object motions, i.e., football players moving in various directions. In addition, the background is comparatively complex. The parked vehicles are very easily identified as part of moving foreground by mistake. Based on the above observation, we set the constant c to be 3.5. Figure 7 (a) - (h) show the original frames, the dense MVF, the rough outputs of the iterative segmentation and the fine tuned segmentation results on a block basis after the temporal tracking. 5.2. Test sequence III: Sand Sliding This test sequence contains a major camera-zooming motion and a slight panning motion. Both the foreground and the camera move relatively fast. The size of the frames is 352 × 288. The constant c is set to be 3.7. In Figure 8, series results of object extraction are shown. It is clearly observed that the moving object is correctly extracted on a block basis. Because the occlusion around object boundary, the extract object mask is bigger than real object region. The accurate contours of moving objects cab be achieved by pixel-wise techniques after decoding the extracted objects.

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

14

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

(a)

(d)

(g)

(j)

(b)

(e)

(h)

(k)

(c)

(f)

(i)

(l)

Fig. 6. Experimental result of F ootballP layingI sequence: (a) Original 11th frame; (b) Original 13th frame; (c) Original 15th frame; (d) Spatiotemporally interpolated MVF for the 11th frame; (e) Spatiotemporally interpolated MVF for the 13th frame; (f) Spatiotemporally interpolated MVF for the 15th frame; (g) Initial segmentation for the 11th frame; (h) Initial segmentation for the 13th frame; (i) Initial segmentation for the 15th frame; (j) Fine tuned result for the 11th frame; (k) Fine tuned result for the 13th frame; (l) Fine tuned result for the 15th frame;

5.3. Test sequence IV: Car Racing The test sequence in this case is obtained from a car racing video. The camera motion in this kind of video is relatively simple, as only camera panning and zooming are observed. However, the foreground object moves forward fast and makes hard turns. The size of the frame is 320 × 240. Taking the complicated foreground motion and fast moving background into account, the constant c is set to be 3. Figure 9 show the experimental results which successfully identify and track fast moving objects. In summary, the proposed approach boosts the performance of detecting moving objects in compressed domain, while keeping the computational complexity low compared to other similar algorithms. For each video shot, MVs are extracted directly from MPEG compressed bit streams, thus avoiding the time-consuming block matching process. Moving object detection is only conducted at the block

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

(a)

(c)

(e)

(g)

(b)

(d)

(f)

(h)

15

Fig. 7. Experimental result of F ootballP layingII sequence: (a) Original 20th frame; (b) Original 22nd frame; (c) Spatiotemporally interpolated MVF for the 20th frame; (d) Spatiotemporally interpolated MVF for the 22nd frame; (e) Initial segmentation for the 20th frame; (f) Initial segmentation for the 22nd frame; (g) Fine tuned result for the 20th frame; (h) Fine tuned result for the 22nd frame;

Test Sequence

# of Frames

Movie Length (s)

Proposed Method*

FOD*

% Improvement

Football Playing I Football Playing II Sand Sliding Car Racing

167 230 138 215

6.7 9.2 5.5 8.6

0.91 0.95 1.54 1.42

1.16 1.15 1.92 1.85

27.7% 20.6% 24.7% 30.3%

Table 1. *: Average Execution Time (second per frame) Comparison

level, and pixel-wise motion detection is not exclusively necessary because it might result in better performances at a huge computational cost of computational load. Although we are adding a temporal tracking refinement for post processing, the negative effect of bringing more computational complexity to the entire process is little because this process is only conducted on a small number of candidate blocks. However, the positive side, i.e. fine tuning the segmentation output and making the motion consistent, shall not be neglected. We compared the performance of our proposed approach to Sukmarg et al.9 (FOD) in terms of average execution time and the result can be found in table 1.

The proposed approach can successfully extract moving objects that can be referred

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

16

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

(a)

(c)

(e)

(g)

(b)

(d)

(f)

(h)

Fig. 8. Experimental result of SandSliding sequence: (a) Original 31st frame; (b) Original 33rd frame; (c) Spatiotemporally interpolated MVF for the 31st frame; (d) Spatiotemporally interpolated MVF for the 33rd frame; (e) Initial segmentation for the 31st frame; (f) Initial segmentation for the 33rd frame; (g) Fine tuned result for the 31st frame; (h) Fine tuned result for the 33rd frame;

as moving regions or motion blobs. It is very useful for motion analysis without considering precise object shapes. Moreover, this extraction method offers a good initialization for future accurate object segmentation.

6. Conclusions In this paper, we have proposed a new scheme for motion compensated background/foreground segmentation under inconsistent camera motions. The proposed algorithm is shown to operate fast and produce semantically meaningful spatiotemporal objects on block level. Compressed domain video bitstreams provide crude motion information for both background and foreground objects. Because of its limitation and coarseness, this information has been subject to a spatiotemporal interpolation that enables the generation of reliable and dense motion vector fields. An iterative segmentation algorithm based upon this enhanced motion vector field yields the estimated background motion model and foreground blocks. Following this process, a temporal tracking algorithm demonstrates its efficacy on removing noise and refining segmentation results. Extending this technique to other domains of structured video, such as content based video retrieval and frame rate up conversion, seems to be a promising and interesting direction for the further research.

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

(a)

(c)

(e)

(g)

(b)

(d)

(f)

(h)

17

Fig. 9. Experimental result of CarRacing sequence: (a) Original 12th frame; (b) Original 14th frame; (c) Spatiotemporally interpolated MVF for the 12th frame; (d) Spatiotemporally interpolated MVF for the 14th frame; (e) Initial segmentation for the 12th frame; (f) Initial segmentation for the 14th frame; (g) Fine tuned result for the 12th frame; (h) Fine tuned result for the 14th frame;

References 1. J. Shi and J. Malik, ”Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888 – 905, 2000. 2. P. Salembier and F. Marques, ”Region-based representations of image and video: segmentation tools for multimedia services,” IEEE Transaction on Circuits and Systems for Video Technology, vol. 9, pp. 1147 – 1169, 1999. 3. I. Kompatsiaris, G. Mantzaras, and M. G. Strintzis, ”Spatiotemporal segmentation and tracking of objects in color image sequences,” in Proceeding of 2000 IEEE Internationl Symposium on Circuits and Systems, vol. 5, 2000, pp. 29 – 32. 4. E. Sifakis and G. Tziritas, ”Moving object localization using a multilabel fast marching algorithm,” in Signal Processing: Image Communication, vol. 10, 2001, pp. 963 – 976. 5. N. O’Connor, S. Sav, T. Adamek, and V. Mezaris, ”Region and object segmentation algorithms in the qimera segmentation platform,” in Proceeding of 3rd International Workshop on Content-based Multimedia Indexing, 2003, pp. 381 – 388. 6. H. L. Eng and K. K. Ma, ”Spatiotemporal segmentation of moving video objects over mpeg compressed domain,” in Proceeding of 2000 IEEE International Conference on Multimedia and Expo, vol. 3, 2000, pp. 1531 – 1534. 7. N. V. Boulgouris, E. Kokkinou, and M. G. Strintzis, ”Fast compressed domain segmentation for video indexing and retrieval,” in Proceeding of 2002 Tyrrhenian International Workshop on Digital Communications, 2002, pp. 295 – 300. 8. M. L. Jamrozik and M. H. Hayes, ”A compressed domain video object segmentation system,” in Proceeding of 2002 IEEE International Conference on Image Processing, vol. 1, 2002, pp. 113 – 116. 9. O. Sukmarg and K. R. Rao, ”Fast object detection and segmentation in mpeg compressed domain,” in Proceeding of 2000 IEEE TENCON, vol. 3, 2000, pp. 364 – 368.

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

18

MCMIS

J. Wang, N.V. Patel, W.I.Grosky, F. Fotouhi

10. L. Favalli, A. Mecocci, and F. Moschetti, ”Object tracking for retrieval applications in mpeg-2,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, no. 3, pp. 427 – 432, 2000. 11. ”Generic coding of moving pictures and associated audio information mpeg-2, iso/iec 13 818,” Tech. Rep., 1996. 12. D. Li and I. Sethi, ”Mdc: A software tool for developing mpeg applications,” in Proceeding of 1999 IEEE International Conference on Multimedia Computing and Systems, 1999, pp. 445–450. 13. J. Ascenso, C. Brites, and F. Pereira, ”Improving frame interpolation with spatial motion smoothing for pixel domain distributed video coding,” in Proceeding of 2005 European Association for Signal, Speech and Image Processing, 2005. 14. B. Shen, I. K. Sethi, and B. Vasudev, ”Adaptive motion-vector resampling for compressed video downscaling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 6, pp. 929 – 936, 1999. 15. Y. P. Tan, S. R. Kulkarni, and P. J. Ramadge, ”A new method for camera motion parameter estimation,” in Proceeding of 1995 IEEE International Conference on Image Processing, vol. 1, 1995, pp. 406 – 409. 16. Y. Tan, D. D. Saur, S. R. Kulkarni, and P. J. Ramadge, ”Rapid estimation of camera motion from compressed video with application to video annotation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, no. 1, pp. 133–146, 2000. 17. ”Mpeg-4 video verification model version 18.0, iso/iec jtc1/sc29/wg11,” 2001. 18. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical recipes in C: the art of scientific computing. Cambridge University Press, 1993.

Photo and Bibliography Jinsong Wang has received his Bachelor of Science degree in computer science at Beijing Institute of Technology, Beijing, China in June 1998. He joined the Department of Computer Science of Wayne State University in August 2000.Working with Dr. Nilesh Patel, Dr. Farshad Fotouhi and Dr. William Grosky, he has pursued the research in Multimedia technology, especially in the field of video motion compensated frame rate conversion. He received his Doctoral degree in Computer Science from Wayne State University, Detroit, MI in 2006. Dr. Nilesh V Patel is currently Assistant Professor in the Department of Computer Science and Engineering at the Oakland University, MI. Dr. Patel received his MS and PhD in Computer Science from Wayne State University, Detroit in 1993 and 1997 respectively. He completed his undergraduate in Control Engineering in 1989 from the Gujarat University, Ahmadabad, India. Prior to joining the Oakland University; Dr. Patel served as an Assistant Professor at University of Michigan-Dearborn and worked at Ford Motors

July 6, 2009 16:39 WSPC/INSTRUCTION FILE

MCMIS

Moving camera moving object segmentation in a compressed video sequences

19

Co and Visteon Corp. His research interest is in the field of Multimedia Information Systems, Pattern Recognition, and Distributed Data mining. William I. Grosky is currently professor and chair of the Department of Computer and Information Science at the University of Michigan-Dearborn. Before joining UMD in 2001, he was professor and chair of the Department of Computer Science at Wayne State University, as well as an assistant professor of Information and Computer Science at the Georgia Institute of Technology in Atlanta. His current research interests are in mulimedia information systems, databases, and the semantic web. He is a founding member of Intelligent Media LLC, a Michigan-based company whose interests are in integrating the new media into information technologies. Grosky received his B.S. in mathematics from MIT in 1965, his M.S. in applied mathematics from Brown University in 1968 and his Ph.D. from Yale University in 1971. He has given many short courses in the area of database management for local industries and has been invited to lecture on multimedia information systems world-wide. Serving also on many database and multimedia conference program committees, he was an Editorin-Chief of IEEE Multimedia, and is currently on the editorial boards of IEEE Multimedia, International Journal of Information and Communication Technology Education, International Journal on Semantic Web and Information Systems, International Journal of Semantic Computing, Journal of Digital Information Management, and Multimedia Tools and Applications. Farshad Fotouhi received his Ph.D. in computer science from Michigan State University in 1988. He joined the faculty of Computer Science at Wayne State University in August 1988 where he is currently Professor and Chair of the department. Dr. Fotouhi major areas of research include xml databases, semantic web, multimedia systems, and query optimization. He has published over 100 papers in refereed journals and conference proceedings, served as program committee member of various database related conferences. Dr. Fotouhi is on the Editorial Boards of the IEEE Multimedia Magazine and The International Journal on Semantic Web and Information Systems.

Lihat lebih banyak...

MOVING CAMERA MOVING OBJECT SEGMENTATION IN COMPRESSED VIDEO SEQUENCES

Descripción

Comentarios