Aligning endoluminal scene sequences in wireless capsule endoscopy

Share Embed


Descripción

Aligning Endoluminal Scene Sequences in Wireless Capsule Endoscopy Michal Drozdzal1,2

Laura Igual1,2

Jordi Vitri`a1,2

Carolina Malagelada3

[email protected]

[email protected]

[email protected]

[email protected]

Fernando Azpiroz3

Petia Radeva1,2

[email protected]

[email protected]

1

2

Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain Dept. Matem`atica Aplicada i An`alisi, Universitat de Barcelona, Gran Via 585, 08007, Barcelona, Spain 3 Digestive System Research Unit, University Hospital Vall dHebron, Barcelona, Spain

Abstract Intestinal motility analysis is an important examination in detection of various intestinal malfunctions. One of the big challenges of automatic motility analysis is how to compare sequence of images and extract dynamic paterns taking into account the high deformability of the intestine wall as well as the capsule motion. From clinical point of view the ability to align endoluminal scene sequences will help to find regions of similar intestinal activity and in this way will provide a valuable information on intestinal motility problems. This work, for first time, addresses the problem of aligning endoluminal sequences taking into account motion and structure of the intestine. To describe motility in the sequence, we propose different descriptors based on the Sift Flow algorithm, namely: (1) Histograms of Sift Flow Directions to describe the flow course, (2) Sift Descriptors to represent image intestine structure and (3) Sift Flow Magnitude to quantify intestine deformation. We show that the merge of all three descriptors provides robust information on sequence description in terms of motility. Moreover, we develop a novel methodology to rank the intestinal sequences based on the expert feedback about relevance of the results. The experimental results show that the selected descriptors are useful in the alignment and similarity description and the proposed method allows the analysis of the WCE.

1. Introduction Wireless Capsule Endoscopy (WCE) [7] is a recent imaging technique for examination of Gastrointestinal (GI) tract. This technique consists of a WCE capsule, of dimension 11mm × 26mm and 3,7g, which contains a camera, a battery, led lamps for illumination and a radio transmitter. The capsule is swallowed by the patient, captures frames at

978-1-4244-7030-3/10/$26.00 ©2010 IEEE

rate 2 fps and emits them by a radio frequency signal to an external device. This technique is able to acquire internal view of the GI tract in a non-invasive way compared with the manometry and does not need patient hospitalization. The whole trip of the capsule takes approximately eight hours and provides around 60, 000 images for each study. Specialists need between four and eight hours for screening the video. This makes examination and interpretation of the capsule recordings time consuming and tedious and thus, motivates a need for automatization of video analysis. Endoluminal scene in WCE is composed by different elements which can be visualized in the inner gut (Fig. 1), namely: intestinal walls, intestinal contents and intestinal lumen. Moreoverthe intestine produces peristalsis movements and, as a result lumen size changes: it may be opened, partly contracted or contracted (Fig. 1). Since WCE capsule can move freely in the GI tract, a variety of orientations of the scene can be produced. In addition, intestinal contents, which can be seen as a turbid liquid or as a mass of bubbles, may appear and hinder the proper visualization of the scene. An endoluminal scene sequence is defined as a consecutive series of frames which constitutes an intestinal action unit. Various types of intestinal actions which are diagnostically and clinically interesting can be distinguished. (1) Contraction sequence, defined in WCE video as openclosed-open movement of the lumen. Generally, when intestinal wall is closed during the contraction a star-wise pattern is present [14]. An example of phasic occlusive contraction is shown on Fig. 1. (2) Tunnel sequence, defined as the absence of contractile activity in the relaxed intestine, visualized as a sequence of open lumens. (3) Sequences of WCE video with no intestine activity are defined as static sequences. (4)Intestinal content sequence the consecutive frames with intestinal content. From intestinal motility point of view, similar sequences 1 117

Figure 2. Scene sequence examples of intestinal contractions.

Figure 1. Illustration of endoluminal scene and sequences in WCE.

are those sequences with resembling movements. In this work, we focus on aligning endoluminal scene sequences based on dynamic patterns. Alignment of endoluminal scene sequences is an important challenge in small intestine examination which allows to compare predefined patterns of intestinal performance in WCE and identify similarity regions. Intestinal dynamics has a high variability in the way and duration of deformation, as it can be seen in Fig. 1, where several sequence examples are shown. For this reason, alignment of endoluminal scene sequences should be flexible in the length of sequences and has to be robust under variations due to capsule motion. Several works have been developed on endoluminal scene analysis, most of them focused on characterization of different lesions in the gut, such as polyps [13] or bleeding [6]. Other are focused on reduction of the final time needed for visualization of WCE video [5], or [17]. Recently, works to assist in the analysis of the intestinal motility disorders using WCE have been developed [15], [8] or[16]. The approach in [15] copes with the problem of intestinal contractions detection. It is based on static information from lumen features of fixed size sequences (nine frames) and it does not exploit the motion structure of the capsule recordings. Histograms of gradients are used in [16]. In [8], the authors propose eigenmotion-based contraction detection where optical flow between consecutive frames is used in fixed size sequences. The problem of detection of measurable contraction sequence in the WCE video is presented in [2]. All of them (except [2]) assume fixed sequence length and are based on optical flow or static features (f.e. lumen size). On the other hand, intestines are highly deformable. Moreover due to the capsule motion in-

testinal events can be captured from different view angles. The standard way to extract motion is to use a scheme for optical flow estimation. Classical optical flow techniques are robust and precise when the deformation is smooth, but have difficulties in correctly estimating the visual deformations in the case of sudden (fast) motions. Fast motions are frequent in WCE, due to the low frame rate of WCE (2 fps). Therefore, the application of optical flow techniques to capture the contraction movements is limited. In order to overcome this problem, we use the Sift Flow (SF) technique [11] to extract the motion information. SF estimates the motion between points using their Sift descriptors and has been shown to be more robust for abrupt variations between consecutive frames. However, since we are interested in describing the structure and deformation of the intestinal wall, we consider three different descriptors to estimate similarity between the sequences: the Histogram of Sift Flow Directions (HSFD) to describe transformation direction; Sift Descriptor (SD) to measure similarity between structures of the images; and the Sift Flow Magnitude (SFM) to quantify the intestine deformation. The union of these three descriptors provides a complete motion characterization and allows to define tools to measure the similarity between intestinal sequences. In this work, we propose a novel technique for aligning intestinal sequences based on three steps: (1) similarity features extraction, (2) computation of similarity between sequences and (3) a scheme for optimal weighting of different features. In the first step, SF technique is used to estimate the motion in sequences. In the second step we apply, Dynamic Time Warping (DTW) in order to compare sequences with different length. Finally, in the last step, we define a scheme based on the relevance feedback algorithm [1] to learn weights which are assigned to every descriptor. Our methodology allows: (1) comparing dynamic events independently of the sequence length and (2) use motion and structure features that are not restricted to highly smooth deformation. To the author’s knowledge the work presented here is the first one dealing with the problem of aligning endoluminal scene sequences. The paper is organized as follows: in Section 2 a short description of Sift Flow, DTW and relevance feedback algo-

118

rithm is presented including the mathematical background, Section 3 introduces the proposed methodology. The experiments and results are presented in Section 4. Finally, in Section 5 we expose conclusions and future work.

2. Background SIFT Flow Because of the variety of endoluminal scenes, free movement of the capsule and the diversity of intestinal sequences, we need robust descriptors to describe intestinal deformation. The descriptors should be flexible and able to manage: scale changes, affine distortion, view point change and illumination variations. In SF, a SIFT [12] descriptor is extracted at each pixel to characterize local image structures and encode contextual information. Then, a discrete, discontinuity preserving, optical flow algorithm is used to match the SIFT descriptors between two images. The use of SIFT features allows robust matching across different scene/object appearances and the discontinuity-preserving spatial model allows matching of objects located at different parts of the scene. SIFT features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D view point, addition of noise, and change in illumination. The SF is calculated by optimization of the following cost function: ∑

) 1 ∑( 2 u (p) + v 2 (p) + 2 σ p p ∑ ( ) ( ) min α|u(p) − u(q)|, γ + min α|v(p) − v(q)|, γ , ||s1 (p) − s2 (p + w)||1 +

(p,q)∈ϵ

where, w(p) = (u(p), v(p)) is the displacement vector at pixel p, si (p) is the SIFT descriptor extracted at location p in i-th image and ϵ is the spatial neighborhood of a pixel. The σ, α and γ are parameters. Fig. 3 presents some examples of SF performance in WCE images. Dynamic Time Warping Besides describing sequence frames we need a robust technique to compare sequences of different length. This technique should be flexible enough to allow matching of sequences even with different and not necessarily uniform capsule motion. DTW is an algorithm for measuring similarity between two sequences which may vary in time or speed in different ways. The cleverness of DTW lies in the computation of the distance between input stream and template. A deep description of DTW approach can be found in [4], [10] and [9]. Rather than comparing the value of the input stream at time t to the template stream at the same time t, the algorithm is used to search optimal mappings results from the input stream to the template stream, so that the total distance between corresponding frames is minimized. In this way, DTW aligns the time

Figure 3. Sift flow results of six pairs of images (six rows). The two first columns are the Sift descriptors of the two images from columns third and fourth. The fifth column displays the result of applying the inverse of the SF over the second image of the pair. The sixth column shows the SF filed of each pair, where the color informs about deformation direction and the color intensity about deformation magnitude.

axes allowing many-to-one and one-to-many frame matching before calculating the Euclidean distance. Relevance feedback algorithm When looking for the best mapping between two sequences we need to estimate the parameters that assign the right importance to different terms of the sequences similarity. We base our approach on the relevance feedback algorithm. Relevance feedback is widely used in content-based image retrieval, where the problem of image relevance ranking is based on content without any additional textual information. As it is presented in [18], the relevance feedback algorithm follows the next steps: (1) Machine provides initial retrieval results; (2) User provides judgment on the currently displayed images as to whether, and to what degree, they are relevant or irrelevant to her/his request; (3) Machine learns and tries again. In [1], the authors present an algorithm for weight learning using relevance feedback. The basic idea is to minimize the distance to images marked as relevant and to maximize the distance to not relevant results. Given the distance: ∑ D(Ym , Yq ) = (wi ∗ Di (Ym , Yq )), i

where wi are weights, Ym and Yq are the model sequence and the query sequence respectively. The following term is minimized with respect to wi : ∑





m∈Q+ q+∈Q+ \{m} q−∈Q− \{m}

D(Ym , Yq+ ) , D(Ym , Yq− )

where q+ are relevant results and q− are not relevant results. As a result, after a gradient descent is applied to minimize the distance function, the number of relevant results is increasing in the top results of image retrieval.

119

Figure 5. Vertical and horizontal flow definition.

Figure 4. Algorithm scheme for aligning scene sequences in WCE.

3. Methodology Given a model and a set of query sequences, the objective is to measure the similarity of each query sequence with respect to the model. Based on this measure, a ranking of sequence similarities with respect to the model can be created. For that, we define a pairwise aligning method and a similarity measure between model and query sequences. A model is a sequence of frames with a predefined intestinal pattern which is used to explore the data base of endoluminal scene sequences. A query sequence is an intestinal sequence used to compare with the model. The proposed methodology to align sequences is based on three steps procedure: (1) similarity features extraction, (2) sequence similarity estimation and (3) relevance feedback for optimal weights assignment. A scheme of the algorithm is presented in Fig. 4. In the first step, the descriptors which characterize intestine event in sequences are extracted. The second step deals with similarity calculation between sequences of different length, and finally, in the third step, the relevance feedback is used to calculate weights for descriptors. In the following sections, we describe in details the three parts of the algorithm.

3.1. Similarity characteristic extraction Given two sequences of deforming intestines, we are interested in obtaining the optimal match between their frames taking into account the object structure and motion. Note that calculating the flow between frames of both se-

Figure 6. Computation of S1, S2 and S3 descriptors.

quences allows us to obtain the cost of image transformation that is the lower the closer the structures of images are. On the other hand, the flow of each frame to the next one in the same sequence gives us information about intestine deformation. Since we expect that similar sequences contain intestine appearance with similar deformation we look for similar horizontal flows. Thus, we define two ways of computing the flow between sequences: vertical and horizontal (see Fig. 5). Horizontal flow is calculated between two consecutive frames in the same sequence, vertical flow is calculated between frames in different video sequences. In order to quantify the similarity level between sequences in a robust way we introduce three types of descriptors: - S1 The Sift Descriptors Difference (SDD) is used to measure similarity between structures of the images. - S2 The Histogram of Sift Flow Directions (HSFD) is used to describe transformation directions. - S3 The Sift Flow Magnitude (SFM) is used to quantify intestine deformation amount. Descriptor S1 is calculated using vertical flow between sequences, descriptors S2 and S3 are calculated using horizontal flow (see Fig. 6). Sift Descriptors Difference S1 . The matrix S1 (j, k), j = 1, . . . , x, k = 1, . . . , y, where x and y are the length of the

120

SFM is defined as: S3 (j, k) = ||SFM(Fmj ) − SFM(Fqk )||. Thus, in terms of SFM, the flow between j and j + 1 frames in the model and k and k + 1 frames in the query sequence is similar when the magnitude of flow is alike.

3.2. Similarity calculation and Relevance feedback For every two sequences (the model sequence and a query sequence) three matrixes S1 , S2 and S3 are obtained, the overall amount of similarity between frames in sequences is defined as follows: Figure 7. Illustration of SDD computation process.

model and the query sequence respectively, is computed as follows: S1 (j, k) = ||SD(Imj ) − SDAW(Iqk )||, where SD is Sift Descriptor, SDAW is Sift Descriptor After Warping, Im and In are intestinal images of different sequences and ||.|| denotes Euclidean norm. Each element in S1 (j, k) describes structure similarity between frame j in the model and frame k in sequence. In order to obtain Warping the vertical Sift flow is calculated between frames in the model and the query sequence. The process is illustrated in Fig. 7. Histogram of Sift Flow Directions S2 . The Histograms of Sift Flow Directions (HSFD) give information about angles of transformation which correspond to. directions of movement between frames and it is calculated using horizontal flow. As shown in Fig. 8, for every pair of frames we receive 4 histograms, describing movement in every quarter of flow filed. A matrix S2 is defined as: S2 (j, k) = EM D(HSFD(Fmj ) − HSFD(Fqk )),

S(j, k) = w1 ∗ S1 (j, k) + w2 ∗ S2 (j, k) + w3 ∗ S3 (j, k), where w1 , w2 and w3 are weights assigned to each descriptor, j = 1, . . . , x, k = 1, . . . , y, and x, y are the length of the model and the query sequence respectively. At the beginning all weights are equal w1 = w2 = w3 = 1. The similarity D between two sequences Ym and Yq is computed using the frame similarity matrix S(j, k): D(Ym , Yq ) = DTW(S(j, k)). Each result d is normalized by the length of the path used for DTW calculation, in that way the results obtained by comparing the model with sequences of different size can be confronted. The smaller d value is the more similar the query sequence is to the model. For the initial results of sequence ranking, the relevance feedback is applied. Expert intervention is necessary to flag top sequences as {relevant, not relevant}. Based on this information, the algorithm learns weights w1 , w2 and w3 minimizing the distances to relevant sequences and maximizing the distance to no relevant sequences. The optimization is done using gradient descent minimizing the energy function with respect to wi : ) ∑( ∑ ∑ wi ∗ Di (Ym , Yq+ )− wi ∗ Di (Ym , Yq− ) , j

i

where Fm is horizontal sift flow filed between frames in the model and Fq is horizontal sift flow filed in the query sequence. In order to obtain similarity measures, the Earth Mover Distance (EMD) between histograms is calculated. Each element of S2 (j, k) quantifies flow directions similarity between frames j and j + 1 in the model and frames k and k + 1 in the query sequence. Sift Flow Magnitude S3 . To quantify intestine deformation magnitude the Sift Flow Magnitude (SFM) is used. It gives us the answer to the question: how much the frame has to be deformed to resemble the next frame? Fig. 9 displays the SFM of a sequence example. The similarity of

qj+ ∈Q+

− ∈Q− qk

k

where Yq+ are the sequences marked as relevant, Yq− j k are the sequences marked as non relevant. With the new weights values the algorithm recalculates the S(j, k) matrix and similarity results between model and the query sequence (see Fig. 4).

4. Results The evaluation of proposed approach is twofold: first, we validate the descriptors for sequence alignment and second, we evaluate the functionality of relevance feedback procedure. Two databases (DB1 and DB2) are used, containing

121

Figure 8. Computation of HSFD.

Index of the ranking of sequences Mean experts note

1 6.5

2 6.9

3 3.3

4 8.1

Table 1. Mean experts notes for the ranking of sequences.

222 and 247 contraction sequences respectively. The sequences have variable length from 5 up to 25 frames and each frame of the sequences have resolution 256 × 256 pixels. In order to speed up descriptors calculation the images are resized to resolution 80 × 80 pixels. To represent HSFD a 17 bin histogram is used. Descriptors validation for sequence alignment In this experiment, we performed a validation of descriptors for sequence alignment with the help of 2 experts. We computed 4 different rankings of sequences from DB1: (1) using only Sift Descriptors Difference, (2) using only Histograms of Sift Flow Directions Difference, (3) using only Sift Flow Magnitude Difference, (4) combination of all three. Usually, ranking is evaluated by displaying the obtained order of retrieval of data to an expert who evaluates them in predefined scale (f.e. from 1 to 10) [3]. We presented the 15 most relevant sequences to the group of experts. They assigned values from 1 to 10 for every of the four sets of sequences, where 1 means that the sequences are not aligned and 10 means perfect alignment. The criteria for sequence alignment evaluation was based on the following:(1) are the query sequence symmetrical/asymmetrical in the same way as model (2) are the lumen size changes similar (3) do the images in the model and in the query sequences preserve similar intestinal structure. The sequences are presented on Fig. 10 and Table 1 presents the mean results of experts answers. In all the mosaics presented below, the model sequence is displayed in the first row. As it can be seen the

Figure 9. Sift Flow Magnitude (SFM) of a sequence example.

union of three descriptors provides the best sequence alignment result, whereas applying only one descriptor gives poor results. As we expected, on the first ranking of sequences, where only SDD is used, similarity order is based on the structure of the images. On the second ranking, the directions of transformation are well preserved in all sequences. The third ranking, applying only SFM difference, gives the worst result, the sequences seems to be unorganized. Relevance feedback validation In this experiment, we validate the performance of the relevance feedback algorithm. Given a ranking of a set of sequences with respect to a model sequence, an expert was asked to assign a flag {relevant or not relevant} to each one of the top 15 sequences of the ranking. Based on expert labeling, the system learned the weights for descriptors using the relevance feedback algorithm. Therefore, the set of reordered sequences was presented to an expert together with the initial set of sequences (Fig. 11). As it can be seen, in the image on the right, the majority of the sequences marked as rele-

122

Figure 10. Rankings of a set of sequences using as descriptors: (a) only Sift descriptors difference, (b) only Histograms of Sift Flow Directions, second row (c) only Sift Flow Magnitude and (d) combination of all three.

Figure 11. (a) Initial ranking of the set of sequences and (b) Reorganized set of sequences.

vant by the expert went up in the similarity ranking. Fig. 12 shows the difference between the mean values of distances from the model sequence to no relevant sequences and to relevant sequences. As it can be seen, on each iteration of the algorithm is incrementing more the distance to not relevant sequences than to relevant sequences. Moreover, it can be observed that after 1000 iterations of gradient descent,

Descriptor Weight

SDD 0.33

HSFD 0.54

SFM 0.11

Table 2. Descriptors weights calculated by relevance feedback.

the algorithm stabilizes. Table 2 presents weights assigned to the descriptors. The algorithm with weights from Ta-

123

algorithm is able to create an endoluminal sequence ranking that fulfills the experts expectations. Extracted data are feasible for further clinical, diagnostic and therapeutic applications. It is worth remaining that one of the main limitations of our algorithm is the time needed to calculate the Sift Flow between frames, reducing it will meaningfully speed-up the alignment and ranking of endoluminal sequences. Another important steep is to elaborate confident validation techniques of intestinal sequences relevance ranking.

References Figure 12. Mean difference between not relevant and relevant sequences.

Figure 13. Alignment result for DB2.

ble 2 was applied over new data base (DB2) of contractions acquired during a different clinical study. The sequences alignment results are presented in Fig. 13. The mean note of the experts evaluation for this sequence ranking was 8.9.

5. Conclusions In this paper, for first time, an algorithm for aligning and measuring similarity of endoluminal scene sequences is presented. We introduce three descriptors to characterize intestinal structure and deformation in consecutive frames: (1) Histograms of Sift Flow Directions to describe flow course, (2) Sift Descriptors to represent image structure and (3) Sift Flow Magnitude to quantify the deformation. To handle the different length of sequences the Dynamic Time Warping algorithm is used. Moreover, the technique for weights learning based on the expert relevance feedback procedure is introduced to improve the performance of the method. The experimental results show that the proposed

[1] T. Deselars and et al. Learning weighted distances for relevance feedback in image retrieval. In ICPR, 2008. 2, 3 [2] M. Drozdzal, P. Radeva, and et al. Towards detection of measurable contractions using wce. In Proc. of the 4th CVC workshop, pages 131 – 136, 2009. 2 [3] M. S. E. Horster and et al. Unsupervised image ranking. In LS-MMRM, 2009. 6 [4] A. W. Fu, E. Keogh, and et al. Scaling and time warping in time series quering. In Proc. of the 31VLDB Conf., 2005. 3 [5] V. Hai, T. Echigo, et al. Adaptive control of video display for diagnostic assistance by analysis of capsule endoscopic images. In Proc. ICPR, pages 980 – 983, 2006. 2 [6] S. Hwang, J. Oh, and et al. Blood detection in wireless capsule endoscopy using expectation maximization clustering. In Proc. of the SPIE, pages 577 – 587, 2006. 2 [7] G. Iddan, G. Meron, et al. Wireless capsule endoscopy. Nature, 405:417, 2000. 1 [8] L. Igual, S. Segui, et al. Eigenmotion-based detection of intestinal contractions. In Proc. CAIP, volume 4673, pages 293–300, 2007. 2 [9] M. W. Kadus. A general architecture for supervised clasification of multivariate time series, 2009. 3 [10] E. Keogh, T. Palpanas, and et al. Indexing large human motion databases. In Proc. of the 30VLDB Conf., 2004. 3 [11] C. Liu, J. Yuen, and et al. Sift flow: Dence correspondence across different scenes. In ECCV, pages 28 – 42, 2008. 2 [12] D. Lowe. Distinctive image features from scale-invariant keypoints. In Inter. Jurnal on Computer Vision, 2004. 3 [13] M. Mackiewicz, J. Berens, and et al. Colour and texture based gastrointestinal tissue discrimination. In Proc. of IEEE ICASSP, pages 597 – 600, 2006. 2 [14] F. Vilarino. A machine learning approach for intestinal motility assessment. PhD thesis, UAB, Barcelona, 2006. 1 [15] F. Vilari˜no, P. Spyridonos, et al. Cascade analysis for intestinal contraction detection. CARS, pages 9–10, 2006. 2 [16] H. Vu, T. Echigo, and et al. Contraction detection in small bowel from an image sequence of wireless capsule endoscopy. In Proc. of MICCAI, pages 775 – 783, 2007. 2 [17] Y. Yagi and et al. A diagnosis support system for capsule endoscopy. Inflammopharmacology, pages 78 – 83, 2007. 2 [18] X. S. Zhou and et al. Relevance feedback in image retrieval: A comprehensive review. Multimedia Syst., 2003. 3

124

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.