ConVeS

Share Embed


Descripción

ConVeS: A Context Verification Framework for Object Recognition System Dhanesh Ramachandram

Mandava Rajeswari

Computer Vision Research Group School of Computer Sciences Universiti Sains Malaysia Malaysia Tel: (+604) 653 4393

Computer Vision Research Group School of Computer Sciences Universiti Sains Malaysia Malaysia Tel: (+604) 653 4046

Computer Vision Research Group School of Computer Sciences Universiti Sains Malaysia Malaysia Tel: (+604) 653 4641

[email protected]

[email protected]

[email protected]

Mozaherul Hoque Abul Hasanat

ABSTRACT Context is a vital element in both biological as well as synthetic vision systems. It is essential for deriving meaningful explanation of an image. Unfortunately, there is a lack of consensus in the computer vision community on what context is and how it should be represented. In this paper context is defined generally as “any and all information that is not directly derived from the object of interest but helps in explaining it”. Furthermore, a description of context is provided in terms of its three major aspects namely scope, source and type. As an application of context in improving object detection results a Context Verification System (ConVeS) is proposed. ConVeS incorporates semantic and spatial context with an external knowledgebase to verify object detection results provided by state-of-the-art machine learning algorithms such as support vector machine or artificial neural network. ConVeS is presented as a simple framework that can be effectively applied to a wide range of computer vision applications such as medical image, surveillance video, and natural imagery.

1.

Figure 1: The two image regions in the setting sun scene (left) and the orange in the orange basket (right) are visually very similar, but can easily be distinguished as one representing a setting sun and the other representing an orange due to the context provided by their neighboring objects. (Photographs used with permission from FreeFoto.com)

sized by Moshe Bar, a neuroscientist in [2]. He presented a detailed account on differents aspects of context in biological vision system and how these can possibly be used in computer vision. Unfortunately, there is a lack of consensus among computer vision researchers on what context is and how it should be represented. A general definition of context is therefore important in order to effectively utilize this resource in computer vision applications. We have attempted to address this issue in Section 2 of this paper. The rest of the paper is organized as follows: Section 3 briefly reviews past researches that used context in one form or other. Section 4 introduces ConVeS (Context Verification System) as a system to verify object detection results provided by any machine learning algorithm; followed by a discussion on the advantages and implimentation challenges of ConVeS in Section 5. Finally, a conclusion on the paper is drawn in Section 6.

INTRODUCTION

The ability of humans to recognize objects is amazingly dynamic and robust. Human beings can recognize an object in variety of pose, illumination, and even in partially occluded conditions. A good reason behind this ability is the use of context in visual perception process. In the real world, objects often co-appear with other objects and in particular environments, providing contextual associations to be learned by intelligent vision systems like that of a human being [3][17] or perhaps by a synthetic vision system. These contextual associations proved to be useful to distinguish visually similar objects as exemplified in Figure 1. The importance of context in computer vision was empha-

2.

WHAT IS CONTEXT?

The understanding of context in the area of computer vision is somewhat imprecise. Some researchers are in the opinion that context should not be defined in the first place and be regarded as a primitive as Hirst explained in [13]

___________________________ Permissiontotomake makedigital digitalor or hard copies of or allpart or part of work this work Permission hard copies of all of this for for personal or classroom use is granted without fee provided that copies are personal oror classroom usefor is profit granted fee provided that copies are not made distributed orwithout commercial advantage and that copies not made distributed or commercial advantage andotherwise, that copiesto bear this or notice and thefor fullprofit citation on the first page. To copy bear this notice andonthe full citation on the firsttopage. copy otherwise, republish, to post servers or to redistribute lists,To requires prior specific fee. orpermission republish,and/or to postaon servers or to redistribute to lists, requires prior ISTA’09permission March 20-22, 2009, Kuwait specific and/or a fee. Copyright 2009 ACM ...$5.00. ISTA’09, March 20-22, 2009, Kuwait, Kuwait. Copyright 2009 ACM 978-1-60558-478-2/09/03…$5.00.

“. . . context is what context does”. and avoided giving any strict definition of context. Carneiro [7], He [12], and Ramstrom [25] assumed context to be statistical features of pixel groups. Kruppa et. al. [18] described context as the spatial correlation between two image regions.

78

Oliva and Torralba [21] explained context as a global texture feature of the whole image. Papadopoulos [23] construed context as the relationship between objects derived from an external knowledgebase based on the semantic labels of the objects in the image. Clearly, researchers have described context from a restricted point of view without any attempt to provide a holistic view on what context means in computer vision and to explain the different types of contextual information that are available for use; the problem being the lack of a common understanding or consensus on the meaning of context. So far the most comprehensive definition of context is provided by Wolf and Bileschi in [28]. According to them, context is :

of an object; their system detects human head to find the face region in the image. Pixel co-occurrence relationship has also been used by Millet et. al. in [19] to improve recognition results. However, such system disregards existence of any distant object which might have strong influence in determining the identity of the object. Additionally, such system does not require finding the identity of the neighboring objects and only seeks correlation among nearby image regions to correctly identify the object of interest. It leaves the system indiscriminant to two image patches of two different objects having similar visual attributes, for example “sky” and “water” has very similar visual attributes but will give different contextual cue. Systems using semantic context extracted from nearby regions do not suffer from this problem, but still lacks knowledge about objects presence in distance. All these systems lack from utilizing external knowledge sources which can be a very good contextual cue as exploited in [14]. Another approach to integrate context in vision system is what is deriving semantic context from the scene of the image as described by Oliva and Torralba in [22]. Murphy and Torralba [20][27] have shown that it is possible to consider the whole scene instead of individual objects as a context. Such approach does not require detection of surrounding object identities and thus avoids the difficulties related to typical segmentation approaches. They suggested that learning statistical relations between objects can cause the detection of one object or scene to generate strong expectations about the probable presence and location of other objects. Although this scene context approach is good to predict the general content of the image or the domain, but it does not explore inter-class relationship among objects within the domain. For example, the likelihood of finding a car in the image increases if the wheels can be detected. The relation between the wheel and the car can only be derived through external knowledgebase. A handful of researchers attempted to incorporate knowledgebase derived context in various manner. Strat and Fischler [26] encoded contextual knowledge as a set of control rules and used it to interpret and verify detection results. An interesting way to incorporate contextual knowledge is to structure the knowledge itself into a generic (not image specific) manner that represent the real world scenario. It can be achieved through a graph structure to encode semantic relationships among objects. Aslandogan et. al. [1] and Rabinovich et. al. [24] used WordNet [9] and GoogleSet1 respectively to create an ontology or ontology-like tree structure. But due to the undirected and acyclic nature of tree structure, it is not able to model the dependency relation among the members of a domain that exist in real life natural imagery [2]. Im and Cho [15] used a Baysian network based model of objects and their locations to infer the probability of a location given a set of detected objects. The model did not incorporate any domain information or inter object dependency relation. Furthermore, most of the previous works which employs any form of knowledge structure did not consider spatial relationship in their system. Spatial relationships proved to be an important cue in verifying the identity of an object both in primates’ vision system [11] and in computer vision [6]. The influence of a specific domain on likelihoods of each object was also largely ignored.

“. . . information relevant to the detection task but not directly due to the physical appearance of the object”. But this definition restricts the utility of context in object detection task only, whereas context can be used to recognize and explain the image and its contents. The notion of context includes any and all information that can be derived from nearby objects or the whole scene. So, we provide an improved definition of context: “Context is any and all information that is not directly derived from the object of interest in the image but helps in explaining the object.” Our definition of context implies that there exist several types of context based on the different sources of contextual information. Furthermore, any context has a scope within which its meaning is relevant. Hence, we describe context by its three important aspects - i) scope ii) source and iii) type. Context derived from the neighboring areas has a local scope, whereas context derived from the whole image has a global scope. Local context sources include pixel, region and labeled objects. Global context sources are the region encompassing the whole image and background scene of the image. Based on these different sources context can categorized into several types: Statistical Feature context is computed on pixel, region and the whole image; Semantic, Spatial, Scale and Orientation context is obtained from object and scene; and finally, Knowledge based context is derived from external knowledgebase and metadata. Table 1 provides an overview of these three aspects and their relations. Interestingly, these different aspects of context can be combined with each other and a synergy can be achieved in the generated contextual knowledge. For instance, semantic context can be linked with an external knowledgebase to explore the inter object relationships that exists naturally among real life objects. A review of the relevant research works involving different types and sources of context is presented in the next section.

3.

RELATED WORKS

Existing context based vision systems use various types of context mentioned earlier for recognition tasks. Information derived from statistical features of surrounding regions were used by many researchers as the context to determine the identity of an image segment. Details about such system can be found in [4][10] and [25]. Kruppa et al. [18] used cooccurrence information of image regions to infer the location

1

79

http://labs.google.com/sets

Table 1: Relationship of various context types, scopes and sources Type Scope Source Example Statistical Features

Semantic Spatial Relations

Local

Pixel

RGB values

Region

Mean, median, circularity

Global

Region (whole image)

Histogram

Local

Object

“car”, “building”

Global

Scene

“city”, “beach”, “indoor

Local

Object

“beside”, “above”

Region Global Scale

Scene

Local

“top”, “bottom”

Object Region

Orientation

Local

Object Region

Knowledge

Local

Knowledgebase

Global

Knowledgebase

Metadata Metadata identity of the segment, each segment si is assigned with a set of labels L = {l1 , l2 , ..., lm } where, m is the number of labels for each segment and corresponding probabilities P (si lj ) where, j = 1...m.

We introduce ConVeS to incorporate high level semantic context and spatial relationships to verify object detection results generated by any typical machine learning algorithm such as support vector machine [16] or artificial neural network [5]. Given an external knowledgebase and a set of detected objects we seek to answer the question

4.3

“is this really the objects that the machine learning algorithm thinks it to be?”. Unlike the system prposed by Im, our system incorporates domain information by linking with an external knowledgebase and thus, is able to reason about object dependencies that exist in the real world. A detailed description of our proposed system is presented in the following section.

4. 4.1

4.4

Generate context map with spatial relation

The list of neighboring objects is augmented with spatial context information. We call this augmented list context map. The neighboring objects and their spatial relations are considered as the context for the image segment in question. Spatial relationships between the segment and a neighboring object are determined based on the position of the object relative to the segment. Figure 3 provides a simple schema showing the possible spatial relationships.

CONVES - CONTEXT VERIFICATION SYSTEM Overview

The proposed system utilizes high level semantic context derived from the image and verify it against a knowledgebase to improve the agreement between the label and the segment the label is assigned to. The system incorporates domain knowledge, object co-occurrence and spatial relationships for this purpose. Different components of the proposed framework are illustrated in the Figure 2.

4.2

Identify neighboring objects

The adjacent objects of each segment are identified and listed along with their set of labels and corresponding probabilities. The adjacency is determined based on 8-connectivity rules. These neighboring objects form the semantic context for the object in question. The set of neighboring objects   nbr nbr where, n is the is denoted as S nbr = snbr 1 , s2 , ..., sn−1 total number of segments in the image.

4.5

Verify contextual agreement

After creating the context map for each segment, we verify the object label based on the contextual agreement of the object with its neighbors and the scene against an expert constructed knowledgebase. Here we propose a knowledge structure based on Bayesian network to model the real world relationships among objects in any domain. The proposed knowledgebase will model the domain membership and spatial relationships among objects that exist it natural imagery. Bayesian network has the advantage of being a directed graph structure as opposed to tree structure ontology used by [24] which allows us to model the dependency relationships.

Input

In an object recognition system a classifier algorithm typically generates labeled image segments along with the corresponding detection scores or probabilities for each label. Our system accepts the output of the classifier algorithm in an image recognition system as its input. For example, in an image the object detection algorithm typically detects a set of segments S = {s1 , s2 , ..., sn } where, n is the total number of segments. Based on the algorithm’s belief about the

80

Figure 2: Proposed Context Verification System - ConVeS demonstrating a hypothetical verification task. (Photographs used with permission from FreeFoto.com)

  nbr nbr P si lj |snbr 1 lj , s2 lj , ..., sn−1 lj =

4.6

lj ,snbr lj ,...,snbr P (si lj )P (snbr 1 2 n−1 lj |si lj ) P (snbr lj ,snbr lj ,...,snbr 1 2 n−1 lj )

Update segment labels

Segment label verified through multiple knowledge bases with probability exceeding a predefined threshold are selected as final candidates. The process is illustrated in Figure 4

5.

DISCUSSION

ConVeS integrates semantic context and knowledgebase to verify the object detection results of an image. The system is designed to be generic. Given an appropriate knowledgebase, ConVeS can be applied to natural imagery, medical images, surveillance videos or any other problem domain in computer vision. A difficulty in implementation of ConVeS might come from the construction of a proper knowledgebase. Ideally the underlying Bayesian network model of a knowledgebase should encode all the object dependencies and related conditional probabilities. In real world scenario this requires a considerable amount of time and expertise. Fortunately, for a reasonably good accuracy Bayesian network does not require to specify all the conditional probabilities as suggested by Charniak in [8]. Domain experts can subjectively determine the probabilities based on a representative set of examples.

Figure 3: Positions of adjacent objects and their spatial relations with respect to object S

In order to verify the object label, the probability score of the candidate segment is iteratively updated by presenting the probabilities of all other neighboring segments to the knowledgebase as observed evidences. According to the Bayes’ rule, the probability of image segment si to be labeled as lj using the notations introduced in section 4.2 and 4.3 is:

81

Figure 4: Illustration of updating segment labels using hypothetical knowledgebase and probabilities. (Photographs used with permission from FreeFoto.com)

6.

CONCLUSION

[5] K. Boehnke, M. Otesteanu, P. Roebrock, W. Winkler, and W. Neddermeyer. Neural network based object recognition using color block matching. In Proceedings of the Fourth conference on IASTED International Conference: Signal Processing, Pattern Recognition, and Applications table of contents, pages 122–125. ACTA Press Anaheim, CA, USA, 2007. [6] P. Carbonetto, N. d. Freitas, and K. Barnard. A statistical model for general contextual object recognition. Computer Vision - ECCV 2004, pages 350–362, 2004. [7] G. Carneiro and A. Jepson. Pruning local feature correspondences using shape context. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, 3:16–19, 23-26 Aug. 2004. [8] E. Charniak. Bayesian networks without tears. AI Magazine, 12(4):50–63, 1991. [9] C. Fellbaum and I. NetLibrary. WordNet: an electronic lexical database. MIT Press USA, 1998. [10] H. Feng and T.-S. Chua. A learning-based approach for annotating large on-line image collection. In Multimedia Modelling Conference, 2004. Proceedings. 10th International, pages 249–256, 5-7 Jan. 2004. [11] I. Gauthier and M. Tarr. Unraveling mechanisms for expert object recognition: Bridging brain activity and behavior. Journal of experimental psychology. Human perception and performance, 28(2):431–446, 2002. [12] X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random fields for image labeling. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–695–II–702, 27 June-2 July 2004. [13] G. Hirst. Context as a spurious concept. In Proceedings, Conference on Intelligent Text Processing and Computational Linguistics, page 273287, Mexico City, 2000.

Understanding the meaning and different aspects of context is vital in order to fully utilize this rich resource in computer vision tasks. In this paper we provided a functional definition of context with a brief discussion on its aspects. The proposed ConVeS (Context Verification System) incorporates contextual information in a generic object recognition system. ConVeS illustrates that semantic context and spatial relationships can be effectively combined with external knowledgebase to improve the agreement between the label and the image segment it is assigned to. It can be generalized for any image and video domain such as natural imagery, medical, surveillance, etc. In this regard we further propose the construction of a knowledgebase which models the real world relationships that exist among objects. We believe ConVeS will make vision applications more reliable and applicable to a wider field where accuracy is vital for success.

7.

ACKNOWLEDGMENTS

This research has been supported by the Ministry of Higher Education, Government of Malaysia under the Fundamental Research Grant Scheme.

8.

REFERENCES

[1] Y. Aslandogan, C. Thier, C. Yu, J. Zou, and N. Rishe. Using semantic contents and wordnet in image retrieval. Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 286–295, 1997. [2] M. Bar. Visual objects in context. Nature Reviews Neuroscience, 5(8):617–629, 2004. [3] M. Bar and S. Ullman. Spatial context in recognition. Perception, 25(3):343 – 352, 1996. [4] N. Bergboer, E. Postma, and H. v. d. Herik. Context-based object detection in still images. Image and Vision Computing, 24(9):987–1000, Sep 2006.

82

[14] A. Hoogs and R. Collins. Object boundary detection in images using a semantic ontology. cvprw, 0:111, 2006. [15] S.-B. Im and S.-B. Cho. Context-based scene recognition using bayesian networks with scale-invariant feature transform. In Advanced Concepts for Intelligent Vision Systems, volume 4179/2006 of Lecture Notes in Computer Science, pages 1080–1087. Springer Berlin / Heidelberg, 2006. [16] W. Kienzle, G. Bakir, M. Franz, and B. Scholkopf. Efficient approximations for support vector machines in object detection. Proc. DAGM04, pages 54–61, 2005. [17] A. Kleinschmidt, C. B. chel, S. Zeki, and R. S. J. Frackowiak. Human brain activity during spontaneously reversing perception of ambiguous figures. Proceedings of the Royal Society B: Biological Sciences, 265(1413):2427–2427, Dec. 1998. [18] H. Kruppa and B. Schiele. Using local context to improve face detection. Proc. of the BMVC, Norwich, England, pages 3–12, 2003. [19] C. Millet, I. Bloch, P. Hede, and P. Moellic. Using relative spatial relationships to improve individual region recognition. In Proc. 2nd Eur. Workshop Integration Knowledge, Semantics and Digital Media Technol, pages 119–126, 2005. [20] K. Murphy, A. Torralba, and W. Freeman. Using the forest to see the trees: a graphical model relating features, objects and scenes. Advances in Neural Information Processing Systems, 16, 2003. [21] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001. [22] A. Oliva and A. Torralba. The role of context in object recognition. Trends in Cognitive Sciences, 11(12):520–527, Dec. 2007. [23] G. Papadopoulos, P. Mylonas, V. Mezaris, Y. Avrithis, and I. Kompatsiaris. Knowledge-assisted image analysis based on context and spatial optimization. International Journal on Semantic Web and Information Systems, 2(3)(3):17–36, July-September 2006. [24] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8, 14-21 Oct. 2007. [25] O. Ramstrom and H. Christensen. Object detection using background context. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, 3:45–48, 23-26 Aug. 2004. [26] T. Strat and M. Fischler. Context-based vision: Recognition of natural scenes. In Signals, Systems and Computers, 1989. Twenty-Third Asilomar Conference on, volume 1, pages 532–536, 1989. [27] A. Torralba. Contextual priming for object detection. Int. J. Comput. Vision, 53(2):169–191, 2003. [28] L. Wolf and S. Bileschi. A critical view of context. International Journal of Computer Vision, 69(2):251–261, August 2006.

83

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.