A semi-interactive panorama based 3D reconstruction framework for indoor scenes

June 23, 2017 | Autor: Marcel Worring | Categoría: Cognitive Science, Computer Vision, 3-D Reconstruction, Field of View

Share Embed

Laporkan tautan ini

Descripción

Computer Vision and Image Understanding 115 (2011) 1516–1524

Contents lists available at SciVerse ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

A semi-interactive panorama based 3D reconstruction framework for indoor scenes Trung Kien Dang a,⇑, Marcel Worring a, The Duy Bui b a b

Intelligent Systems Lab Amsterdam, Informatics Institute, University of Amsterdam, The Netherlands Human Machine Interaction Laboratory, University of Engineering and Technology, Vietnam National University, Hanoi, Viet Nam

a r t i c l e

i n f o

Article history: Received 20 May 2010 Accepted 13 July 2011 Available online 27 July 2011 Keywords: 3D reconstruction Panorama Interactive

a b s t r a c t We present a semi-interactive method for 3D reconstruction specialized for indoor scenes which combines computer vision techniques with efﬁcient interaction. We use panoramas, popularly used for visualization of indoor scenes, but clearly not able to show depth, for their great ﬁeld of view, as the starting point. Exploiting user deﬁned knowledge, in term of a rough sketch of orthogonality and parallelism in scenes, we design smart interaction techniques to semi-automatically reconstruct a scene from coarse to ﬁne level. The framework is ﬂexible and efﬁcient. Users can build a coarse walls-and-ﬂoor textured model in ﬁve mouse clicks, or a detailed model showing all furniture in a couple of minutes interaction. We show results of reconstruction on four different scenes. The accuracy of the reconstructed models is quite high, around 1% error at full room scale. Thus, our framework is a good choice for applications requiring accuracy as well as application requiring a 3D impression of the scene. Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction A realistic 3D model of a scene and the objects it contains is an ideal for applications such as giving an impression of a room in a house for sale, reconstruction of bullet trajectories in crime scene investigation, or building realistic settings for virtual training [1]. It gives good spatial perception and enables functionalities such as measurement, manipulation, and annotation. One broad categorization of scenes is outdoor versus indoor. Outdoor scenes have been popular in many modeling applications [2,3], especially creating models of urban scenes [4,5]. Indoor scenes are prevalent in applications like real estate management, home decoration, or crime scene investigation (CSI), but research on them is limited with some notable exceptions [6–8]. In this paper we consider the 3D reconstruction of indoor scenes. While in applications like real estate management, a coarse model of a room is sufﬁcient, other applications need more complete models. For instance, in CSI the model should be complete and show all the details in the crime scene as any object is potentially evidence. Each application also requires a different level of accuracy. Home decoration, for example, does not need extreme accuracy for its purpose is merely to give an impression of the scene. For the CSI application, the model should be as accurate as possible in order to make measurements and hypothesis validation reliable. Here we are seeking for a framework that can create complete and accurate models in highly demanding applications such as CSI, as well as coarse models for less demanding applications. ⇑ Corresponding author. Address: 42A – 144/4, Quan Nhan, Thanh Xuan, Hanoi, Viet Nam. Fax: +84 437547460. E-mail address: [email protected] (T.K. Dang). 1077-3142/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2011.07.001

3D models are often built manually from measurements and images using the background map technique. Modelers take images of the object from orthogonal views (top, side and front), and try to create a model matching those images. A measurement is required to scale the model of the object to the right size. Modeling from measurements and images is only suitable for simple scenes, as complex scenes with many objects require a lot of measurements, images, and interaction. Even with measurements, accurately modeling objects is difﬁcult since the assumption that the line of view is orthogonal to the object is hard to meet in practice. Since manual reconstruction is cumbersome and time consuming [9], automatic or semi-interactive reconstruction is preferred. Automatic methods do exist and have shown good results for isolated objects and outdoor scenes [10,3,11–13]. Those methods require a camera moving around and looking towards the scene to capture it from multiple viewpoints [14–17]. Such moves maintain a large difference between viewpoints, giving accurately estimated 3D coordinates [18]. Unfortunately in practice people tend not to follow such moves, making these methods inaccurate and unreliable. Indeed in the well-known PhotoSynth system its has been observed that quality suffers when users do not follow the appropriate moves [12]. In simple cases, when modeling single-object scenes, automatic methods give results of 2–5% relative error [19]. This is sufﬁcient for visualization, but rather low for measurements such as in CSI applications. In indoor scenes where the space is limited, the situation is even worse as it is difﬁcult, if not impossible, to perform the capturing moves suitable for automatic reconstruction. So, automatic reconstruction methods in their current state are not sufﬁcient for accurate indoor scene reconstruction.

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524

Semi-interactive methods are potential solutions [20–22]. A small amount of interaction helping computers in identifying important features makes reconstruction more reliable. A few mouse clicks are enough to build a coarse model [8]. Recent work, such as the VideoTrace system [13], shows that interaction can be made smart and efﬁcient by exploiting automatically estimated geometric information. While interaction helps to efﬁciently improve the reliability, there is still the problem of having limited space to move around in indoor scenes. Using panoramas is a potential solution. Panoramas give a broad ﬁeld of view. So a few panoramas are enough to completely capture a scene, and moving around the scene is no longer a problem. Furthermore, building panoramas is reliable, thus using panoramas contributes to the reliability of the overall solution. The advantages of interaction on the one hand and panoramas on the other, suggest that a combination of them would be a good solution for indoor scene reconstruction. Following the above observations, we propose a multi-stage, semi-interactive, panorama based framework for indoor scenes. In the ﬁrst stage, a coarse model is build. This stage extends upon the technique in [8]. We make the interaction more efﬁcient by providing a smart interaction technique, and rectify panoramas to guarantee the accuracy meets our aimed quality. Furthermore, we give a reconstructability analysis and, based on that, present a capture assistant to guide the placement of the camera. Results of the ﬁrst stage, a coarse model and geometric constraints, facilitate efﬁcient interaction to build a detailed model in the second stage. This framework overcomes the problems mentioned and makes it easier to create accurate and complete models. In the next section we summarize related work. Section 3 gives an overview of our framework. Section 4 describes how to turn panoramas into a ﬂoor-plan and how to build a coarse 3D model. Section 5 describes the interaction to add details to the coarse model. Then we evaluate the accuracy and show how efﬁcient the framework is. We close the paper with a discussion on how to further automate the framework.

2. Related work 2.1. Reconstruction from panoramas A panorama is a wide-angle image, typically generated by stitching images from the same viewpoint [23]. Since panoramas cover a wide view, they must be mapped on a cylinder or sphere to view. Accordingly, they are called cylindric or spherical panoramas. Being wide-angle, panoramas give a good overview of a scene, especially in indoor scenes where the ﬁeld of view is limited. On the other hand, they do not give a good spatial perception since the viewpoint is ﬁxed at one point. There is work on creating panoramas using multiple viewpoints, called multi-perspective panoramas [7,24,25]. However, multi-perspective panoramas only yield a 3D impression from the original viewpoints. Other methods are needed to make real 3D models. 3D reconstruction from panoramas is found in [6,7,22]. In [6], a scene is modeled from geometric primitives, which are manually selected in panoramas of the scene. Reconstruction is done separately for each panorama, and then results of different panoramas are merged together. In [7] a dense 3D point cloud is estimated from multi-perspective panoramas. It, however, requires a special rig for capturing the panoramas. In [22] a method to do reconstruction from a cylindric panorama is proposed. It assumes that the scene, e.g. a room, is composed of a set of connected rectangles. This method requires that all corners of the room are visible, which is not often the case in practice. In [8],

1517

a method to reconstruct an indoor scene from normal single-perspective panoramas is described. The result is a coarse 3D model including walls onto which panoramas are projected. Such a model is not sufﬁcient for some applications such as CSI, but this simple and ﬂexible method gives good intermediate results towards building a detailed model.

2.2. Interaction in reconstruction There are many types of interaction in reconstruction. In the simplest case users deﬁne geometric primitives, such as points, lines, or pyramids and match these to the image data [20]. In [21], quadric surfaces are used to support more complex objects. VideoTrace [13] lets users draw and correct vertices of a model in an image sequence. The efﬁciency of interaction can be improved by exploiting what is already known about the scene. The guiding principle is to get as much geometric constraints as possible, and use them to assist interaction. These constraints can come from domain knowledge, the user interacting with the model, or through automatic estimation by the system, each of them we will now brieﬂy describe. Domain knowledge in the form of prior knowledge about the type of scenes to be reconstructed is helpful in designing efﬁcient interaction. For example, when modeling man-made scenes we can assume that parallel lines are many. Thus, vanishing points are helpful in constraining the interaction [6,26,27]. In urban scenes there are often repeated component such as windows. Hence instead of modeling them separately, the user can copy them [21]. In a man-made scene, objects are stacked on each other, e.g. a table is on the ﬂoor and books are on the table. We can exploit these to reduce the interaction and improve accuracy [9]. Scene speciﬁc geometric constraints can be provided by users. In [9], users deﬁne how an object should be bound to another one, to reduce the degrees of freedom in the interaction to reconstruct that object. In [8], after roughly deﬁning a room by a sketch, users can build a coarse model with a few mouse clicks. Some geometric constraints can be reliably estimated by computers. In some cases, coarse 3D structure and camera motion information can be estimated. State-of-the-art interactive reconstruction systems including [13,12] take advantage of such information sources to create intuitive and efﬁcient interaction. For example, in VideoTrace [13] system, vertices drawn in one frame by the user are tracked and rendered in other frame by the system. Users browse forward or backward in the video sequence to correct those vertices until satiﬁed. For the user it is like reﬁning a model rather than creating it from scratch. In practice, those three sources of constraints are often mixed in the modeling ﬂow, which is also what we will do in this paper.

3. Framework overview Our framework is an A-to-Z solution, from capturing an indoor scene to modeling it, which is summarized in Fig. 2. The framework takes as input a sketch of the ﬂoor-plan, a topdown design drawing of a room (e.g. 1a) that describes its walls and their relative positions drawn by the user. The capture planning module analyzes the sketch to tell the user how many panoramas are needed to completely capture the scene, and suggests camera placement i.e. the appropriate viewpoints. Either calibrated or uncalibrated cameras can be used, but to guarantee good accuracy, we advise to pre-calibrate the camera and correct the lens distortion before stitching them into panoramas. Users can use a software package of their own choice to estimate the camera

1518

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524

a. A rectangular room

b. (Unwrapped)panorama of the room

c. The walls-and-floor model

d. Adding more detail to the model

Fig. 1. Illustration of input and (intermediate) results of the reconstruction process. A simple rectangular room is used as example.

motion and stitch corrected images together into panoramas, for example using Hugin 1 (Fig. 1b). To build a coarse model of a room, the users picks the corners, intersections of walls, in the panoramas. The framework provides a smart corner picking method to make the interaction comfortable. The location of the corners on the panoramas and the sketch are enough to estimate the correct ﬂoor-plan and build a coarse model of the scene [8] (Fig. 1c). More expressively, we call this coarse model, which includes textured walls and ﬂoor, a walls-and-ﬂoor model. A typical rectangular room needs only one panorama to build such a model, where irregular rooms may need more than one panorama depending on the shape of the room and the viewpoints of the panoramas. This stage is discussed in detail in Section 4. In order to add more detail efﬁciently, we exploit the geometric constraint resulting from the observation that indoor scenes contain many ﬂat objects aligned to walls. We iteratively use known surfaces to guide an interaction type that we call perspective extrusion to add objects. This technique helps to quickly build a detailed model (Fig. 1d). Details of this stage are given in Section 5.

4. Building a walls-and-ﬂoor model In this section we discuss methods for building a walls-andﬂoor model. For easier comprehension, we present the ﬂoor-plan estimation and other elements prior to the capture planning. For the moment, we assume that the set of panoramas given is sufﬁcient for ﬂoor-plan estimation. We let the user draw a sketch of the ﬂoor-plan indicating orthogonality and parallelism of walls, and use a method built upon the method in [8] to estimate an accurate ﬂoor-plan. This method is based on the observation that the horizontal dimension of the panoramic image is proportional to the horizontal view angle of the panorama. Thus a set of corners divides the panorama into horizontal view angles of known ratio. If we assure that any panorama looks all around a room, the total horizontal view angle is obviously 360 degrees without any measurement. Hence we know each horizontal view angle. This observation is valid when the corners are perfectly aligned to the vertical dimension. Thus, 1

http://hugin.sourceforge.net.

to make a more accurate ﬂoor-plan estimation than in [8], we rectify the panoramas to meet that condition ﬁrst. Building 360-degree panoramas is well studied [23], thus we do not discuss it here. For the next step, indicating corners in panoramas, we provide smart corner picking. Rectifying panoramas, and estimating the ﬂoor-plan are subsequently discussed below. Then we present the reconstructability analysis and the capture assistant. 4.1. Smart corner picking In order to estimate the ﬂoor-plan, coordinates of the top-down projections of corners are needed. As panoramas may not be well aligned, getting one point on a corner is not enough. Instead we need to identify a corner by a line segment. One way to do that is to ask a user to manually draw a line onto a panorama. To make it even simpler, we provide a utility to let users just casually pick a point in a panorama and the system will automatically identify the corner line. Since the straightness of lines is not preserved in the coordinate system of a panorama, here a cylindric one, we must project a user picked point into one of the images, from which the panorama is created, to work in the image coordinate system. We assume that the best image is the one whose image plane is most orthogonal to the projection ray of the picked point. Or in other words, the angle between the ray from the viewpoint to the image center and the projection ray rc of the picked point is smallest.

if ¼ arg min \ðrðiÞ; r c Þ

ð1Þ

i

where r(i) is the principal ray of image i. Since panoramas are usually approximately aligned, we limit the detection to a vertical image band around the picked point. We detect vertical edges around that point, and ﬁt a line through the picked point and edge points using RANSAC [28]. The picked point is used here as an anchor to avoid the auto-detected line moving to a wrong location. Since the picked point is not exactly at the right position, we afterwards relax the condition, optimizing the line without constraining it to go though the picked point to yield the ﬁnal line. The process is summarized in Table 1 and two examples are given in Fig. 3.

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524

1519

4.2. Rectifying panoramas To accurately estimate the ﬂoor-plan, we ﬁrst rectify the panoramas so that corners are aligned to the vertical dimension for a cylindrical panorama. Each corner together with the viewpoint deﬁnes a plane. And these planes remain unchanged no matter how we move the coordinate system since they are deﬁned by the scene and viewpoint. To align the panorama cylinder we need to ﬁnd the rotation R that makes those planes parallel to the vertical direction. In other words, after transforming by R, the normals of planes are orthogonal to w= (0, 0, 1)T, i.e.

uTi R1 :w ¼ 0

ð2Þ

where ui are the planes’ normals. Using this constraint, given at least three corners, we can compute the last column of R1, or equivalently the last row of R, by ﬁnding the least-square solution. If the last row of R is r3 = (a, b, c), and from the constraint that R is orthogonal, we choose its other rows as:

r 1 ﬃ ðb; a; 0Þ 2

r 2 ﬃ ðac; bc; a2 þ b Þ

ð3Þ

where ﬃ means equal up to a scale, and jr1j = jr2j = jr3j = 1. Once having computed R, we resample the panoramic image to ﬁnish the rectiﬁcation. Fig. 2. Overview of the proposed framework.

Table 1 Smart corner picking process. 1. Let the user pick a point in/near a corner from the panorama 2. Find the best image, according to Eq. (1) 3. Perform canny edge detection in a horizontal band of one tenth of the image width around the picked point 4. Fit a line through the picked point and the edges using RANSAC, where the line must go though the picked point 5. Optimize the line without constraning it to the picked point

4.3. Estimating the ﬂoor-plan The locations of corners in panoramas, identiﬁed in the previous step, give sets of horizontal angles between the corners when viewed from the panorama viewpoint. If we have a way to represent those angles in terms of coordinates of projections of corners and viewpoints in the ﬂoor-plan, we have a set of constraints to estimate the ﬂoor-plan and the viewpoints. Here we brieﬂy review such a method presented in [8], discuss its applicability, and show how we extend it for our work. A sketch is a model of the ﬂoor-plan. We force users to draw rectilinear lines parallel to the axes by providing them with a drawing grid. Of course, this alignment can be done automatically, but drawing in such way helps users to correctly deﬁne parallelism and orthogonality. Note, as only parallelism and orthogonality are important in the parameterization, a sketch of a rectangular room is any arbitrary rectangle. Assuming that the room has n corners, we need at most 2n parameters to represent it. A viewpoint, whose coordinates both have to be estimated, is represented by a pair of separate parameters. Suppose that we have v panoramas, then the total number of parameter is 2n + 2v. For each wall drawn in the sketch that is parallel to an axis, since the two corners of a wall share a horizontal or vertical coordinate, the number of parameters is reduced by one (Fig. 4a). Hence the number of parameters is reduced by the number of those walls, m. To further reduce the number of parameters, the origin of the coordinate system is set at one corner, and the length of a wall is set to one, as the reconstruction is up to a scale anyway. These settings reduce the number of parameters by 3. In summary, the number of parameters to be estimated is:

2n þ 2v m 3

Fig. 3. Two examples of smart corner picking. (a) The user picks a point. (b) Edges are detected in a vertical image band; a line is ﬁtted through the picked point and edges. Note that there is another (even longer) vertical line but the algorithm smartly takes the edge close to the picked point. (c) The ﬁnal result.

ð4Þ

From the model of the ﬂoor-plan that contains the coordinates of corners and viewpoints, we can estimate the angle between two corners as seen from a viewpoint (Fig. 4b). These angles are equal to the set of angles deﬁned by user-picked corners in the panoramas. This set of constraints can be used to estimate the parameters of the ﬂoor-plan model and the viewpoints.

1520

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524

a

might be enough to estimate it. A special, yet the most common, case is a rectangular room. Since we see all four corners from any viewpoint, one panorama might be enough to reconstruct the walls-and-ﬂoor model. We need more panoramas when the ﬂoor-plan is not a rectilinear polygon, and when from the chosen viewpoint we cannot see all corners. Fig. 5 shows examples.

b

4.5. The capture assistant

Fig. 4. Parameterization of the ﬂoor-plan model given a sketch, simpliﬁed from Fig. 2 in [8]. (a) To reduce the number of parameters, corners are represented by shared parameters. (b) Each viewpoint is parameterized separately. Locations of corners in a panorama at the viewpoint give a set of angles between corners as viewed from the viewpoint.

Unseen corner

viewpoint

a

b

Fig. 5. When the ﬂoor-plan is not rectilinear (a), or if from the viewpoint we cannot see all corners (b), we may need more than one panorama to estimate it.

At this point, the coordinates of top-down projections of viewpoints are estimated. But the viewpoints’ heights are missing. Complete viewpoint coordinates are required to add more details to the model in the later stage. Since we already know the the ﬂoor and the projection of the viewpoint on the ﬂoor, we only need one point to compute the relative distance from the viewpoint to the ﬂoor. To get that point, we ask the user to pick any ﬂoor point in each panorama to compute its viewpoint height. 4.4. Reconstructability analysis We now give an analysis of the ﬂoor-plan estimation method. To estimate the ﬂoor-plan and the viewpoint coordinates, the number of constraint must be greater or equal to the number of unknowns given in Eq. (4) of the previous sub-section. Suppose that viewpoint i sees ci corners, since the sum of the angles is 360 degrees, we have ci 1 independent constraints. Since the viewpoints are different, constraints of one viewpoint are independent of constraints of other viewpoints. The problem is solvable when the number of constraints is greater than or equal to the number of parameters: v X

ci P 2n þ 3v m 3

ð5Þ

i¼1

Common rooms have all walls parallel to an axis, i.e. the ﬂoor-plan is a rectilinear polygon, thus m is equal to n. Eq. (5) then simpliﬁes to: v X

ci P n þ 3v 3

The capture assistant helps users in planning viewpoints in the room so that the reconstruction is possible and the model covers all of the room. To that end, it must know the number of unknowns given a sketch, the number of constraints produced by viewpoints and the area they cover. Furthermore, it is preferred that the number of viewpoints is minimal. The number of unknowns is computed easily using Eqs. (5) and (6). In a convex polygon, a line segment from any point within it to any of its vertices does not go out of itself. Hence if the ﬂoor-plan is convex, counting the constraints is trivial since from any viewpoint we see all the corners. When the ﬂoor-plan is concave, the problem is nontrivial. Since we keep the sketching simple, only asking users to align rectilinear lines of the sketch parallel to axes, the sketch is freely stretched unevenly along axes. Our solution is to decompose the sketch into tiles and compute the minimal number of observable corners from each tile, invariant to how it is stretched along axes. The algorithm is described in algorithm Algorithm 1. Algorithm 1. Decomposing a sketch into invariant observable areas Step 1: Cut the sketch into tiles using all distinguished x and y coordinates. A sketch is turned into a set of rectangles and triangles (Fig. 6a). Where each of them is called a tile (Fig. 6b). Step 2: For each tile, ﬁnd its invariant observable area (IOA) by the following steps: – Initiate the area contains only the tile itself. – Iteratively add a tile if it together with some tiles already added forms a convex polygon containing the initial tile.

Lemma 4.1. If the sketch is different from the real ﬂoor plan by an unevenly scaling, the IOAs are invariant to unevenly scaling. Proof. The sketch is different from the real ﬂoor plan by an unevenly scaling, the coordinates of corners are transformed by an monotic function, thus the order between any pair of x or y coordinate is preserved. That means if xa > xb in the ﬂoor-plan, or one sketch, in another sketch that still holds. Consequently. The order of tiles, as decomposed in the algorithm above, is horizontally and vertically unchanged in any sketch. Consequently the IOAs, a set of tiles, built following step 2 in Algorithm 1 is unchanged. h Lemma 4.2. Any point in an IOA is observable from any point in the initial tile.

ð6Þ

i¼1

Suppose that we can ﬁnd a point from which all corners are visible, i.e. ci = n, Eq. (6) is then further simpliﬁed to v P 1. So indeed given a rectilinear ﬂoor-plan, one panorama that sees all corners

Proof. Any point is observable from another point within a convex polygon. Since the extending scheme only add new tile if it is a part of a convex polygon with the initial tile, all points in the IOA are observable from any point in the initial tile. h

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524

1521

In practice, since there are objects in the room, we might not be able to put the camera at the suggested positions, or see all the corners we should see according to the analysis. Should an object, e.g. a tall wardrobe, completely block corner(s), it must be considered as part of the walls. The procedure to suggest viewpoints is the same. If a suggested tile is inappropriate to place the camera, users can mark it so that Algorithm 2 can ignore that tile when recomputing the suggested viewpoints. This procedure has proven to give good results in practical cases.

a

b

Viewpoints also affect the accuracy of the ﬂoor-plan and the texture quality. In practice, since the panorama is built from high resolution images, the texture quality should not be a problem. To estimate the ﬂoor plan accurately, intuitively one should place the camera in the center of the room to balance the constraints. After this stage, we have a textured walls and ﬂoor model. In this model, objects are projected on the walls and on the ﬂoor. It gives a good overview of the scene. As indicated in applications such as real estate management it should be satisfactory. However for an application such as CSI, the object localization is not detailed enough. Thus, we need the second stage to add more detail. 5. Adding details using perspective extrusion

c

d

Fig. 6. Illustration of the sketch decomposition algorithm. (a) The sketch is cut into rectangles and triangles using all distinguished x and y coordinates. (b) The tile graph indicates possibilities of traveling among tiles. (c) For each tile the initial observable area is itself (black); then tiles reached by traveling parallel to axes are iteratively added (gray); ﬁnally tiles reached from two ways are added (diagonal pattern). (d) The number of corners contained in the observable area is the minimal number of observable corners from the tile.

Having IOAs we check if the planned viewpoints surely cover all the room and provide enough constrains to estimate the real ﬂoorplan. The IOA of a viewpoint is the IOA of the tile containing it. By checking if the union of the planned viewpoints’ IOAs, we can make sure that the set of viewpoints covers all the scene. Checking whether the ﬂoor-plan is solvable is done by summing the number of corners observed by each IOA, and then comparing it to the condition in (5). Given the IOAs of a sketch, ﬁnding an optimal set of viewpoints, i.e. smallest number of viewpoints that covers the scene completely and satisﬁes the reconstructibility condition (5), is a hard problem. Let us construct a graph representing the problem. Each tile is a node in the graph. For each tile, we have edges connecting it to all tiles in its IOA. Since if a tile is observable from another one, than from it we can also observe the other tile, the edges are undirected. Put aside the reconstructibility condition, our problem is ﬁnding the minimal set of nodes from which we have edges connect to the rest of the nodes. This is the minimal dominating set problem, one of the known NP-complete problems [29]. With an additional condition, our problem is arguably of the same complexity. To suggest users a solution in interactive time, we propose the following greedy Algorithm 2. Algorithm 2. Suggesting viewpoints, the greedy algorithm . Step 1. Find a dominating set. Initialize an empty dominating set of tiles. While the scene is not covered by the union of the IOAs of tiles in the set, add a tile whose IOA contains most uncovered tiles. Step 2. Satisfy the reconstructability condition. While the condition of (5) is not satisﬁed, add a tile whose IOA contains most corners, i.e. providing most number of constraints.

The model now contains planes of walls, the ﬂoor, and viewpoint locations. We design interactive methods to add detail to the model in spirit of the whole framework: ﬂexibly reconstructing objects from coarse to ﬁne. For example, a table is reconstructed ﬁrst and then the stack of books on it. Characteristics of indoor scenes are utilized in designing interaction methods meeting that idea. In indoor scenes, many objects are composed of planes. Since objects are often aligned to walls, those planes are likely parallel to at least one wall or the ﬂoor. As indicated ealier, this gives a constraint to reconstruct objects. This action is similar to an extrusion, a popular standard technique in manual 3D modeling. In a normal extrusion, the orthogonal projection of the object’s boundary on a reference plane is orthogonally popped up with a known distance, creating a new object planar surface. In our situation we do not see the object in orthogonal views, but from a panorama viewpoint. So, instead of moving the object’s boundary on lines orthogonal to the reference plane, we move it on rays from the viewpoint to their original locations in the reference plane (Fig. 1d). Because of this constraining, we call it a perspective extrusion. Our aim is to reconstruct an object surface S that has a surface parallel to an already reconstructed plane (Fig. 7). S is reconstructed from a set of three parameters. The reference plane l is a reconstructed plane to which the plane of S is parallel. The distance S to l is denoted by d; and b is a projection of the boundary of S in a panorama. The reconstruction procedure includes shifting the parallel plane l by distance d to get the object plane p, and cutting p by the pyramid of b and the viewpoint from which we see b. Once we have S, users can choose whether the object is a solid box or just a planar surface. The perspective extrusion process is summarized in Table 2. In related work such as [9], object parameters are deﬁned indirectly in terms of geometric objects, e.g. a rectangular box. In pictures of indoor scenes, objects are frequently occluded, making the use of geometric objects difﬁcult. To give more options in reconstructing an object, we choose to let users deﬁne those parameters directly and separately. For example, a box is deﬁned by one of its faces and the distance to the plane the face is parallel to. The distance can be deﬁned by an orthogonal line to any reconstructed plane. The parallel plane l is picked from the current model. We provide two ways to deﬁne d, namely using one or two viewpoints.

1522

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524 Table 2 Perspective extrusion process. 1. The user picks the reference plane l 2. The user deﬁnes the distance from l to the object plane p, either from one or two viewpoints 3. Compute the object plane p by shifting l by d 4. The user deﬁnes the boundary though its projection b onto a panorama 5. Compute initial S by cutting the object plane p by the pyramid of b and the panorama viewpoint 6. The user choses object type, either a solid box or a planar surface

Fig. 7. A perspective extrusion pops up an object from an already reconstructed plane.

To deﬁne d from a single viewpoint, the user draws a line from the object surface orthogonally to a reconstructed plane. To deﬁne d from two viewpoints, the user picks the projections of a point on the object surface in two panoramas. We then triangulate these two projections to estimate the 3D coordinates of that point, and its distance to l, which already reconstructed, is the distance d. This strategy is useful when there is no physical clue for guiding the drawing of a line from the object’s surface orthogonally to a reconstructed plane. For example, for a chair, whose legs are bended, standing in the middle of the room, there would be no physical clue to draw d from a single viewpoint. The boundary b is a polygon drawn by users from the viewpoint. To assist the drawing of b, we assume as a default that the boundary of S has orthogonal angles and is symmetric as long as the drawing of b does not break this assumption. Using those assumptions, we predict the boundary and render it. This is helpful to accurately deﬁne b, especially when a vertex is occluded. For ﬂexibility and accuracy, we let users deﬁne any parameter (l, d, or b) from any available panorama viewpoint. A possible way to increase ﬂexibility and accuracy is to let users adjust the boundary b from different viewpoints as in VideoTrace [13]. However, that is only effective if we have many viewpoints, i.e. observations of the boundary. To keep the framework simple and the number of input panoramas small, we have decided not to use that technique. To be reconstructible, objects must be seen and the parameters for perspective extrusion must be deﬁnable. The capture assistant described in Section 4.5 handles part of this by ensuring all of the ﬂoor and walls will be seen. Of course objects can be occluded completely by other objects, but that is hardly the case for the main objects in the scene. For l and b, if objects are complex or curvy, we can only approximate them (Fig. 11c and d). For a

Table 3 Floor-plan relative errors (in percent, mean ± standard deviation). To achieve the best accuracy lens distortion should be applied before panorama stitching, and panorama rectiﬁcation (Section 4.2) should be used. The ﬂoor-plan error of the fake crime scene is not available because of lacking ground truth.

Bedroom Dining room Kitchen

Without rectiﬁcation

Uncalibrated images

Calibrated & rectiﬁcation

0.48 ± 1.45 7.50 ± 3.20

0.49 ± 0.16 7.48 ± 3.17

0.38 ± 0.14 1.18 ± 0.49

9.88 ± 3.24

0.48 ± 0.23

0.28 ± 0.05

‘‘ﬂoating’’ object, like the chair in Fig. 10a, since there is no solid connection from its surface to another surface, one should use two viewpoints to deﬁne d. In general, if an object has sufﬁciently different appearance in two panoramas, then it is reconstructible. 6. Results We now present results showing that the proposed framework overcomes difﬁculties in indoor scene reconstruction to efﬁciently produce complete and accurate models. 6.1. Datasets Four scenes are used in our evaluation (Fig. 8). Three are rooms in a house captured by ourselves. The last one is a fake crime scene captured by The Netherlands Forensic Institute. The ground truth is deﬁned by measurements made on objects in the scenes. All scenes are typical indoor scenes, rather complex and the space is limited. For every scene, the minimal number of panoramas required, as computed using our capture assistance, is one. Because of obstacles (furniture) there was no good position for capturing all corners, thus we had to use two panoramas for the three rooms. For the fake crime scene, we use one panorama.

2 panoramas

2 panoramas

2 panoramas

1 panoramas

a. Bedroom

b. Dining room

c. Kitchen

d. Fake crime scene

Fig. 8. Evaluated scenes, their sketches, and number of panoramas used.

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524 Table 4 Average object errors (mean ± standard deviation). Average object error

Bedroom Dining room Kitchen Fake crime scene

Absolute (cm)

Relative (%)

2.4 ± 1.9 1.6 ± 1.2 1.1 ± 1.0 6.2 ± 2.6

2.38 ± 1.41 1.84 ± 2.00 1.17 ± 1.06 1.84 ± 0.89

a. Walls-and-floor model 0 min, 6 mouse clicks

b. All furniture model 5 min, 10 extrusions

s

1523

good as using pre-calibrated images. The errors, with pre-calibrated images and panorama rectiﬁcation, are about a few centimeters in a room of about ten squared meters. The relative errors, computed by dividing the absolute error by the length of the diagonal of the rectangular bounding box of the true ﬂoor-plan, are about 1%. The estimated ﬂoor-plan of the dining room is less accurate since it was hard to identify some of its corners in the panoramas. Our accuracy is higher than in [8], where the error is about 4%. Two differences responsible for the improvement are: the ﬂoor-plan estimation strategy we used, and our panorama rectiﬁcation. In [8], a sketch of several rooms is used to parameterize and estimate the ﬂoor-plan of multiple rooms. It was noted that by doing so, and thus ignoring thickness of walls, might reduce the accuracy [8]. To achieve high accuracy, we have estimated the ﬂoor-plan of each room separately. More importantly, our rectiﬁcation eliminates the inaccurate alignment in the input panoramas (see Table 4). For objects, since the angles between geometric primitives, lines and planes, are already enforced during the reconstruction, we only evaluate the length errors, absolute and relative to the ground truth lengths. The accuracy of our framework is quite high, e.g. comparing to [8,19]. Object accuracy is slightly less accurate than scene accuracy in terms of relative error, but our examination shows that the absolute errors are about the same. 6.3. Efﬁciency and completeness

c. Final model

d. Final textured model

10 min, 19 extrusions Fig. 9. Resulting models as function to time and amount of interaction spent. The example is the fake crime scene.

6.2. Accuracy Since the reconstructed model is up to a scale and a rotation, we have to eliminate that ambiguity in order to evaluate the accuracy. To do so we estimate a transformation from the estimated ﬂoorplan to the ground truth ﬂoor-plan. We apply this to the model, and then evaluate the model at two levels: at room scale (i.e. ﬂoor-plan error), and at object scale (i.e. object measurements). Table 3 shows ﬂoor-plan errors with and without rectifying panoramas. In two out of three datasets the improvement is quite signiﬁcant. In one dataset, the Bedroom, the error without rectiﬁcation is almost the same as rectiﬁed since the angles of the original panoramas almost perfect. Using uncalibrated images (calibration done during stitching) is possible, though the results are not as

a. Bedroom

Our framework is efﬁcient. A scene can be modeled in a dozen of minutes. Fig. 9 shows the model of a rather complex scene namely the fake crime scene. The walls-and-ﬂoor model is built in seconds. All furniture is modeled in about 5 min. The time taken to build the ﬁnal model that includes small objects such as cups on tables is 10 min. Furthermore, users do not need to measure objects for modeling at capture time. Fig. 10 shows models of some scenes built using our framework. Close-ups of objects picked from reconstructed models are given in Fig. 11. Objects composed of planar surfaces are well reconstructed, while complex curvy objects can only be approximated using perspective extrusions. 7. Conclusion We have proposed a panorama-based semi-interactive 3D reconstruction framework for indoor scenes. The framework overcomes the problems of limited ﬁeld of view in indoor scenes and has the desired properties: robustness, efﬁciency, and accuracy. Those properties make it suitable for a broad range of applications, from a coarse model created in a few seconds for a presentation to a detailed model for measurement in crime scene

b. Dining room

c. Kitchen

Fig. 10. Models reconstructed using the proposed framework.

1524

T.K. Dang et al. / Computer Vision and Image Understanding 115 (2011) 1516–1524

a. Stove

b. Table

c. Couch

d. Fake body

Fig. 11. Model of objects picked from models in Figs. 9 and 10. It takes less than a minute to model an object. Objects composed of planar surfaces (the stove and the table) are well reconstructed using our method, while complex objects like a fake body are hard to approximate using perspective extrusions alone.

investigation. Models inexpensively created using our framework are an intuitive medium to manage and retrieve digitized information of scenes and use it in interactive applications. A limitation of the framework is that it lacks the ability to model complex objects. This could be counteracted by other more expensive techniques. For example the VideoTrace technique [13] lets users model objects from video sequences. The ortho-image technique [30] creates background maps from image sequences to assist artists in modeling objects in 3D authoring software. As objects are complex, both techniques require images from many different angles and more interaction. Since our panoramic images are calibrated, we can integrate those techniques into our framework as plugins. Once the object is reconstructed using those techniques, we can automatically integrate it back into our model, by matching panoramic images to the image sequence used to model the object and then estimating the pose of the object. Thus the framework is a useful tool for both quickly building coarse models as well as efﬁciently building accurate models. In the accompanying video the system is demonstrated on a number of realistic scenes. Acknowledgments This work is supported by the BSIK project MultimediaN and the Research Grant from Vietnam National University, Hanoi No. QG.10.23. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.cviu.2011.07.001. References [1] T.L.J. Howard, A.D. Murta, S. Gibson, Virtual environments for scene of crime reconstruction and analysis, in: SPIE – Visual Data Exploration and Analysis VII, vol. 3960, 2000, pp. 1–8. [2] M. Pollefeys, L.J.V. Gool, M. Vergauwen, K. Cornelis, F. Verbiest, J. Tops, Imagebased 3D acquisition of archaeological heritage and applications, in: Virtual Reality, Archeology, and Cultural Heritage, 2001, pp. 255–262. [3] N. Snavely, S.M. Seitz, R. Szeliski, Modeling the world from internet photo collections, International Journal of Computer Vision 80 (2) (2008) 189–210. [4] M. Pollefeys, D. Nistér, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.J. Kim, P. Merrell, C. Salmi, S.N. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, R. Yang, G. Welch, H. Towles, Detailed real-time urban 3D reconstruction from video, International Journal of Computer Vision 78 (2–3) (2008) 143–167. [5] N. Cornelis, B. Leibe, K. Cornelis, L.V. Gool, 3D urban scene modeling integrating recognition and reconstruction, International Journal of Computer Vision 78 (2–3) (2008) 121–141. [6] H.-Y. Shum, M. Han, R. Szeliski, Interactive construction of 3D models from panoramic mosaics, in: Computer Vision and Pattern Recognition, 1998, pp. 427–433. [7] Y. Li, H.-Y. Shum, C.-K. Tang, R. Szeliski, Stereo reconstruction from multiperspective panoramas, IEEE Transaction on Pattern Analysis and Machine Intelligence 26 (1) (2004) 45–62.

[8] D. Farin, W. Effelsberg, P.H.N. de With, Floor-plan reconstruction from panoramic images, in: ACM Multimedia, 2007, pp. 823–826. [9] S. Gibson, R.J. Hubbold, J. Cook, T.L.J. Howard, Interactive reconstruction of virtual environments from video sequences, Computers & Graphics 27 (2) (2003) 293–301. [10] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, R. Koch, Visual modeling with a hand-held camera, International Journal of Computer Vision 59 (2004) 207–232. [11] M. Chandraker, S. Agarwal, F. Kahl, D. Nister, D. Kriegman, Autocalibration via rank-constrained estimation of the absolute quadric, in: IEEE Computer Vision and Pattern Recognition, 2007, pp. 1–8. [12] S.N. Sinha, D. Steedly, R. Szeliski, M. Agrawala, M. Pollefeys, Interactive 3D architectural modeling from unordered photo collections, ACM Transactions on Graphics 27 (5) (2008) 159. [13] A. van den Hengel, A. Dick, T. Thormählen, B. Ward, P.H.S. Torr, VideoTrace: rapid interactive scene modelling from video, ACM Transactions on Graphics 26 (3) (2007) 86. [14] A. Fitzgibbon, A. Zisserman, Automatic 3D model acquisition and generation of new images from video sequences, in: European Signal Processing Conference, 1998, pp. 1261–1269. [15] M. Pollefeys, R. Koch, L. Van Gool, Selfcalibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters, in: IEEE International Conference on Computer Vision, 1998, pp. 90–95. [16] M. Pollefeys, F. Verbiest, L. Van Gool, Surviving dominant planes in uncalibrated structure and motion recovery, in: European Conference on Computer Vision, 2002, pp. 837–851. [17] J. Repko, M. Pollefeys, 3D model from extended uncalibrated video sequences: Addressing key-frame selection and projective drift, in: International Conference on 3-D Digital Imaging and Modeling, 2005, pp. 150–157. [18] R.I. Hartley, P. Sturm, Triangulation, Computer Vision and Image Understanding 68 (1998) 146–157. [19] M. Pollefeys, R. Koch, L. Van Gool, Selfcalibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters, International Journal of Computer Vision 32 (1999) 7–25. [20] P.E. Debevec, C.J. Taylor, J. Malik, Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach, in: SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques, 1996, pp. 11–20. [21] S. El-Hakim, E. Whiting, L. Gonzo, 3D modeling with reusable and integrated building blocks, in: The 7th Conference on Optical 3-D Measurement Techniques, 2005, pp. 3–5. [22] R. Haeusler, R. Klette, F. Huang, Monocular 3D reconstruction of objects based on cylindrical panoramas, in: 3rd Paciﬁc Rim Symposium on Advances in Image and Video Technology, 2008, pp. 60–70. [23] R. Szeliski, Image alignment and stitching: a tutorial, Foundations and Trends in Computer Graphics and Vision 2 (1) (2006) 1. [24] Z. Zhu, A.R. Hanson, LAMP: 3D layered, adaptive-resolution, and multiperspective panorama – a new scene representation, Computer Vision Image Understanding 96 (3) (2004) 294–326. [25] W. Wei, G. Hui, Z. Maojun, X. ZhiHui, Multi-perspective panorama based on the improved pushbroom model, in: Workshop on Digital Media and its Application in Museum & Heritage, 2007, pp. 85–90. [26] R. Cipolla, D. Robertson, 3D models of architectural scenes from uncalibrated images and vanishing points, in: International Conference on Image Analysis and Processing, 1999, pp. 824–829. [27] M. Wilczkowiak, P. Sturm, E. Boyer, Using geometric constraints through parallelepipeds for calibration and 3D modeling, Pattern Analysis and Machine Intelligence 27 (2) (2005) 194–207. [28] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography, Communication of the ACM 24 (1981) 381–395. [29] B. Korte, J. Vygen, Combinatorial Optimization: Theory and Algorithms, third ed., Algorithms and Combinatorics, Springer, 2005. [30] T. Thormählen, H.-P. Seidel, 3D-modeling by ortho-image generation from image sequences, in: ACM SIGGRAPH, 2008, pp. 1–5.

Lihat lebih banyak...

A semi-interactive panorama based 3D reconstruction framework for indoor scenes

Descripción

Comentarios