An imaging camera and a depth camera are configured to perform a 3d scan of an interior space. A processor is configured to generate voxels in a three-dimensional (3d) grid based on the 3d scan. The voxels represent portions of the volume of the interior space. The processor is also configured to project the voxels onto tiles in a two-dimensional (2d) floor plan of the interior space. The processor is further configured to generate, based on the tiles, a 2d distance grid that represents features in the interior space. In some cases, the 2d distance grid is generated in real-time concurrently with performing the 3d scan of the interior space. The processor is further configured to generate, based on a 2d distance grid, a set of polygons representing elements of the floor plan in real-time. The processor is further configured to generate a simplified set of primitives representing the floor plan.
|
1. A method comprising:
performing a three-dimensional (3d) scan of an interior space;
accessing voxels in a 3d grid that is generated from the 3d scan, wherein the voxels represent portions of the volume of the interior space and wherein the voxels have associated weights that indicate a number of observations that include the corresponding portion of the volume of the interior space and signed distances relative to surfaces associated with the voxels;
projecting the voxels onto tiles in a two-dimensional (2d) floor plan of the interior space by combining the weights associated with the voxels to determine a set of 2d weights for the tiles; and
generating, based on the tiles, a 2d distance grid that represents features in the interior space.
20. An electronic device comprising:
an imaging camera and a depth camera configured to perform a 3d scan of an interior space; and
a processor configured to
generate voxels in a three-dimensional (3d) grid based on the 3d scan, wherein the voxels represent portions of the volume of the interior space and wherein the voxels have values of weights that indicate a number of observations that include the corresponding portion of the volume of the interior space and values of signed distances relative to surfaces associated with the voxels;
project the voxels onto tiles in a two-dimensional (2d) floor plan of the interior space by combining the weights associated with the voxels to determine a set of 2d weights for the tiles; and
generate, based on the tiles, a 2d distance grid that represents features in the interior space.
38. A method comprising:
projecting voxels in a three-dimensional (3d) grid onto tiles in a two-dimensional (2d) floor plan by combining weights associated with the voxels to determine a set of weights for the tiles, wherein the voxels are generated from a 3d scan of an interior space, wherein projecting the voxels is performed concurrently with performing the 3d scan, and wherein the voxels represent portions of the volume of the interior space and have values of the weights that indicate a number of observations that include the corresponding portion of the volume of the interior space and values of signed distances relative to surfaces associated with the voxels;
generating, based on the tiles, a 2d distance grid that represents features in the interior space; and
identifying a set of primitives that represent the 2d distance grid.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
8. The method of
9. The method of
extracting, from the 3d grid, at least one of a height of a highest surface that is visible from above, a height of a lowest surface that is visible from below, and a ratio of free space to occupied space in a vertical band.
10. The method of
11. The method of
12. The method of
13. The method of
identifying a set of primitives that represent the features of the 2d distance grid.
14. The method of
15. The method of
16. The method of
assigning labels to portions of the interior space based on features in the 3d scan, wherein the labels indicate room types.
17. The method of
training a convolutional neural network (CNN) to assign the labels based on different types of furniture that are represented in the 3d scan of the portions of the interior space, wherein the CNN is trained using a set of training images including color information and depth information for each pixel.
18. The method of
19. The method of
generating a labeled 2d distance grid by associating the labels with portions of the 2d distance grid that correspond the portions of the interior space that were assigned the labels based on the features in the 3d scan.
21. The electronic device of
22. The electronic device of
23. The electronic device of
24. The electronic device of
25. The electronic device of
26. The electronic device of
27. The electronic device of
28. The electronic device of
29. The electronic device of
30. The electronic device of
31. The electronic device of
32. The electronic device of
33. The electronic device of
34. The electronic device of
35. The electronic device of
36. The electronic device of
37. The electronic device of
39. The method of
40. The method of
41. The method of
42. The method of
labeling subsets of the sets of primitives based on features in the 3d scan, wherein the labels indicate room types.
|
This application claims priority to U.S. Provisional Patent Application 62/491,988 entitled “Floor Plan Mapping and Simplification System Overview,” which was filed on Apr. 28, 2017 and is incorporated herein by reference in its entirety.
A two-dimensional (2D) floor plan of a building, house, or apartment is a valuable representation of the corresponding structure. For example, the 2D floor plan is used to illustrate a room layout for a potential buyer of a home or building, a potential tenant of an apartment, an interior designer that is planning a redesign of the interior space, an architect involved in a renovation of the structure, and the like. Conventional processes of generating a 2D floor plan require human intervention, even though systems are available to scan the three-dimensional (3D) geometry of an interior space. For example, a draftsperson is typically required to draw a 2D architectural floor plan based on the scanned 3D geometry. Furthermore, commercially available systems for performing 3D scanning are relatively expensive and the scanning process is time and labor intensive. For example, a conventional 3D scanning system uses detectors mounted on a tripod, which must be moved to several acquisition locations within the structure that is being scanned. The scanning time at each acquisition location is typically several minutes or more. Mobile 3D scanning can be implemented by adding a depth camera to a mobile phone to capture the 3D geometry of a structure. However, this approach still requires manual extraction of the 2D floor plan from the 3D geometry. Consequently, up-to-date floor plans are not available for most buildings, houses, and apartments.
The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.
A 2D floor plan of a structure is generated using a vertical projection of voxels in a 3D grid that represents a volume enclosed by the structure. In some embodiments, the 3D grid is acquired using a 3D scanning application in a mobile phone that implements a 2D camera and a depth camera. The voxels store information including their spatial position, the spatial extent of the voxel, the number of observations of the voxel (or weight of the information), and the estimated signed distance to the nearest surface to the voxel. A 3D mesh of triangles that represents the scanned volume is generated (or extracted) based on the values of the voxels. Vertical projection of voxels on to a 2D tile in the 2D floor plan includes summing weights of voxels along a vertical direction to determine a 2D weight for the tile and determining a weighted sum of the signed distances along the vertical direction to determine a 2D signed distance for the tile. In some embodiments, additional features are extracted from the 3D representation such as a height of the highest surface that is visible from above, a height of the lowest surface that is visible from below, a ratio of free space to occupied space in a vertical band, and the like. The 2D weights and the 2D signed distances of the tiles are used to generate 2D distance grids that represent features in the structure, such as walls, free space, furniture, doors, windows, and the like. In some cases, the 2D distance grids are converted into polygons that represent the features in the floor plan.
Real-time generation of the 2D floor plan is supported by subdividing the 3D grid into a set of volumes containing a predetermined number of voxels, e.g., a volume of 16×16×16 voxels. The system identifies the volumes that are affected in response to acquisition of a new depth image of the structure. Only the values of the voxels in the affected volumes are updated and the corresponding 3D mesh is re-extracted. Subsets of the tiles in the 2D floor plan are associated with corresponding volumes that include voxels that are vertically projected onto the subset of tiles. The 2D floor plan is updated by determining whether one or more volumes in each vertical column associated with a subset of tiles has been updated. If so, a portion of the 2D floor plan that includes the subset of tiles is recomputed and corresponding polygons (or partial polygons) are extracted. Noise in the 2D floor plan is reduced by representing the floor plan as a series of primitives such as lines, rectangles, triangles, circles, polygons, and the like. The primitives are associated with semantic classes such as free space, walls, unknown, as well as other classes that can include furniture, doors, windows, and the like. The primitives can be oriented in any direction. An iterative process is used to find a sequence of primitives and corresponding orientations that approximate the 2D floor plan by minimizing a cost function over a set of primitives and orientations.
The electronic device 110 is configured to support location-based functionality, such as simultaneous localization and mapping (SLAM) or augmented reality (AR), using image and non-image sensor data in accordance with at least one embodiment of the present disclosure. The electronic device 110 can include a portable user device, such as a tablet computer, computing-enabled cellular phone (e.g., a “smartphone”), a notebook computer, a personal digital assistant (PDA), a gaming system remote, a television remote, an AR/VR headset, and the like. In other embodiments, the electronic device 110 includes a fixture device, such as a personal service robot such as a vacuum cleaning robot, medical imaging equipment, a security imaging camera system, an industrial robot control system, a drone control system, a 3D scanning apparatus, and the like. For ease of illustration, the electronic device 110 is generally described herein in the example context of a portable user device, such as a tablet computer or a smartphone; however, the electronic device 110 is not limited to these example implementations.
The electronic device 110 includes a plurality of sensors to obtain information regarding the interior space 100. The electronic device 110 obtains visual information (imagery) for the interior space 100 via imaging cameras and a depth sensor disposed at a forward-facing surface and, in some embodiments, an imaging camera disposed at a user-facing surface. As discussed herein, the imaging cameras and the depth sensor are used to perform 3D scanning of the environment of the interior space 100. In some embodiments, a user holding the electronic device 110 moves through the interior space 100, as indicated by the arrows 115, 120. The user orients the electronic device 110 so that the imaging cameras and the depth sensor are able to capture images and sense a depth of a portion of the interior space 100, as indicated by the dotted oval 125. The captured images and the corresponding depths are then stored by the electronic device 110 for later use in generating a 3D grid representation of the interior space 100 and a 2D floor plan of the interior space 100.
Some embodiments of the electronic device 110 rely on non-image information for position/orientation detection. This non-image information can be obtained by the electronic device 110 via one or more non-image sensors (not shown in
In operation, the electronic device 110 uses the image sensor data and the non-image sensor data to determine a relative position/orientation of the electronic device 110, that is, a position/orientation relative to the interior space 100. In at least one embodiment, the determination of the relative position/orientation is based on the detection of spatial features in image data captured by one or more of the imaging cameras and the determination of the position/orientation of the electronic device 110 relative to the detected spatial features. Non-image sensor data, such as readings from a gyroscope, a magnetometer, an ambient light sensor, a keypad, a microphone, and the like, also is collected by the electronic device 110 in its current position/orientation.
Some embodiments of the electronic device 110 combine the relative position/orientation of the electronic device 110, the pose of the electronic device 110, the image sensor data, and the depth sensor data to generate a 3D grid of voxels that represent the interior space 100 and features within the interior space 100 including the bookcase 101, the walls 102, 103, the door 104, and the window 105. Each voxel represents a portion of the volume enclosed by the interior space 100. The voxels include values of weights that indicate a number of observations that include the corresponding portion of the volume of the interior space 100 and signed distances relative to surfaces associated with the voxels.
The electronic device 110 is able to generate a 2D distance grid that represents the interior space 100 by vertically projecting the 3D grid into the plane of the floor of the interior space 100. In the illustrated embodiment, the 2D distance grid indicates locations of the bookcase 101 and the walls 102, 103 in the floor plan of the interior space 100. The 2D distance grid can also include information indicating locations of the door 104, the window 105, and other objects or features in the interior space 100. Some embodiments of the electronic device 110 generate the 2D distance grid concurrently with performing the 3D scan of the interior space 100, e.g., by accessing subsets of the voxels that were modified in a previous time interval while the electronic device 110 was performing the 3D scan. Noise in the 2D grid is reduced by representing the 2D floor plan of the interior space 100 as a set of primitives, such as lines, circles, triangles, rectangles, or other polygons. Some embodiments of the electronic device 110 are therefore able to iteratively select a primitive that minimizes a cost function when the primitive is added to the set of primitives that are used to represent the interior space 100. As discussed herein, the cost function indicates how well the set of primitives matches the 2D distance grid.
As illustrated by the front plan view 200 of
As illustrated by the back plan view 300 of
In one embodiment, the imaging camera 302 is implemented as a wide-angle imaging camera having a fish-eye lens or other wide-angle lens to provide a wider angle view of the local environment facing the surface 310. The imaging camera 304 is implemented as a narrow-angle imaging camera having a typical angle of view lens to provide a narrower angle view of the local environment facing the surface 310. Accordingly, the imaging camera 302 and the imaging camera 304 are also referred to herein as the “wide-angle imaging camera 302” and the “narrow-angle imaging camera 304,” respectively. As described in greater detail below, the wide-angle imaging camera 302 and the narrow-angle imaging camera 304 can be positioned and oriented on the forward-facing surface 310 such that their fields of view overlap starting at a specified distance from the electronic device 110, thereby enabling depth sensing of objects in the local environment that are positioned in the region of overlapping fields of view via multiview image analysis.
Some embodiments of a depth sensor implemented in the electronic device 110 uses the modulated light projector 306 to project modulated light patterns from the forward-facing surface 310 into the local environment, and uses one or both of imaging cameras 302, 304 to capture reflections of the modulated light patterns as they reflect back from objects in the local environment. These modulated light patterns can be either spatially-modulated light patterns or temporally-modulated light patterns. The captured reflections of the modulated light patterns are referred to herein as “depth imagery.” The depth sensor calculates the depths of the objects, that is, the distances of the objects from the electronic device 110, based on the analysis of the depth imagery. The resulting depth data obtained from the depth sensor may be used to calibrate or otherwise augment depth information obtained from multiview analysis (e.g., stereoscopic analysis) of the image data captured by the imaging cameras 302, 304. Alternatively, the depth data from the depth sensor may be used in place of depth information obtained from multiview analysis. To illustrate, multiview analysis typically is more suited for bright lighting conditions and when the objects are relatively distant, whereas modulated light-based depth sensing is better suited for lower light conditions or when the observed objects are relatively close (e.g., within 4-5 meters). Thus, when the electronic device 110 senses that it is outdoors or otherwise in relatively good lighting conditions, the electronic device 110 may elect to use multiview analysis to determine object depths. Conversely, when the electronic device 110 senses that it is indoors or otherwise in relatively poor lighting conditions, the electronic device 110 may switch to using modulated light-based depth sensing via the depth sensor.
Although
The type of lens implemented for each imaging camera depends on the intended function of the imaging camera. Because the forward-facing imaging camera 302, in one embodiment, is intended for machine vision-specific imagery for analyzing the local environment, the lens 410 may be implemented as a wide-angle lens or a fish-eye lens having, for example, an angle of view between 160-180 degrees with a known high distortion. The forward-facing imaging camera 304, in one embodiment, supports user-initiated image capture, and thus the lens 414 of the forward-facing imaging camera 304 may be implemented as a narrow-angle lens having, for example, an angle of view between 80-90 degrees horizontally. Note that these angles of view are exemplary only. The user-facing imaging camera 212 likewise may have other uses in addition to supporting local environment imaging or head tracking. For example, the user-facing imaging camera 212 also may be used to support video conferencing functionality for the electronic device 110. Accordingly, depending on the application the lens 418 of the user-facing imaging camera 212 can be implemented as a narrow-angle lens, a wide-angle lens, or a fish-eye lens.
The image sensors 408, 412, and 416 of the imaging cameras 212, 302, and 304, respectively, can be implemented as charge coupled device (CCD)-based sensors, complementary metal-oxide-semiconductor (CMOS) active pixel sensors, and the like. In a CMOS-based implementation, the image sensor may include a rolling shutter sensor whereby a group of one or more rows of pixel sensors of the image sensor is read out while all other rows on the sensor continue to be exposed. This approach has the benefit of providing increased sensitivity due to the longer exposure times or more usable light sensitive area, but with the drawback of being subject to distortion due to high-speed objects being captured in the frame. The effect of distortion can be minimized by implementing a global reset mechanism in the rolling shutter so that all of the pixels on the sensor begin collecting charge simultaneously, rather than on a row-by-row basis. In a CCD-based implementation, the image sensor can be implemented as a global shutter sensor whereby all pixels of the sensor are exposed at the same time and then transferred to a shielded area that can then be read out while the next image frame is being exposed. This approach has the benefit of being less susceptible to distortion, with the downside of generally decreased sensitivity due to the additional electronics required per pixel.
In some embodiments the fields of view of the wide-angle imaging camera 302 and the narrow-angle imaging camera 304 overlap in a region 420 so that objects in the local environment in the region 420 are represented both in the image frame captured by the wide-angle imaging camera 302 and in the image frame concurrently captured by the narrow-angle imaging camera 304, thereby allowing the depth of the objects in the region 420 to be determined by the electronic device 110 through a multiview analysis of the two concurrent image frames. As such, the forward-facing imaging cameras 302 and 304 are positioned at the forward-facing surface 310 so that the region 420 covers an intended distance range and sweep relative to the electronic device 110. Moreover, as the multiview analysis relies on the parallax phenomena, the forward-facing imaging cameras 302 and 304 are sufficiently separated to provide adequate parallax for the multiview analysis.
Also illustrated in the cross-section view 400 are various example positions of the modulated light projector 306. The modulated light projector 306 projects an infrared modulated light pattern 424 in a direction generally perpendicular to the surface 310, and one or both of the forward-facing imaging cameras 302 and 304 are utilized to capture reflection of the projected light pattern 424. In the depicted example, the modulated light projector 306 is disposed at the forward-facing surface 310 at a location between the imaging cameras 302 and 304. In other embodiments, the modulated light projector 306 can be disposed at a location between one of the imaging cameras and an edge of a housing, such as at a location 422 between the wide-angle imaging camera 302 and the side of the housing, or at a location (not shown) between the narrow-angle imaging camera 304 and the side of the housing.
The user equipment 505 generates a 3D grid representative of the interior space using images and depth values gathered during the 3D scan of the interior space. Some embodiments of the user equipment generate a 3D truncated signed distance function (TSDF) grid to represent the interior space. For example, the user equipment 505 can estimate camera poses in real-time using visual-inertial odometry (VIO) or concurrent odometry and mapping (COM). Depth images are then used to build the 3D volumetric TSDF grid that represents features of the interior space. Techniques for generating 3D TSDF grids are known in the art and in the interest of clarity are not discussed further herein. In some embodiments, the 3D TSDF grid is updated in response to each depth image that is acquired by the user equipment 505. Alternatively, the 3D TSDF grid is updated in response to acquiring a predetermined number of depth images or in response to a predetermined time interval elapsing. A 3D triangle mesh is extracted from the 3D TSDF grid, e.g., using a marching cubes algorithm.
The 3D TSDF grid is composed of equally sized voxels having a particular side length, e.g., 2 centimeters (cm). Each voxel stores two values: the number of observations (weight) acquired by the user equipment 505 that include the volume represented by the voxel and an estimated signed distance to a corresponding surface, such as a surface of the wall 515. A 2D distance grid is generated from the 3D TSDF grid by projecting the voxels in a vertical direction. In some embodiments, the vertical projection is done by averaging the weights and the signed distances in a vertical direction. For example, the vertical averages can be computed as:
In this example, the 2D floor plan is in the x-y plane and the vertical projection is along the z-direction. The 2D signed distance is normalized by dividing the 3D weights extracted from the 3D TSDF grid by the 2D weight in the illustrated embodiment.
Some embodiments of the user equipment 505 are configured to extract other features from the 3D TSDF grid. For example, the user equipment 505 can extract a value of a height of the highest surface that was visible from above:
surfacefrom above(x,y)=max{z|distance3d(x,y,z)>0&& distance3d(x,y,z−1)<0} (3)
Other features that can be extracted include a height of a lowest surface that is viewed from below, a ratio of free space to occupied space in a vertical band, and the like. The ratio of free/occupied space can be used to distinguish walls from windows, doors, and other openings.
Pixels in the 2D distance grid 700 are categorized in different semantic classes. In the illustrated embodiment, the semantic classes are walls, free space, and furniture. For example, pixels that represent walls 705 (only one indicated by a reference numeral in the interest of clarity) are encoded in black, pixels that represent free space 710 are encoded in white, and pixels that represent furniture 715 are encoded using grey. The 2D distance grid 700 is relatively noisy and includes artifacts, such as the region 720 that is produced by images and depths acquired from objects that are outside of a window. Thus, in some embodiments, the 2D distance grid 700 is converted into a set of polygons by extracting a zero iso-surface, for example, using the marching squares algorithm. The set of polygons generated using the marching squares algorithm is referred to as a raw version of a polygonised floor plan that represents the interior space.
The method shown in
P=p1,p2, . . . ,pN (4)
The primitives are represented as:
pi=(xi,yi,φi,width,heighti,classi) (5)
where (x, y) are coordinates of a reference point on the primitive in the plane of the floor plan, φ is the orientation of the primitive in the plane, width is the width of the rectangular primitive, height is the height of the rectangular primitive, and class is the class of the primitive. Different sets of parameters are used to characterize other types of primitives.
The search space is reduced along possible orientations by determining a set of primary orientations of the floor plan. In some embodiments, as discussed below, a histogram is constructed to accumulate weighted gradient orientations from the 2D TSDF distance grid. Maxima in the histogram that are larger than a predetermined percentage of the global maximum are selected as primary orientations. In the embodiment shown in
O(n2*Norientations*Nclasses) (6)
where n is the number of pixels Norientations is a number of possible orientations, and Nclasses is the number of possible classes. A predicted label image is generated by rasterize in the primitives in order.
The sequence of primitives is selected so that the predicted image generated by the sequence best matches the target label image. This criterion is represented as:
Pbest=argmaxPΣ(x,y)cost(predicted(P,x,y),target(x,y)) (7)
where the cost function indicates how well the predicted image matches the target label image. The cost function is also used to assign weights to different semantic classes. For example, a higher cost can be assigned to missing walls than the cost that is assigned to missing free space. In practice, finding the optimal sequence of unknown length is an NP-hard problem. Some embodiments of the method therefore utilize an iterative algorithm that searches for a single primitive that minimizes the cost function at each iteration of the method.
A cost map is precomputed for every primitive and orientation over the full space of possible primitives. To evaluate the cost of adding a primitive to the set of primitives that represents the interior space at each iteration, a cost function is computed by iterating over pixels of the primitive. The cost function is represented as:
cost(p)=Σ(x,y)∈Pcost(predicted(P,x,y),target(x,y)) (8)
The cost maps are converted into integral images, which are also known as summed area tables. Techniques for computing summed area tables are known in the art and are therefore not discussed in detail herein. Utilizing integral images or summed area tables allows the cost to be determined in constant time, e.g., using four lookups to the summed area tables. Some embodiments of the cost function include a cost term for matching or mismatching pixels, as described above, a cost term for label changes to discourage combining large overlapping rectangles or other primitives, a cost term for vertices to discourage including walls that are within other walls or having parallel walls that touch each other. The cost term for vertices enhances the ability of the algorithm to distinguish between horizontal walls, vertical walls, and walls having other orientations to avoid including walls that have the same orientation. Parameters that characterize the cost functions are determined empirically.
At each iteration, the method evaluates all possible primitives and selects the primitive that has the lowest cost. The algorithm converges after the difference in cost from one iteration to the next falls below a predetermined threshold.
In the illustrated embodiment, the set 805 of polygons that represent the interior space includes three polygons 801, 802, 803 after a third iteration is complete. The polygons 801, 802, 803 are in the free space class, as indicated by the white fill. After the fifth iteration, the set 810 of polygons that represent the interior space includes two additional polygons 811, 812 that are both in the unknown class, as indicated by the crosshatching. After several more iterations, the algorithm has added polygons 815 (only one indicated by a reference numeral in the interest of clarity) to the set 820 of polygons. The polygons 815 represent walls in the interior space, as indicated by the black fill. The method continues until a convergence criterion is satisfied.
The tiles 920, 940 are selectively updated depending upon whether the corresponding voxels 905, 910, 915, 925, 930, 935 were updated during the previous time interval of the 3D scan. In the illustrated embodiment, the tile 920 is not updated (as indicated by the light dashed lines) because none of the voxels 905, 910, 915 were modified during the previous time interval. The tile 940 is updated (as indicated by the solid line) because the values of the voxel 930 were modified during the previous time interval. The voxels 925, 930, 935 are therefore vertically projected onto the tile 940 by averaging or summing the values of the voxels 925, 930, 935 to update the values of the tile 940. Selectively updating the values of the tiles 920, 940 based on whether the corresponding voxels 905, 910, 915, 925, 930, 935 were previously updated allow some embodiments of the techniques for generating 2D floor plans disclosed herein to be performed in real-time.
In some embodiments, the voxels 905, 910, 915, 925, 930, 935 are grouped into volumes that include predetermined numbers of voxels. For example, a volume can include a 16×16×16 set of voxels that include one or more of the voxels 905, 910, 915, 925, 930, 935. In that case, the tiles are grouped into a corresponding set of tiles, e.g., 16×16 sets of tiles that are associated vertical columns of volumes including 16×16×16 sets of voxels. The sets of tiles are selectively updated based upon whether at least one voxel within the corresponding vertical column of volumes was or was not updated during a previous time interval of the 3D scan. Hashing is used in some embodiments to identify the volumes that include voxels that have been updated, e.g., by marking the corresponding volume as a “dirty” volume.
The weight associated with a pixel is determined by estimating a direction of a gradient in values of pixels. For example, a gradient of a pixel near a surface of a wall is approximately perpendicular to the wall because values of pixels within the wall (e.g. black pixels in a 2D TSDF grid) differ from values of pixels in the free space outside the wall (e.g., white pixels in a 2D TSDF grid). The weighted gradient orientations from the 2D TSDF distance grid are accumulated in bins 1105 (only one indicated by a reference numeral in the interest of clarity) corresponding to different orientations. Maxima in the histogram that are larger than a predetermined percentage of the global maximum are selected as primary orientations. The predetermined percentage is indicated by the line 1115 in
At block 1305, a 3D grid of voxels that represent an interior space is acquired. As discussed herein, in some embodiments, the 3D grid of voxels is acquired by an electronic device or user equipment that is held by a user as the user moves through the interior space. However, in other embodiments, the 3D grid of voxels can be acquired using other image acquisition and depth sensing equipment, which may or may not be implemented in a single device that can be carried by a user. Furthermore, the 3D grid of voxels can be acquired by one system and then processed according to the method 1300 by another system.
At block 1310, the user equipment determines 2D weights of 2D tiles in a floor plan by projecting the voxels in the 3D grid into a plane of the floor plan. At block 1315, the user equipment determines 2D signed distances of the 2D tiles by projecting values of the 3D signed distances for voxels in the 3D grid. At block 1320, the user generates a 2D distance grid that represents the 2D floor plan. For example, the user equipment can generate a 2D distance grid such as the 2D distance grid 700 shown in
At block 1325, the user equipment reduces or removes noise in the 2D distance grid by generating a set of primitives that represent the 2D distance grid. For example, the user equipment can generate the set of primitives using an iterative process that selects primitives based on a cost function as illustrated in
At block 1405, the user equipment generates a 2D distance grid that represents the floor plan. As discussed herein, the 2D distance grid is generated by projecting values of 3D voxels into a plane of the floor plan. At block 1410, the pixels in the 2D distance grid are assigned to semantic classes such as walls, free space, and unknown. Other semantic classes include furniture, doors, windows, and the like.
At block 1415, the user equipment determines primary orientations of the floor plan. Some embodiments of the user equipment determine the primary orientations by constructing a histogram of weighted gradients that are determined based on the 2D distance grid. The primary orientations are determined by identifying peaks in the histogram that correspond to different orientations.
At block 1420, the user equipment finds a primitive that minimizes a cost function. During the first iteration, the user equipment selects a single primitive to represent the floor plan. During subsequent iterations, the user equipment selects a primitive that minimizes the cost function for the currently selected primitive when combined with the previously selected primitives. At block 1425, the selected primitive is added to the set of primitives that represent the floor plan.
At decision block 1430, the user equipment determines whether the iterative selection process has converged. Some embodiments of the iterative selection process converge in response to a difference in a value of the cost function from one iteration to the other falling below a predetermined threshold. If the iterative selection process has converged, the method 1400 flows to termination block 1435 and the method 1400 ends. If the iterative selection process has not converged, the method 1400 flows back to block 1420 and the user equipment selects another primitive that minimizes the cost function.
The electronic device 1500 includes a transceiver 1505 that is used to support communication with other devices. Some embodiments of the electronic device 1500 are implemented in user equipment, in which case the transceiver 1505 is configured to support communication over an air interface. The electronic device 1500 also includes a processor 1510 and a memory 1515. The processor 1510 is configured to execute instructions such as instructions stored in the memory 1515 and the memory 1515 is configured to store instructions, data that is to be operated upon by the instructions, or the results of instructions performed by the processor 1510. The electronic device 1500 is therefore able to implement some embodiments of the method 1300 shown in
The rooms 1601-1606 include pieces of furniture such as a table 1610. The rooms 1601-1606 also include other pieces of furniture such as chairs, beds, dressers, bookshelves, toilets, sinks, showers, washing machines, refrigerators, ovens, stoves, dishwashers, and the like. In the interest of clarity, the other pieces of furniture are not indicated by reference numerals. As used herein, the term “furniture” refers to any object located within the interior space 1600 including the specific pieces of furniture disclosed herein, as well as other objects that are located within one or more of the rooms 1601-1606.
As discussed herein, a 3D scan of the interior space 1600 is acquired using an electronic device such as the electronic device 110 shown in
Labels are assigned to portions of the interior space 1600 based on the 3D scan. In some embodiments, the labels are selected using a trained convolutional neural network (CNN) that analyzes color and depth images of the interior space 1600. The labels are selected from a set that includes labels indicating a bathroom, a bedroom, a living room, a kitchen, an office, and an unknown label for portions of the 3D scan that the CNN is unable to identify. For example, the CNN labels the room 1604 with the label “dining room” based on the presence of the table 1610 and the chairs surrounding the table. For another example, the CNN labels the room 1601 with the label “kitchen” because the room 1601 includes a washing machine, a dishwasher, a refrigerator, a kitchen sink, and a stove.
Input 1705 to the CNN 1700 includes a 2D color image representing the interior space. The input 1705 also includes a corresponding 2D depth image to indicate depths of each location within the interior space. In the illustrated embodiment, the 2D color image and depth images are generated from a textured 3D mesh that is generated based on a 3D scan of the interior space. For example, the input 1705 can include a RGB-D image of the interior space generated from a 3D scan captured by an electronic device such as the electronic device 110 shown in
Convolutional layer 1710 receives the input 1705. The convolutional layer 1710 implements a convolutional function that is defined by a set of parameters, which are trained on the basis of one or more training datasets. The parameters include a set of learnable filters (or kernels) that have a small receptive field and extend through a full depth of an input volume of the convolutional layer 1710. The parameters can also include a depth parameter, a stride parameter, and a zero-padding parameter that control the size of the output volume of the convolutional layer 1710. The convolutional layer 1710 applies a convolution operation to the input 1705 and provides the results of the convolution operation to a subsequent convolutional layer 1715. The CNN 1700 also includes an identity shortcut connection 1720 that allows an identity portion of the input 1705 to bypass the convolutional layers 1710. In the illustrated embodiment, the CNN 1700 includes additional convolutional layers 1725 and an additional identity shortcut connection 1730. Some embodiments of the CNN 1700 include more or fewer convolutional layers or identity shortcut connections.
Results of the convolution operations performed by the convolutional layers 1710, 1715, 1725 are provided to fully connected layers 1735, 1740 and DO layer 1745. The neurons in the fully connected layers 1735, 1740 are connected to every neuron in another layer, such as the convolutional layers 1725 or the other fully connected layers. The fully connected layers 1735, 1740 typically implement functionality that represents the high-level reasoning that produces an output 1750 that represents the labels generated by the CNN 1700. For example, if the CNN 1700 is trained to perform image recognition, the fully connected layers 1735, 1740 implement the functionality that labels portions of the image that have been “recognized” by the CNN 1700. For example, the fully connected layers 1735, 1740 can recognize portions of an interior space as rooms that have a particular function, in which case the fully connected layers 1735, 1740 label the portions using the corresponding room labels. The functions implemented in the fully connected layers 1735, 1740 are represented by values of parameters that are determined using a training dataset, as discussed herein.
The output 1750 of the CNN 1700 is a vector that represents probabilities that a portion of the interior space is labeled as one of a set of labels indicating a room type, such as a bathroom, a bedroom, a living room, a kitchen, an office, and the like. The CNN 1700 is able to label the portions of the interior space because the presence or absence of certain objects (or combinations thereof) provides a constraint on the type of the room that includes the objects. For example, sinks are typically found in both kitchens and bathrooms, chairs are found in all types of room but seldomly in bathrooms, and beds are found in bedrooms. Thus, identifying a sink and a chair in a room makes it more likely that the room is a kitchen than a bathroom. An unknown label is assigned portions of the interior space that the CNN 1700 is unable to identify. Including the unknown label is useful because the problem of identifying a room type for a portion of an interior space is ill-posed. For example, if a 2 m×2 m portion of a 2D color and depth image only shows a tiled white floor, the CNN 1700 will not have any information with which to identify the room type. Labeling some portions of the input 1705 as “unknown” effectively allows the CNN 1700 to avoid choosing a particular room type in cases where it is difficult or impossible to identify the room type based on the available information.
In some embodiments, certain aspects of the techniques described above may implemented by one or more processors of a processing system executing software. The software comprises one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software can include the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium can include, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like. The executable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
A computer readable storage medium may include any storage medium, or combination of storage media, accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media can include, but is not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc, magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.
Sturm, Jürgen, Schütte, Christoph
Patent | Priority | Assignee | Title |
11176374, | May 01 2019 | Microsoft Technology Licensing, LLC | Deriving information from images |
11551422, | Jan 17 2020 | Apple Inc. | Floorplan generation based on room scanning |
11715265, | Jan 17 2020 | Apple Inc. | Floorplan generation based on room scanning |
11763478, | Jan 17 2020 | Apple Inc | Scan-based measurements |
Patent | Priority | Assignee | Title |
20070229521, | |||
20140301633, | |||
20140368504, | |||
20150063683, | |||
20150177948, | |||
20150228114, | |||
20160103828, | |||
20190080463, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Apr 11 2018 | STURM, JÜRGEN | GOOGLE LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 046085 | /0496 | |
Apr 11 2018 | SCHÜTTE, CHRISTOPH | GOOGLE LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 046085 | /0496 | |
Apr 16 2018 | GOOGLE LLC | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Apr 16 2018 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Aug 25 2023 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Feb 25 2023 | 4 years fee payment window open |
Aug 25 2023 | 6 months grace period start (w surcharge) |
Feb 25 2024 | patent expiry (for year 4) |
Feb 25 2026 | 2 years to revive unintentionally abandoned end. (for year 4) |
Feb 25 2027 | 8 years fee payment window open |
Aug 25 2027 | 6 months grace period start (w surcharge) |
Feb 25 2028 | patent expiry (for year 8) |
Feb 25 2030 | 2 years to revive unintentionally abandoned end. (for year 8) |
Feb 25 2031 | 12 years fee payment window open |
Aug 25 2031 | 6 months grace period start (w surcharge) |
Feb 25 2032 | patent expiry (for year 12) |
Feb 25 2034 | 2 years to revive unintentionally abandoned end. (for year 12) |