A process and apparatus are provided to characterize low-resolution partial point clouds for object recognition or query. A partial point cloud representation of an object is received. Zero and first order geometric moments of the partial point cloud are computed. A location of a center of a point cloud mass is computed using the geometric moments. A cubic bounding box is generated centered at the location of the mass center of the point cloud, with one side of the box bounding the point cloud at its longest semi-axis. The bounding box is divided into a three dimensional grid. A normalized voxel mass distribution is generated over the three dimensional grid. Tchebichef moments of different orders are calculated with respect to the voxel mass distribution in the grid. Low-order moments are collected to form TMSDs. Similarity is compared between the TMSD of the point cloud with TMSDs of other point clouds.
|
1. A process for characterizing global shape pattern of low-resolution, partial point clouds, the process comprising:
receiving a partial point cloud representation of an object from a sensor;
computing zero and first order geometric moments of the partial point cloud;
computing a location of a center of a point cloud mass using the zero and first order geometric moments;
generating a bounding box;
dividing the bounding box into a three dimensional grid;
generating a normalized voxel mass distribution over the three dimensional grid;
calculating Tchebichef moments of different orders with respect to the voxel mass distribution in the grid; and
collecting low-order moments to form a one-dimensional numerical vector containing 3D Tchebichef Moment shape Descriptors (TMSD).
17. A program product, comprising:
a non-transitory computer recordable type medium; and
a program code configured to be executed by a hardware based processor to characterize partial point clouds, the program code further configured to retrieve the partial point cloud representation of an object from a sensor, compute zero and first order geometric moments of the partial point cloud, compute a location of a center of a point cloud mass using the zero and first order geometric moments, generate a bounding box, divide the bounding box into a three dimensional grid, generate a normalized voxel mass distribution over the three dimensional grid, calculate Tchebichef moments of different orders with respect to the voxel mass distribution in the grid, and collect low-order moments to form a one-dimensional numerical vector containing 3D Tchebichef Moment shape Descriptors (TMSD).
9. An apparatus, comprising:
a sensor configured to generate a partial point cloud representation of an object;
a memory in electrical communication with the sensor and configured to store the partial point cloud generated by the sensor;
a processor in electrical communication with the memory; and
program code resident in the memory and configured to be executed by the processor to characterize partial point clouds, the program code further configured to retrieve the partial point cloud representation of an object stored in the memory, compute zero and first order geometric moments of the partial point cloud, compute a location of a center of a point cloud mass using the zero and first order geometric moments, generate a bounding box, divide the bounding box into a three dimensional grid, generate a normalized voxel mass distribution over the three dimensional grid, calculate Tchebichef moments of different orders with respect to the voxel mass distribution in the grid, and collect low-order moments to form a one-dimensional numerical vector containing 3D Tchebichef Moment shape Descriptors (TMSD).
2. The method of
comparing the similarity between the TMSD of the point cloud with TMSDs of other point clouds of known classes of shapes for partial point cloud based object recognition or query.
3. The method of
a multi-scale nearest neighbor (NN) query.
4. The method of
generating a bounding box centered at the location of the center of the point cloud mass.
5. The method of
8. The method of
10. The apparatus of
compare the similarity between the TMSD of the point cloud with TMSDs of other point clouds of known classes of shapes for partial point cloud based object recognition or query.
11. The apparatus of
a multi-scale nearest neighbor (NN) query.
12. The apparatus of
generating a bounding box centered at the location of the center of the point cloud mass.
13. The apparatus of
16. The apparatus of
18. The program product of
compare the similarity between the TMSD of the point cloud with TMSDs of other point clouds of known classes of shapes for partial point cloud based object recognition or query.
19. The program product of
a multi-scale nearest neighbor (NN) query.
20. The program product of
generating a cubic bounding box centered at the location of the center of the point cloud mass,
wherein one side of the box bounding the point cloud is at its longest semi-axis.
|
This application is a continuation of U.S. application Ser. No. 15/190,772, entitled “Tchebichef Moment Shape Descriptor for Partial Point Cloud Characterization,” filed on Jun. 23, 2016, which claims the benefit of and priority to U.S. Provisional Application Ser. No. 62/184,289, entitled “Tchebichef Moment Shape Descriptor for Partial Point Cloud Characterization,” filed on Jun. 25, 2015, the entirety of which is incorporated by reference herein.
The invention described herein may be manufactured and used by or for the Government of the United States for all governmental purposes without the payment of any royalty.
The present invention generally relates to feature recognition and extraction from point cloud data.
During the last decade, some advanced three-dimensional (3D) sensors, such as light detection and ranging (LIDAR) sensors, started appearing in various applications. Even though individual devices can have different designs, they usually provide 3D point clouds or gray/color scaled depth images of objects from a distance. With a quick accumulation of such data, there is a need in the art to study compact and robust shape description models for content-based information retrieval (CBIR) applications. However, these sensor outputs are generally not as good and complete as traditional 3D shape data of dense point clouds or watertight meshes generated by full-body laser scanners or graphics software. Instead, the sensor outputs provide partial views of 3D objects at a specific viewing angle. When human targets are involved, there are often self-occlusions that break a body's point cloud into random and disjoint patches. Low-resolution settings typically seen in standoff sensing systems further degrade meaningful point connectivity. These problems pose significant challenges for CBIR systems because such shape degeneracy and sparsity make feature extraction and representation difficult. Many existing 3D descriptors may not be applicable or suitable under this circumstance. For example, without a smooth dense point cloud, it would be difficult to acquire stable first order (surface normal) and second order (surface curvature) geometric properties.
Accordingly, there is a need in the art for feature identification and extraction from low-resolution, partial point cloud data generated by mobile and/or standoff sensors.
When three dimensional (3D) sensors such as light detection and ranging (LIDAR) are employed in targeting and recognition of human action from both ground and aerial platforms, the corresponding point clouds of body shape often comprise low-resolution, disjoint, and irregular patches of points resulted from self-occlusions and viewing angle variations. Many existing 3D shape descriptors designed for shape query and retrieval are unable to work effectively with these degenerated point clouds because of their dependency on dense and smooth full-body scans. Embodiments of this invention provide a new degeneracy-tolerable, multi-scale 3D shape descriptor based on a discrete orthogonal Tchebichef moment as an alternative for low-resolution, partial point cloud representation and characterization.
Embodiments of the invention utilize a Tchebichef moment shape descriptor (TMSD) in human shape retrieval. These embodiments were verified using a multi-subject pose shape baseline, which provided simulated LIDAR captures at different viewing angles. Some embodiments additionally utilized a voxelization scheme that is capable of achieving translation, scale, and resolution invariance, which is lesser of a concern in the traditional full-body shape models, but is a desirable requirement for meaningful partial point cloud retrievals.
Validation experimentation demonstrated that TMSD performs better than contemporary methods such as 3D discrete Fourier transform (DFT) and is at least comparable to other contemporary methods such as 3D discrete wavelet transforms (DWT). TMSD proved to be more flexible on multi-scale construction than 3D DWT because it does not have the restriction of dyadic sampling. The validation experiments were designed as single-view nearest neighbor (NN) queries of human pose shape using a newly constructed baseline of partial 3D point clouds, captured through biofidelic human avatars of individual human volunteers performing three activities—jogging, throwing, and digging. The NN query measures the similarity between the query pose shape's descriptor and the descriptors of other shapes in the pose shape baseline. The baseline provides a geometric simulation of LIDAR data at multiple viewing angles, organized into two subsets of horizontal (0 degree) and vertically-slant (45 degrees) elevation angles. Each subset consisted of more than 5,500 frames of point cloud patches obtained at different azimuth angles, grouped into 200 plus pose shape classes according to the action pose segmentation and azimuth angle. The construction of this baseline offered a unique advantage of performance evaluation at a full range of viewing angles. The validation experimentation demonstrated that TMSD maintains consistent performance under different elevation angles, which may have a particular significance for aerial platforms.
Complementary to TMSD, a new voxelization scheme was also designed to assist in providing translation, scale, and resolution invariance. The inclusion of scale and resolution normalization in the embodiments of the invention distinguishes these embodiments from many contemporary 3D shape search methods. The majority of contemporary methods only deal with full-body models in which a complete surface, rather than individual patches and their spatial relationships, defines shape similarity. Therefore, rotational invariance is the main concern of these models. However, in the case of partial point clouds, rotational invariance is meaningless because the point clouds are viewing angle dependent; instead the scale and resolution differences are important variations.
Embodiments of the invention employ a method of characterizing low-resolution partial point clouds for object query or recognition. A partial point cloud representation of an object is received. Zero and first order geometric moments of the partial point cloud are computed. A location of a center of a point cloud mass is computed using the zero and first order geometric moments. A cubic bounding box is generated centered at the location of the center of the point cloud mass, with one side of the box bounding the point cloud at its longest semi-axis. The bounding box is divided into a three dimensional grid. A normalized voxel mass distribution is generated over the three dimensional grid. Tchebichef moments of different orders are calculated with respect to the voxel mass distribution in the grid. The low-order moments are collected to form 3D Tchebichef Moment Shape Descriptor (TMSD)—a compact, one-dimensional numerical vector that characterizes the three-dimensional global shape pattern of the point cloud. Object query or recognition may then be performed by comparing the similarity between the TMSD of the point cloud with the TMSDs of other point clouds of known classes of shapes.
Additional objects, advantages, and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with a general description of the invention given above, and the detailed description given below, serve to explain the invention.
It should be understood that the appended drawings are not necessarily to scale, presenting a somewhat simplified representation of various features illustrative of the basic principles of the invention. The specific design features of the sequence of operations as disclosed herein, including, for example, specific dimensions, orientations, locations, and shapes of various illustrated components, will be determined in part by the particular intended application and use environment. Certain features of the illustrated embodiments have been enlarged or distorted relative to others to facilitate visualization and clear understanding. In particular, thin features may be thickened, for example, for clarity or illustration.
Feature-based descriptors 10 may be built upon three-dimensional (3D) spatial relationships, surface geometry, and transform coefficients as illustrated in
A group of local feature density distributions usually have a density distribution of pairwise spatial relationships among surface sampling points, unrelated to any global partition. One contemporary approach is a 3D shape distribution, which can be constructed from pair-wise Euclidean distances or angles among surface sample points. Isometric invariant geodesic distance also has been introduced to produce a probabilistic shape description.
These contemporary descriptors of spatial relationship usually tolerate degeneracy and sparsity, and hence may be applicable to point cloud patches. However, while applicable, they are not well suited for feature recognition and extraction from point cloud patch data. Since every bin is equally important, a tradeoff must be made between descriptor size and performance. Moreover, a large number of histogram bins or a refined partition scheme could result in undesirable high dimensionality. Even though some data reduction techniques such as the Principal Component Analysis (PCA) have been used to reduce the dimensionality, it is very difficult to acquire datasets that are sufficiently large to achieve consistency. Thus, the data reduction outcome is tied with a specific dataset and is not scalable.
Surface geometry 20 type of descriptors are generated from local geometric properties, such as radial distance (zero order), surface normal (first order), and surface curvature (second order), etc. These local geometric properties can form both global 12 and local 14 shape descriptors, depending on whether they are aggregated to a global partition framework or collected as a bag of features.
The richness of surface geometry brings out many feature representation methods. Extended Gaussian image (EGI) records a variation of surface normal orientation and maps it to a unit Gaussian sphere. A shape index histogram is a more generalized form of surface curvature representation in which each bin of the histogram represents the aggregation of a specific type of elementary shape for approximating local surfaces over a Gaussian sphere. A contemporary local descriptor is a spin image, which defines a local surface around a key point by distances from the points in the key point's neighborhood to a tangent plane and normal vector at the key point. Other contemporary methodologies include a probabilistic description of surface geometry around uniformly sampled surface points, which is made from nonparametric Gaussian kernel density estimate (KDE). Diffusion geometry may be utilized in non-rigid shape retrieval due to the isometry invariance of the diffusion distance and robustness of the heat kernel to small surface perturbation. Finally, a histogram of oriented gradients (HOG) has been extended to 3D spatial grid for shape representation.
A common constraint or deficiency in these surface geometry based descriptors is that they generally require a stable and smooth local surface approximation, which is difficult to obtain from degenerated and sparse point cloud patches. Moreover, it is hard to identify any meaningful spatial extremity, maxima of curvature, and inflection point to use as a key point. Sampling may not always work well, either.
Transform coefficient based descriptors are created by decomposing (projecting) a global image or shape function to a set of new, usually orthogonal, basis functions. The low-order projection coefficients are usually collected to form a descriptor because the basis functions are designed to pack the main pattern or energy to a low-order subspace.
In contrast to the heuristic nature of many shape descriptors, orthogonal transform-based descriptors are mathematically sound and tight, because of orthogonality, completeness, and consistency. These properties provide some significant advantages that are otherwise not available in the aforementioned descriptors: 1) no redundancy in shape features, 2) capability of exact reconstruction or approximation with known cutoff error, 3) distance preservation in an embedded subspace, which is critical for multi-scale nearest neighbor (NN) query, and 4) better scalability and feature alignment due to fixed basis functions, even though this benefit is not a clear cut because the fixed basis may limit the expressiveness.
The orthogonal transform descriptors may be further divided into three subgroups of Fourier 22, wavelet 24, and moment 26, according to the family of basis functions used. In the Fourier group, spherical harmonics descriptor (SHD) is the most representative. It is essentially a 3D Fourier transform of a shape function defined over a set of concentric spheres in spherical coordinates. The appeal of SHD is its rotational invariance for watertight shape. However, this is irrelevant to viewing angle dependent point cloud patches. Moreover, realization of SHD on point clouds may encounter issues such as discretization error and non-uniform spherical surface grid. Therefore, a more applicable choice is the 3D discrete Fourier transform (DFT) descriptor 22, resulted from the sampled transform of a shape function over a discrete finite 3D grid. Results from embodiments of the invention will be compared with 3D DFT below.
Compared to the 3D Fourier transform 22, there are fewer applications of the wavelet transform 24 in 3D shape retrieval, probably due to the fact that many are not rotation invariant. A few exceptions are a rotation invariant spherical wavelet transform applied to a sampled spherical shape function and an isometry-invariant wavelet shape signature based on a spectral graph wavelet defined over an Eigenspace of Laplace-Beltrami (LB) operator.
Moments 26 were first introduced to 2D image analysis in the form of geometric moment invariants. The majority of follow on research on moments focused on various classical orthogonal polynomial families. These polynomial families are generally divided into continuous group and discrete group. The continuous orthogonal moments are mostly 2D radial kernel based (rotation-invariant), including 2D Zernike moments, pseudo-Zernike moments, and Fourier-Merlin moments. Even though the best-performing Zernike moment descriptor has been extended to 3D shape analysis, it is not a preferable method for point cloud analysis because of the use of spherical domain and two potential errors—a reconstruction cutoff error and an approximation error. The former is due to an infinite number of moments and the latter is due to the discrete nature of point clouds. The approximation error tends to accumulate as the moment order increases.
Discrete orthogonal moments assist in eliminating these errors. Among them, Tchebichef moments have demonstrated superior 2D image reconstruction performances compared with Zernike moments. However, Tchebichef moments have not been applied in 3D domain, due likely to the fact that Tchebichef moments are not rotation-invariant and may not be numerically stable at a higher order with a refined 3D grid. But, point cloud patch data is generally low-resolution, and therefore does not present such issues. Embodiments of the invention utilize a new Tchebichef moment shape descriptor (TMSD) for multi-scale 3D feature representation of the point cloud patches. The TMSD in some embodiments is generated from low-order 3D Tchebichef moments, which compact information on shape patterns to more easily enable a shape search in an embedded subspace. This reduced-dimension search is made possible by TMSD's property of distance preservation in the subspace, which prevents false negatives in a nearest neighbor search.
Finally, another alternative way of analyzing partial point clouds is to convert them into 2D depth images in which intensity or color scales are used to represent the z (depth) dimension. However, the validation experimentation presented below supports the proposition that the embodiments of the invention utilizing 3D-based TMSD outperform contemporary 2D-based depth image analysis for point cloud shape characterization and query.
A shape query process 30 consistent with the embodiments of the invention comprises three stages as illustrated in
Human activity analysis is one area of applicability for embodiments of the invention. Since there are few publicly available human pose shape data with sufficient anthropometric and viewing angle variations, it was decided to use a hybrid experimental/modeling approach to generate dynamic partial 3D point clouds from orthographical ray-tracing of animations of biofidelic human avatars in order to validate the embodiments of the invention in this field. The avatars and animations of actions were made from actual body scans and motion capture data of individual volunteers—5 males and 4 females.
The degeneracy of point cloud patches, such as those illustrated in
In an uncontrolled setting, the shapes of raw point cloud data are usually not translation and scale normalized. Even for the full-scale baseline data, the scale may not be controlled exactly due to the initial uncalibrated rough positioning of the simulated detector array during the data capturing process. There are also body size differences among human subjects. In addition, there is another resolution type variation in the form of varying global density among different sets of point cloud captures because of different sensor or mesh resolutions. All three variations are also present in real-world 3D sensor data. In order to test the voxelization and normalization scheme in the embodiments of this invention, four subjects, two for each gender, were selected from nine baseline subjects to produce similar types of point clouds at different scales of 75%, 50%, 25%, and 6% of the original detector size, as shown in
In moment-based 2D image analysis, there are two general approaches to handle translation and scale invariance issues. The first approach is a direct normalization of data, and the second is a development of translation and scale invariants of moments. The direct normalization typically uses the zero and first order of geometric moments to move the object origin to its center of mass and readjust its size (mass) to a fixed value. The concept of moment invariants was first introduced for 2D geometric moments in the form of ratios of central moments. They were utilized later in the derivation of invariants for other orthogonal moments. The main advantage of direct data normalization is that it is a preprocessing not related to descriptors, thus descriptors are not altered to achieve invariance. However, direct data normalization introduces a small scaling approximation error. Alternatively, moment invariants can avoid scaling errors, but they no longer possess the properties of orthogonality and finite completeness.
Considering an extra computation burden of invariants and a need for resolution normalization, a new 3D voxelization and direct normalization scheme that is utilized by embodiments of the invention is proposed and referred to as Proportional Grid of Voxelization and Normalization (PGVN). PGVN is a direct normalization but not a 3D extension of the aforementioned 2D methods. PGVN consists of a voxelization with a one-side bounding box originated at the center of mass and a normalization of the total point cloud mass to a fixed value. Denoting a point cloud as {pti|1≤i≤Npt,Nptϵ } and a grid of N×N×N cubes as {Cx,y,z|1≤x,y,z≤N}, where Cx,y,z represents the collection of points within the cube at(x,y,z), the voxelization and normalization with respect to the simulated sensor reference system is presented in flowchart 40 in
The zero and first order geometric moments are computed in block 42 by setting a unit mass for each point at (xicam,yicam,zicam). Here, the superscript ‘cam’ represents the simulated sensor reference system. The location of the center of point cloud mass, (xccam,yccam,zccam), is computed in block 44 using the results from block 42. The semi-axis length bx is found, in block 46, with respect to the origin at (xccam,yccam,zccam), bx=max {|xicam−xccam|, |yicam−yccam|, |zicam−zccam|, 1≤i≤Npt}. A bounding box of size 2bx×2bx×2bx, is created in block 48 centered at (xccam,yccam,zccam) and divided into a N×N×N grid. N is usually an even number. Finally, a normalized voxel mass distribution ƒ(x,y,z) is created in block 50 over the grid, with the total mass being set to a constant β:
The moments computed with respect to ƒ(x,y,z) and the PGVN grid are translation, scale, and resolution invariant. The translation invariance is achieved by co-centering point clouds at their mass centers. The one-side bounding box set in blocks 46 and 48 normalizes the size of point clouds relative to the common PGVN grid reference system. Coupled with the scale normalization, block 50 accomplishes the resolution invariance by introducing a relative voxel mass distribution, ƒ(x,y,z), against a constant total mass value of β. In an illustrative embodiment, β values (e.g., 20,000 for a 64×64×64 grid) were chosen to make ƒ(x,y,z) fall into the range of MATLAB color map for easy visualization purpose.
Additionally for the purpose of performance comparison between the illustrated embodiment of 3D-based TMSD and contemporary 2D-based depth image analysis, block 50 of flowchart 40 was changed to mark the voxel occupancy distribution ƒB(x,y,z) over the grid:
where |⋅| represents the cardinality of a set, i.e., the number of points in cube Cx,y,z. Equation (2) allows for a direct performance comparison between the 3D descriptors of a binary voxelization and the 2D descriptors of a depth image that is converted directly from the same binary voxelization.
Moment can be defined as a projection of a real function ƒ to a set of basis (kernel) function ψ={ψi|iϵ } as:
μi=[ƒ,ψi]= ƒ,ψi, (3)
where ♯ i the i-th order moment and is the moment functional defined by the inner product ƒ,ψi. In some embodiments, a basis function set ψ may span the inner product space, i.e., ψ that forms a complete basis. Another desirable property of basis functions is the orthonormality.
Discrete Tchebichef polynomials belong to a Hahn class of discrete orthogonal polynomials. The n-th order discrete Tchebichef polynomial, tn(x), can be expressed in the form of a generalized hypergeometric function 3F2(⋅) as:
where n, x=0, 1, . . . , N−1, and (a)k is a Pochhammer symbol given by
Γ(a)=(a−1)! is the Gamma function. In the illustrated embodiment, N is the size of either a 2D (N×N) depth image or a 3D (N×N×N) voxelization grid, and x corresponds to one of the grid coordinate variables. The basis function {tn(x)} satisfies a finite orthogonality relation over discrete points of x:
Here ρ(n,N) is a normalization function that can be used to create orthonormality as:
Dividing tn(x) by β(n,N)=√{square root over (ρ(n,N))}, the order-scale normalized Tchebichef polynomials may be obtained as:
The orthonormality resulted from Equation (8) removes large scale fluctuations at different orders of Tchebichef polynomials and {{tilde over (t)}n(x)} can be efficiently computed using recurrence relationships associated with the orthogonal polynomials. Taking {{tilde over (t)}n(x)} as the basis set and applying the discrete form of Equation (3), an individual discrete 3D Tchebichef moment of order (n+m+l) for the voxel mass distribution ƒ(x,y,z), over an N×N×N grid, can be defined as:
In Equation (9), the grid reference origin is its back and bottom-left corner. There are total N3 number of Tnmls with the maximum order of 3×(N−1). Among them, a small subset consisting of the first R-th order moments, R<<N3, is used to form the 3D Tchebichef Moment Shape Descriptor (TMSD):
TMSD=[T001,T010,T100, . . . ,Tnml, . . . ,TR00]T, 0<n+m+l≤R. (10)
Excluding the constant zero-order term, if R<N, the dimension of TMSD is
The reverse process of Equation (9) reconstructs the original point cloud voxelization from its moments:
The low-order descriptor TMSD is an approximation of the general pattern of point cloud patches in an embedded subspace of lower dimension. The extent of dimension reduction brought by the descriptor can be very significant. For example, a voxel-based model of point clouds with N=64 could have as many as 262,144 voxels, whereas an approximation using a TMSD of R=16 requires only 968 moment terms. More importantly, this orthogonal approximation decouples and compacts the spatially correlated point distribution into the low-order modes determined solely by the polynomial basis {{tilde over (t)}n(x)}. The process of decoupling, alignment, and compacting of pattern information assists in overcoming the exponential increase in resource requirements related to dimensionality. It enables pose shape queries through the embedded orthogonal domain, which would be otherwise unrealistic or ineffective in the original voxel domain.
For those PGVN 2D depth images (see
where I(x,y) is given by an orthographical projection and the grayscale conversion of the binary voxelizaton in Equation (2) to the grid's (x,y) plane. The corresponding 2D Tchebichef Moment Image Descriptor (TMID) is formed in the similar way as in Equation (10) by collecting the first R-th order moments.
Replacing {tilde over (t)}n(x) in Equation (9) with the familiar DFT basis of
3D DFT can be expressed as:
The shape descriptor is formed similarly as TMSD using the low-order transform coefficients. However in this case, the norms of the coefficients, ∥Fnml∥, are used in place of the actual complex numbers.
Unlike the single close-form basis set of Tchebichef moments and that of DFT, there are many basis families for the wavelet transform. Even though most of them do not have analytical representations, each can be characterized generally as a set of basis functions generated by scaling and translating its basic mother wavelet ψ(x) as
where a and τ are scaling and translation factors, respectively. Three types of wavelets—Haar (db1), Daubechies (db4), and Symlet (Sym4) wavelet filters have been explored for embodiments of the invention. They were chosen mainly due to their fast band-pass filter bank implementation which is desirable for efficient analysis. Among the three chosen types, Haar plays the role of performance baseline. Daubechies is the most widely used wavelet family but asymmetric. Since symmetry is a relevant pattern in shape analysis, symlets family were included which is near symmetric. Although there are some other families having the similar properties, the three selected wavelet families are representative and sufficient for evaluating the performance of wavelet-based approach.
More specifically, for the efficient filter bank implementation with dyadic sampling, denoting the level index as jϵ, 0≤j<log2 N and the spatial index at level j as kϵ, 0≤k<2j, there is a set of orthogonal scaling function basis, φj,k(x)=2j/2φ(2jx−k), which spans the approximation subspace Vj=span{φj,k(x)}, and the set of orthogonal wavelet function basis, ψj,k(x)=2j/2ψ(2jx−k), which spans the details subspace Wj=span{ψj,k(x)}. Therefore, at a specific approximation level j0, the entire domain space is spanned by Vj
For the grid size N=16, 32, or 64, the value of j0 is set accordingly to obtain an 8×8×8 approximation array, which is slightly larger than the size of TMSD at R=12.
The single-view, multi-scale NN query of a pose shape is implemented in embodiments of the invention as a k-NN query which returns the query shape's top k ranked nearest neighbors in the aforementioned pose shape baseline. More specifically, the ranking is based on the similarity (i.e., distance) between the query pose shape's descriptor and the descriptors of other shapes in the pose shape baseline. It is conducted in an embedded lower-order subspace of the pose shapes, because the full order descriptors represent the complete pose shapes in the form of PGVN voxel models of point cloud patches. In a general-purpose CBIR system, a content data depository is in place of the pose shape baseline and populated with the point clouds of concerned objects.
This subspace k-NN query strategy grows out from the necessity of working around the high-dimensionality issue. The high dimensionality not only makes the distance very expensive to compute but also may render the distance in the original N×N×N voxel space meaningless under some circumstance, for example, if the data points are drawn from independent and identical distributions. Thus k-NN query becomes more challenging than typical class-based pattern recognition. The latter relies on inter-class distances which often have better pair-wise stability than intra-class distances that the former has to deal with.
Moreover, unlike pattern recognition where users may expect certain level of false positives and false negatives, users of a CBIR system, for example the Google® Search, have a much lower tolerance on false negatives than false positives. They expect at least that they can find the nearest neighbors in the search returns. Therefore, it is not desirable to overestimate distance in the embedded space, such that a potential qualified NN shape is falsely dismissed. This requirement can be met if a descriptor and its distance measure satisfy the lower bounding distance condition.
Let dF(⋅,⋅) be the distance function in an embedded descriptor space and dO(⋅,⋅) be the distance function in the original pose shape space. If s1 and s2 denote the shape descriptors of pose shapes o1 and o2, respectively, and s1l and s2l denote the truncated, lower-order versions of s1 and s2, respectively, then the semantics of multi-scale lower bounding distance condition can be expressed as:
dF(s1l,s21)≤dF(s1,s2)≤dO(o1,o2) (15)
For 3D TMSD, Equation (15) can be proofed with Euclidean distance based on the orthonormal property. In the illustrated embodiment, Manhattan distance was used because it is more efficient and behaves better than the Euclidean distance under high dimensionality. It also lower-bounds the Euclidean distance.
Equation (15) cannot prevent false positives. However, this is much less of a concern in practice because TMSD has excellent energy compacting power and its lower-order terms seem to have most of the intrinsic dimensions, as illustrated below in the experiments results.
Six types of benchmark performance experiments have been conducted using subspace k-NN query with various sets of descriptors computed from PGVN baseline pose shapes, except for experiment 1. They are: 1) reconstruction of PGVN pose shape from 3D TMSD, 2) experiment of different orders of descriptor and grid sizes on the retrieval performance of 3D TMSD, 3) test of 3D-outperform-2D hypothesis using 2D TMID and binary 3D TMSD, 4) performance comparison between 3D TMSD, 3D DFT, and 3D DWT descriptors, 5) evaluation of the effect of viewing angle, and 6) evaluation of scale and resolution normalization. To make the charts less crowded, only the results for zero elevation angle are presented for experiment 2, 3, and 4. The results for 45 degree elevation angle are similar for these three experiments.
Table 1 lists the configuration parameters for the descriptor sets used in the experiments, except for the viewing angle already specified above. There are four common parameters: shape descriptor type (SD), descriptor order (R), grid size (N), and elevation angle (EL). Another special parameter is wavelet type (WL). A single set of descriptors has a unique combination of these configuration values.
TABLE 1
Configuration Matrix of Descriptor Dataset
Parameters
Values
Descriptor Type (SD)
2D TMID, 3D TMSD, 3D B-TMSD,
3D DFT, 3D DWT
Descriptor Order (R)*
4, 6, 8, 12, 16, 20, 24
Grid Size (N)
16, 32, 64
Elevation Angle (EL)
0, 45
Wavelet Type (WL)†
db1, db4, sym4
*Not applicable to wavelet analysis.
†Only applicable to wavelet analysis
To evaluate retrieval performance, the pose shapes are categorized into key pose shape classes. This is done by segmenting an action into several predefined consecutive phases. For example, a throwing action consists of three phases—wind (hand holds backward), swing (hand swings over the head), and throw (hand stretches out forward). A class of pose shapes is defined as the collection of the pose shapes of all subjects within the same action phase at a specific viewing angle. The resulted dataset and class labels are treated as the ground truth for performance evaluation. The retrieval performance measure used in these experiments is the averaged precision-recall (PR) curve in which the precision is the average interpolated precision over all the assessed queries at 11 recall values evenly spaced from 0% to 100%. In all the following experiments except for the scale invariance tests, the pose shape being queried is always present in the descriptor dataset and hence the first-ranked. Thus, the interpolated precision is 1 at the recall value of 0.
The PR curve is widely used in the performance evaluation of CBIR systems where each class size is very small compared to the size of the search domain. In the embodiments of the invention, and more specifically the illustrated embodiment, the ratio is in the order of less than one hundredth. The ideal case of the PR curve is when all classes have similar class sizes. Even though this ideal condition cannot be satisfied strictly due to the difference in the number of frames of individual actions, the action segmentation has reduced the discrepancy in class size. Overall, class sizes are between 10 and 64, and the PR curve is a suitable and valid performance measure.
A few PGVN pose shape reconstructions using 3D TMSDs were conducted to visualize the ability of 3D TMSD on representing degenerated point cloud patches, to confirm the soundness of our TMSD implementation, and to identify the range of moment orders for an effective pose shape query.
In each subgraph of
Another observation is the refinement of the reconstruction, progressing through the gradual concentration of voxel mass from the lower-order to the higher-order approximation. This explains a gradual reduction of some polynomial fitting errors that appear as residual voxels around the edges and corners of the PGVN grid. The mass values of the residual voxels keep reducing when the order increases. At the maximum order R=189, the mass of any residual voxel is less than 10−14, effectively zero.
For effective pose shape query, descriptors of order between 10 and 24 seem to have sufficient discriminative power. Their dimensions are below 3,000, comparable to the size of many other image or shape feature vectors. Therefore, it is the range of descriptor orders explored in later experiments. This reconstruction example demonstrated surprisingly good and robust shape compacting capability of 3D TMSD, considering the significant degeneracy in the point cloud patches.
This next set of experiments assesses the effect of different moment orders and grid sizes on the retrieval performance. It served as a learning process to find the optimal setting for those parameters.
The diminishing performance gain as the moment order increases above R=16 may indicate the loss of effectiveness of distance measures, even though increasing the moment order brings in a better shape representation as evidenced by the previous reconstruction example. Another possible cause is that the 968 dimensions corresponding to R=16 may constitute the majority of the intrinsic dimensions.
The set of PR curves in
The next experiment involves the pairs of a 64×64×64 binary voxelized pose shape and its corresponding depth image to validate the proposition that the embodiment of 3D-based TMSD is superior to contemporary 2D-based depth image analysis for shape representation. The latter is made by orthographically projecting the former along the z axis to an image plane parallel to the x-y plane. The intensity is proportional to (z−zmin) where zmin is the minimum z coordinate. The Tchebichef moment descriptors of the binary-voxelized shape (3D B-TMSD) and the depth image (2D TMID) are computed for the pairs of 3D and 2D objects, respectively. The performance comparison was made between 3D B-TMSDs and 2D TMIDs of similar dimensions. This procedure ensures comparable datasets and configurations for testing to see if 3D outperforms 2D, by limiting the varying factor to the different z direction representation models only. The matching orders and dimensions between the 3D (R3 and D3, respectively) and 2D (R2 and D2, respectively) descriptors are shown in Table 2.
TABLE 2
Matching of Descriptor Sets between Binary Voxelization
and Depth Image for Grids Size N = 64
Match
Match
Match
Match
Set 1
Set 2
Set 3
Set 4
3D Order (R3)/Dimension (D3)
6/84
8/165
12/455
16/969
2D Order (R2)/Dimension (D2)
12/91
17/171
29/465
42/946
The comparisons of pose shape retrieval performance for Match Sets 1 and 4 are shown in
This next experiment was designed as a benchmark test of 3D TMSD. Among the descriptor sets of different configurations, the results of those with a grid size N=64 are presented here. For 3D DWT, this grid size allows at least 3 levels (L=3) of wavelet decomposition to produce an approximation array of size 8×8×8=512 as the shape descriptor. Among the aforementioned three wavelets, the experiments indicate symlet (sym4) outperforms Haar and db4. Therefore, sym4 was selected as the test wavelet for this illustrated embodiment, though other wavelets may be selected for other embodiments.
The comparable order for 3D TMSD and 3D DFT with the closest descriptor size to 3D DWT is 12; both have 50 fewer components than that of 3D DWT. The results with other combinations of configuration parameters, including 1 or 2 levels of wavelet decomposition, are similar to those presented here.
3D TMSD and 3D DWT have similar retrieval performance. However, if 3D TMSD is not limited to the comparable descriptor size of 3D DWT approximation, the optimal TMSD order R=16 may be used to get a better performance than that of 3D DWT. This highlights that 3D TMSD is easier to scale-up than 3D DWT. The dyadic sampling of 3D DWT means that the descriptor size changes 23 times for each level change. For example, the number of components in 3D DWT would increase from 512, at the current decomposition level, to 4,096 at the next refined level, which is probably too large for an effective distance measurement. Therefore, 3D TMSD is a better alternative to 3D DWT for pose shape query if one prefers this flexibility in order-scaling. If the basis functions for TMSD are pre-computed and saved, the time complexity between TMSD and DWT is also comparable.
In real-world applications, sensors typically do not know and cannot control the orientation of the targets. So, the assessment of retrieval performance should look into the effect of viewing angles. A descriptor is useless for sensor data exploitation if it can perform only under certain viewing angles. Unfortunately, this issue has not received much attention before. By leveraging the full range of viewing angles in the baseline, a detailed examination was conducted on 3D TMSD's performance consistency with respect to both azimuth and elevation angles.
Based on the previous experimental results, only the results for N=64 and R=16 are presented here. The other configurations have similar patterns.
Finally, the effect of action type was considered by comparing the PR curves of three action types per elevation angle, as shown in
Regarding
TABLE 3
“Perfect match %/One-offset match %” in 1-NN query returns
of four different scale tests (SD = 3D TMSD, EL = 0)
Size
Order
N
R
75% Scale
50% Scale
25% Scale
6% Scale
16
8
94.3%/99.1%
93.2%/99.0%
91.0%/97.6%
64.7%/86.0%
16
96.5%/99.4%
95.8%/99.5%
94.7%/98.7%
77.2%/92.4%
24
96.9%/99.5%
96.2%/99.5%
95.1%/98.9%
79.6%/93.2%
64
8
95.3%/99.2%
94.1%/99.1%
92.0%/98.0%
68.5%/88.9%
16
97.3%/99.5%
96.6%/99.5%
95.0%/98.8%
80.9%/93.7%
24
97.6%/99.6%
97.2%/99.6%
95.8%/99.0%
83.9%/94.7%
The results demonstrate that the approach can achieve almost perfect scale invariance and resolution invariance down to 25% of the full-scale point clouds. At the extremely small scale of 6%, the point clouds are roughly equivalent to a body height of 20 pixels, at which level the pose shape is hard to be distinguished by human eyes. In this case, even though the perfect match scores drop considerably, the one-offset match scores could be close to 94%, which is impressive considering the very coarse-grain nature of the point clouds at this scale/resolution. This means that pose shape searching and recognition could be performed at a very low resolution at which the performance of existing 2D methods may degrade significantly. Therefore, these results not only demonstrate the scale and resolution invariance of the approach but also further support the proposition that the embodiment of 3D-based TMSD is superior to contemporary 2D-based depth image analysis for shape representation.
Computer 60 typically includes at least one processor 62 coupled to a memory 64. Processor 62 may represent one or more processors (e.g. microprocessors), and memory 64 may represent random access memory (RAM) devices comprising the main storage of computer 60, as well as any supplemental levels of memory, e.g., cache memories, non-volatile or backup memories (e.g. programmable or flash memories), read-only memories, etc. In addition, memory 64 may be considered to include memory storage physically located elsewhere in computer, e.g., any cache memory in a processor, as well as any storage capacity used as a virtual memory, e.g., as stored on a mass storage device 66 or another computer 68 coupled to computer 60 via a network 70.
Computer 60 also typically receives a number of inputs and outputs for communicating information externally. For interface with a user or operator, computer typically includes one or more user input devices 72 (e.g., a keyboard, a mouse, a trackball, a joystick, a touchpad, a keypad, a stylus, and/or a microphone, among others). Additionally, one or more sensors 74a, 74b may be connected to computer 60, which may generate point cloud data as set out above. Computer 60 may also include a display 76 (e.g., a CRT monitor, an LCD display panel or other projection device, and/or a speaker, among others). The interface to computer 60 may also be through an external device connected directly or remotely to computer 60, or through another computer 68 communicating with computer 60 via a network 70, modem, or other type of communications device.
Computer 60 operates under the control of an operating system 78, and executes or otherwise relies upon various computer software applications, components, programs, objects, modules, data structures, etc. (e.g. 3D TMSD, PGVN voxelizations) 80. Computer 60 communicates on the network 70 through a network interface 82.
In general, the routines executed to implement the embodiments of the above described invention, whether implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions will be referred to herein as “computer program code”, or simply “program code”. The computer program code typically comprises one or more instructions that are resident at various times in various memory and storage devices in a computer, and that, when read and executed by one or more processors in a computer, causes that computer to perform the steps necessary to execute steps or elements embodying the various aspects of the invention. Moreover, while the invention has been described in the context of an application that could be implemented on fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of being distributed as a program product in a variety of forms, and that the invention applies equally regardless of the particular type of computer readable media used to actually carry out the distribution. Examples of computer readable media include but are not limited to non-transitory physical, recordable type media such as volatile and non-volatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., CD-ROM's, DVD's, etc.), among others; and transmission type media such as digital and analog communication links.
In addition, various program code described may be identified based upon the application or software component within which it is implemented in specific embodiments of the invention. However, it should be appreciated that any particular program nomenclature is merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature. Furthermore, given the typically endless number of manners in which computer programs may be organized into routines, procedures, methods, modules, objects, and the like, as well as the various manners in which program functionality may be allocated among various software layers that are resident within a typical computer (e.g., operating systems, libraries, APIs, applications, applets, etc.), it should be appreciated that the invention is not limited to the specific organization and allocation of program functionality described herein.
Those skilled in the art will recognize that the exemplary environment illustrated in
While the present invention has been illustrated by a description of one or more embodiments thereof and while these embodiments have been described in considerable detail, they are not intended to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. The invention in its broader aspects is therefore not limited to the specific details, representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departing from the scope of the general inventive concept.
Patent | Priority | Assignee | Title |
10028070, | Mar 06 2017 | Microsoft Technology Licensing, LLC | Systems and methods for HRTF personalization |
10244341, | Apr 29 2014 | Microsoft Technology Licensing, LLC | HRTF personalization based on anthropometric features |
10278002, | Mar 20 2017 | Microsoft Technology Licensing, LLC | Systems and methods for non-parametric processing of head geometry for HRTF personalization |
10284992, | Apr 29 2014 | Microsoft Technology Licensing, LLC | HRTF personalization based on anthropometric features |
10313818, | Apr 29 2014 | Microsoft Technology Licensing, LLC | HRTF personalization based on anthropometric features |
10600199, | Jun 27 2017 | Toyota Jidosha Kabushiki Kaisha | Extending object detection and identification capability for an object sensor device |
10915779, | Apr 26 2019 | Unikie Oy | Method for extracting uniform features from point cloud and system therefor |
10997728, | Apr 19 2019 | Microsoft Technology Licensing, LLC | 2D obstacle boundary detection |
11087471, | Apr 19 2019 | Microsoft Technology Licensing, LLC | 2D obstacle boundary detection |
11205443, | Jul 27 2018 | Microsoft Technology Licensing, LLC | Systems, methods, and computer-readable media for improved audio feature discovery using a neural network |
11480661, | May 22 2019 | Bentley Systems, Incorporated | Determining one or more scanner positions in a point cloud |
11650319, | Nov 21 2019 | Bentley Systems, Incorporated | Assigning each point of a point cloud to a scanner position of a plurality of different scanner positions in a point cloud |
11861863, | Jun 17 2019 | Faro Technologies, Inc. | Shape dependent model identification in point clouds |
11915145, | Mar 16 2019 | Nvidia Corporation | Leveraging multidimensional sensor data for computationally efficient object detection for autonomous machine applications |
Patent | Priority | Assignee | Title |
7561726, | Dec 09 2004 | National Tsing Hua University | Automated landmark extraction from three-dimensional whole body scanned data |
8605093, | Jun 10 2010 | Autodesk, Inc. | Pipe reconstruction from unorganized point cloud data |
8699785, | Nov 02 2010 | Thiagarajar College of Engineering | Texture identification |
20030236645, | |||
20090232388, | |||
20110304619, | |||
20130069936, | |||
20130321393, | |||
20140132604, | |||
20150109415, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 03 2016 | The United States of America as represented by the Secretary of the Air Force | (assignment on the face of the patent) | / | |||
Oct 03 2016 | CHENG, HUAINING | GOVERNMENT OF THE UNITED STATES, AS REPRESENTED BY THE SECRETARY OF THE AIR FORCE | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 039923 | /0378 |
Date | Maintenance Fee Events |
Aug 30 2021 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Apr 03 2021 | 4 years fee payment window open |
Oct 03 2021 | 6 months grace period start (w surcharge) |
Apr 03 2022 | patent expiry (for year 4) |
Apr 03 2024 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 03 2025 | 8 years fee payment window open |
Oct 03 2025 | 6 months grace period start (w surcharge) |
Apr 03 2026 | patent expiry (for year 8) |
Apr 03 2028 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 03 2029 | 12 years fee payment window open |
Oct 03 2029 | 6 months grace period start (w surcharge) |
Apr 03 2030 | patent expiry (for year 12) |
Apr 03 2032 | 2 years to revive unintentionally abandoned end. (for year 12) |