A method for determining a pose of an object in a scene by determining a set of scene features from data acquired of the scene and matching the scene features to model features to generate weighted candidate poses when the scene feature matches one of the model features, wherein the weight of the candidate pose is proportional to the model weight. Then, the pose of the object is determined from the candidate poses based on the weights.
|
1. A method for determining a pose of an object in a scene, comprising the steps of:
determining, from a model of the object, model features and a weight associated with each model feature, wherein the model features and the weights are learned using training data by maximizing a difference between a number of votes received by a true pose and a number of votes received by an incorrect pose, and wherein the weights are learned by solving a regularized soft constraint optimization problem;
determining, from scene data acquired of the scene, scene features;
matching the scene features to the model features to obtain a matching scene and matching model features;
generating candidate poses from the matching scene and the matching model features, wherein a weight of each candidate pose is proportional to the weight associated with the matching model feature; and
determining the pose of the object from the candidate poses based on the weights.
17. An apparatus for determining a pose of an object in a scene, comprising:
a robot arm;
a sensor arranged on the robot arm, wherein the sensor is configured to acquire scene data; and
a processor configured to
determine scene features from the scene data,
match the scene features to model features to obtain matching scene and model features, wherein each model feature is associated with a weight, wherein the model features and the weights are learned using training data by maximizing a difference between a number of votes received by a true pose and a number of votes received by an incorrect pose, and wherein the weights are learned by solving a regularized soft constraint optimization problem,
generate candidate poses from the matching scene and model features, wherein a weight of each candidate pose is proportional to the weight associated with the matching model feature, and
determine the pose of the object from the candidate poses based on the weights.
2. The method of
3. The method of
where {right arrow over (m)}r is a reference point, {right arrow over (m)}i is a feature points, {right arrow over (n)}r is an orientation at the reference point, {right arrow over (n)}i is an orientations of the feature point, {right arrow over (d)}={right arrow over (m)}i−{right arrow over (m)}r is a displacement vector between the reference point and the feature point, f1=∥d∥ is a distance between the reference point and the feature points, f2 is an angle between the displacement vector and the orientation of the reference point, f3 is an angle between the displacement vector and the orientation of the feature point, and f4 is an angle between the orientations of the reference point and the feature point.
4. The method of
6. The method of
8. The method of
9. The method of
picking the object out of a bin according to the pose using a robot arm.
10. The method of
11. The method of
12. The method of
13. The method of
14. The method of
15. The method of
16. The method of
18. The apparatus of
a gripper mounted on the robot arm for picking the object according to the pose.
|
This invention relates generally to estimating poses of 3D objects, and more particularly to estimating poses from data acquired by 3D sensors.
A frequent problem in computer vision applications is to determine poses of objects in 3D scenes from scene data acquired by 3D sensors based on structured light or time of flight. Pose estimation methods typically require identification and matching of scene measurements with a known model of the object.
Some methods are based on selecting relevant points in a 3D point cloud and using feature representations that can invariantly describe regions near the points. Those methods produce successful results when the shape of the object is detailed, and the scene measurements have a high resolution and little noise. However, under less ideal conditions, the accuracy of those methods decreases rapidly. The 3D measurements can include many hidden surfaces due to imaging from a single viewpoint with the sensor, which makes a detailed region representation unavailable. Noise and background clutter further affect the accuracy of those methods.
A set of pair features can be use for detection and pose estimation. Pairs of oriented points on a surface of an object are used in a voting framework for pose estimation, e.g., see U.S. Publication 20110273442. Even though the descriptor associated with the pair feature is not very discriminative, that method can produce accurate results even when subject to moderate occlusion and background clutters by accumulating measurements for a large number of pairs. That framework can benefit from a hashing and Hough voting scheme.
Other methods model 3D shapes globally by using 2D and 3D contours, shape templates, and feature histograms. In general, global methods require the object to be isolated because those methods are sensitive to occlusion. Also, changes in appearance due to pose variations necessitates the use of a large number of shape templates, which has the drawback of increased memory and processing time. A learning-based keypoint detector that uses range data to decrease processing time is described in U.S. Publication 20100278384.
The embodiments of the invention provide a method for determining poses of three-dimensional (3D) objects in a scene using weighted model features.
During online operation, scene features are determined from 3D data acquired of the scene by a 3D sensor. The scene features are matched to model features acquired during offline training. The matching generates weighted candidate poses. The weights of candidate poses are proportional to the weights of the model features that match the scene features. Then, the candidate poses are merged by clustering the poses. The pose of the object is determined according to the weights of the merged poses.
In some embodiments, the model and scene features are oriented point pair features. The invention is based on the realization that not all the pairs of points have similar discriminative power, or repeatability. In fact, certain features do not carry any relevant information for pose estimation, hence the model features can be selected sparsely and weighted according to their importance.
One example 3D sensor uses structured light generated by a projector. Other sensors, such as stereo cameras and time-of-flight range sensors are also possible. The sensor acquires 3D scene data 100, e.g., a point cloud. The 3D sensor is calibrated with respect to the robot arm. Thus, the poses of the objects estimated in a coordinate system of the 3D sensor can be transformed to a coordinate system of the robotic arm, allowing grasping and picking of the objects by controlling 80 the robotic arm according to the poses. The scene data are processed by a method 71 performed in a processor 70. The processor can include memory and input/output interfaces as known in the art.
Point Pair Features
In some embodiments, the pose estimation method uses point pair features. Each point pair feature can be defined using two points on a surface of the object, and normal directions of the surface at the points, which is called a surface-to-surface (S2S) pair feature. The feature can also be defined using a point on the surface and its normal and another point on an object boundary and direction of this boundary, which is called a surface-to-boundary (S2B) pair feature. The feature can also be defined using two points on the object boundary and their directions, which is called a boundary-to-boundary (B2B) pair feature.
As shown in
where {right arrow over (d)}={right arrow over (m)}i−{right arrow over (m)}r is the displacement vector between the two points. The 4D descriptor is defined by: the distance (f1=∥d∥) between the points mr and mi and angles (f2, f3, f4). The angle f2 is between the displacement vector d and the normal vector nr, the angle f3 is between d and the vector ni, and the angle f3 is between the vectors nr and ni. This descriptor is pose invariant.
In other embodiments, the method uses other 3D features such as a spin image, or 3D scale-invariant feature transform (SIFT) image. The spin image is a surface representation that can be used for surface matching and object recognition in 3D scenes. The spin images encodes global properties of the surface in an object-oriented coordinate system, rather than in a viewer-oriented coordinate system. The SIFT image features provide a set of features that are not affected by object scaling and rotation.
Pose Determination
As shown in
For the purpose of this description, the features used during online processing are referred to as scene features 120. The features learned during offline training are referred to as model features 180.
As shown in
As shown in
The candidate poses are merged 160 by clustering. If two candidate poses are closer than a clustering threshold, then the candidate poses are combined into a single candidate pose by taking a weighted sum of the two candidate poses. Clustering is repeated until no two candidate poses are closer than the clustering threshold. Then, the object pose 170 is the candidate pose that has a weight larger than a threshold.
Training
Determining Features and Weights
The pose of the object is the candidate pose with a maximal weight among all the candidate poses. The goal of the learning is to select and weight the model features to ensure that the correct pose receives more weights than other poses.
Weighting Schemes
There are several different ways of weighting the features. The simplest form is based on assigning each model point pair with a different weight, and any scene point pair feature that matches to a given model point pair feature generates a candidate pose equal to this weight. Although this weighting scheme is very general, learning is very underdetermined due to a high dimensional weight space.
Alternatively, we can group sets of features to have the same weights. One such strategy is weighting using quantization of the model feature descriptors. A single weight is defined for all the model point pair features that have the same quantized descriptor m. Any scene point pair feature that is matched to the same quantized descriptor generates a candidate pose that is proportional to this weight. Because the point pairs are grouped into clusters that are mapping to the same quantized descriptor, a dimension of the weight space is reduced to the number of quantization levels M.
An important advantage of this method is that it is possible to learn a weight vector that is sparse. As used herein, sparse is not a relative term. In the art of numerical analysis, sparsity refers to data where most elements are zero. In such cases, the method immediately removes any scene features mapping to a quantized descriptor with zero weight. This reduces processing time significantly.
A second grouping strategy is based on weighting model points. The weight is defined for each model point. Any scene pair that maps to either first or second point of the pair generates candidate poses with this weight. This approach significantly reduces the dimension of the weight space, allowing efficient learning, and directly identifies important points on the model surface.
Given data in the form of a 3D point cloud of a scene containing the object, let S be the set of all scene pair features that are determined from this scene. Without loss of generality, we use weighting using quantization scheme described above. Let yεSE(3) be a candidate pose. The corresponding training vector xyεRM is given by a mapping Φ of all the scene pair features that generates a pose y, x=Φy(S). When the correspondence is clear from the context, we do not use the subscript and write x instead of xy.
For weighting using quantization, this mapping is given by the number of times a quantized descriptor m generates a pose y:
where xm is the mth dimension of training vector x, Iε{0,1} is an indicator function and, y(s) is the set of poses that the pair s generates, and h(s) is the quantized descriptor of feature s. Note that, a pair s can generate multiple candidate poses.
The weight that a pose receives from all scene pair features can be written using the linear map wTx, where w is an M-dimensional weight vector. When w=1M, this function is equivalent to a uniform weighting function.
Learning Weights
Let I be a 3D scene and y* be the pose of the object in the scene. The goal of learning weights is to determine a non-negative weight vector that satisfies the constraints
wTx*>wTx,∀y≠y*,∀I. (3)
This means that for all the 3D scenes I, the true pose of the object y* should have more weight than any other pose.
To solve this problem, we use machine learning because a closed form solution is generally unavailable. Let {(Ii,yi*)}i=1 . . . N be the training data 200 including N 3D scenes and ground truth poses of the object. For each training scene model, the features are determined. The constraints given in Equation (3) might not define a feasible set. Instead, we reformulate learning the weights as a regularized soft constraint optimization
where R(w) is a convex regularizer on the weight vector, λ is a tradeoff between margin and training error, and a loss function Δ(yi*,y) gives a larger penalty to large pose deviation. We also use an explicit upper bound wu on the maximal weight of a feature dimension.
We use a cutting plane method to find the weight vector. The cutting-plane method iteratively refines a feasible set or objective function by means of linear inequalities, called cuts.
At each iteration k of the learning process 230, we use the previous set of weights w(k-1) and solve the pose estimation problem for each scene Ii using the pose estimation method. In addition to the best pose, this method provides the set of candidate poses, which are sorted according to the weights. For all candidate poses, we evaluate a margin constraint
(w(k-1))T(xi*−x)≧Δ(yi*,y), (7)
and add the most violated constraint to the selected constraint list. Let y(1:k) and x(1:k) be the set of all selected poses and constraints up to iteration k. Then, the optimization problem at iteration k is
This optimization problem is convex and has a finite number of constraints, which we solve optimally. Note that, dimensionality of the training vectors can be large, e.g., M>>105. Fortunately, the training vectors are sparse. Therefore, the optimization can be solved efficiently using convex programming solvers and utilizing sparse matrices and sparse linear algebra.
In general, the method requires fewer iterations if multiple violated constraints are added to the supporting constraint set at a given time, leading to faster operation. We usually solve three or four iterations of the cutting plane method. We initialize with a uniform voting w(0)=1M, which significantly speeds up the convergence of the method.
Pseudocode for the training method is shown in
Implementation Details
Training Data
Pair features are invariant to the action of the group of 3D rigid motions SE(3). However, we only observe the scene from a single viewpoint, self occlusions, hidden surfaces and measurement noise play a major role in variability of the training vectors x. This variation is largely independent of the 3D translation and rotation of the object along the viewing direction, whereas in general the variation is very sensitive to out-of-plane rotation angles.
Therefore, we sample a set of 3D poses {yi*}i=1 . . . N, by regularly sampling along two axes of out-of-plane rotation angles on a unit sphere, and appending a random in-plane rotation angle and a translation vector. In addition we add few random objects to the scene to generate background clutter, and render the scene to generate the 3D point cloud.
Alternatively training data can be collected by scanning a real scene containing the target object using a range sensor. The true 6-degrees-of-freedom (DoF) pose of the object can be labeled manually.
Loss Function
We use a loss function
Δ(y,yi)=1+λθθ(y,yi), (11)
where θ(y,yi)=∥log(Ry−1Ry
Regularization
The form of the regularization function plays an important role in accuracy and efficiency of the voting method. In some embodiments, we use a quadratic regularization function R(w)=wTw. In other embodiments, we use L1-norm regularizer ∥w∥1, which sparsifies the weight vector leading to sparse feature selection and fast operation.
Optimization of Pose Estimation Using Weighted Voting
Pose estimation is optimized using a weighted voting scheme.
As shown in
The model pair features 180 are stored in the hash table in an offline process for efficiency. The quantized descriptors are served as the hash keys. During an online process, a scene reference point is selected and paired with another scene point to form the scene point pair feature 120. Its descriptor is then used to retrieve matched model point pair features 180 using the hash table. Each of the retrieved model point pair features in the hash bin votes for an entry in the 2D accumulator space.
Each entry in the accumulator space corresponds to a particular object pose. After votes are accumulated for the sampled reference scene point by pairing it with multiple sampled referred scene points, poses supported by a certain number of votes are retrieved. The process is repeated for different reference scene points. Finally, clustering is performed to the retrieved poses to collect support from several reference scene points.
Online Learning of Weights
In one embodiment, training of the model features and their weights are performed online using the robot arm 10. The scene data including the object are acquired using a 3D range sensor. The pose estimation procedure starts with uniform weights. After estimating the pose of the object in the scene, the pose is verified by picking the object from the bin. If the object is successfully picked from the bin, then the scene is added to the training set with the estimated pose as the true pose. If the estimated pose is incorrect, then the scene data are discarded. The scene is altered, e.g., by moving objects with the robot arm, and the process is repeated. The weights of the model features are learned using the generated training set during robot operation.
Application
In one application, as shown in
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Taguchi, Yuichi, Liu, Ming-Yu, Tuzel, Oncel, Raghunathan, Arvind U
Patent | Priority | Assignee | Title |
11423571, | Nov 13 2020 | Ford Global Technologies, LLC | Systems and methods for image-based component detection |
11887271, | Aug 18 2021 | Hong Kong Applied Science and Technology Research Institute Company Limited | Method and system for global registration between 3D scans |
Patent | Priority | Assignee | Title |
6173070, | Dec 30 1997 | Cognex Corporation | Machine vision method using search models to find features in three dimensional images |
8774504, | Oct 26 2011 | HRL Laboratories, LLC | System for three-dimensional object recognition and foreground extraction |
20100278384, | |||
20110273442, | |||
20120230592, | |||
20130156262, | |||
20130245828, | |||
JPO2012023593, | |||
WO2010019925, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jul 03 2013 | Mitsubishi Electric Research Laboratories, Inc. | (assignment on the face of the patent) | / | |||
Mar 10 2014 | TAGUCHI, YUICHI | Mitsubishi Electric Research Laboratories, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037284 | /0269 | |
Mar 10 2014 | RAGHUNATHAN, ARVIND U | Mitsubishi Electric Research Laboratories, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037284 | /0269 | |
Mar 10 2014 | LIU, MING-YU | Mitsubishi Electric Research Laboratories, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037284 | /0269 |
Date | Maintenance Fee Events |
Mar 26 2019 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Mar 17 2023 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Mar 08 2019 | 4 years fee payment window open |
Sep 08 2019 | 6 months grace period start (w surcharge) |
Mar 08 2020 | patent expiry (for year 4) |
Mar 08 2022 | 2 years to revive unintentionally abandoned end. (for year 4) |
Mar 08 2023 | 8 years fee payment window open |
Sep 08 2023 | 6 months grace period start (w surcharge) |
Mar 08 2024 | patent expiry (for year 8) |
Mar 08 2026 | 2 years to revive unintentionally abandoned end. (for year 8) |
Mar 08 2027 | 12 years fee payment window open |
Sep 08 2027 | 6 months grace period start (w surcharge) |
Mar 08 2028 | patent expiry (for year 12) |
Mar 08 2030 | 2 years to revive unintentionally abandoned end. (for year 12) |