An object detection method includes an image acquisition step of acquiring an image including a target object, a layer image generation step of generating a plurality of layer images by one or both of enlarging and reducing the image at a plurality of different scales, a first detection step of detecting a region of at least a part of the target object as a first detected region from each of the layer images, a selection step of selecting at least one of the layer images based on the detected first detected region and learning data learned in advance, a second detection step of detecting a region of at least a part of the target object in the selected layer image as a second detected region, and an integration step of integrating a detection result detected in the first detection step and a detection result detected in the second detection step.
|
1. An object detection method comprising:
acquiring an image;
generating a plurality of layer images by enlarging or reducing the image;
detecting a first object from at least one of the plurality of layer images;
estimating a specific object based on the detected first object;
selecting at least one of the plurality of layer images based on the estimated specific object;
detecting a second object larger than the detected first object in the selected layer image;
estimating the specific object based on the detected second object; and
determining the specific object based on the specific object estimated based on the first object and the specific object estimated based the second object.
13. An object detection apparatus comprising:
a memory; and
a processor in communication with the memory, the processor functions as:
an acquisition unit configured to acquire an image;
a generation unit configured to generate a plurality of layer images by enlarging or reducing the image;
a first detection unit configured to detect a first object from at least one of the plurality of layer images;
a first estimation unit configured to estimate a specific object based on the detected first object;
a layer limitation unit configured to limit at least one of the plurality of layer images on which detection is to be performed based on the estimated specific object;
a second detection unit configured to detect a second object larger than the first object in the limited layer image;
a second estimation unit configured to estimate the specific object based on the detected second object; and
a determination unit configured to determine the specific object based on the specific object estimated by the first estimation unit and the specific object estimated by the second estimation unit.
2. The object detection method according to
the estimating the specific object with use of the second object includes estimating a plurality of the specific objects; and
a specific object having a highest overlap ratio with the one specific object is determined, from among the plurality of the specific objects estimated with use of the second object, as the specific object.
3. The object detection method according to
4. The object detection method according to
5. The abject detection method according to
6. The object detection method according to
7. The object detection method according to
8. The object detection method according to
9. The object detection method according to
10. The object detection method according to
11. The object detection method according to
12. The object detection method according to
14. A non-transitory computer readable storage medium storing a program that causes a computer to perform the object detection method according to
|
This application is a Continuation of U.S. application Ser. No. 13/851,837, filed Mar. 27, 2013, which claims priority from Japanese Patent Application No. 2012-082379 filed Mar. 30, 2012, which are hereby incorporated by reference herein in their entireties.
Field of the Invention
The present invention relates to a method for performing detection processing at a high speed while maintaining accuracy, and an object detection apparatus.
Description of the Related Art
As one of conventional methods to detect a target from an image, there is a method including performing detection processing with use of a model learned in advance, limiting a range of layers in which a target is searched for based on the detection result, and performing detection processing based on a more highly accurate model.
Japanese Patent No. 4498296 discusses a method including performing first detection on layer images, and performing second detection only on the same layer image detected from the first detection for a next input image.
However, according to the method discussed in Japanese Patent No. 4498296, a searched layer is limited to the same layer, but the same layer does not always have the highest possibility of detection for the next input image. Further, in a case where different models are used for the first detection and the second detection, a layer from which a target is highly likely detectable is not always the same layer for both the first detection and the second detection, thereby leading to a problem of a reduction in detection accuracy as a whole.
According to an aspect of the present invention, an object detection method includes an image acquisition step of acquiring an image including a target object, a layer image generation step of generating a plurality of layer images by one or both of enlarging and reducing the image at a plurality of different scales, a first detection step of detecting a region of at least a part of the target object as a first detected region from each of the layer images, a selection step of selecting at least one of the layer images based on the detected first detected region and learning data learned in advance, a second detection step of detecting a region of at least a part of the target object in the layer image selected in the selection step as a second detected region, and an integration step of integrating a detection result detected in the first detection step and a detection result detected in the second detection step.
According to exemplary embodiments of the present invention, it is possible to speed up entire processing while maintaining detection accuracy.
Further features and aspects of the present invention will become apparent from the following detailed description of exemplary embodiments with reference to the attached drawings.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the invention and, together with the description, serve to explain the principles of the invention.
Various exemplary embodiments, features, and aspects of the invention will be described in detail below with reference to the drawings.
An object detection method according to a first exemplary embodiment of the present invention is a method for stably detecting a target that exists in an image. An image acquisition unit may use an image captured with use of a camera, a video camera, or a network camera, or may use an image captured and stored in advance.
The present exemplary embodiment will be described based on an example that captures an image including a person, and detects a person that a user wants to detect from the acquired image. In the present exemplary embodiment, a detection target is a human, but is not limited thereto. For example, the present invention can be also employed to detection of, for example, an animal or a plant.
As illustrated in
The image acquisition unit 101 acquires an image from a camera or among images captured in advance. The acquired image is transmitted to the feature amount generation unit 102.
The feature amount generation unit 102 generates layer images 201 by enlarging/reducing the image acquired by the image acquisition unit 101 at predetermined scales as illustrated in
The first detection unit 103 performs detection processing on the feature amount of each of the layer images 201, which is generated by the feature amount generation unit 102.
As illustrated in
As a method for detecting an object, the detection processing is performed by using a known technique such as HOG+Super Vector Machine (SVM) (a reference literature: “Histograms of Oriented Gradients for Human Detection” written by N. Dalal and presented in Computer Vision and Pattern Recognition (CVPR) 2005), Implicit Shape Model (ISM) (“Combined Object Categorization and Segmentation with an Implicit Shape Model” written by B. Leibe and presented in European Conference on Computer Vision (ECCV) 2004), or Latent-SVM (“Object Detection with Discriminatively Trained Part Based Models” written by P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, and presented in Institute of Electrical and Electronics Engineers (IEEE) Pattern Analysis and Machine Intelligence (PAMI) volume 32, number 9, 2010). The first detection unit 103 detects a result 303 from the feature amount of the layer image 302, which is one of the layer images 201.
Further, the first detection unit 103 may also perform the detection processing for detecting a region of the target object on the image acquired by the image acquisition unit 101. In this case, the first detection unit 103 performs the detection processing with use of a known technique such as pattern matching.
The detection result 303 of the head portion surrounding region, which is detected by the first detection unit 103, is transmitted to the first estimation unit 104.
The first estimation unit 104 estimates a specific part region with use of the detection result 303 acquired from the first detection unit 103. In the present exemplary embodiment, the first estimation unit 104 estimates a head portion region as the specific part region. As will be used herein, the term “head portion region” is used to indicate only a head portion, while the above-described head portion surrounding region means the region including not only a head portion but also even a shoulder. In the present exemplary embodiment, the first detection unit 103 detects the head portion surrounding region, and the first estimation unit 104 estimates only the head portion. However, needless to say, in a case where the first detection unit 103 detects another region than the head portion surrounding region, the region estimated by the first estimation unit 104 is not the head portion, too.
In the following description, a method for estimating the head portion region will be described. As a method for estimating the head portion region, the head portion region can be calculated according to the following equation (1) with use of positional coordinates of the detection result 303 of the head portion surrounding region.
In equation (1), x1 and y1 represent upper left coordinates of the detection result 303, and x2 and y2 represent lower right coordinates of the detection result 303.
In equation (1), “A” represents a detection result expressed in the form of a matrix, which is constituted by a root filter including a region from the head portion to the shoulder, and a plurality of part filters 3031 to 3034 each expressing a part of the root filter. Further, when the detection result is converted into the matrix form, a difference is calculated between central coordinates of each of the part filters 3031 to 3034 and central coordinates of the detection result 303 at the detected position.
An x coordinate of the coordinates as the difference is normalized by a width w of the detection result region 303, and a y coordinate is normalized by a height h of the detection result 303. The normalized central coordinates x, y of the respective part filters 3031 to 3034 are expressed in the form of a matrix (a row includes the normalized coordinates of each part filter of one detection result, and a column includes each detection result), which is “A” in equation (1).
In equation (1), “p” represents a vector constituted by coefficients for linear prediction of a size of the head portion region based on a detection result from execution of detection processing on learning data, and an actual size of a head portion (an array of coefficients by which the normalized central coordinates of the respective part filters are multiplied). The term “learning data” refers to a group of images each showing a person prepared in advance.
As illustrated in
The layer limitation unit 105 determines a layer where the second detection unit 106 will perform detection processing based on the feature amount generated by the feature amount generation unit 102 and the head portion region 304 estimated by the first estimation unit 104. As a determination method, the layer limitation unit 105 calculates the layer with use of equation (2).
Further,
As illustrated in
In equation (2), “A” represents a matrix constituted by results from taking logarithms of head portion region sizes in the learning data, and “B” represents a matrix constituted by layers in the detection results of the second detection unit 106.
The layer is determined according to equation (3) with use of the calculated coefficients coeff.
Layer=coeff1*log(width)+coeff2*log(height)+coeff3 (3)
In equation (3), “width” represents a width of the head portion region, and “height” represents a height of the head portion region.
Detection processing is performed only on the feature amount of a layer image 507, which coincides with or is the closest to the layer acquired from equation (3), and detection processing is not performed on the feature amounts of layers 509, which are other layers than the layer image 507.
The limited layer image 507 is output to the second detection unit 106.
The second detection unit 106 performs the detection processing only on the feature amount of the limited layer image 507 based on the feature amount generation unit 102 and the layer limitation unit 105. The present exemplary embodiment detects the entire body of a person. Needless to say, a target detected by the second detection unit 106 is not limited to the entire body of a person.
As a detection processing method, the detection processing is performed at each position by sliding a window of a model 508, which is learned in advance with use of a learning method such as SVM or Boosting.
As illustrated in
The second estimation unit 107 estimates the specific part region from each of the detection results 610 to 612 expressed as rectangular regions acquired by the second detection unit 106. In the present exemplary embodiment, the second estimation unit 107 estimates the head portion region as the specific part. Needless to say, the specific part estimated by the second estimation unit 107 is not limited to the head portion region, but should be the same as the region estimated by the first estimation unit 104. This is because the integration unit 108, which will be described below, integrates the region estimated by the first estimation unit 104 and the region estimated by the second estimation unit 107.
As an estimation method, the specific part region can be calculated with use of equation (1) used by the first estimation unit 104. As illustrated in
The integration unit 108 integrates a head portion region 203 acquired by the first estimation unit 104, and the head portion regions 620 to 622 acquired by the second estimation unit 107, and outputs a final detection result.
As an integration method, the integration unit 108 calculates an overlap ratio between the head portion region 203 and each of the head portion regions 620 to 622, and selects the head portion region having the highest overlap ratio as a frame from which the same target is detected.
As illustrated in
The present exemplary embodiment is configured in this way.
Subsequently, processing performed by the object detection apparatus 100 according to the present exemplary embodiment will be described with reference to a flowchart illustrated in
In step S100, the entire processing is started. First, the processing proceeds to step S101, in which the image acquisition unit 101 acquires an image from, for example, a camera or an image file. The acquired image is transmitted to the feature amount generation unit 102. Then, the processing proceeds to step S102.
In step S102, the feature amount generation unit 102 generates the layer images 201 by performing image enlargement/reduction processing on the image acquired by the image acquisition unit 101, and generates a feature amount for each layer image.
The generated feature amount may be an HOG feature amount, a Haar-Like feature amount, or a color feature amount.
As a result of this processing, a layer feature amount can be acquired. The generated feature amount is output to the first detection unit 103 and the layer limitation unit 105. Then, the processing proceeds to step S103.
In step S103, the first detection unit 103 performs the detection processing. The first detection unit 103 performs the detection processing on the feature amount of the layer image 302, which is one of the feature amounts of the layer images 201.
Alternatively, the first detection unit 103 may perform the detection processing on the generated feature amounts of the layer images 201, or may perform the detection processing on the image acquired by the image acquisition unit 101 by, for example, pattern matching.
As a detection method, the first detection unit 103 uses a known technique such as HOG+SVM or ISM. Further, in the present exemplary embodiment, the first detection unit 103 detects the head portion region of a person. However, the present invention is not limited thereby. The detection result 303 acquired from the detection processing is output to the first estimation unit 104. Then, the processing proceeds to step S104.
In step S104, it is determined whether the first detection result can be acquired. If there is no detection result (NO in step S104), the processing ends. If there is the detection result 303 (YES in step S104), the processing proceeds to step S105.
In step S105, the first estimation unit 104 estimates the specific part region from the detection result 303 acquired from the first detection unit 103.
In the present exemplary embodiment, the first estimation unit 104 estimates the head portion region as the specific part. However, in the present invention, the specific part region is not limited to the head portion region. The first estimation unit 104 estimates the head portion region 304 from the detection result 303 with use of the equation (1).
After completion of the entire head region estimation processing, the acquired head portion region 304 is output to the layer limitation unit 105 and the integration unit 108. Then, the processing proceeds to step S106.
In step S106, the layer limitation unit 105 limits a layer where the second detection unit 106 will perform the detection processing on the feature amount generated by the feature amount generation unit 102, with use of the head portion region 304 estimated by the first estimation unit 104.
As a method for limiting the layer, the layer limitation unit 105 calculates coefficients with use of equation (2) while setting a final result generated by integrating the results of the first estimation unit 104 and the second estimation unit 107 as the learning data, and calculates a layer from linear prediction of the head portion frame and the coefficients according to equation (3).
As a result, it is possible to determine a feature amount of a layer most suitable for the detection processing by the second detection unit 106. As illustrated in
In step S107, the second detection unit 106 performs the detection processing.
As a detection method, the detection processing is performed at each position by sliding the window of the model 508, which is learned in advance with use of a learning method such as SVM or Boosting. Further, in the present exemplary embodiment, the model detects the entire body of a person, but the present invention is not limited thereby.
As illustrated in
In step S108, the second estimation unit 107 estimates a specific part from each of the detection results 610 to 612 acquired from the second detection unit 106.
In the present exemplary embodiment, the second estimation unit 107 estimates the head portion region as the specific part, but the present invention is not limited thereby. As a method for estimating the head portion region, the second estimation unit 107 performs the estimation with use of the equation (1) used by the first estimation unit 104.
As illustrated in
In step S109, the integration unit 108 integrates the head portion region 203 estimated by the first estimation unit 104, and the head portion regions 620 to 622 estimated by the second estimation unit 107.
As an integration processing method, the integration unit 108 calculates an overlap ratio between the head portion region 203 and each of the head portion regions 620 to 622, and selects the result of the combination achieving the highest overlap ratio as the head portion frame from which the same target is detected.
As illustrated in
Then, the entire processing ends.
As illustrated in
The image acquisition unit 801 acquires an image 901 from a camera or among images captured in advance. The acquired image 901 is transmitted to the feature amount generation unit 802.
The feature amount generation unit 802 generates layer images 1002 by performing enlargement/reduction processing on the image 901 acquired by the image acquisition unit 801 at predetermined scales as illustrated in
The generated layer feature amount is output to the first detection unit 803 and the layer limitation unit 805.
The first detection unit 803 performs detection processing on the feature amount. As illustrated in
The head portion region is detected with use of the known technique described in the description of the first exemplary embodiment.
The first detection unit 803 detects results 1104 to 1106 in the feature amount of the layer image 1001. The detection results 1104 to 1106 are transmitted to the first estimation unit 804.
The first estimation unit 804 estimates a specific part region based on the detection results 1104 to 1106 of the head portion surrounding region, which are acquired by the first detection unit 803. In the present exemplary embodiment, the first estimation unit 804 estimates a head portion region, but the present invention is not limited thereby.
As a method for estimating the head portion region, the head portion region can be acquired according to the following equation (4) with use of positional coordinates of the detection result of the first detection unit 803.
In equation (4), x1 and y1 represent upper left coordinates of the detection result, and x2 and y2 represent lower right coordinates of the detection result.
In equation (4), “A” represents a matrix converted from values resulting from normalization of upper left coordinates and lower right coordinates of part filters in a single detection result based on the central coordinates and size of a root filter. Further, “pm” represents a vector constituted by coefficients (“m” represents a model number) acquired from learning. At this time, parameters of pm are calculated for each model by performing the least-square method on learning data having a head portion frame as a correct answer and a frame of the second detection unit 806 as a set.
As illustrated in
The layer limitation unit 805 determines layers where the second detection unit 806 will perform detection processing on the feature amounts of the layer images 1002 based on the head portion regions 1114 to 1116 estimated by the first estimation unit 804.
The distribution 1201 indicates the entire body of a person in an upright position. The distribution 1202 indicates the entire body of a person in a forward tilting position. The distribution 1203 indicates the entire body of a person in a squatting position. In a case where postures are different in this way, different layers are suitable for estimation from the size of the head portion region. Therefore, coefficients are learned by the least-square method for each model, and are calculated according to equation (5).
In equation (5), “Am” represents a matrix constituted by results from taking logarithms of sizes of the head portion frame in the learning data for each model, and “Bm” represents a matrix constituted by layers in the detection results of the second detection unit 806 for each model.
The layer where detection will be performed is calculated according to equation (6) with use of the calculated coefficients.
Layerm=coeff1m*log(width)+coeff2m*log(height)+coeff3m (6)
It is possible to determine a layer most suitable for detection processing that the second detection unit 806 will perform by calculating a weighted sum of coefficients for this head portion region. As illustrated in
The second detection unit 806 performs the detection processing only on the limited layer images 1307 based on the feature amount generation unit 802 and the layer limitation unit 805.
As illustrated in
Further, in the present exemplary embodiment, for example, the second detection unit 806 detects a person in an upright position with use of some model, and detects a person in a squatting position with use of another model. In this way, the second detection unit 806 can detect a person's body corresponding to a posture change using different models. As illustrated in
The second estimation unit 807 estimates a specific part region from the detection results 1410 to 1415 acquired from the second detection unit 806. In the present exemplary embodiment, the second estimation unit 807 estimates a head portion region, but the present invention is not limited thereby. Needless to say, the specific part estimated by the second estimation unit 807 is not limited to the head portion region, but should be the same as the region estimated by the first estimation unit 804. This is because the integration unit 808, which will be described below, integrates the region estimated by the first estimation unit 804 and the region estimated by the second estimation unit 807.
As an estimation method, the second estimation unit 807 estimates the head portion region for each model with use of equation (4), which is used by the first estimation unit 804. As a result, acquired head portion regions 1420 to 1425 are output to the integration unit 808.
The integration unit 108 integrates the head portion regions 1114 to 1116 of the first estimation unit 804 and the head portion regions 1420 to 1425 of the second estimation unit 807, in a similar manner to the first exemplary embodiment. Finally, in the present exemplary embodiment, the regions 1410, 1411, and 1412 are output as a final detection result.
The present exemplary embodiment is configured in this way.
As illustrated in
In the present exemplary embodiment, after the layer/range limitation unit 1505 limits a layer and a detection processing range for each of models to be used by the second detection unit 1506, the second detection unit 1506 performs detection processing according to this limitation. The layer/range limitation unit 1505 determines a feature amount of a layer image where detection will be performed for each model based on the estimated size and the position of the head portion acquired from the first estimation unit 804.
First, the layer/range limitation unit 1505 calculates a layer with use of the equation (6) to calculate a layer image 1601 for a person in an upright position.
Next, the layer/range limitation unit 1505 determines a detection processing range 1604 from a position 1602 of a head portion estimated region and a filter size of a model 1603 so as to allow a thorough search in a range including the filter of the model 1603 above, below, to the right of, and to the left of the head portion estimated position 1602 with the head portion estimated position 1602 set as a center. At this time, the detection processing range 1604 may be stored in a memory as a region, or may be held as a map in which the interior of the detection processing range 1604 is labeled.
The second detection unit 1506 performs detection processing with use of only the model corresponding to the determined detection processing range 1604.
Similarly, for a person in a squatting position, the layer/range limitation unit 1505 focuses on a feature amount of a layer image 1605, and determines a detection processing range 1608 from a position of a head portion region 1606 and a size of a model 1607. For a person in a forward tilting position, the layer/range limitation unit 1505 also focuses on a feature amount of a layer image 1609, and determines a detection processing range 1612 from a position of a head portion region 1610 and a size of a model 1611.
According to this configuration, it is possible to further speed up the entire detection processing.
Various kinds of exemplary embodiments have been described above as the first to third exemplary embodiments, but all of them are only one example of the following configuration. Other embodiments based on the following configuration are also within the scope of the present invention.
An image including a target object is acquired (an image acquisition step). Then, the image is enlarged/reduced at a plurality of different magnifications to generate layer images (a layer image generation step). Then, a region of at least a part of the target object is detected based on the layer images (a first detection step). Then, a first specific part region is estimated based on a first detected region detected in the first detection step (a first estimation step). Then, a layer of the layer images is limited based on the first specific part region and learning data learned in advance (a layer limitation step). Then, a region of at least a part of the target object is detected in the layer image of the layer limited in the layer limitation step (a second detection step). Then, a second specific part region is estimated based on a second detected region detected in the second detection step (a second estimation step). Then, an estimated result estimated in the first estimation step and an estimated result estimated in the second estimation step are integrated to determine an integration result as a specific part region of the target object (an integration step).
Having described exemplary embodiments in detail, the present invention can be embodied in the form of, for example, a system, an apparatus, a method, a program, or a storage medium. In particular, the present invention may be employed to a system constituted by a plurality of devices, or may be employed to an apparatus constituted by a single device.
Embodiments of the present invention can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions recorded on a storage medium (e.g., non-transitory computer-readable storage medium) to perform the functions of one or more of the above-described embodiment(s) of the present invention, and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more of a central processing unit (CPU), micro processing unit (MPU), or other circuitry, and may include a network of separate computers or separate computer processors. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all modifications, equivalent structures, and functions.
Matsugu, Masakazu, Tsukamoto, Kenji, Torii, Kan
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
9213909, | Mar 30 2012 | Canon Kabushiki Kaisha | Object detection method, object detection apparatus, and program |
9224068, | Dec 04 2013 | GOOGLE LLC | Identifying objects in images |
9547908, | Sep 28 2015 | GOOGLE LLC | Feature mask determination for images |
20040153229, | |||
20090290791, | |||
20100067742, | |||
20110010317, | |||
20110019920, | |||
20110164149, | |||
20120148118, | |||
20120274634, | |||
20130064425, | |||
20130249916, | |||
20140294293, | |||
20150049906, | |||
20150220768, | |||
20150262330, | |||
20150262368, | |||
20170124410, | |||
CN101593268, | |||
CN101673342, | |||
CN101714214, | |||
CN102236899, | |||
JP2008102611, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Dec 01 2015 | Canon Kabushiki Kaisha | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Jan 21 2023 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Aug 27 2022 | 4 years fee payment window open |
Feb 27 2023 | 6 months grace period start (w surcharge) |
Aug 27 2023 | patent expiry (for year 4) |
Aug 27 2025 | 2 years to revive unintentionally abandoned end. (for year 4) |
Aug 27 2026 | 8 years fee payment window open |
Feb 27 2027 | 6 months grace period start (w surcharge) |
Aug 27 2027 | patent expiry (for year 8) |
Aug 27 2029 | 2 years to revive unintentionally abandoned end. (for year 8) |
Aug 27 2030 | 12 years fee payment window open |
Feb 27 2031 | 6 months grace period start (w surcharge) |
Aug 27 2031 | patent expiry (for year 12) |
Aug 27 2033 | 2 years to revive unintentionally abandoned end. (for year 12) |