An object detector includes an input interface to accept a sequence of video frames, a memory to store a neural network trained to detect objects in the video frames, a processor to process each video frame sequentially with the neural network to detect objects in the sequence of video frames, and an output interface to output the object detection information. The neural network includes a first subnetwork, a second subnetwork, and a third subnetwork. The first subnetwork receives as an input a video frame and outputs a feature map of the video frame. The second subnetwork is a recurrent neural network that takes the feature map as an input and outputs a temporal feature map. The third subnetwork takes the temporal feature map as an input and outputs object detection information.
|
15. A non-transitory computer readable storage medium embodied thereon a program executable by a processor for performing a method, the method comprising:
accepting a sequence of video frames;
processing each video frame sequentially with a neural network to detect objects in the sequence of video frames, wherein the neural network includes a first subnetwork, a second subnetwork, and a third subnetwork, wherein the first subnetwork receives as an input a current video frame and outputs a current feature map of the current video frame, wherein the second subnetwork is a recursive neural network that combines the current feature map of the current video frame with a previous temporal feature map produced for a previous video frame in the sequence of video frames to output a current temporal feature map of the current video frame, and wherein the third subnetwork takes the current temporal feature map of the current video frame as an input and outputs object detection information; and
outputting the object detection information.
1. An object detector, comprising:
an input interface configured to accept a sequence of video frames;
a memory configured to store a neural network trained to detect objects in the video frames, the neural network includes a first subnetwork, a second subnetwork, and a third subnetwork, wherein the first subnetwork receives as an input a current video frame and outputs a current feature map of the current video frame, wherein the second subnetwork is a recursive neural network that combines the current feature map of the current video frame with a previous temporal feature map produced by the second subnetwork for a previous video frame in the sequence of video frames to output a current temporal feature map of the current video frame, and wherein the third subnetwork takes the current temporal feature map of the current video frame as an input and outputs object detection information;
a processor configured to process each video frame sequentially with the neural network to detect objects in the sequence of video frames; and
an output interface configured to output the object detection information.
13. A method for detecting at least one object in a sequence of video frames, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out at least some steps of the method comprising:
accepting a sequence of video frames;
processing each video frame sequentially with a neural network to detect objects in the sequence of video frames, wherein the neural network includes a first subnetwork, a second subnetwork, and a third subnetwork, wherein the first subnetwork receives as an input a current video frame and outputs a current feature map of the current video frame, wherein the second subnetwork is a recursive neural network that combines the current feature map of the current video frame with a previous temporal feature map produced for a previous video frame in the sequence of video frames to output a current temporal feature map of the current video frame, and wherein the third subnetwork takes the current temporal feature map of the current video frame as an input and outputs object detection information; and
outputting the object detection information.
2. The object detector of
4. The object detector of
5. The object detector of
6. The object detector of
7. The object detector of
8. The object detector of
9. The object detector of
10. The object detector of
11. The object detector of
12. The object detector of
14. The method of
|
This invention relates generally to computer vision, and more particularly to detecting objects in video sequences.
Object detection is one of the most fundamental problems in computer vision. This is partially due to its inherent complexity as well as its potential for wide-ranging applications. One of the goals of object detection is to detect and localize the instances of pre-defined object classes in the form of bounding boxes within the input image with confidence values for each detection. An object detection problem can be converted to an object classification problem by a scanning window technique. However, the scanning window technique is inefficient because classification steps are performed for all potential image regions of various locations, scales, and aspect ratios.
The region-based convolution neural network (R-CNN) is used to perform a two-stage approach, in which a set of object proposals is generated as regions of interest (ROI) using a proposal generator and the existence of an object and the classes in the ROI are determined using a deep neural network. However, the detection accuracy of the R-CNN is insufficient for some cases.
A single-shot object detector is another neural network architecture that is used for object detection. In this class of networks, there is no region-proposal stage. Instead the input image is automatically divided into many different overlapping regions and many convolutional and pooling layers directly output a probability for each region. One or more bounding boxes are also output for each region, which are ignored if none of the classes has a high probability. Neural networks of this type tend to be faster than the region proposal type architectures. However, there accuracy is also insufficient for some cases.
This problem is even more apparent in multi-class detection. The phrase “multi-class” refers to the fact that object detectors can detect multiple different object classes using a single detector. The vast majority of this work has focused on using a single image as input. Convolutional neural networks (CNN) have dominated recent progress.
However, for many applications, the natural input to an object detector is a video. Standard practice is to simply process video sequences one frame at a time, treating each frame independently of the others. Although there have been past approaches that attempt to use multiple frames to improve object detection accuracy, these approaches use multiple frames in a pre or post-processing phase. See, e.g., a method described in U.S. Pat. No. 7,391,907, that uses the video sequences to track the object from one frame into another to assist the object detection.
Accordingly, there is a need for multi-class detectors that take multiple frames of video as input.
It is an object of some embodiments to provide a multi-class object detector that takes multiple frames of video as an input to detect and/or classify the objects in the sequence of video frames. It is another object of some embodiments to provide such a multi-class detector that can concurrently locate and classify one or multiple objects in the multiple frames of the video.
Some embodiments are based on recognition that a multi-class detector can use box-level techniques to operate on the final bounding-box output of object detectors applied to multiple sequential frames. However, the box-level techniques assist in locating the object, not in classifying the object. To that end, some embodiments are based on the recognition that it is desired for a multi-class object detector to use feature-level techniques that consider image features from multiple frames to concurrently locate and classify the object.
However, it is challenging to take advantage of multiple frames together to concurrently locate and classify the object. For example, one approach is to use the multiple frames directly as input to a convolutional neural network. However, some embodiments recognize that this approach does not work well. It is too difficult for the network to learn how to relate raw pixel information across multiple frames.
However, some embodiments are based on the realization that after a few convolutional network layers have processed the input video frame, the resulting feature maps represent higher level image information (such as object parts) which are easier to associate across frames. This insight led to the idea of adding a recurrent neural network layer to a network after a first stage of convolutional neural network layers, because it allows the recurrent units to process higher level information (feature maps) from the current frame as well as previous frames. This architecture led to significant accuracy gains over single-frame object detection networks.
To that end, some embodiments provide a Recurrent Multi-frame Single-Shot Detector (Recurrent Mf-SSD) neural network architecture. This architecture uses multiple sequential frames to improve accuracy without sacrificing the speed of modern object detectors. The Recurrent Mf-SSD network takes a multi-frame video sequence as input and is adapted to handle the change in the input data. The Recurrent Mf-SSD uses a data fusion layer directly after the feature extractor to integrate information from the sequence of input images. The data fusion layer is a recurrent layer. The output of the data fusion layer is then fed into the detection head, which produces the final bounding boxes and classes for the most recent time-stamped image.
For example, the Recurrent Mf-SSD can be implemented as a neural network including a first subnetwork, a second subnetwork, and a third subnetwork. The first subnetwork receives as an input a video frame and outputs a feature map of the video frame. The second subnetwork takes the feature map as an input and outputs a temporal feature map, and the third subnetwork takes the temporal feature map as an input and outputs object detection information.
In various embodiments, the second sub-network is a recurrent neural network having the ability to incorporate temporal information in many domains. Examples of the recurrent neural networks include LSTM and GRU units. The recurrent neural network formed by the second subnetwork combines recursively the inputted feature map with the temporal feature map produced for a previous video frame in the sequence of video frames. In such a manner, the detection head, i.e., the third subnetwork that produces the final bounding boxes and classes for the most recent time-stamped image, can use the higher level information (feature maps) from the current frame as well as previous frames.
In various embodiments, the first and/or the third subnetworks are convolutional networks formed by a combination of convolutional and pooling layers. Additionally, or alternatively, in some embodiments, the Recurrent Mf-SSD uses convolutional recurrent units, instead of fully connected recurrent units, to maintain the fully convolutional structure of object detection architecture. Some embodiments are based on recognition that convolutional recurrent units combine the benefits of standard convolutional layers (i.e. sparsity of connection, suitability to spatial information) with the benefits of standard recurrent layers (i.e. learning temporal features).
Accordingly, one embodiment discloses an object detector including an input interface to accept a sequence of video frames; a memory to store a neural network trained to detect objects in the video frames, the neural network includes a first subnetwork, a second subnetwork, and a third subnetwork, wherein the first subnetwork receives as an input a video frame and outputs a feature map of the video frame, wherein the second subnetwork is a recurrent neural network that takes the feature map as an input and outputs a temporal feature map, and wherein the third subnetwork takes the temporal feature map as an input and outputs object detection information; a processor to process each video frame sequentially with the neural network to detect objects in the sequence of video frames; and an output interface to output the object detection information.
Another embodiment discloses a method for detecting at least one object in a sequence of video frames, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out at least some steps of the method including accepting a sequence of video frames; processing each video frame sequentially with a neural network to detect objects in the sequence of video frames, wherein the neural network includes a first subnetwork, a second subnetwork, and a third subnetwork, wherein the first subnetwork receives as an input a video frame and outputs a feature map of the video frame, wherein the second subnetwork is a recurrent neural network that takes the feature map as an input and outputs a temporal feature map, and wherein the third subnetwork takes the temporal feature map as an input and outputs object detection information; and outputting the object detection information.
Another embodiment discloses a non-transitory computer readable storage medium embodied thereon a program executable by a processor for performing a method, the method includes accepting a sequence of video frames; processing each video frame sequentially with a neural network to detect objects in the sequence of video frames, wherein the neural network includes a first subnetwork, a second subnetwork, and a third subnetwork, wherein the first subnetwork receives as an input a video frame and outputs a feature map of the video frame, wherein the second subnetwork is a recurrent neural network that takes the feature map as an input and outputs a temporal feature map, and wherein the third subnetwork takes the temporal feature map as an input and outputs object detection information; and outputting the object detection information.
These instructions implement a method for detecting objects in a video sequence. In various embodiments, the object detection produces a set of bounding boxes indicating the locations and sizes of objects in each video frame along with a vector of probabilities for each bounding box indicating the likelihood that each output bounding box contains each particular object class.
The image processing system 100 is configured to detect objects in a video using a neural network including three subnetworks. Such a neural network is referred herein as Multi-frame Single Shot neural network. To that end, the image processing system 100 can also include a storage device 130 adapted to store the video frames 134 and the three subnetworks 131, 132, 133 that make up the Multi-frame Single Shot Detector network. The storage device 130 can be implemented using a hard drive, an optical drive, a thumb drive, an array of drives, or any combinations thereof.
In some implementations, a human machine interface 110 within the image processing system 100 connects the system to a keyboard 111 and pointing device 112, wherein the pointing device 112 can include a mouse, trackball, touchpad, joy stick, pointing stick, stylus, or touchscreen, among others. The image processing system 100 can be linked through the bus 106 to a display interface 160 adapted to connect the image processing system 100 to a display device 565, wherein the display device 565 can include a computer monitor, camera, television, projector, or mobile device, among others.
The image processing system 100 can also be connected to an imaging interface 170 adapted to connect the system to an imaging device 175. In one embodiment, the frames of video on which the object detector is run are received from the imaging device. The imaging device 175 can include a video camera, computer, mobile device, webcam, or any combination thereof.
A network interface controller 150 is adapted to connect the image processing system 100 through the bus 106 to a network 190. Through the network 190, the video frames 134 or subnetworks 131, 132, 133 can be downloaded and stored within the computer's storage system 130 for storage and/or further processing.
In some embodiments, the image processing system 100 is connected to an application interface 180 through the bus 106 adapted to connect the image processing system 100 to an application device 585 that can operate based on results of object detection. For example, the device 585 is a car navigation system that uses the locations of detected objects to decide how to steer the car.
The second subnetwork 132 is a recurrent network and uses the feature maps 220 computed in step S2 as well as temporal feature maps 235 computed in the previous iteration of step S3 to compute a new set of temporal feature maps 230. The feature maps 230 are referred as the temporal feature maps because they represent features computed over many frames. Step S4 takes the temporal feature maps 230 and applies a third subnetwork 133 which outputs a set of bounding boxes and class probabilities which encode spatial locations and likely object classes for each detected object in the current video frame.
The neural network with three subnetworks can include many parameters. These parameters are optimized during a training phase from many example videos for which the ground truth object bounding boxes and classes are known. The training phase uses an algorithm to optimize the weights of the network such as stochastic gradient descent.
The vehicle also includes a processor 702 to run an object detector. For example, the neural network 705 can detect the objects 726 in the sequence of images 725 and output a set of object bounding boxes and object classes 740. The processor 702 can be configured to perform other applications 750 that take advantage of the object detector 705. Examples of the applications 750 include control application for moving the vehicle 701 and/or various computer vision applications.
In other words, one embodiment uses joint calibration and fusion 730 to augment both sensors, i.e., to increase resolution of the LIDAR output 735 and to incorporate high-resolution depth information into the camera output. The result of the fusion can be rendered on a display 740 or submitted to different applications 750, e.g., an object tracking application.
The above-described embodiments of the present invention can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component. A processor may be implemented using circuitry in any suitable format.
Also, the embodiments of the invention may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
Use of ordinal terms such as “first,” “second,” in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention.
Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Jones, Michael, Broad, Alexander
Patent | Priority | Assignee | Title |
11417007, | Nov 20 2019 | Samsung Electronics Co., Ltd. | Electronic apparatus and method for controlling thereof |
11657594, | May 12 2020 | Polaris3D Co., Ltd. | Method and apparatus for classifying image |
Patent | Priority | Assignee | Title |
10289912, | Apr 29 2015 | GOOGLE LLC | Classifying videos using neural networks |
8345984, | Jan 28 2010 | NEC Corporation | 3D convolutional neural networks for automatic human action recognition |
9754351, | Nov 05 2015 | Meta Platforms, Inc | Systems and methods for processing content using convolutional neural networks |
9760806, | May 11 2016 | TCL RESEARCH AMERICA INC. | Method and system for vision-centric deep-learning-based road situation analysis |
20110182469, | |||
20160358038, | |||
20170046616, | |||
20170116498, | |||
20170262995, | |||
20170308756, | |||
20170360401, | |||
20180211403, | |||
20200034627, | |||
20200034971, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Feb 06 2018 | Mitsubishi Electric Research Laboratories, Inc. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Feb 06 2018 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Date | Maintenance Schedule |
Nov 02 2024 | 4 years fee payment window open |
May 02 2025 | 6 months grace period start (w surcharge) |
Nov 02 2025 | patent expiry (for year 4) |
Nov 02 2027 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 02 2028 | 8 years fee payment window open |
May 02 2029 | 6 months grace period start (w surcharge) |
Nov 02 2029 | patent expiry (for year 8) |
Nov 02 2031 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 02 2032 | 12 years fee payment window open |
May 02 2033 | 6 months grace period start (w surcharge) |
Nov 02 2033 | patent expiry (for year 12) |
Nov 02 2035 | 2 years to revive unintentionally abandoned end. (for year 12) |