A fully automatic, computationally efficient segmentation method of video employing sequential clustering of sparse image features. Both edge and corner features of a video scene are employed to capture an outline of foreground objects and the feature clustering is built on motion models which work on any type of object and moving/static camera in which two motion layers are assumed due to camera and/or foreground and the depth difference between the foreground and background. sequential linear regression is applied to the sequences and the instantaneous replacements of image features in order to compute affine motion parameters for foreground and background layers and consider temporal smoothness simultaneously. The Foreground layer is then extracted based upon sparse feature clustering which is time efficient and refined incrementally using Kalman filtering.
|
1. For a video image including both a foreground layer and a background layer a method of segmenting the foreground layer from the background layer, said method comprising the computer implemented steps of:
extracting sparse features from a series of image frames thereby producing a sparse feature set for each of the individual images in the series;
performing a sequential linear regression on the sparse feature sets thereby producing a sequential feature clustering set;
extracting the foreground layer from the background layer using the sequential feature clustering set;
refining the extracted layer;
determining optical flows of the sparse features between consecutive frames;
determining a set of features including both edge features and corner features;
computing a covariance matrix for each individual feature to determine if the feature is an edge or a corner feature, wherein a covariance matrix is computed for each individual feature to determine if the feature is an edge or a corner feature;
computing, for each edge feature, its normal direction (dx, dy) from the covariance matrix; and
projecting its optical flow to this nominal direction.
2. The method according to
comparing two sets of affine parameters, and
classifying features to each set.
3. The method of
randomly clustering the features into two sets;
determining least square solutions of the affine parameters for each set of features, and use normal optical flow for edge features;
fitting each feature into both affine motion models and comparing residuals;
classifying each feature to the affine model depending upon the residual;
repeating, the determining, fitting and classifying steps above until the clustering process converge.
4. The method of
extending the feature clustering from two frames to several frames.
|
This application claims the benefit of U.S. Provisional Patent Application No. 60/730,730 filed Oct. 27, 2005 the entire contents and file wrapper of which are incorporated by reference as if set forth at length herein.
This invention relates generally to the field of video processing and in particular relates to a method for segmenting videos into foreground and background layers using motion-based sequential feature clustering.
The ability to segment or separate foreground objects from background objects in video images is useful to a number of applications including video compression, human-computer interaction, and object tracking—to name a few. In order to generate such segmentation—in both a reliable and visually pleasing manner—the fusion of both spatial and temporal information is required. As can be appreciated, this fusion requires that large amounts of information be processed thereby imposing a heavy computational cost and/or requiring substantial manual interaction. This heavy computation cost unfortunately limits its applicability.
Video matting is a classic inverse problem in computer vision research that involves the extraction of foreground objects and alpha mattes which describe their opacity from image sequences. Chuang et al proposed a video matting method based upon Bayesian matting performed on each individual frame. (See, e.g., Y. Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin and R. Szeliski, “Video Matting of Complex Scenes”, ACM SIGGRAPH 2002, pp. II:243-248, 2002, and Y. Y. Chuang, B. Curless, D. H. Salesin, and R Szeliski, “A Bayesian Approach To Digital Matting”, CVPR01, pp. II:264-271, 2001). Such methods require accurate user-labeled “trimaps” that segment each image into foreground, background, and unknown regions. Computationally, it is quite burdensome to periodically provide such trimap labels for long video sequences.
Apostolof and Fitzgibbon presented a matting approach for natural scenes assuming a camera capturing the scene is static and the background is known. (See., e.g., N. Apostoloff and A. W. Fitzgibbon, “Bayesian Video Matting Using Learnt Image Priors”, CVPR04, pp. I:407-414, 2004).
Li, et. al. used a 3D graph cut based segmentation followed by a tracking-based local refinement to obtain a binary segmentation of video objects, then adopt coherent matting as a prior to produce the alpha matte of the object. (See., e.g., J. Shum, J. Sun, S. Yamazaki, Y. Li and C. Tang, “Pop-Up Light Field: An Interactive Image-Based Modeling and Rendering System”, ACM Transaction of Graphics, 23(2):143-162, 2004). This method too suffers from high computational cost and possible need for user input to fine tune the results.
Motion based segmentation methods perform motion estimation and cluster pixels or color segments into regions of coherent motion. (See., e.g., R. Vidal and R. Hartley, “Motion Segmentation With Missing Data Using Powerfactorization and GPCA”, CVPR04, pp. II-310-316, 2004). Layered approaches represent multiple objects in a scene with a collection of layers (See, e.g., J. Xiao and M. Shah, “Motion Layer Extraction In the Presence Of Occlusion Using Graph Cuts”, CVPR04, pp. II:972-79, 2004; N. Jojic and B. J. Frey, “Learning Flexible Sprites in Video Layers”, CVPR01, pp. I:255-262, 2001; J. Y. A. Wang and E. H. Adelson, “Representing Moving Images With Layers”, IP, 3(5):625-638, September, 1994). Wang and Ji described a dynamic conditional random field model to combine both intensity and motion cues to achieve segmentation. (See., e.g., Y. Wang and Q. Ji, “A Dynamic Conditional Random Field Model For Object Segmentation In Image Sequences”, CVPR05, pp. I:264-270, 2005). Finally, Ke and Kanade described a factorization method to perform rigid layer segmentation in a subspace because all of the layers share the same camera motion. (See., e.g., Q. Ke and T. Kanade, “A Subspace Approach To Layer Extraction”, CVPR01, pp. I:255-262, 2001). Unfortunately, many of these methods assume that objects are rigid and/or the camera is not moving.
An advance is made in the art in accordance with the principles of the present invention directed to a fully automatic, computationally efficient segmentation method employing sequential clustering of sparse image features.
Advantageously both edge and corner features of a video scene are employed to capture the outline of foreground objects. The feature clustering is built on motion models which work on any type of object and moving/static cameras.
According to an embodiment of the present invention, two motion layers are assumed due to camera and/or foreground and the depth difference between the foreground and background. Sequential linear regression is applied to the sequences and the instantaneous replacements of image features in order to compute affine motion parameters for foreground and background layers and consider temporal smoothness simultaneously. The Foreground layer is then extracted based upon sparse feature clustering which is time efficient and refined incrementally using Kalman filtering.
Further features and aspects of the present invention may be understood with reference to the accompanying drawing in which:
The following merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope.
Furthermore, all examples and conditional language recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the diagrams herein represent conceptual views of illustrative structures embodying the principles of the invention.
Sequential Feature Clustering
According the present invention, foreground segmentation is determined using sparse features thereby improving the computational cost. For method that operates according to the present invention, we assume that there are only two layers, namely a foreground layer and a background layer. In addition, sparse features are clustered into two classes based upon their motion information.
Operationally, we compute optical flows of the sparse features between consecutive frames and then apply linear regression techniques to compute affine parameters of the two layers. To take advantage of the temporal information, we perform sequential linear regression on sequences of optical flow values to achieve more reliable and temporally smoother clustering results.
Sparse Features
Both corner and edge features are extracted to cover those areas which do not have good textures, but have clear outlines—such as human faces. As may be appreciated by those skilled in the art, edge features provide information about the outline of an object but their optical flows have the foreshortening problem which we have deal with in the linear regression computation.
eig1 and eig2 are the eigenvalues of the covariance matrix, and α and β are parameters. Conveniently, Lucas and Kanade have described a method to compute the optical flow values of the features (See, e.g., B. D. Lucas and T. Kanade, “An Iterative Image Registration Technique With An Application To Stereo Vision”, IJCAI81, pp. 674-679, 1981.)
For edge features—according to an embodiment of the present invention—we compute its normal direction (dx,dy) from the covariance matrix and project its optical flow to this direction, i.e., we only keep the normal optical flow in affine parameter computation.
Linear Regression
Given a set of features and their optical flow values between two frames: (δxi,δyi), i=1, . . . , n where n is the number of features, we apply a linear regression flow technique to compare two sets of affine parameters, and classify the features to each set. An embodiment of our method may be summarized as follows:
Advantageously, and according to the principles of the present invention, we extend the feature clustering by linear regression between two frames to a few frames so that we can take advantage of the temporal consistence and achieve smoother and more reliable results. Since our feature clustering is based upon affine motion models which work better when the camera is moving, and/or the foreground objects and the background objects have independent motion. While this is not always true between two frames, advantageously a few frames (such as 5-7 frames when the video frame rate is 6 frames per second) will usually provide enough motion information to distinguish the foreground and background layers.
We incorporate the temporal information by performing linear regression on a few consecutive frames jointly. Given m consecutive frames, we may solve 2(m−1) affine parameters together where there are a pair of affine parameter to solve between two consecutive frames: (akl,bkl,cki,dkl,ekl,fkl), k=1, . . . m−1 to represent the affine motion between frame k to k+1 and l∈{1,2} denotes one of the two layers.
The connection between the sets of parameters is built upon the feature correspondences which can be achieved through optical flow computation. When a new frame k is available, corner/edge features (xi,yi), i=1, . . . , n are detected first, then the optical flow (δxi,δyi) between frame k and k−1 is computed for each feature. The corresponding feature i is searched over the features detected in frame k−1 to find the closest one to the warped feature point (xi+δxi,yi+δyi), and if the distance between the closest one and the warped one is below some threshold, the correspondence is established. Otherwise, the feature i is labeled as “no match”. Connection is built for corresponding features points that share the same layer label.
The initialization label for feature i is copied from the label of its corresponding point in frame k−1. As for features with “no match”, the initialization label takes the label of its nearest neighbor in frame k−1.
During the iterations of linear regression for each pair of frames, a joint residual is computed for corresponding features:
for feature i. Conversion of rli and r2i would determine which layer feature i belongs. For “no match” points, the clustering is the same as the method between two frames.
The joint solution of sequences of linear regression problems naturally takes into account the temporal consistence which makes the clustering results more reliable and smoother.
Foreground Refinement
Based upon the clustering results of sparse features, we first extract the foreground layer by a simple two-way scanning method, and then refine the layer extraction incrementally through Kalman filtering.
Foreground Extraction
Foreground extraction is to get the dense output, i.e., layer labeling of each pixel given the sparse feature clustering. Accordingly, we first determine which layer is the foreground layer based on the following observations:
1. The foreground layer is closer to the camera, therefore for most cases the affine parameters of the foreground layer have larger values. In a preferred embodiment, we only check the absolute values of the translation parameters |cl|+|fl|. The larger this value, the greater likelihood that the layer is a foreground layer. However, special cases exist when the camera is following the foreground object where the foreground barely moves. Advantageously, we could either compensate by calculating the camera motion—which is typically time consuming—or we could let other characteristics weigh the determination.
2. Foreground layer is rarely cut into pieces, that is, the foreground layer is one or a few connected areas.
3. Background layer is scattered around the boundaries of the images.
4. If a human exists in the foreground, most likely the foreground has more skin color pixels.
As can be appreciated, we could build the foreground layer extraction upon color segmentation results. For each segment, the features covered by this segment would vote which layer it belongs to. Advantageously, this approach provides smooth foreground outlines but exhibits two main drawbacks. First, there are some segments without enough feature coverage whose label could not be determined. Second, the color segmentation itself is quite computationally intensive.
According to the present invention, we employ a two-way scan method to assign each pixel to one of two layers. This two-way scan includes both x-scan and y-scan whereby x-scan works over each row of the image to determine the cutting point between layers in the x dimension. That is, the method locates the shift point between background layer and foreground layer in order to generate a few foreground line segments or each row of the image. The same process is performed for the y-scan, except the cutting point is determined for layers in the y dimension.
Two scan images are combined in an aggressive way to grow the foreground layer. If a pixel is labeled “foreground” in either the x-scan image or the y-scan image, it is labeled “foreground” in the final result. We then use a flood fill algorithm to generate the dense output with a few rounds of image morphing operations to denoise.
Turning now to
Refinement by Kalman Filtering
Although we have incorporated temporal information in sequential feature clustering, there still exists some error in feature labeling which could make dense output appear “jumpy” as depicted in
Experimental Results
An exemplary implementation of a segmentation method according to the present invention was tested and simulated on real videos taken under different lighting conditions and camera motions. In particular, we will show two examples captured by a lightweight, creative webcam. The resolution of the images is 640×480 pixels. The frame rate is 6 frames per second. As can be readily appreciated by those skilled in the art, the quality of webcam images are close to those exhibited by cell phone video cameras. Finally, for the purposes of these tests, we allow the webcam to move while capturing video images and do not know initially whether the foreground or background is static or its composition.
A first sequence was taken of a rigid scene while the camera was moving. The scene is composed of a box of tapes which is positioned closer to the camera as the foreground object, and a flat background. Due to low quality and limited view angle of this webcam, the object was very close to the camera when the video was taken. Therefore, there existed some distortions, as shown in
A second sequence was taken of a person moving and talking in front of the camera while holding the camera himself. The camera was shaking randomly with the person's movement. Most of the facial features were undergoing non-rigid motions. In addition, there were blurred areas in the video where feature tracking exhibits large errors. Since the method works on sequential feature clustering and incremental refinement by Kalman filtering, the temporally local blurring could be fixed over time.
As now apparent to those skilled in the art and in accordance with an aspect of the present invention, we have described a segmentation method to extract foreground objects from background objects in a video scene. Advantageously, the method may be applied to Television (TV), telephone images, and video conference images to—for example—hide background information for privacy, or hallucinate a new background for entertainment. Compared with image matting methods which require large amounts of manual (human) input, the method of the present invention is fully automatic.
In sharp contrast with motion layer methods which assume that objects are rigid, the method of the present invention assumes that there are two motion layers due to camera and/or foreground motion and the depth difference between foreground and background. The computation cost of the method of the present invention is modest since it is based upon sequential clustering of sparse image features while prior art methods typically work on pixels or color segments. And in addition to corner features, the present invention uses edge features as well to capture the outline of the foreground objects. The foreground layer is then extracted based on sparse feature clustering which—as we have noted—is quite computationally and time efficient.
Significantly, the method of the present invention takes advantage of the temporal information by applying a sequential linear regression approach to the sequences of the instantaneous replacements of image features in order to compute the affine motion parameters for foreground and background layers. The foreground layers are also refined incrementally using Kalman filtering.
The experimental results on the webcam are promising. And while the present invention has been described with these applications in mind, those skilled in the art will of course recognize that the present invention is not limited to those examples shown and described. Any video composition—particularly those where computation power is limited—is a candidate for the method of the present invention. Accordingly, our invention should be only limited by the scope of the claims attached hereto.
Xu, Wei, Han, Mei, Gong, Yihong
Patent | Priority | Assignee | Title |
10785445, | Dec 05 2016 | Hewlett-Packard Development Company, L.P. | Audiovisual transmissions adjustments via omnidirectional cameras |
7783118, | Jul 13 2006 | Seiko Epson Corporation | Method and apparatus for determining motion in images |
7940985, | Jun 06 2007 | Microsoft Technology Licensing, LLC | Salient object detection |
8218831, | Jun 30 2008 | Cisco Technology, Inc. | Combined face detection and background registration |
8237792, | Dec 18 2009 | Toyota Motor Corporation | Method and system for describing and organizing image data |
8269616, | Jul 16 2009 | Toyota Motor Corporation | Method and system for detecting gaps between objects |
8319819, | Mar 26 2008 | Cisco Technology, Inc.; Cisco Technology, Inc | Virtual round-table videoconference |
8390667, | Apr 15 2008 | Cisco Technology, Inc. | Pop-up PIP for people not in picture |
8396300, | Dec 08 2008 | Industrial Technology Research Institute | Object-end positioning method and system |
8405722, | Dec 18 2009 | Toyota Motor Corporation | Method and system for describing and organizing image data |
8424621, | Jul 23 2010 | Toyota Motor Corporation | Omni traction wheel system and methods of operating the same |
8452599, | Jun 10 2009 | Toyota Motor Corporation | Method and system for extracting messages |
8472415, | Mar 06 2006 | Cisco Technology, Inc. | Performance optimization with integrated mobility and MPLS |
8542264, | Nov 18 2010 | Cisco Technology, Inc. | System and method for managing optics in a video environment |
8599865, | Oct 26 2010 | Cisco Technology, Inc. | System and method for provisioning flows in a mobile network environment |
8599934, | Sep 08 2010 | Cisco Technology, Inc. | System and method for skip coding during video conferencing in a network environment |
8659637, | Mar 09 2009 | Cisco Technology, Inc. | System and method for providing three dimensional video conferencing in a network environment |
8659639, | May 29 2009 | Cisco Technology, Inc. | System and method for extending communications between participants in a conferencing environment |
8670019, | Apr 28 2011 | Cisco Technology, Inc. | System and method for providing enhanced eye gaze in a video conferencing environment |
8682087, | Dec 19 2011 | Cisco Technology, Inc. | System and method for depth-guided image filtering in a video conference environment |
8692862, | Feb 28 2011 | Cisco Technology, Inc. | System and method for selection of video data in a video conference environment |
8694658, | Sep 19 2008 | Cisco Technology, Inc. | System and method for enabling communication sessions in a network environment |
8699457, | Nov 03 2010 | Cisco Technology, Inc. | System and method for managing flows in a mobile network environment |
8723914, | Nov 19 2010 | Cisco Technology, Inc.; Cisco Technology, Inc | System and method for providing enhanced video processing in a network environment |
8730297, | Nov 15 2010 | Cisco Technology, Inc. | System and method for providing camera functions in a video environment |
8786631, | Apr 30 2011 | Cisco Technology, Inc. | System and method for transferring transparency information in a video environment |
8797377, | Feb 14 2008 | Cisco Technology, Inc. | Method and system for videoconference configuration |
8896655, | Aug 31 2010 | Cisco Technology, Inc.; University of North Carolina at Chapel Hill | System and method for providing depth adaptive video conferencing |
8902244, | Nov 15 2010 | Cisco Technology, Inc. | System and method for providing enhanced graphics in a video environment |
8934026, | May 12 2011 | Cisco Technology, Inc. | System and method for video coding in a dynamic environment |
8947493, | Nov 16 2011 | Cisco Technology, Inc. | System and method for alerting a participant in a video conference |
8982179, | Jun 20 2012 | AT&T Intellectual Property I, LP | Apparatus and method for modification of telecommunication video content |
9031357, | May 04 2012 | Microsoft Technology Licensing, LLC | Recovering dis-occluded areas using temporal information integration |
9082297, | Aug 11 2009 | Cisco Technology, Inc. | System and method for verifying parameters in an audiovisual environment |
9111138, | Nov 30 2010 | Cisco Technology, Inc. | System and method for gesture interface control |
9143725, | Nov 15 2010 | Cisco Technology, Inc. | System and method for providing enhanced graphics in a video environment |
9165192, | Aug 09 2010 | HANWHA VISION CO , LTD | Apparatus and method for separating foreground from background |
9204096, | May 29 2009 | Cisco Technology, Inc. | System and method for extending communications between participants in a conferencing environment |
9225916, | Mar 18 2010 | Cisco Technology, Inc. | System and method for enhancing video images in a conferencing environment |
9313452, | May 17 2010 | Cisco Technology, Inc. | System and method for providing retracting optics in a video conferencing environment |
9331948, | Oct 26 2010 | Cisco Technology, Inc. | System and method for provisioning flows in a mobile network environment |
9338394, | Nov 15 2010 | Cisco Technology, Inc. | System and method for providing enhanced audio in a video environment |
9449395, | Sep 15 2014 | Winbond Electronics Corp. | Methods and systems for image matting and foreground estimation based on hierarchical graphs |
9681154, | Dec 06 2012 | PATENT CAPITAL GROUP | System and method for depth-guided filtering in a video conference environment |
9843621, | May 17 2013 | Cisco Technology, Inc. | Calendaring activities based on communication processing |
D682854, | Dec 16 2010 | Cisco Technology, Inc. | Display screen for graphical user interface |
Patent | Priority | Assignee | Title |
6480615, | Jun 15 1999 | Washington, University of | Motion estimation within a sequence of data frames using optical flow with adaptive gradients |
6738154, | Jan 21 1997 | Xerox Corporation | Locating the position and orientation of multiple objects with a smart platen |
6901169, | Feb 01 2001 | AT & T Corp. | Method and system for classifying image elements |
7085401, | Oct 31 2001 | F POSZAT HU, L L C | Automatic object extraction |
7158680, | Jul 30 2004 | Euclid Discoveries, LLC | Apparatus and method for processing video data |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 26 2006 | NEC Laboratories America, Inc. | (assignment on the face of the patent) | / | |||
Feb 02 2009 | NEC Laboratories America, Inc | NEC Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 022177 | /0763 |
Date | Maintenance Fee Events |
Apr 04 2012 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Jun 03 2016 | REM: Maintenance Fee Reminder Mailed. |
Oct 21 2016 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Oct 21 2011 | 4 years fee payment window open |
Apr 21 2012 | 6 months grace period start (w surcharge) |
Oct 21 2012 | patent expiry (for year 4) |
Oct 21 2014 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 21 2015 | 8 years fee payment window open |
Apr 21 2016 | 6 months grace period start (w surcharge) |
Oct 21 2016 | patent expiry (for year 8) |
Oct 21 2018 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 21 2019 | 12 years fee payment window open |
Apr 21 2020 | 6 months grace period start (w surcharge) |
Oct 21 2020 | patent expiry (for year 12) |
Oct 21 2022 | 2 years to revive unintentionally abandoned end. (for year 12) |