Disclosed herein are techniques and systems for computing geodesic saliency of images using background priors. An input image may be segmented into a plurality of patches, and a graph associated with the image may be generated, the graph comprising nodes and edges. The nodes of the graph include nodes that correspond to the plurality of patches of the image plus an additional virtual background node that is added to the graph. The graph further includes edges that connect the nodes to each other, including internal edges between adjacent patches and boundary edges between those patches at the boundary of the image and the virtual background node. Using this graph, a saliency value, called the “geodesic” saliency, for each patch of the image is determined as a length of a shortest path from a respective patch to the virtual background node.
|
16. One or more computer storage media storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
segmenting an image into a plurality of patches;
generating a graph of the image, the graph comprising a set of patch nodes corresponding to the plurality of patches;
adding a virtual background node to the set of patch nodes of the graph;
for each of the plurality of patches, computing a saliency value as a length of a shortest path from an individual patch node to the virtual background node; and
creating a saliency map of the image based at least in part on the saliency values.
10. A system comprising:
one or more processors; and
one or more memories comprising:
an image segmentation module maintained in the one or more memories and executable by the one or more processors to segment an image into a plurality of patches;
a graph generator maintained in the one or more memories and executable by the one or more processors to generate a graph of the image, the graph comprising a set of patch nodes corresponding to the plurality of patches, and to add a virtual background node to the set of patch nodes of the graph; and
a saliency computation module maintained in the one or more memories and executable by the one or more processors to compute, for each of the plurality of patches, a saliency value as a length of a shortest path from an individual patch node to the virtual background node, and to create a saliency map of the image based at least in part on the saliency values.
1. A method comprising:
segmenting, by one or more processors, an image having an array of image pixels into a plurality of patches, each patch including one or more of the image pixels;
generating a graph of the image, the graph comprising a set of patch nodes and a set of internal edges connecting the patch nodes to each other, the set of patch nodes corresponding to the plurality of patches and including a subset of patch nodes corresponding to boundary patches at a boundary of the image;
adding a virtual background node to the set of patch nodes of the graph;
connecting the subset of patch nodes corresponding to the boundary patches to the virtual background node by a set of boundary edges;
computing, by the one or more processors, a length of a shortest path from each patch node to the virtual background node; and
designating respective lengths as a saliency value for each patch to create a saliency map of the image.
2. The method of
3. The method of
4. The method of
5. The method of
for each of the plurality of patches:
determining appearance distances between the patch and each patch neighboring the patch; and
selecting a smallest appearance distance among the determined appearance distances;
from the smallest appearance distances selected for the plurality of patches, designating a median value of the smallest appearance distances as a threshold; and
setting internal edge weights to zero for any of the internal edge weights that are associated with appearance distances that are below the threshold.
6. The method of
8. The method of
9. The method of
11. The system of
12. The system of
13. The system of
14. The system of
15. The system of
17. The one or more computer storage media of
18. The one or more computer storage media of
19. The one or more computer storage media of
comparing the internal edge weights to a threshold, and,
for any of the internal edge weights that are below the threshold, setting the internal edge weights to zero.
20. The one or more computer storage media of
|
This application is a non-provisional of, and claims priority to PCT Application No. PCT/CN2013/080491, filed on Jul. 31, 2013, which is incorporated by reference herein in its entirety.
The human vision system can rapidly and accurately identify important regions in its visual field. In order to replicate this capability in computer vision, various saliency detection methods have been developed to find pixels or regions in an input image that are of the highest visual interest or importance. Often the “important” pixels/regions carry some semantic meaning, such as being part of an object (e.g., person, animal, structure, etc.) in the foreground of the image that stands out from the background of the image. Object level saliency detection can be used for various computer vision tasks, such as image summarization and retargeting, image thumbnail generation, image cropping, object segmentation for image editing, object matching and retrieval, object detection and recognition, to name a few.
Although the general concept of computing saliency of an input image seems logical and straightforward, saliency detection is actually quite difficult in the field of computer vision due to the inherent subjectivity of the term “saliency.” That is, the answer to the question of what makes a pixel/region of an image more or less salient can be highly subjective, poorly-defined and application dependent, making the task of saliency detection quite challenging.
Current techniques for detecting saliency in an image have tried to tackle the problem by using various “bottom-up” computational models that predominantly rely on assumptions (or priors) of the image relating to the contrast between pixels/regions of the image. That is, current saliency detection algorithms rely on the assumption that appearance contrast between objects in the foreground and the background of the image will be relatively high. Thus, a salient image pixel/patch will present high contrast within a certain context (e.g., in a local neighborhood of the pixel/patch, globally, etc.). This known assumption is sometimes referred to herein as the “contrast prior.”
However, detecting saliency in an image using the contrast prior alone is insufficient for accurate saliency detection because the resulting saliency maps tend to be very different and inconsistent among the various implementations using the contrast prior alone. In some cases, the interior of objects are attenuated or not highlighted uniformly. A common definition of “what saliency is” is still lacking in the field of computer vision, and simply using the contrast prior alone is unlikely to generate accurate saliency maps of images.
Described herein are techniques and systems for computing geodesic saliency of images using background priors. Embodiments disclosed herein focus on the background, as opposed to focusing on the object, by exploiting assumptions (or priors) about what common backgrounds should look like in natural images simultaneously with the contrast prior. These background priors naturally provide more clues as to the salient regions of an image.
In some embodiments, systems, computer-readable media and processes for creating a saliency map of an input image are disclosed where the process includes segmenting the input image into a plurality of patches, and generating a graph associated with the image comprised of nodes and edges. In some embodiments, the patches correspond to regions of the image comprised of multiple pixels, but the process may be implemented with single-pixel segmentation, or patches of a single image pixel. The nodes of the graph include nodes that correspond to the plurality of patches of the image plus an additional virtual background node that is added to the set of nodes of the graph. The graph further includes edges that connect the nodes to each other, including internal edges between adjacent patches and boundary edges between those patches at the boundary of the image and the virtual background node. Using this graph, a saliency value for each patch of the image is determined as a length of a shortest path (i.e., geodesic distance) from a respective patch to the virtual background node. Thus, the saliency measure disclosed herein is sometimes called the “geodesic saliency.”
This Summary is provided to introduce a selection of concepts in a simplified form that is further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same reference numbers in different figures indicates similar or identical items.
Embodiments of the present disclosure are directed to, among other things, techniques and systems for saliency detection in images, and more particularly to determining object-level saliency of an input image using a geodesic saliency measure that is based in part on background priors. Embodiments disclosed herein find particular application for computer vision applications that benefit from object detection, although the applications described herein are provided merely as examples and not as a limitation. As those skilled in the art will appreciate, the techniques and systems disclosed herein are suitable for application in a variety of different types of computer vision and image processing systems. In addition, although input images are discussed primarily in terms of natural photographs or digital images, it is to be appreciated that the input images may include various types of images such as video images/frames or other types of images such as medical images, infra-red images, x-ray images or any other suitable type of image.
The techniques and systems described herein may be implemented in a number of ways. Example implementations are provided below with reference to the following figures.
Example Architecture
The geodesic saliency computation system 202 may be configured to receive images 204, compute saliency of those images 204, and output saliency maps 206 for those images 204 reflecting the computed saliency. The input images 204 may be provided by any suitable means such as an image capture device of any suitable type such as a camera, medical imaging device, video camera or the like that may be part of, or separate from, the geodesic saliency computation system 202. In some instances, the input images 204 may be received via a communications link, disk drive, Universal Serial Bus (USB) connection or other suitable input means to input previously obtained images 204 or images 204 obtained in real-time.
The output saliency maps 206 are generally the same size of the input images 204, and they present a visual representation of the saliency (i.e., importance or visual interest) of each image element (e.g., pixel, or group of pixels) of the input image 204 by showing an intensity value at each image element. That is, each point in the saliency map 206 is represented by a number (e.g., a real number from 0 to 1) that is indicative of the saliency of the corresponding image element in the image 204. For example, a saliency value of 1 (e.g., object element) indicates that the image element is of significant interest, and it may be visually represented as a white image element with a maximum intensity, whereas a saliency value of 0 (e.g., background element) indicates that the image element is of no interest, and it may be visually represented as a black image element with a minimum intensity. Saliency values in between 0 and 1 are gradients on a spectrum of saliency values in between maximum and minimum intensities that may be indicative of image elements that are of some importance. The ideal saliency map reflects the ground truth mask (e.g., the ground truth masks shown in column 102 of
In a typical scenario, the input image 204 received by the geodesic saliency computation system 202 is a natural image that includes one or more objects in the field of view of the image 204 that may be of high visual interest. With reference to
For saliency detection, instead of asking what the salient object(s) is, the approach of the embodiments disclosed herein asks the opposite question; namely, what part of the image is not salient (i.e., what is the background)? To answer this question, the disclosed embodiments utilize two priors (or common knowledge) about common backgrounds in natural images: (1) a boundary prior, and (2) a connectivity prior.
The boundary prior comes from a basic rule of photographic composition: most photographers will not crop salient objects along the view frame. In other words, the image boundary is most often substantially background; hence the name “boundary” prior. For example, the input image 300 of
The connectivity prior comes from the appearance characteristics of real world background images, and is based on the notion that background regions are usually large, continuous, and homogeneous. In other words, most image regions in the background can be easily connected to each other. Additionally, connectivity is in a piecewise manner. For example, sky and grass regions in the background are homogenous by themselves, but inter-region connection between the sky and grass regions is more difficult. Furthermore, homogeneity of background appearance is to be interpreted in terms of human perception. For example, regions in the grass all look visually similar to humans, although their pixel-wise intensities might be quite different. The connectivity prior is not to be confused with the connectivity prior used commonly in object segmentation, which is assumed on the spatial continuity of the object. Instead, the connectivity prior disclosed herein is based on common knowledge of the background, not the object. In some cases, background regions of natural images are out of focus, which supports the connectivity prior to an even greater degree since out-of-focus backgrounds tend to be more blurred and homogeneous by nature.
With these two background priors in mind, and in light of known contrast priors used in saliency detection methods, it can be observed that most background regions can be easily connected to image boundaries. This cannot be said for object regions, which tend to be more difficult to connect to the image boundaries. Accordingly, the saliency of an image region may be defined, at least in some cases, as a length of a shortest path to the image boundary.
With the patches 304 defined, a graph G may be generated and associated with the image 300 where the graph G is comprised of nodes (or vertices) V and edges E. The nodes V of the graph G include nodes that correspond to the plurality of patches 304 of the image 300. The graph G further includes edges E that connect adjacent ones of the nodes V to each other. Using this graph G, a saliency value for each patch 304 of the image 300 may be determined as a length of a shortest path from a respective patch 304 to the image boundary, such as the paths 302(1)-(3) shown in
However, the technique illustrated by
Accordingly, the saliency measure of the embodiments disclosed herein can be made more robust by adding a virtual background node (or vertex) 406 to the nodes V of the graph G. The virtual background node 406 may be connected to all of those nodes that correspond to patches 404 at the boundary of the image 400.
The graph G that is generated to represent the image 400 of
Two nodes are adjacent when they are both incident to a common edge. The edges E are also associated with weights (sometimes called “labels” or “costs,” and sometimes abbreviated as “wt.”) that may be real numbers. In some embodiments, the weights of the edges E may be restricted to rational numbers or integers. In yet further embodiments, edge weights may be restricted to positive weights. Whatever their form, edge weights act as a measure of distance between any two nodes in the graph G. That is, determining a geodesic distance (i.e., a shortest path) includes determining a path between a node V corresponding to a given patch 404 and the virtual boundary node 406 such that the sum of the weights of its constituent edges E is minimized.
Accordingly, the geodesic saliency of a patch P may be computed according to Equation (1) as the accumulated edge weights along the shortest path from P to the virtual background node B on the graph G:
saliency(P)=minP
Here Pi, is adjacent to Pi+1, and Pn is connected by a boundary edge to B, the virtual background node 406. Equation (1) can be generalized as a “single-pair shortest path problem” where, given the edge weights of the undirected graph G, the shortest path from patch P in Equation (1) to the virtual boundary node B is the path (P1, P2, . . . , Pn, B) that, over all possible n, minimizes the sum of the edge weights of edges incident to adjacent nodes along the path from P to B, where P1=P. The minimized sum of the edge weights is the geodesic distance between the patch P and the virtual boundary node B, and the geodesic distance is said to be the length of this shortest path.
It is to be appreciated that various algorithms may be utilized to solve the single-pair shortest path problem, and Equation (1) is but one example algorithm to find the length of the shortest path from a node corresponding to a given patch 404 to the virtual background node 406. Some example alternative algorithms include, but are not limited to, the approximate shortest path algorithm described in P. J. Toivanen: “New geodesic distance transforms for gray-scale images,” Pattern Recognition Letters 17 (1996) 437-450, Dijkstra's algorithm, and the A* search algorithm. Such algorithms are known to a person having ordinary skill in the art and are not explained further herein for conciseness.
In some embodiments, internal edge weights (i.e., weights of edges incident to adjacent nodes corresponding to two adjacent patches 404 of the image 400) may be computed as the appearance distance between adjacent patches 404 of the image 400. This distance measure should be consistent with human perception of how similar two patches are from a visual perspective; the more similar the adjacent patches, the smaller the internal edge weight of the edge incident on the adjacent patch nodes. On the other hand, the more dissimilar the adjacent patches, the larger the internal edge weight of the edge between them. For example, a background patch can be smoothly/easily connected to the virtual background node 406 without too much cost. By contrast, a foreground patch is more difficult to connect to the virtual background node 406 because the visual dissimilarity between the foreground and the background is usually very high. Thus, any path from inside an object in the image 400 is likely to go through a very “high cost” edge, which will make the shortest path from the patch inside the object to the virtual boundary node 406 more costly. In some embodiments, the patch appearance distance is taken as the difference (normalized to [0,1]) between the mean colors of two patches (e.g., in LAB color space), or the color histogram distance. However, any suitable patch appearance distance measure may be utilized without changing the basic characteristics of the system.
Even for homogeneous backgrounds, simple appearance distances, such as color histogram distances, although usually small, are non-zero values. This causes a “small-weight-accumulation problem” where many internal edges with small weights can accumulate along a relatively long path from a patch at the center of the image 400 to the virtual background node 406. This may cause undesirably high saliency values in the center of the background.
To address the small-weight-accumulation problem illustrated in the geodesic saliency map 602, a “weight-clipping technique” can be utilized where the internal edge weights are clipped, or otherwise set, to zero if the internal edge weights are smaller than a threshold. The weight-clipping technique disclosed herein includes determining the internal edge weights between each adjacent patch of the image, such as the image 600 of
Measuring geodesic saliency using regular-shaped patches (e.g., rectangular, square, triangle, etc.), such as the patches 404 of
Measuring geodesic saliency using the aforementioned Superpixels of
Example Processes
At 802, the geodesic saliency computation system 202 receives an input image, such as the input image 400 of
At 806, a virtual background node B, such as the virtual background node 406, may be added to the graph G, wherein the virtual background node B is connected via boundary edges to the nodes V that correspond to the patches at the image boundary, such as the boundary patches 500 of
At 808, the saliency of each patch may be computed, by the geodesic saliency computation system 202, as a length of a shortest path to the virtual background node B. Any suitable algorithm may be used for the shortest path determination, and the algorithm of Equation (1) is exemplary.
At 902, the geodesic saliency computation system 202 may determine appearance distances between each patch of a segmented image and its neighboring patches. For example, the segmented image generated after step 804 of
At 904, the geodesic saliency computation system 202 may select a smallest appearance distance among the appearance distances determined at 902 for each patch. That is, for a given patch, the smallest appearance distance among the given patch and each of its neighbors is selected at 904.
At 906, all of the smallest appearance distances that were selected at 904 are collected to determine a median value of the smallest appearance distances from all of the patches. This median value is then set as an insignificance distance threshold. At 908, the appearance distances determined at 902 are compared to the threshold determined at 906, and any appearance distances that are below the threshold are clipped, or otherwise set, to zero.
Example Computing Device
In at least one configuration, the computing device 1002 comprises the one or more processors 1004 and computer-readable media 1006. The computing device 1002 may also include one or more input devices 1008 and one or more output devices 1010. The input devices 1008 may be a camera, keyboard, mouse, pen, voice input device, touch input device, etc., and the output devices 1010 may be a display, speakers, printer, etc. coupled communicatively to the processor(s) 1004 and the computer-readable media 1006. The output devices 1010 may be configured to facilitate output or otherwise rendering the saliency map(s) 206 of
The computing device 1002 may have additional features and/or functionality. For example, the computing device 1002 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage may include removable storage 1016 and/or non-removable storage 1018. Computer-readable media 1006 may include, at least, two types of computer-readable media 1006, namely computer storage media and communication media. Computer storage media may include volatile and non-volatile, removable, and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. The computer readable media 1006, the removable storage 1016 and the non-removable storage 1018 are all examples of computer storage media. Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store the desired information and which can be accessed by the computing device 1002. Any such computer storage media may be part of the computing device 1002. Moreover, the computer-readable media 1006 may include computer-executable instructions that, when executed by the processor(s) 1004, perform various functions and/or operations described herein.
In contrast, communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer storage media does not include communication media.
The computer-readable media 1006 of the computing device 1002 may store an operating system 1020, a geodesic saliency computation engine 1022 with its various modules and components, and may include program data 1024. The geodesic saliency computation engine 1022 may include an image segmentation module 1026 to segment input images into a plurality of patches, as described herein, a graph generator 1028 to generate a graph G, with patch nodes V and a virtual boundary node B and edges E therebetween, as described herein, a weight clipping module 1030 to clip internal edge weights below a threshold to zero, as described herein, and a saliency computation module 1032 to compute saliency values for each of the plurality of patches as a shortest path to the virtual background node B.
The environment and individual elements described herein may of course include many other logical, programmatic, and physical components, of which those shown in the accompanying figures are merely examples that are related to the discussion herein.
The various techniques described herein are assumed in the given examples to be implemented in the general context of computer-executable instructions or software, such as program modules, that are stored in computer-readable storage and executed by the processor(s) of one or more computers or other devices such as those illustrated in the figures. Generally, program modules include routines, programs, objects, components, data structures, etc., and define operating logic for performing particular tasks or implement particular abstract data types.
Other architectures may be used to implement the described functionality, and are intended to be within the scope of this disclosure. Furthermore, although specific distributions of responsibilities are defined above for purposes of discussion, the various functions and responsibilities might be distributed and divided in different ways, depending on circumstances.
Similarly, software may be stored and distributed in various ways and using different means, and the particular software storage and execution configurations described above may be varied in many different ways. Thus, software implementing the techniques described above may be distributed on various types of computer-readable media, not limited to the forms of memory that are specifically described
In closing, although the various embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter.
Sun, Jian, Wen, Fang, Wei, Yichen
Patent | Priority | Assignee | Title |
10997692, | Aug 22 2019 | Adobe Inc.; Adobe Inc | Automatic image cropping based on ensembles of regions of interest |
11263752, | May 09 2019 | BOE TECHNOLOGY GROUP CO , LTD | Computer-implemented method of detecting foreign object on background object in an image, apparatus for detecting foreign object on background object in an image, and computer-program product |
11663762, | Jun 12 2017 | Adobe Inc. | Preserving regions of interest in automatic image cropping |
11669996, | Aug 22 2019 | Adobe Inc. | Automatic image cropping based on ensembles of regions of interest |
Patent | Priority | Assignee | Title |
6278798, | Aug 09 1993 | Texas Instruments Incorporated | Image object recognition system and method |
7773807, | Aug 29 2006 | Siemens Medical Solutions USA, Inc | Seed segmentation using l∞ minimization |
8437570, | May 23 2008 | Microsoft Technology Licensing, LLC | Geodesic image and video processing |
20110229025, | |||
CN101477695, | |||
CN102024262, | |||
CN102521849, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jul 31 2013 | Microsoft Technology Licensing, LLC | (assignment on the face of the patent) | / | |||
May 06 2014 | SUN, JIAN | Microsoft Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037028 | /0644 | |
May 14 2014 | WEI, YICHEN | Microsoft Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037028 | /0644 | |
May 16 2014 | WEN, FANG | Microsoft Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037028 | /0644 | |
Oct 14 2014 | Microsoft Corporation | Microsoft Technology Licensing, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037028 | /0659 |
Date | Maintenance Fee Events |
Jul 18 2017 | ASPN: Payor Number Assigned. |
Dec 23 2020 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Jul 04 2020 | 4 years fee payment window open |
Jan 04 2021 | 6 months grace period start (w surcharge) |
Jul 04 2021 | patent expiry (for year 4) |
Jul 04 2023 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jul 04 2024 | 8 years fee payment window open |
Jan 04 2025 | 6 months grace period start (w surcharge) |
Jul 04 2025 | patent expiry (for year 8) |
Jul 04 2027 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jul 04 2028 | 12 years fee payment window open |
Jan 04 2029 | 6 months grace period start (w surcharge) |
Jul 04 2029 | patent expiry (for year 12) |
Jul 04 2031 | 2 years to revive unintentionally abandoned end. (for year 12) |