A process recomputes zones for a scene. The process is performed at a computing device having one or more processors and memory. The memory stores one or more programs configured for execution by the one or more processors. The process receives a first image of a scene taken by an array of image sensors of a camera system at a first time and receives designation from a user of a zone within the first image. The process also receives a second image of the scene taken by the array of image sensors at a second time that is after the first time. The process compares the first and second images to identify movement of the camera and notifies the user about a change to the zone when the camera has moved.
|
1. A method of recomputing zones for a scene, comprising:
at a computing device having one or more processors, and memory storing one or more programs configured for execution by the one or more processors:
receiving a first plurality of images of a scene captured by an array of image sensors of a camera system at a first time, wherein each of the first plurality of images is captured when a different subset of illuminators of the camera system are emitting light;
receiving designation from a user of a zone within a first image of the first plurality of images;
receiving a second plurality of images of the scene captured by the array of image sensors at a second time that is after the first time, wherein each of the second plurality of images is captured when a different subset of illuminators of the camera system are emitting light;
building a first depth map of the scene using the first plurality of images;
building a second depth map of the scene using the second plurality of images;
comparing points in the first depth map to points in the second depth map to identify movement of the camera; and
notifying the user about a change to the zone when the camera has moved.
12. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computing device having one or more processors and memory, the one or more programs comprising instructions for:
receiving a first plurality of images of a scene captured by an array of image sensors of a camera system at a first time, wherein each of the first plurality of images is captured when a different subset of illuminators of the camera system are emitting light;
receiving designation from a user of a zone within a first image of the first plurality of images;
receiving a second plurality of images of the scene captured by the array of image sensors at a second time that is after the first time, wherein each of the second plurality of images is captured when a different subset of illuminators of the camera system are emitting light;
building a first depth map of the scene using the first plurality of images;
building a second depth map of the scene using the second plurality of images;
comparing points in the first depth map to points in the second depth map to identify movement of the camera; and
notifying the user about a change to the zone when the camera has moved.
9. A computing device, comprising:
one or more processors;
memory; and
one or more programs stored in the memory configured for execution by the one or more processors, the one or more programs comprising instructions for:
receiving a first plurality of images of a scene captured by an array of image sensors of a camera system at a first time, wherein each of the first plurality of images is captured when a different subset of illuminators of the camera system are emitting light;
receiving designation from a user of a zone within a first image of the first plurality of images;
receiving a second plurality of images of the scene captured by the array of image sensors at a second time that is after the first time, wherein each of the second plurality of images is captured when a different subset of illuminators of the camera system are emitting light;
building a first depth map of the scene using the first plurality of images;
building a second depth map of the scene using the second plurality of images;
comparing points in the first depth map to points in the second depth map to identify movement of the camera; and
notifying the user about a change to the zone when the camera has moved.
2. The method of
when movement is identified and there is substantial overlap between portions of the scene as captured in the first plurality of images and the second plurality of images, notifying the user includes identifying an adjusted zone in a second image of the second plurality of images, wherein the adjusted zone corresponds to the zone in the first image, and recommending to the user to replace the zone with the adjusted zone.
3. The method of
when movement is identified and there is not substantial overlap between portions of the scene as captured in the first plurality of images and the second plurality of images, notifying the user includes recommending removing the zone.
4. The method of
5. The method of
6. The method of
partitioning the image sensors into a plurality of pixels; and
for each pixel:
forming a respective vector of the received first plurality of images at the respective pixel; and
estimating a depth in the scene at the respective pixel by looking up the respective vector in a respective lookup table.
7. The method of
forming a first point cloud using a first plurality of points from the first depth map;
forming a second point cloud using a second plurality of points from the second depth map; and
computing a minimal transformation that aligns the first point cloud with the second point cloud.
10. The computing device of
when movement is identified and there is substantial overlap between portions of the scene as captured in the first plurality of images and the second plurality of images, notifying the user includes identifying an adjusted zone in a second image of the second plurality of images, wherein the adjusted zone corresponds to the zone in the first image, and recommending to the user to replace the zone with the adjusted zone.
11. The computing device of
when movement is identified and there is not substantial overlap between portions of the scene as captured in the first plurality of images and the second plurality of images, notifying the user includes recommending removing the zone.
13. The computing device of
partitioning the image sensors into a plurality of pixels; and
for each pixel:
forming a respective vector of the received first plurality of images at the respective pixel; and
estimating a depth in the scene at the respective pixel by looking up the respective vector in a respective lookup table.
14. The computing device of
forming a first point cloud using a first plurality of points from the first depth map;
forming a second point cloud using a second plurality of points from the second depth map; and
computing a minimal transformation that aligns the first point cloud with the second point cloud.
16. The computer readable storage medium of
when movement is identified and there is substantial overlap between portions of the scene as captured in the first plurality of images and the second plurality of images, notifying the user includes identifying an adjusted zone in a second image of the second plurality of images, wherein the adjusted zone corresponds to the zone in the first image, and recommending to the user to replace the zone with the adjusted zone.
17. The computer readable storage medium of
when movement is identified and there is not substantial overlap between portions of the scene as captured in the first plurality of images and the second plurality of images, notifying the user includes recommending removing the zone.
18. The computer readable storage medium of
19. The computer readable storage medium of
20. The computer readable storage medium of
partitioning the image sensors into a plurality of pixels; and
for each pixel:
forming a respective vector of the received first plurality of images at the respective pixel; and
estimating a depth in the scene at the respective pixel by looking up the respective vector in a respective lookup table.
|
This application is related to U.S. Provisional Application Ser. No. 62/021,620, filed Jul. 7, 2014, entitled “Activity Recognition and Video Filtering,” which is incorporated by reference herein in its entirety.
This application is related to U.S. patent application Ser. No. 14/723,276, filed May 27, 2015, entitled “Multi-Mode LED Illumination System,” which is incorporated by reference herein in its entirety.
This application is related to U.S. patent application Ser. No. 14/738,803, filed Jun. 12, 2015, entitled “Simulating an Infrared Emitter Array in a Video Monitoring Camera to Construct a Lookup Table for Depth Determination”, which is incorporated by reference herein in its entirety.
This application is related to U.S. patent application Ser. No. 14/738,818, filed Jun. 12, 2015, entitled “Using a Scene Illuminating Infrared Emitter Array in a Video Monitoring Camera for Depth Determination”, which is incorporated by reference herein in its entirety.
This application is related to U.S. patent application Ser. No. 14/738,806, filed Jun. 12, 2015, entitled “Using Infrared Images of a Monitored Scene to Identify Windows”, which is incorporated by reference herein in its entirety.
This application is related to U.S. patent application Ser. No. 14/738,817, filed Jun. 12, 2015, entitled “Using a Depth Map of a Monitored Scene to Identify Floors, Walls, and Ceilings”, which is incorporated by reference herein in its entirety.
This application is related to U.S. patent application Ser. No. 14/738,811, filed Jun. 12, 2015, entitled “Using a Scene Illuminating Infrared Emitter Array in a Video Monitoring Camera to Estimate the Position of the Camera”, which is incorporated by reference herein in its entirety.
This application is related to U.S. patent application Ser. No. 14/738,816, filed Jun. 12, 2015, entitled “Using a Scene Information from a Security Camera to Reduce False Security Alerts”, which is incorporated by reference herein in its entirety.
The disclosed implementations relate generally to video cameras, and more specifically to using illumination emitters from a video camera to identify properties of the scene monitored by the camera or to identify properties of the camera itself.
Video surveillance cameras are used extensively. Usage of video cameras in residential environments has increased substantially, in part due to lower prices and simplicity of deployment. In many cases, surveillance cameras include infrared emitters in order to illuminate a scene when light from other sources is limited or absent.
Some video cameras enable a user to identify “zones” within the scene that is visible to the camera. This can be useful to identify movement or changes within those zones.
Because a surveillance camera can capture a very large amount of data (e.g., running 24 hours a day, 7 days a week), some cameras enable a user to set up alerts based on specific criteria. The criteria can include movement within a scene, movement of a specific type, or movement within a certain time range.
Accordingly, there is a need for camera systems that provide simpler usage and better utilization. In various implementations, the disclosed functionality complements or replaces the functionality of existing camera systems.
In accordance with some implementations, a process generates lookup tables for use in estimating spatial depth in a visual scene. The process is performed at a server having one or more processors and memory. The memory stores one or more programs configured for execution by the one or more processors. The process identifies a plurality of distinct subsets of IR illuminators of a camera system. The camera system has a 2-dimensional array of image sensors (e.g., photodiodes) and a plurality of IR illuminators in fixed locations relative to the array of image sensors. The process partitions the image sensors into a plurality of pixels. In some implementations, each pixel comprises a single image sensor. In some implementations, each pixel comprises a plurality of image sensors, which can be 50 or more. For each pixel and for each of m distinct depths from the respective pixel, the process simulates a virtual surface at the respective depth. In some implementations, the simulated virtual surfaces are planar, but in other implementations the simulated surfaces are spherical, parabolic, or cubic. For each of the distinct subsets of IR illuminators, the process determines an expected IR light intensity at the respective pixel based on the respective depth and based on only the respective subset of IR illuminators emitting IR light. The process then forms an intensity vector using the expected IR light intensities for each of the distinct subsets, and normalizes the intensity vector. For each pixel, the process constructs a lookup table comprising the normalized vectors corresponding to the pixel. The lookup table associates each respective normalized vector with the respective depth of the respective simulated surface.
In some implementations, the expected IR light intensity at the respective pixel is based on characteristics of the IR illuminators of the camera system. In some implementations, the characteristics include lux, orientation of the IR illuminators relative to the sensor array, and/or location of the IR illuminators relative to the sensor array.
In some implementations, the process normalizes each intensity vector by computing a respective magnitude of the intensity vector and dividing each component of the intensity vector by the respective magnitude.
In some implementations, the array of image sensors comprises more than one million image sensors. In some implementations, the array of image sensors is downsampled to a smaller number of pixels. For example, an array of image sensors with one million individual sensors may be downsampled to 10,000 pixels. The downsampling used (if any) may depend on available resources, such as memory, bandwidth, processor speed, and/or number of processors.
In accordance with some implementations, a process creates a depth map of a scene. The process is performed at a computing device having one or more processors and memory. The memory stores one or more programs configured for execution by the one or more processors. For each of a plurality of distinct subsets of IR illuminators of a camera system, the process receives a captured IR image of a first scene taken by a 2-dimensional array of image sensors of the camera system while the respective subset of IR illuminators are emitting IR light and the IR illuminators not in the respective subset are not emitting IR light. The image sensors are partitioned into a plurality of pixels. In some implementations, each pixel comprises a single image sensor, but in other implementations, each pixel comprises a plurality of image sensors. In some implementations, the computing device is a server, and the captured images are received from a remotely located camera. In some implementations, the computing device is included in a camera, and the images are processed locally at the camera. For each pixel of the plurality of pixels, the process uses the captured IR images to form a respective vector of light intensity at the respective pixel. The process then estimates a depth in the first scene at the respective pixel by looking up the respective vector in a respective lookup table. In some implementations, the lookup table is stored at the camera system during a calibration process.
In some implementations, looking up the respective vector in the respective lookup table includes computing an inner product of the respective vector with records in the lookup table. In some implementations, the inner product is computed for each record in the lookup table. The process computes the depth in the first scene at the pixel as a depth corresponding to a record in the lookup table whose inner product with the respective vector is greatest among the computed inner products for the respective vector.
In some implementations, each respective vector for a respective pixel comprises a plurality of components, with each of the components corresponding to a respective IR light intensity for the respective pixel for a respective captured IR image. In some implementations, computing an inner product comprises computing a dot product.
In some implementations, the IR illuminators are orientated at a plurality of distinct angles relative to the array of image sensors.
In some implementations, the depth map of the first scene is created in response to detecting a trigger event. In some implementations, the trigger event is detecting movement of a first object in the first scene from a first location to a second location. In some implementations, the trigger event is a power interruption event.
In some implementations, a respective lookup table is generated during the calibration process. In some implementations, the calibration process includes simulating a virtual planar surface at a plurality of respective depths in the first scene and determining, for each pixel and each respective depth, an expected IR light intensity.
Implementations select the distinct subsets of IR illuminators in various ways. In some implementations, each of the distinct subsets of IR illuminators comprises two adjacent IR illuminators, and the distinct subsets of IR illuminators are non-overlapping.
In some implementations, each respective lookup table includes a plurality of normalized IR light intensity vectors, and each normalized light intensity vector corresponds to a respective depth in the first scene.
In some implementations, the respective lookup tables are downloaded to the camera system from a remote server during an initialization process prior to creating the depth map.
In some implementations, prior to capturing the IR images, the process switches from a first mode of the camera system to a second mode of the camera system, including deactivating the first mode and activating the second mode. In some implementations, the array of image sensors has an associated first pixel gain curve while the first mode is activated, and the array of image sensors has an associated second pixel gain curve while the second mode is activated.
In some implementations, the process receives a baseline IR image of the scene captured by the array of sensors while none of the IR illuminators are emitting IR light. Then, forming each respective vector of light intensity at a respective pixel comprises subtracting a light intensity at the pixel of the baseline IR image from the light intensity at the pixel of each of the captured IR images.
In accordance with some implementations, a process classifies objects in a scene. The process is performed at a computing device having one or more processors and memory. The memory stores one or more programs configured for execution by the one or more processors. In some implementations, the computing device is included in a camera system. In some implementations, the computing device is a server distinct from the camera system. The process receives a captured IR image of a scene taken by a 2-dimensional image sensor array of the camera system while one or more IR illuminators of the camera system are emitting IR light. In this way, the process forms an IR intensity map of the scene with a respective intensity value determined for each pixel of the IR image. The process uses the IR intensity map to identify a plurality of pixels whose corresponding intensity values are within a predefined intensity range (e.g., all intensity values between 0 and a positive finite value or all values between two positive finite values). The process then clusters the identified plurality of pixels into one or more regions that are substantially contiguous. The process determines that a first region of the one or more regions corresponds to a specific material based, at least in part, on the intensity values of the pixels in the first region, and stores information in the memory that identifies the first region.
In some implementations, each pixel of the IR image corresponds to a unique respective image sensor in the image sensor array. In some implementations, the pixels of the IR image form a partition of the image sensors in the image sensor array and at least one pixel corresponds to a plurality of image sensors in the image sensor array.
In some implementations, the camera system has a plurality of IR illuminators, and forming an IR intensity map of the scene includes receiving a respective IR sub-image of the scene for each of a plurality of distinct subsets of IR illuminators. Each IR sub-image is captured while the respective subset of IR illuminators are emitting IR light and the IR illuminators not in the respective subset are not emitting IR light. The respective intensity value for a respective pixel is the average of intensity values at the pixel in each of the sub-images.
In some implementations, clustering the identified plurality of pixels into one or more regions further comprises using a depth map that was constructed using the image sensor array.
In some implementations, clustering the identified plurality of pixels into one or more regions further comprises using an RGB image of the scene captured using the image sensor array.
In some implementations, determining that a first region of the one or more regions corresponds to a specific material comprises determining that the first region is substantially a quadrilateral. In some implementations, the first region is substantially a quadrilateral when a total absolute difference in area between the first region and the quadrilateral is less than a threshold percentage of the quadrilateral's area (e.g., 5%, 10%, or 20%).
In some implementations, the predefined intensity range includes all intensity values below a threshold value, and the specific material is glass. The process thereby determines that the first region corresponds to a window in the scene.
In some implementations, the process receives a video stream of the scene from the camera system and reviews the video stream to detect movement in the scene. The first region is excluded from movement detection. The process generates a motion alert when there is motion detected at the scene outside of the first region.
In accordance with some implementations, a process identifies large planar surfaces in scenes, such as floors, walls, and ceilings. The process is performed at a computing device having one or more processors and memory. The memory stores one or more programs configured for execution by the one or more processors. The process receives a plurality of captured IR images of a scene taken by a 2-dimensional array of image sensors of a camera system. Each IR image is captured when a distinct subset of IR illuminators of the camera system are illuminated. The process constructs a depth map of a scene using the plurality of IR images, and uses the depth map to compute a binary depth edge map for the scene. The binary depth edge map identifies which points in the depth map comprise depth discontinuities. The process identifies a plurality of contiguous components based on the binary depth edge map. The process determines that a first component of the plurality of contiguous components represents a large planar surface in the scene by fitting a plane to points in the first component, determining the orientation of the plane, and determining that the plane fitting residual error is less than a predefined threshold.
In some implementations, the nature of the large plane is determined by its orientation. When the orientation of the plane is upwards, the plane is determined to be a floor. When the orientation of the plane is downwards, the plane is determined to be a ceiling. And when the orientation of the plane is horizontal, the plane is determined to be a wall.
In some implementations, the computing device is a server distinct from the camera system. In other implementations, the computing device is included in the camera system.
In some implementations, the image sensors are partitioned into a plurality of pixels. For each pixel, the process uses the captured IR images to form a respective vector of light intensity at the respective pixel and estimates a depth in the first scene at the respective pixel using the respective vector and a respective lookup table. In this way, the process constructs the depth map.
In accordance with some implementations, a process recomputes zones for a scene. The process is performed at a computing device that has one or more processors and memory. The memory stores one or more programs configured for execution by the one or more processors. The process receives a first RGB image of a scene taken by a 2-dimensional array of image sensors of a camera system at a first time. The process also receives a first plurality of distinct IR images of the scene taken by the array of image sensors temporally proximate to the first time. Each of the IR images is taken while a different subset of IR illuminators of the camera system is emitting light. Using the first plurality of IR images, the process constructs a first depth map of the scene. The first depth map indicates a respective depth in the scene at a plurality of pixels, where each pixel corresponds to one or more of the image sensors. The process receives designation from a user of a zone within the first RGB image. The zone corresponds to a contiguous plurality of pixels. At a second time later, the process receives a second plurality of distinct IR images of the scene taken by the array of image sensors. Each of the IR images in the second plurality is taken while a different subset of IR illuminators of the camera system is emitting light. Using the second plurality of IR images, the process constructs a second depth map of the scene. The process then determines physical movement of the camera system based on the first and second depth maps. Based on the determined physical movement, the process translates the zone in the first RGB image into an adjusted zone.
In some instances, the determined physical movement is an angular rotation. In some instances, the determined physical movement is a lateral displacement. In some instances, the determined physical movement includes both an angular rotation and a lateral displacement. Lateral displacements are commonly horizontal, but they can be vertical as well. As used herein, a lateral displacement is any movement in which the camera continues to point in the same direction. This includes any combination of left/right, up/down, and/or forward/backward.
In some implementations, determining the physical movement of the camera system includes identifying a plurality of points in the first depth map and a corresponding plurality of points in the second depth map and the process determines a respective displacement for each of the points between the first and second depth maps.
In some instances, the zone is a first quadrilateral. In some instances, the adjusted zone is a second quadrilateral, and a first edge of the first quadrilateral has a length that is different from a corresponding second edge of the second quadrilateral.
In some implementations, the process creates the first depth map of the scene by partitioning the image sensors into a plurality of pixels. For each pixel, the process forms a respective vector of the received IR images at the respective pixel and estimates a depth in the scene at the respective pixel by looking up the respective vector in a respective lookup table.
In some implementations, the computing device is a server distinct from the camera system. In other implementations, the computing device is included in the camera system.
In some implementations, the process receives a second RGB image of the scene taken by the image sensor array of the camera system temporally proximate to the second time and correlates the adjusted zone to a set of pixels from the second RGB image.
In some implementations, the process determines the physical movement of the camera system using point clouds. The process forms a first point cloud using a first plurality of points from the first depth map and forms a second point cloud using a second plurality of points from the second depth map. The process then computes a minimal transformation that aligns the first point cloud with the second point cloud. This process is referred to as “registration.”
In accordance with some implementations, a process estimates the height and tilt angle of a camera system. The camera system has a 2-dimensional array of image sensors and a plurality of IR illuminators in fixed locations relative to the array of image sensors. The process is performed at a computing device having one or more processors and memory. The memory stores one or more programs configured for execution by the one or more processors. In some implementations, the computing device is included in the camera system. In some implementations, the computing device is a server distinct from the camera system. The process identifies a plurality of distinct subsets of the IR illuminators. In some implementations, each of the distinct subsets of the IR illuminators comprises two adjacent IR illuminators, and the distinct subsets of the IR illuminators are non-overlapping. In some implementations, one or more of the subsets of IR illuminators comprises a single IR illuminator. The process partitions the image sensors into a plurality of pixels. In some implementations, each pixel corresponds to a single image sensor. In some implementations, some of the pixels correspond to multiple image sensors (e.g., by downsampling).
In accordance with some implementations, for each of a plurality of heights and tilt angles, the process constructs a dictionary entry that corresponds to the camera system having the respective height and tilt angle above a floor. The respective dictionary entry includes respective IR light intensity values for pixels in images corresponding to activating individually each of the distinct subsets of the IR illuminators.
In some implementations, the constructed dictionary entries are based on simulating the camera, the floor, and the images, and computing expected IR light intensity values for pixels in the simulated images. In some implementations, each expected IR light intensity value is based on characteristics of the IR illuminators, including one or more characteristics selected from the group consisting of lux, orientation of the IR illuminators relative to the array of image sensors, and location of the IR illuminators relative to the array of image sensors. In some implementations, a respective dictionary entry for a respective height and respective tilt angle is based on measuring IR light intensity values of actual images captured by the camera having the respective height and respective tilt angle with respect to an actual floor.
In accordance with some implementations, for each of the plurality of distinct subsets of the IR illuminators, the process receives a captured IR image of a scene taken by the array of image sensors while the respective subset of the IR illuminators are emitting IR light and the IR illuminators not in the respective subset are not emitting IR light. Using at least one of the captured IR images, the process identifies a floor region corresponding to a floor in the scene. In some implementations, identifying the floor region includes constructing a depth map of the scene using the captured IR images, identifying a region bounded by depth discontinuities, and determining that the region is substantially planar and facing upwards.
In accordance with some implementations, the process forms a vector (sometimes referred to as a feature vector) including pixels from the captured IR images in the identified floor region and estimates the camera height and camera tilt angle relative to the floor by comparing the feature vector to the dictionary entries.
In some implementations, the respective expected IR light intensity is based on characteristics of the IR illuminators. In some implementations, these characteristics include one or more of: illuminator lux; orientation of the IR illuminators relative to the array of image sensors; and location of the IR illuminators relative to the array of image sensors.
In some implementations, constructing a dictionary entry includes normalizing the dictionary entry. In some implementations, normalizing a dictionary entry includes determining a respective total magnitude of the light intensity features in the dictionary entry and dividing each component of the dictionary entry by the respective total magnitude. In some implementations, the dictionary entries are downloaded to the camera system from the computing device during an initialization process.
In some implementations, the process receives a baseline IR image of the scene captured by the array of image sensors while none of the IR illuminators are emitting IR light and subtracts the light intensity at each pixel of the baseline IR image from the light intensity at the corresponding pixel of each of the other captured IR images.
In some implementations, estimating the camera height and camera tilt angle relative to the floor includes computing a respective distance between the feature vector and respective dictionary entries. The process selects a first dictionary entry whose corresponding computed distance is less than the other computed distances and estimates the camera height and tilt angle to be the height and tilt angle associated with the first dictionary entry. In some implementations, computing a respective distance between the feature vector and respective dictionary entries comprises computing a Euclidean distance that uses only vector components corresponding to pixels in the identified floor region. In some implementations, the process normalizes the feature vector and the dictionary entries prior to computing the distances.
In accordance with some implementations, a process reduces false positive security alerts. The process is performed at a computing device having one or more processors, and memory storing one or more programs configured for execution by the one or more processors. In some implementations, the computing device is a server distinct from a video camera. In some implementations, the computing device is included in the video camera. The process computes a depth map for a scene monitored by a video camera using a plurality of IR images captured by the video camera and uses the depth map to identify a first region within the scene having historically above average false positive detected motion events. The process monitors a video stream provided by the video camera to identify motion events. The monitored area excludes the first region. The process generates a motion alert when there is detected motion in the scene outside of the first region and the detected motion satisfies threshold criteria. In some implementations, satisfying the threshold criteria includes detecting movement of an object in the scene, and the detected movement exceeds a predefined distance within a predefined period of time. In some implementations, satisfying the threshold criteria includes detecting movement for an object that exceeds a predefined size. In some implementations, satisfying the threshold criteria includes detecting simultaneous movement of two or more objects in the scene.
In some implementations, the video camera has a plurality of IR illuminators and each of the plurality of IR images captured by the video camera is taken when a different subset of the illuminators is emitting light.
In some instances, the first region is identified as a ceiling. In some implementations, identifying the first region as a ceiling includes using the depth map to compute a binary depth edge map for the scene. The binary depth edge map identifies which points in the depth map comprise depth discontinuities. In some implementations, identifying the first region as a ceiling also includes identifying a contiguous component based on the binary depth edge map. In some implementations, identifying the first region as a ceiling also includes fitting a plane to points in the contiguous component, determining that the plane fitting residual error is less than a predefined threshold, and determining that the plane is oriented downward.
In some instances, the first region is identified as a window. In some implementations, identifying the first region as a window includes identifying the first region as a region of low light intensity within a captured IR image of the scene, fitting the first region with a quadrilateral, and determining that the absolute difference between the first region and the quadrilateral is less than a threshold percentage of the area of the quadrilateral.
In some instances, the first region is identified as a television.
In accordance with some implementations, process for generating depth maps is performed by a camera having a plurality of illuminators, a lens assembly, an image sensing element, a processor, and memory. The illuminators are configured to operate in a first mode to provide illumination using all of the illuminators, the lens assembly is configured to focus incident light on the image sensing element, the memory is configured to store image data from the image sensing element, and the processor is configured to execute programs to control operation of the camera. The process reconfigures the plurality of illuminators to operate in a second mode, where each of a plurality of subsets of the plurality of illuminators provides illumination separately from other subsets of the plurality of illuminators. The process sequentially activates each of the subsets of the illuminators to illuminate a scene and receives reflected illumination from the illuminated scene incident on the lens assembly and focused onto the image sensing element. The process measures light intensity values of the received reflected illumination at the image sensing element and stores to the memory the measured light intensity values associated with activation of each of the subsets.
In some implementations, each of the subsets of illuminators is configured at a different angle relative to the image sensing element.
In some implementations, each of the subsets of illuminators highlights a different portion of the scene.
In some implementations, the process transmits the stored light intensity values to a depth mapping module configured to estimate spatial depths of objects in the scene based on the stored light intensity values, predetermined illumination specifications of the illuminators, and response specifications of the image sensors.
In some implementations, the illuminators are IR illuminators.
In some implementations, the illuminators comprise 8 IR illuminators and each of the subsets of the illuminators comprises 2 adjacent IR illuminators.
In some implementations, the image sensing element is a 2-dimensional array of image sensors.
In some implementations, differences in the stored light intensity values associated with activation of each of the subsets for a respective image sensor correlate with spatial depth of an object in the scene from which reflected light was received at the respective image sensor.
In some implementations, the process captures a baseline image while none of the illuminators are emitting light. The captured baseline image measures ambient light intensity of the scene at each of the image sensors. The process stores the captured baseline image to the memory and for each image sensor, the process subtracts the baseline intensity value from the stored intensity values for the respective image sensor to correct the stored intensity values for ambient light at the scene.
In some implementations, the image sensors are partitioned into a plurality of pixels and for each pixel of the plurality of pixels the process using the captured IR images to form a respective vector of light intensity at the respective pixel. For each pixel, the process also estimates a depth in the first scene at the respective pixel by looking up the respective vector in a respective lookup table. In some implementations, looking up the respective vector in the respective lookup table includes computing an inner product of the respective vector with records in the lookup table and determining the depth in the first scene at the pixel as a depth corresponding to a record in the lookup table whose inner product with the respective vector is greatest among the computed inner products for the respective vector. In some implementations, computing an inner product of the respective vector with records in the lookup table includes computing an inner product of the respective vector and the respective record for each record in the respective lookup table. In some implementations, the respective vector for a respective pixel has a plurality of components, each of the components corresponds to a respective IR light intensity for the respective pixel for a respective captured IR image, and computing an inner product comprises computing a dot product.
In some implementations, each respective lookup table includes a plurality of normalized IR light intensity vectors, each normalized light intensity vector corresponds to a respective depth in the first scene.
In some implementations, the respective lookup table is downloaded to the camera system from a remote server during an initialization process.
In accordance with some implementations, a computing device has one or more processors, memory, and one or more programs stored in the memory. The programs are configured for execution by the one or more processors. The one or more programs including instructions for performing any of the processes described herein. In some implementations, the computing device is a server, which is distinct from a camera system. In other implementations, the computing device includes a camera.
In accordance with some implementations, a non-transitory computer readable storage medium stores one or more programs configured for execution by a computing device having one or more processors and memory. The one or more programs include instructions for performing any of the processes described herein. In some implementations, the computing device is a server, which is distinct from a camera system. In other implementations, the computing device includes a camera.
Thus, computing devices, server systems, and camera systems are provided with more efficient methods for utilizing IR emitters and a sensor array to classify objects in a scene or simplify creation of alerts. These disclosed camera systems thereby increase the effectiveness, efficiency, and user satisfaction with such systems. Such methods may complement or replace conventional methods.
For a better understanding of the various described implementations, reference should be made to the Description of Implementations below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
Security cameras typically include illuminators so that video capture is possible even in low light conditions or in complete darkness. Many such cameras use infrared (IR) illuminators, which allow video capture without illuminating a scene with visible light. Typically, when illumination is needed, all of the illuminators are turned on.
Disclosed implementations utilize existing illuminators in different ways so that the camera can provide more information about a scene. One step in some implementations is to control the illuminators individually or in small groups rather than turning them all on or off together. Because the illuminators are in different locations with respect to the image sensor array, captured images are slightly different depending on which illuminators are on, as illustrated below in
As described below, some implementations build a depth map of a scene using the differences in captured images when different illuminators are on. A depth map estimates the distance between the image sensor array of the camera and the nearest object for each pixel in the field of vision of the camera. In some implementations, the depth map is implemented as an m×n matrix of depths, where m×n is the arrangement of pixels corresponding to image sensor array.
In some implementations, there is a one-to-one correspondence between pixels and individual image sensors in the array, but in many implementations the images are downsampled to create a more manageable set of pixels (e.g., 10,000 pixels instead of 1,000,000 pixels).
A depth map can be used in various ways to determine information about a scene. In some implementations, the depth map is used to help identify floors, walls, and ceilings. In some implementations, the depth map helps to identify when a camera has moved slightly, enabling automatic zone correction for previously defined zones in the scene. In some implementations, the depth map helps to identify the position of the camera (e.g., height above the floor and angle). These features provide useful information, and also allow for more accurate alerts. For example, if a region is identified as a ceiling, perceiving “movement” in that region is likely to be light reflections instead of an intruder. As another example, automatic zone correction can ensure that the proper region is monitored (e.g., a doorway) even if the zone is in a different location relative to a new camera position (e.g., because the camera was bumped).
Some implementations also enable detection of windows using characteristics of windows that are different from other objects. For example, whereas light incident on most objects scatters in all directions, light incident on a window either passes through the window or reflects off like a mirror. Identifying windows can be useful in various ways, including the prevention of false alerts. For example, movement of leaves on a tree outside of a window does not constitute an intruder inside a monitored room with the window.
These features may be implemented for an independent camera, but in some implementations, the camera is part of a smart home environment 100, as described below in
Video-based surveillance and security monitoring of a premises generates a continuous video feed that may last hours, days, and even months. Although motion-based recording triggers can help trim down the amount of video data that is actually recorded, there are a number of drawbacks associated with video recording triggers based on simple motion detection in the live video feed. For example, when motion detection is used as a trigger for recording a video segment, the threshold of motion detection must be set appropriately for the scene of the video; otherwise, the recorded video may include many video segments containing trivial movements (e.g., lighting change, leaves moving in the wind, shifting of shadows due to changes in sunlight exposure, etc.) that are of no significance to a reviewer. On the other hand, if the motion detection threshold is set too high, video data on important movements that are too small to trigger the recording may be irreversibly lost. Furthermore, at a location with many routine movements (e.g., cars passing through in front of a window) or constant movements (e.g., a scene with a running fountain, a river, etc.), recording triggers based on motion detection are rendered ineffective, because motion detection can no longer accurately select out portions of the live video feed that are of special significance. As a result, a human reviewer has to sift through a large amount of recorded video data to identify a small number of motion events after rejecting a large number of routine movements, trivial movements, and movements that are of no interest for a present purpose.
Due to at least the challenges described above, it is desirable to have a method that maintains a continuous recording of a live video feed such that irreversible loss of video data is avoided and, at the same time, augments simple motion detection with false positive suppression and motion event categorization. The false positive suppression techniques help to downgrade motion events associated with trivial movements and constant movements. The motion event categorization techniques help to create category-based filters for selecting only the types of motion events that are of interest for a present purpose. As a result, the reviewing burden on the reviewer may be reduced. In addition, as the present purpose of the reviewer changes in the future, the reviewer can simply choose to review other types of motion events by selecting the appropriate motion categories as event filters.
In addition, in some implementations, event categories can also be used as filters for real-time notifications and alerts. For example, when a new motion event is detected in a live video feed, the new motion event is immediately categorized, and if the event category of the newly detected mention event is a category of interest selected by a reviewer, a real-time notification or alert can be sent to the reviewer regarding the newly detected motion event. In addition, if the new event is detected in the live video feed as the reviewer is viewing a timeline of the video feed, the event indicator and the notification of the new event will have an appearance or display characteristic associated with the event category.
Furthermore, the types of motion events occurring at different locations and settings can vary greatly, and there are many event categories for all motion events collected at the video server system (e.g., the video server system 508). Therefore, it may be undesirable to have a set of fixed event categories from the outset to categorize motion events detected in all video feeds from all camera locations for all users. In some implementations, the motion event categories for the video stream from each camera are gradually established through machine learning, and are thus tailored to the particular setting and use of the video camera.
In addition, in some implementations, as new event categories are gradually discovered based on clustering of past motion events, the event indicators for the past events in a newly discovered event category are refreshed to reflect the newly discovered event category. In some implementations, a clustering algorithm automatically phases out old, inactive, and/or sparse categories when categorizing motion events. As a camera changes location, event categories that are no longer active are gradually retired without manual input to keep the motion event categorization model current. In some implementations, user input to edit the assignment of past motion events into respective event categories is also taken into account for future event category assignment and new category creation.
In some circumstances, there are multiple objects moving simultaneously within the scene of a video feed. In some implementations, the motion track associated with each moving object corresponds to a respective motion event candidate, such that the movement of the different objects in the same scene may be assigned to different motion event categories.
In general, motion events may occur in different regions of a scene at different times. Out of all the motion events detected within a scene of a video stream over time, a reviewer may only be interested in motion events that occur within or enter a particular zone of interest in the scene. In addition, the zones of interest may not be known to the reviewer and/or the video server system until long after one or more motion events of interest have occurred within the zones of interest. For example, a parent may not be interested in activities centered around a cookie jar until after some cookies have mysteriously disappeared. Furthermore, the zones of interest in the scene of a video feed can vary for a reviewer over time depending on the present purpose of the reviewer. For example, the parent may be interested in seeing all activities that occurred around the cookie jar one day when some cookies are missing, and the parent may be interested in seeing all activities that occurred around a mailbox the next day when some expected mail is missing. Accordingly, in some implementations, the techniques disclosed herein allow a reviewer to define and create one or more zones of interest within a static scene of a video feed, and then use the created zones of interest to retroactively identify all past motion events (or all motion events within a particular past time window) that have touched or entered the zones of interest. In some implementations, the identified motion events are presented to the user in a timeline or in a list. In some implementations, real-time alerts for any new motion events that touch or enter the zones of interest are sent to the reviewer. The ability to quickly identify and retrieve past motion events that are associated with a newly created zone of interest addresses the drawbacks of conventional zone monitoring techniques. Conventionally, the zones of interest must be defined first based on a certain degree of guessing and anticipation that may later prove to be inadequate or wrong. Also, in conventional systems, only future events (as opposed to both past and future events) within the zones of interest can be identified.
In some implementations, when detecting new motion events that have touched or entered some zone(s) of interest, the event detection is based on the motion information collected from the entire scene, rather than just within the zone(s) of interest. In particular, aspects of motion detection, motion object definition, motion track identification, false positive suppression, and event categorization are all based on image information collected from the entire scene, rather than just within each zone of interest. As a result, context around the zones of interest is taken into account when monitoring events within the zones of interest. Thus, the accuracy of event detection and categorization may be improved as compared to conventional zone monitoring techniques that perform all calculations with image data collected only within the zones of interest.
Reference will now be made in detail to implementations, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the various described implementations. However, it will be apparent to one of ordinary skill in the art that the various described implementations may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the implementations.
The depicted structure 150 includes a plurality of rooms 152, separated at least partly from each other via walls 154. The walls 154 may include interior walls or exterior walls. Each room may further include a floor 156 and a ceiling 158. Devices may be mounted on, integrated with, and/or supported by a wall 154, a floor 156, or a ceiling 158.
In some implementations, the smart home environment 100 includes a plurality of devices, including intelligent, multi-sensing, network-connected devices, that integrate seamlessly with each other in a smart home network 202 and/or with a central server or a cloud-computing system to provide a variety of useful smart home functions. The smart home environment 100 may include one or more intelligent, multi-sensing, network-connected thermostats 102 (“smart thermostats”), one or more intelligent, network-connected, multi-sensing hazard detection units 104 (“smart hazard detectors”), and one or more intelligent, multi-sensing, network-connected entryway interface devices 106 (“smart doorbells”). In some implementations, the smart thermostat 102 detects ambient climate characteristics (e.g., temperature and/or humidity) and controls a HVAC system 103 accordingly. The smart hazard detector 104 may detect the presence of a hazardous substance or a substance indicative of a hazardous substance (e.g., smoke, fire, and/or carbon monoxide). The smart doorbell 106 may detect a person's approach to or departure from a location (e.g., an outer door), control doorbell functionality, announce a person's approach or departure via audio or visual means, and/or control settings on a security system (e.g., to activate or deactivate the security system when occupants go and come).
In some implementations, the smart home environment 100 includes one or more intelligent, multi-sensing, network-connected wall switches 108 (“smart wall switches”), along with one or more intelligent, multi-sensing, network-connected wall plug interfaces 110 (“smart wall plugs”). The smart wall switches 108 may detect ambient lighting conditions, detect room-occupancy states, and control a power and/or dim state of one or more lights. In some instances, smart wall switches 108 may also control a power state or speed of a fan, such as a ceiling fan. The smart wall plugs 110 may detect occupancy of a room or enclosure and control supply of power to one or more wall plugs (e.g., such that power is not supplied to the plug if nobody is at home).
In some implementations, the smart home environment 100 includes a plurality of intelligent, multi-sensing, network-connected appliances 112 (“smart appliances”), such as refrigerators, stoves, ovens, televisions, washers, dryers, lights, stereos, intercom systems, garage-door openers, floor fans, ceiling fans, wall air conditioners, pool heaters, irrigation systems, security systems, space heaters, window AC units, motorized duct vents, and so forth. In some implementations, when plugged in, an appliance may announce itself to the smart home network, such as by indicating what type of appliance it is, and it may automatically integrate with the controls of the smart home. Such communication by the appliance to the smart home may be facilitated by either a wired or wireless communication protocol. The smart home may also include a variety of non-communicating legacy appliances 140, such as old conventional washer/dryers, refrigerators, and the like, which may be controlled by smart wall plugs 110. The smart home environment 100 may further include a variety of partially communicating legacy appliances 142, such as infrared (“IR”) controlled wall air conditioners or other IR-controlled devices, which may be controlled by IR signals provided by the smart hazard detectors 104 or the smart wall switches 108.
In some implementations, the smart home environment 100 includes one or more network-connected cameras 118 that are configured to provide video monitoring and security in the smart home environment 100.
The smart home environment 100 may also include communication with devices outside of the physical home but within a proximate geographical range of the home. For example, the smart home environment 100 may include a pool heater monitor 114 that communicates a current pool temperature to other devices within the smart home environment 100 and/or receives commands for controlling the pool temperature. Similarly, the smart home environment 100 may include an irrigation monitor 116 that communicates information regarding irrigation systems within the smart home environment 100 and/or receives control information for controlling such irrigation systems.
By virtue of network connectivity, one or more of the smart home devices may further allow a user to interact with the device even if the user is not proximate to the device. For example, a user may communicate with a device using a computer (e.g., a desktop computer, laptop computer, or tablet) or other portable electronic device (e.g., a smartphone) 166. A webpage or application may be configured to receive communications from the user and control the device based on the communications and/or to present information about the device's operation to the user. For example, the user may view a current set point temperature for a device and adjust it using a computer. The user may be in the structure during this remote communication or outside the structure.
As discussed above, users may control the smart thermostat and other smart devices in the smart home environment 100 using a network-connected computer or portable electronic device 166. In some examples, some or all of the occupants (e.g., individuals who live in the home) may register their devices 166 with the smart home environment 100. Such registration may be made at a central server to authenticate the occupant and/or the device as being associated with the home and to give permission to the occupant to use the device to control the smart devices in the home. Occupants may use their registered devices 166 to remotely control the smart devices of the home, such as when an occupant is at work or on vacation. The occupant may also use a registered device to control the smart devices when the occupant is actually located inside the home, such as when the occupant is sitting on a couch inside the home. It should be appreciated that instead of or in addition to registering the devices 166, the smart home environment 100 may make inferences about which individuals live in the home and are therefore occupants and which devices 166 are associated with those individuals. As such, the smart home environment may “learn” who is an occupant and permit the devices 166 associated with those individuals to control the smart devices of the home.
In some implementations, in addition to containing processing and sensing capabilities, the devices 102, 104, 106, 108, 110, 112, 114, 116, and/or 118 (“the smart devices”) are capable of data communications and information sharing with other smart devices, a central server or cloud-computing system, and/or other devices that are network-connected. The required data communications may be carried out using any of a variety of custom or standard wireless protocols (IEEE 802.15.4, Wi-Fi, ZigBee, 6LoWPAN, Thread, Z-Wave, Bluetooth Smart, ISA100.11a, WirelessHART, MiWi, etc.) and/or any of a variety of custom or standard wired protocols (CAT6 Ethernet, HomePlug, etc.), or any other suitable communication protocol.
In some implementations, the smart devices serve as wireless or wired repeaters. For example, a first one of the smart devices communicates with a second one of the smart devices via a wireless router. The smart devices may further communicate with each other via a connection to one or more networks 162 such as the Internet. Through the one or more networks 162, the smart devices may communicate with a smart home provider server system 164 (also called a central server system and/or a cloud-computing system herein). In some implementations, the smart home provider server system 164 may include multiple server systems, each dedicated to data processing associated with a respective subset of the smart devices (e.g., a video server system may be dedicated to data processing associated with camera(s) 118). The smart home provider server system 164 may be associated with a manufacturer, support entity, or service provider associated with the smart device. In some implementations, a user is able to contact customer support using a smart device itself rather than needing to use other communication means, such as a telephone or Internet-connected computer. In some implementations, software updates are automatically sent from the smart home provider server system 164 to smart devices (e.g., when available, when purchased, or at routine intervals).
In some implementations, some low-power nodes are incapable of bidirectional communication. These low-power nodes send messages, but they are unable to “listen”. Thus, other devices in the smart home environment 100, such as the spokesman nodes, cannot send information to these low-power nodes.
As described, the spokesman nodes and some of the low-powered nodes are capable of “listening.” Accordingly, users, other devices, and/or the central server or cloud-computing system 164 may communicate control commands to the low-powered nodes. For example, a user may use the portable electronic device 166 (e.g., a smartphone) to send commands over the Internet to the central server or cloud-computing system 164, which then relays the commands to one or more spokesman nodes in the smart home network 202. The spokesman nodes drop down to a low-power protocol to communicate the commands to the low-power nodes throughout the smart home network 202, as well as to other spokesman nodes that did not receive the commands directly from the central server or cloud-computing system 164.
In some implementations, a smart nightlight 170 is a low-power node. In addition to housing a light source, the smart nightlight 170 houses an occupancy sensor, such as an ultrasonic or passive IR sensor, and an ambient light sensor, such as a photo resistor or a single-pixel sensor that measures light in the room. In some implementations, the smart nightlight 170 is configured to activate the light source when its ambient light sensor detects that the room is dark and when its occupancy sensor detects that someone is in the room. In other implementations, the smart nightlight 170 is simply configured to activate the light source when its ambient light sensor detects that the room is dark. Further, in some implementations, the smart nightlight 170 includes a low-power wireless communication chip (e.g., a ZigBee chip) that regularly sends out messages regarding the occupancy of the room and the amount of light in the room, including instantaneous messages coincident with the occupancy sensor detecting the presence of a person in the room. As mentioned above, these messages may be sent wirelessly, using the mesh network, from node to node (i.e., smart device to smart device) within the smart home network 202 as well as over the one or more networks 162 to the central server or cloud-computing system 164.
Other examples of low-power nodes include battery-operated versions of the smart hazard detectors 104. These smart hazard detectors 104 are often located in an area without access to constant and reliable power and may include any number and type of sensors, such as smoke/fire/heat sensors, carbon monoxide/dioxide sensors, occupancy/motion sensors, ambient light sensors, temperature sensors, humidity sensors, and the like. Furthermore, the smart hazard detectors 104 may send messages that correspond to each of the respective sensors to the other devices and/or the central server or cloud-computing system 164, such as by using the mesh network as described above.
Examples of spokesman nodes include smart doorbells 106, smart thermostats 102, smart wall switches 108, and smart wall plugs 110. These devices 102, 106, 108, and 110 are often located near and connected to a reliable power source, and therefore may include more power-consuming components, such as one or more communication chips capable of bidirectional communication in a variety of protocols.
In some implementations, the smart home environment 100 includes service robots 168 that are configured to carry out, in an autonomous manner, any of a variety of household tasks.
In some implementations, the devices and services platform 300 communicates with and collects data from the smart devices of the smart home environment 100. In addition, in some implementations, the devices and services platform 300 communicates with and collects data from a plurality of smart home environments across the world. For example, the smart home provider server system 164 collects home data 302 from the devices of one or more smart home environments, where the devices may routinely transmit home data or may transmit home data in specific instances (e.g., when a device queries the home data 302). Example collected home data 302 includes, without limitation, power consumption data, occupancy data, HVAC settings and usage data, carbon monoxide levels data, carbon dioxide levels data, volatile organic compounds levels data, sleeping schedule data, cooking schedule data, inside and outside temperature and humidity data, television viewership data, inside and outside noise level data, pressure data, video data, etc.
In some implementations, the smart home provider server system 164 provides one or more services 304 to smart homes. Example services 304 include, without limitation, software updates, customer support, sensor data collection/logging, remote access, remote or distributed control, and/or use suggestions (e.g., based on the collected home data 302) to improve performance, reduce utility cost, increase safety, etc. In some implementations, data associated with the services 304 is stored at the smart home provider server system 164, and the smart home provider server system 164 retrieves and transmits the data at appropriate times (e.g., at regular intervals, upon receiving a request from a user, etc.).
In some implementations, the extensible devices and the services platform 300 includes a processing engine 306, which may be concentrated at a single server or distributed among several different computing entities. In some implementations, the processing engine 306 includes engines configured to receive data from the devices of smart home environments (e.g., via the Internet and/or a network interface), to index the data, to analyze the data and/or to generate statistics based on the analysis or as part of the analysis. In some implementations, the analyzed data is stored as derived home data 308.
Results of the analysis or statistics may thereafter be transmitted back to the device that provided home data used to derive the results, to other devices, to a server providing a webpage to a user of the device, or to other non-smart device entities. In some implementations, use statistics, use statistics relative to use of other devices, use patterns, and/or statistics summarizing sensor readings are generated by the processing engine 306 and transmitted. The results or statistics may be provided via the one or more networks 162. In this manner, the processing engine 306 may be configured and programmed to derive a variety of useful information from the home data 302. A single server may include one or more processing engines.
The derived home data 308 may be used at different granularities for a variety of useful purposes, ranging from explicit programmed control of the devices on a per-home, per-neighborhood, or per-region basis (for example, demand-response programs for electrical utilities), to the generation of inferential abstractions that may assist on a per-home basis (for example, an inference may be drawn that the homeowner has left for vacation and so security detection equipment may be put on heightened sensitivity), to the generation of statistics and associated inferential abstractions that may be used for government or charitable purposes. For example, processing engine 306 may generate statistics about device usage across a population of devices and send the statistics to device users, service providers or other entities (e.g., entities that have requested the statistics and/or entities that have provided monetary compensation for the statistics).
In some implementations, to encourage innovation and research and to increase products and services available to users, the devices and services platform 300 exposes a range of application programming interfaces (APIs) 310 to third parties, such as charities 314, governmental entities 316 (e.g., the Food and Drug Administration or the Environmental Protection Agency), academic institutions 318 (e.g., university researchers), businesses 320 (e.g., providing device warranties or service to related equipment, targeting advertisements based on home data), utility companies 324, and other third parties. The APIs 310 are coupled to and permit third-party systems to communicate with the smart home provider server system 164, including the services 304, the processing engine 306, the home data 302, and the derived home data 308. In some implementations, the APIs 310 allow applications executed by the third parties to initiate specific data processing tasks that are executed by the smart home provider server system 164, as well as to receive dynamic updates to the home data 302 and the derived home data 308.
For example, third parties may develop programs and/or applications, such as web applications or mobile applications, that integrate with the smart home provider server system 164 to provide services and information to users. Such programs and applications may be, for example, designed to help users reduce energy consumption, to preemptively service faulty equipment, to prepare for high service demands, to track past service performance, etc., and/or to perform other beneficial functions or tasks.
In some implementations, the processing engine 306 includes a challenges/rules/compliance/rewards paradigm 410d that informs a user of challenges, competitions, rules, compliance regulations and/or rewards and/or that uses operation data to determine whether a challenge has been met, a rule or regulation has been complied with and/or a reward has been earned. The challenges, rules, and/or regulations may relate to efforts to conserve energy, to live safely (e.g., reducing exposure to toxins or carcinogens), to conserve money and/or equipment life, to improve health, etc. For example, one challenge may involve participants turning down their thermostat by one degree for one week. Those participants that successfully complete the challenge are rewarded, such as with coupons, virtual currency, status, etc. Regarding compliance, an example involves a rental-property owner making a rule that no renters are permitted to access certain owner's rooms. The devices in the room having occupancy sensors may send updates to the owner when the room is accessed.
In some implementations, the processing engine 306 integrates or otherwise uses extrinsic information 412 from extrinsic sources to improve the functioning of one or more processing paradigms. The extrinsic information 412 may be used to interpret data received from a device, to determine a characteristic of the environment near the device (e.g., outside a structure that the device is enclosed in), to determine services or products available to the user, to identify a social network or social-network information, to determine contact information of entities (e.g., public-service entities such as an emergency-response team, the police or a hospital) near the device, to identify statistical or environmental conditions, trends or other information associated with a home or neighborhood, and so forth.
In some implementations, the smart home provider server system 164 or a component thereof serves as the video server system 508. In some implementations, the video server system 508 is a dedicated video processing server that provides video processing services to video sources and client devices 504 independent of other services provided by the video server system 508.
In some implementations, each of the video sources 522 includes one or more video cameras 118 that capture video and send the captured video to the video server system 508 substantially in real-time. In some implementations, each of the video sources 522 includes a controller device (not shown) that serves as an intermediary between the one or more cameras 118 and the video server system 508. The controller device receives the video data from the one or more cameras 118, optionally performs some preliminary processing on the video data, and sends the video data to the video server system 508 on behalf of the one or more cameras 118 substantially in real-time. In some implementations, each camera has its own on-board processing capabilities to perform some preliminary processing on the captured video data before sending the processed video data (along with metadata obtained through the preliminary processing) to the controller device and/or the video server system 508.
As shown in
In some implementations, the server-side module 506 includes one or more processors 512, a video storage database 514, an account database 516, an I/O interface to one or more client devices 518, and an I/O interface to one or more video sources 520. The I/O interface to one or more clients 518 facilitates the client-facing input and output processing for the server-side module 506. The account database 516 stores a plurality of profiles for reviewer accounts registered with the video processing server, where a respective user profile includes account credentials for a respective reviewer account, and one or more video sources linked to the respective reviewer account. The I/O interface to one or more video sources 520 facilitates communications with one or more video sources 522 (e.g., groups of one or more cameras 118 and associated controller devices). The video storage database 514 stores raw video data received from the video sources 522, as well as various types of metadata, such as motion events, event categories, event category models, event filters, and event masks, for use in data processing for event monitoring and review for each reviewer account.
Examples of a representative client device 504 include a handheld computer, a wearable computing device, a personal digital assistant (PDA), a tablet computer, a laptop computer, a desktop computer, a cellular telephone, a smart phone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, a game console, a television, a remote control, a point-of-sale (POS) terminal, a vehicle-mounted computer, an ebook reader, or a combination of any two or more of these data processing devices or other data processing devices.
Examples of the one or more networks 162 include local area networks (LAN) and wide area networks (WAN) such as the Internet. The one or more networks 162 are implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
In some implementations, the video server system 508 is implemented on one or more standalone data processing apparatuses or a distributed network of computers. In some implementations, the video server system 508 also employs various virtual devices and/or services of third party service providers (e.g., third-party cloud service providers) to provide the underlying computing resources and/or infrastructure resources of the video server system 508. In some implementations, the video server system 508 includes, but is not limited to, a handheld computer, a tablet computer, a laptop computer, a desktop computer, or a combination of any two or more of these data processing devices or other data processing devices.
The server-client environment 500 shown in
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, the memory 606 stores a subset of the modules and data structures identified above. In some implementations, the memory 606 stores additional modules and data structures not described above.
The memory 706 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices. In some implementations, the memory 706 includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. In some implementations, the memory 706 includes one or more storage devices remotely located from the one or more processing units 702. The memory 706, or alternatively the non-volatile memory within the memory 706, comprises a non-transitory computer readable storage medium. In some implementations, the memory 706, or the non-transitory computer readable storage medium of memory 706, stores the following programs, modules, and data structures, or a subset or superset thereof:
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, the memory 706 stores a subset of the modules and data structures identified above. In some implementations, the memory 706 stores additional modules and data structures not described above.
In some implementations, at least some of the functions of the video server system 508 are performed by the client device 504, and the corresponding sub-modules of these functions may be located within the client device 504 rather than the video server system 508. In some implementations, at least some of the functions of the client device 504 are performed by the video server system 508, and the corresponding sub-modules of these functions may be located within the video server system 508 rather than the client device 504. The client device 504 and the video server system 508 shown in
As illustrated in
In some implementations, the camera includes one or more radios 850. The radios 850 enable radio communication networks in the smart home environment and allow the camera 118 to communicate wirelessly with smart devices using one or more of the communication interfaces 804. In some implementations, the radios 850 are capable of data communications using any of a variety of custom or standard wireless protocols (e.g., IEEE 802.15.4, Wi-Fi, ZigBee, 6LoWPAN, Thread, Z-Wave, Bluetooth Smart, ISA100.11a, WirelessHART, MiWi, etc.), custom or standard wired protocols (e.g., Ethernet, HomePlug, etc.), and/or any other suitable communication protocol.
The communication interfaces 804 include, for example, hardware capable of data communications (e.g., with home computing devices, network servers, etc.), using any of a variety of custom or standard wireless protocols (e.g., IEEE 802.15.4, Wi-Fi, ZigBee, 6LoWPAN, Thread, Z-Wave, Bluetooth Smart, ISA100.11a, WirelessHART, MiWi, etc.) and/or any of a variety of custom or standard wired protocols (e.g., Ethernet, HomePlug, USB, etc.), or any other suitable communication protocol.
The memory 806 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices. In some implementations, the memory 806 includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. The memory 806, or alternatively the non-volatile memory within the memory 806, comprises a non-transitory computer readable storage medium. In some implementations, the memory 806, or the non-transitory computer readable storage medium of the memory 806, stores the following programs, modules, and data structures, or a subset or superset thereof:
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, the memory 806 stores a subset of the modules and data structures identified above. In some implementations, the memory 806 stores additional modules and data structures not described above.
In some implementations, at least some of the functions of the camera 118 are performed by a client device 504, the server system 508, and/or one or more smart devices 204, and the corresponding sub-modules of these functions may be located within the client device 504, the server system 508, and/or smart devices 204, rather than the camera 118. Similarly, in some implementations, at least some of the functions of the client device, the server system, and/or smart devices are performed by the camera 118, and the corresponding sub-modules of these functions may be located within the camera 118. For example, in some implementations, a camera 118 captures an IR image of an illuminated scene (e.g., using the illumination module 860 and the image capture module 862), while a server system 508 stores the captured images (e.g., in the video storage database 514) and creates a depth map 876 based on the captured images (e.g., performed by a depth mapping module 878 stored in the memory 606). The server system 508, the client device 504, and the camera 118, shown in
A scene understanding server 900 typically includes one or more processing units (CPUs) 902 for executing modules, programs, or instructions stored in the memory 914 and thereby performing processing operations; one or more network or other communications interfaces 904; memory 914; and one or more communication buses 912 for interconnecting these components. The communication buses 912 may include circuitry (sometimes called a chipset) that interconnects and controls communications between system components. In some implementations, the server 900 includes a user interface 906, which may include a display device 908 and one or more input devices 910, such as a keyboard and a mouse.
In some implementations, the memory 914 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM or other random access solid state memory devices. In some implementations, the memory 914 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. In some implementations, the memory 914 includes one or more storage devices remotely located from the CPU(s) 902. The memory 914, or alternately the non-volatile memory device(s) within the memory 914, comprises a non-transitory computer readable storage medium. In some implementations, the memory 914, or the computer readable storage medium of memory 914, stores the following programs, modules, and data structures, or a subset thereof:
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, the memory 914 stores a subset of the modules and data structures identified above. In some implementations, the memory 914 stores additional modules and data structures not described above.
In some implementations, at least some of the functions of the scene understanding server 900 are performed by a client device 504, the camera 118, or other servers in the video server system 508. Similarly, in some implementations, at least some of the functions of the client device 504, the video server system 508, and the camera 118 are performed by the scene understanding server 900. For example, in some implementations, a camera 118 captures an IR image of an illuminated scene (e.g., using the illumination module 860 and the image capture module 862), while a scene understanding server 900 stores the captured images 872 and creates one or more depth maps 876 based on the captured images (e.g., performed by a depth mapping module 878).
As described in greater detail below, the illuminators 856 are activated to illuminate a scene by emitting streams of light (e.g., infrared (IR) light). During illumination, light rays are scattered by and reflect off of object surfaces in the scene (e.g., walls, furniture, humans, etc.). Reflected light rays are then detected by the sensor array 852, which captures an image of the scene (e.g., and IR image or an RGB image). The captured image digitally measures the intensity of the reflected IR light for each of the pixels in the sensor array 852.
In some implementations, the illuminators 856 are light emitting diodes (LEDs). In some implementations, the illuminators 856 are semiconductor lasers or other semiconductor light sources. In some implementations, the illuminators 856 are configured to emit light spanning a broad range of the electromagnetic spectrum, including light in the IR range (e.g., 700 nm to 1 mm), the visible light range (e.g., 400 nm-700 nm), and/or the ultraviolet range (e.g., 10 nm-400 nm). In some implementations, a portion of the illuminators 856 are configured to emit light in a first range (e.g., IR range), while other illuminators 856 are configured to emit light in a second range (e.g., visible light range). In some implementations, the illuminators 856 are configured to emit light in accordance with one or more predefined illumination patterns. For example, in some implementations, the illumination pattern is circular round-robin in a clockwise order. In some of these implementations, the round-robin pattern activates two illuminators at a time, as illustrated in
The sensor array 852 converts an optical image (e.g., reflected light rays) into an electric signal. In some implementations, the sensor array 852 is a CCD image sensor, a CMOS sensor, or another type of light sensor device (e.g., a hybrid of CCD and CMOS). The sensor array 852 includes a plurality of individual light-sensitive sensors. In some implementations, the sensors of the sensor array 852 are arranged in a rectangular grid pattern as illustrated in
In some implementations, the camera 118 includes additional camera components, such as one or more lenses, image processors, shutters, and/or other components known to those skilled in the art of digital photography.
In some implementations, the camera 118 also includes camera circuitry for coordinating various image capture functionality of the camera 118. In some implementations, the camera circuitry is coupled to the illuminators 856, to the sensor array 852, and/or to other camera components, and coordinates the operational timing of the various camera device components. In some implementations, when capturing an IR image of a scene, the camera circuitry activates a subset of the illuminators 856, activates the sensor array 852 to capture the image, and determines an appropriate shutter speed to manage the image exposure. In some implementations, the camera circuitry performs basic image processing of raw images captured by the sensor array 852 during the exposure. The image processing includes filtering and conversion of a produced voltage or current at the sensor array 852 into a digital value.
In some implementations, one or more illuminators 856 are angled relative to the planar axis of the sensor array, such as illuminator 856-1 in
To generate a lookup table for a pixel, the lookup table generation module 868 determines an expected reflected light intensity at the pixel based on the simulated surfaces 1304 being at various fixed distances 1302 from the pixel. This is illustrated in
For each depth 1302, the illuminators 856 of the camera 118 are simulated to activate in accordance with a pre-defined illumination pattern. An illumination pattern specifies the grouping of illuminators 856 (if any), specifies the order the groups of illuminators are activated, and may specify other parameters related to the operation of the illuminators.
In some implementations, the estimated light intensity values are placed into an intensity matrix Yi,j 1506, as illustrated in
The kth column 1500-k in the intensity matrix Yi,j 1506 has four light intensity estimates 1501-k, 1502-k, 1503-k, and 1504-k, corresponding to the same four illumination groups in the illumination pattern. Finally, the mth column 1500-m has four list intensity estimates corresponding to the same four illumination groups in the illumination pattern. Note that the matrix Yi,j 1506 is for a single pixel i,j (e.g., as downsampled from the sensor array 852).
As currently computed, the entries in the intensity matrix Yi,j 1506 depend on the reflectivity ρ of the simulated surface. Because different actual surfaces have varying reflectivities, it would be useful to “normalize” the matrix in a way that eliminates the reflectivity constant ρ. In some implementations, the columns of the intensity matrix Yi,j 1506 are normalized by dividing the elements of each column by the length (e.g., L2 norm) of the column.
Performing the same normalization process for each column in the intensity matrix Yi,j 1506 creates a normalized lookup table {tilde over (Y)}i,j.
Note that after normalization, each column of the lookup table {tilde over (Y)}i,j has the same normalized length, even though each column corresponds to a different distance from the sensor array. However, the distribution of values across the elements (corresponding to the illumination groups) are different for different depths (e.g., the normalized first column is different from the normalized kth column).
Some implementations take advantage of symmetry to reduce the number of lookup tables. For example, using the illumination pattern illustrated in
As illustrated in
For each individual pixel there is a separate lookup table, which is generated as described above by simulating virtual surfaces at different depths. The actual depth in the scene at the pixel is determined by finding the closest matching record in the lookup table for the pixel. In this example, the vector {right arrow over (b)}i,j 1706 and the records in the lookup table (e.g., column {tilde over (Y)}i,j(k) 1510) are four dimensional vectors. In some implementations, the closest match is computed by finding the lookup table record whose “direction” in R4 most closely aligns with the sample vector {right arrow over (b)}i,j 1706. This can be determined by computing the inner product (e.g., dot product) of the vector {right arrow over (b)}i,j 1706 with each of the records in the lookup table. In some implementations, the inner product of the vector {right arrow over (b)}i,j 1706 with the record {tilde over (Y)}i,j(k) 1510 is <{right arrow over (b)}i,j, {tilde over (Y)}i,j(k))=y1k (b1−b0)+y2k (b2−b0)+y3k(b3−b0)+y4k(b4−b0). The record in the lookup table whose inner product with the sample vector 1706 is the greatest has an associated depth (i.e., the simulated depth for which the lookup table record was created), and this is the estimated depth for the pixel. Typically, the inner product used is just the dot product, as illustrated in this example.
The process just described is shown concisely by the formula in
In the example illustrated in
In some implementations, the camera 118 has infrared illuminators 856, which illuminate the scene (typically at night) and capture one of more IR images to form an IR intensity image 1802, as illustrated in
Using size and/or quadrilateral analysis of the low intensity regions in
The same techniques described with respect to windows can identify other types of objects as well. For example, the same analysis used for windows can be applied to identify mirrors or television screens. In some implementations, a sufficiently large quadrilateral region with low intensity of reflected IR light is identified as a television rather than a window based on other information, such as frequent movement within the region. Certain materials have reflectivities that are intermediate between a specular surface and a surface with highly diffused reflections. In some implementations, these materials are identified by a range of expected image intensity from reflecting the IR light.
In some implementations, quadrilateral fitting measures the absolute difference between the quadrilateral and the region, and determines that there is a good fit when the absolute difference is less than a threshold percentage of the area of the quadrilateral (e.g., less than 5%, less than 10%, or less than 20%). In some implementations, the process uses more general polygons rather than quadrilaterals.
Some implementations use motion discontinuity as a factor in determining whether a low intensity region is a window. For example, motion of an object on an opposite side of a window will show up as discontinuous both as the object enters the field of the window and when the object exits the field of the window. In some implementations, the presence of motion discontinuity within a region is used as evidence that the region is a window, but the absence of motion discontinuity is not used as evidence that the region is not a window.
Some implementations build (1982) a depth map based on IR images captured while the camera 118 is in the first position 1988. In some implementations, the IR images are captured temporally proximate to the time the zone is defined in order to ensure that the depth map is built based on the same field of vision. In some implementations, temporal proximity is defined to be within 12 hours or within 24 hours. At some point later, the camera moves (1984). For example, a person may bump the camera or a person may choose to move the camera slightly to get better coverage of a room. Later, some implementations build (1986) a second depth map based on IR images captured while the camera 118 is in a second position 1990. Note that the zone correction module 928 does not necessarily know the camera has moved. In some implementations, depth maps are created on a periodic basis (e.g., once each night, every two days, or once each week).
In some implementations, the zone correction module 928 computes point clouds 930 corresponding to each of the depth maps, where each point in a point cloud 930 is a three dimensional position in the scene monitored by the camera, as illustrated below in
The process of comparing two point clouds is sometimes referred to as “registration” by those of skill in the art. A registration process determines how to transform one point cloud into another point cloud. Some implementations use one or more iterated closest point (ICP) methods to determine the transformation. When one of the point clouds can be transformed to match the other point cloud, the iterative process builds the transformation as a sequence of steps that converge on the final transformation. When the two point clouds are fundamentally different (e.g., from IR images captured from different scenes), the iterative process is generally unable to converge.
After the transformation is determined, the transformation is applied to the zone defined by the user, thereby creating an adjusted zone that corresponds to the defined zone. This is illustrated below in
In the scene 1900-B of
As illustrated in
When the camera is at the first location 1940, the field of vision of the camera is illustrated by the dotted lines 1942 on the left and 1944 on the right. When the camera is at the second location 1950, the field of vision of the camera is illustrated by the dotted lines 1952 on the left and 1954 on the right. A first depth map is created based on images captured while the camera 118 is at the first position 1940, and a second depth map is created based on images captured while the camera 118 is at the second position 1950. For each of the depth maps, a point cloud is created that contains a plurality of points.
In this illustration, the points 1946-1 and 1946-2 are in the field of vision of the camera at the first position 1940 but not in the field of vision from the second location 1950. Conversely, the points 1956-1, 1956-2, and 1956-3 are in the field of vision of the camera 118 at the second position 1950 but not in the field of vision from the first location 1940. The other points in this illustration are in the shared region 1960.
A first point in this region is identified both as point 1946-3 and as point 1956-4. The two labels for the same point are due to the presence of the point in both the first and second depth maps. With respect to the camera 118, the three dimensional coordinates of the point 1946-3 are different from the 3-dimensional coordinates of the point 1956-4, even though the point has not moved. For example, the depth and horizontal position of the point 1946-3 (as measured from the first camera location 1940) are different from the depth and horizontal position of the point 1956-4 (as measured from the second camera location 1950). If the height of the camera above the floor at the first and second locations are the same, then the measured height of the point 1946-3 is the same as the height of the point 1956-4. The same analysis applies to the second labeled point in the region 1960, which is labeled as both 1946-4 and 1956-5. They are the same physical point in the scene, but have different 3-dimensional coordinates based on the two views. The same analysis applies to the third labeled point in the region 1960, which is labeled as both 1946-5 (from the first depth map) and 1956-6 (from the second depth map).
The first point cloud (containing the points 1946-1-1946-5) is correlated to the second point cloud (containing the points 1956-1-1956-6), based on points in the overlap region 1960. In practice, the points are not literally identical as they are in this example. As indicated above, an iterative algorithm determines how to map one of the point clouds to the other.
As shown in
When the camera has moved slightly, the process computes an output 1972, which is an adjusted zone. The adjusted zone corresponds to the original zone, but accounts for the camera movement. This is illustrated above
In some implementations, computing the adjusted zone includes: (1) converting (1974) the original depth map to a point cloud with 3D coordinates. In some implementations, the constructed point cloud has at least 100 points. In some implementations, the point cloud has fewer or more points. For example, in some implementations, the point cloud has 50 points or 500 points. In some implementations, the points for the point cloud are randomly or pseudo-randomly selected from the depth map. In some implementations, the points in the point cloud are selected in a regular pattern, such as every tenth pixel horizontally and vertically. In some implementations, the points in the point cloud are selected based on specific characteristics, such as proximity to the camera or locations where there is significant depth discontinuity (see
The process builds (1976) a second point cloud from the second map, which corresponds to the current location of the camera. The points in the second point cloud are generally selected in the same way as for the first point cloud.
The process then compares (1978) the two point clouds. This process is sometimes referred to as point cloud registration. Some implementations use an iterative process to perform point cloud registration. In some implementations, the process uses an iterated closest point (“ICP”) method. The registration process determines a transformation that maps the first point cloud to the second point cloud.
Finally, the process applies (1980) the identified transformation to the user-selected zone to identify an adjusted zone based on the new camera location. In some implementations, the new zone is used immediately. In some implementations, the user is prompted to confirm the adjusted zone, and the user may tweak the adjusted zone further.
In some implementations, the floor/wall/ceiling module 926 uses a depth map 876 of the scene, which is constructed as illustrated in
Once the depth discontinuities are identified in the binary depth edge map 944, the floor/wall/ceiling module 926 identifies the closed components 946 in the image (i.e., regions that are enclosed by the edges). These closed components 946 represent the candidates for floors, walls, and ceilings.
For each of the closed components 946 that is evaluated, the floor/wall/ceiling module 926 fits a plane to the points in the component. In some implementations, the fitted plane has an equation of the form wxx+wyy+wzz=1, where wx, wy, and wz are constants to be determined, as illustrated in
Once a best plane 948 is identified for a component, the floor/wall/ceiling module 926 evaluates the plane in two ways. First, is the total error sufficiently small so that the plane is a good fit? Second, does the orientation of the plane correspond to floor, wall, or ceiling? Some implementations specify an error threshold, and designate a closed component as a probable floor, wall, or ceiling only when the actual error is less than the threshold. In some implementations, the total error is normalized based on the number of points in the sample.
As illustrated in
so the expression
should be positive for a floor. Similarly, for a ceiling, the expression
should be negative. Some implementations also evaluate the magnitude of the expression
to determine whether it is consistent with data expected for a floor or ceiling. For walls, the expressions are similar, but use the x-dimension rather than the y-dimension.
In
The illustrations in
The dictionary includes a height 2154 and a tilt 2156 for each entry, and includes data for one or more images captured based on different sets of IR illuminators emitting light. In some implementations, a single image is captured while all of the IR emitters are on. In some implementations, a separate image is captured for each individual IR emitter, taken while that IR emitter is on and the remaining IR emitters are off. In some implementations, the emitters are grouped into pairs, as illustrated above with respect to
In this example dictionary 2150, the second dictionary entry 2152-2 corresponds to a height of 0.6 meters and a tilt angle of 10°. In some implementations, positive title angles indicate the camera is pointing downward. For this second entry 2152-2, the process simulates or captures four images I2,1, I2,2, I2,3, and I2,4, corresponding to each of the four subsets of IR illuminators. In some implementations, abbreviated images are stored. For example, some implementations store only pixels corresponding to the simulated floor. Note that the pixels in the images are typically downsampled from the image sensor array. For example, the image sensor array may include 4 million individual image sensors, whereas the saved images may include only 10,000 pixels.
In this example dictionary 2150, there are 250 dictionary entries 2152, corresponding to heights ranging from 0.6 meters to 3.0 meters (in 0.1 meter increments) and angles ranging from 0 degrees to 90 degrees (in 10 degree increments). In some implementations, there are fewer or more dictionary entries 2152, depending on the desired granularity, available storage space, required processing speed, and/or other considerations.
Whereas a dictionary 2150 is typically creating one time for a given camera model, the dictionary 2150 can be used many times to estimate the heights and tilt angles of many cameras at many different times.
Using the adjusted intensity images, the process identifies (2164) at least one possible floor region. In some implementations, identifying a possible floor region uses techniques illustrated in
Some implementations use an iterative algorithm for identifying a floor region. In some of these implementations, the entire set of pixels is used as a starting point for the first iteration, and in each iteration some of the pixels are removed. In some implementations, the pixels identified for removal in each iteration are selected based on overall contribution to the computed distances between the adjusted IR intensity images and entries in the dictionary. In some implementations, the process combines floor selection (2164) and classification (2166) into an iterative loop.
Once a floor region is identified, a classifier estimates (2166) the (height, tilt) 2168 using the adjusted IR intensity images, the previously computed dictionary 2150, and limiting the analysis to pixels in the identified floor region. The operation of the classifier is described in more detail in
The classifier identifies a “closest” dictionary entry 2152 to the adjusted IR intensity images, and estimates the height and tilt of the camera based on that closest dictionary entry. When the number of dictionary entries is small (e.g., 100), some implementations compare the adjusted IR intensity images to each of the dictionary entries to find the closest one. In some implementations, the process is able to prune some of the dictionary entries, thereby comparing the adjusted IR intensity images to a smaller list of dictionary entries.
To identify a closest dictionary entry 2152, some implementations compute distances between vectors, as illustrated in
To compute the distance between the feature vector 2178 and a dictionary entry vector 2180, some implementations use Euclidean distance based on the relevant vector components. The relevant components are the ones associated with the pixels in the identified floor region. For example, in this case, the rth pixel is part of the identified floor region, so the four components corresponding to r are included in the calculation of the distance, as illustrated in formula 2176-2. If there are four illuminator subsets and 100 pixels in the identified floor region, then the distance calculation will use 400 components of the vectors. In some implementations, alternative distance metrics are used, such as the total absolute difference between vector components |a1r−b1r|+ . . . or the maximum absolute difference between vector components.
In some implementations, the single closest dictionary entry is used to estimate the camera position. For example, if the second dictionary entry 2152-2 above is determined to be closer than all of the other dictionary entries, then the camera is estimated to be at a height of 0.6 meters and at an angle of 10 degrees (see
The process identifies (2206) a plurality of distinct subsets of IR illuminators 856 of a camera system 118. One example is illustrated above in
The camera also has (2208) a 2-dimensional array 852 of image sensors. The 2 dimensional array 852 is typically laid out in a rectangular pattern, as illustrated above in
The process partitions (2214) the image sensors into a plurality of pixels. In some implementations, each pixel includes (2216) a respective single image sensor. In some implementations, each pixel includes (2218) a respective plurality of image sensors. In some implementations, each pixel includes (2220) more than 50 respective image sensors. These are a few ways that implementations partition the individual image sensors into pixels. Typically the array of image sensors has a high resolution, but sensors are downsampled to create a more manageable number of pixels (e.g., 10,000 pixels).
A separate lookup table is constructed for each pixel. Each record in a lookup table corresponds to a depth in front of the pixel. The accuracy of subsequent depth estimation depends on the number of depths used to build each lookup table. For example, if depth data is created for each inch in front of the pixel, then subsequent depth estimation may be accurate within an inch. However, if there are only two depth data points, the accuracy for subsequent estimation will be limited.
For each pixel, and for each of m distinct depths from the pixel, the process performs (2222) the following operations. The process simulates (2224) a virtual surface at the respective depth. Implementations use various shapes for the virtual surfaces, such as planar (2226), spherical (2228), parabolic (2230), or cubic (2232).
For each pixel and for each of the depths (2222), the process also determines (2234) an expected IR light intensity at the respective pixel based on the respective depth, the shape of the virtual surface, and which subset of IR illuminators is emitting IR light. In some implementations, the expected IR light intensity at the respective pixel is (2236) based on other characteristics of the IR illuminators of the camera system as well. For example, in some implementations, the characteristics include (2238) the lux of the IR illuminators 856. In some implementations, the characteristics include (2240) orientation of the IR illuminators relative to the sensor array. This is illustrated above in
For each pixel and for each of the depths (2222), the process also forms (2244) an intensity vector using the expected IR light intensity for each of the distinct subsets. This is illustrated in
The process constructs (2250) a lookup table for each pixel using the normalized vectors corresponding to the pixel. Each lookup table associates (2252) each respective normalized vector in the table with the respective depth of the respective simulated surface. Some implementations use this lookup table as described below with respect to the process 2300 illustrated in
In some implementations, the process 2300 detects (2310) a trigger event. In some implementations, creating the depth map of the first scene is (2310) in response to detecting the trigger event. In some implementations, the first scene includes (2312) a first object positioned at a first location within the first scene and the process 2300 detects (2314) the first object positioned at a second location within the first scene, where the second location is distinct from the first location. The movement of the first object triggers the building of the depth map. In some implementations, the trigger event is (2316) a power outage (e.g., build or rebuild the depth map when the computing device reboots).
In some implementations, the process 2300 switches (2318) the mode of operation of the camera system when building the depth map. For example, some implementations switch (2318) from a first mode of the camera system to a second mode of the camera system, including deactivating the first mode and activating the second mode. In some implementations, the array of image sensors has (2320) an associated first pixel gain curve when the first mode is activated, and the array of image sensors has (2320) an associated second pixel gain curve when the second mode is activated.
For each of a plurality of distinct subsets of IR illuminators of the camera system, the process 2300 performs (2322) a set of operations. In some implementations, one or more of the subsets of the IR illuminators consists (2324) of a single IR illuminator. In some implementations, the plurality of IR illuminators are orientated (2326) at a plurality of distinct angles relative to the array of image sensors. In some implementations, each of the distinct subsets of IR illuminators comprises (2328) two adjacent IR illuminators, and the distinct subsets of IR illuminators are (2328) non-overlapping. One of skill in the art recognizes that various groupings, arrangements, and/or configurations may be used for the IR illuminators.
The process 2300 receives (2330) a captured IR image of a first scene taken by a 2-dimensional array of image sensors of the camera system while the respective subset of IR illuminators are emitting IR light and the IR illuminators not in the respective subset are not emitting IR light. This occurs for each distinct subset of IR illuminators. The image sensors are partitioned (2332) into a plurality of pixels. As noted above with respect to the process 2200 in
For each of the pixels, the process 2300 performs (2336) several operations, including using (2338) the captured IR images to form a respective vector of light intensity at the respective pixel. In some implementations, the respective vector for each pixel has (2340) a plurality of components. Each of the components corresponds (2340) to a respective IR light intensity for the respective pixel for a respective captured IR image. This is illustrated above in
For each pixel (2336), the process 2300 then estimates (2344) a depth in the first scene at the respective pixel by looking up the respective vector in a respective lookup table. In some implementations, the process looks up (2346) the respective vector in the respective lookup table by computing (2346) an inner product of the respective vector with records in the lookup table. One of skill in the art recognizes that in a vector space an inner product can be used to measure the extent to which a pair of vectors are pointing in the same direction. In some instances, the inner product is (2350) an ordinary dot product. In some implementations, the process 2300 computes (2348) the inner product of the respective vector with each respective record in the respective lookup table. In some implementations, fewer than all of the inner products are computed for the lookup table (e.g., based on optimization techniques, such as recognizing that certain records in the lookup table would produce smaller inner products than some inner products that are already computed).
In some implementations, the process 2300 determines (2352) the depth in the first scene at the pixel as the depth corresponding to a record in the lookup table whose inner product with the respective vector is greatest among the computed inner products for the respective vector. This is illustrated above with respect to
In some implementations, the respective lookup table is generated (2354) during a calibration process at the camera 118. In some implementations, the calibration process includes (2356) simulating a virtual planar surface at a plurality of respective depths in the first scene. In some implementations, the calibration process includes (2358), for each pixel and each respective depth, determining an expected reflected light intensity. In some implementations, each respective lookup table is downloaded (2362) to the camera system 118 from a remote server during an initialization process prior to creating the depth map.
In some implementations, each respective lookup table includes (2360) a plurality of normalized light intensity vectors, where each normalized light intensity vector corresponds to a respective depth in the first scene. This is illustrated above in
Although lookup tables have been identified separately for each pixel, one of skill in the art recognizes that the separate logical lookup tables are not necessarily stored as separate files or databases. For example, some implementations store all of the lookup tables as a single physical table in a relational database or as a single physical file on a file server. In some implementations, the totality of lookup tables is stored as a small number of distinct files. As described above, implementations generate and use the lookup tables on various devices depending on the capabilities of the camera system 118, available network bandwidth, and other resources. For example, for camera systems with limited processing power and/or storage, some implementations build and use the lookup tables at a scene understanding server 900. The camera system 118 captures the IR images (e.g., baseline image plus additional images with different sets of illuminators on), and transmits them to the server 900. The server then constructs the depth map. In some implementations, the lookup tables are constructed at the server 900 based on the depth simulations and knowledge of the camera configuration, and then downloaded to the camera. In some of these implementations, the camera 118 uses the lookup tables itself to build a depth map.
The process receives (2410) a captured IR image of a scene taken by a 2-dimensional image sensor array of a camera system while one or more IR illuminators of the camera system are emitting IR light, thereby forming an IR intensity map of the scene with a respective intensity value determined for each pixel of the IR image. Typically, the IR image is captured at night, so most of the intensity is based on reflection of the light from the IR illuminators. Typical surfaces disperse light in all directions, so some of the emitted light is reflected back to the image sensor array. For a specular surface, however, such as a window, mirror, or some television screens, the incoming light at a surface is reflected off primarily in one direction, with the angle of incidence equal to the angle of reflection. A specular region therefore typically has low intensity in the IR intensity map.
The pixels in the IR intensity map can correspond to the image sensors in the array 852 in various ways, as previously illustrated with respect to
Typically, the camera system 118 includes (2418) a plurality of IR illuminators, as illustrated above in
The process uses (2424) the IR intensity map to identify a plurality of pixels whose corresponding intensity values are within a predefined intensity range. In some implementations, the predefined intensity range is (2426) all intensity values below a threshold value. This is the intensity range typically used when the goal is to identify windows. Some implementations use other ranges to identify other specific materials.
The process 2400 clusters (2428) the identified plurality of pixels (i.e., the pixels identified based on the intensity range) into one or more regions that are substantially contiguous. This is illustrated above with respect to
The process 2400 determines (2434) that a first region of the one or more regions corresponds to a specific material based, at least in part, on the intensity values of the pixels in the first region. In some implementations, determining that a first region of the one or more regions corresponds to a specific material includes (2436) determining that the first region is substantially a quadrilateral. This is illustrated by the quadrilaterals 1812 and 1814 in
Once a region has been classified, the process 2400 stores (2442) information in the memory that identifies the region. The information can be stored in various ways. In some implementations, the process 2400 stores coordinates for the region, such as coordinates of a centroid, or coordinates of a subset of points along the boundary. In some implementations, the process 2400 creates a two-dimensional scene map corresponding to the pixels, and specifies a value (e.g., a number or a character) to identify the object/material/function for each pixel. For example, in some implementations, a value of 0 indicates no information, a value of 1 indicates a probable window, 2 indicates a probable floor, 3 indicates a probable wall, and 4 indicates a probable ceiling. Usage of a scene map is illustrated in
In some implementations, the process 2400 receives (2444) a video stream of the scene from the camera system and reviews (2446) the video stream to detect movement in the scene. Movement in the scene can be used to identify possible intruders in a home or other potential problems. In some implementations, the first region is excluded (2446) from movement detection. For example, if the first region is identified as a window, movement in the window region may be movement on the other side of the window (e.g., outside), and thus not suitable for a motion alert. In another example, the first region is a television set, and thus “motion” in the region is typically based on displayed television images rather than real motion at the scene. In some implementations, the process 2400 generates (2448) a motion alert when there is motion detected at the scene outside of the first region.
The process 2500 receives (2510) a plurality of captured IR images of a scene taken by a 2-dimensional array of image sensors of a camera system. Each IR image is captured (2512) when illuminators in a distinct subset of IR illuminators of the camera system 118 are emitting light. In some implementations, the image sensors are partitioned (2514) into a plurality of pixels. As described above with respect to
The process 2500 constructs (2516) a depth map of a scene using the plurality of IR images. Some implementations use a process as described in
The process 2500 uses (2524) the depth map to compute a binary depth edge map 944 for the scene. The binary depth edge map 944 identifies (2524) which points in the depth map comprise depth discontinuities. This is illustrated in
The process then determines (2528) that a first component of the plurality of contiguous components represents a large planar surface in the scene. This determination involves a few steps. A first step is to fit (2530) a plane to the points in the first component. In some implementations, the fitting uses least squares to find the best plane for the data in the component. Some implementations use other techniques to identify a “best” plane for the data, such as minimizing the sum of absolute differences between a hypothetical plan and the points in the component. Implementations typically use a sampling of data points from a component to fit the best plane. For example, some implementations use 50 or 100 sample data points from a component.
In making the determination that the first component represents a large planar surface, the process also confirms that the “best” plane is actually a good plane for the data. In some implementations, the process 2500 determines (2540) that the plane fitting residual error is less than a predefined threshold. In some implementations, the plane fitting residual error is the sum of the absolute differences between the plane and the sample points in the component. In some implementations, the plane fitting residual error is the sum of the squares of the differences between the sample points and the plane, or the square root of the sum of the squares. In some implementations, the plane fitting residual error is the maximum absolute difference between the sample points and the plane. Some implementations use two or more techniques to confirm that the residual error is small (e.g., the maximum absolute error is less than a first threshold and the sum of the absolute errors is less than a second threshold).
Once the plane is fitted and it is determined that the residual error is sufficiently small, the first component is identified as a large planar surface. The process 2500 then analyzes the plane to determine whether the surface is likely to be a floor, a ceiling, or a wall. To make this determination, some implementations determine (2532) the orientation of the plane. This is illustrated above with respect to
Some implementations use other criteria as well in making the determination that a component represents a large planar surface. For example, some implementations require the component to have a minimum threshold area to be classified as a probable floor, wall, or ceiling.
The process 2600 receives (2610) a first RGB image of a scene taken by a 2-dimensional array of image sensors of a camera system at a first time. The RGB image identifies what is in the field of vision of the camera. The process also receives (2612) a first plurality of distinct IR images of the scene taken by the array of image sensors temporally proximate to the first time. In general, the temporal proximity ensures that the field of vision of the camera while capturing the IR images is substantially the same as the field of vision of the camera while capturing the RGB image. Commonly, the RGB image is captured during daylight hours, whereas the IR images are captured at night. In some implementations, temporal proximity means within 24 hours or 12 hours. Each of the IR images is taken (2614) while a different subset of IR illuminators of the camera system is emitting light.
The process 2600 uses (2616) the first plurality of IR images to construct a first depth map of the scene, where the first depth map indicates a respective depth in the scene at a plurality of pixels. Some implementations use a process like the depth mapping process 2300 described with respect to
A user designates (2626) a zone within the RGB image. In some implementations, the designated zone is a region of interest, such as a region with special monitoring. In some implementations, the special monitoring consists of excluding the region from monitoring movement. In some implementations, an alert is triggered when there is movement in a designated zone. In some implementations, the zone corresponds (2626) to a contiguous plurality of pixels. In some implementations, the zone is (2628) a quadrilateral. In some implementations, the zone is a polygon. In alternative implementations, the user designates a zone within an IR image instead of within an RGB image.
The process 2600 receives (2630) a second plurality of distinct IR images of the scene taken by the array of image sensors at a second time that is after the first time. In some implementations, each of the IR images in the second plurality is captured (2632) while a different subset of IR illuminators of the camera system is emitting light. Typically, the subsets of IR illuminators used to capture the second plurality of IR images are the same as the subsets of IR illuminators used to capture the first plurality of illuminators.
The process 2600 then uses (2634) the second plurality of IR images to construct a second depth map of the scene. The process 2600 typically uses the same steps for building the second depth map as used for building the first depth map, which was described above with respect to boxes 2618-2624 in
The process 2600 then determines (2636) physical movement of the camera system based on the first and second depth maps. In many cases, if there has been no movement of the camera, the second depth map is substantially the same as the first depth map. However, in some cases, objects in the scene itself change, such as placing a new item of furniture in the monitored area, placing new artwork on a wall, or even accumulated clutter on a floor.
In some instances, the determined physical movement is (2638) an angular rotation. In some implementations, the determined physical movement is (2640) a lateral displacement. For example, the camera may be bumped a little to the left or the right on a shelf. Note that lateral displacement can be a horizontal movement, a vertical movement, and/or a movement forward or backward. In some implementations, a “lateral displacement” is defined as any movement of the camera 118 in which the camera continues to point in the same direction (e.g., due east). In many cases, if the camera 118 is bumped or nudged, the physical movement includes (2642) both an angular rotation and a lateral displacement.
In some implementations, the process 2600 identifies (2644) a plurality of points in the first depth map and a corresponding plurality of points in the second depth map. The process 2600 then determines (2646) a respective displacement for each of the identified points between the first and second depth maps. By combining the displacements for a plurality of distinct points, the process 2600 determines the overall movement of the camera 118.
In some implementations, determining the movement of the camera uses point clouds. The process 2600 forms (2648) a first point cloud using a first plurality of points from the first depth map, and forms (2650) a second point cloud using a second plurality of points from the second depth map. The process then computes (2652) a minimal transformation that aligns the first point cloud with the second point cloud. One of skill in the art recognizes that correlating two point clouds can be performed in various ways. Based on the point cloud transformation, the process 2600 identifies the motion of the camera 118 that would produce the point cloud transformation.
Based on the determined physical movement of the camera system 118, the process 2600 translates (2654) the zone in the first RGB image into an adjusted zone. When the zone originally designated by the user is a quadrilateral, the adjusted zone is (2656) also a quadrilateral. However, because of the transformation, in some instances, a first edge of the quadrilateral has (2658) a length that is different from a corresponding second edge of the second quadrilateral.
In some implementations, the process 2600 receives (2660) a second RGB image of the scene taken by the array of image sensors of the camera system temporally proximate to the second time. In some implementations, the process 2600 correlates (2662) the adjusted zone to a set of pixels from the second RGB image. This can be helpful to a user who wants to view the zones.
The process 2700 identifies (2710) a plurality of distinct subsets of the IR illuminators. Subsequently, each of the distinct subsets of illuminators are activated one subset at a time, and the images captured with different illumination enables determination of the camera height and tilt angle. In some implementations, each of the distinct subsets of the IR illuminators comprises (2712) two adjacent IR illuminators, and the distinct subsets of the IR illuminators are non-overlapping. In some implementations, each individual illuminator is one of the distinct subsets. For example, if a camera system has eight illuminators, some implementations have eight distinct subsets, consisting of each individual illuminator. In some implementations there is overlap between the distinct subsets. For example, in a camera system with eight illuminators, some implementations have eight distinct subsets corresponding to each possible pair of adjacent illuminators. One of skill in the art recognizes that many other selections of subsets of IR illuminators are possible.
The process 2700 also partitions (2714) the image sensors in the array into a plurality of pixels. In some implementations, each pixel comprises (2716) a single image sensor. In other implementations, each pixel comprises (2718) a plurality of image sensors. Typically, the image sensor array 852 has a large number of image sensors (e.g., a million or more). Implementations commonly downsample the images, combining multiple sensors into a single virtual pixel. In some implementations, each pixel includes about 100 image sensors (e.g., a 10×10 contiguous square). In some implementations, each pixel corresponds to the same number of image sensors.
Before computing an actual camera position, implementations build a dictionary (also referred to as a training set). An example dictionary 2150 is provided in
For each of a plurality of heights and tilt angles, the process 2700 constructs (2720) a dictionary entry that corresponds to the camera system 118 having the respective height and tilt angle above a floor. The respective dictionary entry includes (2722) respective IR light intensity values for pixels in images corresponding to activating individually each of the distinct subsets of the IR illuminators. For example, in some implementations with 15,000 pixels and four subsets of illuminators, each dictionary entry has a light intensity value for each of the 60,000 pixel/subset combinations plus the height and tilt angle (e.g., a vector with 60,002 entries). In some implementations, the dictionary entries only include pixels that correspond to the simulated floor. For example, if there are 15,000 pixels for the entire sensor array, the simulated floor may occupy 3000 pixels, thus creating dictionary entries with 12,002 components (12,000 components corresponding to the pixel/subset combinations, and two components for the height and tilt angle). Some implementations have about 100 dictionary entries (e.g., with height values of 0.0 meters, 0.3 m, 0.6 m, . . . , and tilt angles of −40°, −30°, −20°, . . . ). Some implementations include more entries to provide greater accuracy (e.g., height values every 0.1 meter and angles every 5 degrees).
In some implementations, the constructed dictionary entries are (2723) based on simulating the camera, the floor, and the images, and computing expected IR light intensity values for pixels in the simulated images. In some implementations, each expected IR light intensity value is (2724) based on characteristics of the IR illuminators. As noted previously, the characteristics may include (2724) one or more of: lux, orientation of the IR illuminators relative to the array of image sensors, and location of the IR illuminators relative to the array of image sensors. In some implementations, a respective dictionary entry for a respective height and respective tilt angle is (2725) based on measuring IR light intensity values of actual images captured by the camera having the respective height and respective tilt angle with respect to an actual floor.
In some implementations, the process 2700 normalizes (2726) each of the dictionary entries. In some implementations, this accounts for different surface reflectivity. In some implementations, the process 2700 normalizes (2728) each dictionary entry by determining (2728) a respective total magnitude of the light intensity features in the respective dictionary entry and dividing (2728) each component of the respective dictionary entry by the respective total magnitude. For example, with a dictionary entry having 12,002 elements, compute the total magnitude of the first 12,000 entries (corresponding to light intensity at pixels) and divide each of those 12,000 entries by the total magnitude. If the light intensity features are labeled x1, x2, . . . , x12000, then in some implementations the total magnitude is
In some implementations, the dictionary entries are constructed at a computing device that is distinct from the camera system, then downloaded (2730) to the camera system from the computing device during an initialization process. In some implementations, the subsequent determination of height and tilt angle is calculated at the camera system 118, even when the building of the dictionary is performed at a separate computing device (e.g., a scene understanding server 900).
For each of the plurality of distinct subsets of the IR illuminators, the process 2700 receives (2732) a captured IR image of a scene taken by the array of image sensors while the respective subset of the IR illuminators are emitting IR light and the IR illuminators not in the respective subset are not emitting IR light. In some implementations, the process 2700 receives (2734) a baseline IR image of the scene captured by the array of image sensors while none of the IR illuminators are emitting IR light, and subtracts (2736) a light intensity at each pixel of the baseline IR image from the light intensity at the corresponding pixel of each of the other captured IR images. This can provide a better estimate of the light intensity due to the IR illuminators.
The process uses (2738) at least one of the captured IR images to identify a floor region corresponding to a floor in the scene. Some implementations use the techniques illustrated above in
The process 2700 then forms (2746) a feature vector including pixels from the captured IR images in the identified floor region. This is illustrated in
The process then estimates (2748) a camera height and camera tilt angle relative to the floor by comparing (2748) the feature vector to the dictionary entries. In some implementations, the process 2700 normalizes (2750) the feature vector and the dictionary entries prior to computing the distances.
In some implementations, the process 2700 computes (2752) a respective distance between the feature vector and respective dictionary entries, and selects (2756) a first dictionary entry whose corresponding computed distance is less than the other computed distances. In some implementations, computing the distance between a feature vector and respective dictionary entries comprises (2754) computing a Euclidean distance that uses only vector components corresponding to pixels in the identified floor region. This is illustrated in
Some implementations expand or modify this basic process in various ways. In some implementations, the process 2700 identifies a ceiling rather than a floor, and measures the “height” and tilt angle relative to the ceiling. As noted above in
As noted above, the data for the dictionary entries can be constructed by simulation or by experiments with an actual camera. When formed by experimentation, some implementations capture a baseline image for each camera position, and subtract the baseline from the other captured images with each of the subsets of illuminators activated. Alternatively, the experiments are performed in a room with no ambient light so that each captured image represents only light coming originally from the activated illuminators. The size of the dictionary can be selected based on the desired accuracy.
In some instances, multiple “floor” regions are identified. In some of these instances, the multiple regions are different portions of the same floor. In other instances, one or more of the regions may be tables and one or more regions may be an actual floor. Some implementations estimate the height and tilt angle based on each of the identified regions, then compare the multiple results. If they are all approximately the same, some implementations estimate the height and tilt based on all of them (e.g., by averaging the values, taking the values associated with the largest region, or choosing the first one). When the heights are substantially different, some implementations take the larger estimate, guessing that the smaller height estimate is based on a table or other planar object above the floor. Note that the process is only an estimate. If the camera is sitting on a table and the floor is not in the field of vision of the camera, the estimated height will be the height above the table.
Some implementations use interpolation to provide a finer estimate. For example, in some instances the feature vector has equally small distances from two dictionary entries. In some implementations, the estimated height and tilt angle are based on averaging these two closest entries. In some implementations, finding the matching dictionary entry uses a nearest neighbor algorithm. In some implementations, only the single nearest neighbor is used. In some implementations, the k nearest neighbors are used for a fixed small positive integer k, and a weighted average of these neighbors is used to compute the height and tilt angle of the camera. For example, in some implementations, the k nearest entries are selected, and each is weighted based on the inverse of its distance from the feature vector.
In the data acquisition phase 2802, the camera 118 captures (2806) IR images while controlling which IR illuminators are on. In some implementations, the images are captured at night, and may occur multiple times each night (e.g., every hour). In some implementations, the camera 118 receives a command from the video server system 508 or scene understanding server 900 to collect the images. Before taking the images, the camera typically locks auto exposure so that all of the captured images are taken with the same parameter settings.
For cameras with substantial processing power and memory, subsequent processing may be performed at the camera. However, the data is commonly transmitted to a separate server for the data processing phase 2804, which commonly occurs at a video server system 508 or a scene understanding server 900. In some implementations, the data is transmitted from the camera to an external computing device in a native format (e.g., five IR images). In some implementations, some processing occurs on the camera before it is transmitted. For example, in some implementations, the images are downsampled at the camera, which reduces the amount of data transmitted. In some implementations, the captured background image is subtracted from the other images, so the data transmitted corresponds to light from the IR illuminators, and the background light is already canceled out. In some implementations, the data is transmitted as a single long array of data, such as the feature vector 2178 in
In some implementations, the scene understanding server 900 includes a depth mapping module 878, which computes (2808) a 3-D depth map of the scene in the field of vision of the camera. Constructing a depth map is described above with respect to
The second process 2904 identifies large planar regions, such as floors, walls, and ceilings. This process is described above with respect to
The third process 2906 performs zone correction, as described above with respect to
The fourth process 2908 identifies specular regions in a scene, which generally correspond to windows, televisions, or sliding glass doors. This process is described above with respect to
The information provided by the scene understanding server can be used in various ways to reduce false motion alerts. For example, an identified specular region (identified as a possible window), may be a television set. In some implementations, a rectangular specular region that includes lots of motion is identified as a probable television. When a television is identified, “movement” within the television region that would otherwise create a false motion alert can be avoided. In some implementations, false motion alerts from ceilings can be avoided as well. Typically, “motion” on a ceiling is caused by lights, such as headlights from cars, and should not trigger a motion alert.
Some implementations are able to identify other characteristics of the camera location as well. For example, some implementations determine whether the camera is inside or outside (e.g., based on the presence of a ceiling). When a camera is inside, some implementations determine whether the room is a small room or a large room. These characteristics can help determine when to create motion alerts. For example, when a camera is outside, there are many regions where motion would be expected (e.g., plants or trees flowing with the wind). Therefore, motion detection may be limited to very specific areas and/or set at a high threshold for what triggers a motion event. In some implementations, the information about the camera environment (e.g., floors and windows) is used to make recommendations on where to place the camera and/or to recommend zones for more detailed monitoring. For example, in
As illustrated in
Similarly, a group of cells including the cell 3006-3 are encoded with a “W,” indicating that the cells are part of a probable window. The region 3010 includes these cells. Also, on the left is a group 3012 of cells that include the cell 3006-4, which is identified as a probable wall. In some implementations, an individual cell can be labeled with at most object type, but in other implementations, each cell can have two or more designations. For example, the dark region 1822 in
Although the grid 3002 in
Some implementations provide zone correction, as illustrated in
In situations in which the systems discussed here collect personal information about users, or may make use of personal information, the users may be provided with an opportunity to control whether programs or features collect user information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location), or to control whether and/or how to receive content from the content server that may be more relevant to the user. In addition, certain data may be treated in one or more ways before it is stored or used, so that Personally Identifiable Information (“PII”) is removed. For example, a user's identity may be treated so that no PII can be determined for the user, or a user's geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined. Thus, the user may have control over how information is collected about the user and used by a content server.
It is to be appreciated that one or more implementations disclosed hereinabove is particularly advantageous for application in the home monitoring context, for which there are particular combinations of desirable goals including low cost hardware, very low device power (especially for battery-only devices), low device heating, nonintrusive device operation, ease of device installation and configuration, tolerance to intermittent network connectivity, low-maintenance or maintenance-free device operation, long device lifetimes, the ability to operate in a variety of different lighting conditions, and so forth, the home monitoring context further involving particular sets of expected target characteristics and/or constraints for which the preferred implementations may be particularly effective, such as the statistically prominent presence of certain target types (humans, pets, houseplants, ceilings, floors, furniture, doors, windows, household fixtures, various household items, etc.), the fact that the monitoring device is usually stationary relative to the monitored space, the fact that certain target types have certain expected ranges of sizes and characteristics (e.g., humans and pets have certain sizes and any movement is usually parallel to a floor or stairway; floors-ceilings-walls are also usually of certain size or height ranges and are stationary; doors-windows rotate or slide within expected ranges; furniture is usually stationary and has certain expected sizes), and so forth. However, it is to be appreciated that the scope of the present teachings is not so limited, with other implementations being applicable for the monitoring of other types of structures (e.g., multi-unit apartment buildings, hotels, retail stores, office buildings, industrial buildings) and/or to the monitoring of any other indoor or outdoor facility or space. It is to be still further appreciated that, while facility or space monitoring represents one particular advantageous application, the scope of the present teachings can further be applicable to any field in which automated machine characterizations of stationary or moving objects, facilities, environments, persons, animals, or vessels, are desired based on optical, ultraviolet, or infrared electromagnetic reflection or emission characteristics.
It will also be understood that, although the terms first, second, etc. are, in some instances, used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first user interface could be termed a second user interface, and, similarly, a second user interface could be termed a first user interface, without departing from the scope of the various described implementations. The first user interface and the second user interface are both user interfaces, but they are not the same user interface.
The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Although some of various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages could be implemented in hardware, firmware, software or any combination thereof.
The foregoing description, for purpose of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the scope of the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The implementations were chosen in order to best explain the principles underlying the claims and their practical applications, to thereby enable others skilled in the art to best use the implementations with various modifications as are suited to the particular uses contemplated.
Heitz, III, George Alban, Shin, Dongeek
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
5963253, | Jan 17 1997 | HANGER SOLUTIONS, LLC | Light sensor and thresholding method for minimizing transmission of redundant data |
6515275, | Apr 24 2000 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Method and apparatus for determining the illumination type in a scene |
6650694, | Feb 18 2000 | Texas Instruments Incorporated | Correlator co-processor for CDMA RAKE receiver operations |
7066664, | May 28 2004 | CREATIVE TECHNOLOGY LTD | Three way video camera base |
7290740, | Jan 27 2004 | INDUSTRIAL REVOLUTION, INC | Portable tripod and universal mounting assembly for an object |
7552340, | Jul 31 2002 | Trek 2000 International Ltd. | Method and apparatus of storage anti-piracy key encryption (SAKE) device to control data access for networks |
7586537, | Feb 04 2004 | PSP HOLDINGS CO , LTD ; PANASONIC I-PRO SENSING SOLUTIONS CO , LTD | Dome type camera |
7930369, | Oct 19 2005 | Apple Inc | Remotely configured media device |
7986369, | Apr 11 2008 | Web cam stand system | |
8139122, | Aug 20 2007 | Matthew Rolston Photographer, Inc.; MATTHEW ROLSTON PHOTOGRAPHER, INC | Camera with operation for modifying visual perception |
8165146, | Oct 28 1999 | Lightwaves Systems Inc. | System and method for storing/caching, searching for, and accessing data |
8402145, | Mar 16 2009 | Apple Inc. | Application communication with external accessories |
8504707, | Dec 07 2004 | Cisco Technology, Inc. | Method and system for sending and receiving USB messages over a data network |
8625024, | Mar 28 2007 | Logitech Europe S.A. | Webcam with moveable zoom lens |
9102055, | Mar 15 2013 | GOOGLE LLC | Detection and reconstruction of an environment to facilitate robotic interaction with the environment |
20020107591, | |||
20020141418, | |||
20020186317, | |||
20020191082, | |||
20030169354, | |||
20030193409, | |||
20040211868, | |||
20040247203, | |||
20050062720, | |||
20050149213, | |||
20050151042, | |||
20050227217, | |||
20050230583, | |||
20050275723, | |||
20060086871, | |||
20060109375, | |||
20060109613, | |||
20060210259, | |||
20060282866, | |||
20070001087, | |||
20070011375, | |||
20070083791, | |||
20070222888, | |||
20080005432, | |||
20080151052, | |||
20080186150, | |||
20080189352, | |||
20080291260, | |||
20090019187, | |||
20090027570, | |||
20090102715, | |||
20090248918, | |||
20090289921, | |||
20100180012, | |||
20100199157, | |||
20100271503, | |||
20100314508, | |||
20100328475, | |||
20110102438, | |||
20110193967, | |||
20110205965, | |||
20110267492, | |||
20110285813, | |||
20110293137, | |||
20110299728, | |||
20120127270, | |||
20120194650, | |||
20120246359, | |||
20130156260, | |||
20130162629, | |||
20130314544, | |||
20130321564, | |||
20130342653, | |||
20140032796, | |||
20140049609, | |||
20140119604, | |||
20140168421, | |||
20140270387, | |||
20140333726, | |||
20140375635, | |||
20150120389, | |||
20150154467, | |||
20150228114, | |||
20160022181, | |||
D349914, | Feb 18 1992 | Habuka Shashin Sangyo Kabushiki Kaisha | Tripod for camera |
D357267, | Dec 09 1993 | HAKUBA PHOTO INDUSTRY CO , LTD | Mini tripod |
D429269, | Jan 05 1999 | KIP SMRT P1 LP | Wireless camera |
D429743, | Jan 05 1999 | KIP SMRT P1 LP | Wireless camera |
D445123, | Dec 01 2000 | Tripod | |
D447758, | Jan 19 2001 | Transpacific Plasma, LLC | Camera |
D449630, | Sep 05 2000 | Silent Witness Enterprises Ltd. | Enclosure for a dome monitor camera |
D455164, | Oct 31 2000 | LOGITECH EUROPE S A | Video camera |
D467952, | Mar 05 2002 | Sony Corporation | Video camera |
D470874, | May 07 2002 | Chicony Electronics Co., Ltd. | Digital camera |
D527755, | Jun 25 2004 | BEHAVIOR TECH COMPUTER | Web camera |
D534938, | Mar 03 2005 | ARRIS ENTERPRISES LLC | Cordless camera telephone |
D555692, | Jul 28 2006 | Cisco Technology, Inc. | Wireless internet camera |
D563446, | Dec 27 2006 | LOGITECH EUROPE S A | Web cam device |
D575316, | Jul 28 2006 | Cisco Technology, Inc. | Wireless Internet video camera |
D614223, | Apr 30 2008 | Microsoft Corporation | Electronic camera |
D627815, | Sep 11 2009 | Tomy Company, Ltd. | Monitor with camera |
D651229, | Dec 13 2010 | LOGITECH EUROPE S A | Web cam with stand |
D651230, | Dec 13 2010 | LOGITECH EUROPE S A | Web cam with stand |
D651633, | Aug 04 2010 | HANWHA VISION CO , LTD | Wireless monitoring camera |
D657410, | Nov 23 2010 | AXIS AB | Camera |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 12 2015 | Google Inc. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Date | Maintenance Schedule |
Aug 16 2019 | 4 years fee payment window open |
Feb 16 2020 | 6 months grace period start (w surcharge) |
Aug 16 2020 | patent expiry (for year 4) |
Aug 16 2022 | 2 years to revive unintentionally abandoned end. (for year 4) |
Aug 16 2023 | 8 years fee payment window open |
Feb 16 2024 | 6 months grace period start (w surcharge) |
Aug 16 2024 | patent expiry (for year 8) |
Aug 16 2026 | 2 years to revive unintentionally abandoned end. (for year 8) |
Aug 16 2027 | 12 years fee payment window open |
Feb 16 2028 | 6 months grace period start (w surcharge) |
Aug 16 2028 | patent expiry (for year 12) |
Aug 16 2030 | 2 years to revive unintentionally abandoned end. (for year 12) |