Implementations provide for use of spherical random features for polynomial kernels and large-scale learning. An example method includes receiving a polynomial kernel, approximating the polynomial kernel by generating a nonlinear randomized feature map, and storing the nonlinear feature map. Generating the nonlinear randomized feature map includes determining optimal coefficient values and standard deviation values for the polynomial kernel, determining an optimal probability distribution of vector values for the polynomial kernel based on a sum of gaussian kernels that use the optimal coefficient values, selecting a sample of the vectors, and determining the nonlinear randomized feature map using the sampled vectors. Another example method includes normalizing a first feature vector for a data item, transforming the first feature vector into a second feature vector using a feature map that approximates a polynomial kernel with an explicit nonlinear feature map, and providing the second feature vector to a support vector machine.
|
1. A method comprising:
2 normalizing a first feature vector for an image, where the first feature vector represents pixels in the image;
transforming the first feature vector into a second feature vector using a feature map that approximates a polynomial kernel as a fourier transform of a combination of gaussians; and
using the second feature vector as input to a support vector machine for inference or training.
14. A system comprising:
at least one processor;
a memory storing a feature map that approximates a polynomial kernel as positive projection of a sum of gaussians;
a database of images, each image being represented by a nonlinear approximation of an input vector for the image, the nonlinear approximation being generated by applying the feature map to the input vector; and
instructions that, when executed by the at least one processor to perform operations including:
receiving a query image represented as a query vector,
2 normalizing the query vector,
applying the feature map to the normalized query vector to generate a nonlinear approximation of the query vector,
identifying data items responsive to the query image using dot product similarity between the nonlinear approximation of the query vector and the nonlinear approximation of the data items, and
providing the data items responsive to the query image as a query result.
2. The method of
where ci represents optimal coefficient values, σi represents optimal standard deviation values, N represents the number of gaussians in the combination, w represents a vector, and the first feature vector has a dimension of d.
3. The method of
4. The method of
5. The method of
where K(z) is the polynomial kernel, {circumflex over (K)}(z) is the approximation, and z is the variable of the polynomial kernel.
6. The method of
8. The method of
10. The method of
11. The method of
12. The method of
13. The method of
16. The system of
17. The system of
18. The system of
19. The system of
20. The system of
|
This application is a continuation of, and claims priority to, U.S. application Ser. No. 14/968,293, filed Dec. 14, 2015, the disclosure of which is incorporated herein by reference in its entirety.
Many systems use large-scale machine learning to accomplish challenging problems such as speech recognition, computer vision, image and sound file searching and categorization, etc. Deep learning of multi-layer neural networks is an effective large-scale approach. Kernel methods, e.g., Gaussian and polynomial kernels, have also been used on smaller-scale problems, but scaling kernel methods has proven challenging.
Implementations provide a kernel approximation method that is compact, fast, and accurate for polynomial kernels. The method generates nonlinear features for polynomial kernels applied to data on the unit sphere. It approximates the Fourier transform of kernel functions as the positive projection of an indefinite combination of Gaussians and achieves more compact maps compared to previous approaches, especially for higher-order polynomials. The approximation method, also referred to as spherical random Fourier (SRF) features, can be applied to any shift-invariant radial kernel function, whether positive definite or not.
According to one general aspect, a method for generating input for a kernel-based machine learning system includes receiving a polynomial kernel, approximating the polynomial kernel by generating a nonlinear randomized feature map, and storing the nonlinear feature map. Generating the nonlinear randomized feature map includes determining optimal coefficient values and standard deviation values for the polynomial kernel, determining an optimal probability distribution of vector values p(w) for the polynomial kernel based on a sum of N Gaussian kernels that use the optimal coefficient values, selecting a sample of the vectors, and determining the nonlinear randomized feature map using the sample of the vectors. The method may also include generating a vector for a data item in a data source using the nonlinear feature map and providing the vector to the kernel-based machine learning system.
According to one aspect, a computing system includes at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the computing system to perform operations. The operations may include generating an approximation of polynomial kernel as a sum of Gaussian kernels and storing the sample of the vector values as a nonlinear randomized feature map. Generating the approximation of the polynomial kernel as the sum of Gaussian kernels includes limiting the variable of the approximation to [0,2], determining optimal coefficient values for the approximation by determining coefficient values that minimize the difference between the polynomial kernel and the approximation, determining an optimal probability distribution of vector values for the approximation based the optimal coefficient values, and selecting a sample of the vector values. The operations may also include generating input vectors for a kernel-based machine learning system using the nonlinear randomized feature map and training the machine learning system using the input vectors.
According to one aspect, a method includes normalizing a first feature vector for a data item, transforming the first feature vector into a second feature vector using a feature map that approximates a polynomial kernel with an explicit nonlinear feature map, and providing the second feature vector to a support vector machine for use as a training example.
In one general aspect, a computer program product embodied on a computer-readable storage device includes instructions that, when executed by at least one processor formed in a substrate, cause a computing device to perform any of the disclosed methods, operations, or processes. Another general aspect includes a system and/or a method for approximating a Fourier transform of a polynomial kernel function, substantially as shown in and/or described in connection with at least one of the figures, and as set forth more completely in the claims.
One or more of the implementations of the subject matter described herein can be implemented so as to realize one or more of the following advantages. As one example, implementations provide a scalable, non-linear version of features extracted from a data item that give high accuracy for a given task. The features generated using the described subject matter are less rank-deficient, more compact, and achieve better kernel approximation, especially for higher-order polynomials. The resulting predictions made using the SRF features have lower variance and yield better classification accuracy. As another example, the system provides an analytical bound for the SRF approximation paradigm, proving the approximation does not have an adverse effect on performance, especially for large polynomial orders. As another example, the disclosed approximation method reduces model training time, testing time, and memory requirements. As another example, implementations show less feature redundancy, leading to lower kernel approximation error, and more stable performance due to reduced variance.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
The large-scale learning system 100 may be a computing device or devices that take the form of a number of different devices, for example a standard server, a group of such servers, or a rack server system, such as server 110. In addition, system 100 may be implemented in a personal computer, for example a laptop computer. The server 110 may be an example of computer device 600, as depicted in
Although not shown in
The modules may include a spherical random feature engine 126 and a machine learning engine 120. The spherical random feature engine 126 may use feature vectors extracted from data items 130 and generate a randomized feature map 136 that produces an approximation of the features, e.g., via a polynomial kernel. A feature vector may be thought of as an array of floating point numbers with a dimensionality of d, or in other words an array with d positions. The data items 130 may be a database, for example of files or search items. For instance, the data items 130 may be any kind of file, such as documents, images, sound files, video files, etc., and the feature vectors may be extracted from the file. The data items 130 may also be database records and the features may be extracted from data related to an item in the database. The system 100 may use a machine learning engine 120 to perform image searches, speech recognition, etc., on the data items 130. The system 100 may use conventional methods to extract the vectors from the data items 130 or may be provided the extracted feature vectors. As some examples, the extracted feature vector may be pixels from an image file in the data items 130 or speech waveforms.
Kernel methods, such as nonlinear support vector machines (SVMs) provide a powerful framework for nonlinear learning system, but they come with significant computational costs. Their training complexity varies from O(n2) to O(n3), which becomes prohibitive when the number of training examples n becomes large (e.g., in the millions). Furthermore, the number of support vectors increases linearly with the size of the training data. This slows prediction as well, which has an O(nd) complexity with d-dimensional vectors. Explicit kernel maps are an alternative for large-scale learning because they rely on properties of linear SVMs, which can be trained in O(n) time and applied in O(d) time. With explicit kernel maps, the idea is to determine an explicit nonlinear feature map F(·) such that K(x,y)≈(F(x)·F(y)), where x and y are vectors in the input space (i.e., feature vectors from data items) and F(x) produces a vector that is a nonlinear version of x that gives high accuracy for a given task. One solution for performing this mapping of x to F(x) for Gaussian kernels can be expressed by
where b is a random shift and D is the dimension of the new feature map F(x).
The problem with using this expression for polynomial kernels is finding the proper values for w that work for a polynomial kernel, where w represents vectors from some distribution. Polynomial kernels are expressed as K(x, y)=(<x, y>+q)p where <x, y> is the dot product of two input vectors x and y, q is the bias and p is the degree of the polynomial. The bias is a parameter that trades off the influence of higher-order versus lower-order terms in the polynomial. Approximating polynomial kernels with explicit nonlinear maps is challenging for several reasons. Polynomial kernels conventionally need high dimensional mappings and don't scale for higher degree polynomials. Moreover there are some assumptions built into the Gaussian kernel that do not hold true for polynomial kernels.
Approximation for other types of kernels (e.g., Gaussian kernels) has been accomplished with Bochner's theorem. Bochner's theorem works for kernels where the kernel is shift-invariant (i.e., K(x, y)=K(z) where z is the distance between vectors x and y) and where K(z) is a positive definite function on d. But Bochner's theorem cannot be applied to polynomial kernels because polynomial kernels do not satisfy the positive-definiteness prerequisite for the application of Bochner's theorem.
The spherical random feature engine 126 approximates a special case of polynomial kernels, where the input data has been 2-normalized. In other words, the input vectors have been normalized to unit -e2 norm, which ensures the polynomial kernel is not unbounded. Put another way, the input vector x may be normalized so that the sum of the squares of the floating point values equals 1. In some implementations the normalized input vectors may be provided to the spherical random feature engine 126 and in some implementations the spherical random feature engine 126 may perform the normalization. With input normalized, in some implementations, the spherical random feature engine 126 approximates the polynomial kernel defined on Sd−1×Sd−1 as
with
In this equation, x and y are the input vectors, q is the bias, p is the degree of the polynomial, and a and α are scaling constants. The kernel K(x,y) is a shift-invariant radial function of the single variable z =x-y, which can be written as K(x,y)=K(z)=K(z) with z=∥z∥. The Fourier transform of K(z) is not a non-negative function, so a straightforward application of Bochner's theorem to produce Random Fourier Features is impossible. Because
the behavior of K(z) for z>2 is undefined and arbitrary. A Fourier transform requires an integration over all values of z, therefore the spherical random feature engine 126 may map K(z) to 0 where z is greater than 2, thus limiting cases where the system calculates the approximation for K(z) to [0,2].
However, it is impossible for the system to construct a positive integrable {circumflex over (k)}(w) whose inverse Fourier transform {circumflex over (K)}(z) equals K(z) exactly on [0,2]. Rather, the spherical random feature engine 126 finds an inverse Fourier transform {circumflex over (K)}(z) that is a good approximation of K(z) on [0,2], which is sufficient because the system approximates the inverse Fourier transform {circumflex over (K)}(z) by Monte Carlo integration. The spherical random feature engine 126 approximates K(z) as a series of N Gaussians {circumflex over (K)}(z), (e.g., Σi=1Ncic−σ
where Nis the number of Gaussians (e.g., 10), ci represent coefficient values, σi are standard deviation values and e is Euler's number. {circumflex over (k)}(w) may also be referred to as the Fourier transform of the approximate kernel function.
The spherical random feature engine 126 may determine the coefficient values and standard deviation values by optimizing the mean squared error between {circumflex over (K)}(z) and K(z) given a polynomial kernel K(x,y)=K(z), where z=∥x—y∥2, ∥x∥2=1, ∥y∥2=1. The polynomial kernel K(x,y) is parameterized by a scaling constant a≥2 and an order p≥1. The scaling constant and order define the polynomial kernel. The input feature vectors x and y may have a dimensionality of d (e.g., the dimensionality of the feature vectors from data items 130). Put another way, the spherical random feature engine 126 may solve
where {circumflex over (K)}(z) is the inverse Fourier transform of {circumflex over (k)}(w), which is represented by Equation 3. In other words, the spherical random feature engine 126 minimizes the integral of Equation 4 in order to obtain optimal coefficient values ci and standard deviation values σi. With the optimal coefficient values and standard deviation values (i.e., ci and σi) identified, the spherical random feature engine 126 may use them to determine a probability distribution p(w), using the relation p(w)=(2π)−d/2{circumflex over (k)}(w) and Equation 3. The spherical random feature engine 126 may sample D vector values w from the probability distribution p(w) D represents the number of dimensions in the approximated feature vector (e.g., F(x)) and can be adjusted to find a balance between result quality and computation time. For example, the larger D is the better the results will be but the longer it will take to compute the results. Thus, D may be considered a parameter that an administrator can adjust to achieve desired balance between cost and quality results.
The spherical random feature engine 126 can use the randomly-selected vectors w to solve Equation 1 given a particular input vector x. Put another way, once the values for vectors w are determined, the spherical random feature engine 126 may use the values of w in Equation 1 to determine F(x), i.e., a non-linear approximation of the input vector x. In other words, the spherical random feature engine 126 determines the values for w that enable the system to generate the randomized feature map F(·) (i.e., feature map 136) such that K(x,y)≈(F(x)·F(y)). Accordingly, the system may store the optimal values of w as part of the spherical randomized feature map 136.
In some implementations, once the system 100 has determined the values of w that make up the randomized feature map F(·), the system may use the spherical random feature engine 126 to generate data item approximations 134. The data item approximations 134 represent non-linear approximations of input vectors for data items 130. In other words, the data item approximations 134 may be the result of applying the feature map 136 to an input vector x, e.g., the result of F(x) for a particular data item. In some implementations, the system 100 may calculate a nonlinear approximation for each data item in data items 130. This enables the machine learning engine 120 to access the data item approximations 134 for comparison with a query item quickly. In other implementations, the spherical random feature engine 126 may generate the data item approximations 134 in response to a query. The query item is also a data item and the system may use the spherical random feature engine 126 to generate a data item approximation 134 for the query item.
The system 100 may also include machine learning engine 120. The machine learning engine 120 may be any type of kernel-based machine-learning system, such as a long short-term memory (LSTM) neural network, feed-forward neural network, a support vector machine (SVM) classifier etc., that can predict one thing given the data item approximations 134 as input. For example, the machine learning engine 120 may take as input a data item and may use the feature map 136 to generate a transformation of the data item that is used to provide, as output, a classification for the data item. The data item can be an image and the classification may be a label for the image or a description of something identified in the image. The data item can also be sound file and the classification may be a word or words recognized in the sound file. In some implementations, the machine learning engine 120 may use dot product similarity between data item approximations to determine the label. These are given as examples only and implementations are not limited to classification of input. The output from the machine learning engine 120 can include other tasks such as clustering, regression analysis, anomaly detection, prediction, etc. The vectors generated using feature map 136 can be used as input to any machine learning problem, whether for training or for inference. When the machine-learning engine 120 is in a training mode, input vectors may be positive training examples (i.e., examples of a correct inference) or negative training examples (e.g., examples of an incorrect inference). When the machine learning engine 120 is in an inference mode, the machine learning engine 120 provides a prediction for the input vector. For example, the output of the machine learning engine 120 may be one or more classifications, one or more cluster assignments, the absence or presence of an anomaly, etc., for the data item for which the vector was generated. The machine learning engine 120 may use any input that can be classified, clustered, or otherwise analyzed.
The server 110 may include or be in communication with a search engine (not shown). For example, the search engine may be configured to use the machine learning engine 120 to identify data items 130 that are responsive to a query, for example provided by client 170, and to provide a search result in response to the query.
Large-scale learning system 100 may be in communication with client(s) 170 over network 160. Clients 170 may allow a user to provide query to the machine learning engine 120 (e.g., via a search engine) and to receive a corresponding search result. Client 170 may also be used to tune the parameters of the spherical random feature engine 126, such as the dimensionality of the features generated by the feature map 136 and the polynomial kernel parameters (e.g., bias and degree). Network 160 may be for example, the Internet or the network 160 can be a wired or wireless local area network (LAN), wide area network (WAN), etc., implemented using, for example, gateway devices, bridges, switches, and/or so forth. Via the network 160, the server 110 may communicate with and transmit data to/from clients 170.
Large-scale learning system 100 represents one example configuration and other configurations are possible. In addition, components of system 100 may be combined or distributed in a manner differently than illustrated. For example, in some implementations one or more of the machine learning engine 120 and the spherical random feature engine 126 may be combined into a single module or engine. In addition, components or features of the machine learning engine 120, the spherical random feature engine 126, or a search engine may be distributed between two or more modules or engines, or even distributed across multiple computing devices.
In other words, the approximation for the polynomial kernel is based on the inverse Fourier transform of a sum of N Gaussians, where any negative Fourier transform values are mapped to zero. The system may optimize the coefficient values by solving argmin{circumflex over (K)}∫02dz[K(z)−{circumflex over (K)}(z)]2, where {circumflex over (K)}(z) is the inverse Fourier transform of Equation 3 and dz refers to the standard mathematical notation defining the integral. The system may evaluate the inverse Fourier Transform {circumflex over (K)}(z) numerically by performing a one dimensional numerical integral expressed as
where z is the distance between two input vectors x and y and t,?
is the Bessel function of the first kind of order t,?
Performing the one dimensional numerical integral may be well approximated using a fixed-width grid in w and z and can be computed using a single matrix multiplication. In determining the optimal coefficient values, the system may optimize the mean squared error between K(z) and its approximation {circumflex over (K)}(z). The mean squared error may be represented as
which defines an optimal probability distribution p(w) through Equation 3 and the relation p(w)=(2π)−d/2{circumflex over (k)}(w). The upper bound for the error approximating function is
To find the optimal probability distribution of vector values p(w), the system may use the optimal coefficient values and the standard deviation values (220), e.g., values for w in Equation 3. Put another way, the system may use the coefficient values and standard deviation values to extract the optimal probability distribution p(w) using Equation 3. The system may select D vector values w from the optimal probability distribution via random sampling (225), where D is a parameter that represents the dimensions in the resulting approximation of the input feature vector (i.e., F(x)). The system may store the selected vector values as the randomized feature map (230). The sampled vector values (e.g., w) are used to determine the explicit mapping, i.e.,
which is a representation of the spherical randomized feature map and produces the non-linear approximation of the vector x. This non-linear approximation is less rank-deficient, more compact, and has high kernel approximation, especially for higher order polynomials. Process 200 then ends.
The system may provide the approximated feature vector as input to a classifier (425). The classifier may have access to a large store of data items. The data items may already have corresponding approximated feature vectors (e.g., approximated data items 134 of
The process of
Computing device 600 includes a processor 602, memory 604, a storage device 606, and expansion ports 610 connected via an interface 608. In some implementations, computing device 600 may include transceiver 646, communication interface 644, and a GPS (Global Positioning System) receiver module 648, among other components, connected via interface 608. Device 600 may communicate wirelessly through communication interface 644, which may include digital signal processing circuitry where necessary. Each of the components 602, 604, 606, 608, 610, 640, 644, 646, and 648 may be mounted on a common motherboard or in other manners as appropriate.
The processor 602 can process instructions for execution within the computing device 600, including instructions stored in the memory 604 or on the storage device 606 to display graphical information for a GUI on an external input/output device, such as display 616. Display 616 may be a monitor or a flat touchscreen display. In some implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 600 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
The memory 604 stores information within the computing device 600. In one implementation, the memory 604 is a volatile memory unit or units. In another implementation, the memory 604 is a non-volatile memory unit or units. The memory 604 may also be another form of computer-readable medium, such as a magnetic or optical disk. In some implementations, the memory 604 may include expansion memory provided through an expansion interface.
The storage device 606 is capable of providing mass storage for the computing device 600. In one implementation, the storage device 606 may be or include a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in such a computer-readable medium. The computer program product may also include instructions that, when executed, perform one or more methods, such as those described above. The computer- or machine-readable medium is a storage device such as the memory 604, the storage device 606, or memory on processor 602.
The interface 608 may be a high speed controller that manages bandwidth-intensive operations for the computing device 600 or a low speed controller that manages lower bandwidth-intensive operations, or a combination of such controllers. An external interface 640 may be provided so as to enable near area communication of device 600 with other devices. In some implementations, controller 608 may be coupled to storage device 606 and expansion port 614. The expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
The computing device 600 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 630, or multiple times in a group of such servers. It may also be implemented as part of a rack server system. In addition, it may be implemented in a personal computer such as a laptop computer 622, or smart phone 636. An entire system may be made up of multiple computing devices 600 communicating with each other. Other configurations are possible.
Distributed computing system 700 may include any number of computing devices 780. Computing devices 780 may include a server or rack servers, mainframes, etc. communicating over a local or wide-area network, dedicated optical links, modems, bridges, routers, switches, wired or wireless networks, etc.
In some implementations, each computing device may include multiple racks. For example, computing device 780a includes multiple racks 758a — 758n. Each rack may include one or more processors, such as processors 752a-752n and 762a-762n. The processors may include data processors, network attached storage devices, and other computer controlled devices. In some implementations, one processor may operate as a master processor and control the scheduling and data distribution tasks. Processors may be interconnected through one or more rack switches 758, and one or more racks may be connected through switch 778. Switch 778 may handle communications between multiple connected computing devices 700.
Each rack may include memory, such as memory 754 and memory 764, and storage, such as 756 and 766. Storage 756 and 766 may provide mass storage and may include volatile or non-volatile storage, such as network-attached disks, floppy disks, hard disks, optical disks, tapes, flash memory or other similar solid state memory devices, or an array of devices, including devices in a storage area network or other configurations. Storage 756 or 766 may be shared between multiple processors, multiple racks, or multiple computing devices and may include a computer-readable medium storing instructions executable by one or more of the processors. Memory 754 and 764 may include, e.g., volatile memory unit or units, a non-volatile memory unit or units, and/or other forms of computer-readable media, such as a magnetic or optical disks, flash memory, cache, Random Access Memory (RAM), Read Only Memory (ROM), and combinations thereof. Memory, such as memory 754 may also be shared between processors 752a-752n. Data structures, such as an index, may be stored, for example, across storage 756 and memory 754. Computing device 700 may include other components not shown, such as controllers, buses, input/output devices, communications modules, etc.
An entire system, such as system 100, may be made up of multiple computing devices 700 communicating with each other. For example, device 780a may communicate with devices 780b, 780c, and 780d, and these may collectively be known as system 100. As another example, system 100 of
According to one aspect, a method for generating input for a kernel-based machine learning system includes receiving a polynomial kernel, approximating the polynomial kernel by generating a nonlinear randomized feature map, and storing the nonlinear feature map. Generating the nonlinear randomized feature map includes determining optimal coefficient values and standard deviation values for the polynomial kernel, determining an optimal probability distribution of vector values p(w) for the polynomial kernel based on a sum of N Gaussian kernels that use the optimal coefficient values, selecting a sample of the vectors, and determining the nonlinear randomized feature map using the sample of the vectors. The method may also include generating a vector for a data item in a data source using the nonlinear feature map and providing the vector to the kernel-based machine learning system.
These and other aspects can include one or more of the following features. For example, generating the vector for the data item can include extracting a set of features from the data item and normalizing the set of features, wherein the method further includes receiving a predicted label for the data item from the machine learning system. As another example, the data item includes a first data item and the method also includes using the nonlinear feature map to generate a second vector for a second data item in the data source and using respective vectors to compute a dot product similarity between the first data item and the second data item. As another example, the data item may be an image, a speech recording, or a video file.
As another example, determining optimal coefficient values can include solving
where K(z) is the polynomial kernel, {circumflex over (K)}(z) is the approximation of K(z), and z is the variable of the polynomial kernel. In some implementations, the polynomial kernel K(z) is expressed as
where alpha is (2/a2)p, q is the bias, p is the order of the polynomial, and a is a scaling parameter. In some implementations, {circumflex over (K)}(z) is the inverse Fourier Transform of a positive integrable function of the vector w,
parameterized by coefficient values ci and standard deviation values σi such that {circumflex over (K)}(z) is a good approximation of K(z) on [0,2].
As another example, an approximation error of the nonlinear randomized feature map may decay at a rate of O(p−2.5) where p is the order of the polynomial kernel. As another example, determining the optimal probability distribution p(w) through the relation p(w)=(2π)−d/2{circumflex over (k)}(w) can include using the optimal coefficient values ci and standard deviation values σi to obtain
where k(w) is a positive integrable function of the vector w and whose inverse Fourier Transform {circumflex over (K)}(z) is a good approximation of K(z) on [0,2]. As another example, computing the nonlinear randomized feature map using the samples may include using the optimal probability distribution of vector values p(w)
where F(x) is the nonlinear feature map, wi are D random vectors sampled from p(w), and bi are D random biases. As another example, in the weighted sum of N Gaussian functions, negative Fourier transform values may be mapped to zero.
According to one aspect, a computing system includes at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the computing system to perform operations. The operations may include generating an approximation of polynomial kernel as a sum of Gaussian kernels and storing the sample of the vector values as a nonlinear randomized feature map. Generating the approximation of the polynomial kernel as the sum of Gaussian kernels includes limiting the variable of the approximation to [0,2], determining optimal coefficient values for the approximation by determining coefficient values that minimize the difference between the polynomial kernel and the approximation, determining an optimal probability distribution of vector values for the approximation based the optimal coefficient values, and selecting a sample of the vector values. The operations may also include generating input vectors for a kernel-based machine learning system using the nonlinear randomized feature map and training the machine learning system using the input vectors.
These and other aspects can include one or more of the following features. For
example, the sum of Gaussian kernels may be expressed as
where ci represents the optimal coefficient values, σirepresents the optimal standard deviation values, N represents the amount of Gaussian kernels in the sum, w represents the sampled vector values, d is the dimension of an input vector for the polynomial kernel. In some implementations, the approximation is an inverse Fourier Transform of the sum of Gaussian kernels and is a good approximation of the polynomial kernel on [0,2]. As another example, as part of generating the approximation, the operations can also include mapping negative Fourier transform values to zero in the sum of Gaussian kernels. As another example, minimizing the difference between the polynomial kernel and the approximation may be represented as
is the polynomial kernel, {circumflex over (K)}(z) is the approximation, and z is the variable of the polynomial kernel. In some implementations, the approximation can be evaluated as
is the Bessel function of the first kind of order t,? −1 and {circumflex over (k)}(w)is the Fourier transform of the kernel function.
According to one aspect, a method includes normalizing a first feature vector for a data item, transforming the first feature vector into a second feature vector using a feature map that approximates a polynomial kernel with an explicit nonlinear feature map, and providing the second feature vector to a support vector machine for use as a training example.
These and other aspects can include one or more of the following features. For example, the explicit nonlinear feature map may approximate a Fourier transform of the polynomial kernel as a positive projection of a combination of Gaussians. As another example, the combination of Gaussians is expressed as
where ci represents optimal coefficient values, σi represents the optimal standard deviation values, N represents the number of Gaussians in the combination, w represents a vector, d is the dimension of the first feature vector. In some implementations, the method includes determining the optimal coefficient and standard deviation values by determining values that minimize differences between the polynomial kernel and an inverse Fourier transform of the combination of Gaussians for values of the polynomial variable ranging from zero to two.
Various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any non-transitory computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory (including Read Access Memory), Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor.
The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
A number of implementations have been described. Nevertheless, various modifications may be made without departing from the spirit and scope of the invention. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.
Kumar, Sanjiv, Pennington, Jeffrey
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
6457032, | Nov 15 1997 | Cognex Corporation | Efficient flexible digital filtering |
7299213, | Mar 01 2001 | Health Discovery Corporation | Method of using kernel alignment to extract significant features from a large dataset |
7663373, | Dec 15 2006 | The Charles Machine Works, Inc. | Determining beacon location using magnetic field ratios |
8346687, | Sep 16 2009 | KDDI Corporation | SV reduction method for multi-class SVM |
9436876, | Dec 19 2014 | A9 COM, INC | Video segmentation techniques |
20080082426, | |||
20080144943, | |||
20080152231, | |||
20080177640, | |||
20080212899, | |||
20120215511, | |||
20130096817, | |||
20130138428, | |||
20130338496, | |||
20140002617, | |||
20140207401, | |||
20140232862, | |||
20150317282, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Dec 14 2015 | PENNINGTON, JEFFREY | Google Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 050924 | /0505 | |
Dec 14 2015 | KUMAR, SANJIV | Google Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 050924 | /0505 | |
Sep 29 2017 | Google Inc | GOOGLE LLC | ENTITY CONVERSION | 050939 | /0571 | |
Oct 07 2019 | GOOGLE LLC | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Oct 07 2019 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Date | Maintenance Schedule |
Apr 25 2026 | 4 years fee payment window open |
Oct 25 2026 | 6 months grace period start (w surcharge) |
Apr 25 2027 | patent expiry (for year 4) |
Apr 25 2029 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 25 2030 | 8 years fee payment window open |
Oct 25 2030 | 6 months grace period start (w surcharge) |
Apr 25 2031 | patent expiry (for year 8) |
Apr 25 2033 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 25 2034 | 12 years fee payment window open |
Oct 25 2034 | 6 months grace period start (w surcharge) |
Apr 25 2035 | patent expiry (for year 12) |
Apr 25 2037 | 2 years to revive unintentionally abandoned end. (for year 12) |