The technology disclosed relates to a system. The system comprises a trained multi-label support vector machine running a one-vs-the-rest classifier. The trained multi-label support vector machine running a one-vs-the-rest classifier is configured with trained parameters. The trained parameters are learned from training the trained multi-label support vector machine running the one-vs-the-rest classifier on document features of documents belonging to a plurality of label classes, and hyperplane determinations on label classes in the plurality of label classes. The trained parameters include distributions of distances between the label classes and the hyperplanes.
|
1. A system, comprising:
a multi-label document classifier, comprising:
a trained multi-label support vector machine (SVM) running a one-vs-the-rest classifier, wherein the trained multi-label SVM running the one-vs-the-rest classifier is trained to label input documents with one or more labels of a plurality of labels comprising at least fifty (50) labels, and the trained multi-label SVM is configured with trained parameters that are learned from training the trained multi-label SVM running the one-vs-the-rest classifier on:
document features of training documents each belonging to one or more of the plurality of labels, and
hyperplane determinations on labels in the plurality of labels, wherein the trained parameters include distributions of distances between the at least fifty labels and the hyperplanes, and the trained parameters are stored on a memory of the multi-label document classifier for use in applying the trained multi-label SVM running the one-vs-the-rest classifier;
a feature generator that creates the document features representing features of words in the input documents and the training documents; and
a harvester that:
harvests labels of the plurality of labels based on distances between the hyperplanes for the plurality of labels and the document features determined based on applying the trained multi-label SVM to the input documents, and
assigns the harvested labels to the input documents.
11. A system, comprising:
one or more processors;
a first memory, functionally coupled with the one or more processors, the first memory storing:
a trained multi-label support vector machine (SVM) running a one-vs-the-rest classifier, wherein the trained multi-label SVM running the one-vs-the-rest classifier is trained to label input documents with one or more labels of a plurality of labels comprising at least fifty (50) labels, and the trained multi-label SVM is configured with trained parameters that are learned from training the trained multi-label SVM running the one-vs-the-rest classifier on:
document features of first documents each belonging to one or more of the plurality of labels, and
hyperplane determinations on label classes in the plurality of labels, wherein the trained parameters include distributions of distances between the at least fifty labels and the hyperplanes, and the trained parameters are stored in the first memory for use in applying the trained multi-label SVM running the one-vs-the-rest classifier; and
a second memory, having stored thereon instructions that, upon execution by the one or more processors, cause the one or more processors to label second documents with one or more of the plurality of labels based on applying the trained multi-label SVM running the one-vs-the-rest classifier to the second documents, wherein the instructions to label the second documents comprises instructions that, upon execution by the one or more processors, cause the one or more processors to:
create the document features of the second documents;
determine distances between the hyperplanes for the plurality of labels and the document features using the trained multi-label SVM; and
harvest labels of the plurality of labels based on the distances between the hyperplanes for the plurality of labels and the document features.
2. The system of
a document database comprising the input documents, wherein the multi-label document classifier is configured to apply the trained multi-label support vector machine running the one-vs-the-rest classifier to the input documents in the document database to label the input documents.
3. The system of
4. The system of
5. The system of
6. The system of
7. The system of
a hyperparameter tuning component configured to select hyperparameters of the trained multi-label SVM running the one-vs-the-rest classifier across regularization, class weight, and loss function in a predetermined search range such that an at-least-one score is at or within ten percent of maximum attainable over the predetermined search range.
8. The system of
9. The system of
a website identifier configured to identify parked domains by:
crawling uniform resource locators (URLs) that are within a predetermined edit distance of selected URL names; and
a document crawler configured to collect documents posted on the parked domains by:
determining for at least some of the crawled URLs that URL resolution is referred to an authoritative nameserver that appears in a list of parked domain nameservers identified as dedicated to parked domains,
collecting the documents posted on the crawled URLs that are referred to the parked domain nameservers,
labeling the collected documents as collected from the parked domains, and
storing the documents and parked domain labels for use in training.
10. The system of
label a first document with first tier labels based on applying the trained multi-label SVM running the one-vs-the-rest classifier to the first document, wherein applying the trained multi-label SVM running the one-vs-the-rest classifier to the first document comprises:
creating, by the feature generator, document features representing features of words in the first document;
applying classification parameters of the trained multi-label SVM for the plurality of labels to the document features to determine distances between the hyperplanes for the plurality of labels and the document features;
harvesting, by the harvester, positive labels of the plurality of labels having positive distances as first tier labels;
harvesting, by the harvester, negative labels of the plurality of labels having negative distances and a strong separation from a distribution of the negative distances as first tier labels; and
applying, by the harvester, the first tier labels to the first document.
12. The system of
13. The system of
14. The system of
15. The system of
16. The system of
17. The system of
18. The system of
identify parked domains and collect third documents posted on the parked domains by:
crawling uniform resource locators (URLs) that are within a predetermined edit distance of selected URL names,
determining for at least some of the crawled URLs that URL resolution is referred to an authoritative nameserver that appears in a list of parked domain nameservers identified as dedicated to parked domains, and
collecting the third documents posted on the crawled URLs that are referred to the parked domain nameservers;
label the third documents as collected from the parked domains; and
store the documents and parked domain labels for use in training.
19. The system of
label each document of the second documents with first tier labels based on applying the trained multi-label SVM running the one-vs-the-rest classifier to the respective document, wherein applying the trained multi-label SVM running the one-vs-the-rest classifier to the respective document comprises:
creating the document features representing features of words in the respective document;
applying classification parameters of the trained multi-label SVM for the plurality of labels to the document features to determine distances between the hyperplanes for the plurality of labels and the document features;
harvesting positive labels of the plurality of labels having positive distances as first tier labels;
harvesting negative labels of the plurality of labels having negative distances and a strong separation from a distribution of the negative distances as first tier labels; and
applying the first tier labels to the respective document.
|
This application is a continuation of U.S. patent application Ser. No. 16/226,394, entitled “MULTI-LABEL CLASSIFICATION OF TEXT DOCUMENTS,” filed Dec. 19, 2018. The non-provisional application is incorporated by reference for all purposes.
The following materials are incorporated by reference as if fully set forth herein:
The technology disclosed relates to multi-label classification of documents obtained from a wide variety of website classes for implementing fine grained enterprise policies.
The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Access to information via efficient search engines makes the World Wide Web (WWW) the first choice of enterprise users for many types of information. However, many websites contain content that can be offensive in a workplace or can be infected with virus or malware. Enterprises attempt to filter out websites that contain inappropriate content. Websites can contain information related to multiple topics e.g., finance, education, politics, etc. A large number of label classes are required to classify hundreds of millions of websites on the World Wide Web (WWW). One challenge faced by enterprises is to identify websites that meet the criteria for filtering. Another challenge is to apply enterprise policies when each website contains content related to multiple topics.
Therefore, an opportunity arises to automatically assign multiple class labels to a website for efficient implementation of enterprise policies to filter out inappropriate websites.
In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which:
The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Enterprise users access Internet-based services on a daily basis to search a wide variety of web site classes. Efficient search engines make the World Wide Web (WWW) the first choice for access to many types of information. The websites found can be offensive in a workplace or can be infected with virus or malware. Both of these problems correlate with certain classes of websites, such as gambling, adult, tobacco and violence oriented websites. The risk faced by the organizations can be mitigated by fine grained enterprise policies that can be applied by website class.
Websites often contain content that belongs to more than one category, even when the number of categories is limited to a practical number for policy application, such as 50 to 250 categories for fine grained policies. It is useful to classify a website with multiple categories that are relevant. For instance, the landing page for mil.com is a federal government webpage for the Department of Defense, it includes information about filing tax returns in a right side panel that would be a banner ad, if this were commercial website. This page should receive labels for both military and financial/accounting.
One technology for assigning multiple labels to a document is a support vector machine running a one-vs-the-rest (OvR) classifier. This SVM classifier positions a hyperplane between feature vectors support for a ground truth label vs feature vector support for the rest of the available labels. This hyperplane is traditionally used to distinguish between the most applicable label for the document and the rest of the available labels. For small number of labels, such as 3 to 5, this SVM classifier can be expected to apply multiple labels to some documents, applying a default labeling threshold value. For a large number of labels, over 50, this SVM classifier practically leaves a large proportion of sample documents unlabeled and is unlikely to apply multiple labels.
The technology disclosed modifies the traditional one-versus-the-rest-classifier and directly uses distances (typically calculated as dot products) between sample documents and hyperplanes for each of the available labels. The technology optionally can perform labeling in tiers, thereby increasing the likelihood that at least one label will be applied, despite the difficulty of coaxing a label from a one versus the rest classifier when there are more than 50 labels. During training, the one-versus-the-rest-classifier is trained N times for N labels, producing N hyperplanes. The training results, including hyperplane positions are made available for inference. During inference, a feature vector for a sample document is analyzed N times using hyperplanes derived by the N trained SVMs to determine positive or negative distances between the feature vector and the hyperplanes for the respective labels.
The technology disclosed calculates the distance results (positive and negative) between the feature vector and the SVM hyperplanes for the available labels to harvest multiple labels. Many documents may receive more than one label. It is allowable for a document remain unlabeled. It is preferred for documents to receive at least one label.
The tier 1 labels include all labels with a positive distance. The labels with negative distances to hyperplanes follow a Gaussian distribution characterized by a mean and a standard deviation. The tier 1 labels further include labels with negative distances that are strongly separated from the distribution. In one implementation, the tier 1 labels include harvested labels with a negative distance between the mean negative and zero. These labels are separated from the mean negative distance by at least 3 standard deviations. If harvesting for tier 1 labels does not result in any class labels, the technology disclosed harvests tier 2 labels. Tier 2 class labels include class labels with a negative distance between the mean distance and 3 standard deviations. These labels are separated from the mean negative distance by at least 2.5 standard deviations.
System Overview
We describe a system and various implementations for multi-label classification of a website hosted on a network, typically the Internet.
User endpoints 111 such as computers 121a-n, tablets 131a-n, and cell phones 141a-n access and interact with data stored on the Internet-based services 117. This access and interaction is modulated by an inline proxy 151 that is interposed between the user endpoints and the Internet-based services 117. The inline proxy 151 monitors network traffic between user endpoints 111 and the Internet-based services 117 to implement fine grained enterprise policies that can be applied by website class. The inline proxy 151 can be an Internet-based proxy or a proxy appliance located on premise.
In a “managed device” implementation, user endpoints 111 are configured with routing agents (not shown) which ensure that requests for the Internet-based services 117 originating from the user endpoints 111 and response to the requests are routed through the inline proxy 151 for policy enforcement. Once the user endpoints 111 are configured with the routing agents, they are under the ambit or purview of the inline proxy 151, regardless of their location (on premise or off premise).
In an “unmanaged device” implementation, certain user endpoints that are not configured with the routing agents can still be under the purview of the inline proxy 151 when they are operating in an on premise network monitored by the inline proxy 151.
The interconnection of the elements of system 100 will now be described. The network(s) 155, couples the computers 121a-n, the tablets 131a-n, the cell phones 141a-n, the Internet-based services 117, the trained multi-label document classifier 161, the label classes databases 159, the raw document database 173, the document features database 175, the labeled document database 179, and the inline proxy 151, all in communication with each other (indicated by solid double-arrowed lines). The actual communication path can be point-to-point over public and/or private networks. The communications can occur over a variety of networks, e.g., private networks, VPN, MPLS circuit, or Internet, and can use appropriate application programming interfaces (APIs) and data interchange formats, e.g., Representational State Transfer (REST), JavaScript Object Notation (JSON), Extensible Markup Language (XML), Simple Object Access Protocol (SOAP), Java Message Service (JMS), and/or Java Platform Module System. All of the communications can be encrypted. The communication is generally over a network such as the LAN (local area network), WAN (wide area network), telephone network (Public Switched Telephone Network (PSTN), Session Initiation Protocol (SIP), wireless network, point-to-point network, star network, token ring network, hub network, Internet, inclusive of the mobile Internet, via protocols such as EDGE, 3G, 4G LTE, Wi-Fi and WiMAX. The engines or system components of
The Internet-based services 117 can include Internet hosted services such as news websites, blogs, video streaming websites, social media websites, hosted services, cloud applications, cloud stores, cloud collaboration and messaging platforms, and/or cloud customer relationship management (CRM) platforms. Internet-based services 117 can be accessed using a browser (e.g., via a URL) or a native application (e.g., a sync client). The websites hosted by the Internet-based services 117 and exposed via URLs/APIs can fit in more than one classes assigned by the multi-label document classifier 161.
Enterprise users access tens or hundreds of websites on a daily basis to access many types of information. The technology disclosed organizes websites in classes. For example, the websites providing information about education belong to education class. Examples include websites of universities, colleges, schools and online education websites. The websites providing such information are labeled with “education” class label. However, almost all websites contain information that can be classified in multiple classes. For example, an online education website “ryrob.com/online-business-courses/” provides a list of business courses with a brief introduction to each course and the instructor. This website can be labeled with at least two class labels “education” and “business”. The website also offers forums for users to post their questions and comments, therefore the website can be assigned a third label “forums”. As the website is created and maintained by an individual, it can be labeled as belonging to “personal sites & blogs” class. A website most likely has multiple labels based on its content. Enterprises can classify websites in tens or hundreds of classes. More than fifty classes of websites have been identified. The examples include education, business, military, science, finance/accounting, shopping, news & media, personal sites & blogs, entertainment, food & drink, government & legal, health & nutrition, insurance, lifestyle, etc. The number of classes can range between 50 to 250 label classes. In some working examples, data sets have had 70 and 108 label classes, both of which fall within the range of 50 to 250 label classes. The technology described can be applied to 50 to 500 label classes or to 50 to 1,000 label classes, as the classifiers described can be adapted to choosing among labels in those sizes of label sets. A person skilled in the art will appreciate that additional labels for classes of website can be applied to other present or future-developed websites without departing from the spirit and scope of the technology disclosed.
The system 100 stores raw document data for websites in raw document database 173. The raw document data is converted to document features for input to the multi-label document classifier 161. An example of document features is frequency features based on term frequency-inverse document frequency (TF-IDF). Other examples of document features include semantic features based on embedding in a multi-dimensional vector space using techniques such as Word2Vec or global vectors for word representations (GloVe). The system 100 stores class labels in the label classes database 159. The trained multi-label document classifier 161 takes document features data of a website and assigns one or more class labels to the website. The websites with their respective class labels are stored in labeled document database 179.
The Internet-based services 117 provide information to the users of the organization that is implementing enterprise policies directed to access, security and the like. When a user sends a request to an Internet-based service via an endpoint 121a, the inline proxy 151 intercepts the request message. The inline proxy 151 queries the labeled document database 179 to identify the website being accessed via a uniform resource locator (URL) or an application programming interface (API). In one implementation, the inline proxy 151 uses the URL in the request message to identify the website being accessed. The inline proxy 151 then queries the labeled document database 179 to identify class labels for the website. The class labels are used to implement the enterprise policy directed to manage website access. If the class labels for the website are among the classes allowed by the enterprise, the user endpoint 121a is allowed to access the website. Otherwise if at least one class label of the website matches one of the class labels not allowed by the enterprise policy, the connection request from user endpoint 121a to the website is blocked, logged, aborted or otherwise handled.
An example of frequency features include term frequency-inverse document frequency (tf-idf) metric. The tf-idf metric (also referred to as tf-idf weight) is often used to measure how important a word (also referred to as “term”) is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a tool in scoring and ranking a document's relevance given a user query.
The tf-idf weight is a product of two terms: term frequency (tf) and inverse document or domain frequency (idf). There are several variations on calculating tf-idf weights, any of which can be used with our method. Term frequency (tf) measures how frequently a term occurs in a document. This can either be as a count or as a proportion of the words in a document. When the proportion approach is used, the count for a term is divided by the total count of words in the document. Inverse document frequency (idf) measures the discriminating power of a term. Certain terms such as “is”, “of”, “that” appear in virtually every document, so they have little discriminating power for classification. Thus, “idf” down scales the weight given to frequent terms and up scales the rare ones. The “idf” for a term can be logarithmically scaled by taking a log of the total number of documents (in the universe being considered) divided by the number of documents with term “t” in them, such as a natural log or log to the base of 10. Sometimes the count of a term in the document population is increased by a pre-determined number, to avoid a rare divide-by-zero error. This variation on tf-idf calculation is within the scope of our disclosure and actually used by the scikit-learn library under some circumstances to calculate tf-idf.
In another implementation, the feature generator 235 uses semantic features based on word embedding in a multi-dimensional vector space using techniques such as Word2Vec or global vectors for word representation (GloVe). In semantic similarity, the idea of distance between terms is based on likeness of their meaning or semantic content as opposed to similarity regarding their syntactical representation (for example, their string format). For example, a “cat” is similar to a “dog” in that both are animals, are four-legged, pets, etc. Document features based on frequency features do not capture this semantic information. Document features based on semantic features represent words in a vector space where semantically similar words are mapped to nearby points or in other words are embedded nearby each other. Word2Vec and GloVe are two examples of mappings generated by machine learning that embed words in a vector space.
The SVM classifier 265 includes a supervised learning technique called support vector machine (SVM) for classification of documents. In one implementation, the SVM classifier 265 uses scikit-learn based linear support vector classification (LinearSVC) technique (http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines. Given labeled training data, an SVM outputs a hyperplane which classifies new examples. In a two dimensional space, this hyperplane is a line dividing a plane in two parts, with one class on either side. In a multi-label classification of documents, an SVM running a one-vs-the-rest classifier (OvR) is run as many times as the number of label classes to generate hyperplane determinations that separate label class from the rest. During inference, stored parameters of the SVM trained on the label classes are used to determine positive or negative distances between SVM hyperplanes for the labels and the feature vector representing the document.
The harvester 275, assigns multiple class labels to a document by harvesting the labels with a positive distance to the SVM hyperplanes. Consider distribution of class labels with negative distances is characterized by a mean and standard deviation. The harvester 275 also assigns class labels to the document by harvesting labels with a negative distance to the SVM hyperplanes using the following scheme. The harvester harvests labels with negative distances between the mean negative distance and zero and separated from the mean negative distance by a predetermined first number of standard deviations. In one implementation, when the above harvesting does not result in any labels for the document, the harvester further harvests labels with a negative distance between the mean negative distance and first number of standard deviations and separated from the mean negative distance by a predetermined second number of standard deviations.
Training the Multi-Label Document Classifier
During training (box 161), the output labels of an SVM (OvR) with selected hyper parameter values combination is compared with ground truth labels of the training data 331 using a linear support vector machine classifier (Linear SVC or Linear SVM). For Linear SVC, the value of “kernel” parameter is set as “linear”. The Linear SVC constrains growth of dimensionality when generating feature vectors using words in documents thus allowing scalability of the model. A non-linear kernel (e.g., radial basis function (RBF) kernel) can be used, but may require substantial computing resources when presented with a large number of samples. This is because in text categorization, the number of words is large causing the dimensionality of feature vectors to be very large. Details of the LinearSVC model and kernel parameter are provided by scikit-learn at http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html.
The goal of Linear SVC is to identify position of a hyperplane which maximizes the margin between two classes of the training data. The distance between the nearest points of the two classes is referred to as margin. In two dimensional space, the hyperplane is a line represented as f(x)=wx+b, where w is the weight vector which is normal to the line f(x) and b is the bias. In a three dimensional space, the hyperplane is a plane and in n-dimensional space it is a hyperplane. During training the weights in the weight vector are updated using a pre-determined learning rate. The algorithm converges when the margins computed for the training samples, is maximized. Sometimes, the data in the classes is not separable using lines and planes as described above and requires a non-linear hyper-plane to separate the classes. In such cases, the regularization hyper-parameter can be used in combination with a kernel method (also referred to as a kernel trick).
At-Least-One (ALO) Score
The ALO score (or metric) is used to determine performance of the SVM model. A model is scored by considering how many of the ground truth set of labels per document are assigned to the document by the model. During hyper parameter tuning, suppose we want to tune one hyper parameter “h1”. Now let us suppose we want to try the following values for h1: [1, 10, 100]. We select a value of “h1” (say 10), train the classifier to obtain a trained classifier (with N SVM hyperplane positions for N class labels). The ALO score is calculated for this trained classifier (or SVM model). We repeat the above process to train classifiers using the next values of the hyper parameter “h1” (i.e, 1, and 100) and calculate the respective ALO scores for the trained classifiers. We select the classifier (with N SVM hyperplane positions) which gives us the best ALO score. Details of the hyper parameters used in hyper parameter tuning is presented below in the section on hyper parameters. The ALO scorer 365 calculates a ratio of the documents with at least one pairwise match between inferred labels and ground truth labels to the total number of documents with at least one ground truth label and stores as model evaluations 367. Consider a simple example, consisting of three documents D1, D2, and D3 to illustrate the calculation of the ALO score. Suppose the ground truth label classes for the three documents are:
Now further consider a trained SVM (OvR) model predicts the following label classes for these documents:
The ALO score for the above example is ⅔ (or 33.33%) as D1 and D2 have at least one correct label predicted. In one implementation of the technology disclosed, the ALO scores of the SVM (OvR) models range from 45% (minimum) to 85% (maximum). In another implementation the ALO scores range from 45% to 95%. It is understood that in other implementations, the values of the ALO scores can be greater than 95% and can range up to 100%. In one implementation, a trained SVM (OvR) model is selected for use in production such that the ALO score of the model is within 10% of the maximum ALO score using a pre-determined hyper-parameter search range. One of the reasons for using ALO score to determine performance of a model is that it does not consider documents to which a label is not assigned by a model. This is because while searching for content on the World Wide Web, content may not be available for a particular URL. Such documents may not have any labels assigned to them and may cause bias in model performance. In another implementation, each document always contains content, therefore, the above restriction is removed and all documents are considered when calculating the ALO score. In another implementation, the performance score of a model is a weighted average of the ALO score calculated using equation (1) and a fraction of documents not assigned any label classes by the model.
One-vs-the-Rest Classifier
Hyper Parameters
Hyper parameters are parameters that are not directly learnt during training. As described above, in one implementation, three hyper parameters: loss function, regularization and class weight are used in hyper parameter tuning. The “loss” hyper parameter value specifies the loss function: hinge or squared_hinge. Hinge loss is based on the idea of margin maximization when positioning a hyperplane. Hinge is the standard SVM loss function while squared_hinge is the square of the hinge loss. See www.scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html for further details.
The regularization parameter (often referred to as “C” parameter in SKLearn library, http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) informs the SVM classifier how much it needs to avoid misclassifying. Regularization parameter controls trade-off between misclassifications and width of margin when positioning a hyperplane. For large values of “C”, a smaller margin hyperplane will be selected if that hyperplane does a better job of getting all the training data points classified correctly. For example, a graph 476 in
One factor in misclassification of data points is imbalanced classes. The class weight hyper parameter is used to adjust the regularization “C” parameter in a data set in which some classes are small (minority classes) while others are large (majority classes). One method to handle this imbalance is to multiply “C” parameter with class weight “wj” of class “j” which is inversely proportional to the frequency of class j.
Cj=C*wj (2)
This results in “Cj” which is then used as regularization parameter when determining hyperplane for class “j”. The general idea is to increase the penalty for misclassifying minority classes to prevent them from being overwhelmed by the majority class. In scikit-learn library, the values of “Cj” are automatically calculated for each class when class_weight is set as “balanced”. In balanced mode the weight of each class “j” is calculated as:
Where wj is the weight of class j, n is total number of data points in the data set, nj is the number of observations in class j, and k is the total number of classes. Combinations of values of the above three hyper parameters are used in hyper parameter tuning using the Scikit-Learn utility GridSearchCV 362 described above with reference to
Document Collection from Parked Domains
The website name variation generator 561 generates URLs that are within a predetermined edit distance of a selected URL. In one implementation, an open source utility “dnstwist” (https://github.com/elceef/dnstwist) is used to generate such similar-looking domain names. For example, Bank of America's website URL is “www.bankofamerica.com”. The “dnstwist” utility generates multiple variations of the website name e.g., “www.bnakofamerica.com”, “www.bankfoamerica.com”, “www.bankafamerica.com”, etc. Each of the variant URL is passed to the document crawler 512 to collect contents of the website. The document crawler 512 determines whether the requested URL is hosted by one of the parked domains name servers 559 or one of the active website name servers 579. In one implementation, contents from secondary webpages of a website are not collected for generating document features. For example, for “www.espn.com” website, any secondary level webpages such as “baseball.espn.com” are not collected by the document crawler 512.
In one implementation, the system 500 maintains a list of parked domains name servers 559 for example “sedoparking.com”, “parkingcrew.com”, etc. If the nameserver of the requested URL appears in the list of the parked domain name servers, the document crawler 512 labels the document as collected from a parked domain. In case URL of the requested document is redirected, the technology disclosed determines that URL resolution is referred to an authoritative nameserver that appears in the list of parked domain nameservers 519 and labels the document as obtained from a parked domain.
Tier 1 labels also include label classes with negative distances between the mean negative distance (μ) and zero and separated from the mean negative distance (μ) by a predetermined first number of standard deviations (σ). In one implementation, the first number of standard deviations is between 2.8 and 3.2. The example shown in
In one implementation, if harvesting of tier 1 labels does not result in any label classes for a document, the multi-label document classifier harvests tier 2 labels. The tier 2 labels include labels with a negative distances between the mean negative distance (μ) and the first number of standard deviations and separated from the mean negative distance (μ) by a predetermined second number of standard deviations (σ). In one implementation, the second number of standard deviations is between 2.4 and 2.6. The value for the second number of standard deviations is selected to balance between not collecting too many labels versus getting at least one label for the document. In the graph 700, the first number of standard deviations is “3” and the second number of standard deviations is selected as “2.5”. Therefore, tier 2 labels have negative distances between μ+2.5σ and μ+3σ. The technology disclosed can be applied with the a value of the second number of standard deviations between 2 and 3 or between 1.6 and 3.3 as the harvester can be adapted to select labels when the second number of standard deviations is selected in those ranges.
Computer System
In one implementation, the multi-label document classifier 161 of
User interface input devices 1038 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 1000.
User interface output devices 1076 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 1000 to the user or to another machine or computer system.
Storage subsystem 1010 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. Subsystem 1078 can be graphics processing units (GPUs) or field-programmable gate arrays (FPGAs).
Memory subsystem 1022 used in the storage subsystem 1010 can include a number of memories including a main random access memory (RAM) 1032 for storage of instructions and data during program execution and a read only memory (ROM) 1034 in which fixed instructions are stored. A file storage subsystem 1036 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 1036 in the storage subsystem 1010, or in other machines accessible by the processor.
Bus subsystem 1055 provides a mechanism for letting the various components and subsystems of computer system 1000 communicate with each other as intended. Although bus subsystem 1055 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
Computer system 1000 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 1000 depicted in
Particular Implementations
The technology disclosed relates to multi-label classification of documents obtained from a wide variety of website classes for implementing fine grained enterprise policies.
The technology disclosed can be practiced as a system, method, device, product, computer readable media, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
A first system implementation of the technology disclosed includes one or more processors coupled to memory. The memory is loaded with computer instructions to train a multi-label support vector machine (abbreviated SVM) running a one-vs-the-rest (abbreviated OVR) classifier. The system accesses training examples for documents belonging to 50 to 250 label classes. The system trains an SVM using the document features for one-vs-the-rest training and hyperplane determinations on the label classes. The system stores parameters of the trained SVM on the label classes for use in production of multi-label classifications of documents.
The first system implementation and other systems disclosed optionally include one or more of the following features. System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
The document features include frequency features based on term frequency-inverse document frequency (abbreviated TF-IDF). The document features include semantic features based on embedding in a multi-dimensional vector space using Word2Vec. The document features include semantic features based on embedding in a multi-dimensional vector space using global vectors for word representation (abbreviated GloVe).
The system selects the SVM hyper parameters across regularization, class weight, and loss function in a predetermined search range such that an at-least-one (abbreviated ALO) score is at or within ten percent of maximum attainable over the predetermined search range. In such an implementation, the ALO score calculates a ratio of count of the documents with at least one pairwise match between inferred labels and ground truth labels to the total number of documents with at least one ground truth label.
In one implementation, one of the label classes is parked domain. For the documents posted on parked domains, the system identifies parked domains and collecting documents posted on the parked domains. In such an implementation, the system crawls websites accessible by uniform resource locators (abbreviated URLs) that are within a predetermined edit distance of selected URL names. The system determines for at least some of the crawled URLs that URL resolution is referred to an authoritative nameserver that appears in a list of parked domain nameservers identified as dedicated to parked domains. The system collects the documents posted on the crawled URLs that are referred to the parked domain nameservers. The system labels the collected documents as collected from the parked domains and stores the documents and parked domain labels for use in training.
Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the system described above. Yet another implementation may include a method performing the functions of the system described above.
A first method implementation of the technology disclosed includes training a multi-label support vector machine (abbreviated SVM) running a one-vs-the-rest (abbreviated OVR) classifier. The method includes accessing training examples for documents belonging to 50 to 250 label classes. Following this, the method includes training an SVM using the document features for one-vs-the-rest training and hyperplane determinations on the label classes. The method includes storing parameters of the trained SVM on the label classes for use in production of multi-label classifications of documents.
Each of the features discussed in this particular implementation section for the first system implementation apply equally to this method implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the first method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the first method described above.
Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
Each of the features discussed in this particular implementation section for the first system implementation apply equally to the CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
A second system implementation of the technology disclosed includes one or more processors coupled to memory. The memory is loaded with computer instructions to perform multi-label SVM classification of a document. The system creates document features representing frequencies or semantics of words in the document. The system applies trained SVM classification parameters for a plurality of labels to the document features for the document and determines positive or negative distances between SVM hyperplanes for the labels and the feature vector. The system harvests the labels with a positive distance. The system further harvests the labels with a negative distance and a strong separation from a distribution of the negative distances. When the distribution negative distance is characterized by a mean and standard deviation, the strong separation is defined such that the harvested labels include the labels with a negative distance between the mean negative distance and zero and separated from the mean negative distance by a predetermined first number of standard deviations. Finally, the system outputs a list of harvested tier 1 labels.
The second system implementation and other systems disclosed optionally include one or more of the following features. System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
In one implementation, the first number of standard deviations is between 3.0 and 4.0. The system harvests as tier 2 labels the labels with a negative distance between the mean negative distance and the first number of standard deviations and separated from the mean negative distance by a predetermined second number of standard deviations. Following this, the system outputs the tier 2 labels with the list. In such an implementation, the second number of standard deviations is between 2.0 and 3.0.
Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the system described above. Yet another implementation may include a method performing the functions of the system described above.
A second method implementation of the technology disclosed includes performing multi-label SVM classification of a document. The method includes creating document features representing frequencies or semantics of words in the document. The method includes applying trained SVM classification parameters for a plurality of labels to the document features for the document and determines positive or negative distances between SVM hyperplanes for the labels and the feature vector. Following this, the method includes harvesting the labels with a positive distance. The method includes further harvesting the labels with a negative distance and a strong separation from a distribution of the negative distances. When the distribution negative distance is characterized by a mean and standard deviation, the strong separation is defined such that the harvested labels include the labels with a negative distance between the mean negative distance and zero and separated from the mean negative distance by a predetermined first number of standard deviations. Finally, the method includes outputting a list of harvested tier 1 labels.
Each of the features discussed in this particular implementation section for the second system implementation apply equally to this method implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the second method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the second method described above.
Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
Each of the features discussed in this particular implementation section for the second system implementation apply equally to the CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
Yadav, Sandeep, Balupari, Ravindra K.
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
6430559, | Nov 02 1999 | JUSTSYSTEMS EVANS RESEARCH INC | Method and apparatus for profile score threshold setting and updating |
6463434, | Nov 02 1999 | JUSTSYSTEMS EVANS RESEARCH INC | Method and apparatus for profile score threshold setting and updating |
6587850, | Nov 02 1999 | JUSTSYSTEMS EVANS RESEARCH INC | Method and apparatus for profile score threshold setting and updating |
7139754, | Feb 09 2004 | Xerox Corporation | Method for multi-class, multi-label categorization using probabilistic hierarchical modeling |
7356187, | Apr 12 2004 | JUSTSYSTEMS EVANS RESEARCH INC | Method and apparatus for adjusting the model threshold of a support vector machine for text classification and filtering |
7376635, | Jul 21 2000 | Ford Global Technologies, LLC | Theme-based system and method for classifying documents |
7386527, | Dec 06 2002 | KOFAX, INC | Effective multi-class support vector machine classification |
7386827, | Jun 08 2006 | XILINX, Inc. | Building a simulation environment for a design block |
7835902, | Oct 20 2004 | Microsoft Technology Licensing, LLC | Technique for document editorial quality assessment |
7974984, | Apr 19 2006 | SARTORI, ELISA | Method and system for managing single and multiple taxonomies |
8112421, | Jul 20 2007 | Microsoft Technology Licensing, LLC | Query selection for effectively learning ranking functions |
8548951, | Mar 10 2011 | Textwise LLC | Method and system for unified information representation and applications thereof |
20040111438, | |||
20050228783, | |||
20080046486, | |||
20100287160, | |||
20110078099, | |||
20120011120, | |||
20130097166, | |||
20150242486, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Dec 17 2018 | BALUPARI, RAVINDRA K | NETSKOPE, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 058942 | /0586 | |
Dec 18 2018 | YADAV, SANDEEP | NETSKOPE, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 058942 | /0586 | |
Aug 06 2021 | Netskope, Inc. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Aug 06 2021 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Date | Maintenance Schedule |
Nov 07 2026 | 4 years fee payment window open |
May 07 2027 | 6 months grace period start (w surcharge) |
Nov 07 2027 | patent expiry (for year 4) |
Nov 07 2029 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 07 2030 | 8 years fee payment window open |
May 07 2031 | 6 months grace period start (w surcharge) |
Nov 07 2031 | patent expiry (for year 8) |
Nov 07 2033 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 07 2034 | 12 years fee payment window open |
May 07 2035 | 6 months grace period start (w surcharge) |
Nov 07 2035 | patent expiry (for year 12) |
Nov 07 2037 | 2 years to revive unintentionally abandoned end. (for year 12) |