The concept involves efficiently using machine learning to quickly identify possible fraudulent applications in small business loan and credit applications by automatically flagging applications that meet certain criteria. In one preferred implementation, the tool compares a business description to a selected NAICS code in a loan application to assess the potential for fraud. Specifically, an algorithm can match the leftmost two digits of the selected code with the description of the category from an applicant. An engine calculates a probability of a fraud score based on the matching attached to the application. Because the tool detects fraud proactively rather than reactively, it substantially reduces computational costs and resources and reduces the biases associated with highly intensive manual work.
|
11. A computer-implemented method capable of processing loan applications comprising:
creating a classification model by training a neural network with a plurality of layers using a training data set;
weighting classes within the classification model to facilitate learning across the classes;
receiving a description of a business applying for a loan from an applicant;
generating, using the classification model, classification options for the business using the description;
comparing a selection of one of the classification options by the applicant to generated options;
determining if a fraud threshold is met based upon the comparing;
when the fraud threshold is met, flagging the loan as problematic; and
when the fraud threshold is not met, identifying the loan as nonproblematic.
1. A computer system for processing loan applications, comprising:
one or more processors; and
non-transitory computer-readable storage media encoding instructions which, when executed by the one or more processors, causes the computer system to:
create a classification model by training a neural network with a plurality of layers using a training data set;
weight classes within the classification model to facilitate learning across the classes;
receive a business description of a business applying for a loan from an applicant;
generate, using the classification model, classification options for the business using the business description;
compare a selection of one of the classification options by the applicant to the classification options;
determine if a fraud threshold is met based upon the compare;
when the fraud threshold is met, identify the loan as problematic; and
when the fraud threshold is not met, identify the loan as nonproblematic.
19. A computer system capable of detecting fraud in payroll protection program loans submitted to a financial institution, comprising:
one or more processors; and
non-transitory computer-readable storage media encoding instructions which, when executed by the one or more processors, causes the computer system to:
create a classification model by training a neural network with a plurality of layers using a training data set;
weight classes within the classification model to facilitate learning across the classes;
receive text input from a customer associated with an application for a payroll protection program loan, wherein the text input is a description of a business;
generate, using the classification model, classification options for the text input using the description of the business, wherein the classification options are presented to the customer on a graphical user interface;
compare a selection of one of the classification options by the customer to generated options, wherein the customer manually elects a classification options which is not one of the generated options; and
based on a rating score, flag the application for review by the financial institution.
3. The computer system of
4. The computer system of
5. The computer system of
8. The computer system of
9. The computer system of
10. The computer system of
12. The method of
13. The method of
14. The method of
16. The method of
17. The method of
|
Financial institutions process thousands of requests for loans each year. Information associated with the applicant is gathered as part of the loan application process. This information is used to determine whether the applicant qualifies for the requested loan. It can be a significant challenge to process this information and mitigate risks of fraud associated with these loan processes.
Embodiments of the disclosure are directed to detecting potential fraud in loan applications.
According to aspects of the present disclosure, a system comprises: one or more processors; and non-transitory computer-readable storage media encoding instructions which, when executed by the one or more processors, causes the computer system to: receive a business description of a business applying for a loan from an applicant; generate classification options for the business by querying a database using the business description; compare a selection of one of the classification options by the applicant to the classification options; determine if a fraud threshold is met based upon the compare; and when the fraud threshold is met, identify the loan as problematic.
In another aspect, a computer-implemented method capable of processing loan applications comprising: receiving a description of a business applying for a loan from an applicant; generating classification options for the business by querying a database using the description; comparing a selection of one of the classification options by the applicant to generated options; determining if a fraud threshold is met based upon the comparing; and when the fraud threshold is met, flagging the loan as problematic.
In yet another aspect, a computer system capable of in Payroll Protection Program loans submitted to a financial institution, comprising: one or more processors; and non-transitory computer-readable storage media encoding instructions which, when executed by the one or more processors, causes the computer system to: receive text input from a customer associated with an application for a Payroll Protection Program loan, wherein the text input is a description of a business; generate classification options for the text input by querying a database using the description of the business, wherein the classification options are presented to the customer on a graphical user interface; compare a selection of one of the classification options by the customer to generated options, wherein the customer manually elects a classification options which is not one of the generated options; and based on a rating score, flag the application for review by the financial institution.
The details of one or more techniques are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of these techniques will be apparent from the description, drawings, and claims.
This disclosure relates to detecting fraud in business applications to any credit product.
Financial institutions process millions of requests for credit products each year. Examples of such credit products include loans and credit cards. Each request typically involves information that the applicant provides. This information is used by the financial institution to determine whether the applicant qualifies for the requested credit product. A certain percentage of applications involve fraudulent information, and these applications are identified as early in the application process as possible to mitigate the impact of this activity.
The concepts described herein can provide an early warning to detect fraud proactively rather than reactively and can include substantially reducing computational costs and resources and reducing biases associated with highly intensive manual work. The concept can include a machine learning classification tool. In some examples, the tool takes a business description provided by the applicant and returns a suggested code from which the applicant may choose. A machine learning model is trained using the data that has been trained to recognize certain patterns, providing an algorithm that can be used to reason over and learn from the data. In the examples provided, the tool implements a machine learning approach to industry classification, which promises efficiency, scalability, and adaptability.
One possible implementation is detecting fraud in a Payroll Protection Program (PPP) loan, which is a Small Business Association-backed loan that helps businesses keep their workforce employed during the COVID-19 crisis provided by the US Federal Government. An applicant applying for a PPP loan must provide a description of the business and select the appropriate NAICS code. Given a business description from the applicant, the NAICS engine provides five potential NAICS codes from which the customer can choose.
The preeminent taxonomy for industry classification is the NAICS, which is the standard used by, among other organizations, the United States Census Bureau. The 2017 NAICS taxonomy arrays the North American business economy into 1057 industries, each corresponding to a six-digit code. Each industry belongs to an industry group, represented by the first four digits of the code, which in turn belongs to a subsector, represented by the first three digits, which in turn belongs to a sector, represented by the first two digits. In addition to the 1057 industries, NAICS comprises 20 sectors, 99 subsectors, and 311 industry groups.
This concept is a novel approach to industry classification, utilizing a multilayer perceptron. Because the classifier relies on machine learning rather than manual labor, the approach provides a highly efficient solution for classifying companies that are not already contained within an extant database. Moreover, by thresholding the predictions of the classifier based on confidence scores, corporations are able to be classified into six-digit NAICS industries with greater precision than that of the classifications provided by premier databases. Finally, the framework of the model can be used to label companies according to any industry classification schema, not only NAICS. As a result, the algorithm can rapidly adapt to changing industries in a way that classification systems tied to the static NAICS taxonomy cannot.
Leveraging information (e.g., through an Application Programming Interface (API) provided by ZoomInfo Technologies LLC of Vancouver, Wash. (formerly EverString Technology) to construct a database of companies labeled with the industries to which they belong, deep neural networks are trained to predict the industries of novel companies. The model's capacity is examined to predict six-digit NAICS codes and the ability of the model architecture to adapt to other industry segmentation schemas. Additionally, the ability of the model was investigated to generalize despite the presence of noise in the labels in the training set. Finally, increasing predictive precision by thresholding based on the confidence scores that the model outputs along with its predictions is implemented.
Presently, in one implementation, PPP loan applications that show if an application is flagged as suspicious, the majority of the time, the selected NAICS code is incorrect and possibly associated with an attempt to mislead the system. For example, “hair, nails or beauty salons” fall under NAICS code 81, which is “other services”. Applications have been observed that described as “hair, nails or beauty salons” listing NAICS code 72, which stands for “accommodation and food services”. All industries not qualified under code 72 are entitled to a loan up to 2.5 percent of the average monthly payroll. Industry 72 is entitled up to 3.5 percent of the average monthly payroll, thereby being more attractive for fraudsters.
More specifically, the concept involves the automated flagging of PPP loan applications that meet certain criteria. Specifically, an algorithm can match the leftmost two digits of the selected NAICS code with the description of the industry from the customer. An engine calculates a probability of fraud based on the matching attached to the loan application.
The client devices 102, 104, 106 may be one or more computing devices that can include a mobile computer, desktop computer, or other computing device used by a customer to generate or receive data.
In one non-limiting example, a client device 102 is used by an applicant to submit application data regarding a loan application with the server device 112, such as business information.
The client devices 102, 104, 106 can communicate with the server device 112 through the network 110 to transfer data. The server device 112 can also obtain data via other input devices, which can correspond to any electronic data acquisition processes (e.g., from third parties through an application programming interface—API).
The server device 112 can be managed by, or otherwise associated with, an enterprise (e.g., a financial institution such as a bank, brokerage firm, mortgage company, or any other money-handling enterprise) that uses the system 100 for data management and/or deep learning processes. The server device 112 receives data from one or more of the client devices 102, 104, 106.
The graphical user interface module 202, rendered on the client devices 102, 104, 106, provides an interface for displaying and navigating the results of the classification engine 204. In some examples, the graphical user interface module 202 can render interfaces that allow an applicant to access a survey, submit data to the classification engine 204, store results associated with classifications generated, and otherwise manipulate the classification results, as described further below. See, e.g.,
The classification engine 204 is programmed to manage the transport and storage of classification codes based upon the business description text provided by the applicant associated with the description, such as a business description, etc. Additional details of the classification engine 204 are provided below.
The code generation engine 302 establishes pre-selected codes based on the input provided from the applicant in the business description text field box. Training sets are constructed from a database, such as EverString's proprietary database, an index of over 18 million companies tying each entity to a detailed set of attributes. The massive size of the database is compiled by combining data purchased from private vendors with data extracted from the Internet by our internally developed web-crawling technologies, which calls for storage on a distributed file system, such as HDFS (Hadoop Distributed File System).
The model utilizes a standard multilayer perceptron architecture. Specifically, a neural network with four fully-connected layers is used. After each of the first three layers a perform batch normalization, tanh activation, and dropout with a keep probability of 0.5 are performed. The first fully-connected layer has a hidden dimension of 640; the second and third layers have a hidden dimension of 4096. The output dimension of the final layer is the number of industries into which are being classified. In one preferred implementation, where classification is occurring according to six-digit NAICS codes, the output dimension of this layer is 1057. The dimension of each training example that is input to the neural network, which corresponds to the number of keywords in the dense matrix loaded from the sparse feature vectors in a minibatch, is 350,000. As a result, for six-digit NAICS classification, the model uses around 250 million parameters (350000*640+640*4096+4096*4096+4096*1057).
The weighted loss function is addressed with a scheme of differential inter- and intra-class weighting. The classes are weighed according to the ratio of the total number of training examples to the number of training examples for that class. If there are C classes, N examples in the training set, and c examples in a particular class, the weight for that class is set according to the following Equation 1.
This weighting scheme up-weights the classes with fewer examples and down-weights the classes with more examples so that the model learns robustly across all classes, rather than learning in a skewed fashion. It only predicts the most well-represented classes. Evidence that such an inter-class weighting schema also leads to a loss function robust to noisy labels in the training set. However, that is addressed to the noisy label problem using intra-class weighting. For six-digit NAICS classes with particularly noisy labels, the EverString's HIT system is used to manually verify the labels for a small number of training examples (around 200). The verified examples are then up-weighted while the unverified are down-weighted. If a particular class contains N examples, V of which are verified and U of which are verified, the weight for a verified example is shown in Equation 2.
The weight for an unverified example is shown in Equation 3.
This weighting scheme allows the model to prioritize the verified examples in such a way that the model skews more heavily toward the verified examples as the number of verified examples increases, without affecting the distribution of interclass weights.
Once the code options are automatically generated and presented to the applicant, and if the applicant decides to select a code option from a drop-down list manually, the tool can detect the application is suspicious immediately upon the applicant's submission using the fraud score engine 304. By thresholding the model's predictions based on a fraud score (or a confidence score, which may be used interchangeably), the six-digit NAICS codes can precisely predict, even for difficult-to-classify industries. The fraud score can flag potential fraud in a binary manner, while the confidence score is output as a value that measures possible fraud as a score between two values.
The fraud score engine 304 can be adjusted to a preferred tolerance score in flagging potential records that may contain fraudulent data. The fraud score engine 304 is engaged when the applicant manually selects a code rather than choosing from the generated code options provided by the code generation engine 302.
The interface 400 enables the user to use a secure channel to connect to the application survey that contains the classification engine 204. In one non-limiting example, the applicant can access the application survey directly from an institution's web application.
Upon selecting the requested application type option,
The interface in
In one non-limiting example, an applicant may desire a loan with a NAICS code that has a sector of 72 because a sector of 72 may provide for a bigger loan with more lenient repayment policies than other than sectors, such as 81, 61, and 32. An applicant with a business that falls under “hair salon services” would be presented with pre-selected codes not including 72 because 72 is for “food and accommodations”. The applicant may attempt to defraud the system by manually selecting the sector of 72 because qualifying as such a business would result in a greater net gain for themselves.
Once the applicant attempts to enter the NAICS code starting with 72, the two left most digitals are compared to those found in any of the generated NAICS codes provided to the applicant from the drop-down list. If the two left most digitals from the manually selected option are not found in any of the automatically generated options based on the business description input, the application is immediately flagged as a problematic record, whether or not fraud has occurred. Once flagged, the institution will proceed to review the application with notice that the application may be fraudulent.
The fraud score engine 304 can be adjusted to “loosen” the tolerance of the tool to the desired threshold. For example, the fraud score engine 304 may be modified to be flag applications with even one match of sectors. If the choice is two digits in the left-most column are found in all option cases, there is no possibility of fraud or a problematic record. Another example is that if all generated options start with the same two digits and those two digits match the applicant's manual selection, then the application is deemed legitimate.
At step 1002, the business description text is received by the applicant in the application survey. This can be accomplished in various ways, such as through the graphical user interfaces described here. See
Next, at operation 1004, options are generated based on the business description text from the applicant in the business description text field box. The options are generated based on a machine learning model that is dynamically engaged to relearn and grow its library of data. The generated options are displayed to the applicant in the form of a drop-down list for the applicant to select an option.
Next, at operation 1006, if the applicant decides not to use the generated options, the applicant manually selects a choice from a drop-down list not featured in the generated options because it was deemed not relevant as the generated options.
Finally, at operation 1008, the tool compares the manually selected option by the applicant against the auto-generated options to determine if the fraud score has been met. If the fraud score threshold has been met, then the application is flagged as problematic.
Although the examples described above relate to loans requested through the PPP, the concepts described herein are equally applicable to other types of loans and credit products. For instance, different classification systems such as Merchant Category Classification (MCC), where the code for which is a four-digit number used by the credit card industry to classify businesses into market segments, and Standard Industrial Classification (SIC), where the code for which is another four-digit number may be applicable.
As illustrated in the example of
The mass storage device 1114 is connected to the CPU 1102 through a mass storage controller (not shown) connected to the system bus 1122. The mass storage device 1114 and its associated computer-readable data storage media provide non-volatile, non-transitory storage for the server device 112. Although the description of computer-readable data storage media contained herein refers to a mass storage device, such as a hard disk or solid-state disk, it should be appreciated by those skilled in the art that computer-readable data storage media can be any available non-transitory, physical device, or article of manufacture from which the central display station can read data and/or instructions.
Computer-readable data storage media include volatile and non-volatile, removable, and non-removable media implemented in any method or technology for storage of information such as computer-readable software instructions, data structures, program modules, or other data. Example types of computer-readable data storage media include, but are not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid-state memory technology, CD-ROMs, digital versatile discs (“DVDs”), other optical storage media, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the server device 112.
According to various embodiments of the invention, the server device 112 may operate in a networked environment using logical connections to remote network devices through network 110, such as a wireless network, the Internet, or another type of network. The server device 112 may connect to network 110 through a network interface unit 1104 connected to the system bus 1122. It should be appreciated that the network interface unit 1104 may also be utilized to connect to other types of networks and remote computing systems. The server device 112 also includes an input/output controller 1106 for receiving and processing input from a number of other devices, including a touch user interface display screen or another type of input device. Similarly, the input/output controller 1106 may provide output to a touch user interface display screen or other output devices.
As mentioned briefly above, the mass storage device 1114 and the RAM 1110 of the server device 112 can store software instructions and data. The software instructions include an operating system 1118 suitable for controlling the operation of the server device 112. The mass storage device 1114 and/or the RAM 1110 also store software instructions and applications 1124, that when executed by the CPU 1102, cause the server device 112 to provide the functionality of the server device 112 discussed in this document. For example, the mass storage device 1114 and/or the RAM 1110 can store the graphical user interface engine 202, and the classification engine 204.
Although various embodiments are described herein, those of ordinary skill in the art will understand that many modifications may be made thereto within the scope of the present disclosure. Accordingly, it is not intended that the scope of the disclosure in any way be limited by the examples provided.
Chen, Jie, Pandey, Manish, Nadav, Carmel
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
10013655, | Mar 11 2014 | APPLIED UNDERWRITERS, INC | Artificial intelligence expert system for anomaly detection |
10453142, | Feb 11 2009 | System and method for modeling and quantifying regulatory capital, key risk indicators, probability of default, exposure at default, loss given default, liquidity ratios, and value at risk, within the areas of asset liability management, credit risk, market risk, operational risk, and liquidity risk for banks | |
10762561, | Jul 03 2012 | LEXISNEXIS RISK SOLUTIONS FL INC. | Systems and methods for improving computation efficiency in the detection of fraud indicators for loans |
11049109, | Mar 25 2016 | State Farm Mutual Automobile Insurance Company | Reducing false positives using customer data and machine learning |
7427016, | Apr 12 2006 | System and method for screening for fraud in commercial transactions | |
8458082, | Nov 13 2001 | FIRST AMERICAN FINANCIAL CORPORATION | Automated loan risk assessment system and method |
8567669, | Feb 24 2006 | Fair Isaac Corporation | Method and apparatus for a merchant profile builder |
8606712, | Jul 21 2011 | Bank of America Corporation | Multi-stage filtering for fraud detection with account event data filters |
20080005001, | |||
20080040275, | |||
20120130853, | |||
20130018777, | |||
20150339769, | |||
20180293660, | |||
20210182877, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 29 2021 | Wells Fargo Bank, N.A. | (assignment on the face of the patent) | / | |||
Nov 30 2021 | CHEN, JIE | WELLS FARGO BANK, N A | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 058712 | /0180 | |
Nov 30 2021 | PANDEY, MANISH | WELLS FARGO BANK, N A | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 058712 | /0180 | |
Jan 20 2022 | NADAV, CARMEL | WELLS FARGO BANK, N A | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 058712 | /0180 |
Date | Maintenance Fee Events |
Nov 29 2021 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Date | Maintenance Schedule |
Mar 14 2026 | 4 years fee payment window open |
Sep 14 2026 | 6 months grace period start (w surcharge) |
Mar 14 2027 | patent expiry (for year 4) |
Mar 14 2029 | 2 years to revive unintentionally abandoned end. (for year 4) |
Mar 14 2030 | 8 years fee payment window open |
Sep 14 2030 | 6 months grace period start (w surcharge) |
Mar 14 2031 | patent expiry (for year 8) |
Mar 14 2033 | 2 years to revive unintentionally abandoned end. (for year 8) |
Mar 14 2034 | 12 years fee payment window open |
Sep 14 2034 | 6 months grace period start (w surcharge) |
Mar 14 2035 | patent expiry (for year 12) |
Mar 14 2037 | 2 years to revive unintentionally abandoned end. (for year 12) |