Techniques are described for employing a crowdsourcing framework to analyze data related to the performance or operations of computing systems, or to analyze other types of data. A question is analyzed to determine data that is relevant to the question. The relevant data may be decontextualized to remove or alter contextual information included in the data, such as sensitive, personal, or business-related data. The question and the decontextualized data may then be presented to workers in a crowdsourcing framework, and the workers may determine an answer to the question based on an analysis or an examination of the decontextualized data. The answers may be combined, correlated, or otherwise processed to determine a processed answer to the question. machine learning techniques are employed to adjust and refine the decontextualization.
|
14. One or more computer-readable media storing instructions which, when executed by at least one processor, instruct the at least one processor to perform actions comprising:
accessing at least one dataset associated with a question, the at least one dataset including data and contextual information about the data;
determining at least one decontextualization operation that at least partly alters the contextual information about the data included in the at least one dataset with the data;
applying the at least one decontextualization operation to the at least one dataset to determine at least one modified dataset that includes the data and the contextual information about the data included in the at least one modified dataset is at least partly altered;
receiving, from a plurality of worker devices associated with a plurality of workers in a crowdsourcing framework, a plurality of answers to the question, the plurality of answers generated by the plurality of workers analyzing the at least one modified dataset having data and at least partly altered contextual information about the data in view of the question;
incorporating, into training data, the plurality of answers and information describing the at least one decontextualization operation; and
employing the training data in machine learning to train a decontextualizer to be used in subsequent data decontextualization to answer subsequent questions using the crowdsourcing framework.
5. A system, comprising:
at least one computing device configured to implement one or more services, wherein the one or more services are configured to:
access at least one dataset associated with a question, the at least one dataset including data and contextual information about the data and the question having a predetermined answer;
determine at least one decontextualization operation that at least partly alters the contextual information about the data included in the at least one dataset;
apply the at least one decontextualization operation to the contextual information about the data of the at least one dataset to determine at least one modified dataset in which the contextual information about the data included in the at least one dataset is at least partly altered;
send the at least one modified dataset that includes the data and the at least partly altered contextual information about the data and the question to a plurality of worker devices associated with a plurality of workers in a crowdsourcing framework;
receive, from the plurality of worker devices, a plurality of answers to the question, the plurality of answers generated by the plurality of workers analyzing the at least one modified dataset that includes the data and the at least partly altered contextual information about the data in view of the question;
incorporate, into training data, the plurality of answers and information describing the at least one decontextualization operation; and
employ the training data in machine learning to train a decontextualizer to be used in subsequent data decontextualization to answer subsequent questions using the crowdsourcing framework.
1. A computer-implemented method, comprising:
accessing a question having a predetermined answer, the question being associated with operations of at least one computing system;
accessing at least one dataset associated with the question, the at least one dataset including data describing the operations of the at least one computing system and contextual information about the data;
selecting at least one decontextualization operation from a plurality of decontextualization operations, wherein the at least one decontextualization operation at least partly alters the contextual information about the data included in the at least one dataset;
applying the at least one decontextualization operation to the at least one dataset to determine at least one modified dataset in which the contextual information about the data included in the at least one dataset is at least partly altered;
sending the at least one modified dataset including the data and the altered contextual information and the question to a plurality of worker devices associated with a plurality of workers in a crowdsourcing framework;
receiving, from the plurality of worker devices, a plurality of answers to the question, the plurality of answers generated by the plurality of workers analyzing the at least one modified dataset having the contextual information about the data at least partly altered and the data in view of the question;
incorporating, into training data, the plurality of answers and information describing the at least one decontextualization operation; and
employing the training data in machine learning to train a decontextualizer to be used in subsequent data decontextualization to answer subsequent questions using the crowdsourcing framework.
2. The method of
an attribute of the at least one computing system;
a characteristic of the data included in the at least one dataset;
information regarding an organization associated with the at least one computing system; or
information regarding an individual associated with the at least one computing system.
4. The method of
6. The system of
the question is associated with operations of at least one computing system; and
the at least one dataset includes data describing the operations of the at least one the computing system.
7. The system of
an attribute of the at least one computing system;
information regarding an organization associated with the at least one computing system; or
information regarding an individual associated with the at least one computing system.
8. The system of
a characteristic of data included in the at least one dataset; or
an identification of data included in the at least one dataset.
9. The system of
a rate of requests to a web site;
an amount of time spent processing the requests;
a delay time in processing the requests;
a rate of errors in processing the requests;
a size of requests to the web site; or
a size of responses to the requests.
11. The system of
12. The system of
13. The system of
15. The one or more computer-readable media of
the question is associated with operations of at least one computing system; and
the at least one dataset includes data describing the operations of the at least one computing system.
16. The one or more computer-readable media of
17. The one or more computer-readable media of
18. The one or more computer-readable media of
a characteristic of data included in the at least one dataset; or
an identification of data included in the at least one dataset.
19. The one or more computer-readable media of
modifying the question to at least partly alter the contextual information included in the question, prior to sending the question to the plurality of worker devices.
20. The one or more computer-readable media of
|
To identify errors, security breaches, or other anomalous behavior in computing systems, system administrators may implement automated checks. Such automated checks may be performed by processes that periodically execute to analyze performance data or other types of data captured during the operation of the computing system being checked. In many cases, the processes to implement automated checks may not be adequate to identify anomalous behavior in computing systems, particularly when the anomalous behavior is subtle or is indicated by patterns or correlations in the data.
Certain implementations and embodiments will now be described more fully below with reference to the accompanying figures, in which various aspects are shown. However, various aspects may be implemented in many different forms and should not be construed as limited to the implementations set forth herein. Like numbers refer to like elements throughout.
This disclosure describes implementations of systems, devices, methods, and computer-readable media for crowdsourcing the analysis of data associated with computing system operations, to identify errors, security problems, performance issues, or other types of anomalous behavior in the computing systems. In many scenarios, one or more sets of computing systems operations data may be monitored to identify anomalies in the performance of computing systems that may include, but are not limited to, computing devices, clusters of computing devices, communications networks, network infrastructure devices, data storage systems, and so forth. Although some such monitoring may be automated, the effectiveness of such automation may not achieve parity with the information processing and pattern recognition capabilities of the human brain. Accordingly, implementations provide for the distribution of one or more datasets to a plurality of workers in a crowdsourcing framework to enable the workers to answer one or more questions by analyzing the dataset(s).
Crowdsourcing may refer to a method for soliciting contributions of services, ideas, content, answers, or other information from a plurality of workers, where workers may include individuals or groups of individuals. Such contributions may be provided for free by the workers, without any value or consideration provided in return. Alternatively, the contributions may be provided in return for any type of value or consideration, including but not limited to monetary payment, products, or services. Contributions may also be provided in return for coupons or discounts for products or services, or in return for points, credits, or tokens that are redeemable for products, services, or money. In some cases, the contributions may be provided in return for recognition in the form of the publication, promotion, or enhancement of a worker's skills, knowledge, business, status, or credentials. Crowdsourcing may take place within a crowdsourcing framework in which one or more requestors pose a question to be answered by one or more workers based on an analysis or examination of one or more datasets. The requestors may include one or more individuals, devices, or processes that formulate a question to be answered through crowdsourcing. Within the crowdsourcing framework, the question and one or more datasets relevant to the question may be provided to the workers. The workers may formulate an answer to the question based on their analysis or examination of the provided datasets. The answers received from the workers may be correlated, analyzed, combined, or otherwise processed to determine a processed answer that is provided to the requestor(s).
In implementations, prior to sending the dataset(s) to the workers for analysis, the datasets may be at least partly decontextualized to remove or otherwise alter contextual information that may provide a context of the data being analyzed. Contextual information may include any information that identifies or otherwise describes practices, impacts, performance, products, services, personnel, customers, infrastructure, or other operational details regarding an organization such as a business. In some cases, contextual information may include sensitive, personal, or private information associated with an individual, a group of individuals, a corporation, or an organization. For example, in cases where the analyzed data is related to the performance or operations of computing systems for a business, contextual information may include but is not limited to one or more of the following: information describing corporate infrastructure or organization; computing system topology, architecture, capabilities, or identifications; communication network configurations; data storage structures; processes, modules, or applications executed by computing systems; geographic location of computing systems, support infrastructure, or personnel; and so forth. Contextual information may also include trends or projections related to financial performance or business performance. In some cases, contextual information may include any information that identifies a type of communication described by the data (e.g., a web page request or response, a sale transaction, a refund transaction, a user login, and so forth). Contextual information may also include data that indicates a source or destination of a communication described by the data. Contextual information may also include any information that at least partly enables inference, deduction, or reverse engineering of any of the types of contextual information described above.
For example, a dataset showing a rate of requests to a web site over time may be decontextualized to remove axis labels (e.g., request rate vs. time), descriptive text, or other indications that the data is associated with request rates or time. Decontextualization may also include altering a scale of the displayed data along one or more coordinate axes, inverting the data, normalizing the data to a baseline set of data, representing the data in alternate ways (e.g., as shapes, objects, colors, sounds, and so forth), combining multiple datasets of data, transforming the data, or otherwise obscuring or obfuscating an original source, meaning, or significance of the data. Decontextualizing may include removing, altering, obscuring, obfuscating, or hiding the contextual information included in the dataset(s) to be analyzed by workers. In some cases, the question to be answered by the workers may also be decontextualized in any of the ways that the data may be decontextualized. Because the workers may be members of the general public, outside of the business or organization whose operations are described by the data, the dataset(s) to be analyzed and the question to be answered may be decontextualized prior to sending the dataset(s) and the question to the workers. By decontextualizing the dataset(s) and the question, implementations may provide for the crowdsourced analysis of data while avoiding exposure of sensitive, personal, private, confidential, or business-related information to the public.
In some implementations, the requestor(s) associated with the requestor device(s) 102 may be system administrators, engineers, developers, testers, managers, or other personnel affiliated with a business or other organization that maintains and operates any number of computing systems. For example, the requestor(s) may be affiliated with an online business or other organization that maintains and operates computing systems to provide one or more web sites. The question(s) 104 may include questions related to the performance or operations of the computing systems that are maintained and operated by the business or other organization. For example, the question(s) 104 may include but are not limited to the following:
Are there any anomalies that appear in one or more datasets during a period of time? An anomaly may include data that is higher than, lower than, or otherwise outside a range that characterizes other data in the dataset relative to one or more coordinate axes. In this example question, and the other example questions described herein, the dataset(s) may include data describing one or more of the following: a rate of requests to a web site over time; a rate of logins to a web site over time; a latency or delay in processing requests or logins; a size of requests over time; a size of responses to requests over time; a distance from the request origin to a data center that processes the request; an error rate in processing the requests or the logins over time; a rate of orders, purchases, or refunds at a web site (e.g., an e-commerce web site) over time; an origin location of requests (e.g., determined via geolocation based on Internet Protocol (IP) address); and so forth.
Are there any anomalies that appear in one or more datasets during a period of time following a particular event? The particular event may include a user login, a request to a web site, an order submitted to a web site, a boot or start of a computing system, a failure or shutdown of a computing system, and so forth.
Is there a correlation or relationship between multiple datasets? The correlation may be a direct correlation, an inverse correlation, a linear correlation, a quadratic correlation, an exponential correlation, a logarithmic correlation, and so forth. In some cases, the multiple datasets may include different types of data. For example, workers may be asked to compare multiple datasets that may include the rate of requests, the rate of errors in processing requests, the average compute time or latency to process a request, and so forth, and identify any correlations exhibits between the datasets. In such cases, the datasets may be normalized to a substantially similar scale in at least one coordinate axis prior to the presentation of the datasets to the workers.
Is a particular dataset substantially similar to one or more other datasets? In this case, the other datasets may be datasets that have been previously characterized as known good datasets that exhibit expected, nominal, typical, or optimal behavior of the computing systems.
Are one or more datasets different in some way than other datasets included in a plurality of datasets? Such differences may include different trends in the dataset(s) or differing features of the dataset(s), such as maxima, minima, periodicity, or other features.
Do multiple datasets exhibit a similar pattern of behavior? In such cases, the similarity may include identical patterns of behavior, or patterns of behavior that are substantially similar.
Is a particular dataset characterized by periodic behavior? In such cases, the workers may be asked to estimate a periodicity or frequency of the data by identifying peaks or troughs in the data, by estimating a width of a repeating pattern in the data, or by identifying any other pattern in the data.
Is a dataset characterized by a curve? In this case, the workers may be asked by identify a curve that substantially fits the dataset. For example, the workers may be asked to choose a particular type of curve that substantially fits the dataset, such as a linear, polynomial, exponential, logarithmic, hyperbolic curve, and so forth, and to choose one or more parameters that characterize the curve. The workers may also be asked to draw a curve or compose a spline curve that substantially fits the dataset. Other types of questions are also supported by implementations.
The requestor device(s) 102 may communicate with one or more server device(s) 106 over one or more networks. The server device(s) 106 may include any type of computing device, including but not limited to a server computer, personal computer, network computer, cloud computing or distributed computing device, any of the types of computing devices described with reference to the requestor device(s) 102, or other types of computing devices. An example of the server device(s) 106 is described further with reference to
The server device(s) 106 may include various types of server devices 106. In implementations illustrated by
The server device(s) 106(1) may execute a crowdsourcing module 108, operations of which are described further with reference to
The server device(s) 106(2) may access data storage 110, which may store one or more dataset(s) 112. A decontextualization module 114 executing on the server device(s) 106(2) may retrieve or otherwise access the one or more datasets 112 that are relevant to the question(s) 104. In cases where the question(s) 104 are related to the performance or operations of one or more computing systems or networks that are maintained or operated by the requestor(s), the dataset(s) 112 may include data related to the performance or operations of the computing systems or networks.
The data storage 110 may comprise any number of data storage systems that employ any type of data storage technology, including relational databases, non-relational databases, or both relational and non-relational databases. Although the data storage 110 is depicted as external to the server device(s) 106, implementations are not so limited. In some implementations, the data storage 110 may be at least partly incorporated into the server device(s) 106 as local storage.
In some implementations, the decontextualization module 114 may perform operations to decontextualize the dataset(s) 112 and generate decontextualized dataset(s) 116. In some cases, the decontextualization module 114 may also perform operations to decontextualize the question(s) 104 and generate decontextualized question(s) 118. The decontextualization is described further with reference to
In some implementations, the server device(s) 106(1) may execute an answer processing module 120 to perform operations to process the answers received from the workers. The processing of the answer(s) is described further with reference to
The server device(s) 106(1) may send the decontextualized dataset(s) 116 to one or more worker devices 122. In some cases, the decontextualized question(s) 118 may also be sent to the worker device(s) 122. Alternatively, in cases where the question(s) 104 may not include contextual information, the question(s) 104 may be sent to the worker device(s) 122 without having been decontextualized. The worker device(s) 122 may be owned by, operated by, or otherwise associated with one or more workers 124. The worker(s) 124 may include any number of individuals or groups of individuals. The worker device(s) 122 may include any type of computing device, including but not limited to those types of devices described above with reference to the requestor device(s) 102 and the server device(s) 106. An example of the worker device(s) 122 is described further with reference to
In some implementations, the worker device(s) 122 execute a data analysis module 126. The data analysis module 126 may receive decontextualized question(s) 118 to be answered based on analysis or examination of the decontextualized dataset(s) 116. In some implementations, the data analysis module 126 may include a data analysis user interface 128 that presents the decontextualized question(s) 118 and the decontextualized dataset(s) 116 to the worker(s) 124. The data analysis user interface 128 may enable the worker(s) 124 to analyze or otherwise examine the decontextualized dataset(s) 116 in view of the decontextualized question(s) 118, and may enable the worker(s) 124 to formulate one or more answers 130 to the decontextualized question(s) 118. The answers 130 may include binary answers (e.g., yes or no, 1 or 0, and so forth), multiple choice answers (e.g., choose from A, B, C, or D, and so forth), or other types of answers. An example of the data analysis user interface 128 is described further with reference to
The answer(s) 130 may be sent from the worker device(s) 122 to the server device(s) 106(1), and may be received by the crowdsourcing module 108. The crowdsourcing module 108, or the answer processing module 120, may analyze, correlate, combine, or otherwise process the answers to determine one or more processed answers 132. The processed answer(s) 132 may then be sent to the requestor device(s) 102 or otherwise provided to the requestors. The requestors may then examine the processed answer(s) 132 and determine if further investigation of the question(s) 104 is merited. Such further investigation may be performed manually by the requestors, or may be performed through further crowdsourcing to answer one or more other questions 104.
In some cases, the questions 104 may include control questions, or placebo questions, for which a correct answer is predetermined by the requestors. Such control questions may be incorporated into the processes described herein, as a check on or a measure of the accuracy of the workers 124 in answering questions. Implementations that employ control questions are described further with reference to
Although
The various devices of the environment 100 may communicate with one another using one or more networks. Such networks may include public networks such as the Internet, private networks such as an institutional or personal intranet, or some combination of private and public networks. The networks may include any type of wired or wireless network, including but not limited to local area networks (LANs), wide area networks (WANs), wireless WANs (WWANs), wireless LANs (WLANs), mobile communications networks (e.g. 3G, 4G, etc.), and so forth. In some implementations, communications between the various devices in the environment 100 may be encrypted or otherwise secured. For example, such communications may employ one or more public or private cryptographic keys, digital certificates, or other credentials supported by a security protocol such as any version of the Secure Socket Layer (SSL) or the Transport Layer Security (TLS) protocol.
The server device 106 may include one or more input/output (I/O) devices 204. The I/O device(s) 204 may include input devices such as a keyboard, a mouse, a pen, a game controller, a touch input device, an audio input device (e.g., a microphone), a gestural input device, a haptic input device, an image or video capture device (e.g., a camera), or other devices. In some cases, the I/O device(s) 204 may also include output devices such as a display, an audio output device (e.g., a speaker), a printer, a haptic output device, and so forth. The I/O device(s) 204 may be physically incorporated with the server device 106, or may be externally placed.
The server device 106 may include one or more I/O interfaces 206 to enable components or modules of the server device 106 to control, interface with, or otherwise communicate with the I/O device(s) 204. The I/O interface(s) 206 may enable information to be transferred in or out of the server device 106, or between components of the server device 106, through serial communication, parallel communication, or other types of communication. For example, the I/O interface(s) 206 may comply with a version of the RS-232 standard for serial ports, or with a version of the Institute of Electrical and Electronics Engineers (IEEE) 1284 standard for parallel ports. As another example, the I/O interface(s) 206 may be configured to provide a connection over Universal Serial Bus (USB) or Ethernet. In some cases, the I/O interface(s) 206 may be configured to provide a serial connection that is compliant with a version of the IEEE 1394 standard. The server device 106 may also include one or more busses or other internal communications hardware or software that allow for the transfer of data between the various modules and components of the server device 106.
The server device 106 may include one or more network interfaces 208 that enable communications between the server device 106 and other networked devices, such as the requestor device(s) 102, the worker device(s) 122, or the data storage 110. The network interface(s) 208 may include one or more network interface controllers (NICs) or other types of transceiver devices configured to send and receive communications over a network.
The server device 106 may include one or more memories, described herein as memory 210. The memory 210 comprises one or more computer-readable storage media (CRSM). The CRSM may include one or more of an electronic storage medium, a magnetic storage medium, an optical storage medium, a quantum storage medium, a mechanical computer storage medium, and so forth. The memory 210 provides storage of computer-readable instructions that may describe data structures, program modules, processes, or applications, and other data for the operation of the server device 106.
The memory 210 may include an operating system (OS) module 212. The OS module 212 is configured to manage hardware resources such as the I/O device(s) 204, the I/O interface(s) 206, and the network interface(s) 208, and to provide various services to applications, processes, or modules executing on the processor(s) 202. The OS module 212 may include one or more of the following: any version of the Linux® operating system originally released by Linus Torvalds; any version of iOS® from Apple Corp.® of Cupertino, Calif., USA; any version of Windows® or Windows Mobile® from Microsoft Corp.® of Redmond, Wash., USA; any version of Android® from Google Corp.® of Mountain View, Calif., USA and its derivatives from various sources; any version of Palm OS® from Palm Computing, Inc.® of Sunnyvale, Calif., USA and its derivatives from various sources; any version of BlackBerry OS® from Research In Motion Ltd.® of Waterloo, Ontario, Canada; any version of VxWorks® from Wind River Systems® of Alameda, Calif., USA; or other operating systems.
The memory 210 may include the crowdsourcing module 108, the decontextualization module 114, and the answer processing module 120 as described above with reference to
The memory 210 may include data storage 218 to store data for operations of the server device 106. The data storage 218 may comprise a database, array, structured list, tree, or other data structure, and may be a relational or a non-relational datastore. The data storage 218 may store one or more of the following: the question(s) 104, the dataset(s) 112 (e.g., prior to decontextualization), the decontextualized dataset(s) 116, the decontextualized question(s) 118, the answer(s) 130, or the processed answer(s) 132. In some implementations, the data storage 218 may store worker description data 220. The worker description data 220 may include information regarding qualifications, skills, credentials, or other characteristics of one or more workers 124 for use by the worker selection module 214 in determining which workers 124 to employ for analyzing the decontextualized dataset(s) 116. The worker description data 220 may also store results of the evaluation question(s) that may be employed to determine worker suitability as described above.
The data storage 218 may also store other data 222, such as user authentication information or access control data. In some cases, the other data 222 may include training data that is employed to train a decontextualizer, as described further with reference to
The worker device 122 may include one or more memories, described herein as memory 310. The memory 310 comprises one or more CRSM. The CRSM may include one or more of an electronic storage medium, a magnetic storage medium, an optical storage medium, a quantum storage medium, a mechanical computer storage medium, and so forth. The memory 310 provides storage of computer-readable instructions, data structures, program modules, and other data for the operation of the worker device 122. The memory 310 may include an OS module 312. The OS module 312 is configured to manage hardware resources such as the I/O device(s) 304, the I/O interface(s) 306, and the network interface(s) 308, and to provide various services to applications, processes, or modules executing on the processor(s) 302. The OS module 312 may include one or more of the operating systems described above with reference to OS module 212.
The memory 310 may include the data analysis module 126 and the data analysis user interface 128 as described above with reference to
The memory 310 may include data storage 316 to store data for operations of the worker device 122. The data storage 316 may comprise a database, array, structured list, tree, or other data structure, and may be a relational or a non-relational datastore. The data storage 316 may store one or more of the decontextualized dataset(s) 116, the decontextualized question(s) 118, or the answer(s) 130. In some cases, where the question(s) 104 are sent to the worker device(s) 122 without having been decontextualized, the data storage 316 may store the question(s) 104. The data storage 316 may also store other data 318, such as user authentication information or access control data. In some implementations, at least a portion of the information stored in the data storage 318 may be stored externally to the worker device 122, on other devices that are accessible to the worker device 122 via the I/O interface(s) 306 or via the network interface(s) 308.
In cases where the dataset(s) 112 are decontextualized through a normalization operation as shown in the example of
Moreover, implementations also support non-visual presentations of the decontextualized dataset 116 to the workers 124. For example, the decontextualization operation(s) 402 may include transforming the data of the dataset 112 into audio data such as any number of simultaneous or non-simultaneous sequences of tones that audibly describe one or more characteristics of the data. Such audio data is represented in
In cases where the dataset(s) 112 are decontextualized through a normalization operation as shown in the example of
The various examples of the decontextualization operation(s) 402 illustrated in
As shown in the example, the data analysis user interface 128 may present the decontextualized question 118 to the worker 124. In this example, the worker 124 is asked to determine which of three curves is “most similar” to a curve (e.g., “curve 1”) that is a depiction of the decontextualized dataset 116. The data analysis user interface 128 displays three comparison datasets 902 showing three comparison curves. Alternatively, multiple decontextualized dataset(s) 116 may be presented to the worker 124, and the worker 124 may be asked which of the decontextualized dataset(s) 116 (e.g., dataset A or dataset B) is more similar to one or more other datasets. In such cases, at least one of the decontextualized dataset(s) 116 presented may be a control dataset and the question 104 may be control question for which the control dataset is the predetermined, correct answer.
The data analysis user interface 128 also includes one or more answer input controls 904, enabling the worker 124 to input his or her answer 130 to the decontextualized question 118. In this example, the data analysis user interface 128 includes one or more send answer controls 906, enabling the worker 124 to send the answer 130 to the server device(s) 106(1). The data analysis user interface 128 may also include one or more request question controls 908, enabling the worker 124 to request another decontextualized question 118 to answer.
Although
At 1102, the question 104 is received from at least one requester, the question 104 to be answered at least party through a crowdsourcing framework such as that described above. In some implementations, the question 104 may be received from one or more requestor devices 102 associated with the at least one requestor. In some cases, the question 104 may be associated with or related to operations of one or more computing systems which may be maintained, operated, or monitored by the requestor(s).
At 1104, at least one dataset 112 associated with or relevant to the question 104 may be accessed from the data storage 110 or elsewhere. In some cases, the dataset(s) 112 may include data that describes the performance or operations of the computing system(s) that are operated, maintained, or monitored by the requestor(s).
At 1106, the dataset(s) 112 may be modified to generate or otherwise determine one or more modified datasets. Such modification may include a decontextualization of the data included in the dataset(s) 112 to generate the decontextualized dataset(s) 116 as described above. The decontextualization is described further with reference to
At 1108, the question 104 may be modified to generate or otherwise determine a modified question. Such modification may include a decontextualization of the information included in the question 104 to generate the decontextualized question 118 as described above. The decontextualization of the question 104 may proceed similarly to the decontextualization of the dataset(s) 112, such that contextual information in the question 104 may be removed, obfuscated, or otherwise altered.
In some cases, the question 104 may be further modified based on one or more particular decontextualization operations employed to decontextualize the dataset(s) 112. For example, in cases where the decontextualization includes inverting a dataset 112 (e.g., as shown in
At 1110, the modified dataset(s) and the modified question generated at 1106 and 1108 may be sent to one or more worker devices 122 associated with one or more workers 124 in the crowdsourcing framework. The data analysis module 126 and the data analysis user interface 128 may then enable the worker(s) 124 to formulate the answer(s) 130 to the question as described above. At 1112, the answer(s) 130 are received from the worker device(s) 122.
At 1114, the answer(s) 130 may be combined, correlated, analyzed, or otherwise processed to determine the processed answer 132 to the question. Such processing of the answer(s) 130 to determine the processed answer 132 is described further with reference to
At 1202, the dataset(s) 112 may be modified to remove any descriptions, labels, titles, or other information that may provide a context for the data included in the dataset(s) 112, as described above with reference to
At 1204, the dataset(s) 112 may be altered to modify at least one scale of at least one coordinate axis, vector component, data component, or unit of measure associated with the data included in the dataset(s) 112. In some implementations, this may proceed as described above with reference to
At 1206, the dataset(s) 112 may be modified to invert the data included in the dataset(s) 112 with respect to at least one coordinate axis or data component. In some implementations, this may proceed as described above with reference to
At 1208, the dataset(s) 112 may be altered to normalize or otherwise modify the data included in the dataset(s) 112 relative to at least one baseline dataset. As described above with reference to
At 1210, the dataset(s) 112 may be modified to at least partly represent the data included in the dataset(s) 112 as shapes, objects, color, designs, sounds, or other types of representations. In some implementations, this may proceed as described above with reference to
At 1302, the answer(s) 130 may be received from the worker devices 122 as described above. At 1304, a determination may be made that at least some of the answer(s) 130 are a common answer, and that the answer(s) 130 that are the common answer form a larger subset of the answer(s) 130 than other subsets of same answers. For example, ten answers may be received from the workers. Four of the answers may be 42, three of the answers may be 44, two of the answers may be 40, and one of the answers may be 45. In this case, 42 may be designated as the common answer, given that it is the most frequently occurring answer among the answers 130. In some implementations, the common answer may be a same or identical answer among the answers 130. Alternatively, the common answer may be a substantially similar answer that is within a threshold range of being the same answer. Continuing with the above example, the answers 42.1, 41.8, and 42.4 may be determined to be similar enough to 42 such that the four answers are determined to share a common answer.
At 1306, a determination may be made whether the question 104 is associated with a threshold proportion. In some cases, the requestor(s) may specify one or more parameters or criteria, such as a threshold proportion, when specifying the question 104. If the question 104 is not associated with a threshold proportion, the process may proceed to 1308 and designate the common answer determined at 1304 as the processed answer 132 that is sent to the requestor(s).
If the question 104 does include a threshold proportion, the process may proceed to 1310. At 1310, a determination is made whether the common answer is shared by at least the threshold proportion of the answers 130. If not, the process may proceed to 1312, where the requestor(s) may be notified that no sufficiently common or consensus answer was reached by the workers 124. The requestor(s) may then choose to reformulate the question 104, adjust the method by which the dataset(s) 112 are decontextualized, or otherwise modify the way in which the question 104 or the dataset(s) 112 are presented to the workers.
If a determination is made at 1310 that the common answer is shared by at least the threshold proportion of the answers 130, the process may proceed to 1314. At 1314, the common answer determined at 1304 may be designated as the processed answer 132 to be provided to the requestor(s).
In some implementations, the particular decontextualization operation(s) employed to generate one or both of the decontextualized dataset(s) 116 and the decontextualized question(s) 118 may be based on the operations of a decontextualizer or another machine (e.g., software module) that is trained through one or more machine learning operations. In some implementations, the decontextualizer may comprise a model, a set of decontextualization operations, or a set of decontextualization transforms that have been determined as suitable (e.g., refined, or optimal) for decontextualizing the dataset(s) 112 to be sent for analysis by the workers 124.
Although not shown in
The question(s) 104 may be provided to the decontextualization module 114, which may determine one or more decontextualization operations to apply to the dataset(s) 112 and the question(s) 104. The decontextualization operation(s) may include, but are not limited to, one or more of the operations described above with reference to
The crowdsourcing module 108 may send the decontextualized dataset(s) 116 and the decontextualized question(s) 118 to the worker device(s) 122, and receive the answer(s) 130 from the worker device(s) 122. The answer(s) 130 may then be provided to the decontextualization module 114. In some implementations, the answer(s) 130 may be stored in training data in memory on the server device(s) 106(2). In some cases, information describing the particular decontextualization operation(s) employed may also be stored in the training data and associated with the answer(s) 130 that were generated based on the particular decontextualization operation(s) applied to the dataset(s) 112 analyzed by the workers 124.
A learning module 1402 executing on the server device(s) 106(2) may employ the training data in one or more machine learning operations to train a machine such as a decontextualizer 1404. In some cases, multiple iterations 1406 may be performed. Each iteration may include the application of one or more selected decontextualization operations to the dataset(s) 112 and the question(s) 104, and the receipt of the answer(s) 130 generated by the workers 124. The results of each such iteration may be incorporated into the training data. During the learning operations, the learning module 1402 may receive the answer(s) 130 from the workers 124 during multiple iterations, and may compare the received answer(s) 130 to the predetermined answer for the control question. In some cases, the learning module 1402 may apply a greater weight to one or more decontextualization operation(s) that generated those answer(s) 130 that are substantially similar to the predetermined answer to the control question. In this way, machine learning operations may train the decontextualizer 1404 to preferentially apply those decontextualization operation(s) that have been demonstrated to generate more accurate answer(s) 130.
Having trained the decontextualizer 1404 using machine learning, the decontextualization module 114 (or other software) may then employ the decontextualizer 1404 to decontextualize the dataset(s) 112 to be sent to workers in answering the question(s) 104 for which an answer is not predetermined, as described above with reference to
The decontextualized dataset(s) 116 may be decontextualized using one or more different decontextualization operations during each of the multiple iterations 1406. In some cases, different decontextualization operations may be employed during a same iteration. In some implementations, different worker devices 122 may receive differently decontextualized datasets 116 during a same iteration or during different iterations. In some implementations, a same set of worker devices 122 may be employed during multiple iterations. Alternatively, different worker devices 122 may be employed during different iterations.
At 1502, the question 104 may be received. As described above with reference to
At 1504, one or more datasets 112 may be accessed, the dataset(s) 112 being associated with the control question. The dataset(s) may be accessed as described with reference to 1104.
At 1506, one or more decontextualization operations may be selected or otherwise determined. As described above, the decontextualization operation(s) may be selected using a random or a pseudo-random selection process. The decontextualization operation(s) may be selected using a simulated annealing technique to identify one or more (e.g., substantially optimal) decontextualization operation(s) from a plurality of available decontextualization operations. The decontextualization operation(s) may be selected using an exhaustive enumeration method that traverses all or a portion of a collection of available decontextualization operation(s) over multiple iterations of the process. One or more fitness functions may also be employed to identify the decontextualization operation(s) to be employed. In some cases, the selection of the decontextualization operation(s) may be determined at least partly by the requester(s) or other personnel.
At 1508, the one or more datasets 112 accessed at 1504 may be decontextualized based on the application of the one or more decontextualization operations determined at 1506. In some cases, the question 104 may also be decontextualized at 1508.
At 1510, the decontextualized dataset(s) 116 and the question 104 (in some cases the decontextualized question 118) may be sent to one or more worker devices 122. In some implementations, this information may be sent to the worker device(s) 122 through the server device(s) 106(1) as described above.
At 1512, the answer(s) 130 may be received. As described above, the answer(s) 130 may be generated by one or more workers 124 operating the worker device(s) 122, and analyzing the question 104 (or the decontextualized question 118) in view of the decontextualized dataset(s) 116.
At 1514, the answer(s) 130 and information describing the decontextualization operation(s) employed during this iteration may be stored or otherwise incorporated into the training data.
At 1516, a determination is made whether there are any additional decontextualization operation(s) to be employed to generate additional training data (e.g., whether additional iterations are to be performed). In some cases, as described above, the decontextualization operation(s) may be selected from a list. If there are additional operations available for selection from the list, the determination may be made that there are additional decontextualization operation(s) that may be applied. In such cases, the process may return to 1506 and perform another iteration of the process to generate additional training data based on one or more different decontextualization operations. If there are no additional decontextualization operations to be applied, then the process may proceed to 1518. At 1518, the training data accumulated over the multiple iterations 1406 may be employed in one or more machine learning operations to train the decontextualizer 1404.
Implementations support any number and type of machine learning techniques to train the decontextualizer 1404. In some cases, supervised machine learning may be employed, using training data that is labeled based on a degree to which the answer(s) 130 approximate the predetermined answer to the control question. Implementations may also employ unsupervised or semi-supervised machine learning techniques in which labeled data may be unavailable or partly available during the training of the decontextualizer 1404. Implementations support machine learning algorithms that may include, but are not limited to, one or more of the following: artificial neural networks, inductive logic programming, support vector machines (SVMs), clustering, Bayesian networks, decision tree learning, association rule learning, reinforcement learning, representation learning, similarity learning, metric learning, sparse dictionary learning, simulated annealing methods, fitness functions, identification of local or global maxima or minima, and so forth.
Following its training, the decontextualizer 1404 may describe a plurality of decontextualization operations that have been shown to generate substantially accurate answers from the workers 124 analyzing the question(s) 104 in the crowdsourcing framework. Accordingly, the decontextualizer 1404 may be employed in decontextualization operations to decontextualize the dataset(s) 112 that are sent to the workers 124 in answering the question(s) 104 for which an answer is not predetermined, as described above with reference to
Those having ordinary skill in the art will readily recognize that certain steps or operations illustrated in the figures above can be eliminated, combined, subdivided, executed in parallel, or taken in an alternate order. Moreover, the methods described above may be implemented as one or more software programs for a computer system and are encoded in one or more computer-readable storage media as instructions executable on one or more processors.
Separate instances of these programs may be executed on or distributed across separate computer systems. Thus, although certain steps have been described as being performed by certain devices, software programs, processes, or entities, this need not be the case and a variety of alternative implementations will be understood by those having ordinary skill in the art.
Additionally, those having ordinary skill in the art readily recognize that the techniques described above can be utilized in a variety of devices, environments, and situations. For example, although the examples herein describe the use of a crowdsourcing framework to analyze data associated with the performance or operations of computing systems, implementations are not so limited. Implementations may also enable the crowdsourced analysis of other types of data following their decontextualization. Such other types of data may include but are not limited to one or more of the following: scientific data; engineering or architectural design data; polling data associated with political campaigns or other types of public opinion data; data related to the operation of search algorithms; demographic data; and so forth. Although the present disclosure is written with respect to specific embodiments and implementations, various changes and modifications may be suggested to one skilled in the art and it is intended that the present disclosure encompass such changes and modifications that fall within the scope of the appended claims.
Brezinski, Dominique Imjya, McClintock, Jon Arron, Stathakopoulos, George Nikolaos
Patent | Priority | Assignee | Title |
10693914, | Jun 26 2015 | MICRO FOCUS LLC | Alerts for communities of a security information sharing platform |
10701044, | Jun 26 2015 | MICRO FOCUS LLC | Sharing of community-based security information |
11604980, | May 22 2019 | AT&T Intellectual Property I, L.P.; AT&T Intellectual Property I, L P | Targeted crowd sourcing for metadata management across data sets |
Patent | Priority | Assignee | Title |
20020169595, | |||
20040230586, | |||
20080187057, | |||
20100191686, | |||
20120137367, | |||
20120158623, | |||
20120259877, | |||
20130035931, | |||
20130132308, | |||
20130159404, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Sep 16 2013 | Amazon Technologies, Inc. | (assignment on the face of the patent) | / | |||
Sep 25 2013 | MCCLINTOCK, JON ARRON | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034110 | /0939 | |
Oct 08 2013 | STATHAKOPOULOS, GEORGE NIKOLAOS | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034110 | /0939 | |
Sep 16 2014 | BREZINSKI, DOMINIQUE IMJYA | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034110 | /0939 |
Date | Maintenance Fee Events |
Jan 06 2020 | REM: Maintenance Fee Reminder Mailed. |
Jun 22 2020 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
May 17 2019 | 4 years fee payment window open |
Nov 17 2019 | 6 months grace period start (w surcharge) |
May 17 2020 | patent expiry (for year 4) |
May 17 2022 | 2 years to revive unintentionally abandoned end. (for year 4) |
May 17 2023 | 8 years fee payment window open |
Nov 17 2023 | 6 months grace period start (w surcharge) |
May 17 2024 | patent expiry (for year 8) |
May 17 2026 | 2 years to revive unintentionally abandoned end. (for year 8) |
May 17 2027 | 12 years fee payment window open |
Nov 17 2027 | 6 months grace period start (w surcharge) |
May 17 2028 | patent expiry (for year 12) |
May 17 2030 | 2 years to revive unintentionally abandoned end. (for year 12) |