In one example, a method performed by a compiler comprises: receiving a dataflow graph of a neural network, the neural network comprising a neural network operator; receiving information of computation resources and memory resources of a neural network hardware accelerator intended to execute the neural network operator; determining, based on the dataflow graph, iterations of an operation on elements of a tensor included in the neural network operator; determining, based on the information, a mapping between the elements of the tensor to addresses in the portion of the local memory, and a number of the iterations of the operation to be included in a batch, wherein the number of the iterations in the batch are to be executed in parallel by the neural network hardware accelerator; and generating a schedule of execution of the batches of the iterations of the operations.
|
6. A method, the method being performed by a compiler and comprising:
receiving a dataflow graph of a neural network, the neural network comprising a neural network operator;
receiving first information of computation resources of a neural network hardware accelerator intended to execute the neural network operator;
receiving second information of a portion of a local memory of the neural network hardware accelerator intended to execute the neural network operator;
determining, based on the dataflow graph, iterations of an operation on elements of a tensor included in the neural network operator;
determining, based on the first information and the second information, a mapping between the elements of the tensor to addresses in the portion of the local memory, and a number of the iterations of the operation to be included in a batch, wherein the number of the iterations in the batch are to be executed in parallel by the neural network hardware accelerator;
generating one or more loop representations of the iterations of the operation, the loop representations including:
one or more loops that increments one or more induction variables; and
an instruction in a body of the loop to perform the operation on an element of the tensor, the element being associated with an index determined by the one or more induction variables;
generating a schedule of execution of the batches of the iterations of the operations; and
generating executable instructions to be executed by the neural network hardware accelerator based on the schedule of execution.
19. A non-transitory computer-readable medium having stored therein instructions that, when executed by one or more processors, cause the one or more processors to execute a compiler, the compiler performing operations including:
receiving a dataflow graph of a neural network, the neural network comprising a neural network operator;
receiving first information of computation resources of a neural network hardware accelerator intended to execute the neural network operator;
receiving second information of a portion of a local memory of the neural network hardware accelerator intended to execute the neural network operator;
determining, based on the dataflow graph, iterations of an operation on elements of a tensor included in the neural network operator;
determining, based on the first information and the second information, a mapping between the elements of the tensor to addresses in the portion of the local memory, and a number of the iterations of the operation to be included in a batch, the number of the iterations in the batch to be executed in parallel by the neural network hardware accelerator;
generating one or more loop representations of the iterations of the operations the loop representations including:
one or more loops that increments one or more induction variables; and
an instruction in a body of the loop to perform the operation on an element of the tensor, the element being associated with an index determined by the one or more induction variables;
generating a schedule of execution of the batches of the iterations of the operations; and
generating executable instructions to be executed by the neural network hardware accelerator based on the schedule of execution.
1. A method of accelerating an execution of loops in a neural network at a neural network hardware accelerator, the method being performed by a compiler and comprising:
receiving input codes of a neural network, the neural network comprising a neural network operator;
receiving first information of a first quantity of computation resources of the neural network hardware accelerator assigned to execute the neural network operator;
receiving second information of a second quantity of memory resources of the neural network hardware accelerator assigned to execute the neural network operator;
compiling the input codes to generate an input data set representing a dataflow graph of the neural network;
generating, based on the input data set, a loop-nest representation of the neural network, the loop-nest including a parent loop and a child loop nested within the parent loop, elements of a first tensor being associated with first indices determined by a first induction variable of the parent loop, and elements of a second tensor being associated with second indices determined by a second induction variable of the child loop;
determining, based on the first information and the second information, a first modulo operator to map the first indices to first remainders and a second modulo operator to map the second indices to second remainders;
determining, based on the first modulo operator and the second modulo operator, a first number of iterations of the loop-nest to be included in a batch to access the first tensor and the second tensor, wherein the first number of iterations in the batch are to be executed in parallel by the neural network hardware accelerator;
determining a schedule of execution of the batches of iterations of the loop-nest; and
generating executable instructions to be executed by the neural network hardware accelerator based on the schedule of execution.
2. The method of
the memory resources are provided by a local memory of the neural network hardware accelerator;
the first modulo operator maps elements of the first tensor to a first number of memory addresses in the local memory represented by the first remainders;
the second modulo operator maps elements of the second tensor to a second number of memory addresses in the local memory represented by the second remainders; and
a sum of the first number and the second number is equal to or below the first quantity.
3. The method of
the first information indicates that the computation resources comprise a second number of computation engines, each computation engine to execute an iteration of the loop-nest;
a product of the first modulo operator and the second modulo operator is equal to the first number; and
the first number is less than or equal to the second number.
4. The method of
the first tensor and the second tensor are part of a multi-dimensional tensor;
the first indices are associated with a first dimension;
the second indices are associated with a second dimension;
a product of the first modulo operator and the second modulo operator equals a second number of memory addresses mapped to elements of the multi-dimensional tensor; and
each iteration of the loop-nest includes an instruction to access an element of the multi-dimensional tensor at the mapped memory address.
5. The method of
the input codes include a direct memory access (DMA) instruction to access a third tensor in the child loop, elements of the third tensor being associated with the second indices; and
the method further comprises determining a third modulo operator to map the second indices of the third tensor to third remainders, the third remainders being different from the second remainders, such that a larger number of iterations of the DMA instruction to access the third tensor is included in the batch than iterations of an instruction to access the first tensor or the second tensor.
7. The method of
adding a modulo operator to at least one of the one or more induction variables of the index of the element of the tensor in the loop representation.
8. The method of
wherein a value of the modulo operator indicates a number of iterations of the loop to be executed in parallel in the neural network hardware accelerator; and
wherein the value of the modulo operator is determined based on the first information and the second information.
9. The method of
determining, based on the modulo operator, a first batch and a second batch of the iterations of the loop to be executed in parallel; and
determining a schedule of execution indicating that the first batch is to be executed at a first time and the second batch is to be executed at a second time after the first time.
10. The method of
the tensor is a first tensor;
the index is a first index;
the operation is a first operation;
the instruction is a first instruction;
the set of memory addresses is a first set of memory addresses;
the neural network operator further comprises iterations of a second operation on a second tensor;
the modulo operator is a first modulo operator;
the method further includes:
generating the loop representation to include a second instruction in the body of the loop to perform the second operation on an element of the second tensor, the element of the second tensor being associated with a second index; and
adding a second modulo operator to the index of the element of the second tensor in the loop representation to map elements of the second tensor to a second set of memory addresses in the portion of the local memory.
11. The method of
wherein each of the first batch and the second batch includes iterations of the first operation to be executed in parallel and iterations of the second operation to be executed in parallel; and
wherein the schedule of execution indicates that iterations of the second operation are to be executed in parallel after the parallel execution of the iterations of the first operation completes.
12. The method of
13. The method of
wherein each of the first batch and the second batch includes a first number of iterations of the DMA operation to be executed in parallel and a second number of iterations of the second operation to be executed in parallel, the first number being larger than the second number.
14. The method of
start the parallel execution of the first number of iterations of the DMA operation at a first time;
start the parallel execution of a first group of the second number of iterations of the second operation at a second time, after a first subset of the first number of iterations of the DMA operation completes; and
start the parallel execution of a second group of the second number of iterations of the second operation at a third time, after a second subset of the second number of iterations of the DMA operation completes.
15. The method of
the loop representation includes a loop-nest;
the induction variable is a first induction variable;
the loop-nest includes a parent loop and a child loop nested within the parent loop, the loop being the parent loop;
the indices include first, indices and second indices;
the first indices are determined by the first induction variable of the parent loop;
the second indices are determined by a second induction variable of the child loop; and
a product of the first modulo operator and the second modulo operator is determined based on the first information and the second information.
16. The method of
determining that the tensor has no loop-carried dependency across loops between iterations of a loop; and
adding the modulo operator to the index of the element of the tensor in the loop representation based on the determination of no loop-carried dependency.
17. The method of
assigning a first initial modulo operator to the first indices of the first tensor;
assigning a second initial modulo operator to the second indices of the second tensor;
determining, based on the second information, that the second tensor having the second initial modulo operator causes a total memory footprint of the first tensor and the second tensor exceeds a total size of the portion of the local memory;
determining that the second operation included in the child loop contains a first write instruction and a last read instruction of an element of the second tensor;
determining that the parent loop is a closest ancestor loop of the child loop; and
reducing the first initial modulo operator.
18. The method of
performing a topology sort on the dataflow graph to generate a linear graph;
translating, based on accessing loop templates in a compute definition library, each neural network operator in the linear graph into a loop representation; and
generating a program of the loop representations following an order of the corresponding neural network operators in the linear graph; and
wherein the schedule of execution and the executable codes are determined based on the program.
|
Artificial neural networks are computing systems with an architecture based on biological neural networks. Artificial neural networks can be trained using training data to learn how to perform a certain task, such as identifying or classifying physical objects, activities, characters, etc., from images or videos. An artificial neural network, such as a deep neural network, may include multiple layers of processing nodes. Each processing node in a layer can perform computations on input data generated by processing nodes in the preceding layer to generate output data. For example, a processing node may perform a set of arithmetic operations such as multiplications and additions to generate an intermediate output, or perform post-processing operations on the intermediate output to generate a final output. An artificial neural network may include thousands of processing nodes and millions of parameters.
The architecture of a neural network may include an input layer, an output layer, and a number of intermediate layers, often referred to as hidden layers. Each layer executes a computation on the outputs of the previous layer, with the last layer (the output layer) providing a final result. With more layers, a neural network can, theoretically, perform more complex tasks, such as language translations and identifying (or classifying) the contents of an image. A neural network with more than three hidden layers is sometimes referred to as a deep neural network. Deep neural networks can have many hidden layers, such as, for example, between five and more than a thousand layers.
Neural networks can be implemented using a central processing unit (CPU) to perform the computations. CPUs, however, tend to be optimized for sequential rather than parallel computations, and thus can suffer from poor response times. Graphics processing units (GPUs) are optimized for parallel computations, but not necessarily optimized to provide the result from one computation unit directly to another computation unit. Often, the result must first be written to a memory and then read back. Although GPUs can have better response times than CPUs, it would still be desirable to improve the execution time of a neural network. Recently, special-purpose integrated circuit devices, such as a hardware neural network accelerator, have been developed to execute neural networks more efficiently than either CPUs or GPUs.
Various embodiments in accordance with the present disclosure will be described with reference to the drawings.
Neural networks can include many interconnected operators of several different operator types. Operators of the same type may perform similar operations on the input data. For example, one type of operator may be an addition operator that adds two tensors together. Another type of operator may be a convolution operator that convolves an input tensor with a filter, which is characterized by a set of weights. An addition operator and a convolution operator can include a set of operations on input tensors to generate an output tensor. The set of operations can be repetitive in that the same type of operations are performed on different elements of the input tensors to generate corresponding elements of the output tensor. For example, an addition operator between two input tensors can include a set of repetitive addition operations, with each addition operation performed between two elements from the two input tensors to generate an element of the output tensor. The addition operation is then repeated over other elements of the two input tensors to generate other elements of the output tensor. As another example, a convolution operator between two input tensors (e.g., an image tensor and a weight tensor) can include a set of repetitive multiply-and-accumulation operations, with each operation involving multiplying two elements from the two input tensors to generate a product, and adding the product to an accumulator of products of other elements of the two input tensors. As yet another example, an activation function operation (e.g., ReLU) can be performed on each element in an input tensor to generate a corresponding element of an output tensor.
The repetitive operations in a neural network operator can be represented in the form of an affine loop. An affine loop is a loop with a canonical induction variable that starts at zero and increments by one for each iteration. The upper bound of the variable does not change during program execution. The induction variable incremented by the loop can be used to index a particular element of the tensor for an operation by the loop. In a case where the repetitive operations involve a multi-dimensional tensor that includes multiple tensors defined along different dimensions, the repetitive operations can be represented in a loop-nest, which may be manifested at certain intermediate representations generated by the compiler. In a simple example, a loop-nest includes an outer loop and an inner loop within the body of the outer loop. The outer loop and the inner loop may each iterate across a different range of induction variable values. Each range of values can correspond to a range of a dimension of the multi-dimensional tensor. For example, in a case of a two-dimensional tensor, the first iteration of the outer loop triggers the inner loop, which executes across its entire range of values to index the elements of a first tensor along a first column in multiple iterations. Upon completion of the inner loop, the outer loop moves to a second value within its range of values and again triggers the inner loop, which again executes across its entire range of values to index the elements of a second tensor along a second column in multiple iterations. A loop-nest can also be represented in a hierarchy (e.g., in the form of a hierarchical tree), in which the outer loop is a parent loop and the inner loop(s) nested within the outer loop are child loops. An inner loop may further include a nested inner loop as a child loop of the inner loop.
A compiler can compile input codes representing a neural network operator into executable instructions that can be in the form of binary codes. As part of the compiling process, the compiler can generate a loop-nest representation of the neural network operator. The generation of the loop-nest representation allows the compiler to determine the schedule of execution of each loop and, based on the schedule, generate the executable binary codes to control the order of execution of the loops. Specifically, a loop of the neural network operator may have no loop-carried dependency between iterations, such that the operation in one iteration does not depend on an output generated by the operation in another iteration. This allows the compiler to schedule the different iterations of operations to be executed in parallel, instead of executing each iteration sequentially. For example, in the aforementioned addition, multiply-and-accumulation, and activation function operators, each iteration of the addition, multiplication, and/or activation function operations may be performed on different elements of the input and output tensors and can be performed independently from other iterations. The compiler can control the different iterations of an operation to access different memory locations for different elements of the input/output tensor, to enable each iteration to be executed independently and in parallel with each other.
A hardware neural network accelerator typically includes computation resources, such as multiple computation engines, to support parallel execution of the different iterations of a neural network operator, as well as an on-chip memory to provide intermediate storage for the input and output of the neural network operator, all of which can speed up the execution of the neural network operator. But the level of parallelism supported by the hardware neural network accelerator can be limited by the amount of computation resources and memory space assigned (or intended) to the execution of the neural network operator. For example, the number of computation engines assigned to the execution of the neural network operator can limit a number of loop iterations that can be executed in parallel at a given time. Moreover, the memory space may limit a number of elements of a tensor stored in the memory at a given time. As each iteration indexes/accesses a different element of a tensor from a different memory address, the size of the memory space can also limit the number of iterations executing in parallel that can access the different elements at a given time.
The computation resources available to support parallel execution of the loop iterations, as well as the memory space available to support the parallel execution, typically vary. For example, different hardware neural network accelerators may have different numbers of computation engines and different memory sizes. Moreover, in a case where the neural network hardware accelerator is shared by multiple tenants, the computation resources and memory spaces assigned to each tenant may also vary. If a compiler does not take into account the computation resources and memory space assigned to the execution of the neural network operator when scheduling the parallel execution of the different iterations of a neural network operator, the compiler may generate instructions that may either underutilize or overutilize the computation and memory resources, which may lead to inefficient execution of the neural network operator or may affect other operations being performed by the neural network hardware accelerator.
Examples described herein provide methods, systems, and other techniques to improve the scheduling of repetitive operations of a neural network operator. The compiler can determine a number of iterations of the operations to be included in a batch, where operations within a batch can be executed in parallel and can access different memory addresses, while different batches are executed sequentially. Moreover, the compiler can determine an address mapping scheme in which the different batches of operations reuse the same set of memory addresses, to reduce the total memory footprint by the neural network operator. The compiler can determine the address mapping scheme and assign the iterations into batches based on the computation and memory resources assigned to the neural network operator. The assignment of the computation and memory resources to the neural network operator can be made by, for example, an administration/management software of the neural network hardware accelerator, and the information of about the computation and memory resources (e.g., a number of iterations of operations that can be executed in parallel, a size of a local memory space assigned to the particular neural network operator or to the entire neural network, etc.) can be part of configuration parameters of the compiler to configure the compilation operation. After determining the address mapping scheme and the batches, the compiler can determine a schedule of execution of the batches, as well as the addresses of the memory accessed by the iterations of operations within each batch, and generate binary codes based on the schedule of execution of the batches and the addresses of the memory accessed by the batches.
Specifically, the compiler can receive input codes involving neural network computations and compile the input codes to generate a dataset representing a dataflow graph of a neural network. The dataflow graph may include a plurality of neural network operators, such as an addition operator, a convolution operator, an activation function (e.g., ReLU) operator, etc. Each neural network operator can be represented as a node in the dataflow graph. The compiler can generate a linear graph from the dataflow graph by performing, for example, a topological sort to assign each node (and the associated neural network operator) to the linear graph. The compiler can then generate a program representing the linear graph based on translating each neural network operator represented in the linear graph into a loop including instructions to access a tensor. In a case where the tensor is multi-dimensional and includes multiple tensors defined along different dimensions, the compiler can translate the neural network operator that accesses the multi-dimensional tensor into a loop-nest, with a parent outer-loop and one or more child inner-loops. The parent outer-loop and the child inner-loops can be associated with different induction variables associated with different dimensions. The loops can update the induction variables to select different tensors in different iterations. In some examples, the translation can be based on accessing loop-nest templates in a compute definition library that associates different loop-nest templates with different neural network operators. In some examples, additional processing, such as a loop fusion operation to fuse two or more loop-nests together while preserving the original behaviors of the loop-nests, can also be performed.
After generating the program including the loops and/or loop-nests, the compiler can identify a tensor indexed by one or more loops that does not create loop-carried dependency in the tensor. The tensor may include a plurality of elements, with each element associated with an index, and a loop's induction variable can set the index of the element to be accessed in an iteration of the loop. In a case where the tensor is multi-dimensional and has multiple tensors, each element can be associated with multiple indices in multiple dimensions set by multiple loops, where each index can be set by the induction variable of a different loop. A lack of loop-carried data dependency allows each loop iteration that accesses a tensor (of a multi-dimensional tensor) or a tensor element to be executed independently.
The compiler can carry out a two-step test to determine whether there is loop-carried dependency in the tensor. As a first step, the compiler can determine whether the indices along one dimension are set by two different loops, and whether the tensor is written in one loop and read in another loop. If both are true, the compiler may determine there is loop-carried dependency between the two different loops. On the other hand, for tensors/elements of which the indices of one dimension are set by a single loop, the compiler can carry out a second step of the two-step test to determine whether there is loop-carried dependency.
As a second step of the test, the compiler can determine whether there is a loop-carried dependency between elements of the tensor. A loop-carried dependency may exist when, for example, a first element of the tensor accessed by a first iteration of the loop has data dependency on a second element of the tensor accessed by a second iteration of the loop. In some examples, the compiler can determine a live interval of each element of the tensor, which is defined by logical timestamps of when the tensor element is first written and when it is last read, and determine the loop-carried dependency (if any) between the elements of the tensor based on whether the live intervals of the elements overlap. The logical timestamps can be defined by the induction variables of the loop for the first write and the last read of a tensor element. In some examples, a loop-nest may index multiple tensors, and the compiler can determine the live interval of each element of the tensors to determine loop-carried dependency (if any) for each tensor indexed by the loop-nest.
After identifying tensor(s) that have no loop-carried dependency, the compiler can identify the loop that indexes the tensor using the loop's induction variable, and determine an initial modulo operator for that loop as part of a global modulo allocation operation. The modulo operator can operate on the original indices (e.g., directly from the induction variables) of elements of the tensor in the program to generate remainder values. The remainder values can represent memory addresses. Through the modulo operation, elements of the tensor having different original indices can be mapped to a range of remainder values each representing a different address in the memory. The modulo operator can indicate how many elements of the tensor are mapped to different addresses in the memory. For example, for a modulo operator of m, m elements of the tensors are mapped to m different addresses, and m iterations of the loop can be included in a batch to be executed in parallel to access the m different addresses. Different groups of m elements are accessed in different batches, and the different groups are all mapped to the same set of m addresses. In a case of a multi-dimensional tensor including multiple tensors defined along multiple dimensions and associated with a loop-nest, and the compiler can determine an initial modulo operator for each loop that indexes the different tensors along different dimensions.
The compiler may determine not to assign modulo operator on indices of a dimension of a tensor which has a loop-carried dependency. Specifically, if the tensor fails the first step of the two-step test where the tensor is written in one loop and read in another loop, the compiler may determine that the first loop writes all tensor elements (or tensors) of the tensor of that dimension separately in the memory, instead of mapping multiple tensor elements/tensors to a single set of addresses, so that the second loop can access all of the tensor elements (or tensors) of the tensor. Moreover, in a case where there is loop-carried dependency within a loop between tensor elements of the same dimension, the compiler may also determine not to assign modulo operator on indices of that dimension to avoid violating the data dependency between iterations of the loop.
The initial modulo operators for each loop can be determined based on a maximum degree of parallel execution of the neural network operator supported by the neural network hardware accelerator, as well as the size of memory space assigned to the neural network operator. Specifically, the number of iterations made available for parallel execution may be equal to the product of initial modulo operators of each loop in a loop-nest. For example, assuming that a loop-nest includes a parent outer loop that indexes elements of a first tensor, and a child inner loop that indexes elements of a second tensor, then the first tensor is assigned a first initial modulo operator of m and the second tensor is assigned a second initial modulo operator of n, and the product m×n can determine the number of iterations made available for parallel execution. That product is typically smaller than or equal to a number of iterations the neural network hardware accelerator can execute in parallel (e.g., 8, 16, etc.) for the neural network operator, which in turn is based on the computation resources assigned to the neural network operator. In addition, as described above, the initial modulo operator can define how many different elements/tensors are to be mapped to different addresses. With a larger modulo operator on the original indices of the elements/tensors, a larger number of the elements/tensors can be mapped to a larger number of different addresses, and thereby a larger memory space is used to store the elements/tensors, and vice versa. The compiler can therefore determine the initial modulo operator based on the size of the memory space assigned to the elements/tensors. In some examples, the compiler can determine the initial modulo operators of the tensors accessed by a loop-nest based on a topological order traversal, in which the compiler assigns the initial modulo operators of the tensors indexed by the parent loop first based on the assigned memory spaces for the tensors/elements indexed by the parent loop. The compiler can then assign the initial modulo operators of the tensors indexed by the child loop, under the constraint that the product of the initial modulo operators across the loops remains equal to or below the maximum degree of parallel execution supported by the neural network hardware accelerator. In some examples, the compiler can also assign the initial modulo operators for the child loops first, followed by the parent loops.
The assignment order of initial modulo operators (parent loop followed by child loops, or vice versa) can be based on the architecture of the system that execute the neural network operators. For example, for a system that has multiple hardware processors, the compiler may preferentially set the initial modulo operators for the parent loop first, followed by the child loops, to manage parallel execution of the parent loops across the multiple hardware processors. On the other hand, for a system that has a single hardware processor with multiple execution engines, the compiler may preferentially set the initial modulo operators for the child loops first, followed by the parent loop, to manage parallel execution of the child loops across the execution engines for each iteration of the parent loop.
After determining the initial modulo operators, as part of the global modulo allocation operation, the compiler can reduce some or all of the initial modulo operators based on whether the total memory footprint by the tensors exceeds the available memory space. Specifically, the compiler can determine the live interval of each tensor for which an initial modulo operator is assigned, as well as the size of memory used by the tensor during the live interval. Tensors having overlapping live intervals can indicate that the memory needs to store the tensors simultaneously, whereas tensors that do not have overlapping live intervals need not be stored simultaneously. The compiler can determine the total memory footprint by the tensors based on identifying tensors having overlapping live intervals, as well as their memory footprints. If the total memory footprint of the tensors with the initial modulo operators is below the available memory space, the compiler can stop the global modulo allocation operation.
On the other hand, if the total memory footprint is above the available memory space, the compiler can determine an overflowing tensor that cannot fit into the available memory space. To reduce the total memory footprint, the compiler identifies the loop that includes a first write instruction and a last read instruction of that tensor, and reduces the initial modulo operator of the closest parent loop of the loop in the hierarchy (e.g., by reducing it by half) if the initial modulo operator of the closest parent loop is bigger than one. Such arrangements can be more effective in reducing the total memory footprint, since reducing the initial modulo operator of the parent loop can reduce the memory footprint both by the parent tensor indexed by the parent loop and by the child tensor (the overflowing tensor) indexed by the second parent loop. The reduction of the memory footprint by the first tensor can be due to mapping the elements of the first tensor to fewer addresses. Moreover, the reduction of the memory footprint by the second tensor can be due to reducing the number of parallel iterations of the child loop (given by the product of the modulo operators of both loops), which can also reduce the number of addresses mapped to the elements of the second tensor.
After reducing the initial modulo operator of the closest parent loop of the loop, the compiler can update the memory footprint estimate based on the new modulo operator of the closest parent loop, and determine whether the child tensor can fit into the available memory space. If the child tensor can fit, the compiler can stop the global modulo allocation operation. If the child tensor still cannot fit, the compiler can further reduce the modulo operator of the closest parent loop if the modulo operator is still above one. If the modulo operator of the closest parent loop equals one (e.g., after multiple rounds of reduction), such that there is no parallel execution of iterations of the closest parent loop, the compiler can proceed to reduce the modulo operator of the child loop (e.g., by half) that indexes the child tensor. The compiler can repeat the reduction of the modulo operator of the child loop until the total memory footprint is below the available memory space.
In some examples, the neural network operator can also include one or more tensors accessed by direct memory access (DMA) instructions (“DMA tensors”) to transfer data between an external memory and the local memory of the neural network hardware accelerator. The DMA instructions can also be included in the same loop as other instructions of a neural network operator (e.g., additions, multiplications, activation function processing, etc.), to provide memory data transfer to support those instructions. The DMA tensors can be stored in the local memory of the neural network hardware accelerator as the other tensors (“local tensors”) accessed by other neural network operations.
As part of the global modulo allocation operation, the compiler can also determine the modulo operator that maps the elements of the DMA tensor to the addresses of the local memory. The DMA instructions may be in the same loop as other neural network operation instructions, and the DMA tensor can be indexed by the induction variable of that loop. The modulo operator can determine a local memory footprint by the DMA tensor as well as a number of DMA instructions to be executed in parallel in accessing the DMA tensor. The compiler may estimate the local memory footprint of the DMA tensor and the local tensor based on determining the live intervals of the DMA tensor and local tensor, and summing the footprints of tensors having overlapping live intervals, as described above.
If the total memory footprint exceeds the available memory space, the compiler may preferentially reduce the modulo operator for the local tensors, before reducing the modulo operator of the DMA tensors. This can lead to more DMA instructions to be executed in parallel than the other neural network operations, even though the DMA instruction and the neural network operation are in the same loop, and the DMA tensor and the local tensor being indexed by the same induction variable. Such arrangements can improve the performance of the neural network hardware accelerator, especially in a case where the DMA operations present a substantial bottleneck. Moreover, while the DMA instructions start execution in parallel, they typically do not complete at the same time due to the sequential access of the external memory. As a result, other neural network operations that depend on the DMA operations need not have the same parallelism and can be performed sequentially after the DMA operations complete. As a result, parallelism of the neural network operations can be reduced with minimum effect on the execution speed of these operations, while at the same time reducing the memory footprint.
After the global modulo allocation operation completes and the modulo operators for the tensors of the program are determined, the compiler can determine a schedule of execution of the different iterations of the loops in the program and the mapping of the tensors to the memory addresses based on the modulo operators. The compiler can perform the scheduling based on estimating the total completion time of the DMA operations, which can include the memory access delay as well as memory data transfer delay over the interconnect, as well as data dependency between the tensors. The compiler can then generate executable instructions that reflect the schedule of execution of the different iterations of the loops in the program.
With the disclosed examples, a compiler can schedule the repetitive operations of a neural network operator based on the available computation and memory resources to maximize the parallel execution of the operations of the neural network operator allowed by the available computation resources, while ensuring that there are sufficient memory resources to support the parallel execution. Such arrangements can reduce underutilization or overuse of the available computation and memory resources and improve the performance of the neural network hardware accelerator that executes the neural network operator.
In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiments being described.
A synapse can scale the signal crossing the synapse. The scaling factor is referred to as a weight, and is thought of as the way a brain is able to learn: different weights result from different responses to input. Learning can change the weights, but the organization of the neurons and synapses need not change to obtain the learning. The static structure of the brain can thus be used as a model for a program, and the weights can reflect tasks that the program has learned to perform.
Neural networks operate on the notion that a neuron's computation involves a weighted sum of input values. These weighted sums correspond to the value scaling performed by the synapses and the combining of those values in the neuron. A functional operation is performed in the neuron on the combined inputs. In the brain model, the operation appears to be a non-linear function that causes the neuron to generate an output only when the inputs cross some threshold. Thus, by analogy, the nodes of a neural network can apply a non-linear function to the weighted sum of the values input into the nodes.
In the illustrated example, the model 100 includes an input layer 104, a middle layer that is often referred to as a hidden layer 106, and an output layer 108. Each layer includes some number of nodes 102. In this example, the nodes 102 of the input layer 104 are connected to each node 102 of the hidden layer 106. The connections, which would be referred to as synapses in the brain model, are referred to as weights 110. Also in this example, each node 102 of the hidden layer 106 has a connection or weight 110 with each node 102 of the output layer. The input layer 104 can receive inputs and can propagate the inputs to the hidden layer 106. A neural network implementation can include multiple hidden layers. Weighted sums computed by the hidden layer 106 (or multiple hidden layers) are propagated to the output layer 108, which can present final outputs to a user. The outputs of the nodes 102 can be referred to as activations, in keeping with the brain model.
An example of a computation that can occur at each layer in the example model 100 is as follows:
yj=ƒ(Σi=13Wij×xib) (Equation 1)
In the above equation, Wij is a weight, xi is an input activation, yj is an output activation, ƒ( ) is a non-linear function, and b is a bias term. Various non-linear functions can be used to achieve different purposes.
The model 100 can be referred to as a directed, weighted graph. In a directed graph, each connection to or from a node indicates a direction (e.g., into the node or away from the node). In a weighted graph, each connection can have a weight. Tools for developing neural networks can visualize the neural network as a directed, weighted graph, for ease of understanding and debuggability. In some cases, these tools can also be used to train the neural network and output trained weight values. Executing the neural network is then a matter of using the weights to conduct computations on input data.
Neural networks with many layers can be capable of learning high-level features having more complexity and abstraction than shallower networks. As an example, a neural network can be taught to recognize images. In this example, pixels of an image can be fed into the input layer of the neural network, and the outputs of the first layer can indicate the presence of low-level features in the image, such as lines and edges. At subsequent layers, these features can be combined to measure the likely presence of higher level features: the lines can be combined into shapes, which can be further combined into sets of shapes. Given all this information, the neural network can output a probability that the high-level features represent a particular object or scene. For example, the neural network can output whether an image contains a cat or does not contain a cat.
The learning phase of a neural network is referred to as training the neural network. During training, the neural network is taught to perform a task. In learning the task, values for the weights (and possibly also the bias) are determined. The underlying program for the neural network (e.g., the organization of nodes into layers, the connections between the nodes of each layer, and the computation executed by each node), does not need to change during training. Once trained, the neural network can perform the task by computing a result using the weight values that were determined during training. For example, the neural network can output the probability that an image contains a particular object, can output the probability that an audio sequence contains a particular word, can generate a bounding box around an object in an image, or can propose an action that should be taken, etc. Running the program for the neural network is referred to as inference.
There are multiple ways in which weights can be trained. One method is called supervised learning. In supervised learning, all training samples are labeled, so that inputting each training sample into a neural network produces a known result. Another method is called unsupervised learning, where the training samples are not labeled and training aims to find a structure in the data or clusters in the data. Semi-supervised learning falls between supervised and unsupervised learning. In semi-supervised learning, a subset of training data is labeled. The unlabeled data can be used to define cluster boundaries and the labeled data can be used to label the clusters.
A neural network, such as the neural network represented in
In various implementations, the memory subsystem 204 can include multiple memory banks 214. In these implementations, each memory bank 214 can be independently accessible, meaning that the read of one memory bank is not dependent on the read of another memory bank. Similarly, writing to one memory bank does not affect or limit writing to a different memory bank. In some cases, each memory bank can be read and written at the same time. Various techniques can be used to have independently accessible memory banks 214. For example, each memory bank can be a physically separate memory component that has an address space that is separate and independent of the address spaces of each other memory bank. In this example, each memory bank may have at least one read channel and may have at least one separate write channel that can be used at the same time. In these examples, the memory subsystem 204 can permit simultaneous access to the read or write channels of multiple memory banks. As another example, the memory subsystem 204 can include arbitration logic such that arbitration between, for example, the outputs of multiple memory banks 214 can result in more than one memory bank's output being used. In these and other examples, though globally managed by the memory subsystem 204, each memory bank can be operated independently of any other.
Having the memory banks 214 be independently accessible can increase the efficiency of the accelerator 202. For example, values can be simultaneously read and provided to each row of the processing engine array 210, so that the entire processing engine array 210 can be in use in one clock cycle. As another example, the memory banks 214 can be read at the same time that results computed by the processing engine array 210 are written to the memory subsystem 204. In contrast, a single memory may be able to service only one read or write at a time. With a single memory, multiple clock cycles can be required, for example, to read input data for each row of the processing engine array 210 before the processing engine array 210 can be started.
In various implementations, the memory subsystem 204 can be configured to simultaneously service multiple clients, including the processing engine array 210, the activation engine 216, the pooling engine 218, and any external clients that access the memory subsystem 204 over a communication fabric 220. In some implementations, being able to service multiple clients can mean that the memory subsystem 204 has at least as many memory banks as there are clients. In some cases, each row of the processing engine array 210 can count as a separate client. In some cases, each column of the processing engine array 210 can output a result, such that each column can count as a separate write client. In some cases, output from the processing engine array 210 can be written into the memory banks 214 that can then subsequently provide input data for the processing engine array 210. As another example, the activation engine 216 and the pooling engine 218 can include multiple execution channels, each of which can be separate memory clients. The memory banks 214 can be implemented, for example, using static random access memory (SRAM).
In various implementations, the memory subsystem 204 can include control logic. The control logic can, for example, keep track of the address spaces of each of the memory banks 214, identify memory banks 214 to read from or write to, and/or move data between the memory banks 214. In some implementations, memory banks 214 can be hardwired to particular clients. For example, a set of memory banks 214 can be hardwired to provide values to the rows of the processing engine array 210, with one memory bank servicing each row. As another example, a set of memory banks can be hard wired to receive values from columns of the processing engine array 210, with one memory bank receiving data for each column.
The processing engine array 210 is the computation matrix of the example accelerator 202. The processing engine array 210 can, for example, execute parallel integration, convolution, correlation, and/or matrix multiplication, among other things. The processing engine array 210 includes multiple processing engines 211, arranged in rows and columns, such that results output by one processing engine 211 can be input directly into another processing engine 211. Processing engines 211 that are not on the outside edges of the processing engine array 210 thus can receive data to operate on from other processing engines 211, rather than from the memory subsystem 204.
In various examples, the processing engine array 210 uses systolic execution, in which data arrives at each processing engine 211 from different directions at regular intervals. In some examples, input data can flow into the processing engine array 210 from the left and weight values can be loaded at the top. In some examples weights and input data can flow from the left and partial sums can flow from top to bottom. In these and other examples, a multiply-and-accumulate operation moves through the processing engine array 210 as a diagonal wave front, with data moving to the right and down across the array. Control signals can be input at the left at the same time as weights, and can flow across and down along with the computation.
In various implementations, the number of columns in the processing engine array 210 determines the computational capacity of the processing engine array 210, and the number of rows determines the required memory bandwidth for achieving maximum utilization of the processing engine array 210. The processing engine array 210 can have, for example, 64 columns and 428 rows, or some other number of columns and rows.
An example of a processing engine 211 is illustrated in
In the illustrated example, an input from above can include a partial sum, p_in, provided either from another processing engine 211 or from a previous round of computation by the processing engine array 210. When starting a computation for a new set of input data, the top row of the processing engine array 210 can receive a fixed value for p_in, such as zero. As illustrated by this example, i and w are multiplied together and the result is summed with p_in to produce a new partial sum, p_out, which can be input into another processing engine 211. Various other implementations of the processing engine 211 are possible.
Outputs from the last row in the processing engine array 210 can be temporarily stored in the results buffer 212. The results can be intermediate results, which can be written to the memory banks 214 to be provided to the processing engine array 210 for additional computation. Alternatively, the results can be final results, which, once written to the memory banks 214, can be read from the memory subsystem 204 over the communication fabric 220, to be output by the system.
In some implementations, the accelerator 202 includes an activation engine 216. In these implementations, the activation engine 216 can combine the results from the processing engine array 210 into one or more output activations. For example, for a convolutional neural network, convolutions from multiple channels can be summed to produce an output activation for a single channel. In other examples, accumulating results from one or more columns in the processing engine array 210 may be needed to produce an output activation for a single node in the neural network. In some examples, activation engine 216 can be bypassed.
In various examples, the activation engine 216 can include multiple separate execution channels. In these examples, the execution channels can correspond to the columns of the processing engine array 210, and can perform an operation on the outputs of a column, the result of which can be stored in the memory subsystem 204. In these examples, the activation engine 216 may be able to perform between 1 and n parallel computations, where n is equal to the number of columns in the processing engine array 210. In some cases, one or more of the computations can be performed simultaneously. Examples of computations that each execution channel can perform include exponentials, squares, square roots, identities, binary steps, bipolar steps, sigmoidals, and ramps, among other examples.
In some implementations, the accelerator 202 can include a pooling engine 218. Pooling is the combining of outputs of the columns of the processing engine array 210. Combining can include, for example, computing a maximum value, a minimum value, an average value, a median value, a summation, a multiplication, or another logical or mathematical combination. In various examples, the pooling engine 218 can include multiple execution channels that can operate on values from corresponding columns of the processing engine array 210. In these examples, the pooling engine 218 may be able to perform between 1 and n parallel computations, where n is equal to the number of columns in the processing engine array 210. In various examples, execution channels of the pooling engine 218 can operate in parallel and/or simultaneously. In some examples, the pooling engine 218 can be bypassed.
Herein, the activation engine 216 and the pooling engine 218 may be referred to collectively as execution engines. The processing engine array 210 is another example of an execution engine. Another example of an execution engine is a Direct Memory Access (DMA) engine, which may be located outside the accelerator 202.
Input data 250 can arrive over the communication fabric 220. The communication fabric 220 can connect the accelerator 202 to other components of a processor, such as a DMA engine that can obtain input data 250 from an Input/Output (I/O) device, a storage drive, or a network interface. The input data 250 can be, for example one-dimensional data, such as a character string or numerical sequence, or two-dimensional data, such as an array of pixel values for an image or frequency and amplitude values over time for an audio signal. In some examples, the input data 250 can be three-dimensional, as may be the case with, for example, the situational information used by a self-driving car or virtual reality data. In some implementations, the memory subsystem 204 can include a separate buffer for the input data 250. In some implementations, the input data 250 can be stored in the memory banks 214 when the accelerator 202 receives the input data 250.
In some examples, the accelerator 202 can implement a neural network processing engine. In these examples, the accelerator 202, for a set of input data 250, can execute a neural network to perform a task for which the neural network was trained. Executing a neural network on a set of input data can be referred to as inference or performing inference.
The weights for the neural network can be stored in the memory subsystem 204, along with input data 250 on which the neural network will operate. The neural network can also include instructions, which can program the processing engine array 210 to perform various computations on the weights and the input data. The instructions can also be stored in the memory subsystem 204, in the memory banks 214 or in a separate instruction buffer. The processing engine array 210 can output intermediate results, which represent the outputs of individual layers of the neural network. In some cases, the activation engine 216 and/or pooling engine 218 may be enabled for computations called for by certain layers of the neural network. The accelerator 202 can store the intermediate results in the memory subsystem 204 for inputting into the processing engine array 210 to compute results for the next layer of the neural network. The processing engine array 210 can further output final results from a last layer of the neural network. The final results can be stored in the memory subsystem 204 and then be copied out to host processor memory or to another location.
As described above, accelerator 202 may execute a set of instructions that reflects, for example, computational flow model 100 of
The processor 302 is an integrated circuit device that can execute program code, in the form of instructions. The program code can be used for various software applications or tools, such as an operating system 320 or the illustrated compiler 330. While the processor 302 is executing a program, the instructions for the program can be stored in the processor memory 304. The instructions can also be stored elsewhere, such as on the storage device 306, and can be loaded into the processor memory 304 when needed by the processor 302. The processor 302 can also use the processor memory 304 for temporary storage of other data on which the processor 302 is operating. In various examples, the processor memory 304 is a volatile memory type, such as a type of Random Access Memory, though non-volatile memory types can, alternatively or additionally, be used for the processor memory 304.
The storage device 306 is an example of a device that can include non-volatile memory. For example, the storage device 306 can be a magnetic disk drive, a solid state drive, or an optical drive, among other examples. The storage device 306 can further be non-transitory, such that program code and other data stored on the storage device 306 remains present when the storage device 306 is not powered on.
The storage device 306 is one example of a peripheral device, which are components that can be coupled to the host system 300 to add functionality to the host system 300. Other examples of peripheral devices include the Input/Output devices 308 and the network interface 310. The Input/Output devices 308 can include user input and output devices, such as keyboards, mice, touch screens, microphones, display screens, speakers, printers, and scanners, among other examples. The network interface 310, which can be implemented using a network interface card, can provide access to one or more networks. The network interface 310 can include, for example, a physical port for connecting a network cable and/or wireless antennas for communicating with Wi-Fi and/or cellular networks. The network interface 310 can also be described as an I/O device.
The acceleration engine 312 is also another type of peripheral device or I/O device. The acceleration engine 312 is a device that is purpose built to perform certain operations that can be performed by the processor 302, but can be performed faster by the acceleration engine 312. For example, the acceleration engine 312 can be a neural network accelerator, and, as such, may be able to perform the large scale, parallel computations of a neural network more efficiently than when the computations are performed by the processor 302. As another example, the acceleration engine 312 can be a GPU, and may be optimized to perform the computations needed for graphics rendering. Other examples of devices that can be implemented by the acceleration engine 312 include cryptographic accelerators, compression and decompression accelerators, 3-D accelerators, regular expression accelerators, security accelerators, and others.
In various examples, the acceleration engine 312 can execute program code to perform certain operations. For example, when the acceleration engine 312 is a neural network accelerator, the acceleration engine 312 can be programmed to execute a particular neural network, such as one that performs image recognition or one that performs machine translation. As a further example, to support the execution of a neural network, the acceleration engine 312 can be programed to perform operations such as copying data for the neural network from processor memory 304 (for example) into the acceleration engine 312, copying input data for the neural network from processor memory 304 into the acceleration engine 312, and/or copying results from the acceleration engine 312 into the processor memory 304, among other examples.
To generate program code for the acceleration engine 312, in various examples, the host system 300 can execute the compiler 330. Compilers, in general, are software programs that translate program code written in a human-readable language into a format (e.g., machine instructions) that can be read and processed by an integrated circuit device. In the example of
The compiler 330 can be activated, for example, when the operating system 320 receives keyboard, mouse, touchscreen, voice commands, or other inputs from the Input/Output devices 308. The inputs can further include parameters for the compiler 330, such as the input code 342 to compile and configuration options for the compilation process. Once the compiler 330 is activated, the processor 302 can load the instructions for the compiler 330 into the processor memory 304, and can execute the instructions.
In the example of
The first stage 332 can receive and process input code 342. The input code 342 can describe a program in a high-level programming language, such as Java, C++, or Tensorflow, among many other examples. The input code 342 can describe, for example, steps to perform image recognition, speech recognition, machine translation, or other operations. The input code 342 can be obtained, for example, from the storage device 306. Alternatively, though not illustrated here, the input code 342 may be located in the processor memory 304 or can be obtained from a network location, using the network interface 310. Processing of the input code 342 can include sorting the operations described in the input code 342 into layers, where the outputs of one layer provide the inputs to a next layer. Processing can also include identifying steps to be performed by the processor 302, rather than by the acceleration engine 312. For example, the processor 302, through the execution of a driver 322, may need to perform steps such as configuring Direct Memory Access (DMA) descriptors for moving data into or out of the acceleration engine 312, among other examples.
The output 334 of the first stage 332 can be organized, for example, in the layers, nodes, and connections between nodes of a neural network. The second stage 336 can perform intermediate processing on this output 334. For example, the operations performed in any one layer, or at any one node in a layer, may be too many for the acceleration engine 312 to perform at the same time. The acceleration engine 312 may, for example, have a limited amount of local storage space for the data needed for a computation, or the computations may be more than the acceleration engine 312 can perform at one time. In this example, the first stage 332 can break the operations of the layer or node down into smaller operations, which can fit into the acceleration engine's local memory and/or can fit into the computing capacity of the acceleration engine 312. Processing of the output 334 of the first stage 332 can include other steps, such as scheduling, or determining the order in which the acceleration engine 312 and/or processor 302 will perform operations, among other examples.
In various examples, the output 338 of the second stage 336 includes the various steps to be performed by components of the acceleration engine 312, in the order that the steps are to be performed. The output 338 can be represented, for example, as a data flow graph, where the nodes in the graph represent memory operations, computations, and other operations, and the edges or connections between the nodes represent dependencies between the nodes, such as data dependencies, memory dependencies, or operational dependencies, among other examples.
The third stage 340 can operate on the output 338 of the second stage 336, and perform various steps before producing the instructions that are to be executed by the acceleration engine 312. These steps can include, for example, removing redundant dependencies, resolving or handling dependencies between nodes by inserting synchronization instructions into the code, identifying possible optimizations in memory footprint or memory bandwidth usage, and other operations.
The output of the third stage 340 is compiled code 344, which may include machine instructions in binary format. In some examples, the compiled code 344 can be stored in the processor memory 304. Alternatively or additionally, the compiled code 344 can be copied to the storage device 306 or to a network location. As noted above, the acceleration engine 312 may be located at a different host system, in which case the compiled code 344 can be sent over the network interface 310 to the other host system.
In the example of
In the illustrated example, an input tensor 401 is received by FCL operator 402-1, which also receives a weight tensor 403. FCL operator 402-1 can generate a first intermediate output tensor 410. Intermediate output tensor 410 is then processed by addition operator 402-2, which can add a bias from bias tensor 405 to each element of first intermediate output tensor 410 to generate a second intermediate output tensor 414. Activation function operator 402-3 can then apply an activation function (e.g., ReLU) to each element of second intermediate output tensor 414 to generate output tensor 416. Output tensor 416 can represent, for example, a classification output of input tensor 401.
Each tensor in
Each of FCL operator 402-1, addition operator 402-2, and activation function operator 402-3 can perform a set of repetitive operations on each element of the tensors. For example, as described above, FCL operator 402-1 can represent the behavior of a node in a fully connected layer, in which FCL operator 402-1 can perform a set of repetitive multiplication operations between each element of input tensor 401 with a weight from weight tensor 403 to generate a product. In addition, addition operator 402-2 can perform a set of repetitive addition operations between each element of first intermediate output tensor 410 and a bias from bias tensor 405 to generate a corresponding element of second intermediate output tensor 412. Further, activation function operator 402-3 can perform a set of repetitive activation function operations on each element of second intermediate output tensor 412 to generate a corresponding element of output tensor 416.
The repetitive operations in a neural network operator can be represented in the form of loop. A loop has a canonical induction variable that starts at zero and increments by one for each iteration, where the upper bound of the variable does not change during program execution. The induction variable incremented by the loop can be used to index a particular element of the tensor for an operation by the loop. In a case where the repetitive operations involve a multi-dimensional tensor that includes multiple tensors defined along different dimensions, the repetitive operations can be represented in a loop-nest, which may be manifested at certain intermediate representations generated by the compiler. In a simple example, a loop-nest includes an outer loop and an inner loop within the body of the outer loop. The outer loop and the inner loop may each iterate across a different range of values. Each range of values can correspond to a range of a dimension of the multi-dimensional tensor. For example, in a case of a two-dimensional tensor, the first iteration of the outer loop triggers the inner loop, which executes across its entire range of values to index the elements of a first tensor along a first column in multiple iterations. Upon completion of the inner loop, the outer loop moves to a second value within its range of values and again triggers the inner loop, which again executes across its entire range of values to index the elements of a second tensor along a second column in multiple iterations.
Loop representation 500 can be part of a program generated by a compiler, such as compiler 330, as part of a compilation operation to generate executable instructions for accelerator 202. As to be described below, based on loop representation 500, compiler 330 can determine a schedule of the operations for each of FCL operator 402-1, addition operator 402-2, and activation function operator 402-3 to be performed at accelerator 202.
The compiler can perform a topological sort operation 520 on dataflow graph 510 to generate a linear graph 530 comprising nodes 514a-514d. The topological sort can be performed based on the data dependency among the nodes indicated by edges 516a-b and 518a-b. For example, the compiler can traverse through dataflow graph 510 starting from node 514a and following the direction of the edges, and assign a number to each node based on the order by which the node appears in the traversal path, and the number can represent a position of the node in the topology of dataflow graph 510. The compiler can then sort the nodes based on the numbers assigned to the nodes, and then generate linear graph 530 that reflects the sorted order of the numbers, which in turn reflects the order of the neural operators in the program. For example, in
After generating linear graph 530, the compiler can perform a program construction operation 400, in which the compiler traverses through linear graph 530 and translate the neural network operator represented by nodes 514-d into loop-nests 542a-d in program 544. Each neural network operator can be represented by a loop-nest such as loop-nests 502-506 of
After including the loop-nest templates in program 544, the compiler may perform additional processing on program 544, such as loop fusion, to fuse multiple loop-nests together into a single loop-nest while preserving the original behavior of the multiple loop-nests. Loop fusion operation can be performed based on identifying loops (or loop-nests) that have a common induction variable range, and putting the instructions of those loops under a single loop.
The loop-nests shown in
In some examples, a loop of the aforementioned neural network operator may have no loop-carried dependency between iterations, where the operation in one iteration does not depend on output generated by the operation in another iteration. This allows the compiler to schedule the different iterations to be executed in parallel, instead of executing each iteration sequentially.
Since there is no loop-carried dependency within loop 502a or loop 502b, each iteration of loop 502a and loop 502b can be executed in parallel for computation of a particular element j). In the example of
To reduce the usage of the memory resources, a compiler (e.g., compiler 330) can perform an array contraction operation on loop-nest 502, where a single memory element is provided to store some of the tensors, such as I1.
On the other hand, as the content of I1(0, 0) may need to be read before a new iteration can start, each iteration of the loop-nest is executed sequentially, with one iteration of each of S0, S1, and S2 instructions being executed at a given time. Schedule 612 is an example schedule of execution generated by the compiler based on the memory constraint imposed by the array contraction operation, and based on data dependencies. For example, as shown in schedule 612, at time TO the compiler can schedule a first iteration of S0 instruction corresponding to i=0 and j=0 (represented by S0(0, 0)) to be executed to generate I1(0, 0). Moreover, based on the data dependency of S1 instruction on S0 instruction, the compiler can schedule the first iteration of S1 instruction corresponding to i=0 and j=0 (represented by S1(0, 0)) to be executed at time T1 to consume I1(0, 0) and generate I2(0, 0). Further, based on the data dependency of S2 on S1, the compiler can schedule the first iteration of S2 instruction corresponding to i=0 and j=0 (represented by S2(0, 0)) to be executed at time T2 to consume I2(0, 0) and generate I3(0, 0).
In addition, the compiler can schedule the execution of a second iteration of S0 instruction corresponding to i=0 and j=1 (represented by S0(0, 1)) to generate a second version of I1(0, 0), after the first version of I1(0, 0) is consumed by the first iteration of S1, within time T1. The compiler can also schedule the execution of a second iteration of S1 instruction corresponding to i=0 and j=1 (represented by S1(0, 1)) at time T2 following time T1, and the execution of a second iteration of S2 instruction corresponding to i=0 and j=1 (represented by S2(0, 1)) at time T3 following time T3, based on the data dependency.
Further, the compiler can also schedule the execution of a third iteration of S0 instruction corresponding to i=0 and j=2 (represented by S0(0, 2)) at time T2 after time T1, a third iteration of S1 instruction corresponding to i=0 and j=2 (represented by S1(0, 2)) at time T3 after time T2, and a fourth iteration of S0 instruction corresponding to i=0 and j=3 (represented by S0(0, 3)) at time T3 after S0(0, 2) at time T2.
In addition, from schedule 612, the compiler can also determine the maximum memory footprint starts at time T2. The memory being used by loop-nest 502 includes three memory addresses to store the outputs of S0, S1, and S2 instructions. The compiler can also perform memory allocation to allocate memory addresses to be used by the S0, S1, and S2 instructions.
The compiler can perform the array contraction operation as part of the compilation operation to generate executable instructions for the neural network hardware accelerator. For example, the compiler can perform the array contraction operation after generating program 544 of
Prior to performing an array contraction operation on a tensor, the compiler can determine whether the tensor has any loop-carried dependency. As described above, the tensor may include a plurality of elements, with each element associated with an index, and a loop's induction variable can set the index of the element to be accessed in an iteration of the loop. In a case where the tensor is multi-dimensional and has multiple tensors, each element can be associated with multiple indices (e.g., i and j in
The compiler can carry out a two-step test to determine whether there is loop-carried dependency in the tensor. As a first step, the compiler can determine whether the indices along one dimension are set by two different loops, and whether the tensor is written in one loop and read in another loop. If both are true, the compiler may determine there is loop-carried dependency between the two different loops.
The first step of the two-step test can also be performed on multidimensional tensor. For example, in program 630, tensor X has a first dimension with an index k set by parent loop 632. Tensor X also has a second dimension which is indexed by an index i set by child loop 634 and indexed by an index j set by child loop 636. In such an example, the compiler may determine not to carry out array contraction along the second dimension. As such, tensor associated with different k indices along the first dimension are stored in different locations of the memory, while at each memory location tensor elements associated with different i/j indices along the second dimension (and a particular k index along the first dimension) can be mapped to a set of memory addresses.
On the other hand, for tensors/elements of which the indices of one dimension are set by a single loop, the compiler can carry out a second step of the two-step test to determine whether there is loop-carried dependency based on determining the live interval of each element of the tensor. The live interval of a tensor element can be defined by the logical timestamps of when the tensor element is first written and when it is last read. A determination of whether the tensor has loop-carried dependency can be made if there is overlap between the live intervals of the tensor elements. The logical timestamps can be defined based on the induction variables of the loops when a tensor element is first written and when the tensor element is last read.
On the other hand, the right of
Although the array contraction operation of
One way to relax the stringent sequential execution of iterations is by applying a modulo operator on the indexing of the tensor. A compiler (e.g., compiler 330) can apply the modulo operator on the original indexing of the tensor in a program to generate a remainder of the index, after determining that the tensor has no loop-carried dependency based on the two-step test illustrated in
The provision of two memory addresses to store tensor I1 allows two iterations of the S0 instruction to be executed in parallel in a batch, as both iterations can write the output to different addresses simultaneously. Likewise, the provision of two memory addresses to store tensor L2 also allows two iterations of S1 (and S2 instructions, which depend on S1) to be executed in parallel in a batch, followed by another batch including another two iterations of S1 (and S2 instructions).
Schedule 712 is an example schedule of execution generated by the compiler based on the relaxed memory constraint provided by the modulo operator, and based on data dependency. As shown in schedule 712, at time T0, the compiler can schedule first and second iterations of the S0 instructions (represented by S0(0, 0) and S0(0, 1)) to be executed in parallel in a first batch to generate a first version of I1(0, 0) and I1(0, 1). Moreover, based on the data dependency between the S1 instruction and the S0 instruction, the compiler can schedule the first and second iterations of the S1 instruction (represented by S1(0, 0) and S1(0, 1)) to be executed in parallel in a second batch at time T1. Further, based on the data dependency of S2 on S1, the compiler can schedule the first and second iterations of the S2 instruction (represented by S2(0, 0) and S2(0, 1)) to be executed in parallel in a third batch at time T2.
In addition, to allow the overwriting of the first version of I1(0, 0) and I1(0, 1), the compiler can schedule the parallel execution of third and fourth iterations of the S0 instruction (represented by S0(0, 2) and S0(0, 3)) in the second batch at time T1 to generate a second version of I1(0, 0) and I1(0, 1), after the parallel execution of S1(0, 0) and S1(0, 1) completes and consumes the first version of I1(0, 0) and I1(0, 1). The compiler can also schedule the parallel execution of third and fourth iterations of the S1 instruction (represented by S1(0, 2) and S1(0, 3)) at time T2, followed by the parallel execution of third and fourth iterations of the S2 instruction (represented by S2(0, 2) and S2(0, 3)) at time T3, based on the data dependency.
Further, to allow overwriting of the second version of I1(0, 0) and I1(0, 1), the compiler can schedule the parallel execution of fifth and sixth iterations of S0 instruction (represented by S0(0, 4) and S0(0, 5)) in the third batch at time T2 to generate a third version of I1(0, 0) and I1(0, 1), after the parallel execution of S1(0, 2) and S1(0, 3) completes and consumes the second version of I1(0, 0) and I1(0, 1). The compiler can also schedule the parallel execution of fifth and sixth iterations of the S1 instruction (represented by S1(0, 4) and S1(0, 5)) at time T3, followed by the parallel execution of fifth and sixth iterations of the S2 instruction (represented by S2(0, 2) and S2(0, 3)) after time T3 (not shown in schedule 712). Furthermore, the compiler can also schedule the parallel execution of seventh and eight iterations of the S0 instruction (represented by S0(0, 6) and S0(0, 7)) at time T3.
In addition, from schedule 712, the compiler can also determine the maximum memory footprint starting at time T2. The memory being used by loop-nest 502 includes six memory addresses to store the outputs of the S0, S1, and S2 instructions. The compiler can also perform memory allocation to allocate memory addresses to be used by the S0, S1, and S2 instructions.
Although
As described above, a compiler (e.g., compiler 330) can insert a modulo operator in the indexing of a tensor in program 544 of
As described above, the value of the modulo operator can determine a number of iterations of an instruction that can be executed in parallel, as well as the resulting memory footprint used to support the parallel execution of the instruction. For example, for a modulo operator of m, m elements of the tensors are mapped to m different addresses, and m iterations of the loop can be included in a batch to be executed in parallel to access the m different addresses. To improve utilization of the computation and memory resources available for execution of a loop (or loop-nest), the compiler can perform a global modulo allocation operation, in which the compiler can determine the modulo operators for the indexing of tensors in the loops in program 544 of
From program 800, the compiler can identify tensors that have no loop-carried dependency based on the two-step test described in
In addition, a modulo operator of 1 is assigned to loop 808 (loop3) to change the indexing of tensor L3, whereas a modulo operator of 8 is assigned to loop 810 (loop4) to change the indexing of tensor L4. As a result of the assignment of these initial modulo operators, each iteration of loop3 can be executed sequentially, while within each iteration of loop3, eight iterations of loop4 can be executed in parallel. Moreover, one memory address is allocated to store tensor L3, and eight memory addresses are allocated to store tensor L4.
The initial modulo operators for each of loops 802-810 can be determined based on a maximum degree of parallel execution supported by the neural network hardware accelerator, as well as the size of memory space allocated for each tensor. Specifically, the number of iterations made available for parallel execution is equal to the product of initial modulo operators of each loop in a loop-nest. For example, assuming that a loop-nest includes a parent outer loop that indexes elements of a first tensor and a child inner loop that indexes elements of a second tensor, the first tensor is assigned a first initial modulo operator of m and the second tensor is assigned a second initial modulo operator of n, the product m X n can determine the number of iterations made available for parallel execution. That product is typically smaller than or equal to a number of iterations the neural network hardware accelerator can execute in parallel. In the example of
In some examples, the compiler can determine the initial modulo operators of the tensors accessed by a loop-nest based on a topology traversal operation similar to
The assignment order of initial modulo operators (parent loop followed by child loops, or vice versa) can be based on the architecture of the system that execute the neural network operators. For example, for a system that has multiple hardware processors, the compiler may preferentially set the initial modulo operators for the parent loop first, followed by the child loops, to manage parallel execution of the parent loops across the multiple hardware processors. On the other hand, for a system that has a single hardware processor with multiple execution engines, the compiler may preferentially set the initial modulo operators for the child loops first, followed by the parent loop, to manage parallel execution of the child loops across the execution engines for each iteration of the parent loop.
The following illustrates excerpts of codes of the compiler to determine the initial modulo operators.
max_accum_modulo_alloc=max(product(m(l_i) for l_i in path p)
for each child loop of l_i:
if not eligible_for_modulo_allocation(l_i):
max_accumu_alloc_size=\
m_l_i=maxl(MaxParallelism, max_accumu_alloc_sixe)
m[l_i]=min(l_i.tripcount, m_l_i)
def enumerate_accu_alloc_size_on_each_path)l_i, alloc_sixe=1):
if not self.children:
for child in self.children:
After determining the initial modulo operators, as part of the global modulo allocation operation, the compiler can reduce some or all of the initial modulo operators based on whether the total memory footprint by the tensors exceeds the available memory space. In one example, the compiler can determine the live intervals of elements of tensors, and determine whether there are overlaps in the live intervals. If there are overlaps, the compiler may allocate separate memory addresses for the elements of the tensors, and determine the total memory footprint based on the allocated memory addresses.
On the other hand, in a case where L1 and L2 are tensors of a multi-dimensional tensor, the compiler can determine the total number of memory addresses to be mapped to the elements of the multi-dimensional tensor based on a product of the modulo operators, as described above in
If total memory footprint 820 at this point is below the available memory space, the compiler can stop the global modulo allocation operation. On the other hand, if total memory footprint 820 is above the available memory space, the compiler can identify an overflowing tensor that cannot fit into the available memory space. In the example of
Referring to
Referring to
In some examples, the neural network operator can also include one or more DMA tensors accessed by direct memory access (DMA) instructions to transfer data between an external memory and the local memory (e.g., memory subsystem 204) of the neural network hardware accelerator. The DMA instructions can also be included in the same loop as other neural network operations (e.g., additions, multiply-and-accumulation, activation function processing, etc.). The DMA tensors can be stored in the local memory of the neural network hardware accelerator as the local tensors accessed by other neural network operations.
In some examples, as part of the global modulo allocation operation, the compiler can determine the modulo operators for the indexing of tensors TO and T1 to determine the memory footprints of tensors TO and T1, as well as a number of iterations of DMA operation 904 and addition operation 906 to be executed in parallel in a batch. The compiler can determine the modulo operators for the indexing of tensors TO and T1 based on the degree of parallelism supported by the neural network accelerator as well as available memory space. The compiler can also estimate the memory footprint of tensors TO and T1 for a given modulo operator, and adjust the modulo operators until tensors TO and T1 can fit into the available memory space.
The compiler can determine the memory footprint of tensors TO and T1 for a given modulo operator based on estimating live intervals of tensors TO and T1 in the local memory, and adding the footprints of tensor elements having overlapping live intervals, as described above. If the total memory footprint exceeds the available memory space, the compiler may preferentially reduce the modulo operator for tensor L1, before reducing the modulo operator of tensor L0. This can lead to more DMA instructions to be executed in parallel than the number of iterations of other neural network operations, even though the DMA instruction and the neural network operation are defined within the same loop and the DMA tensor and the data tensor being indexed by the same induction variable. Such arrangements can improve the performance of the neural network hardware accelerator, especially in a case where the DMA operations present a substantial bottleneck. Moreover, while the DMA instructions start execution in parallel, they typically do not complete at the same time due to the sequential access of the external memory. As a result, other neural network operations that depend on the DMA operations need not have the same parallelism and can be performed sequentially after the DMA operations complete. As a result, parallelism of the neural network operations can be reduced with minimum effect on the execution speed of these operations, while at the same time reducing the memory footprint of tensors TO and T1.
After the global modulo allocation operation completes and the modulo operators for the tensors of the program are determined, the compiler can determine a schedule of execution of the different iterations of the loops in the program and the mapping of the tensors to the memory addresses based on the modulo operators. The compiler can perform the scheduling based on data dependency between the tensors, and based on predicting the total completion time of the DMA operations using a delay model, which can account for various sources of delay such as memory access delay, memory data transfer delay over the interconnect, etc. The compiler can then generate executable binary codes that reflect the schedule of execution of the different iterations of the loops in the program.
Based on the data dependency of addition operations 906a and 906b on DMA operations 904a and 904b, where addition operations 906a and 906b consume tensor elements L0(0) and L0(1) generated by DMA operations 904a and 904b, the compiler can schedule parallel execution of a first group of addition operations, including addition operations 906a and 906b, at time T1. The compiler can also have addition operations 906a and 906b to reuse the memory addresses allocated to DMA operations 904a and 904b such that the compiler does not need to allocate additional memory addresses to addition operations 906a and 906b, and the total memory footprint remains within total memory footprint 920 when addition operations 906a and 906b are executed in parallel with DMA operations 904c and 904d at time T1.
In addition, the compiler can predict that a second subset of DMA operations being executed in parallel, including DMA operations 904c and 904d, complete by time T2. Based on the data dependency of addition operations 906c and 906d on DMA operations 904c and 904d, where addition operations 906c and 906d consume tensor elements L0(2) and L0(3) generated by DMA operations 904a and 904b, the compiler can schedule parallel execution of a second group of addition operations, including addition operations 906a and 906b, at time T2. The compiler can also have addition operations 906c and 906d to reuse the memory addresses allocated to addition operations 906a and 906d, so that the compiler does not need to allocate additional memory addresses to addition operations 906a and 906b.
Schedule 912 on the right of
Method 1000 start with step 1002, in which compiler 330 receives information representing a dataflow graph of a neural network, the neural network comprising a neural network operator. In some examples, compiler 330 can receive input codes involving neural network computations and compile the input codes to generate a dataset representing the dataflow graph of the neural network. An example of the dataflow graph is shown in
In step 1004, compiler 330 receives first information of computation resources of the neural network hardware accelerator assigned (or intended) to execute the neural network operator. Moreover, in step 1006, compiler 330 receives second information of a portion of a local memory of the neural network hardware accelerator assigned to execute the neural network operator. The first information may indicate, for example, a number of parallel execution of the neural network operator supported by the neural network hardware accelerator. The second information may indicate a size of the portion of the local memory, which can represent the memory space available to support the parallel execution.
In step 1006, compiler 330 determines, based on the dataflow graph, iterations of an operation on elements of a tensor included in the neural network operator. Specifically, each neural network operator represented in the linear graph into a loop includes instructions to access a tensor. In a case where the tensor is multi-dimensional and includes multiple tensors defined along different dimensions, the compiler can translate the neural network operator that accesses the multi-dimensional tensor into a loop-nest, with a parent outer-loop and one or more child inner-loops. The parent outer-loop and the child inner-loops can be associated with different induction variables associated with different dimensions. The loops can update the induction variables to select different tensors in different iterations. In some examples, the translation can be based on accessing loop-nest templates in a compute definition library that associates different loop-nest templates with different neural network operators. In some examples, additional processing, such as a loop fusion operation to fuse two or more loop-nests together while preserving the original behaviors of the loop-nests, can also be performed.
In step 1010, the compiler can determine, based on the first information and the second information, a mapping between the elements of the tensor to addresses in the portion of the local memory, and a number of the iterations of the operation to be included in a batch, wherein the number of the iterations in the batch to be executed in parallel by the neural network hardware accelerator.
Specifically, the mapping can be based on an array contraction operation, in which the compiler can identify one or more loops that index the tensor using the loop's induction variables, and determine an initial modulo operator for that loop as part of a global modulo allocation operation. The modulo operator can operate on the original indices (e.g., directly from the induction variables) of elements of the tensor in the program to generate remainder values. The remainder values can represent memory addresses. Through the modulo operation, elements of the tensor having different original indices can be mapped to a range of remainder values each representing a different address in the memory. The modulo operator can indicate how many elements of the tensor are mapped to different addresses in the memory. For example, for a modulo operator of m, m elements of the tensors are mapped to m different addresses, and m iterations of the loop can be included in a batch to be executed in parallel to access the m different addresses. Different groups of m elements are accessed in different batches, and the different groups are all mapped to the same set of m addresses. In a case of a multi-dimensional tensor including multiple tensors defined along multiple dimensions and associated with a loop-nest, and the compiler can determine an initial modulo operator for each loop that indexes the different tensors along different dimensions.
Prior to determining the mapping, compiler 330 can determine that the tensor has no loop-carried dependency using a two-part test described in
After determining the initial modulo operators, as part of the global modulo allocation operation, compiler 330 can reduce some or all of the initial modulo operators based on whether the total memory footprint by the tensors exceeds the available memory space. Specifically, referring to
On the other hand, if the total memory footprint is above the available memory space, the compiler can determine an overflowing tensor that cannot fit into the available memory space. Referring to
In some examples, referring to
In step 1012, compiler 330 generates a schedule of execution of the batches of the iterations of the operations. Compiler 330 can determine a schedule of execution of the different iterations of the loops in the program and the mapping of the tensors to the memory addresses based on the modulo operators. The compiler can perform the scheduling based on estimating the total completion time of the DMA operations, which can include the memory access delay as well as memory data transfer delay over the interconnect, as well as data dependency between the tensors. The compiler can then generate executable instructions that reflect the schedule of execution of the different iterations of the loops in the program, based on techniques shown i
The modules described herein may be software modules, hardware modules or a suitable combination thereof. If the modules are software modules, the modules can be embodied on a non-transitory computer readable medium and processed by a processor in any of the computer systems described herein. It should be noted that the described processes and architectures can be performed either in real-time or in an asynchronous mode prior to any user interaction. The modules may be configured in the manner suggested in the preceding figures, and/or functions described herein can be provided by one or more modules that exist as separate modules and/or module functions described herein can be spread over multiple modules. Any of the methods described herein can be implemented as a computer-readable medium or computer program product comprising instructions which, when the program is executed by one or more computers, cause the one or more computers to carry out the steps of the method. Such computer program products can be transmitted, over a wired or wireless network, in a data carrier signal carrying the computer program product.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the disclosure as set forth in the claims.
Other variations are within the spirit of the present disclosure. Thus, while the disclosed techniques are susceptible to various modifications and alternative constructions, certain illustrated examples thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the disclosure to the specific form or forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the disclosure, as defined in the appended claims.
The use of the terms “a” and “an” and “the” and similar referents in the context of describing the disclosed examples (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. The term “connected” is to be construed as partly or wholly contained within, attached to, or joined together, even if there is something intervening. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate examples of the disclosure and does not pose a limitation on the scope of the disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the disclosure.
Disjunctive language such as the phrase “at least one of X, Y, or Z,” unless specifically stated otherwise, is intended to be understood within the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain examples require at least one of X, at least one of Y, or at least one of Z to each be present.
Various examples of this disclosure are described herein, including the best mode known to the inventors for carrying out the disclosure. Variations of those examples may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate and the inventors intend for the disclosure to be practiced otherwise than as specifically described herein. Accordingly, this disclosure includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Zheng, Hongbin, Huang, Randy Renfu, Geva, Robert
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
11372629, | Apr 19 2019 | RESERVOIR LABS, INC | Systems and methods for tensor scheduling |
11507838, | Jun 28 2019 | Intel Corporation | Methods and apparatus to optimize execution of a machine learning model |
7926046, | Dec 13 2005 | Compiler method for extracting and accelerator template program | |
9489180, | Nov 18 2011 | Qualcomm Incorporated | Methods and apparatus for joint scheduling and layout optimization to enable multi-level vectorization |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
May 20 2021 | Amazon Technologies, Inc. | (assignment on the face of the patent) | / | |||
May 20 2021 | ZHENG, HONGBIN | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 057823 | /0474 | |
Sep 23 2021 | HUANG, RANDY RENFU | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 057823 | /0474 | |
Sep 23 2021 | GEVA, ROBERT | Amazon Technologies, Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 057823 | /0474 |
Date | Maintenance Fee Events |
May 20 2021 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Date | Maintenance Schedule |
Nov 07 2026 | 4 years fee payment window open |
May 07 2027 | 6 months grace period start (w surcharge) |
Nov 07 2027 | patent expiry (for year 4) |
Nov 07 2029 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 07 2030 | 8 years fee payment window open |
May 07 2031 | 6 months grace period start (w surcharge) |
Nov 07 2031 | patent expiry (for year 8) |
Nov 07 2033 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 07 2034 | 12 years fee payment window open |
May 07 2035 | 6 months grace period start (w surcharge) |
Nov 07 2035 | patent expiry (for year 12) |
Nov 07 2037 | 2 years to revive unintentionally abandoned end. (for year 12) |