Apparatuses and methods for prefetch generation are disclosed. prefetching circuitry receives addresses specified by load instructions and can cause retrieval of a data value from an address before that address is received. stride determination circuitry determines stride values as a difference between a current address and a previously received address. Plural stride values corresponding to a sequence of received addresses are determined. Multiple stride storage circuitry stores the plurality of stride values determined by the stride determination circuitry. New address comparison circuitry determines whether a current address corresponds to a matching stride value based on the plurality of stride values stored in the multiple stride storage circuitry. prefetch initiation circuitry can causes a data value to be retrieved from a further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values. By the use of multiple stride values, more complex load address patterns can be prefetched.

Patent
   10769070
Priority
Sep 25 2018
Filed
Sep 25 2018
Issued
Sep 08 2020
Expiry
Nov 24 2038
Extension
60 days
Assg.orig
Entity
Large
1
7
currently ok
18. A method of prefetching data values comprising:
retrieving data values from addresses specified by load instructions for storage in a storage component;
receiving the addresses specified by the load instructions;
causing retrieval of a further data value from a further address before the further address is received, wherein causing the retrieval comprises:
determining a stride value as a difference between a current address and a previously received address;
determining a stride sequence comprising a plurality of stride values corresponding to a sequence of received addresses;
storing the plurality of stride values determined;
determining at least one cumulative stride value as a sum of at least two of the plurality of stride values stored;
performing a comparison to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored;
in response to receiving the current address, determining whether the at least one cumulative stride value is the matching stride value; and
initiating a prefetch procedure to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.
19. Apparatus comprising:
means for retrieving data values from addresses specified by load instructions for storage in a storage component;
means for receiving the addresses specified by the load instructions;
means for causing retrieval of a further data value from a further address before the further address is received, wherein the means for causing the retrieval comprises:
means for determining a stride value as a difference between a current address and a previously received address;
means for determining a stride sequence comprising a plurality of stride values corresponding to a sequence of received addresses;
means for storing the plurality of stride values determined;
means for determining at least one cumulative stride value as a sum of at least two of the plurality of stride values stored;
means for performing a comparison to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored, wherein the means for performing a comparison is responsive to reception of the current address to determine whether the at least one cumulative stride value is the matching stride value; and
means for initiating a prefetch procedure to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.
1. Apparatus comprising:
data loading circuitry to retrieve data values from addresses specified by load instructions for storage in a storage component;
prefetching circuitry to receive the addresses specified by the load instructions and to cause the data loading circuitry to retrieve a further data value from a further address before the further address is received, wherein the prefetching circuitry comprises:
stride determination circuitry to determine a stride value as a difference between a current address and a previously received address, the stride determination circuitry comprising stride sequence determination circuitry to determine a plurality of stride values corresponding to a sequence of received addresses;
multiple stride storage circuitry to store the plurality of stride values determined by the stride determination circuitry;
cumulative stride determination circuitry to determine at least one cumulative stride value as a sum of at least two of the plurality of stride values stored in the multiple stride storage circuitry;
new address comparison circuitry to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored in the multiple stride storage circuitry, wherein the new address comparison circuitry is responsive to reception of the current address to determine whether the at least one cumulative stride value is the matching stride value; and
prefetch initiation circuitry to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.
2. The apparatus as claimed in claim 1, wherein the multiple stride storage circuitry is arranged to store a table data structure, wherein entries of the table data structure comprise each of the multiple stride values stored in association with respective predicted addresses, and
wherein the multiple stride storage circuitry is arranged to determine a predicted address for storage in an entry of the table data structure as the current address plus the stride value.
3. The apparatus as claimed in claim 2, wherein the prefetching circuitry comprises sample address storage to store the previously received address for reference by the stride determination circuitry when determining the plurality of stride values for the sequence of received addresses.
4. The apparatus as claimed in claim 2, wherein the prefetching circuitry is responsive to receipt of the current address to cause the multiple stride storage circuitry to perform a lookup for the current address in the propagation table data structure,
and the multiple stride storage circuitry is responsive to a match between the current address and a hit predicted address in a hit entry in the table data structure to send a correct prediction signal to the prefetch initiation circuitry,
wherein the prefetch initiation circuitry is responsive to the correct prediction signal to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the stride value in the hit entry.
5. The apparatus as claimed in claim 4, wherein the further address is the current address with addition of an address offset, wherein the address offset is the stride value in the hit entry multiplied by a prefetch distance factor.
6. The apparatus as claimed in claim 4, wherein the multiple stride storage circuitry is further responsive to the match between the current address and the hit predicted address in the hit entry in the table data structure to modify the hit predicted address in the hit entry by addition of the stride value.
7. The apparatus as claimed in claim 4, wherein entries of the table data structure further comprise a counter value for each entry,
wherein the multiple stride storage circuitry is responsive to a non-match between the current address and any of the predicted addresses in the entries of the table data structure to select a victim entry in the table data structure in dependence on the counter values of the entries of the table data structure and to allocate a new entry to replace the victim entry comprising the predicted address and the stride value.
8. The apparatus as claimed in claim 7, wherein the multiple stride storage circuitry is responsive to the match between the current address and the hit predicted address in the hit entry in the table data structure to set the counter value for the hit entry to a predetermined maximum value.
9. The apparatus as claimed in claim 7, wherein the multiple stride storage circuitry is responsive to the non-match between the current address and any of the predicted addresses in the entries of the table data structure to select the victim entry in the table data structure as a victim entry for which the counter value has a predetermined minimum value.
10. The apparatus as claimed in claim 7, wherein the multiple stride storage circuitry is arranged to store an entry pointer,
and is responsive to the non-match between the current address and any of the predicted addresses in the entries of the table data structure to increment the entry pointer to indicate a next entry in the table data structure and to select the victim entry in the table data structure in further dependence on the entry pointer.
11. The apparatus as claimed in claim 10, wherein the multiple stride storage circuitry is responsive to an increment of the entry pointer causing the entry pointer to reach a predetermined entry of the table data structure to decrease the counter values of the entries of the table data structure.
12. The apparatus as claimed in claim 1, wherein the multiple stride storage circuitry further comprises an incoming stride buffer to store the plurality of stride values corresponding to a sequence of received addresses,
wherein the incoming stride buffer has a FIFO structure.
13. The apparatus as claimed in claim 1, wherein the multiple stride storage circuitry comprises cumulative stride value storage to store the at least one cumulative stride value.
14. The apparatus as claimed in claim 13, wherein the cumulative stride value storage further comprises score value storage to store at least one score value in association with the at least one cumulative stride value,
wherein the prefetch initiation circuitry is responsive to the at least one score value meeting at least a threshold value to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the at least one cumulative stride value.
15. The apparatus as claimed in claim 14, wherein the cumulative stride value storage is responsive to there being a match between a cumulative stride value stored in the cumulative stride value storage and at least one sum of at least two of the plurality of stride values stored in the multiple stride storage circuitry, to increase a score value stored in association with the cumulative stride value.
16. The apparatus as claimed in claim 14, wherein the cumulative stride value storage is responsive to there being no match between a cumulative stride value stored in the cumulative stride value storage and any sum of at least two of the plurality of stride values stored in the multiple stride storage circuitry, to decrease a score value stored in association with the cumulative stride value.
17. The apparatus as claimed in claim 16, wherein the cumulative stride value storage is responsive to the score value stored in association with the cumulative stride value reaching a predetermined minimum value to cause the cumulative stride value to be replaced by a replacement cumulative stride value calculated as a sum of at least two of the plurality of stride values currently stored in the multiple stride storage circuitry.

The present disclosure relates to data processing. In particular, the present disclosure relates to prefetching.

A data processing apparatus which performs data processing operations on data values which are retrieved from specified addresses in memory may be provided with prefetching capability. The apparatus is arranged to retrieve data values from addresses specified by load instructions and the prefetching capability enables the apparatus to monitor the pattern of addresses being specified and to predict addresses which will be specified in the near future on the basis of that pattern. Data values can then be caused to be retrieved from those predicted addresses, before such addresses are actually specified by executed load instructions. This is a technique by which the latency associated with the retrieval of a data value from memory may be mitigated against. For example, if a sequence of addresses specified by load instructions is observed in which the addresses have a regular spacing (sometimes referred to as a “stride”), then further addresses may be generated for the purpose of prefetching by extending that “stride” sequence.

In one example embodiment described herein there is an apparatus comprising: data loading circuitry to retrieve data values from addresses specified by load instructions for storage in a storage component; prefetching circuitry to receive the addresses specified by the load instructions and to cause the data loading circuitry to retrieve a further data value from a further address before the further address is received, wherein the prefetching circuitry comprises: stride determination circuitry to determine a stride value as a difference between a current address and a previously received address, the stride determination circuitry comprising stride sequence determination circuitry to determine a plurality of stride values corresponding to a sequence of received addresses; multiple stride storage circuitry to store the plurality of stride values determined by the stride determination circuitry; new address comparison circuitry to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored in the multiple stride storage circuitry; and prefetch initiation circuitry to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.

In one example embodiment described herein there is a method of prefetching data values comprising: retrieving data values from addresses specified by load instructions for storage in a storage component; receiving the addresses specified by the load instructions; causing retrieval of a further data value from a further address before the further address is received, wherein causing the retrieval comprises: determining a stride value as a difference between a current address and a previously received address; determining a stride sequence comprising a plurality of stride values corresponding to a sequence of received addresses; storing the plurality of stride values determined; performing a comparison to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored; and initiating a prefetch procedure to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.

In one example embodiment described herein there is an apparatus comprising: means for retrieving data values from addresses specified by load instructions for storage in a storage component; means for receiving the addresses specified by the load instructions; means for causing retrieval of a further data value from a further address before the further address is received, wherein the means for causing the retrieval comprises: means for determining a stride value as a difference between a current address and a previously received address; means for determining a stride sequence comprising a plurality of stride values corresponding to a sequence of received addresses; means for storing the plurality of stride values determined; means for performing a comparison to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored; and means for initiating a prefetch procedure to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.

The present invention will be described further, by way of example only, with reference to embodiments thereof as illustrated in the accompanying drawings, in which:

FIG. 1 schematically illustrates an apparatus in one example embodiment;

FIG. 2 schematically illustrates an apparatus in one example embodiment;

FIG. 3 schematically illustrates an apparatus in one example embodiments;

FIG. 4 is a flow diagram showing a sequence of steps which are taken in accordance with the method of one example embodiment;

FIG. 5 schematically illustrates an apparatus in one example embodiment; and

FIG. 6 is a flow diagram showing a sequence of steps which are taken according to the method of one example embodiment.

Before discussing the embodiments with reference to the accompanying figures, the following description of embodiments is provided.

In accordance with one example configuration there is provided an apparatus comprising: data loading circuitry to retrieve data values from addresses specified by load instructions for storage in a storage component; prefetching circuitry to receive the addresses specified by the load instructions and to cause the data loading circuitry to retrieve a further data value from a further address before the further address is received, wherein the prefetching circuitry comprises: stride determination circuitry to determine a stride value as a difference between a current address and a previously received address, the stride determination circuitry comprising stride sequence determination circuitry to determine a plurality of stride values corresponding to a sequence of received addresses; multiple stride storage circuitry to store the plurality of stride values determined by the stride determination circuitry; new address comparison circuitry to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored in the multiple stride storage circuitry; and prefetch initiation circuitry to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.

The present techniques recognise that there are circumstances under which an approach of merely determining a prefetch stride on the basis of the difference between subsequently specified addresses may not be capable of accurately predicting addresses from which data values will be retrieved and therefore for correspondingly successful prefetching to be carried out. For example where a program includes a loop which increments load addresses this can result in a regular pattern of load addresses from which a simple stride may be determined. However, when a program includes a nested loop the load addresses can exhibit a repeating stride pattern which incorporates more than one stride value. One stride value may be associated with an inner loop, whilst a possibly entirely different stride value may be associated with an outer loop. If the inner loop iteration count is sufficiently large then a simple single stride value may be appropriate and be able to support beneficial prefetching, yet if the inner loop iteration count is relatively small then the resulting pattern of load addresses can become more changeable and a single stride approach may not be able to support efficient prefetching. These problems of more complex stride patterns will also become more complex with further levels of loop nesting, or other types of iterative load address evolution.

In this context the present techniques propose that a sequence of received addresses is monitored and a plurality of stride values corresponding to that sequence is determined. The multiple stride values so determined are stored and can then be made use of to identify more complex patterns in the load instructions received. The plurality of stride values can be made use of in a variety of ways to identify such a multi-stride pattern in the received load instructions.

In some embodiments the multiple stride storage circuitry is arranged to store a table data structure, wherein entries of the table data structure comprise each of the multiple stride values stored in association with respective predicted addresses, and wherein the multiple stride storage circuitry is arranged to determine a predicted address for storage in an entry of the table data structure as the current address plus the stride value. Accordingly, the multiple stride storage circuitry therefore stores multiple predicted addresses (each with an associated stride value), enabling the apparatus to cover a range of patterns which may result from the multiple stride values as extrapolations from a current address. Operation of the apparatus may for example include a populating phase in which the table data structure is populated by observation of the sequence of received addresses, in order to build up a set of predicted addresses in association with the multiple stride values.

The previously received address against which the stride determination circuitry compares the current address in order to determine the stride value may be variously defined. In some embodiments the prefetching circuitry comprises sample address storage to store the previously received address for reference by the stride determination circuitry when determining the plurality of stride values for the sequence of received addresses. Accordingly, this sample address can then be used as a base with respect to which the plurality of stride values are calculated, and hence with reference to which the predicted addresses are also generated. Thus populated, the table data structure can then be used for reference when a new address is received and in some embodiments the prefetching circuitry is responsive to receipt of the current address to cause the multiple stride storage circuitry to perform a lookup for the current address in the propagation table data structure, and the multiple stride storage circuitry is responsive to a match between the current address and a hit predicted address in a hit entry in the table data structure to send a correct prediction signal to the prefetch initiation circuitry, wherein the prefetch initiation circuitry is responsive to the correct prediction signal to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the stride value in the hit entry. Thus a match between the current address and a predicted address stored in the table data structure indicates that the mechanism by which the predicted address was generated (i.e. its associated stride value with respect to the sample address) has been validated and on that basis a prefetch may be initiated making use of the associated stride value. This stride value can be applied to the current address in various ways.

For example in some embodiments the further address is the current address with addition of an address offset, wherein the address offset is the stride value in the hit entry multiplied by a prefetch distance factor. The prefetch distance factor can thus be used to “scale up” the stride value as appropriate, in particular in dependence on the memory latency, i.e. how long in advance of a data value being required it is necessary to initiate the prefetch, such that it is present when required. In some embodiments the multiple stride storage circuitry is further responsive to the match between the current address and the hit predicted address in the hit entry in the table data structure to modify the hit predicted address in the hit entry by addition of the stride value. Thus this updates the entry in the table data structure which has correctly predicted an address, such that at the next iteration of that stride it may (potentially) predict another address.

In some embodiments entries of the table data structure further comprise a counter value for each entry, wherein the multiple stride storage circuitry is responsive to a non-match between the current address and any of the predicted addresses in the entries of the table data structure to select a victim entry in the table data structure in dependence on the counter values of the entries of the table data structure and to allocate a new entry to replace the victim entry comprising the predicted address and the stride value. Accordingly, an associated counter value for each entry may be used to set relative priorities between the entries in the table data structure such that when an entry in the table data structure should be replaced, i.e. an existing entry is to be selected as a victim to allow allocation of a new entry, the victim selection may make use of these relative priorities. The counter values may be administered in a number of ways in order to set up such relative priorities, but in some embodiments the multiple stride storage circuitry is responsive to the match between the current address and the hit predicted address in the hit entry in the table data structure to set the counter value for the hit entry to a predetermined maximum value. Similarly in some embodiments the multiple stride storage circuitry is responsive to the non-match between the current address and any of the predicted addresses in the entries of the table data structure to select the victim entry in the table data structure as a victim entry for which the counter value has a predetermined minimum value.

The particular values employed for such counter values may be variously defined, but in some embodiments these can be relatively limited values, in order to limit the storage space that they require. For example a 2-bit counter may suffice.

In some embodiments the multiple stride storage circuitry is arranged to store an entry pointer, and is responsive to the non-match between the current address and any of the predicted addresses in the entries of the table data structure to increment the entry pointer to indicate a next entry in the table data structure and to select the victim entry in the table data structure in further dependence on the entry pointer. Accordingly, the victim entry (absence any other determining factor such as the above mentioned counter value) is selected on a round-robin basis.

Furthermore, in some embodiments the multiple stride storage circuitry is responsive to an increment of the entry pointer causing the entry pointer to reach a predetermined entry of the table data structure to decrease the counter values of the entries of the table data structure. The predetermined entry of the table data structure may be variously defined, but may for example be a first entry in the table, it therefore being understood that the entry pointer wraps around once it reaches the end of the table and this may be the point chosen at which to decrease the counter values of each entry in the table data structure. In other words this provides a mechanism by which over time the confidence associated with a given entry in the table decreases if it has not been maintained through hits in the lookup procedure.

In some embodiments the apparatus further comprises cumulative stride determination circuitry to determine at least one cumulative stride value as a sum of at least two of the plurality of stride values stored in the multiple stride storage circuitry, and wherein the new address comparison circuitry is responsive to reception of the current address to determine whether the at least one cumulative stride value is the matching stride value. Accordingly, not only are multiple stride values stored in the apparatus, but at least one combination of these multiple stride values is thus determined to generate a cumulative stride value. There may indeed be more than one cumulative stride value determined, and at least two of the multiple stride values may then be combined to generate this. In this manner a greater variety of potential load address patterns can be accommodated and predicted.

In order to support such cumulative stride determination, the apparatus may be variously configured, but in some embodiments the multiple stride storage circuitry further comprises an incoming stride buffer to store the plurality of stride values corresponding to a sequence of received addresses, wherein the incoming stride buffer has a FIFO structure. Thus a sequence of stride values is temporarily held in the FIFO buffer corresponding to the sequence of received addresses. This “snapshot” of the load addresses which are received can therefore be used as a sample from which the patterns of load addresses can be determined. It should be appreciated that the incoming stride buffer may have a range of difference sizes in difference implementations of the present techniques, depending on the size of pattern (or “depth”) which should be covered.

In some embodiments the multiple stride storage circuitry comprises cumulative stride value storage to store the at least one cumulative stride value.

This at least one cumulative stride value thus stored may then be used in a variety of ways in order to identify load address patterns, but in some embodiments the cumulative stride value storage further comprises score value storage to store at least one score value in association with the at least one cumulative stride value, wherein the prefetch initiation circuitry is responsive to the at least one score value meeting at least a threshold value to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the at least one cumulative stride value. Hence control over the triggering of the prefetch may be exerted by the setting of the threshold value, corresponding to a confidence in the at least one cumulative stride value.

The score values stored in this manner may be administered in a variety of ways, but in some embodiments the cumulative stride value storage is responsive to there being a match between a cumulative stride value stored in the cumulative stride value storage and at least one sum of at least two of the plurality of stride values stored in the multiple stride storage circuitry, to increase a score value stored in association with the cumulative stride value.

Similarly in some embodiments the cumulative stride value storage is responsive to there being no match between a cumulative stride value stored in the cumulative stride value storage and any sum of at least two of the plurality of stride values stored in the multiple stride storage circuitry, to decrease a score value stored in association with the cumulative stride value.

Moreover, this mechanism may be used to replace entries in the cumulative stride value storage once their associated “confidence” has sufficiently decreased. Thus in some embodiments the cumulative stride value storage is responsive to the score value stored in association with the cumulative stride value reaching a predetermined minimum value to cause the cumulative stride value to be replaced by a replacement cumulative stride value calculated as a sum of at least two of the plurality of stride values currently stored in the multiple stride storage circuitry.

In accordance with one example configuration there is provided a method of prefetching data values comprising: retrieving data values from addresses specified by load instructions for storage in a storage component; receiving the addresses specified by the load instructions; causing retrieval of a further data value from a further address before the further address is received, wherein causing the retrieval comprises: determining a stride value as a difference between a current address and a previously received address; determining a stride sequence comprising a plurality of stride values corresponding to a sequence of received addresses; storing the plurality of stride values determined; performing a comparison to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored; and initiating a prefetch procedure to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.

In accordance with one example configuration there is provided an apparatus comprising: means for retrieving data values from addresses specified by load instructions for storage in a storage component; means for receiving the addresses specified by the load instructions; means for causing retrieval of a further data value from a further address before the further address is received, wherein the means for causing the retrieval comprises: means for determining a stride value as a difference between a current address and a previously received address; means for determining a stride sequence comprising a plurality of stride values corresponding to a sequence of received addresses; means for storing the plurality of stride values determined; means for performing a comparison to determine whether the current address corresponds to a matching stride value based on the plurality of stride values stored; and means for initiating a prefetch procedure to cause the further data value to be retrieved from the further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values.

Particular embodiments will now be described with reference to the figures.

FIG. 1 illustrates a data processing apparatus 100 in accordance with one embodiment. The apparatus comprises execution circuitry 102, a load store unit 104, data cache 106, and a memory 108. The execution circuitry 102 is arranged to perform data processing operations making use of data values which are retrieved from storage at addresses in the memory 108. Retrieval of such data values is performed in response to load instructions which specify the address from which the data value should be retrieved, as well as the location into which that data value should be locally stored (e.g. in a storage component such as a register). The load/store unit 104 is provided to support the interaction of the execution circuitry 102 with the memory system such that these operations may effectively be delegated to it, leaving the execution circuitry 102 to continue with other operations whilst the interaction with the memory system continues. As part of the configuration of the load/store unit 104, it is provided with a prefetcher 110. This prefetcher 110 monitors the addresses specified by load instructions executed by the execution circuitry 102 and seeks to identify a pattern within them. For this purpose the load/store unit 104 also comprises address prediction circuitry 112 which forms part of this address prediction mechanism and provides the prefetcher with addresses from which it should retrieve data values from memory, even though these addresses have not yet been specified by a load instruction. The memory system is schematically illustrated in FIG. 1 by the data cache 106 and the memory 108, but it should be appreciated that the memory system could be considerably more complex than this, for example involving multiple levels of cache hierarchies. The prefetching actions of the prefetcher 110 causes the prefetched data values to be brought into the data cache 106 and it is this that enables the data processing apparatus to avoid the full latency associated with retrieving a data value from the memory 108, since when a data value required (in response to an address specified by a load instruction) is already present in the data cache 106 then it may be returned much more quickly to the execution circuitry 102 than if it needed to be retrieved from the memory 108. The address prediction circuitry 112 of the load/store unit shown in FIG. 1 has a configuration such that it determines multiple stride values corresponding to a sequence of received addresses, i.e. a sequence of addresses specified by load instructions executed by the execution circuitry 102 and these multiple stride values are stored in multiple stride storage circuitry of the address prediction circuitry 112. Further detail of this approach of the present techniques and various embodiments thereof are described in more detail with reference to the figures which follow.

FIG. 2 schematically illustrates an apparatus 200 in one example embodiment. The illustrated components are provided to monitor a stream of incoming addresses specified by load instructions. Accordingly, the components shown in FIG. 2 may be considered to be an example of the address prediction circuity 112 and the prefetcher 110 of FIG. 1. An incoming address 202, under the control of control gate 203 can be caused to be stored in the previous address storage 204. As will be discussed with reference to the figures that follow, depending on the particular embodiment this may take place as an ongoing process for each incoming address that is received, or only periodically in dependence on other factors. The incoming address 202 is also received by the addition circuitry 205. The addition circuitry 205 is in fact arranged to perform a subtraction and thus receives a negative input from the previous address storage 204 and thus subtracts the previous address from the incoming address 202. This generates a stride value 206. The apparatus 200 comprises multiple stride storage circuitry 208 which is arranged to hold multiple stride values simultaneously. There are a number of ways in which the multiple stride storage 208 may be populated with values, as will be described in more detail with reference to the example embodiments which follow. The apparatus 200 also comprises cumulative stride determination circuitry 210. On the basis of the multiple stride values stored in the multiple stride storage 208 the cumulative stride determination circuitry 210 can determine a number of different strides and/or stride combinations, and hence in general the apparatus is capable of detecting cumulative stride patterns in the load address stream. The output of the cumulative stride determination circuitry 210 is passed to the prefetch address calculation circuitry 212 which also receives the incoming address 202. On the basis of these inputs the prefetch address calculation circuitry 212 can then generate an address for prefetching which is passed to the prefetcher 214 to implement.

As an example of a context in which the present techniques may be used, when the load address stream relates to a section of program code which includes nested loops, and where more than one of the nested loops comprises a load address generation, a repeating stride pattern of load addresses may result which incorporates more than one stride. For example in the case of just two loops (an inner loop and an outer loop) the inner loop may exhibit one stride, whilst the outer loop may have its own quite different stride. In such examples, when the inner loop iteration count is large enough, memory latency can be hidden by simply detecting the stride of the inner loop and prefetching sufficiently far ahead. However, if the inner loop iteration count is relatively small, it is no longer possible to prefetch sufficiently far ahead without knowing the outer loop stride.

FIG. 3 schematically illustrates an apparatus in one example embodiment. Generally, the apparatus 300 comprises address and stride calculation circuitry 204, a propagation table 306, and a prefetcher 308. An incoming address 302 in the load address stream is received by the address and stride calculation circuitry 304 and held in new address storage 310. In an initial set up phase the new address 310 is stored as a sample address in the sample address storage 311. The sample address 311 can also later be updated, as will be described in more detail below. Once a sample address 311 has been stored then further incoming addresses 302 held by the new address storage 310 are compared, in that each is received by the addition circuity 312, which receives a negative input from the sample address storage 311, thus subtracting the sample address 311 from the new address 310. This generates a stride value 313. The stride value 313 is added to the new address 310 by the addition circuity 314 to form a predicted address 315. The stride value 313 and the predicted address 315 together form the content of an entry which is stored in the propagation table 306. The propagation table 306 is provided as a content-addressable-memory (CAM). In this manner, the propagation table 306 is populated with entries as an initial sequence of the load address stream is monitored.

As shown in the figure, the new address 310 is also used to perform a lookup in the propagation table 306. When the new address 310 matches a predicted address, this indicates that the propagation table entry has correctly identified a true cumulative stride in the load address stream. The corresponding predicted address 320 and its associated stride value 321 are read out. Addition circuitry 323 receives both of these values and adds them together, this forming an updated predicted address entry 324 which replaces the predicted address in the entry. Also the stride value 321 is multiplied by a predetermined fetch distance 325 and then added (by means of addition circuitry 326) to the new address 310 and the address thus generated is passed to the prefetch circuitry 308 to initiate a prefetch for the data value at this predicted address.

Accordingly, on an ongoing usage basis, new addresses are used to look up in the propagation table 306 and the hit entry is used to generate a prefetch and the stored predicted address in that entry is updated. However, when a miss occurs for the lookup in the propagation table 306, use is made (by means of pointer control 330) of an entry pointer 331. The entry pointer 331 is controlled by the pointer control 330 to point to an entry of the propagation table 306 and (under circumstances described below) to be incremented sequentially through entries of the propagation table, wrapping round from the last entry to the first entry, in a round-robin fashion. Further, when a miss occurs for the look up in the propagation table 306 reference may be made to a counter value which also forms part of each entry. In the example of FIG. 3, this is a 2-bit counter value. All counters are initialised to 0 (i.e. when a new entry is allocated into the propagation table) and only propagation table entries with a count of 0 can be evicted. When a miss occurs in the propagation table 306, if the entry pointer 331 is at the beginning of the table this causes a new sample address to be taken, i.e. the next new address 310 is then stored in the sample address storage 311. A new propagation table entry of this new address is not generated. Generally, when a miss occurs in the propagation table 306 and the pointer is not at the start of the table, then it is determined if the pointer currently indicates an entry for which the counter value is 0. If it is, then this entry can be replaced and a new stride 313 and predicted address 315 (with counter set to 0) form the new entry allocated in the propagation table 306. If however the pointer is not currently at an entry with a counter value of 0, then the pointer is incremented. If the pointer reaches the end of the table and wraps to the beginning then all counters are decremented. Conversely note that when a hit occurs in an entry of the propagation table 306 the counter value is set to its maximum, in this case this being a value of 3. Updating of the respective counter values of the entries of the propagation table 306 takes place under control of the counter control circuitry 302.

FIG. 4 is a flow diagram showing a sequence of steps which are taken in the method of one example embodiment, according to which the apparatus 300 schematically illustrated in FIG. 3 may be operated. The flow begins at step 400. Then at step 401 the next incoming address is stored as the sample address. Then the flow proceeds to step 402 where it is determined if a new incoming address has been received, and the flow waits here until this happens. Then at step 404 the new incoming address is used to look up in the propagation table. If this hits on a predicted address of an entry in the propagation table the flow proceeds to step 406 where the predicted address and stride value in that entry are read out and prefetches are generated using the stride value (multiplied by a predetermined prefetch distance) added to the incoming address. The stride value is then used at step 408 to update (by addition) the predicted address in this propagation table entry. At step 410 the counter value for this table entry is set to its maximum value. The flow then returns to step 402.

If however at step 404 it is found that the incoming address has missed in the propagation table lookup then the flow proceeds to step 412, where it is determined if the propagation table pointer is currently at the start of the table. If it is then the flow returns to step 401 for a new incoming address to be stored as the sample address. If the pointer is not at the start of the table then at step 414 it is determined if the pointer is indicating an entry for which the counter value is 0. If this is the case then at step 416 a stride value and predicted address are calculated, and at step 418 these are used to replace the entry and the counter value of 0 is maintained. At step 420 the pointer is then incremented. Following this at step 422 it is determined if the pointer (through the increment of step 420) has now wrapped round to the beginning of the table. If it has not then the flow returns to step 402 and the next incoming address is received. If however the pointer has reached the beginning of the table then all counter values are decremented at step 424 and thereafter the flow returns to step 401 for a new incoming address to be stored as the sample address.

Returning to a consideration of step 414 if it is found (following the propagation table miss) that the pointer is not at the start of the table and is currently indicating an entry with a non-zero counter value then the flow proceeds to step 426 where the pointer is incremented. If this has not caused the pointer to wrap to the start of the table (step 428) then the flow returns to step 414 to determine if this is an entry which can be replaced (counter=0). Otherwise if the pointer is found to have wrapped to the start of the table at 428 then at step 424 all counters are decremented and the flow returns to step 401 for a new incoming address to be stored as the sample address.

FIG. 5 schematically illustrates an apparatus 500 in one example embodiment. The apparatus 500 has both incoming address storage 501 and previous address storage 503, these being connected by the buffer 502, such that on an ongoing basis the latest address received in the load address stream is held by the incoming address storage 501 and the previous address received in the load address stream is held by the previous address storage 503. These are compared by means of the addition circuitry 505, which has a negative input from the previous address storage 503, such that the previous address 503 is subtracted from the new incoming address 501, giving a stride value between the two. Stride values determined in this manner are pushed into the FIFO buffer 506. In the example of FIG. 5 the FIFO buffer 506 is shown to comprise four entries, but this is an implementation choice depending on the system in which this apparatus is to be used. In particular, the number of entries in the shifting buffer 506 is based on the depth of the load address stream pattern which should be covered. On the basis of the content of the buffer 506, cumulative stride determination circuitry 508 calculates cumulative stride candidates at each depth. Each individual calculated stride held in the FIFO 506 can itself be a candidate cumulative stride, and this sum of sequential stride values held in the FIFO 506 can also be a candidate cumulative stride. Whenever a new stride value is inserted into the FIFO 506 a new cumulative stride can be calculated for each depth and compared against the candidate cumulative strides stored in the cumulative stride table 510. On a match with the existing content of the table 510 a corresponding score value is incremented. When any of the scores in the table reach a predetermined threshold value 512, compared by means of comparison of comparison circuitry 514, then this candidate cumulative stride can be chosen as a prefetch stride 516 which, by addition to the incoming address 501 (modified if necessary as described above by a prefetch distance) can be used in prefetch generation 518.

As a first illustrative example, the load address stream might generate a sequence of strides such as 2, 10, 2, 10, 2, 10, 2, 10, 2, 10, etc. In this example, allowing depths of 1 or 2 strides, candidate cumulative strides of 2, 10, and 12 will be generated. Table 1 below shows the load address stream in the left-hand column, the sequential address offsets (strides) in the next column, and the candidate cumulative strides (2, 10, 12) which populate the table at respective depths of 1 (single stride) and 2 (sum of two successive strides) respectively. The evolution of the table scores as the load address stream continues is shown.

TABLE 1
Depth
1 1 2
Cumulative stride
<Address> 2 10 12
0 <Sequential address offset> Evolution of table scores
2 2 1
12 10 0 1 1
14 2 1 0 2
24 10 0 1 3
26 2 1 0 4
36 10 0 1 5
38 2 1 0 6
48 10 0 1 7
50 2 1 0 8
60 10 0 1 9

First example load address stream and corresponding table content evolution

As a second illustrative example, the load address stream might generate a sequence of strides such as 2, 2, 10, 2, 2, 10, 2, 2, 10, etc. In this example candidate, allowing depths of 1, 2, or 3 strides, cumulative strides of 2, 10, 4, 12, and 14 will be generated. Table 2 below shows the load address stream in the left-hand column, the sequential address offsets (strides) in the next column, and the candidate cumulative strides (2, 10, 12) which populate the table at respective depths of 1 (single stride) and 2 (sum of two successive strides) respectively. The evolution of the table scores as the load address stream continues is shown.

Depth
1 1 2 2 3
Cumulative stride
2 10 4 12 14
<Address> <Sequential address offset> Evolution of table scores
0 2 1
2 2 2 1
12 10 0 1 0 1 1
14 2 1 0 0 2 2
16 2 2 0 1 0 3
26 10 0 1 0 1 4
28 2 1 0 0 2 5
30 2 2 0 1 0 6
40 10 0 1 0 1 7
42 2 1 0 0 2 8
44 2 2 0 1 0 9

FIG. 6 is a flow diagram showing a sequence of steps which are taken in the method of one example embodiment, which may be used to operate the apparatus schematically illustrated in FIG. 5. The flow can be considered to begin at step 600 at which a new incoming address is waited for and then received. Thereafter at step 602 a stride value is calculated from the difference between this new address and the previously received address. This stride value is then pushed into the FIFO at step 604. Then, on the basis of the content of the FIFO, cumulative strides for each depth in the FIFO are calculated at step 606. At step 608 it is firstly determined if there are any candidate cumulative strides with a score of 0. If there are then any such entries may, at step 610, be replaced by new candidate cumulative strides for that depth. Then at step 612 it is determined if there are any candidate cumulative strides stored in the cumulative stride table which are not matched by the current content of the FIFO. This being the case then at step 614 the score for these entries is decremented. Conversely then at step 616 it is determined if there are any matches between the content of the FIFO and the cumulative stride table. If there are not, the flow simply returns to step 600 for the next new incoming address to be received and processed. However, if at least one match is found at step 616 then at step 618 the score for any candidate cumulative strides which match is incremented and at step 620 it is determined if any stores in the cumulative stride table now meet the predefined threshold for prefetch generation. If they do not the flow simply returns to step 600. However, if at least one score has met this threshold then at step 622 one or more prefetches is generated using the candidate cumulative strides for which the score met the threshold.

In brief overall summary apparatuses and methods for prefetch generation are disclosed. Prefetching circuitry receives addresses specified by load instructions and can cause retrieval of a data value from an address before that address is received. Stride determination circuitry determines stride values as a difference between a current address and a previously received address. Plural stride values corresponding to a sequence of received addresses are determined. Multiple stride storage circuitry stores the plurality of stride values determined by the stride determination circuitry. New address comparison circuitry determines whether a current address corresponds to a matching stride value based on the plurality of stride values stored in the multiple stride storage circuitry. Prefetch initiation circuitry can causes a data value to be retrieved from a further address, wherein the further address is the current address modified by the matching stride value of the plurality of stride values. By the use of multiple stride values, more complex load address patterns can be prefetched.

In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes, additions and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims. For example, various combinations of the features of the dependent claims could be made with the features of the independent claims without departing from the scope of the present invention.

Dooley, Miles Robert, Pusdesris, Joseph Michael, Shulyak, Alexander Cole, Nathella, Krishnendra, Sunwoo, Dam

Patent Priority Assignee Title
ER9234,
Patent Priority Assignee Title
10025717, Jun 01 2015 CAVIUM INTERNATIONAL; MARVELL ASIA PTE, LTD Multi-dimensional prefetching
20080229070,
20130185515,
20140359221,
20170168946,
20180018266,
20180173631,
//////
Executed onAssignorAssigneeConveyanceFrameReelDoc
Sep 19 2018NATHELLA, KRISHNENDRAARM LimitedASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0469570475 pdf
Sep 20 2018PUSDESRIS, JOSEPH MICHAELARM LimitedASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0469570475 pdf
Sep 20 2018DOOLEY, MILES ROBERTARM LimitedASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0469570475 pdf
Sep 20 2018SHULYAK, ALEXANDER COLEARM LimitedASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0469570475 pdf
Sep 21 2018SUNWOO, DAMARM LimitedASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0469570475 pdf
Sep 25 2018ARM Limited(assignment on the face of the patent)
Date Maintenance Fee Events
Sep 25 2018BIG: Entity status set to Undiscounted (note the period is included in the code).
Feb 20 2024M1551: Payment of Maintenance Fee, 4th Year, Large Entity.


Date Maintenance Schedule
Sep 08 20234 years fee payment window open
Mar 08 20246 months grace period start (w surcharge)
Sep 08 2024patent expiry (for year 4)
Sep 08 20262 years to revive unintentionally abandoned end. (for year 4)
Sep 08 20278 years fee payment window open
Mar 08 20286 months grace period start (w surcharge)
Sep 08 2028patent expiry (for year 8)
Sep 08 20302 years to revive unintentionally abandoned end. (for year 8)
Sep 08 203112 years fee payment window open
Mar 08 20326 months grace period start (w surcharge)
Sep 08 2032patent expiry (for year 12)
Sep 08 20342 years to revive unintentionally abandoned end. (for year 12)