A loop predictor and a method for instruction fetching using a loop predictor. A processor may include a loop predictor in addition to a primary branch predictor. A relatively common scenario in program execution is that a set of branches repeat over and over forming a loop. The loop may be detected based on a repeated pattern of access to a data structure used for branch prediction. Once a loop is detected and it may be determined whether the codes would stay in the loop for at least a duration sufficient to disable the branch prediction. On a determination that the detected loop is locked, a sequence of instruction addresses in one iteration of the detected loop may be captured in a buffer and the branch predictor may be turned off and a sequence of fetch instructions may be played from the buffer.
|
1. A method for instruction fetching in a processor using a loop predictor, the method comprising:
detecting a loop based on a repeated pattern of access to a data structure used for branch prediction;
capturing a sequence of instruction addresses in one iteration of the detected loop in a buffer on a determination that the detected loop is locked,
wherein the detected loop is determined to be locked if it is determined based on the repeated pattern of access to the data structure that instructions stay in the loop for at least a duration sufficient to disable the branch prediction; and
generating a sequence of instruction addresses for instruction fetching from the buffer.
21. A non-transitory computer-readable storage medium storing a code for describing a structure and/or a behavior of a circuit configured to detect a loop based on a repeated pattern of access to a data structure used for branch prediction, capture a sequence of instruction addresses in one iteration of the detected loop in a buffer on a determination that the detected loop is locked, and generate a sequence of instruction addresses for instruction fetching from the buffer, wherein the detected loop is determined to be locked if it is determined based on the repeated pattern of access to the data structure that instructions stay in the loop for at least a duration sufficient to disable the branch prediction.
10. A processor comprising:
a branch predictor configured to predict a branch direction and a branch target; and
a loop predictor configured to:
detect a loop based on a repeated pattern of access to a data structure used for branch prediction,
capture a sequence of instruction addresses in one iteration of the detected loop in a buffer on a determination that the detected loop is locked, the detected loop is determined to be locked if it is determined based on the repeated pattern of access to the data structure that instructions stay in the loop for at least a duration sufficient to disable the branch prediction, and
generate a sequence of instruction addresses for instruction fetching from the buffer.
19. A non-transitory computer-readable storage medium storing a set of instructions for execution by a processor to fetch an instruction using a loop predictor, the set of instructions comprising:
a detecting code segment for detecting a loop based on a repeated pattern of access to a data structure used for branch prediction;
a determining code segment for determining whether the detected loop is locked;
a capturing code segment for capturing a sequence of instruction addresses in one iteration of the detected loop in a buffer on a determination that the detected loop is locked, the detected loop is determined to be locked if it is determined based on the repeated pattern of access to the data structure that instructions stay in the loop for at least a duration sufficient to disable the branch prediction; and
a generating code segment for generating a sequence of instruction addresses for instruction fetching from the buffer.
2. The method of
disabling the branch prediction on a determination that the detected loop is locked.
3. The method of
determining whether there is a corresponding entry in a branch cache when there is a predicted branch by the branch prediction, the branch cache including an index to the data structure used for the branch prediction; and
determining whether a branch cache identity, a branch direction, and a branch type of the predicted branch match valid entries in a recent branch history register, wherein the loop is detected by identifying matching entries in the recent branch history register.
4. The method of
5. The method of
6. The method of
counting conditional jumps or variable target indirect jumps in the loop; and
checking a global history register for a repeating pattern with a frequency of the conditional jumps or the variable-target indirect jumps, wherein the detected loop is determined to be locked on a condition that a matching pattern is found in the global history register.
7. The method of
stalling a thread and waiting for a predetermined number of cycles to allow remaining fetches in a branch predictor pipe to drain on a condition that the sequence of instruction addresses is generated from the buffer.
8. The method of
collapsing non-taken branches in the sequence of instruction addresses in the detected loop.
9. The method of
disabling an instruction cache unit for fetching the instructions on a determination that the detected loop is locked.
11. The processor of
12. The processor of
a recent branch history register for recording a history of recent branches including at least one of: a branch identity (ID), whether the branch ID is valid, a predicted branch direction, or whether a branch is a conditional jump, a call or return instruction, or has a variable indirect target;
a loop watcher for detecting a loop by identifying a matching entry in the recent branch history register to a predicted branch, and detecting a repeating pattern of entries in the recent branch history register;
a capture controller for capturing a sequence of instruction addresses in one iteration of the loop on a determination that a detected loop is locked; and
a loop buffer for storing the sequence of instruction addresses such that the sequence of the instruction addresses is replayed from the loop buffer.
13. The processor of
a branch cache for storing an index to the data structure used for the branch prediction corresponding to a predicted branch, wherein an index to the branch cache is used as the branch ID.
14. The processor of
15. The processor of
16. The processor of
a counter for counting conditional jumps or variable target indirect jumps in the loop, wherein a locked loop is detected on a condition that a repeating pattern with a frequency of the conditional jumps or the variable-target indirect jumps is found in the global history register.
17. The processor of
18. The processor of
20. The non-transitory computer-readable storage medium of
22. The non-transitory computer-readable storage medium of
|
This application is related to microprocessors, including central processing units (CPUs) and graphical processing units (GPUs).
Processing units are utilized in a multiple of applications. A standard configuration is to couple a processor with a storage unit, such as a cache, a system memory, or the like. Processors may execute a fetch operation to fetch instructions from the storage unit as needed.
In order to speed up the operation of the processor which is performing a fetch operation, a branch predictor may be used. A branch predictor predicts the direction of the branch instruction, (i.e., taken or not-taken), and the branch target address before the branch instruction reaches the execution stage in the pipeline.
This is known as “pre-fetching.” Although pre-fetching and speculatively executing the instructions without knowing the actual direction of the branch instruction may result in speeding up the processing of an instruction, it may have the opposite effect and may result in stalling the pipeline if the branch direction is mis-predicted. If the branch mis-prediction occurs, the pipeline needs to be flushed and the instructions are re-executed. This may severely impact the performance of the system.
Several different types of branch predictors have been used. A bimodal predictor may make a prediction based on recent history of a particular branch's execution, and give a prediction of taken or not-taken. A global predictor may make a prediction based upon recent history of all the branches' execution, not just the particular branch of interest. A two-level adaptive predictor with a globally shared history buffer, a pattern history table, and an additional local saturating counter may also be used such that the outputs of the local and the global predictors are XORed with each other to provide a final prediction. More than one prediction mechanism may be used simultaneously and a final prediction may be made based either on a meta-predictor that remembers which of the predictors has made the best predictions in the past, or a majority vote function based on an odd number of different predictors. However, branch predictors are typically large and complex. As a result, they consume a lot of power and incur a latency penalty for predicting branches.
An improved loop predictor and a method for instruction fetching using a loop predictor are disclosed. A processor may include a loop predictor in addition to a branch predictor. A relatively common scenario in program execution is that a set of branches repeat over and over, thereby forming a loop. The loop may be detected based upon a repeated pattern of access to a data structure used for branch prediction, (e.g., branch target buffer (BTB) in a branch predictor). Once a loop is detected it may be determined whether the instruction codes will stay in the loop for at least a duration sufficient to disable the branch prediction, (i.e., the loop is “locked”). On a determination that the detected loop is locked, a sequence of instruction addresses in one iteration of the detected loop may be captured in a buffer. The branch predictor may be then turned off and the stored sequence of fetch addresses may be played from the buffer.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
The embodiments will be described with reference to the drawing figures wherein like numerals represent like elements throughout.
A relatively common scenario in program execution is that a set of branches repeats over and over, thereby forming a loop. In accordance with one embodiment, the power and latency penalty of using a primary branch predictor may be avoided by using a smaller structure, (i.e., a loop predictor), which uses less power and has lower latency, and turning off the primary branch predictor.
The branch predictor 302 may include a branch target buffer (BTB) 306, a branch history table 308, and/or a global history table 310. The BTB 306 is a cache containing a branch target address, a predicted branch direction, and a tag for the branch instruction. For each instruction to be fetched, a BTB 306 is accessed, and if the branch instruction is found in the BTB 306, (i.e., a BTB hit), the corresponding branch target address may be output depending on the predicted branch direction.
The branch history table 308 includes a recent history of the predicted branch directions of a specific branch instruction. The global history table 310 includes the recent history of the predicted branch directions of all branch instructions.
In accordance with one embodiment, as shown in
When there is a predicted branch, (e.g., a BTB hit), the physical index to the BTB 306 may be stored in the branch cache 402, (i.e., each entry in the branch cache 402 contains the BTB location). For example, for a 4-way 128 entry level 1 (L1) BTB, the branch cache 402 may include a nine bit value to uniquely identify the L1 BTB entry, (i.e., each entry of the branch cache 402 is a concatenation of the two bit way and the seven bit BTB index). The branch cache 402 may use a replacement algorithm such as ideal least recently used (LRU), or pseudo LRU, or the like. The branch cache 402 may use a full least recently used (LRU) replacement algorithm, or the like. The branch cache 402 may be shared between multiple threads, or a separate branch cache 402 may be provided for each thread. The index into the branch cache 402 may be used as a branch identifier, which is referred to as the “branch cache ID” hereinafter. For example, if the branch cache 402 has eight entries, a three-bit index to the branch cache 402 may be used as the branch cache ID.
The recent branch history register 404, (e.g., a shift register), records at least one of the following information about the last N predicted branches (where N is a predetermined number): the branch cache ID, if used, (alternatively, if the branch cache 402 is not used, any type of branch identifier may be used, (e.g., the BTB index, or a hash of the instruction address, etc.)), whether the branch cache ID is valid, the predicted branch direction, whether the branch was a conditional jump or a call or return instruction, or has a variable indirect target, or the like. Whenever there is a predicted branch (e.g., a BTB hit), the recent branch history register 404 is updated, (i.e., an old record is shifted out and a new record is shifted in). By using the index to the branch cache 402 instead of using other references, such as references to the branch predictor (e.g., BTB), which is much larger than the branch cache ID, the amount of information that needs to be moved in the recent branch history register 404 may be minimized and processing power may be saved.
A plurality of loop watchers 406 (state machines) are provided, one for each loop size (measured by the branch count in the loop) supported by the loop predictor 304. For example, six loop watchers may be provided for the loops of size from one to six measured by the number of branch instructions in the loop. Each loop watcher 406 may be updated as a new entry enters into the recent branch history register 404, (i.e., each time there is a predicted branch). Each loop watcher 406 compares the branch prediction at hand with its own assigned position in the recent branch history register 404. For example, a loop watcher finite state machine for identifying a loop of size one, (i.e., a loop including one branch instruction), compares the branch prediction at hand with the first entry, (i.e., the most recent entry), in the recent branch history register 404, and a loop watcher finite state machine for identifying a loop of size two compares the branch prediction at hand with the second entry in the recent branch history register 404, and so on. Therefore, the loop watcher 406 for a loop of size one compares the two consecutive branch predictions and if there is a match, (i.e., the BTB indices of the two consecutive branch predictions are the same), a loop of size one is detected. The loop watcher 406 for a loop of size two compares the branch prediction at hand and the second most recent branch prediction, and if the BTB indices match, a loop of size two is detected.
The capture controller 408 captures a sequence of fetch addresses in one iteration of the loop in the loop buffer 410 based on the detected loop size by the loop watcher(s) 406. The capture controller 408 may include additional information for each branch in the loop, for example the number of fetches between each branch in the loop. The capture controller 408 may also collapse non-taken branches so they are recorded as part of the sequential fetch window information.
The loop buffer 410 may include a branch target address, a branch end pointer, a count of how many sequential fetch windows to predict before predicting the branch, and/or a branch type field. Once one iteration of the loop has been captured, the branch predictor 302 may be powered down and the fetch address sequence may be replayed from the loop buffer 410. This may continue speculatively until one of the conditional branches mis-predicts, at which time the branch predictor 302 will be powered up to its normal mode.
The recent branch history register 404, the loop watchers 406, the capture controller 408, and the loop buffer 410 may be provided per thread.
It should be noted that the structure of the loop predictor 304 in
If a loop is detected, it may be further determined when the loop is locked, (i.e., the codes would stay in the loop for at least a duration sufficient to disable the branch prediction) (step 508). If the loop comprises only unconditional branches or static target indirect branches, at the first time a match is found with its entry in the recent branch history register 404, a locked loop may be declared.
If the loop contains a conditional jump(s) or a variable-target indirect jump(s), the locked loop may be declared based on, for example, a global history of branches. The branch predictor 302 includes the recent history of all predicted branches in the global history register. For example, the conditional jumps or variable target indirect jumps may be counted in the detected loop and the global history register is checked for a repeating pattern with the frequency of the conditional jumps or variable target indirect jumps count in the detected loop. Once the repeating pattern is observed in the global history register, a locked loop may be declared. For example, if the loop has two conditional branches or variable target indirect jumps in it and the global history has a repeating matching pattern spaced every two bits in the global history register, a locked loop may be declared.
Once any of the loop watchers 406 identifies a locked loop, a capture controller 408 captures the sequence of fetch addresses in one iteration of the loop in a loop buffer 410 (step 510). The capture controller 408 may include additional information for each branch in the loop, for example the number of fetches between each branch in the loop. The capture controller 408 may also collapse non-taken branches so they are recorded as part of the sequential fetch window information. Once the capture phase is complete, the capture controller 408 may stall the thread and wait for a predetermined number of cycles to allow the remaining fetches in the branch predictor pipe to drain.
Once one iteration of the loop has been captured and synchronization is complete, the branch predictor 302 may be powered down and the fetch address sequence may be replayed from the loop buffer 410 (step 512). The instruction cache unit in the fetch unit 202 for retrieving the predicted instructions from the cache or the memory may also be powered down if the instruction data is captured and replayed for one iteration of the loop. This may continue speculatively until one of the conditional branches mis-predicts, at which time the branch predictor 302 will be powered up to its normal mode.
In accordance with the embodiments, power may be saved by capturing and playing back loops with a dedicated structure, allowing clocking to be disabled for the main branch predictor including L1 and L2 BTB, the PHT, the perceptron tables, the return-stack, and/or the indirect target array. In accordance with the embodiments, fragmented fetch windows for non-taken branches may be collapsed, and the non-taken branch bubbles may be squashed so that the instruction byte buffer (IBB) utilization in the decoder may be improved.
Currently, the vast majority of electronic circuits are designed and manufactured by using software, (e.g., hardware description language (HDL)). HDL is a computer language for describing structure, operation, and/or behavior of electronic circuits. The loop predictor 304 (i.e., the electronic circuit) may be designed and manufactured by using software (e.g., HDL). HDL may be any one of the conventional HDLs that are currently being used or will be developed in the future. A set of instructions are generated with the HDL to describe the structure, operation, and/or behavior of the loop predictor 304. The set of instructions may be stored in any kind of computer-readable storage medium.
The set of instructions may comprise a detecting code segment for detecting a loop based on a repeated pattern of access to a data structure used for branch prediction, (e.g., BTB), a determining code segment for determining whether the detected loop is locked, a capturing code segment for capturing a sequence of instruction addresses in one iteration of the detected loop in a buffer on a determination that the detected loop is locked, and a generating code segment for generating a sequence of instructions for instruction fetching from the buffer. The set of instructions may further comprise a disabling code segment for disabling the branch prediction on a determination that the detected loop is locked.
Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements. The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
Patent | Priority | Assignee | Title |
10037207, | May 26 2016 | International Business Machines Corporation | Power management of branch predictors in a computer processor |
10552159, | May 26 2016 | International Business Machines Corporation | Power management of branch predictors in a computer processor |
9395994, | Dec 30 2011 | Intel Corporation | Embedded branch prediction unit |
9753732, | Dec 30 2011 | Intel Corporation | Embedded branch prediction unit |
9996351, | May 26 2016 | International Business Machines Corporation | Power management of branch predictors in a computer processor |
Patent | Priority | Assignee | Title |
5909573, | Mar 28 1996 | Intel Corporation | Method of branch prediction using loop counters |
7010676, | May 12 2003 | International Business Machines Corporation | Last iteration loop branch prediction upon counter threshold and resolution upon counter one |
7130991, | Oct 09 2003 | Advanced Micro Devices, Inc. | Method and apparatus for loop detection utilizing multiple loop counters and a branch promotion scheme |
7877742, | Jun 27 2002 | International Business Machines Corporation | Method, system, and computer program product to generate test instruction streams while guaranteeing loop termination |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 15 2010 | JARVIS, ANTHONY | Advanced Micro Devices, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 025372 | /0038 | |
Nov 16 2010 | Advanced Micro Devices, Inc. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Apr 20 2017 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Apr 21 2021 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Date | Maintenance Schedule |
Nov 05 2016 | 4 years fee payment window open |
May 05 2017 | 6 months grace period start (w surcharge) |
Nov 05 2017 | patent expiry (for year 4) |
Nov 05 2019 | 2 years to revive unintentionally abandoned end. (for year 4) |
Nov 05 2020 | 8 years fee payment window open |
May 05 2021 | 6 months grace period start (w surcharge) |
Nov 05 2021 | patent expiry (for year 8) |
Nov 05 2023 | 2 years to revive unintentionally abandoned end. (for year 8) |
Nov 05 2024 | 12 years fee payment window open |
May 05 2025 | 6 months grace period start (w surcharge) |
Nov 05 2025 | patent expiry (for year 12) |
Nov 05 2027 | 2 years to revive unintentionally abandoned end. (for year 12) |