A method and system for branch prediction are provided herein. The method includes executing a program, wherein the program comprising multiple procedures, and setting bits in a taken branch history register to indicate whether a branch is taken or not taken during execution of instructions in the program. The method further includes the steps of calling a procedure in the program and overwriting, responsive to calling the procedure, the contents of the taken branch history register to a start address for the procedure.
|
21. An apparatus, comprising:
a taken branch history register configured to store bits that indicate whether a branch is taken during execution of instructions in a program, wherein the program includes a procedure;
a preset circuit configured to overwrite contents of the taken branch history register to a start address for the procedure;
a first branch history table including a first prediction of whether a current branch will be taken or not taken;
a second branch history table including a second prediction of whether the current branch will be taken or not taken;
a selector table including a selection value in a selection entry; and
a hash block configured to determine whether the first prediction or the second prediction is used based on whether the selection value is equal to or greater than a threshold value.
11. An apparatus, comprising:
an instruction cache; and
a branch prediction unit configured to supply a speculative address to the instruction cache, the branch prediction unit comprising:
a taken branch history register configured to store bits that indicate whether a branch is taken during execution of instructions in a program, wherein the program includes a procedure;
a preset circuit configured to overwrite contents of the taken branch history register to a start address for the procedure;
a first branch history table including a first prediction of whether a current branch will be taken or not taken;
a second branch history table including a second prediction of whether the current branch will be taken or not taken;
a selector table including a selection value in a selection entry; and
a hash block configured to determine whether the first prediction or the second prediction is used based on whether the selection value is equal to or greater than a threshold value.
1. A method, comprising:
providing a program to be executed in accordance with a speculative address;
executing the program, the program comprising a procedure;
setting bits in a taken branch history register to indicate whether a branch is taken or not taken during execution of instructions in the program;
calling the procedure in the program; and
overwriting, responsive to calling the procedure, contents of the taken branch history register to a start address for the procedure;
accessing, using a hash block, a first branch history table based on contents of the taken branch history register for a first prediction of whether a current branch will be taken or not taken;
accessing, using the hash block, a second branch history table based on the contents of the taken branch history register for a second prediction of whether the current branch will be taken or not taken; and
determining whether the first prediction or the second prediction is used based on whether a selection value in a selection entry of a selector table is equal to or greater than a threshold value.
2. The method of
accessing the first branch history table based on the contents of the taken branch history register and a program counter register.
3. The method of
generating an index based on the contents of the taken branch history register and the program counter register to access an entry in the first branch history table, wherein the entry includes the first prediction.
4. The method of
hashing the contents of the taken branch history register to produce a result having a number of bits that is less than a number of bits in the taken branch history register.
5. The method of
accessing the second branch history table based on the contents of the taken branch history register and the program counter register, wherein the first branch history table has more entries than the second branch history table.
6. The method of
generating an index based on the contents of the taken branch history register and the program counter register to access an entry in the second branch history table, wherein the entry includes the second prediction.
7. The method of
hashing contents of the program counter register to produce a result having a number of bits that is less than a number of bits in the program counter register.
8. The method of
using the index to access the selection entry in the selection table.
9. The method of
using the index to access an update entry in an update table that stores an update value that indicates whether the first branch history table or the second branch history table is more accurate in a prediction of whether the branch will be taken, and
suppressing updating a prediction value, corresponding to the branch, in the first branch history table based on the update value indicating that the second branch history table is more accurate than the first branch history table in the prediction of whether the branch is taken.
10. The method of
a section of code within the program that is accessed upon execution of a call instruction and has an instruction that returns program execution to a next instruction after the call instruction.
12. The apparatus of
13. The apparatus of
14. The apparatus of
15. The apparatus of
16. The apparatus of
17. The apparatus of
18. The apparatus of
19. The apparatus of
an update table, wherein the hash block is configured to access an update entry in an update table that stores an update value that indicates whether the first branch history table or the second branch history table is more accurate in a prediction of whether the current branch will be taken, and
logic circuits configured to suppress an update of a prediction value, corresponding to the current branch, in the first branch history table based on the update value indicating that the second branch history table is more accurate than the first branch history table in the prediction of whether the current branch is taken.
20. The apparatus of
a section of code within the program that is accessed upon execution of a call instruction; and
an instruction that returns program execution to a next instruction after the call instruction.
|
Field of the Invention
This application generally relates to branch predictors.
Background Art
Branch predictors are used to predict whether a branch will be taken or not taken. Accuracy of the prediction improves performance of a processor. Methods and systems are provided herein to improve the accuracy of a branch predictor.
Embodiments of the disclosure are described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Additionally, the left most digit(s) of a reference number identifies the drawing in which the reference number first appears.
The figures illustrate various components, their arrangements, and interconnections. Unless expressly stated to the contrary, the figures are not necessarily drawn to scale.
The following Detailed Description refers to accompanying drawings to illustrate exemplary embodiments consistent with the disclosure herein. References in the Detailed Description to “one exemplary embodiment,” “an illustrative embodiment”, “an example embodiment,” and so on, indicate that the exemplary embodiment described may include a particular feature, structure, or characteristic, but every exemplary embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same exemplary embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an exemplary embodiment, it is within the knowledge of those skilled in the relevant art(s) to affect such feature, structure, or characteristic in connection with other exemplary embodiments whether or not explicitly described.
The exemplary embodiments described herein are provided for illustrative purposes, and are not limiting. Other exemplary embodiments are possible, and modifications may be made to the exemplary embodiments within the spirit and scope of the disclosure herein. Therefore, the Detailed Description is not meant to limit the disclosure. Rather, the scope of the disclosure is defined only in accordance with the following claims and their equivalents.
The following Detailed Description of the exemplary embodiments will so fully reveal the general nature of the disclosure that others can, by applying knowledge of those skilled in relevant art(s), readily modify and/or adapt for various applications such exemplary embodiments, without undue experimentation, without departing from the spirit and scope of the invention. Therefore, such adaptations and modifications are intended to be within the meaning and plurality of equivalents of the exemplary embodiments based upon the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by those skilled in relevant art(s) in light of the teachings herein.
Terminology
The terms, chip, die, integrated circuit, semiconductor device, and microelectronic device, are often used interchangeably in the field of electronics.
FET, as used herein, refers to metal-oxide-semiconductor field effect transistors (MOSFETs). An n-channel FET is referred to herein as an NFET. A p-channel FET is referred to herein as a PFET.
CMOS is an acronym that stands for Complementary Metal Oxide Semiconductor, and refers to a semiconductor manufacturing process in which both NFETs are PFETs are formed in the same chip.
CMOS circuit refers to a circuit in which both NFETs and PFETs are used together.
SoC is an acronym that stands for System on a Chip, and refers to a chip that includes two or more circuit blocks, typically interconnected by a bus, where those circuit blocks provide such high levels of functionality that these blocks would have been considered system-level components in the past. By way of example, circuit blocks having the requisite level of functionality as of this date include scalar, superscalar, and very long instruction word processors; DRAM controllers (e.g., DDR3, DDR4 and DDR5); flash memory controllers; Universal Serial Bus (USB) controllers; and the like. This list is intended to be illustrative and not limiting. Another common way of describing an SoC is a chip that includes all the components that would be needed to implement an electronic system such as, for example, a computer system or a computer-based system.
VLIW is an acronym for Very Long Instruction Word.
VLIW instruction, as used in the description of exemplary embodiments herein, refers to a set of instructions grouped together for presentation to the instruction decoder. The individual instructions in the set of instructions are assigned to one of a plurality of execution pipes for execution.
IC0 refers to a pseudo-stage which is on the input to the instruction cache.
IC1 refers to the instruction cache stage. Fetch requests to the instruction cache are made in this cycle, along with calculations to determine which PC to fetch next. VLIW instructions previously requested are supplied in this stage.
DE1 refers to the first stage of the instruction decoder.
DE1_operation refers to a logical operation performed by the first stage of the instruction decoder.
DE1_time refers to a cycle in which a DE_1 operation occurs.
DE2 refers to the second stage of the instruction decoder.
DE2_operation refers to a logical operation performed by the second stage of the instruction decoder.
DE2_time refers to the cycle in which the reading and renaming of the general register file (GRF) and predicate register file (PREG) occurs.
RS refers to a reservation station. There are several different reservation stations that can be enqueued to. In the best case this is a single cycle stage, however operations may end up queuing here for many cycles.
EXn refers to an nth stage of an execution pipe. Examples of execution pipes include ALU short and long pipes, BRANCH and the Load Store Unit.
SHP refers to a short execution pipe. A short execution pipe is used to perform single cycle operations.
LOP refers to a long execution pipe. A long execution pipe is used to execute instructions that take 2-8 cycles to complete.
LSU refers to the load store unit.
DTCM refers to a data tightly coupled memory.
PBUS refers to a bus that connects to a peripheral memory.
DCACHE refers to the data cache used to cache accesses to peripheral memory.
Enqueue refers to the action in which a VLIW instruction in DE2 is split into its component operations and then move forward down the pipe into the reservation stations.
Issue refers to moving an operation from the reservation station to an execution unit. An operation is referred to as being issued when it is moved from the reservation station to an execution unit. An operation is a component part of a VLIW instruction.
Current PC refers to the value of the program counter (PC) for the instruction currently in a given stage. Each stage of the pipe will have its own version of the current PC.
Next PC refers to the next PC to fetch from the Icache. For straight line code this will be current PC+ current instruction width, for redirected code it will be the new target PC.
Loop start address refers to the address of the first instruction in a loop body, i.e., the address to branch to for starting a new loop iteration.
Loop end address refers to the address of the first instruction after a loop body, i.e., the address to branch to for naturally exiting the loop.
Loop body refers to the instructions beginning with the loop start address and ending with the loop match address.
Loop match address refers to the address of the last instruction in a loop body.
Loop count refers to the number of iterations of the loop that should be executed. This comes from either an immediate field for LOOP operations, or a general register for ZLOOP and ZLOOPS operations.
SIN refers to the Speculation Index Number, which is used to identify instructions enqueued speculatively in the shadow of a branch.
SIN resolution refers to determining whether a branch was correctly speculated or not. SIN resolution is performed in EX1.
SIN validation refers to a branch in EX1 that was correctly speculated, which in turn will validate the SIN associated with the operations in the shadow of the correctly speculated branch. A validated operation is one which will update the architectural state.
SIN cancellation refers to a branch in EX1 that was incorrectly speculated, which in turn will cancel all outstanding SINs, and perform an EX1 redirect, effectively removing all operations that were in the shadow of the branch it from the execution pipe. In one embodiment, removing the operation that were in the shadow of the incorrectly speculated branch includes changing the state of a bit associated with each of those instruction in the execution pipe.
State coherency enforcement (SCE) refers to actions performed by an internal mechanism to prevent future operations from seeing an incoherent machine state.
Trap events refers to the set of synchronous, asynchronous and fault events.
Synchronous trap events relate to a specific instruction and are detected in time to prevent the instruction causing the event from being enqueued. The Supervisor Call (SVC) instruction fits into this category. These are precise as they occur in an architecturally defined place in the instruction stream.
Asynchronous trap events (interrupts) occur independently from the current instruction sequence. Asynchronous exceptions fit into this.
Fault trap events prevent program flow from recovering. Examples of fault trap events are a misaligned PC and a data abort. Faulting operations with a register destination must complete a register value.
A processor architecture is disclosed that includes a register file having a plurality of registers, and is configured for out-of-order instruction execution, further includes a renamer unit that produces generation numbers that are associated with register file addresses to provide a renamed version of a register that is temporally offset from an existing version of that register rather than assigning a non-programmer-visible physical register as the renamed register. The processor architecture may include a reset dual history length (DHL) Gshare branch prediction unit coupled to an instruction cache and configured to provide speculative addresses to the instruction cache. The processor architecture is suitable for implementation in an integrated circuit. Such an integrated circuit is typically implemented with CMOS circuitry.
In typical embodiments a processor in accordance with this disclosure is implemented in an integrated circuits as an embedded processor.
Instruction cache 102 holds VLIW instructions that have been previously fetched by an instruction fetch unit (not shown). The VLIW instructions are typically fetched from a memory disposed external to the processor itself. Branch prediction unit 104 is shown coupled to instruction cache 102. Branch prediction unit 104 provides the address of the VLIW instruction to fetch. If the requested VLIW instruction is present in instruction cache 102 then it is provided to an instruction decoder 106. If the requested VLIW instruction is not present in instruction cache 102 then a cache miss has occurred and the requested instruction is fetched from a memory that is disposed outside of the processor.
Branch prediction unit 104 has several functions, including providing the program counter value needed by instruction cache 102, and the program counter value needed by different stages and logic blocks throughout the processor. For sequentially executing program code, the program counter value simply changes by the length of the instruction just fetched. But when a branch instruction is detected, then branch prediction unit 104 determines from what address the next instruction should be fetched. In this exemplary processor, branch prediction unit 104 uses a small reset DHL Gshare branch prediction mechanism to determine the next instruction address.
Instruction decoder 106 decodes the content of the VLIW instructions and provides control information to various other blocks of the processor.
Register file 108 contains a predetermined number of programmer-visible registers. These registers hold values that are used during the execution of a program.
Individual instructions obtained from the VLIW instruction are enqueued into a selected reservation queue. When the operands needed for execution of an enqueued instruction become available, that instruction is issued to the execution pipe associated with the selected reservation queue.
Generation renamer 110 is used to assign generation numbers to register instances in instructions when those register instances would conventionally be reassigned to a different non-programmer-visible physical register.
The reservation queues hold instructions that are waiting to be issued.
Stunt box 124 provides a mechanism for receiving and distributing the outputs of the execution pipes. Stunt box 124 provides data to an operand copy network. The operand copy network allows all the results of the execution pipes to be made available to other blocks within the processor. In this way, an instruction waiting for an operand to be produced from the execution of another instruction does not have to wait for that operand to be written back to the register file and then read out of the register file. Rather the required operand is made available, via the operand copy network, to all the locations throughout the processor that are waiting for that particular result.
System bus 126 provides a mechanism for the embedded processor to communicate with other logic blocks on the integrated circuit that are external to the processor itself.
Branches and Branch Prediction
Branch instructions are used to choose which path to follow through a program. Branches can be used to jump to a procedure in different places in a program. They can also be used to allow a loop body to be executed repeatedly, and they can be used to execute a piece of code only if some condition is met.
Branches cause problems for processors for two reasons. Branches can change the flow through the program, so the next instruction is not always the instruction following sequentially after the branch. Branches can also be conditional, so it is not known until the branch is executed whether the next instruction to be fetched is the next sequential instruction or the instruction at the branch target address.
In early processor designs, instructions were fetched and executed one at a time. By the time the fetch of a new instruction stared, the target address and condition of a previous branch was already known. The processor always knew which instruction to fetch next. However, in pipelined processors, the execution of several instructions is overlapped. In a pipelined processor, the instruction following the branch needs to be fetched before the branch is executed. However, the address of the next instruction to fetch is not yet known. This problem may be referred to as the branch problem. Since the target address and condition of the branch are not known until after the branch is executed, all pipeline stages before the execute stage will be filled with bubbles or no-operations by the time the branch is ready to execute. If an instruction executes in an nth stage of a pipeline, there will be (n−1) bubbles or no-operations per branch. Each of the bubbles or no-operations represents the lost opportunity to execute an instruction.
In superscalar processors, the branch problem is more serious as there are two or more pipelines. For a superscalar processor capable of executing k instructions per cycle, the number of bubbles or no-operations is (n−1)×k. Each bubble still represents the lost opportunity to execute an instruction. The number of cycles lost due to each branch is the same in the pipelined and superscalar processors, but the superscalar processor can do much more in that period of time. For example, consider a 4-issue superscalar (i.e., k=4) processor where branches are executed in the nth pipeline stage (with n=6). If every fifth instruction is a branch instruction, there will be 20 bubbles for every 5 useful instructions executed. Due to the branch problem, only 20% of the execution bandwidth is used to execute instructions. The trend in processor design is towards wider issue and deeper pipelines, which further aggravates the branch problem.
Branch prediction is one way of dealing with the branch problem. A branch predictor predicts whether a branch will be taken or not taken. The predictor uses the prediction to decide what address to fetch the next instruction from in the next cycle. If the branch is predicted as taken, then an instruction at the branch target address will be fetched. If the branch is predicted as not taken, then the next sequential instruction after the branch instruction will be fetched. When a branch predictor is used, a branch penalty is only seen if the branch is mispredicted. A highly accurate branch predictor is therefore an important mechanism for reducing the branch penalty in a processor.
Global branch history register 306 stores bits that indicate whether a branch was taken during execution of instructions in a program. Hash block 304 generates addresses to access entries in the large branch history table 308, small branch history table 310, hybrid selector table 312, and update counter table 314. Generation of addresses using hash block 304 to access entries in the large branch history table 308 is further described below with respect to
A conventional branch predictor may use only one branch history table. Embodiments presented herein use two branch history tables, the large branch history table 308 and the small branch history table 310. Both the small branch history table 310 and the large branch history table 308 store values that predict a branch direction for a branch in a program code being executed. The small branch history table 310 has fewer entries than the large branch history table 308, and is therefore a shorter history that is better at capturing correlation between branches for which only the most recent branch outcomes are needed. The large branch history table 308 has more entries than the small branch history table and the longer history captures more complex correlations between branches. The state machine to update values in large branch history table 308 and small branch history table 310 is described below with respect to
Mux 316 selects between a branch direction read from the large branch history table 308 and the small branch history table 310 based on a selection value read from an entry in the hybrid selector table 312. Each fetched branch is mapped to an entry in the large branch history table 308, the small branch history table 310, and a selection entry in the hybrid selector table 312 using the hash block 304. If the selection entry in the hybrid selector table 312 has a value greater than or equal to 2, then the prediction from the large branch history table 308 is used to predict the direction of the branch, otherwise, the prediction from the small branch history table 310 is used. A value in a selection entry in hybrid selector table 312 corresponding to a branch is incremented if only the large branch history table 308 was correct in predicting that branch. If only the small branch history table 310 was correct in predicting that branch, the value in the selection entry in the hybrid selector table 314 corresponding to that branch is decremented. If both the large branch history table 308 and the small branch history table 310 made the same prediction for the branch, the value in the selection entry is not changed.
The update counter table 314 is used to determine whether to inhibit an update of an entry in the large branch history table 308. Update counter table 314 stores an update value in each entry. The update value indicates whether the large branch history table 308 or the small branch history table 310 is more accurate in a prediction of a particular branch. According to an embodiment of the disclosure, the value in a large branch history table 308 corresponding to a branch instruction is not updated if the corresponding update value in update counter table 314 indicates that the small branch history table 310 is more accurate than the large branch history table 308 in a prediction of a branch direction for the branch. If an update value corresponding to a particular branch in the update counter table 314 is zero, then update of the large branch history table 308 is inhibited regardless of whether the particular branch is correctly predicted by the large branch history table 308 or the small branch history table 310, otherwise the update is allowed. When the small branch history table 310 mispredicts a particular branch, the update value corresponding to the particular branch in the update counter table 314 is set to 3. Every time thereafter, the update value corresponding to that particular branch is decremented if the large branch history table 308 mispredicts the particular branch. In this manner, the large branch history table 308 is only updated with the correct prediction for the particular branch when the small branch history table 308 has recently mispredicted the particular branch. This prevents over-updating of the large branch history table 308 leading to better training of the large branch history table 308 with regard to the particular branch.
Referring back to
Upon fetching the DIV instruction 412, global branch history register 306 is again not updated since the DIV 412 instruction is not a conditional branch instruction. Upon receiving the BRNEZ instruction 414, the global branch history register 306 will be updated since the BRNEZ instruction 414 is a conditional branch instruction. Assuming the BRNEZ instruction 414 is predicted as taken, the global branch history register 306 is updated by shifting a bit “1” into the least significant bit position of the global branch history register 306 as shown in
A program may include multiple procedures. A procedure is a section of code within the program that is accessed upon execution of a “call” instruction. The call instruction may include an instruction that returns program execution to a next instruction after the call instruction. An example of a call instruction is a “branch with link” instruction that is further described with reference to the example program code provided below.
PROGRAM CODE:
0x001 ADD
0x002 SUB
0x003 BR
0x004 BRNEZ
0x005 MUL
0x006 ADD
0x007 BRANCH WITH LINK TO PROCEDURE 1
0x008 BRLEZ
0x009 BRGTZ
0x010 ADD
0x011 BRANCH WITH LINK TO PROCEDURE 2
0x012 ADD
PROCEDURE 1
0x014 ADD
0x015 SUB
0x016 BREOZ
0x017 BRNEZ
0x018 MUL
0x019 DIV
END PROCEDURE 1
0x021 ADD
0x022 MUL
PROCEDURE 2
0x024 SUB
0x025 MUL
0x026 ADD
0x027 BREQZ
0x028 MUL
0x030 BRGTZ
END PROCEDURE 2
In the example program code above, 0xXXX represents the address at which an instruction is stored in instruction cache 102. A branch with link instruction is an instruction that transfers program execution to a particular procedure in the program code. Executing the branch with link instruction that transfers program execution to a procedure is referred to as “calling a procedure” herein. The branch with link instruction includes an instruction (not shown) that returns program execution to a next instruction after the branch with link instruction.
Global branch history such as that stored in global branch history register 306 is used as an index to access prediction entries in large branch history table 308 and small branch history table 310 becase branches often correlate with previously executed branches. Longer branch histories enable predictors to view a larger window of previously executed branches and learn based on correlations with those branches. For branches highly correlated with recent branch history, global history can provide key prediction information. Conventional branch predictors may rely only on a global branch history to produce branch predictions. However, not all branches in the program are correlated with recently executed branches. For these branches that are not correlated with recently executed branches, the extra information encoded in the global history may do more harm than good when predicting branches. It also increases the time to train the branch predictor and it significantly expands the level of aliasing in branch prediction tables, thereby reducing the accuracy of prediction of a current branch and of other branches. A longer global branch history register 306 enables correlation between more distant branches, but also increases the number of uncorrelated branches that are included in the branch history. Those uncorrelated branches can generate significant noise when predicting branches. Consider a 15-bit global branch history register 306. A branch that is highly correlated with 3 prior branches will make good use of a correlating predictor, but even in this scenario, the history contains 12 bits of useless noise. This means that in a worst case, 212 times more entries may be needed to predict a branch, greatly increasing the training period of a branch predictor along with aliasing with other branches. For a branch uncorrelated with prior branches, the entire 15 bits are noise. Procedure calls often represent breaks in program flow. Branches preceding a procedure call tend to be less correlated with branches inside the procedure call. Accordingly, an architecture that allows some branches to benefit from large histories, but eliminates or reduces the history noise in those regions where the noise is not useful is provided.
To provide better prediction of branches using the global branch history register 306, a value in the global branch history register 306 is overwritten with a start address of a first instruction in a procedure when a branch to that procedure is made. Overwriting of a value in the global branch history register 306 with an address of a start address of the procedure that is called is referred to as “presetting” herein. If the branch to the procedure was speculative and incorrectly predicted, then the value in the global branch history register 306 that was overwritten is restored to the global branch history register 306. Using a start address of a first instruction in a procedure provides a unique history for each point at which the global branch history register 306 is preset, thereby eliminating aliasing between the different preset points in the program code, and ensuring that when program execution calls the procedure again, the global branch history register 306 will be preset to the same value. Since the global branch history register 306 is used as an index into the large branch history table 308 and the small branch history table 306 to determine direction of branches, presetting the global branch history register 306 to the same value (i.e. the start address of a first instruction in a procedure) ensures that branch predictions retrieved from the large branch history table 308 and the small branch history table 306 will be local to the procedure that is called and will be more accurate.
In the example in
Global branch history register 306 and program counter 202 are coupled to hash block 304. Hash block 304 is coupled to the small branch history table 310, the hybrid selector 312 and update counter 314. According to an embodiment of the disclosure, hash block 304 includes XOR function 700. XOR function 700 hashes a 32-bit program counter value in program counter 202 into a 10-bit value. The 10 bits generated by hash function 700 are combined with the least significant bit of the global branch history register 306 to form a 11-bit index. This 11-bit index is used to access an entry in the small branch history table 310, the hybrid selector 312, and the update counter 314.
During initialization, a random value may be stored in entries of the large branch history table 308 and the small branch history table 310. If a branch is taken the first time it is executed, the entry corresponding to the branch is associated with a “weakly taken” state 802 and is updated with bits 00. If the entry for a branch is currently in the weakly taken state 802, and if a next time the branch is executed it is taken again, then the entry is associated with the “strongly taken” state 804 and is updated with bits 01. If the current state for a branch is weakly taken state 802, then if the branch is not taken the next time it is executed, it transitions to the “weakly not taken state” 806 and the entry is updated with 10. If a branch is currently associated with the weakly not taken state 806 and it is taken the next time it is executed, then the state for that branch transitions to the weakly taken state 802 and its corresponding entry is updated with 00. If a branch is in the weakly not taken state 806, and then it is again not taken the next time it is executed, then the state transitions to a “strongly not taken” state 808 and the entry is updated with 11. If a branch is in a strongly not taken state 808, and then it is taken, then it transitions to the weakly not taken state 806 and the entry is updated with 10. If a branch is in the strongly taken state 804 and is then taken again the next time it is executed, then it will stay in the strongly taken state 804. If the branch is in the strongly taken state 804, and then it is not taken the next time it is executed, then it transitions to the weakly taken state 802 and the entry is updated with 00. If the branch is in the weakly not taken state 10 and then it is taken the next time it is executed, then it transitions to the weakly taken state 00.
Unspeculated Branch Instruction
Branch prediction unit 104 optimizes the frequent path that code takes through a set of branches. As described above, the branch prediction unit 104 determines which path a branch should take based on the global branch history register 306, large branch history register 308, and the small branch history 310. However, sometimes the most frequently taken path as learned by the branch prediction unit 104 isn't the path that a programmer may want to optimize. For example, a programmer may want to optimize a branch path that takes the most number of clock cycles to execute instead of a branch path that is most frequently taken. For example, consider the pseudo-code:
If (branch_instruction (branch condition))
Execute code A
else
Execute code B
In the above example, code A is the branch path whose execution takes the more number of clock cycles when compared to code B. The branch prediction unit 104 may determine based on the global branch history register 306, large branch history register 308, and the small branch history 310 that code B is the path of the branch that is taken most often. Thus, whenever branch prediction unit 104 encounters the branch instruction above, it will predict that code B should be executed instead of code A. However, when branch_instruction is later resolved by executing it in branch execution unit 118 and it is determined that code A was to be executed based on the branch condition above, penalty cycles will be incurred because code B, which was executed based on the prediction by branch prediction unit 104 that outputs branch direction signal 204, has to be invalidated and then code A will have to fetched and executed. Thus code A, that takes more number of clock cycles to execute than code B, will have additional cycles added to its execution due to the prior execution of code B based on the prediction by branch prediction unit 104. To avoid this scenario, embodiments presented herein provide for an unspeculated branch instruction called builtin_expect(branch condition). Pseudo-code for the builtin_expect instruction is provided below:
If (builtin_expect(branch condition))
Execute code A
else
Execute code B
According to an embodiment of the disclosure, the branch prediction unit 104 does not treat the builtin_expect unspeculated branch instruction like other branch instructions in that branch prediction unit 104 doesn't predict the branch direction by providing a branch direction signal 204 to determine whether code A or code B is to be executed. Instead, the builtin_expect instruction causes code A, which takes more clock cycles than code B to execute, to be executed every time. In an example, the instruction fetch unit (not shown) upon receiving the unspeculated branch instruction builtin_expect(branch condition) from the instruction cache 102 always fetches instructions for code A which is the branch path of the unspeculated branch instruction that takes more clock cycles to execute than code B. In an embodiment, code A is in the “forward path” of the unspeculated branch instruction. Forward path as referred to herein is the code that is the next sequential instruction after the unspeculated branch instruction. In an embodiment, a programmer or compiler may place code A in the forward path of the unspeculated branch instruction.
In addition, branch prediction unit 104 does not update any of the branch history tables such as global branch history register 306, large branch history register 308, and the small branch history 310 with the branch history of the builtin_expect instruction. Branch execution unit 118 later resolves the branch condition in builtin_expect(branch condition) to determine whether code A or code B was to be executed. If it is determined in branch execution unit 118 that code B should have been executed, then code B can be fetched, and the instructions executed in code A can be invalidated. While fetching instructions for code B, which may be the most frequent path taken, and invalidating instructions executed in code A will result in extra cycles, the worst case scenario of additional cycles required to execute code A will be avoided by use of the builtin_expect(branch condition) instruction.
Conclusion
The present invention has been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
The foregoing description of the specific embodiments will so fully reveal the general nature of the disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
The claims in the instant application are different than those of the parent application or other related applications. The Applicant therefore rescinds any disclaimer of claim scope made in the parent application or any predecessor application in relation to the instant application. The Examiner is therefore advised that any such previous disclaimer and the cited references that it was made to avoid, may need to be revisited. Further, the Examiner is also reminded that any disclaimer made in the instant application should not be read into or against the parent application.
Wilson, Sophie, Barrett, Geoffrey
Patent | Priority | Assignee | Title |
11269642, | Sep 20 2019 | Microsoft Technology Licensing, LLC | Dynamic hammock branch training for branch hammock detection in an instruction stream executing in a processor |
11630670, | Jul 21 2021 | Apple Inc.; Apple Inc | Multi-table signature prefetch |
Patent | Priority | Assignee | Title |
5634026, | May 12 1995 | International Business Machines Corporation | Source identifier for result forwarding |
5881262, | Jan 04 1994 | Intel Corporation | Method and apparatus for blocking execution of and storing load operations during their execution |
5996063, | Mar 03 1997 | International Business Machines Corporation | Management of both renamed and architected registers in a superscalar computer system |
6240509, | Dec 16 1997 | Intel Corporation | Out-of-pipeline trace buffer for holding instructions that may be re-executed following misspeculation |
6370637, | Aug 05 1999 | GLOBALFOUNDRIES Inc | Optimized allocation of multi-pipeline executable and specific pipeline executable instructions to execution pipelines based on criteria |
6658621, | Jun 30 2000 | Intel Corporation | System and method for silent data corruption prevention due to next instruction pointer corruption by soft errors |
7043626, | Oct 01 2003 | MEDIATEK INC | Retaining flag value associated with dead result data in freed rename physical register with an indicator to select set-aside register instead for renaming |
7069411, | Aug 04 2003 | Advanced Micro Devices, Inc. | Mapper circuit with backup capability |
7856548, | Dec 26 2006 | Sun Microsystems, Inc; Sun Microsystems Technology LTD | Prediction of data values read from memory by a microprocessor using a dynamic confidence threshold |
8019981, | Jan 06 2004 | Altera Corporation | Loop instruction execution using a register identifier |
20020194463, | |||
20090198917, | |||
20120226894, | |||
20130055256, | |||
20130086210, | |||
20130086211, | |||
20130219176, | |||
20130275492, | |||
20140025832, | |||
CN1114763, | |||
EP863460, | |||
EP1569112, | |||
WO2013071087, | |||
WO2013126570, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 30 2014 | WILSON, SOPHIE | Broadcom Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034083 | /0336 | |
Oct 30 2014 | BARRETT, GEOFFREY | Broadcom Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 034083 | /0336 | |
Oct 31 2014 | Avago Technologies International Sales Pte. Limited | (assignment on the face of the patent) | / | |||
Feb 01 2016 | Broadcom Corporation | BANK OF AMERICA, N A , AS COLLATERAL AGENT | PATENT SECURITY AGREEMENT | 037806 | /0001 | |
Jan 19 2017 | BANK OF AMERICA, N A , AS COLLATERAL AGENT | Broadcom Corporation | TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS | 041712 | /0001 | |
Jan 20 2017 | Broadcom Corporation | AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 041706 | /0001 | |
May 09 2018 | AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | MERGER SEE DOCUMENT FOR DETAILS | 047231 | /0369 | |
Sep 05 2018 | AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | CORRECTIVE ASSIGNMENT TO CORRECT THE EXECUTION DATE OF THE MERGER AND APPLICATION NOS 13 237,550 AND 16 103,107 FROM THE MERGER PREVIOUSLY RECORDED ON REEL 047231 FRAME 0369 ASSIGNOR S HEREBY CONFIRMS THE MERGER | 048549 | /0113 |
Date | Maintenance Fee Events |
Aug 16 2022 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Feb 19 2022 | 4 years fee payment window open |
Aug 19 2022 | 6 months grace period start (w surcharge) |
Feb 19 2023 | patent expiry (for year 4) |
Feb 19 2025 | 2 years to revive unintentionally abandoned end. (for year 4) |
Feb 19 2026 | 8 years fee payment window open |
Aug 19 2026 | 6 months grace period start (w surcharge) |
Feb 19 2027 | patent expiry (for year 8) |
Feb 19 2029 | 2 years to revive unintentionally abandoned end. (for year 8) |
Feb 19 2030 | 12 years fee payment window open |
Aug 19 2030 | 6 months grace period start (w surcharge) |
Feb 19 2031 | patent expiry (for year 12) |
Feb 19 2033 | 2 years to revive unintentionally abandoned end. (for year 12) |