This invention provides a cache system and method based on instruction read buffer (IRB). When applied to the field of processor, it is capable of filling instructions to the instruction read buffer which can be directly accessed by processor core and the processor core outputs instruction to the processor core for execution autonomously and achieve a high cache hit rate.
|
1. A method for facilitating operation of a processor system, comprising:
naming possible execution paths of a section of sequential instructions starting from an initial address, naming sequential instructions starting from the branch target instructions of the branch instructions within the section of sequential instructions as ways, and naming the initial address and the branch target addresses as way starting addresses;
issuing a plurality of instructions in the ways in parallel to processor pipelines for processing;
checking dependency among the plurality of instructions, and generating an address increment amount for each way based on the dependency;
making branch decisions independently for other branch instructions among the plurality of instructions of each way;
generating a way selecting signal based on the plurality of branch decisions and positions of corresponding branch instructions in the execution paths;
executing only instructions having no dependency prior to and in a selected way;
updating the initial address by adding the starting address of the selected way and the address increment amount of the selected way.
16. A system for facilitating operation of a processor, comprising:
a plurality of instruction read buffers;
a plurality of execution units;
a plurality of dependency check units; and
a plurality of priority encoder,
wherein:
naming possible execution paths of a section of sequential instructions starting from an initial address, naming sequential instructions starting from the branch target instructions of the branch instructions within the section of sequential instructions as ways, and naming the initial address and the branch target addresses as way starting addresses;
the instruction read buffers and the execution units have one to one correspondence; each instruction read buffer issues an instruction to a corresponding execution unit; and the instruction read buffers issue a plurality of instructions in the ways in parallel to the execution units in a corresponding way;
each dependency check unit corresponds to a way, checks dependency among the plurality of instructions in the way, and generates an address increment amount for the way based on the dependency;
the execution units are configured to execute instructions issued by the instruction read buffers;
when instructions being executed are branch instructions, the execution unit makes branch decisions independently for other branch instructions among the plurality of instructions of each way;
each priority encoder corresponds to a way, and generates a way selecting signal based on the plurality of branch decisions and positions of corresponding branch instructions in the execution paths; and the initial address is updated by adding the starting address of the selected way and the address increment amount of a selected way; and
the execution units execute only instructions having no dependency prior to and in the selected way.
2. The method according to
provided that m and w are positive integer, m sequential instructions starting from the initial address, together with possible branch target instructions and subsequent instructions of the branch instructions within the sequential instructions, are organized into w ways based on the possible execution paths, and into m instruction issue slots based on a program sequence of the instructions, wherein m is a total number of issue slots in a way, and w is a total number of ways;
a plurality of instructions of each way are in a plurality of slots; and
each slot processes the plurality of instructions in different ways.
3. The method according to
naming pipeline stages up to a branch decision stage of a processing pipeline as a front-end pipeline, and pipeline stages after the branch decision stage as a rear-end pipeline, each issue slot has a processor pipeline structure consisting at least one rear-end pipeline and up to w front-end pipelines, where each front-end pipeline corresponds to an instruction of a way.
4. The method according to
based on dependency among issued instructions in each way together with all prior instructions and positions of dependent instructions, generating an address increment amount for each way.
5. The method according to
branch decisions are made independently when processing branch instructions in issued plurality of instructions;
a way with each of the branch decisions is selected based on a priority of corresponding instruction node positions on an instruction execution path binary tree;
each branch decision controls a 2-to-1 selection which selects a fall-through way if the branch decision is not taking-branch and selects a target way if the branch decision is taking-branch;
a plurality of the selections are configured according to positions of corresponding plurality of branch instructions in the execution path binary tree, with a selection result corresponding to a later branch instruction in execution sequence as an input to a selection corresponding to an earlier branch instruction in sequence; and
a last selection corresponds to a first branch instruction, and a result of the last selection designates the selected way.
6. The method according to
selecting a plurality of front-end pipeline outputs of instructions in the selected way and instructions prior to the selected way to the rear-end pipeline for further execution;
completing and retiring non-dependent instructions in the selected way and the instructions prior to the way, based on the dependency of the selected way; and
replacing the initial address by a summation of the starting address and the address increment amount of the selected way.
7. The method according to
adjusting the number of instructions issued in parallel by adjusting a configuration of dependency check, wherein setting the dependency check of a slot to having-dependence disables the front-end pipeline and the rear-end pipeline of the slot.
8. The method according to
an instruction read buffer (IRB) stores a plurality of instructions;
there are plurality IRB read-ports for each instruction stored in the IRB, each of the read-port corresponding to a same instruction issues the instruction to a front-end pipeline via a separate set of bit-lines; and
the plurality of IRB read-ports are organized as a two-dimension matrix with read-ports corresponding to different instruction as one dimension and read-ports corresponding to different sets of bit-lines as another dimension.
9. The method according to
using diagonal word lines to control the IRB read-port matrix outputting a plurality of sequential instructions to a plurality front-end pipelines;
IRB read-ports connected to a single set of bit-lines correspond to a slot; and
IRB read-ports controlled by a single diagonal word-line corresponds to a way.
10. The method according to
extracting instruction information from instruction blocks filling into an instruction cache;
building tracks corresponding to the instruction blocks based on extracted information storing in tracks of a track table, wherein a sequential next block address is stored to a multi-port end address table.
11. The method according to
the initial address is used to address the track table and the multi-port end address table;
the track table outputs target addresses which also address the multi-port end address table;
the multi-port end address table outputs sequential next block addresses of the initial address and target addresses; and
the initial address, the target addresses, and the next addresses are send to the IRB.
12. The method according to
the IRB stores a plurality instruction blocks and their corresponding block addresses;
the IRB matching the initial addresses, the target addresses, and the next addresses with the block addresses stored in the IRB to identify the instruction blocks;
offset addresses of matched incoming addresses are decoded to enable zig-zag word lines in the identified block to issue a plurality of instructions controlled by the word lines to the front-end pipelines for processing.
13. The method according to
the slot corresponding to the first instruction in a way is named as the starting slot of the way,
the initial address and the target addresses enable the word-lines originating in the starting slots of corresponding way;
the next addresses enable the word-line starting in the slot after the slot where the last instruction of corresponding prior block is issued; and
the IRB sends out a plurality of instructions from the read-ports controlled by the enabled word-lines to the corresponding front-end pipelines for processing.
14. The method according to
n sequential instructions starting from an initial address, and the possible branch target instructions from the branch instructions within the n instructions, and the branch target instructions of the branch targets, are divided into different ways based on each instruction's position on the instruction binary tree, and issued simultaneously;
each of the simultaneously issued instruction is independently executed;
dependency among the instructions are checked, and a way address increment amount is generated for each way based on whether there is dependency among the instructions and a location of a dependent instruction;
branch decision is made independently by executing each branch instruction;
a way of execution is determined based on each of the independent branch decisions and branch priority based on the branch instruction sequence order;
based on the way determined, up to n instructions are selected from the simultaneously issued instructions for normal execution and retirement, and rest of the instructions are terminated;
based on the determined way, the current address of the way is added with the address increment amount of the way to be the next initial address; and
each branch target address addresses a separate track table or a multiple port track table to read out branch target addresses of the branch target instruction block.
15. The method according to
the instructions and extracted instruction information are stored in a joint buffer, the instruction information corresponds one-to-one to the instructions stored in the joint buffer;
based on the initial address, the joint buffer provides corresponding instructions of each way, the branch target instruction address, and the next block address of each way;
based on the said branch target instruction address of each way, the joint buffer further provides corresponding instructions of each way, the branch target instruction address and the next block address of the said branch target for each way;
the initial address, branch target address, and the next block address respective enable the corresponding diagonal word-lines to control the IRB to output the instructions to the corresponding front-end pipeline.
17. The system according to
provided that m and w are positive integer, m sequential instructions starting from the initial address, together with possible branch target instructions and subsequent instructions of the branch instructions within the sequential instructions, are organized into w ways based on the possible execution paths, and into m instruction issue slots based on a program sequence of the instructions, wherein m is a total number of issue slots in a way, and w is a total number of ways.
18. The system according to
a plurality of instruction read buffers store a same plurality of instructions;
each instruction read buffer has a plurality of read-ports, and each read-port corresponds to a bit-line, and corresponding to an instruction stored in the instruction read buffer;
for each instruction read buffer, the read-ports form a two-dimension matrix, with read-ports corresponding to different instructions as one dimension and read-ports corresponding to different sets of bit-lines as another dimension;
word lines of the instruction read buffer are in a diagonal form, and control the read-port matrix outputting a plurality of sequential instructions to a plurality of execution units;
IRB read-ports connected to a single set of bit-lines correspond to a slot; and
IRB read-ports controlled by a single diagonal word-line corresponds to a way.
19. The system according to
the processing pipeline of each execution unit is further divided into a front-end pipeline up to the branch decision stage, and a rear-end pipeline including the rest of the pipeline stages;
each issue slot has a processor pipeline structure consisting at least one rear-end pipeline and up to w front-end pipelines, where each front-end pipeline corresponds to an instruction of a way;
based on the dependency check result, the outputs of the plurality front-end pipelines in the selected way is sent to the rear-end pipelines for completion; and
the initial address is replaced by the summation of the starting address and the way address increment amount of the selected way.
20. The system according to
a scanner, a track table, an end address table and a tracker,
wherein:
the scanner is configured to extract instruction information from instruction blocks filling into an instruction cache and build tracks corresponding to the instruction blocks based on the extracted information;
the track table is configured to store the tracks;
the end address table is configured to store sequential next block addresses;
the tracker is configured to address the track table and the end address table, to output target addresses which also address the end address table and sequential next block addresses of the initial address and the target addresses, and to send the initial address, the target addresses, and the next addresses to the IRB for instruction issuing.
|
This application is a national phase entry under 35 U.S.C. § 371 of International Application No. PCT/CN2014/084616, filed on Aug. 18, 2014, which claims priority of Chinese Patent Application No. 201310362689.8, filed on Aug. 19, 2013, the entire contents of which are incorporated by reference herein.
The present invention generally relates to the fields of computer, communication and integrated circuit.
Cache's function in general is to copy part of the contents from lower memory to enable the fast access of those contents by even higher memory or processor core to sustain pipeline operations.
The addressing of existing cache is all based on the following method, match the tag section in an address with the tag read out from tag memory addressed by the indexed section of an address; read out the cache content which is addressed by the indexed section and offset section in the address. If the tag read out from the tag memory matches with the tag section in an address, then the content read out from the cache is valid, called cache hit. Otherwise, if the tag read out from the tag memory does not match with the tag section in an address, then the content read out from the cache is invalid, called cache miss. In the case of a multi-way set-associative cache, perform the said operation on all the Ways in parallel to detect which Way hits. The read out content corresponding to the hit Ways are valid content. If all of the Ways are ‘miss’, then all of the contents read out are invalid. The cache control logic fills the content from lower storage media into the cache after a cache miss.
Cache misses can be divided into three categories: compulsory miss, conflict miss, and capacity miss. Compulsory misses are inevitable in the existing cache structure, except for a small portion of content, which is successfully pre-fetched. However, the existing pre-fetch operation has a sizable cost. In addition, even though multi-way set-associative cache is able to reduce the conflict miss, there is a limit of the number of Ways due to the power consumption and speed restrictions (for example multi-way set associative cache requires reading out and comparing tags of all of the Ways, and all of the content addressed by the same index at the same time).
The modern cache system usually consists of multiple layers of multi-way set-associative caches. New cache structures such as: victim cache, trace cache, and pre-fetch are all improvements based on existing cache structures. Nevertheless, with the widened processor/memory speed gap, the existing architecture, particularly the cache misses in multi category, has been the most serious bottle neck which hinders the performance improvement of modern processors.
The disclosed methods and systems are directed to solve one or more problems set forth above and other problems.
An instruction cache system, herein, comprising: Processor core, the said processor core is used to execute instructions; instruction memory, the said instruction memory is used to store instructions; Instruction read buffer (IRB), the said instruction read buffer autonomously outputs instructions to processor to execute.
Optionally, the said instruction read buffer autonomously outputs instructions to processor core to execute based on the execution results of the instructions executed by processor core.
Optionally, each instruction in the IRB corresponds to a token passer, the said token passer passes token; the said IRB autonomously outputs the instruction corresponding to the token passer, which holds the token to processor core to execute.
Optionally, when executing the instructions in the same instruction block in sequential order, the said token passes from the current token passer to the next token passer in address sequence.
Optionally, when executing instructions in different instruction blocks, the said token is passed from the current token passer to the token passer corresponding to the next instruction through global bus.
Optionally, when executing instructions of different instruction blocks, reset all token passers, and insert token into the token passer corresponding to branch target instruction.
Optionally, the said IRB autonomously outputs a plural number of instructions including the instruction corresponding to the token passer that contains the token to processor core to execute in parallel.
Optionally, the said plural number of instructions are in the same instruction block.
Optionally, the said plural number of instructions are in different instruction blocks.
Optionally, perform dependency check on the said plural number of instructions, and based on the dependency check result, pass the token to the corresponding token passer, and based on the dependency result processor core executes a portion or all of the instructions in the said plural number of instructions through processor core.
Optionally, the said instruction cache system further includes: tracker, the said tracker moves forward to the first branch instruction after the instruction currently being executed in processor core, and outputs the fall-through instruction address and target instruction's address of the branch instruction; and when the said fall-through instruction or target instruction has not yet been stored in IRB, control instruction memory to fill IRB with the said fall-through instruction or target instruction.
Optionally, the said tracker moves forward to a certain number of branch instructions after the instruction currently being executed in the processor core, and outputs all of the fall-through instruction addresses and target instruction addresses of the said certain number of branch instructions; and when the instructions corresponds to all the said fall-through or target instruction addresses have not yet been stored in IRB, control instruction memory to fill the said fall-through instruction or target instruction into IRB.
Optionally, the said processor core has two front-end pipelines and one rear-end pipeline; the said IRB outputs the fall-through instruction and target instruction of the said branch instruction at the same time to the said two front-end pipelines to execute at the same time; and based on the branch instruction execution result selects one of the execution results of the two front-end pipelines to continue executing in rear-end pipeline.
This disclosure discloses a type of instruction cache method, wherein: the instructions processor core may execute are stored to IRB beforehand, and the said instruction read buffer autonomously outputs instructions to processor core to execute based on the execution results of the instructions executed by processor core.
Optionally, the said IRB autonomously outputs the instruction corresponding to the token passer, which holds the token to processor core to execute.
Optionally, Token is passed based on the execution result of instruction, and output the plural number of instructions that contain the instruction corresponding to the said Token to processor core to execute.
Optionally, perform dependency check on the said plural number of instructions, and based on the dependency check result, pass the token to the corresponding token passer, and based on the dependency result processor core executes a portion or all of the instructions in the said plural number of instructions through processor core.
Optionally, fill the fall-through instruction and the target instruction of a said branch instruction into IRB before processor core executes the branch instruction.
Optionally, the said processor core has two front-end pipelines and one rear-end pipeline; the said IRB outputs the fall-through instruction and target instruction of the said branch instruction at the same time to the said two front-end pipelines to execute at the same time; and based on the branch instruction execution result select one of the execution results of the two front-end pipelines to continue executing in rear-end pipeline.
Optionally, the said system further includes: first tracker, the read pointer of the said first tracker moves to the first instruction after the instruction currently being executed by the processor, and outputs the branch target addresses of the branch instructions in a plural number of instructions starting with the said first instruction; when the said first instruction or the said target instruction has not yet been stored into IRB, control instruction memory to fill the said first instruction or the said target instruction into IRB; and control IRB to output the plural number of instructions starting from the first instruction.
Optionally, in the said system, dependency check unit performs dependency check on the said plural number of instructions, and based on the dependency check result determine the increment amount of the read pointer of the first tracker to update the read pointer, and based on the dependency check result processor core executes part or all of the said plural number of instructions.
Optionally, in the said system, the said first tracker outputs the said first instruction address and the next block instruction address to IRB, to control the plural number of instructions of sequential address starting from the said first instruction outputted from IRB.
Optionally, in the said system, based on the received said first instruction address, IRB sets the corresponding zigzag word line to valid, thus enabling the read ports zigzag word line controls to output the said plural number of instructions.
Optionally, in the said system, when the valid signal on the said zigzag word line arrives at the boundary of an instruction block, it is passed onto a bus, through which it is received by another zigzag bus on an instruction block determined by the next block instruction address, enabling the read ports the other zigzag bus controls to output corresponding instructions.
Optionally, in the said system, the first tracker outputs the said first instruction address and its next block instruction address, target instruction address and its next block instruction address to IRB, to control IRB to output plural number of instructions starting from the said first instruction to the first branch instruction, and instructions of contiguous address starting from the branch target instruction.
Optionally, in the said system, based on the said first instruction address received, IRB sets the corresponding zigzag word line to valid, thus enabling the read ports that are controlled by the zigzag word line to output instructions starting from the said first instruction to the first branch instruction, the valid signal is passed to target word line when it reaches the said first branch instruction, and is received by the second zigzag word line in an instruction block determined by branch target address, the second zigzag word line controls its corresponding read ports to output corresponding instructions; and when the valid signal on the said zigzag word lines arrive at the boundary of an instruction block, it is passed onto a bus, through which it is received by other zigzag buses on an instruction block determined by the next block instruction addresses, enabling the read ports the other zigzag buses control to output corresponding instructions.
Optionally, in the said system, the said processor core has two sets of front-end pipelines and one set of rear-end pipeline; the said first tracker outputs the said first instruction address and its next block instruction address, target instruction address and its next block instruction address to IRB, to control the IRB to output the plural number of instructions of contiguous address starting from the said first instruction to one set of the front-end pipelines to execute; to control the IRB to output the plural number of instructions of contiguous address starting from the branch target address of the said first branch instruction to another set to execute; and based on the execution result of the branch instruction select the execution result of one of the two sets of said front-end pipelines to continue executing in rear-end pipeline.
Optionally, in the said system, the said processor core has two sets of front-end pipelines and one set of rear-end pipeline; the said system also includes a second tracker; the said first tracker outputs the said first instruction address and its next block instruction address, and the target instruction address to IRB, to control IRB to output the plural number of instructions with contiguous address starting from the said first instruction to a set of front-end pipelines to execute; the said second tracker outputs the next block instruction address of the said target instruction to IRB, to control the IRB to output the plural number of instructions with contiguous address starting from the branch target instruction of the said first branch instruction to another set of front-end pipelines to execute; and based on the execution result of the branch instruction select the execution result of one of the two sets of said front-end pipelines to continue executing in rear-end pipeline.
Optionally, in the said system, the said processor core has plural sets of front-end pipelines and one set of rear-end pipeline; the said first tracker outputs the said first instruction address and its next block instruction address to IRB, to control IRB to output the plural number of instructions with contiguous address starting from the said first instruction to a set of front-end pipelines to execute; the said first tracker outputs the branch target instruction addresses and their next block addresses of all of the branch instructions in the plural number of instructions with contiguous addresses starting from the said first address to IRB, each of those addresses controls IRB to output a plural number of instructions with contiguous addresses starting from each of the branch target instructions to the other front-end pipelines to execute; and the total number of branch instructions is less than the number of sets of front-end pipelines.
Optionally, in the said system, the said processor core has plural sets of front-end pipelines and one set of rear-end pipeline; the said first tracker outputs the said first instruction address and its next block instruction address to IRB, to control IRB to output the plural number of instructions with contiguous address starting from the said first instruction to a set of front-end pipelines to execute; the said first tracker outputs the branch target instruction addresses and their next block addresses of every layer of the branch instructions in the plural number of instructions with contiguous addresses starting from the said first address to IRB, each of those addresses controls IRB to output a plural number of instructions with contiguous addresses starting from each of the branch target instructions in every layer of branches to the other front-end pipelines to execute; and the total number of branch instructions in the said every layer is less than the number of sets of front-end pipelines.
Optionally, in the said system, each set of front-end pipeline constitutes a Way, the corresponding execution unit in each set of front-end pipeline constitutes a slot; dependency check module performs dependency check on each Way starting from the said first instruction, based on the dependency check result of each Way to produce the read pointer increment for each Way and to control the execution unit of the corresponding Way to execute part or all of the corresponding instruction, based on the dependency check result of each Way; Based on the execution result of branch instruction in each Way, select execution units in one Way of the Ways to complete execution in the corresponding rear-end pipelines, but terminate the execution in execution unit of other Ways; and select the instruction address and read adder increment of one of the Ways to update the tracker read pointer based on the execution result of branch instructions in each Way.
Optionally, in the said system, organize IRB by Ways; or organize IRB by slots.
Optionally, in the said system, the said dependency check module is configurable, and can be configured to decrease the system's maximum number of instruction issue.
Optionally, the said system also includes data read buffer and data engine; the said data engine fills to the data read buffer in advance the data that may be used by load instruction in the instruction read buffer.
Optionally, in the said system, the said data read buffer's table entry and instruction read buffer's table entry are one-to-one correspondence, the data corresponds to a data fetch instruction can be directly found from data read buffer through the position of the said data fetch instruction in the instruction read buffer; or the said data read buffer's table entry is less than instruction read buffer's table entry and each of the instruction read buffer items contains a pointer, the data corresponds to a data fetch instruction can be found through decoding the said pointer of the data fetch instruction entry in the instruction read buffer.
Optionally, the said method further includes: the read pointer of the said first tracker moves to the first instruction after the instruction currently being executed by the processor, and outputs the branch target addresses of the branch instructions in a plural number of instructions starting with the said first instruction; when the said first instruction or the said target instruction has not yet been stored into IRB, control instruction memory to fill the said first instruction or the said target instruction into IRB; and control IRB to output the plural number of instructions starting from the first instruction.
Optionally, in the said method, dependency check unit performs dependency check on the said plural number of instructions, and based on the dependency check result determine the increment amount of the read pointer of the first tracker to update the read pointer, and based on the dependency check result processor core executes part or all of the said plural number of instructions.
Optionally, in the said method, the said first tracker pointer outputs the said first instruction address and the next block instruction address to IRB, to control the plural number of instructions of sequential address starting from the said first instruction outputted from IRB.
Optionally, in the said method, based on the received said first instruction address, IRB sets the corresponding zigzag word line to valid, thus enabling the read ports zigzag word line controls to output the said plural number of instructions.
Optionally, in the said method, when the valid signal on the said zigzag word line arrives at the boundary of an instruction block, it is passed onto a bus, through which it is received by another zigzag bus on an instruction block determined by the next block instruction address, enabling the read ports the other zigzag bus controls to output corresponding instructions.
Optionally, in the said method, the first tracker pointer outputs the said first instruction address and its next block instruction address, target instruction address and its next block instruction address to IRB, to control IRB to output plural number of instructions starting from the said first instruction to the first branch instruction, and instructions of contiguous address starting from the branch target instruction.
Optionally, in the said method, based on the said first instruction address received, IRB sets the corresponding zigzag word line to valid, thus enabling the read ports that are controlled by the zigzag word line to output instructions starting from the said first instruction to the first branch instruction, the valid signal is passed to target word line when it reaches the said first branch instruction, and is received by the second zigzag word line in an instruction block determined by branch target address, the second zigzag word line controls its corresponding read ports to output corresponding instructions; and when the valid signal on the said zigzag word lines arrive at the boundary of an instruction block, it is passed onto a bus, through which it is received by other zigzag buses on an instruction block determined by the next block instruction addresses, enabling the read ports the other zigzag buses control to output corresponding instructions.
Optionally, in the said method, the said processor core has two sets of front-end pipelines and one set of rear-end pipeline; the said first tracker pointer outputs the said first instruction address and its next block instruction address, target instruction address and its next block instruction address to IRB, to control the IRB to output the plural number of instructions of contiguous address starting from the said first instruction to one set of the front-end pipelines to execute; to control the IRB to output the plural number of instructions of contiguous address starting from the branch target address of the said first branch instruction to another set to execute; and based on the execution result of the branch instruction select the execution result of one of the two sets of said front-end pipelines to continue executing in rear-end pipeline.
Optionally, in the said method, the said processor core has two sets of front-end pipelines and one set of rear-end pipeline; the said first tracker pointer outputs the said first instruction address and its next block instruction address, and the target instruction address to IRB, to control IRB to output the plural number of instructions with contiguous address starting from the said first instruction to a set of front-end pipelines to execute; the said second tracker outputs the next block instruction address of the said target instruction to IRB, to control the IRB to output the plural number of instructions with contiguous address starting from the branch target instruction of the said first branch instruction to another set of front-end pipelines to execute; and based on the execution result of the branch instruction select the execution result of one of the two sets of said front-end pipelines to continue executing in rear-end pipeline.
Optionally, in the said method, the said processor core has plural sets of front-end pipelines and one set of rear-end pipeline; the said first tracker pointer outputs the said first instruction address and its next block instruction address to IRB, to control IRB to output the plural number of instructions with contiguous address starting from the said first instruction to a set of front-end pipelines to execute; the said first tracker pointer outputs the branch target instruction addresses and their next block addresses of all of the branch instructions in the plural number of instructions with contiguous addresses starting from the said first address to IRB, each of those addresses controls IRB to output a plural number of instructions with contiguous addresses starting from each of the branch target instructions to the other front-end pipelines to execute; and the total number of branch instructions is less than the number of sets of front-end pipelines.
Optionally, in the said method, the said processor core has plural sets of front-end pipelines and one set of rear-end pipeline; the said first tracker pointer outputs the said first instruction address and its next block instruction address to IRB, to control IRB to output the plural number of instructions with contiguous address starting from the said first instruction to a set of front-end pipelines to execute; the said first tracker pointer outputs the branch target instruction addresses and their next block addresses of every layer of the branch instructions in the plural number of instructions with contiguous addresses starting from the said first address to IRB, each of those addresses controls IRB to output a plural number of instructions with contiguous addresses starting from each of the branch target instructions in every layer of branches to the other front-end pipelines to execute; and the total number of branch instructions in the said every layer is less than the number of sets of front-end pipelines.
Optionally, in the said method, each set of front-end pipeline constitutes a Way, the corresponding execution unit in each set of front-end pipeline constitutes a slot; dependency check module performs dependency check on each Way starting from the said first instruction, based on the dependency check result of each Way to produce the read pointer increment for each Way and to control the execution unit of the corresponding Way to execute part or all of the corresponding instruction, based on the dependency check result of each Way; Based on the execution result of branch instruction in each Way, select execution units in one Way of the Ways to complete execution in the corresponding rear-end pipelines, but terminate the execution in execution unit of other Ways; and select the instruction address and read adder increment of one of the Ways to update the tracker read pointer based on the execution result of branch instructions in each Way.
Optionally, in the said method, organize IRB by Ways; or organize IRB by slots.
Optionally, in the said method, the said dependency check module is configurable, and can be configured to decrease the system's maximum number of instruction issue.
Optionally, in the said method, the data read buffer is filled in advance with the data that may be used by load instruction in the instruction read buffer.
Optionally, in the said method, the said data read buffer's table entry and instruction read buffer's table entry are one-to-one correspondence, the data corresponds to a data fetch instruction can be directly found from data read buffer through the position of the said data fetch instruction in the instruction read buffer; or the said data read buffer's table entry is less than instruction read buffer's table entry and each of the instruction read buffer items contains a pointer; the data corresponding to a data fetch instruction can be found through decoding the said pointer of the data fetch instruction entry in the instruction read buffer.
Other aspects of the present disclosure may be understood by those skilled in the art in light of the description, the claims, and the drawings of the present disclosure.
The system and method disclosed is capable of providing the fundamental solution for cache structure using in digital system. The conventional mechanism fills instruction after cache miss. The system and methods of the said invention fill the instruction read buffer in the instruction cache system before the processor executes the said instruction, thus can prevent or sufficiently hide compulsory miss. The system and method of this disclosure provides a fully associative cache structure, thus prevent or sufficiently hide conflict miss and capacity miss. In addition, the system and method of the said disclosure is capable of providing instruction for processor core execution by IRB autonomously, avoids the tag matching in the time critical path of cache reading, Therefore, it may run at a higher clock frequency and the power consumption is significantly lower than the conventional cache system. The other advantages and applications are obvious to one skilled in the art.
Reference will now be made in detail to exemplary embodiments of the invention, which are illustrated in the accompanying drawings in connection with the exemplary embodiments. By referring to the description and claims, features and merits of the present invention will be clearer to understand. It should be noted that all the accompanying drawings use very simplified forms and use non-precise proportions, only for the purpose of conveniently and clearly explaining the embodiments of this disclosure.
It is noted that, in order to clearly illustrate the contents of the present disclosure, multiple embodiments are provided to further interpret different implementations of this disclosure, where the multiple embodiments are enumerated rather than listing all possible implementations. In addition, for the sake of simplicity, contents mentioned in the previous embodiments are often omitted in the following embodiments. Therefore, the contents that are not mentioned in the following embodiments can be referred to in the previous embodiments.
Although this disclosure may be expanded using various forms of modifications and alterations, the specification also lists a number of specific embodiments to explain in detail. It should be understood that the purpose of the inventor is not to limit the disclosure to the specific embodiments described herein. On the contrary, the purpose of the inventor is to protect all the improvements, equivalent conversions, and modifications based on spirit or scope defined by the claims in the disclosure. The same reference numbers may be used throughout the drawings to refer to the same or like parts.
Although CPU is used as an example for the cache system in this disclosure, this invention can be applied to the cache system of any proper processor system such as general purpose processor, CPU, MCU, DSP, GPU, SOC, and ASIC,
In this disclosure, the instruction and data addresses mean the main memory addresses of the instruction and data. For the sake of simplicity, assume in this disclosure the virtual address is the same as the physical address. However, the method disclosed by this invention can also be applied in the case address translation is required. In the disclosure, current instruction means instruction currently being executed or acquired by the processor core; current instruction block means the block containing the instruction currently being executed by the processor core.
Please refer to
When the processor core (CPU Core) 111 executes an instruction, it first reads instruction from a higher-level memory. Here, the memory hierarchy level means the distance from the processor core 111. The closer to the processor core 111 it is, the higher the level. A higher-level memory in general is faster but has less capacity compared to a lower level memory.
This embodiment differs from the conventional cache-based processor system in that there is an instruction read buffer (IRB) 107 and its corresponding address tag storage matcher 109. Here, the capacity and the latency of the instruction read buffer 107 are respectively smaller and shorter than those of instruction memory 206, and its access time is also shorter. Instruction memory 103 and instruction read buffer 107 can be any suitable memories, such as register, register file, SRAM, DRAM, flash memory, hard disk, solid state disk, or any suitable memory or any new future memory. Instruction memory 103 can function as a memory of the system, or as a level 1 cache when other cache levels exist. It can be subdivided into memory blocks on the memory section that stores the data the processor core 111 will fetch, such as instructions in the instruction block.
Specifically, processor core 111 sends the address of the current instruction to address tag storage matcher 109 for matching. If matched, it indicates the current instruction is already in IRB 107, which can be obtained from IRB in a shorter latency; otherwise, it indicates the current instruction has not yet been stored in IRB 107, Therefore, address tag storage matcher 109 sends the instruction address of the current instruction to tag memory 105 for matching. If matched in tag memory 105, then the instruction block contains the current instruction and may be fetched from instruction memory 103, and filled to IRB 107; at the same time the current instruction is sent to processor core 111. If it is not matched in 105, then tag storage 105 sends the address of the current instruction to an even lower level memory to fetch the instruction block containing the current instruction, fills instruction memory 103 and IRB 107 with the instruction block, and sends the current instruction to processor core 111.
In this process, it takes the least time when processor core 111 can directly fetch the current instruction from IRB. As used herein, it is desirable to fill as many as possible of the instructions that will likely be used to IRB 107 beforehand, to prepare for the fetching by processor core 111.
Specifically, the singular or plural number of following sequential instruction blocks can be filled into IRB 107 besides (in addition to?) filling the instruction block containing the current instruction into IRB 107. This way, when processor core 111 completes fetching of the last instruction in the current instruction block, it can fetch the next instruction (which is in the next instruction block of the said current instruction block) right away, Therefore, reducing the wait time for instruction fetch.
In addition, the instruction blocks of the branch target instructions of part of or all of the branch instructions in IRB 107 can also be filled into IRB 107. For example, the instruction block of the branch target instruction of a branch instruction in the current instruction block can be filled into IRB 107, the instruction block of a branch target instruction of a branch instruction in the instruction block that is at least one next in sequence of the current block can also be filled into IRB 107, ready to be fetched by processor core 111. In this disclosure, branch instruction or branch point means any proper instruction form that causes processor core 116 to change its execution flow (such as: to execute an instruction not in order). Branch instruction or branch source means an instruction for branch operation, branch source address may be the instruction address of the branch instruction itself; branch target means the target instruction the branching of a branch instruction becomes; branch target address means the address the program becomes when a branch of a branch instruction is successfully taken, that is, the instruction address of branch target instruction.
In this embodiment, existing technology can be used in determining the branch target address of the branch instruction, Therefore, the branch target instruction block can be found and filled into IRB 107. For example, processor core 111 calculates the branch target address by executing the branch instruction and then stores the corresponding branch target instruction block into IRB 107. Also, branch target instruction block can be filled to IRB based on the branch target address recorded in branch target buffer. This way, when a branch is determined as taken by processor core 111 on a branch instruction in the Current instruction block, the corresponding branch target instruction can be obtained from IRB 107 to reduce the wait time in acquiring the instruction.
As used herein, the program counter in the processor core can be further improved so it skips certain instructions and only fetches other instructions to acquire instructions selectively, besides acquiring instructions from IRB in program execution order. Please refer to
In
If branch is successfully taken, then multiplexer 165 selects the output of adder 153, which is branch target address. Otherwise, multiplexer 165 selects the output of adder 155 or adder 157 based on the comparison result of comparator 159. Specifically, when the instruction address stored in register 151 is different from the current instruction address that means the fall-through instruction after the Current instruction is not the instruction to be skipped. Therefore, the output of comparator 159 controls multiplexer 165 to select the output of adder 155, which is the instruction address of the fall-through instruction so the processor core acquires the instruction after the Current instruction. When the instruction address stored in register 151 is the same as the Current instruction address that means the fall-through instruction after the Current instruction is the instruction to be skipped. Therefore, the output of comparator 159 controls multiplexer 165 to select the output of adder 157, which is the instruction address of the second instruction after the Current instruction so the processor core skips the fall-through instruction after the Current instruction, and directly acquires the second instruction after the Current instruction. In this way, the instruction skip function is implemented.
As used herein, the branch target address of a branch instruction may be calculated before the processor core 111 executes the branch instruction, and fills the branch target instruction block to IRB 107 beforehand. Please refer to
Please refer to
Filler 202 fetches instructions or instruction block from lower level memory and fills them into instruction memory 206 based on the address provided by active list 204. Then, the instruction block is filled into instruction read buffer 107 from instruction memory 206, ready to be read by processor core 111. Here, fill means move instruction from a lower-level memory to a higher-level memory. Memory access means the processor core 111 reads instructions from memory or from instruction read buffer 107.
The memory block in both the table entries in table 204 and instruction memory 206 correspond to each other one-to-one. In each of the entries of active list 204 is a pair of memory block addresses of an instruction block, and its block number (BNX) in memory 206. The block number in this invention indicates the location of the storage block in instruction memory 206. The branch target instruction address generated by scanner 208 can be matched with the instruction block memory address stored in active list 204 to determine if the branch target is already stored in instruction memory 206. If the target instruction block is not yet in instruction memory 206, then it is filled into 206. At the same time, a corresponding pair of instruction block addresses and block number (BNX) will be established in active list 204. The Match referred to in this disclosure means comparing two values. When the two values are equivalent then the match is successful, else it is ‘not a match’.
Scanner 208 scans the instructions from lower-level memory that were filled to instruction memory 206, and extracts information such as: instruction type, instruction source address, and branch offset, and based on this information calculates the branch target address. In this invention, branch instruction or branch point is any appropriate instruction that can cause processor core 116 to change the execution flow (such as: executes instruction not in order). Branch source means a branch instruction; branch source address is the instruction address of the branch instruction; branch target instruction is executed after a successful branch. Branch Target Address is the address a successful branch transfer transfers to; it is also the address in the branch target instruction. For example, instruction type can include conditional branch instruction, unconditional branch instruction, and other instruction types, etc. Instruction type can include condition branch instruction sub categories, such as branch on unequal, on greater, etc. Unconditional branch instruction can be viewed as a type of condition branch instruction, with always taken condition. Other information can also be included. Scanner 208 sends the above information and address to other modules, such as active list 204 and track table 210.
Instruction read buffer 107 contains at least one instruction block including the current instruction block. Every row in instruction read buffer can contain a lesser number or the same number of instructions as the number of instructions in an instruction block in memory 206. When each row of IRB and an instruction block have the same number of instructions, the corresponding instruction block number can represent the IRB rows. If the rows in IRB 107 have fewer instructions than those in memory instruction block, multiple rows would be equivalent to one instruction block, and a less significant address bit can be added to the block number to identify the IRB row. For example, if there is an instruction block whose BNX is ‘111’, its corresponding rows in IRB 107 will be identified as ‘1110’; and ‘1111’.
For ease of following explanation, the rows in IRB 107 are assumed to have the same number of instructions as the number of instructions in instruction blocks in instruction memory 206.
In the present disclosure, instruction read buffer 107 may actively provide instructions to processor core 111 for execution according to the current instruction execution situation of the processor core 111.
Track Table 210 has a plural number of track points. A track point is a table element of a track table. It can hold at least one instruction's information, such as instruction type branch target address, etc. In this invention, an instruction in instruction memory is addressed by the same track table address of its corresponding track table entry. The track table entry corresponds to a branch instruction containing the track table address of its branch target instruction. A track is a plural number of track entries (track points) corresponding to one instruction block in the instruction memory 206. The same block number indexes a track and its corresponding instructions block. The track table includes at least one track. The number of track points can be the same number of entities in a row on track table 210. Track table 210 can also be organized in other forms.
The first address (BNX) and second address (BNY) can be employed to index a track point (i.e. instruction) in the track table (instruction memory). The first address represents the instruction block number of the track point; the second address represents the position (address offset) of the track point (and its corresponding instruction) in the track (memory block). If the track point has a branch type, the address content of the track point denotes its branch target. The first address in the track point identifies the target track and the second address identifies the target instruction on the target track. Therefore, track table is a table whose own address corresponds to branch source instruction and its content corresponds to branch target address.
Scanner 208 extracts the instruction information being stored in instruction memory 206, and then stores the extracted information in the corresponding entries in track table 210. If the instruction is a branch instruction, the branch instruction's branch target instruction address is calculated and sent to Active List 204 to be matched. When it is matched, it gets the block number (BNX) of the branch target instruction. If branch target address is not yet in active list 204, the branch target address is sent to filler 202 that reads instruction blocks from lower-level memory. At the same time, replacement logic in the active list assigns a block number BNX for the instruction block; the more significant part of the target address is stored in the active list 204 entry and the instruction block fetched by Filler 202 is filled into the memory block indicated by the block number. Then the BNX and the lower part of target address are stored in the corresponding TT entry as first and second address.
The tracks in Track Table 210 and the memory block in instruction memory 206 correspond one-to-one and both use the same pointer. The instructions to be executed by Processor Core 111 can all be filled into instruction memory 206 and IRB 107. To preserve program order relationship between tracks, there is an end track point beyond the track point corresponding to the last instruction on every track, which stores the first address of the sequential next track's instruction block. If instruction memory 206 stores multiple instruction blocks, when an instruction block is being executed, the sequential next instruction block is stored into instruction memory 206 and IRB 107, ready to be executed by processor core 111. The address of the next instruction block is the sum of the address of the previous instruction block and the block size. This address is also sent to Active List 204 for matching, the instruction block obtained is filled into instruction memory 206 and the BNX is filled into the end track point of the current track. The instructions in this new block being filled into 206 are also scanned by scanner 208, and the extracted information fills the corresponding track as described before.
Read pointer of tracker 214 points to the track point in track table 210 which corresponds to the first branch instruction after the entry in track table. The read pointer of tracker 214 is comprised of a first address pointer and a second address pointer. The first address pointer points to the track currently being executed in track table 210. The second address pointer points to the first branch track point, or the end point if there is no branch track point remaining on the track, after the track point corresponds to the current instruction currently being executed. The first address pointer indexes instruction memory 206, fetching the target or next instruction block to be filled into IRB 107, in preparation for Core 111 to execute if it successfully takes a branch.
If tracker 214 points to a branch instruction but the branch is not taken, the read pointer of tracker 214's points to the next branch track point, or the End track point if there is no more remaining branch track point on the track. IRB 107 provides fall-through instructions following the not taken branch instruction for Core 111 to execute.
If branch instruction pointed to by the tracker 114 takes a branch, the first address and the second address of the branch target become the new address pointer of the tracker, pointing to the track point corresponding to the branch target in the track table. The new tracker address pointer also points to the recently filled branch instruction block, making it the new current instruction block. Instruction read buffer 107 provides branch target instruction and the sequential instructions of the current branch instruction to processor core 111 for execution. Then, the read pointer of the tracker 214 points to the first branch instruction track point after the current instruction in the track corresponding to the new instruction block, or to the End track point if no more branch track points remain on the track.
If tracker 214 points to the End track point in the track, the content of the End track point is updated to the read pointer of tracker 214, that is, the read pointer points to the first track point of the next track, thereby pointing to the new current instruction block. Then, the read pointer of the tracker 214 points to the first branch instruction track point after the current instruction in the track containing the current instruction in the track table 210; or End track point when there are no more branch track points in the remaining track. Repeat the said sequence. The instruction may be filled to the instruction memory 206 and IRB 107 before it is executed by processor core 111. The Core 111 may fetch the instruction with minimum latency, Therefore, improving the performance of the processor.
The data/address bidirectional addressing unit 302 may include a plurality of entries 304. Each entry 304 includes a register, a flag bit 320 (i.e., V bit), a flag bit 322 (i.e., A bit) a flag bit 324 (i.e., U bit), and a comparator. Result from the comparator may be provided to encoder 306 to generate a matching entry number, that is, block number. Control 314 may be used to control read/write state. V (valid) bit of each entry 320 may be initialized as ‘0’, and A (Active) bit for each entry 322 may be written by an active signal on input line 328.
A write pointer 310 may point to an entry in data/address bidirectional addressing unit, and the pointer is generated by a wrap-around increment unit 318. The maximum number generated by wrap-around increment unit 318 is the same as the total number of entries. After reaching the maximum number, the next number is generated from wrap-around increment unit 318 by increasing one to start from ‘0’, and continues the increment until reaching the maximum number again. When the write pointer 310 points to the current entry, V bit and A bit of the current entry may be checked. If both values of V bit and A bit are ‘0’, the current entry is available for writing. After the write operation is completed, wrap-around increment unit 318 may increase the pointer by one (1) to point to next entry. However, if either V bit or A bit is not ‘0’, the current entry is not available for writing, wrap-around increment unit 318 may increase the pointer by one (1) to point to next entry, and the next entry is checked for availability for writing.
During writing, the data that is written through inputted block address data input 308 is compared with the content of the register of each entry. If there is a match, the entry number is outputted by matched address output 316, and the write operation is not performed. If there is no match, the inputted data is written into the entry pointed to by the address pointer 310, and the V bit of the same entry is set to ‘0’. The entry number is provided onto match address output 216, and the address pointer 310 points to the next entry. For reading, the content of the entry pointed to by the read address 312 is read out by data output 330. The entry number is outputted by matched address output 316, and the V bit of the selected entry is set to ‘1’.
U bit of an entry 324 may be used to indicate usage status. When write pointer 310 points to an entry 304, the U bit of the pointed entry 324 is set to ‘0’. When an entry 304 is read, the U bit of the read entry 324 is set to ‘1’. Further, when a write pointer 310 generated by wrap-around increment unit 318 points to a new entry, the U bit of the new entry is checked first. If the U bit is ‘0’, the new entry is available for replacement, and write pointer 310 stays on the new entry for possible data to be written. However, if the U bit is ‘1’, write pointer 310 further points to the next entry. Optionally, a window pointer 326 may be used to set the U bit of the pointed entry to ‘0’. The entry pointed to by the window pointer 326 is N entries ahead of write pointer 310 (N is an integer). The value of window pointer 326 may be obtained by adding value N to the write pointer 310. The N entries between write pointer 310 and window pointer 326 are considered as a window. The unused entries may be replaced when write pointer 310 moves on to N entries. The replace rate of the entries can be changed by changing the size of window (i.e., changing the value of N). Alternatively, the U bit may include more than one bit, thus becoming the U bits. The U bits may be cleared by write pointer 310 or window (clear) pointer 326, and the U bits increase ‘1’ after each reading. Before writing operation, the U bits of a current entry are compared to a predetermined number. If the value of U bits is less than the predetermined value, the current entry is available for replacement. If the value of U bits is greater than or equal to the predetermined value, write pointer 310 moves to the next entry.
Back to
The scanner 208 may examine every instruction from the instruction memory 206, extract instruction type, and calculate branch target instruction address. The said branch target address may be calculated as the sum of branch address and the branch offset. The more significant part of the branch target address is matched with the content of Active List 204 to obtain the corresponding block number, which is the first address. The less significant part of branch target address, the offset address within the block, is the second address.
For the End track point, the sum of instruction block address and the instruction block length is the block address of the next sequential instruction block. Then the block address can be matched as a branch target address to obtain its block number, which is stored in the End point.
If the more significant part of target address is matched in active list 204, then active list 204 outputs the corresponding block number to track table 210. If it is not matched, then Active List 204 sends this address to fill 202 via bus 244 to fill the corresponding instruction block to instruction memory while assigning a block number to this address and outputting this block number to track table 210.
A new track can be placed into a replaceable row in track table 210. If there is a branch instruction in the instruction block corresponding to the said new track, a branch track point is built in the corresponding track entry. The said branch track point can be located by the address of branch source instruction. For example, the more significant part of branch source address can be mapped into a track number (block number) and index a track; the less significant part (offset) of the source address indexes an entry on the track.
Each track point or track table entry in the track table row may have a format including type field, first address (XADDR) field, and second address (YADDR) field. Other fields may also be included. The type field represents the instruction type of the corresponding instruction. Type field can represent the type of the instruction corresponding to the track point, such as conditional branch, unconditional branch, and other instructions. XADDR field is also known as first dimension address, or first address. YADDR field is also known as second dimension address, or second address.
The content of a new track point can correspond to a branch target address. That is, the branch track point stores the address information of a branch target instruction. For example, the block number of the target track in track table 210 is stored in the said branch track point as first address. The offset address of the branch target instruction is the second address stored in the said branch track point.
The End track point of the tracks is a special track point. Because the End track point points to the first instruction of the Next block, the format of the End track is a type of unconditional branch and the first address of the sequential next block in program order, but without the second address. Alternatively, a constant ‘0’ can be placed in the second address field.
As shown in
As used herein, the second address stored in the track point of each branch instruction is an offset of the instruction block containing the branch target instruction of the branch instruction.
The described various embodiments above use a direct addressing mode to calculate the branch target address and implement an instruction pre-fetching operation. However, an indirect addressing mode may also be used. In the indirect addressing mode, at the beginning, the register value (e.g., a base register value) is determined, thereby calculating the branch target address. The register value is changed based on the result of instruction execution. Therefore, when a new value is calculated for the base register of an indirect branch but is not yet written into the base register, the new value can be bypassed to perform the target address calculation and subsequent operations.
As shown in
As used herein, when the branch instruction corresponding to track point 366 does not take a branch, the second address of read pointer in tracker 214 points to track point 370. The content of track point 370 is read out, including the number of interval instructions ‘2’. Thus, when the position value of the instruction currently executed by the processor in the track (i.e., low address offset of program counter) is less than or equal to ‘2’, then the value of the second address of the read pointer in the tracker 214, and the base register value are updated. At this time, the base register value BP1 may be obtained from the processor core 111, performing the branch target address calculation and the subsequent operations.
As used herein, the base register value may be obtained through a variety of methods, such as an additional read port of the register in the processor core 111, or the time multiplex mode from the register in the processor core 111, or the bypass path in the processor core 111, or an extra register file for data pre-fetching.
The entry representing the instruction pointed to by the second address 396 (block offset, BNY) in a track pointed to by the first address 394 (block number, BNX) in the memory 210 may be read out at any time. A plurality of entries, even all entries representing instruction types in a track indexed by the first address 394 in the memory 210, may be read out at the same time.
On the right of the entry corresponding to the instruction with the largest offset address in each row of the memory 210, an end entry is added to store the address of the next instruction currently being executed in sequence. The instruction type of the end entry is always set to ‘1’. The first address of the instruction information in the end entry is the instruction block number of the next instruction. The second address (BNY) is always set to zero and points to the first entry of the instruction track. The end entry is defined as an equivalent unconditional branch instruction. When the tracker points to an end entry, an internal control signal is always generated to make multiplexer 388 select the output 380 of the track table (TT) 210; another control signal is also generated to update the value of register 390. The internal signal may be triggered by the special bit in the end entry of TT 550, or when the second address 396 points to the End entry.
In
When the second address points to an entry representing an instruction, the shifter controlled by the second address shifts a plurality of the instruction types outputted by the TT 210 to the left. At this moment, the instruction type representing the instruction read out by the TT 210 is shifted to the left-most step bit of the instruction type 399. The shift instruction type 399 is sent into the leading zero counter to count the number of instructions before the next branch instruction. The output 395 of the leading zero counter 384 is a forward step of the tracker. This step is added to the second address 396 by the adder 386. The result of the addition operation is the next branch instruction address 397.
When the step bit signal of the shifted instruction type 399 is ‘0’, which indicates that the entry of the TT 210 pointed to by the second address 396 is a non-branch instruction, the step bit signal controls the update of the register 390; the multiplexer 388 selects next branch source address 397 as the second address 396 while the first address 394 remains unchanged, under the control of ‘0’ TAKEN signal 392. The new first and second addresses point to the next branch instruction in the same track, non-branch instructions before the branch instruction are skipped. The new second address controls the shifter 396 to shift the instruction type 398, and the instruction type representing the branch instruction is placed in step bit 399 for the next operation.
When the step bit signal of the shifted instruction type 399 is ‘1’, it indicates that the entry in the TT 210 pointed to by the second address represents branch instruction. The step bit signal does not affect the update of the register 390, while BRANCH signal 393 from the processor core controls the update of the register 390. The output 397 of the adder is the next branch instruction address of the current branch instruction in the same track, while the output 380 of memory is the target address of the current branch instruction.
When the BRANCH signal is ‘1’, the output 391 of the multiplexer 388 updates the register 390. If TAKEN signal 392 from the processor core is ‘0’, it indicates that the processor core has determined to execute operations in sequence at this branch point. The multiplexer 388 selects the source address 397 of the next branch. The first address 394 outputted by the register 390 remains unchanged, and the next branch source address 397 becomes the new second address 396. The new first address and the new second address point to the next branch instruction in the same track. The new second address controls the shifter 396 to shift the instruction type 398, and the instruction type representing the branch instruction bit is placed in step bit 399 for the next operation.
If the TAKEN signal 392 from the processor core is ‘1’, it indicates that the processor core has determined to jump to the branch target at this branch point. The multiplexer selects the branch target address 380 read out from the TT 210 to become the first address 394 outputted by the register 390 and the second address 395. In this case, the BRANCH signal 393 controls the register 390 to respectively latch the first address and the second address as the new first address and the new second address. The new first address and the new second address may point to the branch target addresses that are not in the same track. The new second address controls the shifter 396 to shift the instruction type 398, and the instruction type representing the branch instruction bit is placed in step bit 399 for the next operation.
When the second address points to the end entry of the track table (the next line entry), as previously described, the internal control signal controls the multiplexer 388 to select the output 530 of the TT 210, and update the register 390. In this case, the new first address 394 is the first address of the next track recorded in the end entry of the TT 210, and the second address is zero. The second address controls the shifter 396 to shift the instruction type 398 zero bit to start the next operation. The operation is performed repeatedly, Therefore, the tracker 214 may work together with the track table 210 to skip non-branch instructions in the track table and always point to the branch instruction.
As used herein, Active List 104 needs replacement when it is full and a new block address/block number pair is created. A correlation table, which records the status of each block as a target of a branch, is employed to prevent the track table entry from branching to a block that has already been replaced. Only the blocks in instruction memory together with their corresponding Active List entries, which are not branch targets, are candidates for replacement.
As used herein, Token controls the instruction issue. In
As used herein, the Token is passed to the token passer of the branch target and stops passing to the next instruction of the branch instruction, when the branch instruction has successfully taken branch. The token passer in
When executing the last instruction in an instruction block, the Token is passed to the token passer corresponding to the first instruction in the sequential next instruction block. This requires a mechanism to indicate the last instruction in an instruction block, it also needs a decoder to decode the Next block instruction address provided by the tracker to designate this block. Thus, the token passers of the first instruction and of the last instruction are modified accordingly.
In
Control unit 401 stores the corresponding block numbers of the instructions blocks in IRB 107. In this embodiment, each IRB block of the IRB stores one instruction block of memory 206. Control unit 401 matches first address (BNX) of the received branch source BN, branch target BN, and End Point with its content. The Current instruction block is already in IRB 107, Therefore, the branch source BNX is matched, and the IRB block corresponding to the matched entry is holding the Current instruction block. If the matching to target BNX or the Next block BNX is successful, then the corresponding instruction blocks are already in IRB. The unmatched BNX is sent to memory 206 to fetch the needed instruction block to fill in a replaceable block in IRB 107. The replaceable block is determined in a similar manner as the replacement of Active List 204.
Further, the second address (BNY) in the branch source or branch target BN is used to index the corresponding branch instruction or branch target instruction from IRB 107.
The Next instruction block comparator 507 compares the Next instruction block BNX on bus 235 with the content of 505. If matched, the matched output of 507 points to the first instruction in 501 (top instruction in
Branch source Comparator 509 compares BNX on bus 231 with the content of 505. If matched, the matched output of 509 enables branch source address decoder 513 to decode the BNY address on bus 231. Output 523 of decoder 513 points at one of the instructions in 501: the branch source instruction. If not matched, output of 509 disables source decoder 513, so all word line outputs 523 of decoder 513 are ‘0’, not enabling read ports of any instructions.
Branch target Comparator 511 compares BNX on bus 233 with the content of 505. If matched, the matched output of 511 enables branch source address decoder 515 to decode the BNY address on bus 233. Output 525 of decoder 515 points to one of the instructions in 501, the branch target instruction. If not matched, output of 509 disables branch target decoder 515, so all word line outputs 525 of decoder 515 are ‘0’, not enabling read ports of any instructions.
Back to
Read pointer 231 of Tracker 214 moves and stops at the first branch point after the track point corresponding to the instruction currently being executed as previously described. As used herein, the branch source and branch target addresses are sent to control unit 401 and compared as described in
Thus, locations of branch source, branch target, and first instruction of the next sequential block are found through matching in control unit 401.
As used herein, the clock received by 401 depends on the system clock and the pipeline status of process core 111. Control Unit 401 receives a valid clock when Core 111 needs an instruction. Control Unit 401 receives no clock signal when Core 111 does not need new instructions, for example, during pipeline stall. Token passers are included in 401, and each passer corresponds to an instruction. The passers pass an active Token signal, which denotes the instruction the CPU needs. Control Unit 401 updates the token passer for every valid clock cycle, and passes the Token to the token passer corresponding to the instruction the Core 111 needs next. Thus, the control unit controls IRB 107 to output the right instruction to Core 111 based on the Token signal.
This embodiment is only an example of how the control unit 401 takes initiatives in serving instructions to Core 111 based on its needs. Other handshake signals or communication protocols to guarantee control unit 401 that take initiative in sending needed instructions to processor core 111 are also under the protection of this disclosure.
More particularly, based on the depth of instruction look-ahead, a plurality of tracks can be established at the same time to fill more instruction segments to cover the response time when fetching instructions from the lower level memory.
The multi-pointer addressing device 7001 may include incrementers 5003 and 7005, pointer registers 5005, 5007, 5009, and 5011, multiplexer 7015, and branch decision logic 5015. The pointer registers 5005, 5007, 5009, and 5011 are used to store four branch instructions corresponding to the second level branch points of the current instruction being executed.
The incrementers 5003 and 7005 may perform the increment-by-one operation on one set of the pointer registers from the two sets of pointer registers (i.e., 5005 and 5007, 5009 and 5011) to increase the second address (BNY) by one to reach the next branch point in the same track. Further, multiplexers 7015 may respectively select one pointer from each pointer register pair 5005 and 5007, and 5009 and 5011 for addressing the track table 7126. The branch decision logic 5015 may process or decode the branch taken signal from the processor core to generate simultaneous write-enable signals for the four pointer registers and select signals for the multiplexers 7013 and 7015.
Further, when the bus 7009 carries the BN of the target track point read out from the track table 210, the multiplexer 5025 or 7017 selects the input from the bus 7009, and the BN is directly stored in the pointer register 5011. If the bus 5021 does not carry BN of the target track point read out from the track table 210, the active list may be matched, filled, and the corresponding BN may be outputted to the selectors 5025 and 7017 via bus 7011 and to be stored in the pointer register 5011.
This way, the BNs of the two layers of subsequent instruction blocks (a total of four instruction blocks including the sequential execution block and branch target block) pointed to by the read pointer of tracker 214 are stored in pointer registers 5005, 5007, 5009, and 5011. The BNXs outputted by this pointer register may be sent in turn through bus 5021 to the control unit of IRB 107 for matching. If successfully matched, that means the instruction block corresponding to the BNX is already stored in IRB 107. If not matched, the BNX is sent to instruction memory 206 through bus 7013 to fetch the corresponding instruction block and fill in a storage block designated by the replacement algorithm in IRB 107. Thus, IRB 107 contains the instruction block of the branch point pointed to by tracker 214 read pointer (such as the instruction block corresponding to track W in
The above embodiment only describes the prefetching of instruction blocks corresponding to two layers of branch points. However, people skilled in the art shall be able to increase similar parts, apparatuses, or software, to expand this method to the prefetching of instruction blocks corresponding to more layers based on this disclosure and embodiment. Those are also in the scope of this invention.
The organization of IRB can consist of the Current instruction block, the Next instruction block, and the branch target instruction block. Each of those blocks are in a fixed location, Therefore, copy the successfully taken branch target instruction block into the Current instruction block, as the branch target before a successful branch is now the Current instruction. New branch target instruction block is written into the location of branch instruction block. By the same reason, the Next instruction block is copied into the Current instruction block when executing the Next instruction block, and the new Next instruction block will be filled into the location of the Next instruction block.
The organization of IRB can consist of multiple instruction blocks. Using the decoder in 401, determine the Current instruction block based on the branch source address 231, determine the Next instruction block based on the Next instruction block address 235. The replacement of the instruction blocks may be through the same method as active list replacement or through LRU.
As used herein, with the passing of the token signal, the IRB 107 provides corresponding instructions to processor core 111 for execution. The read pointer 231 of track 214 looks ahead and stops at the next branch point of the current instruction. Then, it sends the BNX of this tracker point to multiple branch source comparators 509 in the control unit 401 through bus 231. The result of the comparator indicates that the current instruction block is in the instruction segment where the instructions 701 and 703 located. Then, it sends the BNY on bus 231 to the branch source decoder 513; the result of the decoder indicates this branch instruction is stored in register 701. The decoded word line 751 controls the pass-gate in token passer 711 to drive token bus 721. At the same time, the word line 751 also blocks the input path of next stage token passer.
The branch targets in track table 210 pointed to by the read pointer 231 of tracker 214 are read out, and are sent to multiple branch target comparators 511 in the control unit 401 through bus 233. The result of the comparator indicates that the current instruction block is located in the instruction segments of memory 705 and 707. Then it sends the BNY on bus 233 to the branch target decoder 511 on the current instruction block, the result of decoder indicates this branch instruction is stored in register 707. The decoded word line 767 controls the token passer to receive the token through bus 721.
When Token is passed to token passer 711, the Token controls the branch instruction stored in memory 701 to be sent to CPU 111 for execution through instruction bus 431. At the same time, the token is put on token bus 721 through the pass-gate controlled by word line 751. Of all the token passers connected to 721, only 717, under the control of 767, is able to receive input. At this time, the CPU core 111 decodes that the received instruction is a branch instruction, and controls the clock of control unit 401 to pause the token passing.
If the execution result of the branch instruction is taken, the CPU core resumes in giving clock to control unit, the token is passed into the token passer 717. The branch target instruction of register 707 is sent to CPU core 111 under the control of the token through instruction bus 431. At the same time, the read pointer 231 of tracker 214 also points to the next branch point of the track of the track table 210 corresponding to register 707. If the branch target instruction stored in memory 707 is not a branch instruction, the token may pass to the next token passer from token passer 717 in next clock cycle.
If the execution result of the branch instruction is not taken, the CPU core uses disable signal to control the decoders 513 and 515 in control unit 401 to output ‘0’. At this time, the pass-gate in token passer 711 does not drive the token bus 721, the input circuit of token passer 713 enables token passing. Then the CPU core 111 resumes giving clock to control unit 401, the token is passed to token passer 713 from token passer 711. The next instruction 703 of branch source instruction 701 is sent to CPU core 111 controlled by the token.
The instruction 703 is the last instruction of an instruction block, thus the token passer 713 automatically puts the token on token bus 721. If the instruction 703 is a branch instruction, the pointer 231 of tracker stops at this branch point, for the detailed process referred to in the above embodiments. If the instruction 703 is not a branch instruction, the pointer 231 of tracker does not stop at this point. According to the information of the next track at the end of the current track, the pointer points to the first branch point in the next track. In this situation, the tracker issues an enable signal to index the next instruction block. Under the control of this signal, the address of next instruction address is sent to multiple next instruction comparators 507 in the control unit 401 through bus 235. The result of the comparator indicates that the next branch instruction is in the instruction block of instructions 705 and 707. It controls the input of token passer 715 to receive the token on token bus 721. Then the token is sent to token passer 715 and the instruction 705 is sent to CPU core 111 through token bus 431 in next clock cycle.
To improve CPU performance, it is not necessary to wait for the execution result of a branch instruction, but rather using branch prediction, to provide either the fall-through instruction or branch target instruction to CPU core 111 for speculate execution while the branch decision is not yet generated. If the speculation is incorrect, the execution results or the intermediate results of incorrectly predicted instructions are cleared, and then the correct instruction is provided to CPU core 111 for execution.
The static branch prediction, according to the characteristics of branch instructions (such as jump forwards or backwards), predicts if the branch is taken or not taken successful or unsuccessful. Regarding a type of embodiment of the static prediction herein, please refer to
In
Back to
Unlike the previous example, in this embodiment the branch is predicted as not taken, the control unit 401 doesn't match the branch source BN, thereby, does not prevent the token passer 711, which corresponds to the branch source address, from passing the token signal to the next token passer 713. Thus, when the branch instruction has been provided to processor core 111 to execute but the result is not yet known, the succeeding instruction of the branch instruction may be provided to the processor core 111 for continued execution, to fulfill the not taken static branch condition. As described in the previous example, when the last instruction of the Current instruction block (instruction 703 here) is provided to processor core 111 to execute, then the Token is passed to the first token passer (715 here) of the Next instruction block, to ensure providing instructions to processor core 111 continuously.
The prediction is correct if the branch decision of the branch instruction executed by processor core 111 is not taken, then read pointer 231 of tracker 214 moves ahead and stops at the next branch point and the token passer in IRB 107 continues passing the Token step by step, provide instructions to processor core 111 for execution.
The prediction is incorrect if the branch decision of the branch instruction executed by processor core 111 is taken, then the ALL signal received by the source decoder in the instruction block where the Token is, then all outputs of the source decoder are ‘1’. Then, the Token is blocked from passing no matter which instruction it is on, the Token is passed onto the global bus 721. At the same time, target decoder corresponding to the branch target instruction block decodes the branch target BNY and controls the token passer corresponding to the branch target instruction receive the Token from bus 721. Thus, Token is passed to the token passer corresponding to branch target instruction, and Therefore, outputs the branch target instruction to processor core 111 for execution. Processor core 111 clears the execution result or intermediate result of the wrong instruction based on prior art technology.
Now, the prediction of branch taken is used as an example. As is shown in the above embodiments, along with the token signal passing, the IRB 107 provides the corresponding instructions to CPU core 111 for execution. The read pointer 231 of track 214 looks ahead and stops at the next branch point of the current instruction. It sends the branch source address and branch target address to control unit 401 to compare with the address stored in the control unit. Control unit 401 matches the branch target BN and found the branch target instruction in IRB 107 based on the corresponding target decoder output.
The control unit 401 matches the branch source BNX, and the source address decoder in the matched entry decodes the branch source BNY, the decoded result prevents the token passer corresponding to the branch source instruction from passing to the next token passer but rather pass the Token signal to global bus 721.
At the same time, the control unit 401 matches the branch target instruction and the matched target decoder decodes the branch target BNY, control the token passer corresponding to the branch target instruction receiving the Token on global bus 721. Then, the Token is passed to the succeeding token passer after the branch target instruction and output corresponding instructions to fulfill the taken static branch prediction condition. In the same way, if the last instruction of the instruction block of the branch target instruction is provided to processor core 111 to execute, then the Token is passed to the Next instruction block to ensure providing instruction to processor core 111 continuously.
When the execution result of the said branch instruction is branch taken, the prediction is correct. It updates the branch target BN value corresponding to the read pointer 231 of the tracker 214, and the pointer moves to the new track pointed to by the target BN, and stops on the next branch point of the current track. The token signal is successively passed to next token passer and it provides instructions to the CPU core 111 for execution. The prediction is correct if the branch decision of the branch instruction executed by processor core 111 is taken, then read pointer 231 has the value of the said branch target BN, that is move to the track of the branch target track point and use this track as the Current track and continuously move to the next branch point. And the token passer in IRB 107 continues passing the Token step by step, provide instructions to processor core 111 for execution.
The prediction is incorrect if the branch decision of the branch instruction executed by processor core 111 is not taken, then the ALL signal received by all of the source decoders are valid, then all outputs of the source decoder are ‘1’. Then, the Token is blocked from passing no matter which instruction block it is in, and the Token is passed onto the global bus 721. At this time, only token passer 717 under the control of branch target decodes word line 767 may receive the Token on token bus 721 Therefore, Token is passed into token passer 717, and instruction 707 is issued to processor core 111 through bus 431 under the control of the Token. Processor core 111 clears the execution result or intermediate result of the wrong instruction.
No matter the branch prediction is correct or not, the read pointer of tracker stops at the predicted branch point until the result of branch execution is generated. Then it moves to the next branch point. Thus, the IRB 107 provides the correct instructions to CPU core 111.
As used herein, along with the token signal passing, the IRB 107 provides instructions to CPU core 111. Read pointer 231 of tracker 214 moves ahead and stops on the next branch point, reads out the branch target BN 233 and the corresponding branch prediction information 811. Control unit 801 matches the branch target BN and decodes by the corresponding target decoder and find the branch target instruction in IRB 107.
When the prediction information is not taken (‘0’), the said prediction signal is inverted to ‘1’ through inverter 809 and this signal enables AND gate 807. The other input port of AND gate 807 comes from the content of the end point 235 of the current track. The AND gate 807 sends the Next instruction block number stored in the end point to control unit 801 to match and find the corresponding Next instruction block in IRB 107.
The AND gates 803 and 805 receive the “not taken” (‘0’) prediction information, and block the passing of branch source address 231 and branch target address 233, making all of the outputs of source decoder and target decode in control unit 801 ‘0’. Thus, the token passer corresponding to the branch source instruction pass the token signal to its next token passer, so the succeeding instructions of the branch source instruction is provided to CPU core 111 to continue execution, and thus implements the effect of ‘not taken’ dynamic prediction. When the last instruction of the Current instruction block is provided to CPU core 111, the token signal is passed to the Next instruction block and continues to provide instructions to CPU core.
At this time, if the branch decision by processor core 111 of the said branch instruction is not taken, the branch prediction is correct, the read pointer 231 of track 214 moves and stops at the next branch point. And the token passer in IRB 107 continues passing the Token step by step, provide instructions to processor core 111 for execution.
The prediction is incorrect if the branch decision of the branch instruction executed by processor core 111 is taken, then the ALL signal received by all of the source decoders are valid, then all outputs of the source decoder are ‘1’. Then, the Token is blocked from passing no matter which instruction block it is in, and the Token is passed onto the global bus 721. At this time, AND gate 805 is enabled, the BNY of the branch target address 233 is sent to the BNY decoder in the branch target instruction block in 801 to output corresponding word line, control the token passer corresponding to the branch target instruction receives the Token from global bus 721. Thus, Token is passed to the token passer corresponding to the branch target instruction and outputs the branch target instruction to processor core 111 for execution. Processor core 111 clears the execution result or intermediate result of the wrong instruction.
On the other hand, when the prediction information states branch is taken (‘1’), the said taken signal inverts to ‘0’ through inverter 809 and it disables the AND gate 807. Thus the next instruction number stored in the ending point is not sent to control unit 801, that prevents the Token signal from being passed to the token passer corresponding to the first instruction in the Next block of the instruction block which contains the branch source instruction, while the Token signal is passed to token passer corresponding to the branch target instruction.
At the same time, AND gates (803 and 805) receive the prediction taken (‘1’) signal, it separately outputs valid BNX and BNY signals to the source decoder and target decoder. At this time, the branch source BNX is matched by control unit 801, according to the BNY of branch source, the source decoder corresponding to the matched BNX outputs a signal avoid the token signal passes to the next token passer and passes the token signal to global bus 721.
The branch target BNX address is matched with each of the BNXs stored in control unit 801, and the target decoder corresponding to the matched BNX decodes the branch target BNY and outputs the decoded result. The result controlling the Token passer corresponds to the branch target instruction that receives Token signal from global bus 721. Thus, the Token signal is sent to the Token passer corresponding to the branch target instruction and the branch target instruction is provided to CPU core 111 for execution. Hereafter, the Token signal is sent to the token passers. Each token passer corresponds to the following instructions of the branch target instruction and outputs the corresponding instructions in sequence. The effect is equivalent to a static branch prediction.
AND gate 807 resumes in receiving and passing the track point 235 (next instruction block address) when the token signal is sent to the token passer corresponding to the branch target instruction. The content of the end point of the branch target's instruction block is sent to control unit 801 to match, and then find the corresponding next instruction block in IRB 107. This way, once the last instruction of branch target block is provided to CPU core 111 to execute, the token signal is sent to the next instruction block and it may continue to provide instructions to CPU.
When the CPU core 111 executes the said branch instruction as branch taken, that means the prediction is correct. It updates the read pointer 231 of the tracker 214 to the value of branch target BN, and the pointer moves to the new track the said branch target track point is on, and it stops at the next branch point of the current track. The token passer in IRB 107 passes the token signal to next token passer and it continues to provide instructions to the CPU core 111.
When CPU core 111 executes the result of the said branch instruction is branch not taken, the prediction is incorrect. The source decoder corresponding to the instruction where the token signal is located is valid and the output of the source decoder is ‘1’. At this time, regardless of if the token signal is stored in any one of the instruction registers, it blocks the token signal from passing and puts it on the global bus 721. The branch source address 231 adds ‘1’ and puts it onto the target address bus 233, and the address is decoded in control unit 801 and generates a target word line. This word line corresponding to the token passer of the next instruction of the branch source instruction controls token passer to receive token signal from global bus 721. Thus, the token signal is again sent to the token passer corresponding to the next instruction of branch target instruction, and it provides the instruction to CPU core 111. It must clear the next instructions and their intermediate results of the branch instruction in the pipeline.
As used herein, the said token passer may be improved, so certain instructions are not to be issued out through instruction bus 431 to implement instruction folding, such as branch folding.
The token passer in
The pre-processor 990 performs simple decoding on the instruction that flow out ahead of time. If it finds instruction 965 is a branch instruction, then it sets instruction 965 as skipable. To do that, it sends a clock signal to update flag registers in all token passers such as flag register 981, 983, 985 and 987. Herein, only the flag register of token passer 985 latches token state 971 as a ‘valid flag’, which indicates instruction 965 in instruction storage may be skipped. Because the Token is not at the token state bits 973, 975 and 977, their corresponding flag registers 991, 993 and 997 latched in ‘invalid’ flag. When the token signal is sent to token passer 983, the token state 973 is ‘1’, the pass-gate in token passer 983 sends the token signal to bus 999 under the control of flag 995. Based on the pre-decode result of branch stored in the pre-processor, branch prediction mechanism decides the direction of Token movement. If the branch prediction is taken, the Token on token bus 999 is insert into a token passer designated by the branch target decoder; and all registers in the instruction block where the token passer 983 is located are reset to ‘0’. Thus, the token is sent to branch target instruction. If the branch prediction is not taken, the token signal is not inserted to the branch target and the token passers in the instruction block which contains 983 are not set to ‘0’. In this situation, two multiplexers in the token passer 985 under the control of flag 995 send the Token to token passer 987 and ‘0’ is inserted into the token passer 985. Thus, the Token skips branch instruction 965. Despite whether the branch prediction is correct, the branch instruction is not executed, so that it does not take up execution time.
The pre-processor 990 perform simple decoding on the instruction that flow out ahead of time. If it finds instruction 965 is a load/store instruction, then it sets instruction 965 as skipable. To do that, it sends a clock signal to update flag registers in all token passers such as flag register 981, 983, 985 and 987. Herein, only the flag register of token passer 985 latches token state 971 as a ‘valid flag’, which indicates instruction 965 in instruction storage may be skipped. Because the Token is not at the token state bits 973, 975 and 977, their corresponding flag registers 991, 993 and 997 latched in ‘invalid’ flag. When the token signal is sent to token passer 983, the token state 973 is ‘1’, the pass-gate in token passer 983 sends the token signal to bus 999 under the control of flag 995. Based on the pre-decode result of load/store stored in the pre-processor, the pre-processor ignores the Token on bus 999. In this situation, two multiplexers in the token passer 985 under the control of flag 995 send the Token to token passer 987 and ‘0’ is inserted into the token passer 985. Thus, the Token skips branch instruction 965, so that it does not take up execution time.
Instruction folding may also be performed by track table, tracker together with token passer in
In addition, it may implement the function of repeatedly providing the same instruction to CPU core. Specifically, clock signals to all token signal registers 601 in token passers may be shut off, pausing the passing of token signals. This way, IRB outputs the instruction corresponding to the current token signal in every clock cycle. It may implement the function that repeatedly provides the same instruction to CPU core.
The branch source pointer 231 of tracker 214 points to track table 1010, then it reads out the branch target address 233. Herein, the target BNX 1043 portion is sent to branch target comparators (such as comparator 1031 and 1033) in control unit 1001 and compared with the BNX address of each instruction. The corresponding branch target decoder (1021 or 1023) is enabled if it matches one of the BNXs stored in the register. The enabled decoder receives the BNY of branch target address and inserts the token signal into the token passer corresponding to the branch target instruction. In control unit 1001, once the branch is taken, it resets the whole token passer to ‘0’, and clears the token signal corresponding to branch source instruction block and the target decoder generates token signal and inserts it into the token passer corresponding to the branch target instruction. In this embodiment, the input of the token signal register of the first token passer corresponds to each instruction block and comes from an AND gate. One input of the AND gate comes from the global bus 1021, the other input comes from the output of the next instruction block BNX comparator.
The tracker 214 also reads out the next instruction block address 235 through branch source pointer 231 from track table 1010 and then sends the address to each next instruction block comparator (such as comparator 1033 and 1035) in control unit 1001 and compares with each instruction block BNX (such as the BNX stored in registers 1025 and 1027). The matched result is sent to AND gate (such as 1019) of the first token passer of the corresponding instruction. The global bus 1021 in this embodiment replaces the global bus 721 in
Further, the output of OR gate 1007 is sent to the AND gate (such as AND gate 1019) that corresponds to each instruction block. The other input of the said AND gate couples with the output of the Next comparator (such as the comparator 1033 and 1035), which is used to determine the next instruction block. Its output is sent to the first token passer (such as token passer 1015) in an instruction block in IRB 107. The Next block BNX is read out from the End track point on the current track in track table 1010 and is sent to next BNX comparators in control unit 1001 through bus 235, and this BNX is compared with the BNX of the corresponding instruction. Here, the instruction block of instructions (705 and 707) is the next instruction block, so only the result of next BNX comparator 1035 is ‘1’, the results of the other next BNX comparators are ‘0’. Thus the AND gate 1019 outputs a ‘1’ and this value is written into token signal register 1009. The value of token signal registers in other token passers are ‘0’, thus the token signal may pass to the token passer that corresponds to the first instruction of next instruction block pointed to by the End point in track table. It outputs the correct instruction to CPU core for execution and the Token is passed to next instruction block from the current instruction block.
On the other hand, when the branch instruction is taken, it needs to pass a token from the token passer that corresponds to the current instruction to the token passer that corresponds to the branch target instruction. Let's assume that the token passer 1017 in
In addition, it may adopt static branch prediction in
If the branch predictor predicts taken, the IRB 107 resets the token passer to ‘0’ without waiting for the execution result of branch instruction, and inserts a Token in the token passer of the branch target to provide the branch target instruction to processor core 111 as described before. The branch target token passer is designated through decoding the branch target address provided by the tracker. A mechanism is needed to designated the position of the branch source, for example, by instruction decode (e.g. the decoding of the normal instructions sent to process core, or the decoding of instructions outputted from IRB 107 ahead of time shown in
As used herein, the processor pipeline may be portioned into front-end pipeline and back-end pipeline by the location of the TAKEN signal. A duplicated front-end pipeline may be added to the CPU core so that the IRB may provide both the fall-through instruction and the branch target instruction to the CPU core after a branch instruction. The two front-end pipelines in CPU core execute the instructions after the branch instruction, when the TAKEN signal 1037 is generated; it selects one of the two execution results of front-end pipeline to be further executed by the back-end pipeline. It ensures the pipeline suffers penalty-less branching no matter if the branch is taken or not.
The difference between
In this embodiment, the token signal is passed over global bus when the two instructions outputted are not in the same instruction block. That is, the current instruction is located in the current instruction block but the instruction after the next instruction is located in the Next instruction block. Specifically, the last two token passers of an instruction block may each output the value of its token signal register and send the value to OR gates (1057 and 1059) through buses (1053 and 1055). When the Token signal is at the token passer before the last token passer of the current instruction block, IRB outputs the corresponding instruction, the token signal is also sent to OR gate 1057 through bus 1053 and the output of OR gate 1057 is sent to AND gate 1065 through global bus 1061. In here, it is assumed that the token passer coupled to AND gate 1065 is the Next instruction block. The output of AND gate 1065 is because the other input of the AND gate 1065 couples with the output of Next BNX comparator whose output is ‘1’. Therefore, the first instruction of the said Next instruction block may be outputted with the instruction before the last instruction of the current instruction block at the same time. On the other hand, when the token signal is at the last token passer of the current instruction block, IRB outputs the corresponding instruction, the token signal is also sent to OR gate 1059 through bus 1055 and the result of the OR gate 1059 is sent to AND gate 1067 through global bus 1063. The output of AND gate 1067 is ‘1’ because the other input port of AND gate 1067 couples with the output of Next BNX comparator which is ‘1’, thus the second instruction of the said next instruction block may be outputted with the last instruction of the current instruction block at the same time.
For the token passers on the right in this figure, the detailed process may refer to the above description, and will not be repeated here.
In addition, according to the TAKEN signal 1098, the toggle counter 1081 is used to keep track of which front-end pipeline corresponds to the current instruction block and which one corresponds to the target instruction block. Specifically, it assumes the left front-end pipeline and the token passer correspond to the current instruction block, that is the current token is passing in the left token passers, the output 1083 of toggle counter 1081 is ‘1’, Therefore, it disables the AND gate 1085 that corresponds to the left target decoder and enables the AND gate 1087 that corresponds to the right target decoder. The multiplexer 1084 selects the intermediate result of front-end pipeline under the control of signal 1083 and sends the results to back-end pipeline 1086 for execution. According to the branch target address 233 of tracker 214, the left token passer inserts target token into the branch target address and is controlled to send target instruction and its next instruction to CPU core 1051. These instructions in left front-end pipeline are executed until the branch target instruction reaches the last stage of pipeline 1082. The CPU core 1051 pauses the clock of right token passers, thus it stops to output more instructions from bus 1092 and waits for the result of branch decision. When the branch instruction of the left instruction bus 1090 is executed, according to the result of branch decision, it generates the corresponding output 1083 to control the whole system. If the branch is not taken, that is the TAKEN signal is ‘0’, the output 1083 of toggle counter 1081 is ‘1’ and it also selects the execution result of left front-end pipeline 1080 and sends the result to back-end pipeline 1086. The left front-end pipeline also corresponds to the current instruction block. According to track address of tracker 214 the right token passers outputs the next branch target instruction and its subsequent instructions to CPU core. If the branch is taken, that is the TAKEN signal is ‘1’, and the output 1083 of toggle counter 1081 is ‘0’. It controls the multiplexer 1084 to select the execution result of right front-end pipeline 1082 and send the result to back-end pipeline 1086. At this time, the CPU core resumes in providing clock to right token passer, the target token becomes current token and the instruction from right instruction bus 1092 becomes the current execution instruction and it provides the current instruction to right front-end pipeline 1082. The signal 1083 resets the enable signal of right branch target decoder to ‘0’ through AND gate 1087 so that the right target decoder doesn't generate token signal. The signal 1083 enables the left branch target decoder through AND gate 1085 and it inserts the token into left token passer group. The instructions from left instruction bus 1090 are sent to left front-end pipeline 1080 until the branch target instruction reaches the last stage of pipeline 1080. At this time, the CPU core holds the clock of left token passer group. It stops to pass branch target token and output instructions and waits for the result of branch decision. At this time, the instruction bus 1090 of left token passer and the left front-end pipeline 1080 correspond to branch instruction. The instruction bus 1092 of right token passer and the right front-end pipeline 1082 correspond to the current instruction. Each taken branch sets the TAKEN signal 1098 to ‘1’ and it triggers the toggle counter 1081, thus the module response for the current instruction and the module response for the branch target instruction are exchanged with each other.
Similarly, when the right front-end pipeline and token passer correspond to the current instruction block, the detailed process may refer to the situation that the left front-end pipeline and token passer correspond to the current instruction block, which is not repeated herein.
Despite whether the branch is taken or not, the CPU core may receive the instructions from IRB and execute these instructions continuously, thus it may eliminate performance loss of branch instruction.
The token signal may control the simultaneous output of four sequential instructions. For example, the token signal 1444 stored in register 1443 may control the output of instruction 1431 through bus 1461, instruction 1433 through bus 1463, instruction 1435 through bus 1465, and instruction 1437 through bus 1467 in the same clock cycle. When token signal 1444 is passed to next token passer, the token signal 1448 may output instruction 1433 through bus 1461, instruction 1435 through bus 1463, instruction 1437 through bus 1465, and instruction 1439 through bus 1467 in the same clock cycle. The token signal may also be passed from token passer 1431 to token passer 1439, so that instruction 1439 and the three instructions follows it may be outputted in the same clock cycle. Token passing is selected by the four-input multiplexer in each token passer implement. For example, token signal 1444 couples to the A input (the left most) of the four-input multiplexer in token passer 1433 and couples to input B (the second input from the left) of four-input multiplexer in token passer 1435 and couples to input C (the third input from left) of the four-input multiplexer in token passer 1437 and couples to input D (the right most input) of four-input multiplexer in token passer 1439. All of the four-input multiplexers are controlled by Dependency Check Unit. If the multiplexer selects input A, the Token is sent to the next instruction; if the multiplexer selects input B, the Token is sent to the instruction after the next instruction; if the multiplexer selects input C, the Token is sent to the third instruction; if the multiplexer selects input D, the Token is sent to the fourth instruction.
Dependency Check Unit checks the Read after Write (RAW) hazard of the four instructions outputted in parallel from IRB. If the source (operand) register address of an instruction is the same as the destination register address of a prior instruction, a RAW dependence has occurred, so these two instructions may not be executed at the same time. The Dependency Check Unit also checks the dependency between a branch instruction and the instruction which may affect the branch condition. That is the instruction that may affect the branch decision condition and the branch instruction may not be issued at the same time (outputted from IRB).
Four instructions read out from IRB are in sequence from left to right. The instruction outputted from bus 1461 is the first instruction; the instruction outputted from bus 1463 is the next instruction of the first instruction and it is called the second instruction; the instruction outputted from bus 1465 is the next instruction of the second instruction and it is called the third instruction; the instruction outputted from bus 1467 is the next instruction of the third instruction and it is called the fourth instruction. In this embodiment, an example of each instruction using at most two source registers and one destination register is used for illustration. Other situations may be deduced from this situation. In the checking process, the destination register addresses (1481, 1482 and 1485) extracted from the first three instructions are compared with the source register addresses (1483, 1484, 1486, 1487, 1488 and 1489) extracted from the instructions after the first instruction. In
Branch instruction dependency checking is similar to RAW hazard checking. The branch condition is updated by a certain instruction executed before the branch instruction. A common practice is that an instruction updates a register which is used by the branch instruction to make branch decision, such as a condition flag register or a register which is used for comparison by a branch instruction. If the instruction set uses condition flag register, then decoding the first three instructions indicates whether the instruction updates the flag register, and then send the results to the last three instructions and compare with each signal; the signal indicates whether the instruction is a branch instruction or not. For example, the first instruction updates the flag register and the fourth instruction is branch instruction, thus the fourth instruction may not be issued in this clock cycle, it must wait for the flag register is updated and then executing in next clock cycle. If the instruction set adopts condition destination register, the method is the same as the process of RAW hazard, that is the source register addresses of the last three instructions compared with the condition destination register addresses of the first three instructions. This comparison is included in RAW hazard detection, so that it doesn't need comparison logic.
Performing OR operations on all of the comparison results in each of the second, third, and fourth instructions. The output of OR gate indicates this instruction has hazard with the prior instruction, so that this instruction may not be executed in the same clock cycle, but rather it needs to be outputted from IRB in the next clock cycle. If the second instruction has hazard with the first instruction, only the first instruction of the four instructions outputted may be executed in this clock cycle, four instructions starting from the second instructions of the two instructions that have dependency in the prior cycle. If the second instruction has no hazard but the third instruction has hazard, only the first two instructions of the four instructions outputted may be executed in this clock cycle, four instructions starting from the third instruction that have dependency will be issued next cycle. If the second and third instructions have no hazard but the fourth instruction has hazard, only the first three instructions of the four instructions outputted may be executed in this clock cycle, four instructions starting from the fourth instruction that have dependency will be issued next cycle. If the second, third, and fourth instructions all have no hazard, then all four instructions of the four instructions outputted may be executed in this clock cycle, four instructions starting from the first instruction after the four instructions outputted are issued next cycle for processor execution. The Token passing must abide by the rules described above.
Which instruction may be executed in next clock cycle depends on the location where the hazard occurred and the priority between multiple hazards. The hazard of the instruction of the left side has priority over the hazard of the instruction of the right side. In this embodiment, this function is implemented by priority encoder. The priority encoder has a similar structured shift blocking logic corresponding to each instruction. When an instruction has a hazard its corresponding shift block logic blocks the ‘hazard’ signal propagated by the shift block logic to its right, but produces its own hazard signal corresponding to the instruction. When an instruction has no ‘hazard’ then the shift block logic downshifts the ‘hazard’ position signal from its right and pass it to the shift block logic to the left.
Let's assume that Token is in token passer 1431, that is the control line 1444 is ‘1’, it controls the issue of the instruction stored in the memory 1431 (simply called instruction 1431 in the following) through bus 1461, and the instruction 1433 through bus 1463, and the instruction 1435 through bus 1465, and the instruction 1437 through bus 1467 all at the same time. If there is no hazard between the four instructions, the shift block logics (1452, 1453, and 1454) don't block the signal passing. The signal on wires (1471, 1472 and 1473) each corresponds to the shifted second instruction hazard bit 1491, the shifted third instruction hazard bit 1492 and the shifted fourth instruction hazard bit 1493 are all ‘0’ (there is no hazard detected on the second, third, and fourth instructions). Because the wire 1494 is fixed to couples with ‘1’, the signal of wire 1494 passes to signal 1474 through 4 shift logic. Thus, the control signal of the four-input multiplexer in each token passer is ‘0001’, it selects the fourth input of each four-input multiplexer. Except for the four-input multiplexer in token passer 1439, the fourth input of each four-input multiplexer is ‘0’. The fourth input of the multiplexer in token passer 1439 that couples with the control line 1444 is ‘1’. Thus, the Token is sent to token passer 1439. In next clock cycle, IRB outputs four instructions in sequence from instruction 1439 to CPU core for execution and also to dependency check unit to perform dependency checking.
Let's assume the instruction 1431 and its next three instructions are issued at the same time again. If only the fourth instruction 1437 has hazard with one of its prior instructions, the hazard bit 1493 of the fourth instruction is ‘1’, thus the output of AND gate in shift block logic 1454 is ‘0’ and it blocks the passing of signal 1494, the signal 1493 reaches signal 1473 through three stage shifter. The control signal of the four-input multiplexer in each token passer is ‘0010’, it selects the third input of each multiplexer. Except for the four-input multiplexer in token passer 1437, the third input of each four-input multiplexer is ‘0’. The third input of the multiplexer in token passer 1437 couples with the control line 1444 is ‘1’. Thus, the Token is sent to token passer 1437. In next clock cycle, IRB outputs four instructions in sequence from instruction 1437 to CPU core for execution and also to dependency check logic to perform dependency checking. The instruction 1437 is outputted from bus 1461 of the first lane this cycle (it was outputted from bus 1467 of the fourth lane in the previous clock cycle).
Let's assume the instruction 1431 and its next three instructions are issued at the same time again. If the second instruction 1433 and the fourth instruction 1437 have hazard with the instructions before themselves, the hazard bit signal 1493 of the fourth instruction is ‘1’, thus the output of AND gate in shift block logic 1454 is ‘0’ and it blocks the passing of signal 1494 and the signal 1493 shifts left to its left shift block unit. However, at this time the hazard bit signal 1491 of the second instruction is ‘1’, the outputs of the three AND gates in shift block logic are ‘0’, it blocks the passing of signal 1493. The signal 1491 reaches signal 1471 that the control signal of the four-input multiplexer in each token passer is ‘1000’ it selects the first input of each multiplexer. Except the four-input multiplexer in token passer 1433, the first input of each four-input multiplexer is ‘0’. The first input of the multiplexer in token passer 1433 couples with the control line 1444 is ‘1’. Thus, the Token is sent to token passer 1433. At next clock cycle, IRB outputs four instructions in sequence from instruction 1433 to CPU core for execution and also to dependency check logic to perform dependence checking. This time, the instruction 1433 is outputted from bus 1461 (it was outputted from bus 1463 in the previous clock cycle).
When the destination register address 1481 is the same as one of the source register addresses (1483 and 1484) corresponding to the second instruction, the output signal 1471 of OR gate 1491 is ‘1’, and it forces the signals (1472, 1473 and 1474) to output ‘0’; otherwise, the output signal 1471 of OR gate 1491 is ‘0’.
When the destination register address (1481 or 1482) is the same as one of the source register addresses (1486 and 1487) corresponding to the third instruction, the output signal 1472 of OR gate 1492 is ‘1’, and it forces the signals (1473 and 1474) to output ‘0’; otherwise, the output signal 1472 of OR gate 1492 is ‘0’.
When the destination register address (1481,1482 or 1485) is the same as one of the source register addresses (1488 and 1489) corresponding to the fourth instruction and the outputs of OR gates (1491 and 1492) are ‘0’, the output signal 1473 of OR gate 1493 is ‘1’, and it forces the signal 1474 to output ‘0’; otherwise, the output signal 1473 of OR gate 1493 is ‘0’.
Only when the output signals of OR gates (1491, 1492 and 1493) are ‘0’, the output signal 1474 is ‘1’; otherwise, the output signal 1474 is ‘0’.
Thus, the output signals generated by logic 1471, 1472, 1473 and 1474 combine together to form a selecting signal 1479 which controls all of the multiplexers in each token passer. Take the token passer 1431 as an example, the output signals 1471, 1472, 1473 and 1474 each correspond to one of the four inputs from left to right (that are input A, B, C, and D) of multiplexer 1447.
Let's assume the Token signal is in token passer 1431, the four instructions corresponding to token passers 1431, 1433, 1435 and 1437 are sent to dependency check unit and execution unit at the same time each through buses 1461, 1463, 1465 and 1467. The result of dependency checking determines which instructions are to be executed in parallel. At the same time, the control signal 1479 outputted by dependency check unit is sent to each token passer to control the Token passing.
If the Dependency Check Unit finds that the first instruction of the said four instructions has RAW hazard with at least one of the other three instructions, then each multiplexer in all token passers selects input A. Token signal is in token passer 1431 at this time, in the four multiplexers 1431, 1433, 1435, 1437, only the input A of multiplexer 1433 is ‘1’, the inputs of other three multiplexers are ‘0’. Thus, only the output of the multiplexer in Token passer 1433 is ‘1’, the outputs of other three Token passers are ‘0’. The Token signal is sent to Token passer 1433, it indicates the instruction that corresponds to Token passer 1433 may be sent to execution unit through bus 1461. At next clock cycle, IRB sends four instructions starting with the instruction corresponding to token passer 1433 to execution unit and Dependency Check Unit.
If the Dependency Check Unit finds that the first two instructions of the said four instructions has no RAW hazard with each other but at least one of the first two instructions has RAW hazard with the third instruction, then each multiplexer in all token passers selects input B. Token signal is in token passer 1431 at this time, in the four multiplexers 1431, 1433, 1435, 1437, only the input B of multiplexer 1435 is ‘1’, the inputs of other three multiplexers are ‘0’. Thus, only the output of the multiplexer in Token passer 1435 is ‘1’, the outputs of other three Token passers are ‘0’. The Token signal is sent to Token passer 1435. It means now only the instruction that corresponds to Token passer 1431 and 1433 may be sent to execution unit through bus 1461 and 1463. At next clock cycle, IRB sends four instructions starting with the instruction corresponding to token passer 1435 to execution unit and Dependency Check Unit. Other situations can be deduced by analogy, and Therefore, pass the Token signal to a certain token passer based on the dependency of the four sequential instructions, to enable the IRB to output the right instructions.
In this embodiment, the input clock or power supply of an execution unit corresponding to the said instructions which are not able to be executed in parallel, to stop the execution of the said instruction; clearing the execution result of the said instruction will have the same effect.
As used herein, modification of the way of blocking in the dependency check unit may support less number of instruction parallel issue; but increase the number of inputs on multiplexers in the token passer and do the corresponding modification of way of blocking in the dependency unit may support higher parallel issue rate. For example, in
The ILP multi-issue structure in
As used herein, a branch source instruction, the branch target instruction, and its following instructions may be issued in the same clock cycle if the branch prediction of a branch instruction is taken, that may implement penalty-less ILP branching.
In addition, the token passer also includes 4 pass-gates and 4 AND gates. For example, in token passer 1513, under the controlling of branch source decoder, the Token of first lane is passed to token bus of the first lane 1541 through pass-gates 1530; the Token of second lane is passed to token bus of the second lane 1543 through pass-gates 1531; the Token of third lane is passed to token bus of the third lane 1545 through pass-gates 1532; the Token of fourth lane is passed to token bus of the third lane 1547 through pass-gates 1534. Under the control of branch target decoder, each AND gate (1536, 1537, 1538 and 1539) may block the passing of the Token of token passer 1503 to its next token passer. The operation is similar to that of the embodiment in
In
In this embodiment, because the branch prediction is taken, the control signal 1535 of the token passer 1513 corresponding to the output of the source decoder 513 is ‘1’, the signal 1535 passes through an inverter 1533 to become ‘0’, and this inverted signal couples with one input of each AND gate (1536, 1537, 1538 and 1539), thus the outputs of the above four AND gates are ‘0’, it blocks the Token signal passing. At the same time, under the control of signal 1535, the pass-gates (1530, 1531, 1532 and 1534) are opened. Only the input of pass-gate 1531 is ‘1’ that is the bus 1543 is ‘1’, the other buses (1541, 1545 and 1547) are all ‘0’. So that, in the instruction block where the branch instruction 1503 is located, only the branch instruction 1503 and its previous instruction 1501 are outputted to execution unit and Dependency Check Unit.
In
Similarly, if instruction 1551 is not a branch instruction, the token passer 1561 controls signal 1575 outputted by the source decoder 513 is ‘0’, the signal 1575 passes through an inverter 1573 to become ‘1’, and this inverted signal couples with one input of each AND gate (1576, 1577, 1578 and 1579). At this time, only the other input of AND gate 1578 is ‘1’ from the AO gate 1566 in token passer 1561, but the other inputs of the other three AND gates are all ‘0’. Thus, the output of AND gate 1578 is ‘1’, the outputs of AND gates (1577, 1578 and 1579) are ‘0’. Therefore, the output of AO gate 1567 in token passer 1563 is ‘1’, the outputs of AO gates (1564, 1565 and 1566) are ‘0’, the output of AO gate 1567 controls instruction 1553 to be outputted from bus 1557.
Based on the method described above, the branch source instruction and branch target instruction and its fall-through instructions may be issued in the same clock cycle. In addition, based on the above embodiments, the branch source instruction and its fall-through instructions may be issued when the branch prediction is not taken. Therefore, using the said structure and methods consistent with the disclosed embodiments, penalty-less branching for ILP may be implemented.
As used herein, the parts and components in the prior embodiments may be combined to form processor system in more variety to implement the same function
In
The rows of tag memory 2305 one to one correspond to the rows of instruction memory 2306, every row is used to store the block address of the corresponding instruction block in instruction memory 2306.
The structures and functions of instruction memory 2306 and IRB 2307 are similar to the instruction memory and IRB of the previous embodiment. Its differences lie in the memory blocks of instruction memory 2306 one to one correspond to the rows of tag memory 2305. Therefore, the BNX obtained from matched block addresses in tag memory 2305 may be used to find the corresponding micro-op block in instruction memory 2306. Instead of BNX of the block, the register in the control unit of IRB 2307 now stores the block address of the current block. In this embodiment, the end mark representing the last instruction of the instruction block is stored in the last token passer in IRB 2307. This way, when token signal is passed to the last instruction of instruction block, IRB 2307 not only outputs the corresponding instruction to be executed by processor 2311, but also outputs the said end mark to update the instruction block address.
Processor core 2311 is a modified processor core, in which the address generation module only produces instruction block address. The said instruction block address represents the block address of instruction block. After obtaining BNX from successfully matching tag memory 2305, the position in instruction memory 2306 of the instruction block represented by the current block address may be found.
As used herein, the branch target instruction may be calculated by instruction generation block using the instruction block that is directly used by instruction address generator module and the revised branch offset value. Here, the revised branch offset value may be found by the sum of the instruction block offset address of the branch instruction and the branch offset, and is stored to the storage unit corresponding to the said branch instruction in instruction memory 2306. Because branch target address is equal to the sum of branch instruction address and branch offset value, branch instruction address is equal to the sum of branch instruction block address and offset value within the branch instruction block. Therefore, in this disclosure, branch target address is equal to the sum of branch instruction block address and the revised branch offset value.
As shown in
Specifically, when the CPU core 2311 executes the sequential instructions, if an instruction currently executed by CPU core 2311 is not the last instruction in the instruction block, multiplexer 2417 selects the value outputted from register 2401 to feed back to register 2401. Thus, the value of register 2401 is kept unchanged (that is, the instruction block address outputted from register 2401 is unchanged). That is, the value outputted from bus 2321 is the original instruction block address.
If an instruction currently executed by CPU core 2311 is the last instruction in the instruction block, multiplexer 2417 selects the value outputted from register 2401 as one input of adder 2423. The other input of adder 2423 is signal 2421 (‘1’) from IRB 2307 representing that the current instruction is the last instruction in the instruction block, such that the instruction block address stored in register 2401 is incremented by 1 to obtain a new instruction block address. The new instruction block address is written back into register 2401. The value outputted from bus 2321 is the next instruction block address.
If CPU core 2311 executes a branch instruction and the branch is taken, adder 2425 obtains the address of a new instruction block by adding the current instruction block address sent from register 2401 to the upper bit portion of the compensated branch offset sent from IRB 2307. The value outputted from bus 2325 is the branch target instruction block address.
It should be noted that the instruction block address generation module is inside CPU core 2311, and the instruction block addresses respectively outputted by the instruction block address generation module via bus 2321 and bus 2325 are selected to perform a matching operation in tag memory 105. However, the instruction block address generation module may also exist separately outside CPU core 2311. The operating process of the instruction block address generation module outside CPU core 2311 is the same as the operating process of the instruction block address generation module inside CPU core 2311, which are not repeated here.
Returning to
Specifically, when the CPU core 2311 executes the instructions according to the order of the addresses and the last instruction in the current instruction block is not executed, because the instruction block address is unchanged, the instruction block address does not need to perform the corresponding matching operation in IRB 2307 and tag memory 105. The token signal in IRB 2307 is passed in every token transmitter corresponding to the current instruction block in order, providing the corresponding instructions for CPU core 2311 execution.
When the next sequential instruction block is executed, multiplexer 2319 selects the instruction block address (i.e., the address of the instruction block corresponding to the next instruction block) from bus 2321. The instruction block address is performed a matching operation matched? in IRB 2307.
If the instruction block address is matched successfully in the control unit in IRB 2307, the corresponding instruction block is the next instruction block.
If the instruction block address is matched unsuccessfully in the control unit in IRB 2307, the instruction block address is sent to tag memory 105 to perform a matching operation. In this case, if the matching operation is successful, BNX is obtained. The instruction block pointed to by the BNX in instruction memory 2306 is filled into the memory block determined by the replacement algorithm in IRB 2307, such that IRB 2307 contains the next instruction block.
If the instruction block address is matched unsuccessfully in tag memory 105, the low bit of instruction block address is filled with ‘0’ to form a complete instruction address (that is, the instruction address of the first instruction corresponding to the instruction block address). Based on the previous method, the instruction address is sent to the lower level memory to obtain the corresponding instruction block. The obtained instruction block is converted to the instruction block via converter 109, and the instruction block is filled into the memory block pointed to by the BNX determined by the replacement algorithm in instruction memory 2306. The mapping relationship obtained by a conversion operation is stored in the row pointed to by the BNX in the mapping module. At the same time, the instruction block in instruction memory 2306 is filled into the memory block determined by the replacement algorithm in IRB 2307, such that IRB 2307 contains the next instruction block.
Thus, when the token signal is passed to the token transmitter corresponding to the last instruction in the current instruction block (that is, when CPU core 2311 executes the last instruction), the token signal is passed from the token transmitter corresponding to the last instruction in the current instruction block to the token transmitter corresponding to the first instruction in the next instruction block under the control of the ending flag. Then, as the Token signal is passed, IRB 2307 outputs the corresponding instruction in order for CPU core 2311 execution.
When IRB 2307 outputs the branch instruction to CPU core 2307 for execution, the address of branch target instruction block may be calculated by adding the upper bit portion of the compensated offset address to the block address of the branch instruction as shown in
If the address of branch target instruction block is matched successfully in the control unit in IRB 2307, the instruction block that is matched successfully is the branch instruction block corresponding to the branch target instruction. At this time, because instruction memory 2306 contains all the instruction blocks in IRB 2307, BNX may be obtained successfully by performing a matching operation on the instruction block address in tag memory 105. Then, the low bit portion 2331 of the compensated branch offset is used as the instruction block offset. The instruction block offset is sent to mapping module 107. Based on the mapping relationship included in the row pointed to by the BNX, the instruction block offset is converted to the instruction offset address 2333. Based on the instruction offset address 2333, the branch target instruction may be found in the instruction block that is matched successfully in IRB 2307.
If the address of branch target instruction block is matched unsuccessfully in the control unit in IRB 2307, the instruction block address is sent to tag memory 105 to perform a matching operation. In this case, if the matching operation is successful, BNX is obtained. The instruction block pointed to by the BNX in instruction memory 2306 is filled into the memory block determined by the replacement algorithm in IRB 2307, such that IRB 2307 contains the branch target instruction block. At the same time, the low bit portion 2331 of the compensated branch offset is used as the instruction block offset. The instruction block offset is sent to mapping module 107. Based on the mapping relationship included in the row pointed to by the BNX, the instruction block offset is converted to instruction offset address 2333. Based on instruction offset address 2333, the branch target instruction may be found in the branch target instruction block in IRB 2307.
Thus, when the execution result of the branch instruction is not yet generated by CPU core 2311, according to the order of the addresses, the token signal continues to be passed in order and the corresponding instructions are outputted to CPU core 2311 for execution. When CPU core 2311 executes the branch instruction and generates the execution result of the branch instruction, if the branch is not taken, the token signal continues to be passed in order and the corresponding instruction is outputted to CPU core 2311 for execution; if the branch is taken, CPU core 2311 clears the execution results or the intermediate results of the executed instructions following the branch instruction. At the same time, according to the previously described method in
The said IRB equipped processor may be expanded as multi core processor, and support more than one instruction set.
Further, there are two instruction set converter mechanisms in scan converter 209, which respectively convert instruction set B and instruction set C to instruction set A. Under this circumstance, it is equivalent as different lanes and different threads in multi lane processor are executing instructions of different instruction sets.
As used herein, all the methods and implementations of this disclosure may be expanded to cache systems with more layers of memory hierarchy.
As used herein, the IRB can be improved further by directly controlling IRB with tracker and outputting multiple instructions to processor in the situation without any token registers to implement functionality the same way as the embodiment in
In this embodiment, track table 210 not only outputs target track point BN through bus 1633 after selecting by target select module 1649 based on the addressing of read pointer outputted by tracker 1607, but also outputs next instruction block address in track end point which comes from bus 1635 pointed to by read pointer 1631. The above address is sent to IRB 107 and multiplexer 1609 through buses 1633 and 1635. For ease of display and explanation, IRB in
Each instruction storage unit in IRB 107 can accept instructions from outer memory (for example instruction cache 206) through bus 1667. A block in instruction cache 206 is placed in instruction block from top down by program order. There are 3 read ports in each instruction storage unit and each read port provides instructions to an execution unit.
Each instruction block in IRB 107 contains one decoder module. As is the case for decoder module 1617 in instruction block 1601, its first address memory 505, branch target comparator 511, current first address comparator 509, and current second address decoder 513 are the same as the corresponding components stated in previous embodiments (called branch source comparator and branch source address decoder). The first address memory 505, which is written along with instructions, stores BNX of the instruction block. The BNX coming from tracker 1607 through 1631 is compared with BNX stored in the first address memory 505 by the first address comparator 509, and the instruction block is the current instruction block if the result matches. Then it enables the second address decoder 513, which decodes BNY in read pointer 1631, and there is and only is one ‘1’ in its output signals 1641, 1643, 1645 and 1647. However, the outputs of the second address decoder are all ‘0’ if the result of first address comparator 509 mismatches.
The instruction blocks comprise an array, in which the instructions are arranged from top down by program order and each row stores one instruction, whereas each column contains a read port corresponding to an execution unit in each row. The outputs of second address decoder 513 in IRB 107 control the read ports on all columns through a word-line extended from top left to bottom right. It issues 3 sequential instructions through buses 1661, 1663 in the order from left to right and 1665 to dependency check module 1627 and execution units 1621, 1623 and 1625, so that continuous instructions can be issued to multiple execution units at the same clock cycle. An instruction can be issued to the execution unit through read port at any column depending on the demand.
Similarly, the BNX of branch target coming from track table 210 and selected by module 1649 is compared with the BNX stored in the first address memory 505 by branch target comparator 511. It indicates that the instruction block is the one where branch target is located if they match, and the result is only used to judge whether the branch target already is already stored in IRB 107.
The BNX of next instruction block outputted by track table 210 is compared with the BNX stored in the first address memory 505 by next block address comparator 1619. It indicates the corresponding instruction block is the next instruction block if the inputs of comparator match. The result of comparator controls one input of all AND gates in row NO. 1 of IRB (except for the leftmost column, in which read ports on all rows are directly driven by the second address decoder 513), such as the AND gate 1637 and 1639, and another input of all these AND gates connects to the token bus (bus 1667 for example) to receive the position where the last instruction is issued in another IRB block, filling the remaining columns with instructions in the current instruction block, which make most of the execution unit. The read port control line in the last row of all these IRB blocks connect to an OR gate, such as OR gate 1647 or 1649, whose output is the token bus such as 1667 and 1669 and is also the input of the AND gates 1637 and 1639. The output of next block address comparator 1619, whose purpose resembles that of branch target comparator 511, is also used to judge whether the next instruction block is already in IRB 107.
As used herein, track table 210 consists of three components: instruction type field 1671, branch target track point field 1673 and next instruction block number 1675 in this embodiment. The instruction type field 1671 contains all instructions' type information on the track, for instance, the instruction type is ‘1’ if it's a branch instruction, otherwise it is ‘0’. Each item in branch target track point field 1673 corresponds to a track point on the track. If a track point is branch point, its branch target track point field 1673 contains the information of target track point of the branch instruction. The track, which is addressed by BNX in the read pointer 1631 of tracker 1607 in the embodiment herein, outputs its next instruction block number 1675 to bus 1635 as the BNX of next instruction block, and output all contents of instruction type field 1671 and branch target track point field 1673 to branch target selection module 1649.
As used herein, an embodiment of branch target selection module stated in this invention is illustrated in
As shown in
The tracker 1607 consists of two registers, four multiplexers, and one adder. Register 1651 and 1653 respectively store BNX and BNY of the read pointer. Multiplexer 1656 passes fixed value ‘1’, ‘2’ or ‘3’ to adder as address increment according to the dependency check result between instructions generated by dependency check module 1627. The value added to the BNY of read pointer sent by register 1653, is the new BNY of read pointer. For example, multiplexer 1656 passes ‘3’ to adder 1655 if there is no dependency in 3 instructions provided by IRB 107, which is the BNY corresponds to the 3rd instruction behind current BNY after addition.
The multiplexer 1658 selects output of adder 1655 and BNX outputted by branch target selection module 1649 under the control of branch decision signal 1657 sent by execution unit. There are independent branch judgment logics in execution units 1621, 1623, and 1625. There are independent instruction decoders corresponding to execution units in the dependency check module 1627. It only executes the first branch instruction in the case of a certain class of branch instructions, which generates branch condition and checks the branch type at the same time, as well as the case of issuing multiple branch instructions in one cycle. The signal 1657 derives from the encoding of the first branch instruction's (i.e. the first branch instruction in program order) branch decision in each execution unit by priority encoder, which utilizes the branch types decoded by instruction decoder. Functionality of the priority encoder resembles 1687 in
The multiplexer 1658 passes the branch target track point's BNY outputted by the branch target selection module 1649 to register 1653 in order to update the BNY's read pointer in the case that the branch is taken. If not taken, multiplexer 1658 passes the BNY outputted by adder 1655 to register 1653.
The multiplexer 1652 selects BNX value between current read pointer BNX (namely current instruction block BNX) and the next instruction BNX derived from track table 210 under the control of carry bit generated by adder 1655. It passes the next instruction block's BNX outputted by track table 210 to multiplexer 1654 when adder 1655 generates carry bit, indicating all instructions in current instruction block have been sent to execution unit. However, it passes the next instruction block's BNX outputted by register 1651 to multiplexer 1654, when adder 1655 doesn't generate the carry bit, indicating there are instructions that haven't been sent to the execution unit in the current instruction block.
The multiplexer 1654 selects value between the output of multiplexer 1652 and branch target BNX outputted by branch target selection module 1649 which is also under the control of branch decision signal 1657. When the branch is taken, multiplexer 1654 passes branch target track point's BNX outputted by branch target selection module 1649 to register 1651 in order to update the BNX of read pointer. Whereas it passes BNX outputted by multiplexer 1652 to update register 1653 if the branch is not taken. Registers 1651 and 1653 update at each cycle unless there is an exception. For example, execution unit 1621 stalls the pipeline or cache miss and so on. It terminates the update of register 1651 and 1653 through control line 1626 once the exception happens.
Besides, the multiplexer 1652 can be omitted by sending the next instruction block's BNX, which is directly outputted by track table 210 on bus 1635 to the multiplexer 1654, and controlling the update of register 1651 with the branch TAKEN signal and the carry output of adder 1655. If the branch is not taken, the multiplexer 1654 passes branch target BNX, which is outputted by branch target selection module to register 1651 controlled by the TAKEN signal. If the branch is not taken and adder 1655 generates the carry bit, the multiplexer 1654 passes the BNX of the next instruction block which is outputted by track table 210 to register 1651 controlled by the carry signal. However, register 1651 won't be updated and preserves the original BNX in the case that branch is not taken and there is no carry bit generated by adder 1655.
Thus, tracker 1607 generates a read pointer 1631 and sends it to the control modules corresponding to all IRB blocks at each clock cycle. As is the case of module 1617, if its corresponding instruction block is the current instruction block, the current second address decoder 513 decodes BNY in read pointer 1631 under the control of the match signal generated by the current first comparator 509, setting corresponding outputs to ‘1’ and others to ‘0’. The situation that the branch is not taken is elaborated herein first. For example, the current second address decoder 513's output control line 1641 is ‘1’, and 1643 as well as 1645 are ‘0’ if the BNY of read pointer 1631 is ‘0’. As illustrated in
For instance, the control line 1647 of the second address decoder 513 is ‘1’, control line 1641, 1643 and 1645 are all ‘0’ if the BNY of read pointer is ‘3’. As illustrated in
One input of AND gates 1637 and 1639 is the output of comparator 1619, ‘1’ as in this case in the control module corresponding to the next instruction block, and other inputs are respectively bus 1667 and 1669. The value of control line 1638 is ‘1’ because all two inputs of AND gate 1637 are ‘1’, making the memory unit 1611 and 1613 output instructions respectively through bus 1663 and 1665. Thus, the instruction on bus 1661 is the last instruction of current instruction block, while instructions on buses 1663 and 1665 are the respective first, second instructions of next instruction block, i.e. it outputs 3 continuous instructions which are sent to execution units 1621, 1623, 1625 and dependency check module 1627. If we regard the IRB blocks as an array, the next block address selects the first row of certain block (the first instruction in this block), and the column information of the last instruction in previous instruction block (denoted as the column right to the one occupied) is passed to all IRB blocks through token bus. It issues all instructions from the read port on the selected row and column until all columns/execution units are utilized in the same clock cycle. Tracker 1607 is responsible for adding ‘1’, ‘2’ or ‘3’ to BNY in the read pointer based on the output of dependency check module 1627. In the embodiment herein, the carry bit generated by adder 1655 is definitely ‘1’ because it adds at least ‘1’ to BNY. As a result, the next instruction block's BNX, which is derived from the output of track table 210 is stored in register 1651 and the sum of adder 1655 is stored in register 1653. The newly obtained read pointer points to the first instruction of 3 continuous instructions in the next instruction block to be outputted in parallel next time.
The following paragraphs focus on the explanations that the branch is taken. The execution unit judges the first branch instruction in program order under the control of priority judgment logic if there is branch instruction in the issued instructions after being decoded by dependency check module 1627, and its result controls multiplexers 1654 and 1658. Multiplexer 1654 passes output of multiplexer 1652 and multiplexer 1658 passes output of adder 1655 if the branch is not taken, and the execution procedure at next cycle is exactly the same as that of the above non-branch instructions.
The results of execution units after the one corresponding to the branch instruction won't write back to registers such as register file 1629 if the branch is not taken (the same situation as executing non-branch instruction). Meanwhile, both multiplexers 1654 and 1658 pass branch target track point derived from track table and transferred by bus 1633 under the control of valid branch decision 1622. Registers 1651 and 1653 in tracker 1607 respectively update their contents to BNX and BNY of the branch target track point, which constitute the new current read pointer. The decoder in IRB enables the corresponding word-line to control the read port of instruction memory unit where the branch target instruction is located. As a result, the instruction at the leftmost column is sent to execution unit 1621 through bus 1661, and its succeeding instructions are sent to execution units from the left to right. Besides, the newly obtained read pointer is sent to track table 210 through bus 1631 to read out the corresponding track. Information on the track of branch target, which is provided by track table, is utilized by tracker 1607 and decoder in IRB 107.
If the IRB block boundary is crossed in the procedure stated above, as is the case if the last instruction in IRB block is branch target, the instruction is issued to execution unit 1621 through bus 1661 and the token bus 1667 is validated. The IRB block, which matches the address of branch target instruction's next instruction block derived from track table 210 through bus 1635, issues the first instruction to execution unit 1623 through bus 1663 and the second instruction to execution unit 1625 through bus 1665. The following operations resemble the non-branch instruction. A new read pointer is achieved by adding the branch target to increment 1659 determined by dependency check module 1627 in next cycle, and it is decoded by the decoder in IRB 107 to locate the position of instructions to be issued.
As used herein, the embodiment in
As used herein, the instruction cache 206, dependency check module 1627, execution units 1621, 1623 and 1625, register file 1629, tracker 1607, and multiplexer 1609 are the same as the corresponding components in embodiment of
Specifically, the first address storage 505, branch source comparator 509, branch target comparator 511, next address comparator 1619, and branch source decoder 513 in control module 1617 are the same as corresponding components in embodiment of
In the embodiment herein, the output of branch source comparator 509 not only controls the enablement of branch source decoder 513, but also the enablement of end decoder 1717. The predictor 1709 generates corresponding control signal according to the branch prediction information stored in the current track of track table 1710 and the BNY of read pointer 1631, and then sends the signal to the end decoder 1717, producing clear signal 1741, 1743 or 1745 for corresponding instruction memory unit. The default value of the end decoder 1717's clear signal is ‘1’, indicating that it doesn't terminate the pass of control signal 1641, 1643 or 1645. The AND gate's or complex gate's output in corresponding instruction memory is ‘0’ and the triple state gate is enabled once the clear signal is ‘0’, terminating the pass of corresponding control signal with value ‘1’ to next instruction memory unit. The control signal is then sent to all instruction blocks through bus 1763 or 1765.
The multiplexer 1711 and 1713 in each control module passes the input correlated to branch target (i.e. the output of branch target comparator 511 and branch target BNY on bus 1633), enabling the branch target decoder 1715 in control module corresponding to the branch target instruction block. The control signals of instruction memory unit which the branch instruction corresponds to, are generated by the branch target decoder and control the AND gate or complex gate in the instruction memory unit in order to pass the value ‘1’ on bus 1763 or 1765 to this instruction memory unit. The corresponding instructions are read out subsequently. This way, IRB 107 can provide branch instruction and its target instruction at the same time.
For the purpose of facilitating explanation, suppose that the second instruction in an instruction block is a branch instruction predicted as branch taken and its branch target is the zero instruction in this block in the following example. As stated before, the control signal 1645 generated by branch source decoder 513 is ‘1’ if the read pointer 1631 points to this instruction block and the BNY is ‘2’, and the instructions are read out from instruction memory unit 1615 to bus 1661. The predictor 1709 sends BNY (i.e. ‘2’) of the branch instruction to the end decoder 1717 because the branch instruction is predicted as taken. The end decoder 1717 generates clear signals 1741, 1743, and 1745 with respective values ‘1’, ‘1’ and ‘0’ under the enablement of branch source comparator 509's output. Triple state gate 1775 is then enabled, and the value ‘1’ of control signal 1645 is passed to bus 1763 (the value on bus 1765 is ‘0’).
Meanwhile, the branch target decoder 1715 is enabled by multiplexer 1711's result, which is derived from the output of comparator 511, takes the branch target BNX on bus 1633 as its input and outputs control signals 1751, 1753, and 1755 with respective values ‘1’, ‘0’, ‘0’. The outputs of AND gate 1721 and 1723 in instruction memory unit 1711 are ‘1’ and ‘0’, as well as the outputs of complex gates 1731 and 1733 in instruction memory unit 1713 are respectively ‘0’ and ‘1’. Thus, the instruction memory units 1613 and 1615 put corresponding instructions on buses 1663 and 1665 under the control of AND gate 1721's and complex gate 1733's output respectively.
As used herein, the IRB 107 issues instructions where the read pointer 1631 points to up to the branch instruction together with the branch target and its succeeding instructions to execution units 1621, 1623, 1625 and dependence check module 1627, in the case that the branch among the 3 continuous instructions which are pointed to by tracker 1607's read pointer are predicted as taken. However, if the branch instruction is predicted as not taken as illustrated in the embodiment of
In the embodiment stated above, the IRB can issue multiple instructions to execution unit at each cycle. Because these instructions might contain multiple data access instructions, a data read buffer (DRB), which is used to store the instructions that execution unit might need, can be added into the system together with corresponding data needed by these data access instructions for execution unit. By this way, the pipeline stall time can be decreased or even be eliminated when it is waiting for data.
As used herein, another embodiment of processor system including DRB is illustrated in
As used herein, another embodiment of processor system including DRB is illustrated in
As used herein, another embodiment of processor system including DRB is illustrated in
The contents of stride memory 1836 and DRB 1818 which are addressed by DRBA on bus 1815 are read out when a data read instruction is issued the first time by IRB 1814 through bus 1805. The valid bit of DRB entry is ‘0’ at this time, directing the execution unit to stall the pipeline and wait for data, whereas the status bit 1839 of stride memory entry is ‘0’, directing the data engine 1930 to wait for data address 1831 to be generated by execution unit 1806 (or computed by the data engine itself, such as achieving the data address by adding the data base address in data read instruction to the data offset). The data from cache 1822, which is indexed by address 1831 selected by multiplexer 1842 a sent through bus 1843, is filled into corresponding entry in DRB 1818 through bus 1823, making the valid bit of this entry and status bit 1839 in corresponding stride memory entry to be set to ‘1’. Execution unit reads out data from DRB through bus 1807 and completes the pipeline operations if the valid bit of the wanted DRB entry is ‘1’. The valid bit is then reset to “0’, and data address on bus 1843 is filled into corresponding entry's data address field 1835 in stride memory 1836.
If the data read instruction is issued again, the ‘0’ valid bit of corresponding entry in DRB 1818 directs the pipeline in execution unit to be stalled and wait for the data to be filled into DRB 1818. The ‘1’ status bit 1839 of corresponding entry in stride memory 1836 directs the data engine to wait for the data address on bus 1831 generated by execution unit again, based on that the data is read out from data cache 1822 and filled into the corresponding entry in DRB 1818, then setting its valid bit as ‘1’. Thus, execution unit 1806 may read out the data needed from bus 1807 and proceed in execution as stated before. Then the ‘1’ valid bit and ‘1’ status bit control the Multiplexer 1838 in data engine to select data address 1831 of this time to adder 1832. The adder 1832 subtracts the old data address 1835 stored in stride memory 1836 from data address 1831, and the result (difference, namely data stride) is stored in stride field 1837 in the entry of stride memory 1836.
Furthermore, the result 1833 of adding stride value in stride field 1837 to current data address on bus 1831 selected by multiplexer 1838 is the possible data address when the data load instruction is executed the next time. The resulting address is sent to bus 1843 after being selected by multiplexer 1842 and stored in data address field 1837 in the corresponding entry in stride memory 1836. Data engine reads out data from data cache 1833 in advance according to the data address on bus 1843, and then stores it in the DRB 1818. The corresponding status bit is set to ‘2’ and valid bit is set to ‘1’. It is worth noticing that the corresponding entry in stride memory 1836 stores pre-calculated next data address and data stride value while the corresponding entry in DRB 1818 stores pre-fetched next data, as well as that both the entry in DRB 1818 and the entry in stride memory 1836 are pointed to by DRBA in the entry of IRB 1814 which corresponds to the data load instruction.
As a result, data needed by the data load instruction is already stored in DRB 1818 once the instruction is executed again, which is pointed to by DRBA in the entry of IRB corresponding to the instruction, and could be sent to bus 1807 at a proper time. Thus, execution unit 1806 does not have to wait to fetch data from data cache. Because the value of status bit 1839 is ‘2’, the data engine 1836 again calculates the next data address for next time by adding data address 1835 to data stride 1837 to fetch data. It also updates the corresponding entries in stride memory 1836 and DRB 1818 and sets the valid bit to ‘1’.
The above methods and devices can improve efficiency of data loading in a loop. However, it is necessary to verify due to pre-fetching data on a possible data address. The embodiment in
Embodiment with structure in
The stated replacement logic is essentially a storage pool, storing available addresses of DRB entries. An available DRBA is filled into field 1816 once a new data read instruction is filled into IRB. If the existing entry in IRB is replaced by another data read instruction, its corresponding DRBA is sent back to storage pool.
As used herein, another embodiment of processor system including DRB is illustrated in
Please refer to
The word-line is in the form of a straight line and is parallel or perpendicular to address-line in general memory so that it can read the content of a group of memory cells, for example a series of bits in instruction. However, the word-line is placed in the diagonal or zigzag direction in this embodiment, which enables the ability of reading content from multiple memory cells according to a specific sequence, such as reading multiple instructions by natural sequence in program. The IRB shown in
IRB shown in
However, all instructions issued after the branch instruction will be interrupted once execution unit makes the decision to take the branch. Instructions 4 and 5 in the second and third columns will be interrupted and no longer write to registers or memories in this case. The execution unit will only complete all operations of instructions 2 and 3 in the pipeline. Target address ‘9’ will be transmitted to control module as CU address in order to enable block 1905 and validate word-line 1929 in it. Meanwhile, the start address of block 1905's next instruction block will be ‘12’, leading to the validation of control line 1938 in the next instruction block control line of IRB 1907. Block 1905 will respectively output instructions 9, 10 and 11 through 3 read ports under the enablement of word-line 1929. As formulated before, token will be issued to token bus 1933 when word-line 1929 arrives at END control line 1905. When token bus 1933 intersects with next instruction control line 1938, token will be transmitted to word-line 1939 to enable the output of instruction 12. Instructions 9, 10, 11 and 12 will be issued to execution unit at the same time in this way.
IRB in
Furthermore, the structure of embodiment elaborated in
However, it does need another two addresses besides CU address and NX address once the branch is predicted as taken. One is branch source address, hereafter referred to as SO address, i.e. the address of branch instruction itself. If the control line, which SO address corresponds to, intersects with the word-line the token is located in, the token will be passed to the succeeding instruction of sequential address and be issued to the token bus. The other one is branch target address, hereafter referred as TG address. If the control line which TG address corresponds to intersects with the token bus where the token located, it can receive token from the token bus and pass it to the corresponding word-line, such as the complex gate 1731 shown in
There are two different disposing methods according to the means of addressing next instruction block of branch target instruction block. In the first method, IRB only issues branch instruction, branch target instruction, and its succeeding instructions until the instruction block terminates rather than issue instructions in next instruction block with sequential address of instruction block where the branch target located in order to avoid conflict on the token bus between TG control line and NX control line when they are valid at the same time. The NX control line of IRB is completely invalid because NX address won't be transmitted to the control module, as formulated by former embodiment in
The word-line 1925 is valid from IRB's left boundary because CU address is ‘5’. Meanwhile, SO control line 1942, TG control line 1946 and NX control line 1948 are all valid according to corresponding address. Word-line 1925's enablement of corresponding read port, leads to the zero column outputting instruction 5 and the first column outputting instruction 6 (namely the branch instruction denoted as circle in
Furthermore, the structure of IRB in
For the purpose of adapting processor system to use IRB to issue instructions, some minor changes could be made on the structure in this embodiment in terms of technical scheme stated in this invention. The proposed processor system will be named as lane processor for abbreviation in the following specification. Each lane consists of IRB, an execution unit, and a dependency check module between the adjoining lanes, which resembles the column in the previous embodiment. This is different from
In order to change the direction of word-line's passing, we can add a token multiplexer at the read port of IRB to select from 3 token sources as per the technical scheme of this invention. The stated token source is comprised of a token coming from control module that corresponds to the current lane (for token insertion), a token coming from left lane's read port with the same position as current lane (this causes current lane and its left lane to output the same instruction) and token coming from left lane's read port with upper position (this causes current lane to output next instruction of its left lane's read port). Accordingly, the selection of token source stated above respectively corresponds to MIMD flow processing mode, SIMD flow processing mode and ILP mode.
Please refer to
Please refer to
Token multiplexer is configured as column correlation (namely select token from control module), dependency check module is configured as not in use and inter-lane bus is configured as disconnected when processing MIMD flow. The IRBs of four lanes store different programs and control module of each lane provides correlating addresses to corresponding lanes under the control of independent trackers, making it possible for the four lanes to issue and execute different instructions in parallel. Register files of each lane respectively load data from or write data to data cache through corresponding load/store unit. Each lane can execute different programs at the same time based on different data sources because both inter-lane bus and dependency check module are disabled in this mode, namely program and data between different lanes are not correlated with each other, and thus implements functionality of MIMD flow processor.
As used herein, an embodiment of lane processor run in SIMD flow mode refers to
An embodiment of lane processor run in ILP stated herein is illustrated in
Issue following instruction or branch target together with instruction itself based on branch information of branch prediction in embodiment of
An IRB implements branch process without performance loss as stated herein is illustrated in
Process method is the same as that of
As used herein, the CU address is ‘3’ (corresponds to branch instruction, as denoted by circle in
Meanwhile, the token on TG token bus is issued to word-line 2012 when TG token bus 2030 intersects with TG control line 2042, making the first, second columns to respectively output instructions 2 and 3. The word-line 2012 intersects with the corresponding END control line when it arrives at bottom boundary of block 2001 and the token upon it is issued to NXT token bus 2053 (this is denoted by solid arrow in
The back-end pipeline proceeds in execution with output from front-end pipeline P and discards result of front-end pipeline Q when the branch instruction is not taken, while it proceeds to execution with output from front-end pipeline Q and discards result of front-end pipeline P once the branch is taken (the branch instruction has already been executed and there is no need to proceed in execution in back-end pipeline).
As used herein,
As used herein, the structure and functionality of track table 210 and branch target selection module 1649 is the same as
As used herein, front-end pipeline P corresponds to continuous instructions starting from the current instruction. The front-end pipeline Q corresponds to branch target instruction if the first instruction of these two is a branch, or otherwise front-end pipeline Q doesn't work. It is worth noting that only the first instruction is a branch so the front-end pipeline be used because the maximum issue count of IRB 2201 is ‘2’ in the embodiment hereto. As to other circumstances, the maximum issue count is ‘4’. For example, front-end pipeline Q is used if there is branch instruction in the former three instructions. Detailed operation procedure resembles the embodiment hereto. The following operation is the same as
The register 1651 and 1653 of tracker 2207 respectively store current instruction address's (namely CU address) BNX and BNY. The branch target address (namely TG address), which is comprised of BNX and BNY, is stored in register 2252. After the selection of multiplexer 2213, BNX of this TG address is transferred to track table 210 for addressing through bus 2214 to find corresponding row, and then read out its NXT address, send to IRB 2201 through bus 2232. Track read buffer 2210 could output NXS address directly and send to IRB 2201 through bus 2231. The control module of IRB 2201 is responsible for checking whether the instruction block corresponding to TG address, NXS address, and NXT address exist, and if not, multiplexer 2209 selects address corresponding to the instruction block which hasn't been stored and sends it to instruction cache 206, reads out the demanded instruction block and fills it into IRB 2201. Thus, IRB 2201 could output succeeding instructions of branch instruction and the branch target at the same time under the situation of output branch itself as stated in
As used herein, dependency check module 2227 makes judgment between two instructions sent to front-end pipeline P and output control signal 2226 to multiplexer 2226 based on the increment of CU address at next clock cycle. If there is branch among the instructions sent to front-end pipeline P, dependency check module 2229 makes judgment of the correlation between the first branch and instructions before it (the first instruction as to this example) as well as correlation between instructions sent to front-end pipeline Q. It outputs control signal 2228 to multiplexer 2211 based on the increment of TG address at next clock cycle.
The multiplexer 2211 chooses output of dependency check module 2227 on bus 2226 as control signal of multiplexer 2256 in order to select the right increment of CU address and send it to adder 1655 when branch is not taken at front-end pipeline P. The possible increment of CU address could be ‘1’ or ‘2’, namely the increment is ‘2’ when the two instructions don't correlate with each other in front-end pipeline P, otherwise ‘1’. The CU address chosen by multiplexer 2213 is sent to adder 1655 to compute the CU address of next clock cycle, and then write to register 1653 to update BNY of the CU address after being chosen by multiplexer 2658. BNX of NXS address (i.e. next instruction block's BNX) on bus 2231 is chosen by multiplexer 2258 and sent to register 1651. As illustrated before, if adder 1655 outputs carry bit, the carry bit controls the enablement of control register 1651, updating the value of register 1651 to next instruction block's BNX, or otherwise hold the value of register 1651. Thus, tracker 2207 generates new CU address. Besides, the output of adder 1655 is also sent to branch target select module 1649 in order to read out the first branch target address starting from the new CU address, i.e. new TG address, and then repeat above operations.
The multiplexer 2211 chooses output of dependency check module 2229 on bus 2228 as control signal of multiplexer 2256 for the sake of computing the correct TG address increment for adder 1655 when branch is taken in front-end pipeline P. Increment is ‘1’ if branch instruction in front-end pipeline P doesn't correlate with instructions in front-end pipeline Q, or otherwise the increment is ‘0’. TG address's BNY of register 2252's output is selected by multiplexer 2213 and sent to adder 1655. New BNY of next clock cycle's TG address is then computed and written to register 1653, making that BNY of CU address update to the stated TG address of next cycle, which will be used as current address at next cycle and provide instructions from there. Multiplexer 2258 operates based on if the adder 1655 outputs carry bit. BNX of NXT address on bus 2232 is sent to register 1651 once there is a carry output, or otherwise BNX of TG address of register 2252. Write enable of register 1651 is valid when branch is taken and output of multiplexer 2258 writes to register 1651, making the BNX of CU address update and then repeat above operations.
Another embodiment of branch process without performance loss, which contains system of IRB, is illustrated in
The structure of trackers 2307 and 2309 are exactly the same, while tracker 2307 corresponds to front-end pipeline P and dependency check module 2227, and tracker 2309 corresponds to front-end pipeline Q and dependency check module 2229. These two trackers consist of registers 1651, 1653, multiplexers 1656, 1658, 2358 and adder 1655. One front-end pipeline of P and Q provides succeeding instructions (the instruction count is ‘2’ as used herein) starting from NX address, and if there is branch instruction, the other front-end pipeline provides succeeding instructions starting from the branch target address (TG address). Once the branch is taken, the situations of these two front-end pipelines are exchanged with each other. For ease of explanation, the tracker and front-end pipeline corresponding to CU address are hereafter referred to as CU tracker and CU front-end pipeline, while those that correspond to TG address is hereafter referred to as TG tracker and TG front-end pipeline.
As used herein, TG front-end pipeline doesn't work while CU front-end pipeline performs the same as front-end pipeline P in embodiment of
Controller 2305 is responsible for the selection of execution result between front-end pipelines P and Q, and toggles the select signal every time branch is taken. Specially, controller 2305 controls the multiplexers 2331, 2333 between front-end and back-end pipelines based on the branch decision of CU front-end pipeline, i.e. issue execution result of CU front-end pipeline to back-end pipeline for further processing if branch is not taken or otherwise execution result of TG front-end pipeline is issued. Besides, each branch taken signal generated by CU front-end pipeline exchanges the two front-end pipelines, i.e. the original TG front-end pipeline becomes new CU front-end pipeline and the original CU front-end pipeline becomes new TG front-end pipeline. Meanwhile, original TG tracker becomes new CU tracker and original CU tracker becomes TG tracker. Controller 2305 also changes its status once the branch is taken, and controls multiplexers 2331, 2333 between front-end and back-end pipeline based on the result of branch instruction in new CU front-end pipeline.
Suppose that current CU front-end pipeline is P, and registers 1651 and 1653 of CU tracker 2307 store BNX and BNY of CU address. Multiplexer 2305 chooses this BNY and then sends it to branch target select module 1649 in order to read out address of the first branch starting from the CU address, i.e. TG address, which is then sent to IRB 2201 and TG tracker 2309, making that register 1651 and 1653 of TG tracker 2309 respectively store BNX and BNY of TG address. Meanwhile, BNX of the TG address is sent to track table 210 to read out NXT address and NXS address could be read out from track read buffer 2210. Thus, IRB 2201 can output succeeding instructions of the branch and its target in one clock cycle when it receives the above correlation address as said. Specifically, IRB 2201 outputs branch and its succeeding instructions to front-end pipeline P and dependency check module 2227 under the control of correlation address sent by CU tracker 2207, and outputs branch and its succeeding instructions to front-end pipeline Q and dependency check module 2229 under the control of correlation address sent by TG tracer 2209. These two trackers in the embodiment hereto can decide increment of CU address and TG address based on the control signals 2226 and 2228 sent by corresponding dependency check module before the branch decision is definite because these two trackers respectively store CU address (tracker 2307 as is the case) and TG address (tracker 2309 as is the case). And output succeeding instructions corresponding to the post-updated CU address and TG address to front-end pipeline P and Q for parallel execution until the branch decision yields.
If the branch is taken in front-end pipeline P, controller 2305 selects this result and controls multiplexer 2330 and 2331 between front-end and back-end pipeline to send the result from front-end pipeline Q to back-end pipeline for further execution. Then, control 2305 toggles its state and selects result of branch instruction in front-end pipeline Q as its output before next branch is taken. Meanwhile, front-end pipeline Q becomes CU front-end pipeline, front-end pipeline P becomes TG front-end pipeline, and tracker 2309 becomes CU tracker, tracker 2307 becomes TG tracker (TG tracker 2307 and TG front-end pipeline P don't work in the case that there is no branch instruction in CU front-end pipeline Q).
If the branch in CU front-end pipeline Q is not taken, controller 2305 controls multiplexers 2330 and 2331 between front-end and back-end pipeline to send the result of front-end pipeline Q to back-end pipeline for further execution. The branch is not taken; therefore, controller 2305 doesn't toggle its state and still selects the result of branch instruction in front-end pipeline Q as its output. The front-end pipeline Q remains CU front-end pipeline, front-end pipeline P remains as TG front-end pipeline, tracker 2309 remains CU tracker and tracker 2307 remains TG tracker and then continue the process of execution.
No matter if the branch in CU front-end pipeline is taken in following operation, it performs the same way as before and so will not be repeated here.
The IRB of each front-end pipeline has its specific read port and bit-line in processors containing multiple front-end pipelines. The same functionality implemented by token bus such as embodiments in
Another structure of IRB without token bus is illustrated in embodiment of
An instruction segment being executed is illustrated in
The execution of these 4 instructions starts from instruction 3 and there are 4 possible program execution paths at this cycle based on the different branch decisions of the former 3 branch instructions. Result of the 4th branch instruction influences next clock cycle and will be discussed later. The execution path will be branch instruction 3, branch target 0 and its succeeding instructions 1 and 2 if branch instruction 3 is taken, i.e. instruction 3, 0, 1, 2; and instruction 3's branch target instructions 0, 1, 2 are hereafter referred to as O way for ease of description. In a similar way, the execution path will be instructions 3, 4, 7, 8 if branch instruction 3 is not taken but branch instruction 4 is taken, and instruction 4's branch target instructions 7, 8 are hereafter referred to as P way. By the same reason, the execution path will be instructions 3, 4, 5, 1 if branch instruction 3 and 4 are not taken but branch instruction 5 is taken, and instruction 5's branch target instruction 1 is hereafter referred to as Q way. Finally, the execution path will be instructions 3, 4, 5, 6, which are hereafter referred to as N way if all these three branch instructions are not taken. The succeeding instructions 7, 8, 9, 10 will be executed at next cycle if instruction 6 is not taken which is hereafter also referred to as N way, or otherwise succeeding instructions 2, 3, 4, 5 which are hereafter referred to as J way. The N way and J way are different execution paths in next clock cycle but their difference does not affect instructions executed in the current cycle. As long as sufficient execution units and corresponding IRB read ports and bit-lines are provided for each possible execute paths during one cycle, all possible instructions that may be executed could be issued to multiple front-end pipelines at the same time and then selected by the branch decisions, only part of the possible instructions are sent to back-end pipelines for further execution.
Please refer to
The track table 2501 in
Tracker 2504 is different from before in that it can provide the current instruction address, and all the branch targets of branch instructions within 4 instructions at the same time starting with the current instruction. Specifically, registers 2525, 2526 respectively store the current instruction's first address BNX and second address BNY; register 2521, 2522, 2523 and 2524 store branch target addresses (BNX and BNY) 2511, 2512, 2513, 2514 of the current instruction segment (4 in this embodiment) outputted by the track table. In this example, the 4 BNXs are ‘68’, four BNYs are respectively ‘0’, ‘7’, ‘1’, ‘2’. The output of register 2525 (BNX) and 2526 (BNY) are joined together into bus 2520 (in the figure circle and arrow represent the two buses joined together). Outputs of registers 2521, 2524, 2523 are sent to all of the first address comparators 509 and the current second address decoder 513 in IRB 2504, the enabled 513s drives multiple zigzag word lines.
The outputs of bus 2520 and registers 2521, 2522, 2523, 2524 are selected by multiplexer 2585, which is controlled by branch decision. The first address BNX portion 2535 of multiplexer 2585's output is sent to the other input of multiplexer 2529; the second address BNY portion 2536 is sent to adder 2528 and added to the increment amount provided multiplexer 2527, which is under the control of dependency checker's detection result 2565. The sum of the adder is used as the new current second address BNY and stored into register 2526. Adder's carry output signal 2538 controls multiplexer 2529. When there is no carry out, multiplexer 2529 selects the current first address 2535; when there is carry out, multiplexer 2529 selects the Next bock's first address 2539; the output of multiplexer 2529 is the new current first address BNX and is stored into register 2526.
Multiplexer 2529's output and adder 2528's output are also joined to become read pointer 2510 to control the reading of the track table 2501. Read pointer 2510 (the current address of next cycle) and track table 2501's outputs 2511, 2512, 2513 (branch targets of instructions 1, 2, 3 in next cycle) are sent to End track point memory 2502 to read out the Next block address of each address; and are also sent to column address generator 2503. Column address generator generates the corresponding column address. The current address' Next block address is stored in register 2530, and its corresponding column address is stored in register 2540. The Next block addresses of 2511, 2512, and 2513, which are the branch target address of the first, second, and third instructions of the current instruction segment, are stored into registers 2531, 2532, and 2533, and their corresponding column addresses are stored in registers 2541, 2542 and 2543.
Column address generator generates corresponding column addresses based on the Current address of branch target address input in the following way. Define number of rows in every IRB block (number of storage entries) as n; block offset address (second address) as BNY, which has value 0˜n−1, the row on the top is row 0; there are m columns in total, BNZ is the column address which have value from 0˜m−1, the left most column is column 0; then the column address can be calculated by the following formula: BNZ=n−BNY, BNZ is invalid if larger or equal to m. For example, when n=8, m=4, w=4, BNZ=8−6=2. BNZ<4 Therefore, BNZ is valid. The meaning is that address ‘6’ is decoded and drives zigzag word line, the instruction that corresponds to address ‘6’ is issued from column ‘0’, the instruction that corresponds to address ‘7’ is issued from column ‘1’, at this time the zigzag word line terminates as it reaches IRB block's lower boundary. At the same time, decoding of Next address points to the first instruction of the Next instruction block, the only thing that needs to be known is which column the instruction should be issued from to fully utilize processor resources and avoid collision with instructions issued by the current IRB. At this time, the column decoder 2411 in the Next block IRB block decodes column address BNZ=2 and drives the zigzag bus starting in the second column so the first instruction in the IRB block (BNY=0) is issued from the second column, the second instruction (BNY=1) is issued from the third column. If BNZ larger or equal to m, the generator generates an invalid signal which controls all column decoders 2411 so that they don't drive any zigzag word lines, because under the circumstances, the current IRB block issues instructions to all columns at the same time. The result of the above calculation can be placed in a reference table to replace calculation. Take the afore conditions as example, when BNY=0˜4, BNZ=invalid; when BNY=5, 6, 7, BNZ=3, 2, 1. The said method is valid when n>m or n=m. Operations under other conditions can be deduced by analogy.
IRB 2550, 2551, 2552, and 2553 are 4 groups of IRBs like the structure of
There are in total 10 front-end pipelines because there are common paths, which can be shared by the said 4 execution paths determined by branch decision. For example, all 4 paths need to execute the first instructions in the instruction segment (instruction 3 in this example), Therefore, the first instruction in the segment only needs 1 front-end pipeline, not 4, to process. The second, third, and fourth instructions in the segment respectively need 2, 3, 4 front-end pipelines. The 4 instructions that are processed in the same cycle are in the same instruction execution slot. For ease of explanation, respectively name the instruction execution slots the 4 sequential instructions issued in the same clock cycle would occupy as slot A, B, C, and D in the order of instruction execution sequence. Slot A only has one choice, instruction 3 in the example in
Because there may be multiple instructions issued in an instruction slot, for ease of explanation, define way as the possible different program execution paths due to branch instructions. First, define N way as the execution path in which the plural number of instructions in slots A, B, C are either non branch instructions or branch instructions that do not take the branches, there are 4 instructions in this cycle; presume instruction A is presumed as taken branches, then all the needed instructions from hereon are named O way, in this cycle there are 3 instructions; presume A instruction does not branch, but B instruction branches, then the instructions needed hereon are named P way, there are two instructions in this cycle; if instructions A and B do not branch, but instruction C does branch, then the needed instruction hereon is named Q way, in this cycle there is one instruction. A, B, C instructions do not branch, but the instructions needed by instruction branch D are named J way, in this cycle there are 0 of these instructions. Please note that the outputs of track table 2511, 2512, 2513, 2514 are the corresponding track table entries of A, B, C, D slot instructions in N way, the content is each instruction's branch target, and also the starting point of O, P, Q, J ways.
The third address BNZ can be marked with the alphabet of the instruction slot, to distinguish it from the number in the first address, second address. In addition, the algorithm of the said third address generation should be revised a bit to meet the definition of an instruction slot in this embodiment. The third address obtained by the original formula is based on the number of columns from which the instructions is issued. If the instruction pointed to by the Current address is not issued in column zero, then the third address calculated should be compensated. The formula is BNZ=n−BNY+Z, here Z is the column number (column address) of issuing based on Current address. Here, define the column number of slot A as ‘0’, the column number of slots B, C, D as ‘1, 2, 3’ for ease of calculation. But the third address in this embodiment is marked with alphabet. Each input in the column address generator 2503 occupies a specific issue slot, so the column address Z can be determined based on the specific input. Such as the address on input 2510 is the Current instruction address in Next cycle, and it belongs to N Way, and issues from slot A, Therefore, the Z for this input is ‘0’. The address on input 2511 is the branch target of the slot A instruction address in Next cycle, and it belongs to O Way, and issues from slot B, Therefore, the Z for this input is ‘1’. By the same reason, the address on input 2512 is belongs to P Way, and issues from slot C, Therefore, the Z for this input is ‘2’.
Dependency checker 2560˜2563 has a structure similar to the dependency checker in
The IRB in
When executing the instruction segment in
After the clock signal updates tracker registers and the Next block address register, value ‘68.3’ on bus 2520 which is the outputs of register 2525 and 2526 joined together, is sent to N Way IRB 2550 in the current clock cycle. The value is matched by decoder's first address comparator and decoded by the second address decoder, which drives zigzag word line 2555, to issue instructions 3, 4, 5, 6 each in slots A, B, C, D along N Way. The Next bock address in N way, that is register 2530's output ‘23’ and register 2540's output ‘x’, is invalid, Therefore, after decoding in column decoder 2411 does not drive any word line. At the same time, register 2521's output ‘68.0’ is sent to O Way's IRB 2551. After being matched and decoded by decoder, it drives zigzag word line 2556, and issues instructions 0, 1, 2 along the O way of slots B, C, and D. The Next bock address of the O way, registers 2531's output ‘23’ and register 2541's output ‘x’, is invalid, Therefore, no word lines are driven after decoded by the O Way decoder. At the same time, register 2522's output ‘68.7’ is sent to P way IRB 2552, after being matched and decoded by decoder, drive zigzag word line 2557. After issuing instruction 7 along way P slot C, the word line terminates when it reaches IRB block's lower boundary. Next block address of the P Way of register 2532's output ‘23’ and register 2542's output drives word line 2558 after the decoding in P way decoder, issue instruction 8 from row ‘0’ in slot D in the Next block IRB block in P way. At the same time, register 2523's output ‘68.1’ is sent to Q way's IRB 2553, and after matching and decoding by decoder, decoder drives word line 2559 and issues instruction ‘1’ along Q way's slot D. Q way only has one issue slot D and has no possibility to cross IRB block boundary, Therefore, does not accept Next block address and column address.
Each branch decision is independently made in the front-end pipelines of slots A, B, C, D for instructions 3, 4, 5, 6 in N way. The branch decision outputted by a front-end pipeline is ‘taken’ only when the instruction being processed by the front-end pipeline is a branch instruction, and the branch is decided as taken and the instruction does not have dependence. Under other circumstances the branch decision would be ‘not taken’.
The N way branch decision results of 4 slots are sent to priority encoder 2596 and encoded as way priority code 2598. Priority encoder 2596 sets the priority of branch decisions based on the address order of their corresponding instructions. If slot A way N branch decision is ‘taken’, then in this case the way priority code 2598 outputted by the encoder means to select way O, no matter the branch decision result of the instructions of N way of slot B, C and D. If the instruction in slot A way N is determined as ‘not taken’ and slot B way N branch decision is ‘taken’, then the way priority code 2598 outputted by the encoder means to select way P, no matter the branch decision result of the instructions of N way of slot C and D. If instruction in slot A, B way N is determined as ‘not taken’ and the instruction in slot C way N is determined as ‘taken’, the way priority code 2598 outputted by the encoder means to select way Q, no matter the branch decision result of the instructions of N way of slot D. If the instructions in N way in slots A, B, and C are determined as ‘not taken’ and the instruction in N way D slot is determined as ‘taken’, then the way priority code 2598 outputted by the encoder means to select way E, which will be explained later. Lastly when N way in slots A, B, C, and D are all determined as ‘not taken’, then the way priority code 2598 outputted by the encoder means to select way N.
Way priority code 2598 controls multiplexer 2581, 2582, 2583, 2584, 2585 and 2586. Not all multiplexers need the control of all the meaning types of way priority code, such as way priority code E does not control multiplexer 2586. First, look at the circumstance of selecting outputs of front-end pipelines to be provided to the rear-end pipelines. Multiplexers 2581, 2582 and 2583 select the operation control signals decoded by front-end pipelines and the data from the DRBs. As shown in embodiments in
If way priority code means select O way, then multiplexers 2581, 2582, and 2583 select their O inputs, that is to select the outputs of the 3 O way front-end pipelines to rear-end pipeline 2591, 2592 and 2593 to continue processing. The output of front-end pipeline 2470 is sent to rear-end pipeline 2590 which is not affected by branch decision. Here the instruction of front-end pipeline of Slot A N way is instruction 3. The instructions in the front-end pipeline of O way in slots B, C, and D are the instructions that presume instruction 3 is a ‘taken’ branch instruction, that is, the branch target 0 of branch instruction 3, and the two instructions following the target (instructions 1 and 2). So instructions 3, 0, 1, and 2 are sent to rear-end pipeline 2590, 2591, 2592 and 2593 to process.
By the same reason, when way priority code means P way, multiplexers 2581, 2582, and 2583 all select their P inputs, that is, the output of the front-end pipeline of N way slots A and B and the output of the front-end pipeline of P way slots C and D are used as the output of multiplexers and provided to rear-end pipeline to continue processing. So instructions 3, 4, 7, and 8 are sent to rear-end pipeline 2590, 2591, 2592 and 2593 to be processed. By the same reason when way priority code means Q way, multiplexers 2581, 2582, and 2583 all select their Q inputs, front-end pipeline of N way slots A, B, and C outputs, the output of the front-end pipeline of Q way slot D is the multiplexer output provided to rear-end pipeline to continue processing. So instructions 3, 4, 1, and 2 are sent to rear-end pipelines 2590, 2591, 2592 and 2593 to be processed. By the same reason, when way priority code means N way, then multiplexers 2581, 2582, and 2583 all select their N inputs, and N way slots A, B, C, and D front-end pipeline outputs are provided to rear-end pipeline to continue processing. So instructions 3, 4, 5, and 6 are sent to rear-end pipeline 2590, 2591, 2592 and 2593 to be processed. When way priority code means E way, multiplexers 2581, 2582, and 2583 all select E input, and so output instructions 3, 4, 5, and 6. The selection of E way and N way are the same in the current clock cycle, the difference is only significant in the next cycle.
Way priority code 2598 also decides which segment of instructions to be executed next cycle. Way priority code 2598, which is generated from the branch decisions, controls multiplexer 2584, 2585 and 2586 to decide the program's course. Tracker 2504 calculates the initial instruction address of the instruction segment to be executed next clock cycle based on the output of multiplexers 2584, which selects the address increment amount of a certain way; the output of multiplexers 2585, which selects the initial address of the same way in the current cycle; and the output of multiplexers 2586 which selects the Next block address of the same way in the current cycle. Specifically, the method uses adder 2528 to add BNY address 2536 of the initial address of this certain way in the current cycle selected by multiplexer 2585 (from registers 2525, 2526, or from registers 2521, 2522, 2523, 2524) to the address increment amount selected by multiplexer 2527 which is controlled by the same way output selected by multiplexer 2584 (from dependency checker 2560, 2561, 2562 or 2563); the sum will be the second address BNY of the initial instruction in the next cycle. The carry-out output 2538 of the adder 2528 controls multiplexer 2529; if carry out is ‘0’, select the first address BNX 2535 of the initial address of this way in the current cycle; if carry out is ‘1’, select the output 2539 of multiplexer 2586, which is the first address BNX of the Next block address of this way in the current cycle selected by multiplexer 2586 (from registers 2530, 2531, 2532 or 2533), and the output of multiplexer 2529 is the first address BNX of the next cycle. The BNX and BNY join together to become the read pointer 2510 which points to track table 2501 and reads out the entry 2511 being pointed to, and the next 3 entries 2512, 2513, and 2514 in the same manner as described before. Read pointer 2510 is also sent to End address memory 2502 and column address generator 2503 to obtain the corresponding Next block addresses and column addresses. Therefore, when clock signal comes, BNX and BNY on read pointer 2510 are respectively sent to registers 2525 and 2526 as the current address; track table outputs 2511˜2514 are each respectively latched into registers 2521˜2524 as branch target addresses; End address memory 2502's outputs are latched into register 2530˜2533 as the Next block address, and the outputs of column address generator 2503 are latched into register 2540˜2543 as the corresponding column addresses. Processor starts a new cycle of operation, as said before.
When way priority code 2598 is N way, multiplexer 2584 selects the instruction dependency checker unit 2560's output as increment control signal 2597. When way priority code 2598 is O way, P way and Q way, multiplexer 2584 correspondingly selects the output of instruction dependency checker unit 2561, 2562, or 2563 as increment control signal 2597. When way priority code 2591 is E way, multiplexer 2584 always selects ‘0’ as increment control signal 2597, the control signal selects increment value ‘0’ at multiplexer 2527.
Here are a few actual examples: Presume branch decision is N way instructions 3, 4, 5, and 6 that do not branch, and the dependency check unit 2560 judges no dependence between instructions 3, 4, 5, 6. Then, branch priority encoder 2596 outputs way priority code as N way. Then multiplexers 2581, 2582, 2583 select N way's front-end pipeline outputs to send to rear-end pipelines 2591, 2592, 2593 to execute. Therefore, instructions 3, 4, 5, and 6 execute in rear-end pipeline, and the execution result is written back into the shared RF 2586. Multiplexer 2584 selects the output ‘4’ of dependency checker 2560 as increment amount and sends to adder 2528 through 2597 which sums it with register 2526's content ‘3’ selected by multiplexer 2585. The sum is ‘7’, carry out is ‘0’. Therefore, multiplexer 2529 selects register 2525's content ‘68’ through multiplexer 2585. Therefore, read pointer is ‘68.7’, next cycle executes instructions 7, 8, 9, 10 (8, 9, and 10 are in the Next IRB block) in N way. Other ways O, P, and Q start execution from the branch target recorded from track table entries of instructions 7, 8, and 9. If an instruction is non-branch, the IRB of the corresponding way will not issue instruction, and the way will also not be selected for the final branch decision.
Presume branch decision has concluded that instructions 3, 4, and 5 do not take branch, but instruction 6 does take branch, and dependency check unit 2560 concludes there is no correlation between the four instructions. Here, branch priority encoder 2590 outputs way priority code as E way. Here multiplexers 2581, 2582, 2583 select N way's front-end pipeline output and send to rear-end pipeline 2591, 2592, 2593 to execute. Therefore, instructions 3, 4, 5, 6 execute in rear-end pipeline. Multiplexer 2584 selects J way's increment control ‘0’ and sends to adder 2528 through 2597. The adder sums the said increment control with the content ‘2’ in register 2524 selected by multiplexer 2585, the sum is ‘2’ and the carry out is ‘0’. Therefore, multiplexer 2529 selects the first address ‘68’ of register 2524 which is selected by multiplexer 2585. Therefore, read pointer is ‘68.2’, next cycle instructions 2, 3, 4, 5 are issued in N way. Other ways O, P, and Q start execution from the branch target recorded in the track table entries of instructions 2, 3, and 4.
Presume branch decision in slot A instruction 3 does not take branch, but slot B instruction 4 does take branch, and dependency check unit 2562 concludes there is no dependence between the four instructions. Then, branch priority encoder 2596 outputs way priority code as P way. So multiplexer 2581 selects N way B slot's front-end pipeline output and sends it to rear-end pipeline 2591 to execute. Multiplexer 2582, 2583 select the front-end pipeline output of P way C slot, D slot, to be executed by rear-end pipeline 2592, 2593. Therefore, instructions 3, 4, 7, 8 are executed in rear-end pipeline. Multiplexer 2584 selects the increment control ‘2’ of output of dependency check unit 2562 and sends to adder 2528 through 2597. Adder 2528 sums the increment control with the content ‘7’ of register 2522 selected by multiplexer 2585. The sum is ‘1’, and carry is ‘1’. Therefore, multiplexer 2529 selects the content of register 2532's first address ‘23’ which is selected by multiplexer 2586. Therefore, read pointer is ‘23.1’, instructions 9, 10, 11, 12 (4 contiguous instructions starting with the one with address ‘1’ in the Next instruction block) are issued in N way in next cycle. Other ways O, P, and Q start execution from the branch target recorded in the track table entries of instructions 9, 10, and 11.
Presume branch decision is slot A instruction 3 does take branch and dependency check unit 2561 concludes O way B slot instruction 0 has dependency on and N way A slot instruction 3. Then, branch priority encoder 2596 outputs way priority code as O way. So multiplexers 2581, 2582, 2583 select the front-end pipeline outputs of O way B slot, C slot, D slot, to be executed by rear-end pipeline 2591, 2592, 2593. Therefore, instructions 3, 0, 1, and 2 are executed in rear-end pipeline; but then the 0, 1, 2 instruction in B, C, D slots are aborted due to dependency, only instruction 3 in A slot is completed and retired, and its result is written back to the shared RF 2595. Multiplexer 2584 selects the increment control ‘0’ outputted by dependency check unit 2561 and send to adder 2528 through 2597. Adder 2528 sums the increment control with the second address content ‘0’ of register 2521 which is selected by multiplexer 2585. The sum is ‘0’, and carry is ‘0’. Therefore, multiplexer 2529 selects the content of register 2521's first address ‘68’ selected by multiplexer 2585. Therefore, read pointer is ‘68.0’, instructions 0, 1, 2, 3 are issued in N way in next cycle. Other Ways O, P, and Q start execution from the branch target recorded in the track table entries of instructions 0, 1, and 2.
This embodiment uses IRBs controlled by zigzag buses, which are capable of issuing plural number of instructions in order. This embodiment fully utilizes the branch target information and the Next block address information of the instructions, both stored in the track table, which are about to be executed, to control multiple numbers of the said IRB, to pre-process instructions in parallel on multiple execution paths due to branch instructions. This embodiment makes branch decisions on each of the branch instructions being processed, and then produces a final branch decision result taking into account the sequential order of the branch instructions. The branch decision result selects the intermediate pre-processing results of one set of the multiple execution paths to be further processed, and the dependency check result on instructions in the selected set decides whether a portion of or all of the instructions of the selected set are used to complete execution, while the others are aborted. It also adds the initial second address of the selected path to the address increment amount of the same path. For the next cycle, the sum of the addition will be the second address, and the initial first address will be the first address of the next cycle, if the sum does not overflow over the IRB block boundary. For the next cycle, the sum within the IRB block boundary will be the second address and the Next block address of the selected path will be first address, if the sum overflows over the IRB block boundary.
This disclosure discloses a method of instruction multi-issue. The method is to issue n sequential instructions in parallel starting with the instruction at address a, and use the dependency check modules to detect the dependence between the said plural number of instructions, and feedback an address increment amount i based on the dependency and the position of the dependent instruction; and issue n instructions starting from instruction address a=a+i. Here, the sequence of issuing instructions is defined as 0, 1, 2, . . . , n−1; then i=p, p is the position of the first dependent instruction in the instruction sequence; n is defined as the dependent instruction position if there are no dependencies found among the issued instructions. Here, the instruction later in the sequence of the two instructions that have dependency is defined as the dependent instruction.
This disclosure may employ special data read buffers (DRBs). Data engine pre-fetches and fills data that correspond to instructions into DRB in advance. When an instruction is issued, the corresponding data will automatically be extracted from IRB for execution.
The pipeline's processor may not start from the usual instruction fetch pipe stage, but rather starts from the instruction decode pipe stage; it also does not contain memory (data) access stage. Instructions are pushed to processor core by instruction engine containing track table, tracker, and IRB. Data is pre-fetched by data engines and filled into DRBs, the data is pushed to the core following the issuing of the corresponding instructions.
The specific implementation of this method is to use diagonal or so called zigzag word lines to control a memory, which has a plural number of read ports for a single instruction (or data in general) and there are a plural number of bit lines connecting those read ports, which are independent from each other, so a segment of sequential instructions (or sequential data) can be sent on the plural number of bit lines to the connected plural number of processing units for parallel processing.
The said multi-bit line memory controlled my oblique word lines is called instruction read buffer (IRB). As used herein, the IRB may be divided as instruction buffer blocks (IRB blocks) of the same capacity for ease of instruction or data replacement. The instruction segment issued may be located in different IRB blocks. In this disclosure, the oblique word lines are distinguished as Current instruction word line and the Next block instruction word line, they are driven by separate addresses. The Current word line is driven by the Current read pointer of the tracker or by the branch target address on the tracks in the track table. The current word line issues instructions starting from a specific instruction in a specific IRB block designated by the said address from the first instruction issue slot of the Way of the designated instruction to the last slot of the same Way, or to the last instruction in the IRB block. The Next block instruction word line is driven by the Next block address in the End track point of the track, and together with the Z address, issues instructions starting from a specific instruction in a specific IRB block designated by the said address from the first instruction issue slot of the Way of the designated instruction to the last slot of the same Way, or to the last instruction in the IRB block. Issue a segment of contiguous instructions from the first instruction in a Next block instruction address designated specific Next instruction block of the said Current or target instruction block in an instruction issue slot designated by the Z address to the last instruction issue slot. Here, the Z address is obtained through the block offset address of the said Current or target instruction block and the number of instruction issue slots in the Way of the instruction.
A slight modification to the multi-issue processor disclosed in
Instructions 2603, 2604, 2605, and 2606 in are the four instructions (instruction addresses 3, 4, 5, 6) in row 2481 of
The N Way in
Please refer to
When each of the instructions in D slot are branch instructions, their branch decisions do not affect the instruction execution of this cycle, but may affect the program execution course in the next cycle. If the tracker/track table prepares the branch targets of the branch instructions in D slot in this cycle ready to be selected by branch decisions, then based on branch decision the processor of this embodiment in next cycle can select the right instruction along the correct path to execute program under any combination of branch instructions. Then as long as the branch targets are in the track table and IRB, this processor will have no performance degradation due to branching. To achieve this, it is necessary to define the branch paths for instruction in D slot. Please refer to
The branch decision of this cycle selects one of the addresses from 16 addresses of each of the first instructions in the 16 paths of next cycle provided by tracker 2504. Instruction segment starting with this address and any branch of the segment can be executed in N Way next cycle (not necessarily in the same path as the N Way of this cycle), the following ways explained in combination with
Each way in
Each of the N, O, P, Q, S, T, U, V Ways may encounter the issue of zigzag word line reaching the lower boundary of an IRB block. Therefore, the Next block address should be provided for each Way. The reading of Next block address of each of the N, O, P, Q Ways is the same as in the embodiment in
Priority code 2598 can select the 8 fall-through paths N, O, P, Q, S, T, U, V and 8 branch targets paths E, F, G, H, I, J, K, L, a total of 16 inputs as the 16 current and branch target addresses selected by 2585. However, the Next block addresses are sequential instruction address. Therefore, the Next address multiplexer 2586 only selects the Next block address of each of the N, O, P, Q, S, T, U, V Ways send through bus 2539 to multiplexer 2529 to be selected by the carry out of adder 2528. Specifically, when the Way priority code 2598 is N Way (the fall-through instruction of instruction in N Way slot D), or E Way (the branch target instruction of instruction in N Way slot D), the Next address multiplexer 2586 selects the N Way Next block address stored in register 2530. By the same reason, when the Way priority code 2598 is O Way or I Way, the Next address multiplexer 2586 selects the O Way Next block address stored in register 2531; when the Way priority code 2598 is P Way or G Way, the Next address multiplexer 2586 selects the P Way Next block address stored in register 2532; when the Way priority code 2598 is Q Way or F Way, the Next address multiplexer 2586 selects the Q Way Next block address stored in register 2638; when the Way priority code 2598 is S Way or K Way, the Next address multiplexer 2586 selects the S Way Next block address stored in register 2634; when the Way priority code 2598 is T Way or J Way, the Next address multiplexer 2586 selects the T Way Next block address stored in register 2638; when the Way priority code 2598 is U Way or H Way, the Next address multiplexer 2586 selects the U Way Next block address stored in register 2638; when the Way priority code 2598 is V Way or L Way, the Next address multiplexer 2586 selects the V Way Next block address stored in register 2638.
The instruction addresses of N, O, P, S Ways are also sent through bus 2641 to column address generator 2503. The column address generator has the same structure as the column address generator in
In this embodiment, dependency checker module is configurable, that is, the dependency between instructions of different slots based on the configuration. Refer to
For example, in
Similarly, output 1492 can be configured as ‘1’ to conduct 2 issue, to check the dependency only on the instructions issued from slots A and B at the same time, and selecting address increment amount ‘2’ or ‘1’ based on the “no dependency” or “has dependency” detection result. The detail operation is the same as stated before, and is not repeated here. In addition, the output 1491 can be configured as ‘1’ to conduct single issue, to block all the AND gates in slot A, so the value on control lines 1471, 1472, 1473, and 1474 is each ‘1’, ‘0’, ‘0’, ‘0’, fix address increment amount as ‘1’. In this example, at least one instruction has to be issued, Therefore, no matter what the dependency check result is, slot A instruction is always issued.
The issue Width can be adjusted to meet the requirement of each program on performance and power consumption, through combining the configuration of dependency check module and disable the corresponding front-end and back-end pipelines. For example, the resources for all Ways other than N Way may be disabled, and the front-end pipeline 2571, 2572, 2573 are disabled while enabling the N Way front-end pipeline 2570, N Way rear-end pipeline 2590, and IRB 2550; configure the instruction dependency check module 2560 as single issue as aforementioned. Then, the processor is configured as single instruction issue. Adding front-end pipeline 2571, 2574, P Way rear-end pipeline 2591 and the corresponding IRB 2551 configure the instruction dependency check module 2560, 2561, as two issue, then the processor is configured as two issue. Three issue can be deduced by analogy. Embodiments in and following
Besides, dependency checker's result is also used to produce an abort signal that clears the result of instructions, which are issued in parallel but can't be executed in parallel. Specifically, when output 1491 is ‘1’, there is dependency between slot B's instruction and slot A's instruction, Therefore, the abort signal 2810 is ‘1’, and the front-end pipeline's execution results of slots B are all cleared. At the same time, through OR gate 2811, 2813, the output 1491 makes abort signals 2812 and 2814 are all ‘1’, and thus clear front-end pipeline's execution results of slots C and D. This way is equivalent to only issuing and executing slot A's instruction, and the instructions of slots B, C, and D are not executed, and slots B, C, and D will not produce branch taken signal.
When output 1491 is ‘0’, output 1492 is ‘1’, which indicates slot C instruction has dependency with slot A's or B's instruction, and the instructions of slots A and B have no dependency. Therefore, slot B's abort signal is ‘0’, making slot B's front-end pipeline execute normally. At the same time, the abort signals of slots C and D are ‘1’, so the execution result of slot C's and slot D's front-end pipeline are cleared. This way is equivalent to only issuing and executing instructions of slots A and B, and not executing instructions of slots C and D, and slots C and D will not produce a branch taken signal.
When outputs 1491 and 1492 are both ‘0’, and output 1493 is ‘1’, it indicates that D slot's instruction has dependency with an instruction in either slot A, B, or C and instructions in slots A, B, and C have no dependency with each other. Therefore, the abort signals 2810, 2812 are ‘0’, making the front-end pipeline of slots B and C execute normally. At the same time, the ‘1’ value on output 1493 through OR gate 2813 setting abort signal 2814 as ‘1’, clearing the execution result of slot D's front-end pipeline. This way it is equivalent to only issuing and executing instructions of slots A, B, and C, but not executing slot D's instruction this cycle. Slot D will not produce branch taken signal. Lastly, when outputs 1491, 1492, and 1493 are all ‘0’, it indicates there is no dependency between instructions of slots A, B, C, and D. Therefore, the abort signals 2810, 1812 and 2814 are all ‘0’, making the front-end pipelines of slots B, C, and D execute normally, which is equivalent to issuing and executing instructions of slots A, B, C, and D during this issue.
The instruction address increment amount produced by the dependency check module for instructions in the same issue slot may be different depending on how many instructions are in the Way. For example, when there are four instructions (only N Way) the increment amount is ‘4’ if the dependency check module 2560 determines no dependency between the four instructions; when slot D instruction has dependency, the increment amount is ‘3’; when slot C instruction has dependency, the increment amount is ‘2’; when slot B instruction has dependency, the increment amount is ‘1’. For example, when there are three instructions (only O Way) the increment amount is ‘3’ if the dependency check module 2561 determines no dependency between the three instructions; when slot D instruction has dependency, the increment amount is ‘2’; when slot C instruction has dependency, the increment amount is ‘1’; when slot B instruction has dependency, the increment amount is ‘0’. By the same reason, when there are two instructions (P Way and S Way) the increment amount is ‘2’ if the dependency check module 2562, 2563 determine no dependency between the two instructions; when slot D instruction has dependency, the increment amount is ‘1’; when slot C instruction has dependency, the increment amount is ‘0’. When there is one instruction (Q, T, U, V Ways) the increment amount is ‘1’ if the dependency check module 2564˜2567 determine no dependency on slot D instruction. When slot D instruction has dependency, the increment amount is ‘0’.
The removal of dependency check logic also reduces the number of abort signals such as 2810, 2812, so one abort signal controls a front-end pipeline. Specifically, the slot B, C, D abort signals (2810, 2812, 2814 in
Back to
The intermediate processor result of front-end pipeline 2570 is further processed by rear-end processor 2590 as in the embodiment of
In summary, depending on whether the instructions are branch instructions, the processor illustrated in
Each of the front-end pipelines makes judgment on whether or not to take the branch when executing branch instruction. This judgment and the corresponding abort signal by the instruction dependency checker of the same way in the same slot of the front-end pipeline constitute the branch decision signal of the front-end pipeline. When the abort signal's meaning is ‘dependent’, then the branch decision signal the front-end pipeline produces is ‘not take branch’; when the abort signal's meaning is ‘no dependency’, then the branch decision signal the front-end pipeline produces depends on its internal branch decision logic. Another implementation is to let the abort signal directly terminate the instruction processing in the corresponding front-end pipe line. The branch decision output of each of the processing terminated front-end pipelines is set as ‘not take branch’. All of the branch decisions produced by all 15 front-end pipelines are sent through bus 2689 to priority encoder 2596 to produce way priority code 2598.
Way priority code 2598 is produced by each of the branch decisions from each of the front-end pipelines based on the instruction slot priority of its corresponding instruction node position on the instruction path binary tree.
Each two-input multiplexer in
Although the number of inputs of multiplexers 2581˜2586 of
Then, under the control of S Way way priority code 2598, multiplexers 2581, 2582, 2583 each selects the output of front-end pipeline 2574, 2579, and 2680 to be further processed by rear-end pipeline 2591, 2592, and 2593. A total of four instructions NOSS each is processed in the rear-end pipeline in A, B, C, D slots including the front-end pipeline 2570 output processed by the rear-end pipeline 2590. The S Way way priority code 2598 controls the rear-end pipeline dependency selector (not shown in
The S Way way priority code 2598 controls multiplexer 2584 to select the increment amount output of S Way instruction dependency checker 2664. The S Way way priority code also controls multiplexer 2585 to select the S Way address from S Way address register 2664. The selected S Way increment amount and the second address (BNY) 2536 of the selected S Way address are added together by adder 2528. The sum of adder 2528 is ‘0’, which will be the second address N Way next cycle. The carry out output of adder 2528 is “carry”, which controls multiplexer 2529 to not select the S Way branch first address 2535 (that is, the first address of S Way slot C instruction in the current cycle), but select the address on Next block bus 2539 which is the S Way Next block address in S Way Next block address register 2634 selected by multiplexer 2586 under the control of way priority code 2598. The output of multiplexer 2529 will be the first address of N Way next cycle, and also indexes track table 2501 to read out the OPQE Way branch target addresses of next cycle. The OPQ Way address further indexes track table 2682, 2683, 2684 to read out the branch target of the rest of the Ways as said before. Each of those instruction addresses are stored into branch target register 2521, etc. as mentioned before. Their corresponding next block addresses and Z addresses are also stored in the Next block address 2530, etc. and Z address register 2540, etc. as mentioned before. This way, in the next cycle, the processor in
Another example of instruction execution contains the same four branch instructions as the previous example, but the branch decision of all four instructions are “branch taken” in this example, the output of S Way, instruction dependency checker 2664 determines that the slots A, B, C, D have “no dependence”, the increment amount is ‘2’. Under these conditions, the way priority code 2598 is K Way. This time, multiplexer 2581 selects NOSS Way instruction to normally execute and retire in the rear-end pipelines as the previous example. The K Way way priority code 2598 controls multiplexer 2584 to select the increment amount ‘0’. The K Way way priority code also controls multiplexer 2585 to select the K Way address from K Way address register 2624. The selected K Way increment amount ‘0’ and the second address (BNY) 2536 of the selected K Way address are added together by adder 2528. The sum of adder 2528 will be the second address N Way next cycle. The carry out output of adder 2528 is “no carry” which controls multiplexer 2529 to select the K Way branch first address on bus 2535 as the N Way first address in the next cycle. The other operations are the same as the previous example.
Another example of instruction execution contains the same four branch instructions as the previous example, and the branch decision of all four instructions are “branch taken” in this example, the output of O Way, instruction dependency checker 2561 determines that slot B, has dependence on slot A, then the corresponding dependence signals of the B, C, D slots are all “dependent”, and the increment amount is ‘0’. Under these conditions, those B, C, D slot dependency signals set each of the branch decisions of each front-end pipeline to “branch not taken”. That is, the N Way slot A branch decision is “branch taken” the O Way slot B, C, D branch decisions are “branch not taken”. In the priority encoder 2596, these branch decision select NOOO Way, which makes the code representing the O Way as way priority code 2598. The branch decisions from other front-end pipelines are not selected, their corresponding code (such as S Way etc) are filtered.
Under this condition the way priority code 2598 is O Way. This time, multiplexer 2581 etc selects NOOO Way instruction to execute in rear-end pipeline as the previous example. O Way way priority code 2598 selects the slot B, C, D dependency signal (all have dependence) from the O Way instruction dependency checker 2561 control the rear-end pipeline to only completely the instruction execution in slot A, but abort the intermediate execution results of slots B, C, D. O Way way priority code 2598 controls multiplexer 2584 to select the increment amount output ‘0’ of instruction dependency checker 2561 as address increment amount 2597. The O Way way priority code also controls multiplexer 2585 to select the O Way address from 0 Way address register 2521. The selected O Way increment amount ‘0’ and the second address (BNY) 2536 of the selected O Way address are added together by adder 2528. The sum of adder 2528 will be the second address N Way next cycle. The carry out output of adder 2528 is “no carry” which controls multiplexer 2529 to select the O Way first address on bus 2535 as the N Way first address in the next cycle. The other operations are the same as the previous example.
This embodiment is capable of handling the instruction binary tree on which every instruction is a branch instruction as shown in
This disclosure further discloses another method of multi instruction issue. The method is to divide n sequential instructions starting from an initial address, and the possible branch target instructions from the branch instructions within the n instructions, and the branch target instructions of the branch targets, into different Ways based on each instruction's position on the instruction binary tree, and issue them at the same time. The said plural number of instructions simultaneously issued is each independently executed. The dependency amongst instructions is checked, the execution of instructions with dependency and the follow up instructions in the same way are all aborted; and a way address increment amount is feed backed for each way based on if there is dependence amongst the instructions and the location of the dependent instruction. Make branch decision independently execute each branch instruction regardless of other branch decisions. Determine the way of execution in current cycle and next cycle based on each of the independent branch decisions and branch priority based on the branch instruction sequence order. Based on the way determined, select n instructions from the said simultaneously issued instructions for normal execution and retirement, and terminate the rest of the instructions. Based on the determined way of next cycle, sum the current cycle address of the way with the address increment amount of the way. The result is the block offset address BNY of the initial address of next cycle. Take the current cycle address of the determined way as the initial block address BNX for next cycle if the above sum does not overflow the block address boundary. Take the current cycle Next block address of the determined way as the initial block address BNX for next cycle if the above sum does overflow the block address boundary. Then n sequential instructions start from this initial address, and the possible branch target instructions from the branch instructions within the n instructions are issued at the same time. This process is performed repeatedly.
As used herein, the IRB may also be organized by execution slots, Therefore, the IRB and decoder structure organized this way is different from the IRB in embodiment in
The zigzag word line 2920, etc does not connect the neighboring two columns of read ports in the same slot. But rather, connects one read port of one Way in one slot to a read port in the same way the next row in the slot to the right, thus enabling the sequential instructions in the same Way issued one in each instruction issue slot at the same time. Therefore, the word line driving the read port of row 2961, N column 2903 comes from the read port of row 2960, N column (slot B). By the same reason, the word line driving the read port of row 2961, O column 2905 comes from the read port of row 2960, O column (slot B). All of the read ports in the C slot, with the exception of the read ports in first row 2960 or in the P column 2907, are controlled by the zigzag word line from the read ports in a previous row same way in B slot. The read ports in the first row do not have a previous row. Therefore, the P column read ports are controlled by the decoder 513 generated word line 2920 etc. (the first row as well as the other rows in P column), the first row 1960 read ports on N column 2903 and on O column 2905 are each directly controlled by the Next block address comparator 2973 and 2975.
There is no Z address decoder, such as the one in the decoder in
In
Please refer to
As described in
As used herein, the Next block addresses of current instruction or branch target may be generated based on the following method. The generated Next block addresses may be shifted to store in appropriate registers. Define number of rows in every IRB block (number of storage entries) as n; block offset address (second address) as BNY, which has value 0˜n−1, the row on the top is row 0; there are m slots in total, which have value from 0˜m−1, the left most slot is slot 0; there are w ways in total, which have value from 0˜w−1, the left most is O way. Then, the Next block address is valid if ((m−w)−(n−BNY)) is greater than 0, invalid if otherwise. In this embodiment, the next block addresses for N, O, and P ways can be shifted to appropriate slots based on the corresponding initial BNY addresses. Q way does not need the said Next block address. Specifically, the value of ((m−w)−(n−BNY)−1) is used as the shift amount and right shifts the Next block address.
In this example, n=8, m=4, w=4, N corresponds to Way0, O corresponds to Way 1, P corresponds to Way 2, Q corresponds to Way 3. When N way's BNY=6, ((m−w)−(n−BNY))=((4−0)−(8−6))=2, greater than 0, Therefore, the Next block address is valid. The meaning is that address ‘6’ is decoded and drives zigzag word line, the instruction that corresponds to address ‘6’ is issued from N way slot A, the instruction that corresponds to address ‘7’ is issued from N ways slot B, at this time because the zigzag word line terminates as it reaches IRB block's lower boundary. At this time, decoding of Next address points to the first instruction of the Next instruction block, the only thing that needs to be known is which slot of N way the instruction should be issued from to fully utilize processor resources and avoid collision with instructions issued by the current IRB. At this time, the shift amount ((m−w)−(n−BNY)−1)=1, shifter 2546 shifts right one position of the valid Next block address, which is the N way Next block address stored in register 2530, and store the shifted result to register 2541 of N way slot C (the values of N way's other corresponding registers 2540 and 2542 are invalid). This address is decoded by the column decoder 2411 of Next instruction block in column 2 to issue the first instruction (BNY=0) from N way slot C, the second instruction (BNY=1) is issued from N way slot D. If ((m−w)−(n−BNY)) is less than 0, the Next bock address is invalid, and the corresponding registers 2540, 2541, 2542 of N way are all invalid. Controller controls all column decoders 2411 so that they don't drive any zigzag word lines, because under the circumstances, the current IRB block issues instructions to all columns at the same time. The result of the above calculation can be placed in a reference table to replace calculation.
Dependency checker 2560 etc. has the same structure of dependency checker in the
This embodiment and embodiment in
After the clock signal updates tracker registers and the Next block address register, value ‘68.3’ on bus 2520 which is the outputs of register 2525 and 2526 joined together, is sent to slot A IRB 2550 in the current clock cycle. The value is matched by decoder's first address comparator and decoded by the second address decoder, which drives zigzag word line 2555, to issue instructions 3, 4, 5, 6 along slots A, B, C, D; the Next bock address in N way of registers 2540, 2541 and 2542 are all invalid, Therefore, after decoding slots B, C, and D the column decoder 2411 does not drive any word line in N way. At the same time, register 2521's output ‘68.0’ is sent to slot B's IRB 2551. After being matched and decoded by decoder, it drives zigzag word line 2556, and issues instructions 0, 1, 2 along the O way of slots B, C, and D; the Next bock address of the O way of registers 2543 and 2544 is invalid, Therefore, no word lines are driven by column decoder 2411 in slots C and D. At the same time, register 2522's output ‘68.7’ is sent to P way IRB 2552, after being matched and decoded by decoder, drive zigzag word line 2557. After issuing instruction 7 along way P slot C, the word line terminates when it reaches IRB block's lower boundary; register 2545's P way's Next block address is valid, Therefore, D slot's decoder decodes it to drive word line 2558; in D slot's P way's IRB of Next instruction block, it's ‘0’ row issues instruction 8. At the same time, register 2523's output ‘68.1’ is sent to Q way's IRB 2553, and after matching and decoding by decoder, decoder drives word line 2559 and issues instruction ‘1’ along Q way's slot D. The rest of the operations are the same as in the
The multi layer track table 2501, 2682, 2683, 2684, 2685, 2686, 2687, 2688 in the embodiment in
The first address register 505, the first address comparator 509, second address decoder 513, Next block address comparator 1619, Z address decoder 2411 in the decoder 2751 have the same function as the same number functional blocks in decoder 2417 in
In the Current IRB block, the output 2761 of second address decoder 513 are latched by register 2756, and drive the Current word line such as 2785 in the Current IRB block to control read ports issuing instructions in the next cycle. In the IRB block of the Next block, the output 2763 of Z address decoder 2411 are latched by register 2756, and drive the Next block word line such as 2795 in the Next block IRB block to control read ports issuing instructions in the next cycle. Following the description in the
The joint track table/IRB in this embodiment (joint buffer hereafter, each block in it is named a joint block) can be applied to the embodiment in
The first address outputted by multiplexer 2529 within the tracker 2504 is sent to register 2525, the second address outputted by 2528 in 2504 is sent to register 2526. The instruction address 2510 formed by joining this first and second address is sent to N Way joint buffer 2550 (in
O Way branch target address on bus 2511 addresses O Way joint buffer 2551; the O Way Next block address outputted by O Way joint buffer 2551 is sent to O Way Next block address register 2531; S, T, I Way branch target address each is sent through bus 2663, 2661, etc. to Current address register 2624, 2625, 2729. P Way branch target address on bus 2512 addresses P Way joint buffer 2552; the P Way Next block address outputted by P Way joint buffer 2552 is sent to P Way Next block address register 2532; U, G Way branch target address each is sent through bus 2662, etc. to Current address register 2626, 2720. Q Way branch target address on bus 2513 addresses Q Way joint buffer 2553; the Q Way Next block address outputted by Q Way joint buffer 2553 is sent to Q Way Next block address register 2638; K Way branch target address is sent to Current K Way address register 2722.
By the same reason, the S Way branch target address on bus 2663 outputted by O Way joint buffer 2551 addresses S Way joint buffer 2654; the S Way Next block address outputted by joint buffer 2654 is sent to the S Way Next block address register 2634; the V Way, K Way branch target addresses are sent to the Current V Way, K Way address registers 2627, 2721. The T Way branch target address on bus 2661 outputted by O Way joint buffer 2551 addresses T Way joint buffer 2655; the T Way Next block address outputted by joint buffer 2655 is sent to the T Way Next block address register 2726; the J Way branch target address is sent to the Current J Way address register 2723. The U Way branch target address on bus 2662 outputted by P Way joint buffer 2552 addresses U Way joint buffer 2656; the U Way Next block address outputted by joint buffer 2656 is sent to the U Way Next block address register 2727; the H Way branch target address is sent to the Current H Way address register 2724. The V Way branch target address on bus 2664 outputted by S Way joint buffer 2654 addresses V Way joint buffer 2657; the V Way Next block address outputted by joint buffer 2657 is sent to the V Way Next block address register 2728; the L Way branch target address is sent to the Current L Way address register 2725.
In the next cycle, all 16 of the said branch targets are latched in the corresponding said 17 Current address registers 2521, etc. (for ease of explanation, the first address and second address of N Way are each stored by register 2525 and 2526), waiting for the selection of multiplexer 2585. When 2585 selects N Way, it selects the joint output from register 2585, 2586 as the N Way input and all the 8 said branch targets are stored in the corresponding said 8 Next block address registers 2530, etc, waiting for the selection of register 2586. Each of the second address decoders output 2761 and each Z address decoders outputs in each of the decoders 2751 in each of the joint buffers like 2550 are latched in register 2756 to drive the Current word lines, such as word line 2785 etc, also drives the Next block word line in IRB 2701 such as word line 2795, etc. (please see
IRB can contain its corresponding micro track table, together it is called the joint buffer. Because the track corresponds to the instruction block, Therefore, the filling of the micro track table and IRB block in a joint buffer can be done at the same time. The two also share the same set of decoders.
Please refer to
When the said processor system includes multiple columns (that is: Ways or slots), each column has a set consisting of execution unit(s), IRBs, and DRBs. Dependency checker module 3311 can be configured, to detect the dependency between instructions issued within a column, or certain columns in a plurality of columns, or all columns in a plurality of I columns. Tracker module 3303 indexes track table module 3301 and fetches branch target instruction from first level instruction cache 3307 and fills it to IRB 3309 before processor executes the branch instruction if the it is not already in 3309.
In addition, as described in the embodiments of
Data engine module 3305 is similar to the one in
In addition, in the embodiment's processor system, write buffer 3317 temporarily stores the data that execution unit 3313 intends to write back to first level data cache 3319, and writes the temporary data back to first level data cache 3319 if it is not filling data into DRB 3315 (first level data cache 3319's port is not busy at the time). This reduces the read/write collisions in first level data cache 3319, and ensures that the data that may be used in execution unit 3313 will be filled into DRB 3317 as soon as possible.
As said in the previous embodiments, under the guidance of tracker module 3303, the processor system of this embodiment can control IRBs to provide the correct instruction to be executed in execution unit along the program flow without interruption, and based on the information stored in IRB find corresponding data in DRB, no matter if the branch instructions take branch or not. Because each column used in the processor system has its own IRB and DRB to provide instructions and corresponding data, Therefore, instructions and corresponding data can be provided to different columns (that is: different issue slots or Ways) at the same time, which improves processor system efficiency.
It is understood by one skilled in the art that many variations of the embodiments described herein are contemplated. While the invention has been described in terms of an exemplary embodiment, it is contemplated that it may be practiced as outlined above with modifications within the spirit and scope of the appended claims.
The apparatuses and methods of this disclosure may be applied to various applications related to cache, and may enhance efficiency of the cache.
Patent | Priority | Assignee | Title |
11507282, | Dec 04 2020 | Winbond Electronics Corp.; Winbond Electronics Corp | Data processing system and method for reading instruction data of instruction from memory including a comparison stage for preventing execution of wrong instruction data |
11733879, | Dec 04 2020 | Winbond Electronics Corp. | Data processing system and method for reading instruction data of instruction from memory including a comparison stage for preventing execution of wrong instruction data |
Patent | Priority | Assignee | Title |
5499348, | Feb 27 1990 | Matsushita Electric Industrial Co., Ltd. | Digital processor capable of concurrently executing external memory access and internal instructions |
5717944, | Nov 13 1990 | International Business Machines Corporation | Autonomous SIMD/MIMD processor memory elements |
5764946, | Apr 12 1995 | GLOBALFOUNDRIES Inc | Superscalar microprocessor employing a way prediction unit to predict the way of an instruction fetch address and to concurrently provide a branch prediction address corresponding to the fetch address |
5875315, | Jun 07 1995 | Advanced Micro Devices, Inc. | Parallel and scalable instruction scanning unit |
5875324, | Jun 07 1995 | GLOBALFOUNDRIES Inc | Superscalar microprocessor which delays update of branch prediction information in response to branch misprediction until a subsequent idle clock |
5974542, | Oct 30 1997 | Advanced Micro Devices, Inc. | Branch prediction unit which approximates a larger number of branch predictions using a smaller number of branch predictions and an alternate target indication |
5987561, | Aug 31 1995 | Advanced Micro Devices, Inc. | Superscalar microprocessor employing a data cache capable of performing store accesses in a single clock cycle |
6185675, | Oct 24 1997 | AMD TECHNOLOGIES HOLDINGS, INC ; GLOBALFOUNDRIES Inc | Basic block oriented trace cache utilizing a basic block sequence buffer to indicate program order of cached basic blocks |
8180998, | Sep 10 2007 | Nvidia Corporation | System of lanes of processing units receiving instructions via shared memory units for data-parallel or task-parallel operations |
8918625, | Nov 24 2010 | CAVIUM INTERNATIONAL; MARVELL ASIA PTE, LTD | Speculative scheduling of memory instructions in out-of-order processor based on addressing mode comparison |
20020082714, | |||
20030233532, | |||
20050005084, | |||
20050198467, | |||
20120311305, | |||
CN103176914, | |||
CN103229145, | |||
CN1650272, | |||
CN1758214, | |||
JP7250089, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Aug 18 2014 | Shanghai Xinhao Microelectronics Co., Ltd. | (assignment on the face of the patent) | / | |||
Feb 18 2016 | LIN, KENNETH CHENGHAO | SHANGHAI XINHAO MICROELECTRONICS CO LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 037780 | /0871 |
Date | Maintenance Fee Events |
Feb 09 2022 | M2551: Payment of Maintenance Fee, 4th Yr, Small Entity. |
Date | Maintenance Schedule |
Sep 04 2021 | 4 years fee payment window open |
Mar 04 2022 | 6 months grace period start (w surcharge) |
Sep 04 2022 | patent expiry (for year 4) |
Sep 04 2024 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 04 2025 | 8 years fee payment window open |
Mar 04 2026 | 6 months grace period start (w surcharge) |
Sep 04 2026 | patent expiry (for year 8) |
Sep 04 2028 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 04 2029 | 12 years fee payment window open |
Mar 04 2030 | 6 months grace period start (w surcharge) |
Sep 04 2030 | patent expiry (for year 12) |
Sep 04 2032 | 2 years to revive unintentionally abandoned end. (for year 12) |