A method of, and apparatus for, interfacing the hardware of a processor capable of processing instructions from more than one type of instruction set. More particularly, an engine responsible for fetching native instructions from a memory subsystem (such as an EM fetch engine) is interfaced with an engine that processes emulated instructions (such as an x86 engine). This is achieved using a handshake protocol, whereby the x86 engine sends an explicit fetch request signal to the EM fetch engine along with a fetch address. The EM fetch engine then accesses the memory subsystem and retrieves a line of instructions for subsequent decode and execution. The EM fetch engine sends this line of instructions to the x86 engine along with an explicit fetch complete signal. The EM fetch engine also includes a fetch address queue capable of holding the fetch addresses before they are processed by the EM fetch engine. The fetch requests are processed such that more than one fetch request may be pending at the same time. If a pending fetch request is canceled due to a pipeline flush, then the fetch address queue is cleared and the pending fetch requests are canceled. The system also prevents macroinstruction (MIQ)-related stalls by using a speculative write pointer to control the issuance of fetch requests, thereby preventing the MIQ from becoming oversubscribed.
|
1. A method implementing a native instruction set architecture (isa) having an emulation engine and an emulated isa, wherein a fetch engine fetches instructions of the emulated isa from a memory subsystem, the method comprising:
sending a fetch request signal to the fetch engine;
sending a fetch address to the fetch engine;
retrieving a line of instructions from the memory subsystem, wherein the line of instructions is associated with the fetch address;
storing the fetch address in a fetch address queue;
sending the line of instructions to the emulation engine;
sending a fetch complete signal from the fetch engine to the emulation engine when the line of instructions is sent to the emulation engine, the fetch complete signal being a signal separate from the line of instructions, wherein the step of storing comprises storing the fetch address in the fetch address queue until the fetch complete signal is sent;
storing the fetch address in a queue of the emulation engine, wherein the queue of the emulation engine mirrors the fetch address queue;
progressing the fetch address through pipeline stages of the fetch engine and the emulation engine synchronously using pipeline advance logic;
receiving, at the pipeline advance logic, a delayed version of the fetch complete signal; and
advancing the fetch address through the fetch engine and the emulation engine based on the delayed version of the fetch complete signal.
11. A multi-architecture computer system capable of implementing a native instruction set architecture (isa) and an emulated isa, wherein instructions of the native isa are processed in a native isa pipeline and instructions of the emulated isa are processed in an emulated isa pipeline, the system, comprising:
a memory subsystem of the native isa;
a fetch engine of the native isa, said fetch engine being electrically connected to the memory subsystem of the native isa wherein the fetch engine accesses the memory subsystem to retrieve a line of instructions from the memory subsystem;
an engine of an emulated isa, wherein the engine of the emulated isa is electrically connected to the fetch engine and interfaces with the fetch engine using a handshake protocol, wherein the engine of the emulated isa receives a line of instructions and a fetch complete signal from the fetch engine;
a fetch address queue that stores a fetch address for the line of instructions retrieved from the memory subsystem when the native isa pipeline is stalled, wherein the fetch address queue is controlled by the fetch complete signal such that the fetch address is stored in the fetch address queue until the fetch complete signal is received; and
pipeline advance logic, wherein the pipeline advance logic receives a delayed version of the fetch complete signal and advances the fetch address through the fetch engine and the engine of the emulated isa based on the delayed version of the fetch complete signal.
7. A method for fetching a line of instructions from a memory subsystem of a mixed architecture CPU into a macroinstruction queue of an emulation engine comprising,
determining whether the macroinstruction queue is full or will become full with one or more lines of instructions returning from one or more pending fetch requests; and
if the macroinstruction queue is not full and will not become full with one or more lines of instructions returning from one or more pending fetch requests,
sending a fetch address to a fetch engine;
retrieving a line of instructions from a memory subsystem into the fetch engine;
sending the line of instructions to the emulation engine;
sending a fetch complete signal from the fetch engine to the emulation engine along with the line of instructions;
if the macroinstruction queue is full or will become full with one or more lines of instructions returning from one or more pending fetch requests, waiting until the macroinstruction queue is not full and will not become full before sending a fetch address to the fetch engine;
storing the fetch address in a queue of the emulation engine, wherein the queue in the emulation engine mirrors the fetch address queue;
progressing the fetch address through pipeline stages of the fetch engine and the emulation engine synchronously using pipeline advance logic;
receiving at the pipeline advance logic a delayed version of the fetch complete signal; and
advancing the fetch address through the fetch engine and the emulation engine based on the delayed version of the fetch complete signal.
3. The method of
4. The method of
determining whether the macroinstruction queue is full or will become full with one or more lines of instructions returning from one or more pending fetch requests; and
if the macroinstruction queue is full or will become full with one or more lines of instructions returning from one or more pending fetch requests, waiting until the macroinstruction queue is not full and will not become full to send a new fetch request signal.
5. The method of
6. The method of
canceling a pending fetch request; and
clearing the fetch address queue.
8. The method of
9. The method of
10. The method of
12. The computer system of
13. The computer system of
a pending fetch request signal is canceled; and
the fetch address queue is cleared.
14. The computer system of
15. The computer system of
|
This application is a continuation of application Ser. No. 09/510,010, filed Feb. 22, 2000, now U.S. Pat. 6,678,817 entitled “METHOD AND APPARATUS FOR FETCHING INSTRUCTIONS FROM THE MEMORY SUBSYSTEM OF A MIXED ARCHITECTURE PROCESSOR INTO A HARDWARE EMULATION ENGINE,” which is incorporated herein by reference.
The technical field relates to digital computer systems and fetching instructions. More particularly, it relates to methods and an apparatus for fetching instructions from a computer memory in a mixed architecture.
In the field of computer architecture, a single chip may process instructions from multiple instruction sets. In such mixed architectures, the processor hardware is designed and optimized for executing instructions from one instruction set generally referred to as the native instruction set, while emulating other instruction sets by translating the emulated instructions into operations understood by the native hardware. For example, the IA-64 architecture supports two instruction sets—the IA-32 (or x86) variable length instruction set and the fixed-length enhanced mode (EM) instruction set. When executing the IA-32 instruction set, the central processing unit (CPU) is said to be in IA-32 mode. When executing EM instructions, the CPU is said to be in EM mode. Native EM instructions are executed by the main execution hardware of the CPU in EM mode. However, the variable length IA-32 instructions are processed by the IA-32 (or x86) engine and broken down into native EM mode instructions for execution in the core pipeline of the machine. In x86 mode, it is desirable to retrieve instructions from the IA-64 memory subsystem into an x86 engine. To accomplish this, the x86 execution engine must interface with the EM pipeline, because the memory subsystem is tightly coupled to the EM pipeline. The x86 hardware support exists primarily to support legacy software. For this reason, it is desirable that the x86 engine not slow the processing of native instructions in the EM pipeline.
Existing methods of fetching instructions, such as those methods previously implemented in IA-64 architecture, use dual pipelines—the EM pipeline and the x86 pipeline—to process instructions. In these methods, the x86 engine simply sends a fetch address to the EM fetch engine, which accesses the memory subsystem and returns a line of instructions for depositing to a macroinstruction queue (MIQ) in the x86 engine. While both pipelines are synchronized to process the same set of addresses, they operate independently such that the x86 engine sends a new fetch address in each clock cycle, and the EM fetch engine retrieves a new line of instructions in each clock cycle.
In the presence of pipeline stalls (for example due to a cache miss), the pipelines could go out of synchronization. This is because, given the physical separation of the x86 engine and the EM fetch engine it takes one complete clock-cycle to transmit information between these pipelines. In the case of a stall, it is not possible to report the stall to the x86 engine in the same cycle that the fetch engine sees it. That is, the x86 engine would not notice the stall in the EM pipeline until at least one clock cycle after it occurred. Meanwhile, the x86 pipeline continues to advance the fetch address as though no stall had occurred. The x86 pipeline and the EM pipeline become unsynchronized and will process different instructions in corresponding pipeline stages. This requires a complicated stall recovery means to get the pipelines back into synchronization.
Another stall-related problem with existing methods of processing instructions is that there may not be enough room to write a line of returning instructions on the MIQ. That is, existing methods and apparatuses may try to write a new line of instructions to the MIQ, even though the MIQ may be full with unprocessed entries. One prior art method introduces a new stall to recover from this oversubscription to the MIQ. The detection and signaling of this new stall is cumbersome and combined with the earlier fetch-related stalls, requires complicated hardware to handle.
What is needed is a means of interfacing the hardware of a CPU that processes both native instructions and emulated instructions. In particular, what is needed is a method for retrieving instructions of one instruction set architecture (ISA) from the memory of a different, native ISA, while avoiding the problems associated with pipeline stalls and the complexities inherent to the dual, synchronous pipeline system.
Disclosed is a method for implementing a native instruction set architecture (ISA), having an emulation engine, and an emulated ISA, where the emulated ISA includes a fetch engine responsible for fetching native instructions from a memory subsystem. The fetch engine is interfaced with the emulation engine. This is achieved using a handshake protocol, whereby the emulation engine sends an explicit fetch request signal to the fetch engine along with a fetch address. The fetch engine then accesses the memory subsystem and retrieves a line of instructions for subsequent decode and execution. The fetch engine sends this line of instructions to the emulation engine along with an explicit fetch complete signal. The fetch engine also includes a fetch address queue capable of holding the fetch addresses before they are processed by the fetch engine. The fetch requests are processed such that more than one fetch request may be pending at the same time. If a pending fetch request is canceled due to a pipeline flush, then the fetch address queue is cleared and the pending fetch requests are canceled. The system also prevents macroinstruction (MIQ)-related stalls by using a speculative write pointer to control the issuance of fetch requests, thereby preventing the MIQ from becoming oversubscribed.
Also disclosed is a computer system capable of processing instructions from more than one instruction set and an engine that fetches native instructions from a memory subsystem (such as an EM fetch engine), and an engine that processes emulated instructions (such as an x86 engine) is described. The EM fetch engine has a fetch address queue. The EM fetch engine interfaces with the memory subsystem and the x86 engine by using a handshake protocol. The x86 engine sends an explicit fetch request signal to the EM fetch engine along with a fetch address. The EM fetch engine then accesses the memory subsystem and retrieves a line of instructions. The EM fetch engine sends this line of instructions to the x86 engine along with an explicit fetch complete signal. The EM fetch engine also includes a fetch address queue capable of holding the fetch addresses before they are processed by the EM fetch engine. The fetch requests are processed such that more than one fetch request may be pending at the same time. If a pending fetch request is canceled due to a pipeline flush, then the fetch address queue is cleared and the pending fetch requests are canceled. The system also prevents macroinstruction (MIQ)-related stalls by using a speculative write pointer to control the issuance of fetch instructions, thereby preventing the MIQ from becoming oversubscribed.
The detailed description will refer to the following drawing in which like numbers refer to like items and in which:
The system improves interfacing between hardware in a processor that implements both a native instruction set and an emulated instruction set by replacing the synchronous, stall-controlled mechanism with a handshake based fetch protocol. It will be recognized by one skilled in the art that the system may be used by any engine that attempts to emulate one instruction set architecture (ISA) using another ISA. By way of illustration only, and not by way of limitation, the embodiment of the system is shown to interface an x86 engine and an IA-64 memory subsystem. The memory subsystem includes any apparatus that may be used to store instruction bytes, including a cache system, a main memory, and any other memory used by the system.
The handshake method explained above and in
In one embodiment, 16 sequential bytes of instructions are requested from the fetch engine 40 by sending a fetch request signal 110. In this embodiment, the x86 engine 30 sends a 16-byte aligned 28-bit fetch address 120 to the fetch engine 40 at the same time as it sends the fetch request signal 110. In this embodiment, the fetch engine 40 accesses (220 in
A group of logic functions, referred to as the pipeline-advance logic 90, is applied to the addresses 120 in the respective stages of the x86 address queue 60 to advance the fetch addresses 120 along the x86 pipeline 30. The pipeline-advance logic 90 is designed to move the oldest outstanding fetch address 120 toward the BT3 stage 370. Once each of the x86 pipeline stages BT1 350, BT2 360, BT3 370 have valid addresses 120 corresponding to the three pending fetch-requests in the EM pipeline 30, the addresses 120 are advanced only after a line of instructions 150 has been returned for the oldest outstanding request 120. To accomplish this, the pipeline-advance logic 90 uses a delayed version of the fetch complete indication 142 to advance the addresses 120 along the x86 pipeline 30. As shown, the fetch complete indication 140 is sent to from the BT3 stage 370 to a latch 72 in the align (ALN) stage 380 of the x86 pipeline 30.
The output of the latch 72 is the delayed fetch complete signal 142, which is then used by the pipeline-advance logic 90 and which controls fetch request signals 110. At each stage 320, 330, 340 of the EM pipeline 40, EM logic functions 80 work to fetch a line of instructions 150 from the memory subsystem 20. When a line of instructions 150 is returned to the x86 engine 30, it is for the address 120 in the BT3 stage 370, which represents the oldest unsatisfied fetch request. The instruction information is buffered and placed into an MIQ 70 one clock cycle later in the ALN stage 380 of the x86 engine 30.
It should be appreciated that multiple fetch requests may be pending or “in-flight” at the same time.
For example, at the first clock cycle t, a fetch request signal 110A is sent for the first fetch address A. Fetch address A is in the IIP stage 320 of the EM pipeline 40, where the EM pipeline 40 receives the fetch address 120 along with the fetch request signal 110. At the second clock cycle t+1, a fetch request signal 110B is sent for the second fetch address B, while the memory subsystem 20 is prepared for fetching the first address A. A is in the IPG stage 330, and B is in the IIP stage 320. At the third clock cycle t+2 511, a fetch request signal 110C is sent for the third fetch address C, and the memory subsystem 20 is prepared for the second fetch address B. For the first fetch address A, the memory subsystem 20 is accessed, the line of instructions 130 is received by the fetch engine 40, and the line of instructions 150 is delivered to the x86 engine 30. In the EM pipeline 40, A is in the ROT stage 340, B is in the IPG stage 330, and C is in the IIP stage 320. In the next clock cycle t+3 512, the line of instruction bytes 150 for address A is written into the MIQ 70 in the ALN stage 380 of the x86 pipeline 30 as shown in
In the example shown in
The following key summarizes the progression of the EM pipeline of
A corresponding key summarizes these points of the x86 pipeline 30 of
The progression of instructions through the method, as shown in
However,
In
The following key summarizes some events and illustrates the problems of the EM pipeline in existing methods at certain times:
532
533
A corresponding key summarizes the operation of the x86 pipeline at certain times:
542
543
544
The handshake protocol, described above, is designed to alleviate the complex recovery mechanisms that are needed by previous methods to keep two independent pipelines that interface with each other in sync. The handshake is, by definition, independent of the latency between a fetch request and its completion. This makes the implementation fairly straightforward and relatively easy to verify.
The x86 engine 30 will issue up to 3 fetch requests 110 before expecting the line of instructions for the request corresponding to the first fetch request 1101 to be returned. In the absence of front-end fetch related stalls (e.g., due to cache miss or to a TLB miss), data for the request of the first address 120, is returned in the same cycle as the fetch request for the third fetch address 1203 is being made. Thus, new requests can continue to be pipelined, and a new fetch request made in every clock cycle. In the event of a front-end fetch stall, the fetch complete indication 140 will not be asserted until the stall condition is resolved and the data becomes available. The fetch engine 40 is able to buffer up to three fetch addresses in the fetch address queue 50 and process the corresponding data in a first-in, first-out (FIFO) fashion. Thus, a fetch request for the fourth fetch address 1204 will not be made by the x86 engine 30 until a fetch complete indication 140 is known to have been received from the oldest outstanding request in the previous cycle. By buffering up to three pending requests, the fetch request logic can use a clock-cycle-delayed version of the fetch complete indication 142 from the fetch engine to initiate the new request. This alleviates timing pressure on the fetch complete signal 140 coming from the fetch engine 40 while still maintaining the two pipelines 30, 40 in synchronization. In the event of pipeline flushes, the fetch queues 50 are emptied, and all in-flight, or pending, fetch requests 110 are canceled. Because the pipelines 30, 40 are in sync, there is no need to selectively flush the different stages of the pipeline as was necessary in earlier implementations. In addition, because the x86 engine 30 is designed to ensure that there are never more than three pending or “in-flight” (i.e., requested, but not yet returned) fetch addresses 120 at any given time, the fetch address queue 50 will never be oversubscribed.
The MIQ 70 shown in the embodiment of
The execution of instructions in the x86 engine 30 goes through several stages before the instruction is eventually retired. That is, an instruction may have been read from the MIQ 70, but remains in the x86 engine 30 for some period before it is retired. It may be desirable that the entries in the MIQ 70 not be overwritten until an instruction has been retired from the x86 engine 30. The read pointer 420 may advance before the instruction has been retired. Therefore, in one embodiment, the MIQ 70 includes a retire pointer 430 to indicate that an instruction has been retired by the x86 engine 30 and may safely be overwritten. The retire pointer 430 will lag the read pointer 420.
In one embodiment, the system prevents new fetch requests 110 from being issued by the x86 engine 30 if the MIQ 70 is full. This is done by comparing the MIQ pointers to ensure that no entry is overwritten before the desired time. The write pointer 410 must not write to entries that have not been read, as indicated by the read pointer 420. Also, it may be desirable to prevent overwriting entries that have been read but not retired, as indicated by the retire pointer 430.
As noted, however, the system may include multiple pending request stages (e.g., IIP, IPG, ROT) and the fetch address queue 50 may have multiple entries in it. For example, the embodiment shown in
For example, if the MIQ 70 has 8 entries (0-7), the write pointer 410 may be pointing at entry 3, as the entry into which the next line of instructions will be written. The read pointer 420 may point to entry 2, the entry from which the next line of instructions will be read. The retire pointer 440 may point to entry 1, the entry from which the most recent instruction was retired in the x86 engine 30. If three fetch requests 110 are already in the fetch engine 40, then the speculative write pointer will point to entry 6, leaving room on the MIQ 70 for the lines of instructions 130 that are returned for those requests 120. Because the speculative write pointer 440 cannot pass the retire pointer 430, the system can be configured such that no fetch request 110 is issued when the speculative write pointer 440 catches up to the retire pointer 430.
Although the system and method have been described in detail with reference to certain embodiments thereof, variations are possible. For example, although the values of certain data, sizes of the pipelines, number of pending fetch requests, clock cycles, and other certain specific information were given as examples, these examples were by way of illustration only, and not by way of limitation. The system and method may be embodied in other specific forms without departing from the essential spirit or attributes thereof. Although examples shown refer specifically to the IA-64 architecture and to the EM fetch engine and the x86 engine as the native and emulation systems, these are by way of illustration only and not by way of limitation. The method may be implemented on any type of architecture capable of using more than one type of ISA. It is desired that the embodiments described herein be considered in all respects as illustrative, not restrictive, and that reference be made to the appended claims for determining the scope of the invention.
McCormick, Jr., James E, Brockmann, Russell C, Undy, Stephen R., Arnold, Barry J, Dua, Anuj, Kubicek, David Carl, Stout, James Curtis
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
5448705, | Jan 08 1992 | SAMSUNG ELECTRONICS CO , LTD | RISC microprocessor architecture implementing fast trap and exception state |
5515521, | Feb 08 1994 | TU, WEICHI; CHEN, CHIH-NONG | Circuit and method for reducing delays associated with contention interference between code fetches and operand accesses of a microprocessor |
5537559, | Feb 08 1994 | TU, WEICHI; CHEN, CHIH-NONG | Exception handling circuit and method |
5584037, | Mar 01 1994 | Intel Corporation | Entry allocation in a circular buffer |
5732235, | Jan 25 1996 | Freescale Semiconductor, Inc | Method and system for minimizing the number of cycles required to execute semantic routines |
WO8701482, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 21 2003 | Hewlett-Packard Development Company, L.P. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Nov 21 2011 | REM: Maintenance Fee Reminder Mailed. |
Apr 08 2012 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Apr 08 2011 | 4 years fee payment window open |
Oct 08 2011 | 6 months grace period start (w surcharge) |
Apr 08 2012 | patent expiry (for year 4) |
Apr 08 2014 | 2 years to revive unintentionally abandoned end. (for year 4) |
Apr 08 2015 | 8 years fee payment window open |
Oct 08 2015 | 6 months grace period start (w surcharge) |
Apr 08 2016 | patent expiry (for year 8) |
Apr 08 2018 | 2 years to revive unintentionally abandoned end. (for year 8) |
Apr 08 2019 | 12 years fee payment window open |
Oct 08 2019 | 6 months grace period start (w surcharge) |
Apr 08 2020 | patent expiry (for year 12) |
Apr 08 2022 | 2 years to revive unintentionally abandoned end. (for year 12) |