A method of performing a store operation in a computer processor is disclosed. The method issues a store operation that is divided into a pre-fetch micro-operation that loads a needed cache line into a cache memory, and the subsequent store micro-operation stores a data value into the needed cache line that was pre-fetched into the cache memory.
|
7. A method comprising:
handling cache and memory operations via a memory subsystem; decoding a store operation into a pre-fetch micro-operation and a store operation; the pre-fetch micro-operation loading a needed cache line into a cache; the store micro-operation storing a data value into the needed cache line into the cache; and an out-of-order scheduler scheduling the pre-fetch micro-operation and the store micro-operation out of an original program order.
1. An apparatus for processing computer instructions, said apparatus comprising:
a memory subsystem, said memory subsystem for handling cache and memory operations; a decoder, the decoder decoding a store operation into a pre-fetch micro-operation and a store operation; the pre-fetch micro-operation loading a needed cache line into a cache, and the subsequent store micro-operation storing a data value into the needed cache line into the cache; at least one execution unit, said execution unit for processing operations; and an out-of-order scheduler, said scheduler for scheduling the pre-fetch micro-operation and the store micro-operation out of an original program order.
2. The apparatus as claimed in
3. The apparatus as claimed in
4. The apparatus as claimed in
5. The apparatus as claimed in
6. The apparatus as claimed in
8. The method as claimed in
9. The method as claimed in
10. The method as claimed in
11. The method as claimed in
|
The present invention relates to the field of computer processor architecture. In particular the present invention discloses a method and apparatus for improving memory operation efficiency by dividing memory write operations into a prefetch stage and store stage.
Computer processor designers continually attempt to improve the performance of computer processors. To improve processor performance, many novel processor design approaches have been created such as pipeline execution, register renaming, out-of-order instruction execution, and branch prediction with speculative execution of instructions fetched after a predicted branch. However, the speed of computer memories has not increased proportionally with the speed increases of computer processors. To alleviate any speed bottleneck that may be caused by the relatively slow main memory, most processors use a local high-speed cache memory.
The speed of computer processors now often stretches the limitations of high-speed cache memories. In order to most efficiently utilize a local high-speed cache memory system, a processor must be carefully integrated the cache memory system using read buffers and write buffers. The read buffers and write buffers provide a conduit between processor execution units and the memory subsystems. If the design of the read buffers, write buffers, and the associated control logic is optimized then the computer processor will not be slowed down by the memory system. It would therefore be desirable to have an improved memory interface within a computer processor.
A method of performing memory write operations in a computer processor is disclosed. The method issues a pre-fetch operation that loads a needed cache line into a cache memory. Then, a subsequent store operation is issued. The subsequent store operation stores a data value into the cache line that was pre-fetched into the cache memory.
Other objects, features, and advantages of present invention will be apparent from the accompanying drawings and from the following detailed description that follows below.
The objects, features, and advantages of the present invention will be apparent to one skilled in the art in view of the following detailed description in which:
A method and apparatus for dividing memory write operations into a prefetch stage and store stage is disclosed. In the following description, for purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that these specific details are not required in order to practice the present invention. For example, the present invention has been described with reference to a computer processor that has large instructions that are decoded into smaller micro-operations. However, the same techniques can easily be applied to other types of processors.
The fetched instructions are passed to a decoder 120 that decodes the fetched instructions. In one embodiment, the decoder decodes the computer processor instructions into one or more small simple micro-operations (micro-ops). The decoded micro-operations are then passed to a scheduling unit 140 that allocates resources, performs register renaming, and schedules micro-ops for issue into executions units. Register renaming is performed using a register file and register map. The micro-ops are stored in a reservation station unit execution. The scheduling unit 140 selects the micro-ops that will be executed at any given cycle of the processor.
The scheduled micro-ops are dispatched from the reservation stations along with operands from the register file to execution units for execution. In the processor embodiment of
The speed of modem computer processors greatly exceeds the speed of standard Dynamic Random Access Memory (DRAM). To compensate for this speed disparity, interleaved memory systems have been created. Interleaved memory systems divide memory operations among a number of different DRAM units in a manner that improves memory response times. However, the most commonly used technique to improve memory subsystem performance is to implement high-speed cache memory systems.
Cache memory systems are small high-speed memories that are tightly integrated with a computer processor. Cache memory systems replicate the function of a normal main memory except that cache memories respond much faster. Therefore, to improve processor performance a section of main memory that the processor is currently using is copied into a high-speed cache memory.
When the processor needs information that is currently stored within the high-speed cache memory (a cache "hit"), the high-speed cache memory quickly responds with the needed information. Due to the locality of most programs, a well designed cache memory system will greatly reduce the number of times that a processor needs to access the slower main memory. Thus, the overall memory system performance is greatly improved.
When a processor needs to access information in a memory that location is not currently replicated in the cache memory (a cache "miss"), the processor may need to wait for the main memory to respond with the desired information. Therefore cache memory systems are designed to minimize the number of cache misses greatly since cache misses greatly reduce processor performance.
As illustrated in
The first two instructions of the code in
In many processors, the processors instructions fetched by the processor are broken down into smaller simpler instructions. The smaller simpler instructions are thus executed to perform the operations of the larger more complex processor instructions. For example, referring back to
A processor "store" instruction (store data value into a memory location) would normally be translated into a single micro-op since the task to be performed by the processor store instruction is relatively straightforward. However, the present invention contemplates dividing a processor store instruction into two separate micro-ops: a pre-fetch micro-op and a store micro-op. The pre-fetch micro-op alerts the memory subsystem that a particular cache line will be needed. In this manner, the memory subsystem will fetch the needed cache line. The store micro-op will perform the actual memory store operation. If the needed cache line is already in the cache, then the pre-fetch micro-op does not need to be issued.
The store address micro-op (STA) performs several preliminary functions required for the memory store operation. For example, the store address micro-op (STA) may perform memory-addressing calculations such as calculating the needed physical memory address from a virtual memory address. The store address micro-op (STA) may also perform memory system violation checks such as segment violation checks). This implementation is desirable in processor architectures with complex memory addressing systems that require difficult memory address calculations and memory system violation determinations. The primary goal of the store address micro-op (STA) is to prefetch the cache line that contain the needed memory address.
After the store address micro-op (STA) has issued, the store data micro-op (STD) may issue. The store data micro-op (STD) stores the data in the memory address that was identified in the previously issued store address (STA) micro-op.
Referring to
The foregoing has described a method and apparatus for dividing memory write operations into a prefetch stage and store stage. It is contemplated that changes and modifications may be made by one of ordinary skill in the art, to the materials and arrangements of elements of the present invention without departing from the scope of the invention.
Patent | Priority | Assignee | Title |
10019263, | Jun 15 2012 | Intel Corporation | Reordered speculative instruction sequences with a disambiguation-free out of order load store queue |
10048964, | Jun 15 2012 | Intel Corporation | Disambiguation-free out of order load store queue |
10185561, | Jul 09 2015 | CENTIPEDE SEMI LTD. | Processor with efficient memory access |
10282300, | Dec 02 2004 | Intel Corporation | Accessing physical memory from a CPU or processing element in a high performance manner |
10592300, | Jun 15 2012 | Intel Corporation | Method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization |
11593115, | Mar 27 2019 | Alibaba Group Holding Limited | Processor, device, and method for executing instructions |
7080210, | Feb 12 2002 | IP-FIRST, LLC | Microprocessor apparatus and method for exclusive prefetch of a cache line from memory |
7080211, | Feb 12 2002 | IP-First, LLC | Microprocessor apparatus and method for prefetch, allocation, and initialization of a cache line from memory |
7089368, | Feb 12 2002 | IP-First, LLC | Microprocessor apparatus and method for exclusively prefetching a block of cache lines from memory |
7089371, | Feb 12 2002 | IP-First, LLC | Microprocessor apparatus and method for prefetch, allocation, and initialization of a block of cache lines from memory |
7111125, | Apr 02 2002 | IP-First, LLC | Apparatus and method for renaming a data block within a cache |
7188215, | Jun 19 2003 | IP-First, LLC | Apparatus and method for renaming a cache line |
9280473, | Dec 02 2004 | Intel Corporation | Method and apparatus for accessing physical memory from a CPU or processing element in a high performance manner |
9575897, | Jul 09 2015 | CENTIPEDE SEMI LTD. | Processor with efficient processing of recurring load instructions from nearby memory addresses |
9710385, | Dec 02 2004 | Intel Corporation | Method and apparatus for accessing physical memory from a CPU or processing element in a high performance manner |
9904552, | Jun 15 2012 | Intel Corporation | Virtual load store queue having a dynamic dispatch window with a distributed structure |
9928121, | Jun 15 2012 | Intel Corporation | Method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization |
9965277, | Jun 15 2012 | Intel Corporation | Virtual load store queue having a dynamic dispatch window with a unified structure |
9990198, | Jun 15 2012 | Intel Corporation | Instruction definition to implement load store reordering and optimization |
Patent | Priority | Assignee | Title |
5170476, | Jan 22 1990 | Motorola, Inc.; MOTOROLA, INC , SCHAUMBURG, IL A CORP OF DE | Data processor having a deferred cache load |
5615386, | May 06 1993 | SAMSUNG ELECTRONICS CO , LTD | Computer architecture for reducing delays due to branch instructions |
5778423, | Jun 29 1990 | HEWLETT-PACKARD DEVELOPMENT COMPANY, L P | Prefetch instruction for improving performance in reduced instruction set processor |
5845101, | May 13 1997 | GLOBALFOUNDRIES Inc | Prefetch buffer for storing instructions prior to placing the instructions in an instruction cache |
5931945, | Apr 29 1994 | Sun Microsystems, Inc | Graphic system for masking multiple non-contiguous bytes having decode logic to selectively activate each of the control lines based on the mask register bits |
5944815, | Jan 12 1998 | Advanced Micro Devices, Inc. | Microprocessor configured to execute a prefetch instruction including an access count field defining an expected number of access |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 07 1999 | SHEAFFER, GAD S | Intel Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 012977 | /0837 | |
Jun 16 1999 | Intel Corporation | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Apr 21 2006 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Apr 14 2010 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Mar 20 2014 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Oct 22 2005 | 4 years fee payment window open |
Apr 22 2006 | 6 months grace period start (w surcharge) |
Oct 22 2006 | patent expiry (for year 4) |
Oct 22 2008 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 22 2009 | 8 years fee payment window open |
Apr 22 2010 | 6 months grace period start (w surcharge) |
Oct 22 2010 | patent expiry (for year 8) |
Oct 22 2012 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 22 2013 | 12 years fee payment window open |
Apr 22 2014 | 6 months grace period start (w surcharge) |
Oct 22 2014 | patent expiry (for year 12) |
Oct 22 2016 | 2 years to revive unintentionally abandoned end. (for year 12) |