Method and apparatus for dividing a store operation into pre-fetch and store micro-operations

Method and apparatus for dividing a store operation into pre-fetch and store micro-operations
US6470444

A method of performing a store operation in a computer processor is disclosed. The method issues a store operation that is divided into a pre-fetch micro-operation that loads a needed cache line into a cache memory, and the subsequent store micro-operation stores a data value into the needed cache line that was pre-fetched into the cache memory.

PTO Wrapper PDF
Dossier Espace Google

Patent 6470444
Priority Jun 16 1999
Filed Jun 16 1999
Issued Oct 22 2002
Expiry Jun 16 2019
Inventors Sheaffer, …
Assg.orig Intel Corp…
Assg.curr Intel Corp…
Entity Large
Referenced by 19
References 6
Maint.: all paid

FIELD OF THE INVENTI…
BACKGROUND OF THE IN…
SUMMARY OF THE INVEN…
BRIEF DESCRIPTION OF…
DETAILED DESCRIPTION…
A Typical Prior Art …
Cache Memory Basics
Processor Stall Caus…
A Divided Memory Wri…

7. A method comprising:

handling cache and memory operations via a memory subsystem;

decoding a store operation into a pre-fetch micro-operation and a store operation;

the pre-fetch micro-operation loading a needed cache line into a cache;

the store micro-operation storing a data value into the needed cache line into the cache; and

an out-of-order scheduler scheduling the pre-fetch micro-operation and the store micro-operation out of an original program order.

1. An apparatus for processing computer instructions, said apparatus comprising:

a memory subsystem, said memory subsystem for handling cache and memory operations;

a decoder, the decoder decoding a store operation into a pre-fetch micro-operation and a store operation; the pre-fetch micro-operation loading a needed cache line into a cache, and the subsequent store micro-operation storing a data value into the needed cache line into the cache;

at least one execution unit, said execution unit for processing operations; and an out-of-order scheduler, said scheduler for scheduling the pre-fetch micro-operation and the store micro-operation out of an original program order.

2. The apparatus as claimed in claim 1 wherein said pre-fetch micro-operation designates a desired memory address, said desired memory address within an address space of said needed cache line.

3. The apparatus as claimed in claim 2 wherein said desired memory address comprises a virtual memory address.

4. The apparatus as claimed in claim 3 wherein said pre-fetch micro-operation calculates a physical memory address from said virtual memory address.

5. The apparatus as claimed in claim 4 wherein said pre-fetch micro-operation performs memory system violation checks.

6. The apparatus as claimed in claim 4 wherein said out-of-order scheduler schedules intermediate instructions between said pre-fetch micro-operation and said store micro-operation.

8. The method as claimed in claim 7 wherein said desired memory address comprises a virtual memory address.

9. The method as claimed in claim 8 wherein the pre-fetch micro-operation includes calculating a physical memory address from the virtual memory address.

10. The method as claimed in claim 9 wherein the pre-fetch micro-operation includes performing memory system violation checks.

11. The method as claimed in claim 7 wherein the pre-fetch micro-operation includes designating a desired memory address, said desired memory address within an address space of the needed cache line.

FIELD OF THE INVENTION

The present invention relates to the field of computer processor architecture. In particular the present invention discloses a method and apparatus for improving memory operation efficiency by dividing memory write operations into a prefetch stage and store stage.

BACKGROUND OF THE INVENTION

Computer processor designers continually attempt to improve the performance of computer processors. To improve processor performance, many novel processor design approaches have been created such as pipeline execution, register renaming, out-of-order instruction execution, and branch prediction with speculative execution of instructions fetched after a predicted branch. However, the speed of computer memories has not increased proportionally with the speed increases of computer processors. To alleviate any speed bottleneck that may be caused by the relatively slow main memory, most processors use a local high-speed cache memory.

The speed of computer processors now often stretches the limitations of high-speed cache memories. In order to most efficiently utilize a local high-speed cache memory system, a processor must be carefully integrated the cache memory system using read buffers and write buffers. The read buffers and write buffers provide a conduit between processor execution units and the memory subsystems. If the design of the read buffers, write buffers, and the associated control logic is optimized then the computer processor will not be slowed down by the memory system. It would therefore be desirable to have an improved memory interface within a computer processor.

SUMMARY OF THE INVENTION

A method of performing memory write operations in a computer processor is disclosed. The method issues a pre-fetch operation that loads a needed cache line into a cache memory. Then, a subsequent store operation is issued. The subsequent store operation stores a data value into the cache line that was pre-fetched into the cache memory.

Other objects, features, and advantages of present invention will be apparent from the accompanying drawings and from the following detailed description that follows below.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects, features, and advantages of the present invention will be apparent to one skilled in the art in view of the following detailed description in which:

FIG. 1 illustrates a pipelined superscalar computer processor.

FIG. 2 illustrates a short section of assembly code that may cause a bubble in a computer processor pipeline.

FIG. 3 illustrates a list of micro-operations generated by a first embodiment of the present invention.

FIG. 4 illustrates a list of micro-operations generated by a second embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

A method and apparatus for dividing memory write operations into a prefetch stage and store stage is disclosed. In the following description, for purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that these specific details are not required in order to practice the present invention. For example, the present invention has been described with reference to a computer processor that has large instructions that are decoded into smaller micro-operations. However, the same techniques can easily be applied to other types of processors.

A Typical Prior Art Processor Architecture

FIG. 1 illustrates a prior art out-of-order superscalar computer processor. In the computer processor of FIG. 1, an instruction fetch unit 110 fetches instructions from a first level local instruction cache 105 or a main memory unit 103. If the desired instruction is in the first level local instruction cache 105, then the instruction fetch unit 110 fetches from that first level local instruction cache 105. Otherwise, the instruction fetch unit 110 must fetch the desired instruction from the slower main memory unit 103. (In some embodiments, a second level cache may be present to further improve memory access times.)

The fetched instructions are passed to a decoder 120 that decodes the fetched instructions. In one embodiment, the decoder decodes the computer processor instructions into one or more small simple micro-operations (micro-ops). The decoded micro-operations are then passed to a scheduling unit 140 that allocates resources, performs register renaming, and schedules micro-ops for issue into executions units. Register renaming is performed using a register file and register map. The micro-ops are stored in a reservation station unit execution. The scheduling unit 140 selects the micro-ops that will be executed at any given cycle of the processor.

The scheduled micro-ops are dispatched from the reservation stations along with operands from the register file to execution units for execution. In the processor embodiment of FIG. 1, there are four different execution units: a memory operation execution unit 150, execution unit 2, execution unit 3, and execution unit 4. The scheduling unit 140 dispatches micro-ops to the execution units by selecting data-ready micro-ops for which an appropriate execution unit is available. In most processors, the individual execution units are not identical such that each execution unit can only process certain types of instructions. For example, in FIG. 1, the memory operation execution unit 150 is configured only for performing memory operations.

Cache Memory Basics

The speed of modem computer processors greatly exceeds the speed of standard Dynamic Random Access Memory (DRAM). To compensate for this speed disparity, interleaved memory systems have been created. Interleaved memory systems divide memory operations among a number of different DRAM units in a manner that improves memory response times. However, the most commonly used technique to improve memory subsystem performance is to implement high-speed cache memory systems.

Cache memory systems are small high-speed memories that are tightly integrated with a computer processor. Cache memory systems replicate the function of a normal main memory except that cache memories respond much faster. Therefore, to improve processor performance a section of main memory that the processor is currently using is copied into a high-speed cache memory.

When the processor needs information that is currently stored within the high-speed cache memory (a cache "hit"), the high-speed cache memory quickly responds with the needed information. Due to the locality of most programs, a well designed cache memory system will greatly reduce the number of times that a processor needs to access the slower main memory. Thus, the overall memory system performance is greatly improved.

When a processor needs to access information in a memory that location is not currently replicated in the cache memory (a cache "miss"), the processor may need to wait for the main memory to respond with the desired information. Therefore cache memory systems are designed to minimize the number of cache misses greatly since cache misses greatly reduce processor performance.

Processor Stall Caused By A Cache Memory "Miss"

As illustrated in FIG. 1, a typical high-speed processor includes an integrated data cache memory unit 180. To obtain maximum performance from the data cache memory unit 180, the integrated data cache memory unit 180 must be loaded with the most relevant memory locations such that the best cache hit rate is achieved. To help achieve this goal, the present invention proposes dividing memory write operations into two stages: a cache line prefetch instruction and a memory write instruction. The cache line prefetch instruction and a memory write instruction do not need to be (and probably should not be) consecutive. The cache line prefetch instruction will alert the memory operation subsystem of an upcoming memory write instruction. In this manner, the processor will fetch the proper cache line into the integrated data cache memory unit 180. Thus, no cache "miss" will occur when the memory write instruction is later encountered.

FIG. 2 illustrates an example section of code wherein a cache miss can cause a stall in a processor pipeline. The code of FIG. 2 represents a generic fictional processor machine code using instructions similar to instructions that exist in most processors. Referring to the code of FIG. 2, the first instruction loads a first register (REG1) with an immediate data value. The next instruction adds REG1 to a second register (REG2) and stores the result in REG2. In the third instruction, the result in REG2 is stored into a memory location at an arbitrary address (Address1). The fourth instruction loads another register (REG0) with the data value that was stored in Address1. The fifth instruction adds the value at Address1 to yet another register (REG4).

The first two instructions of the code in FIG. 2 are easily executed with a processor stall since the first two instructions only access internal registers and an immediately available data value. However, the third instruction (the STORE instruction) accesses the external memory. To quickly access a memory location, a cache memory will likely be used. However, if the cache-line containing Address1 is not currently in the cache memory, then the STORE instruction will stall the processor since the cache line containing Address1 must be fetched. Even if the processor is capable of concurrently executing multiple instructions out of the original program order, the code of FIG. 2 will still cause a stall since the next two instructions are dependent on the data stored into Address1. Furthermore, the two instructions after the memory STORE instructions cannot be executed before the memory STORE instruction completes since those two instructions are dependent upon the value stored into Address1 by the memory STORE operation. Thus, the processor will be stalled until the cache line containing Address is fetched such that the memory STORE instruction can complete.

A Divided Memory Write Operation

In many processors, the processors instructions fetched by the processor are broken down into smaller simpler instructions. The smaller simpler instructions are thus executed to perform the operations of the larger more complex processor instructions. For example, referring back to FIG. 1, the decoder 120 decodes the fetched processor instructions and outputs one or more small simple micro-operations (micro-ops) codes for each processor instruction. The remainder of the instructions processing is performed by using the micro-ops.

A processor "store" instruction (store data value into a memory location) would normally be translated into a single micro-op since the task to be performed by the processor store instruction is relatively straightforward. However, the present invention contemplates dividing a processor store instruction into two separate micro-ops: a pre-fetch micro-op and a store micro-op. The pre-fetch micro-op alerts the memory subsystem that a particular cache line will be needed. In this manner, the memory subsystem will fetch the needed cache line. The store micro-op will perform the actual memory store operation. If the needed cache line is already in the cache, then the pre-fetch micro-op does not need to be issued.

FIG. 3 illustrates an example of how the divided memory write operation can improve processor performance. FIG. 3 illustrates an example of micro-ops issued in a super-scalar processor capable of concurrently executing instructions out of the original program order to perform the processor instructions of FIG. 2. Referring to FIG. 3, the first micro-op issued is a prefetch (PRF) micro-op that alerts the memory subsystem of an upcoming memory operation that will affect Address1. The next two micro-ops (LOD and ADD) perform the functions of the first two processor instructions (LOAD and ADD) of FIG. 2. The fourth micro-op is a memory store (STO) micro-op that stores the value of the second register (R2) into Address1. Since the pre-fetch micro-op had already issued, the cache line containing Address1 will be available such that the memory store micro-op will not stall the processor pipeline. The last two micro-ops of FIG. 3 perform the function of the last two computer processor instructions of FIG. 2.

FIG. 4 illustrates another example micro-op flow using a second embodiment of a divided memory write operation. In the example of FIG. 4, the embodiment features a memory write operation that is divided into a store address (STA) micro-op and a store data (STD) micro-op.

The store address micro-op (STA) performs several preliminary functions required for the memory store operation. For example, the store address micro-op (STA) may perform memory-addressing calculations such as calculating the needed physical memory address from a virtual memory address. The store address micro-op (STA) may also perform memory system violation checks such as segment violation checks). This implementation is desirable in processor architectures with complex memory addressing systems that require difficult memory address calculations and memory system violation determinations. The primary goal of the store address micro-op (STA) is to prefetch the cache line that contain the needed memory address.

After the store address micro-op (STA) has issued, the store data micro-op (STD) may issue. The store data micro-op (STD) stores the data in the memory address that was identified in the previously issued store address (STA) micro-op.

Referring to FIG. 4, the first micro-op issued in the second embodiment example is a store address micro-op (STA) that first calculates a desired memory address, performs memory violation checks, and alerts the memory subsystem of an upcoming memory operation that will affect the desired memory address. The next two micro-ops (LOD and ADD) perform the functions of the first two computer processor instructions (LOAD and ADD) in the code of FIG. 2. The fourth micro-op is a store data (STD) micro-op that stores the value of the second register (R2) into the address specified by the previously issued store address (STA) micro-op. In this example, the previously issued store address (STA) micro-op generated the address Address1. Since the store address micro-op (STA) had already issued, the cache line containing generated address Address1 will be available such that the store data (STD) micro-op will not stall the processor pipeline. The last two micro-ops of FIG. 4 perform the function of the last two computer processor instructions of FIG. 2.

The foregoing has described a method and apparatus for dividing memory write operations into a prefetch stage and store stage. It is contemplated that changes and modifications may be made by one of ordinary skill in the art, to the materials and arrangements of elements of the present invention without departing from the scope of the invention.

INVENTORS:

Sheaffer, Gad S.

THIS PATENT IS REFERENCED BY THESE PATENTS:

Patent	Priority	Assignee	Title
10019263,	Jun 15 2012	Intel Corporation	Reordered speculative instruction sequences with a disambiguation-free out of order load store queue
10048964,	Jun 15 2012	Intel Corporation	Disambiguation-free out of order load store queue
10185561,	Jul 09 2015	CENTIPEDE SEMI LTD.	Processor with efficient memory access
10282300,	Dec 02 2004	Intel Corporation	Accessing physical memory from a CPU or processing element in a high performance manner
10592300,	Jun 15 2012	Intel Corporation	Method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
11593115,	Mar 27 2019	Alibaba Group Holding Limited	Processor, device, and method for executing instructions
7080210,	Feb 12 2002	IP-FIRST, LLC	Microprocessor apparatus and method for exclusive prefetch of a cache line from memory
7080211,	Feb 12 2002	IP-First, LLC	Microprocessor apparatus and method for prefetch, allocation, and initialization of a cache line from memory
7089368,	Feb 12 2002	IP-First, LLC	Microprocessor apparatus and method for exclusively prefetching a block of cache lines from memory
7089371,	Feb 12 2002	IP-First, LLC	Microprocessor apparatus and method for prefetch, allocation, and initialization of a block of cache lines from memory
7111125,	Apr 02 2002	IP-First, LLC	Apparatus and method for renaming a data block within a cache
7188215,	Jun 19 2003	IP-First, LLC	Apparatus and method for renaming a cache line
9280473,	Dec 02 2004	Intel Corporation	Method and apparatus for accessing physical memory from a CPU or processing element in a high performance manner
9575897,	Jul 09 2015	CENTIPEDE SEMI LTD.	Processor with efficient processing of recurring load instructions from nearby memory addresses
9710385,	Dec 02 2004	Intel Corporation	Method and apparatus for accessing physical memory from a CPU or processing element in a high performance manner
9904552,	Jun 15 2012	Intel Corporation	Virtual load store queue having a dynamic dispatch window with a distributed structure
9928121,	Jun 15 2012	Intel Corporation	Method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
9965277,	Jun 15 2012	Intel Corporation	Virtual load store queue having a dynamic dispatch window with a unified structure
9990198,	Jun 15 2012	Intel Corporation	Instruction definition to implement load store reordering and optimization

THIS PATENT REFERENCES THESE PATENTS:

Patent	Priority	Assignee	Title
5170476,	Jan 22 1990	Motorola, Inc.; MOTOROLA, INC , SCHAUMBURG, IL A CORP OF DE	Data processor having a deferred cache load
5615386,	May 06 1993	SAMSUNG ELECTRONICS CO , LTD	Computer architecture for reducing delays due to branch instructions
5778423,	Jun 29 1990	HEWLETT-PACKARD DEVELOPMENT COMPANY, L P	Prefetch instruction for improving performance in reduced instruction set processor
5845101,	May 13 1997	GLOBALFOUNDRIES Inc	Prefetch buffer for storing instructions prior to placing the instructions in an instruction cache
5931945,	Apr 29 1994	Sun Microsystems, Inc	Graphic system for masking multiple non-contiguous bytes having decode logic to selectively activate each of the control lines based on the mask register bits
5944815,	Jan 12 1998	Advanced Micro Devices, Inc.	Microprocessor configured to execute a prefetch instruction including an access count field defining an expected number of access

ASSIGNMENT RECORDS Assignment records on the USPTO

Executed on	Assignor	Assignee	Conveyance	Frame	Reel	Doc
Jun 07 1999	SHEAFFER, GAD S	Intel Corporation	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	012977	0837	pdf
Jun 16 1999		Intel Corporation	(assignment on the face of the patent)

MAINTENANCE FEES AND DATES: Maintenance records on the USPTO

Date	Maintenance Fee Events
Apr 21 2006	M1551: Payment of Maintenance Fee, 4th Year, Large Entity.
Apr 14 2010	M1552: Payment of Maintenance Fee, 8th Year, Large Entity.
Mar 20 2014	M1553: Payment of Maintenance Fee, 12th Year, Large Entity.

Date	Maintenance Schedule
Oct 22 2005	4 years fee payment window open
Apr 22 2006	6 months grace period start (w surcharge)
Oct 22 2006	patent expiry (for year 4)
Oct 22 2008	2 years to revive unintentionally abandoned end. (for year 4)
Oct 22 2009	8 years fee payment window open
Apr 22 2010	6 months grace period start (w surcharge)
Oct 22 2010	patent expiry (for year 8)
Oct 22 2012	2 years to revive unintentionally abandoned end. (for year 8)
Oct 22 2013	12 years fee payment window open
Apr 22 2014	6 months grace period start (w surcharge)
Oct 22 2014	patent expiry (for year 12)
Oct 22 2016	2 years to revive unintentionally abandoned end. (for year 12)