An improved manifold array (ManArray) architecture addresses the problem of configurable application-spacific instruction set optimization and instruction memory reduction using an instruction abbreviation process thereby further optimizing the general ManArray architecture for application to high-volume and portablke battery-powered type of products.
In the ManArray abbreviation process a standard 32-bit ManArray instruction is reduced to a smaller length instruction format, such as 14-bits. An application is first programmed using the full ManArray instruction set using the native 32-bit instructions. After the application program is completed and verified, an instruction-abbreviation tool analyzes the 32-bit application program and generates the abbreviated program using the abbreviated instructions. This instruction abbreviation process allows different program-reduction optimizations tailored for each application program. This process develops an optimized instruction set for the intended application. The abbreviated program, now located in a significantly smaller instruction memory, is functionally equivalent to the original native 32-bit application program. The abbreviated-instructions are fetched from this smaller memory and then dynamically translated into native ManArray instruction form in a sequence processor controller. Since the instruction set is now determined for the specific application. an optimized processor design can be easily produced. The system and process can be applied to native instructions having other numbers of bits and to other processing architectures.
|
18. A method for generating an abbreviated instruction set corresponding to a set of native manifold array (ManArray) instructions for all used in an application specific program comprising the steps of:
separating the set of native ManArray instructions into groups of instructions;
identifying the unique instructions within each group of instructions;
analyzing the unique instructions for common instruction characteristics;
determining at least one style pattern of bits which is defined as a specific pattern of bits that are constant; and
generating the abbreviated instruction set utilizing the at least one style by encoding the at least one style pattern of bits into a reduced number of bits utilizing a processor.
1. A method for generating an abbreviated application specific program utilizing an abbreviated instruction set comprising the steps of:
generating a native program for an application utilizing a set of native instructions having a first fixed number of bits;
debugging the native program;
processing the debugged native program by analyzing the set of native instructions at a sub-instruction level with a processor to determine specific patterns of bits that do not change within groups of instructions mid and utilizing the results of said analysis to determine an abbreviated instruction set having a second fixed number of bits less than the first fixed number of bits and corresponding to the set of native instructions; and
converting the native program to the abbreviated application specific program by replacing the set of native instructions with the abbreviated instruction set.
0. 46. A system for controlling a translation process wherein a B-bit abbreviated instruction having B bits is translated into a native instruction format having C bits, where the value C is greater than the value B, the system comprising:
a B-bit instruction register for holding the B-bit abbreviated instruction;
a base register;
an adder; and
a native instruction register, wherein the base register output and a field of the B-bit abbreviated instruction in the B-bit instruction register are added by the adder to produce an output which selects native instruction bits for loading into the native instruction register, the selected native instruction bits are not found in the B-bit abbreviated instruction, the selected native instruction bits having been previously determined by analyzing a set of native instructions for specific patterns of bits that do not change within the set of native instructions.
0. 51. A method for operating a processor utilizing an abbreviated instruction having a first number of bits, the method comprising:
retrieving the abbreviated instruction;
generating an address reference for a native instruction pattern from combining a first bit field in the abbreviated instruction and a base address register;
retrieving the native instruction pattern of bits to be combined with the abbreviated instruction using the address reference, the native instruction pattern of bits being based on a previous analysis of a set of native instructions on a sub-instruction level to determine patterns of bits that do not change within groups of instructions;
combining the native instruction pattern of bits with the abbreviated instruction to create a native instruction, the native instruction having a second number of bits, the second number of bits is greater than the first number of bits; and
dispatching the native instruction to a processor for execution.
0. 50. A system for controlling a translation process wherein a B-bit abbreviated instruction having B bits is translated into a native instruction format having B bits, where the value C is greater than the value B, the system comprising:
a B-bit instruction register for holding the B-bit abbreviated instruction;
two base registers, the two base register outputs and two fields of the B-bit abbreviated instruction are concatenated respectively to form at least two addresses to select at least two patterns of native instruction bits, the selected at least two patterns of native instruction bits having been previously determined by analyzing a set of native instructions for specific patterns of bits that do not change within the set of native instructions; and
a native instruction register for loading the native instruction, wherein the selected at least two patterns of native instruction bits are combined as specified by a style set of bits stored in the processor to form the native instruction.
43. A system for translating abbreviated instructions into a native instruction format comprising:
a memory storing an abbreviated instruction having a first fixed number of bits;
means for fetching the abbreviated instruction from the memory;
means for dynamically translating the abbreviated instruction into a native instruction using two translation memories each storing at least one specific sub-native instruction patterns of bits, each of said at least one sub-native instruction patterns being based on a previous analysis of the a set of native instructions on a sub-instruction level to determine patterns of bits that do not change within groups of instructions;
two addressing mechanisms each using a bit field in the abbreviated instruction as an address reference to one of the two translation memories for the at least one specific sub-native instruction patterns of bits;
means for fetching the sub-native instruction patterns from each translation memory ; and
means for combining the sub-native instruction patterns to crate create the native instruction in the native instruction format having a second fixed number of bits greater than said first fixed number.
39. A system for translating abbreviated instructions into a native instruction format comprising:
a memory storing an abbreviated instruction having a first fixed number of hits bits;
means for fetching the abbreviated instruction from the memory;
means for dynamically translating the abbreviated instruction into a native instruction using a translation memory storing at least one specific a sub-native instruction pattern of bits, said sub-native instruction pattern being based on a previous analysis of the a set of native instructions on a sub-instruction level to determine patterns of bits that do not change within groups of instructions;
an addressing mechanism using a first bit field in the abbreviated instruction as an address reference to the translation memory for the sub-native instruction pattern;
means for fetching the sub-native instruction pattern from the translation memory utilizing the address reference; and
means for combining the sub-native instruction pattern with bits from the abbreviated instruction to create the native instruction in the native instruction format having a second fixed number of bits greater than said first fixed number.
0. 49. A system for controlling a translation process wherein a B-bit abbreviated instruction having B bits is translated into a native instruction format having C bits, where the value C is greater than the value B, the system comprising:
a B-bit instruction register for holding the B-bit abbreviated instruction;
a base register;
an adder;
a native instruction register, wherein the native instruction register receives a plurality of direct load bits from a direct load field of the B-bit abbreviated instruction in the B-bit instruction register; and
a base register output and a field of the B-bit abbreviated instruction are added by the adder to produce an output which selects native instruction bits for loading in combination with the direct load bits into the native instruction register, the selected native instruction bits are not found in the B-bit abbreviated instruction, the selected native instruction bits having been previously determined by analyzing a set of native instructions for specific patterns of bits that do not change within the set of native instructions.
26. A method for translating abbreviated instructions into a native instruction format comprising the steps of:
fetching an abbreviated instruction having a first fixed number of bits from a memory tailored to storage of abbreviated instructions;
dynamically translating the abbreviated instruction into the format of a native instruction by using a first bit field in the abbreviated instruction as an address reference to a first translation memory containing at least one specific sub-native instruction pattern of bits for a sub-native instruction pattern;
fetching a the sub-native instruction pattern from the translation memory using said address reference, said sub-native instruction pattern being based on a previous analysis of the a set of native instructions on a sub-instruction level to determine patterns of bits that do not change within groups of instructions;
combining the sub-native instruction patterns with bits from the abbreviated instruction to create the native instruction in a sequence processor (SP) array controller said native instruction having a second fixed number of bits greater than said first fixed number; and
dispatching the native instruction to the sequence processor array controller or a processing element for execution.
42. A method fur for translating abbreviated instructions into a native instruction format comprising the steps of:
fetching an abbreviated instruction having a first fixed number of bits from a memory tailored to storage of abbreviated instructions;
dynamically translating the abbreviated instruction into the format of a native instruction by using a first and a second bit field in the abbreviated instruction as address references to a first field and a second translation memory each containing at least one specific sub-native instruction patterns of bits for at least two sub-native instruction patterns;
fetching at least two a sub-native instruction pattern from each translation memory patterns using said address references, each of said at least two sub-native instruction pattern patterns being based on a previous analysis of the a set of native instructions on a sub-instruction level to determine patterns of bits that do not change within groups of instructions;
combining the at least two sub-native instruction patterns to create the native instruction in a sequence processor (SP) array controller said native instruction having a second fixed number of bits greater than said first fixed number; and
dispatching the native instruction to the sequence processor array controller or a processing element for execution.
2. The method of
analyzing the set of native instructions to identify a first group of native instructions having a style pattern of bits which is defined as a specific pattern of bits that are constant for said group.
3. The method of
storing the identified style pattern of bits in a translation memory.
4. The method of
utilizing the identified style pattern of bits stored in said translation memory to recreate native instructions from the first group of native instruction by combining bits from corresponding abbreviated instructions with the identified style pattern of bits.
5. The method of
analyzing the set of native instructions to identify multiple groups of native instructions, each group having a style pattern of bits which is defined as a specific pattern of bits that are constant.
6. The method of
storing the identified style patterns of bits in a translation memory.
7. The method of
utilizing an identified style pattern of bits selected from said translation memory to recreate native instructions from one of said multiple groups of native instructions by combining bits from corresponding abbreviated instructions with the identified style pattern of bits.
8. The method of
creating a one-to-one mapping between a program's native instruction and an abbreviated instruction by using a translation memory addressing mechanism to identify the style pattern of bits stored in said translation memory.
9. The method of
creating a one-to-one mapping between a program's native instruction and an abbreviated instruction by using a translation memory addressing mechanism to identify the style pattern of bits stored in said translation memory.
10. The method of
11. The method of
12. The method of
executing the application specific program on a simulator to verify its functional equivalence to the native program.
13. The method of
determining a processor core specification tailored for use in implementing the application specific program utilizing the abbreviated instruction set.
14. The method of
15. The method of
16. The method of
17. The method of
19. The method of
20. The method of
21. The method of
22. The method of
storing the at least one style pattern of bits in a translation memory.
23. The method of
24. The method of
25. The method of
utilizing the identified style stored in the translation memory to recreate native instructions from a first group of native instruction by combining bits from corresponding abbreviated instructions with the at least one style pattern of bits.
27. The method of claim 26 59 wherein the abbreviated instruction includes at least one S/P bit, a multi-bit opcode field and a multi-bit translation memory address offset for use in the address reference to the first translation memory.
28. The method of
29. The method of
selecting a plurality of native instruction bits from a location in the translation memory corresponding to the formed translation memory address.
30. The method of
31. The method of
32. The method of
33. The method of
selecting a plurality of native instruction bits from a location in the translation memory corresponding to the formed translation memory address.
34. The method of
35. The method of
36. The method of
37. The method of
selecting a first multi-bit portion of the native instruction from a first translation memory address utilizing the first multi-bit translation memory offset field; and
selecting a second multi-bit portion of the native instruction from a second translation memory address utilizing the second multi-bit translation memory offset field; and
combining both multi-bit portions into a native instruction format.
38. The method of
40. The system of
41. The system of
0. 44. The method of
0. 45. The system of
a translation memory for storing said sub-native instruction pattern of bits.
0. 47. The system of
0. 48. The system of
0. 52. The method of
0. 53. The method of
converting a set of native instructions defining an application program into the abbreviated instruction set; and
storing the abbreviated instruction set.
0. 54. The method of
applying a style to determine how to map bit positions in the abbreviated instruction to bit positions in the native instruction and how to map bit positions in the translation memory entry to bit positions in the native instruction.
0. 55. The method of
0. 56. The method of
0. 57. The method of
decoding the multi-bit opcode field.
|
More than one reissue application has been filed for the reissue of U.S. Pat. No. 6,408,382. The reissue applications are application Ser. Nos. 10/848,615 which is the present application and 12/144,046 which is a divisional reissue application filed Jun. 23, 2008.
The present invention relates generally to improved methods and apparatus for providing abbreviated instructions, mechanisms for translating abbreviated instructions, and configurable processor architectures for system-on-silicon embedded processors.
An emerging class of embedded systems, especially those for portable systems, is required to achieve extremely high performance for the intended application, to have a small silicon area with a concomitant low price, and to operate with very low power requirements. Meeting these sometimes opposing requirements is a difficult task, especially when it is also desirable to maintain a common single architecture and common tools across multiple application domains. This is especially true in a scalable array processor environment. The difficulty of the task has prevented a general solution resulting in a multitude of designs being developed, each optimized for a particular application or specialized tasks within an application. For example, high performance 3D graphics for desktop personal computers or AC-powered game machines are not concerned with limiting power, nor necessarily maintaining a common architecture and set of tools across multiple diverse products. In other examples, such as portable battery powered products, great emphasis is placed on power reduction and providing only enough hardware performance to meet the basic competitive requirements. The presently prevailing view is that it is not clear that these seemingly opposing requirements can be met in a single architecture with a common set of tools.
In order to meet these opposing requirements, it is necessary to develop a processor architecture and apparatus that can be configured in more optimal ways to meet the requirements of the intended task. One prior art approach for configurable processor designs uses field programmable gate array (FPGA) technology to allow software-based processor optimizations of specific functions. A critical problem with this FPGA approach is that standard designs for high performance execution units require ten times the chip area or more to implement in a FPGA than would be utilized in a typical standard application specific integrated circuit (ASIC) design. Rather than use a costly FPGA approach for a configurable processor design, the present invention uses a standard ASIC process to provide software-configurable processor designs optimized for an application. The present invention allows for a dynamically configurable processor for low volume and development evaluations while also allowing optimized configurations to be developed for high volume applications with low cost and low power using a single common architecture and tool set.
Another aspect of low cost and low power embedded cores is the characteristic code density a processor achieves in an application. The greater the code density the smaller the instruction memory can be and consequently the lower the cost and power. A standard prior art approach to achieving greater code density is to use two instruction formats with one format half the size of the other format. Both of these different format types of instructions can be executed in the processor, though many times a mode bit is used to indicate which format type instruction can be executed. With this prior art approach, there typically is a limitation placed upon the reduced instructions which is caused by the reduced format size. For example, the number of registers visible to the programmer using a reduced instruction format is frequently restricted to only 8 or 16 registers when the full instruction format supports up to 32 or more registers. These and other compromises of a reduced instruction format are eliminated with this present invention as addressed further below.
Thus, it is recognized that it will be highly advantageous to have a scalable processor family of embedded cores based on a single architecture model that uses common tools to support software-configurable processor designs optimized for performance, power, and price across multiple types of applications using standard ASIC processes as discussed further below.
In one embodiment of the present invention, a manifold array (ManArray) architecture is adapted to employ various aspects of the present invention to solve the problem of configurable application-specific instruction set optimization and program size reduction, thereby increasing code density and making the general ManArray architecture even more desirable for high-volume and portable battery-powered types of products. The present invention extends the pluggable instruction set capability of the ManArray architecture described in U.S. application Ser. No. 09/215,081 filed Dec. 18, 1998, now U.S. Pat. No. 6,101,592, entitled “Methods and Apparatus for Scalable Instruction Set Architecture with Dynamic Compact Instructions” with new approaches to program code reduction and stand-alone operation using only abbreviated instructions in a manner not previously described.
In the ManArray instruction abbreviation process in accordance with the present invention, a program is analyzed and the standard 32-bit ManArray instructions are replaced with abbreviated instructions using a smaller length instruction format, such as 14-bits, custom tailored to the analyzed program. Specifically, this process begins with programming an application with the full ManArray architecture using the native 32-bit instructions and standard tools. After the application program is completed and verified, or in an iterative development process, an instruction-abbreviation tool analyzes the 32-bit ManArray application program and generates the application program using abbreviated instructions. This instruction-abbreviation process creates different program code size optimizations tailored for each application program. Also, the process develops an optimized abbreviated instruction set for the intended application. Since all the ManArray instructions can be abbreviated, instruction memory can be reduced, and smaller custom tailored cores produced. Consequently, it is not necessary to choose a fixed subset of the full ManArray instruction set architecture for a reduced instruction format size, with attendant compromises, to improve code density.
Depending upon the application requirements, certain rules may be specified to guide the initial full 32-bit code development to better optimize the abbreviation process, and the performance, size, and power of the resultant embedded processor. Using these rules, the reduced abbreviated-instruction program, now located in a significantly smaller instruction memory, is functionally equivalent to the original application program developed with the 32-bit instruction set architecture. In the ManArray array processor, the abbreviated instructions are fetched from this smaller memory and then dynamically translated into native ManArray instruction form in a sequence processor array controller. If after translation the instruction is determined to be a processing element (PE) instruction, it is dispatched to the PEs for execution. The PEs do not require a translation mechanism.
For each application, the abbreviation process reduces the instruction memory size and allows reduced-size execution units, reduced-size register files, and other reductions to be evaluated and if determined to be effective to thereby specify a uniquely optimized processor design for each application. Consequently, the resultant processor designs have been configured for their application.
A number of abbreviated-instruction translation techniques are demonstrated for the present invention where translation, in this context, means to change from one instruction format into another. The translation mechanisms are based upon a number of observations of instructions usage in programs. One of these observations is that in a static analysis of many programs not all instructions used in the program are unique. There is some repetition of instruction usage that varies from program to program. Using this knowledge, a translation mechanism for the unique instructions in a program is provided to reduce the redundant usage of the common instructions. Another observation is that in a static analysis of a program's instructions it is noticed that for large groups of instructions many of the bits in the instruction format do not change. One method of classifying the groups is by opcode, for example, arithmetic logic unit (ALU) and load instructions represent two opcode groupings of instructions. It is further recognized that within opcode groups there are many times patterns of bits that do not change within the group of instructions. Using this knowledge, the concept of instruction styles is created. An instruction style as utilized herein represents a specific pattern of bits of the instruction format that is constant for a group of instructions in a specific program, but that can be different for any program analyzed. A number of interesting approaches and variations for translation emerge from these understandings. In one approach, a translation memory is used with a particular style pattern of bits encoded directly into the abbreviated-instruction format. In another approach, all the style bit patterns or style-field are stored in translation memories and the abbreviated-instruction format provides the mechanism to access the style bit patterns. With the style patterns stored in memory, the translation process actually consists of constructing the native instruction format from one or more stored patterns. It was found in a number of exemplary cases that the program stored in main instruction memory can be reduced by more than 50% using these advantageous new techniques.
It is noted that the ManArray instruction set architecture while presently preferred is used herein only as illustrative as the present invention is applicable to other instruction set architectures.
These and other advantages of the present invention will be apparent from the drawings and the Detailed Description which follows.
Further details of a presently preferred ManArray architecture for use in conjunction with the present invention are found in U.S. Pat. No. 6,023,753, U.S. Pat. No. 6,167,502, U.S. patent application Ser. No. 09/169,255 filed Oct. 9, 1998, U.S. Pat. No. 6,167,501, U.S. Pat. No. 6,219,776, U.S. Pat. No. 6,151,668, U.S. Pat. No. 6,173,389, U.S. Pat. No. 6,101,592, U.S. Pat. No. 6,216,223, U.S. patent application Ser. No. 09/238,446 filed Jan. 28, 1999, U.S. patent application Ser. No. 09/267,570 filed Mar. 12, 1999, as well as, Provisional Application Serial No. 60/092,130 entitled “Methods and Apparatus for Instruction Addressing in Indirect VLIW Processors” filed Jul. 9, 1998, Provisional Application Serial No. 60/103,712 entitled “Efficient Complex Multiplication and Fast Fourier Transform (FFT) Implementation on the ManArray” filed Oct. 9, 1998, Provisional Application Serial No. 60/106,867 entitled “Methods and Apparatus for Improved Motion Estimation for Video Encoding” filed Nov. 3, 1998, Provisional Application Serial No. 60/113,637 entitled “Methods and Apparatus for Providing Direct Memory Access (DMA) Engine” filed Dec. 23, 1998, Provisional Application Serial No. 60/113,555 entitled “Methods and Apparatus Providing Transfer Control” filed Dec. 23, 1998, Provisional Application Serial No. 60/139,946 entitled “Methods and Apparatus for Data Dependent Address Operations and Efficient Variable Length Code Decoding in a VLIW Processor” filed Jun. 18, 1999, Provisional Application Serial No. 60/140,245 entitled “Methods and Apparatus for Generalized Event Detection and Action Specification in a Processor” filed Jun. 21, 1999, Provisional Application Serial No. 60/140,163 entitled “Methods and Apparatus for Improved Efficiency in Pipeline Simulation and Emulation” filed Jun. 21, 1999, Provisional Application Serial No. 60/140,162 entitled “Methods and Apparatus for Initiating and Re-Synchronizing Multi-Cycle SIMD Instructions” filed Jun. 21, 1999, Provisional Application Serial No. 60/140,244 entitled “Methods and Apparatus for Providing One-By-One Manifold Array (1×1 ManArray) Program Context Control” filed Jun. 21, 1999, Provisional Application Serial No. 60/140,325 entitled “Methods and Apparatus for Establishing Port Priority Function in a VLIW Processor” filed Jun. 21, 1999, and Provisional Application Serial No. 60/140,425 entitled “Methods and Apparatus for Parallel Processing Utilizing a Manifold Array (ManArray) Architecture and Instruction Syntax” filed Jun. 22, 1999 respectively, all of which are assigned to the assignee of the present invention and incorporated by reference herein in their entirety.
In a presently preferred embodiment of the present invention, a ManArray 2×2 iVLIW single instruction multiple data stream (SIMD) processor 100 as shown in
In this exemplary system 100 of
The basic concept of loading the iVLIWs is described in further detail in co-pending U.S. patent application Ser. No. 09/187,539, now U.S. Pat. No. 6,151,668, entitled “Methods and Apparatus for Efficient Synchronous MIMD Operations with iVLIW PE-to-PE Communications” and filed Nov. 6, 1998. Also contained in the SP/PE0 and the other PEs is a common PE configurable register file (CRF) 127 which is described in further detail in co-pending U.S. patent application Ser. No. 09/169,255 entitled “Methods and Apparatus for Dynamic Instruction Controlled Reconfiguration Register File with Extended Precision” filed Oct. 9, 1998. Due to the combined nature of the SP/PE0, the data memory interface controller 125 must handle the data processing needs of both the SP controller, with SP data in memory 121, and PE0, with PE0 data in memory 123. The SP/PE0 controller 125 also is the controlling point of the data that is sent over the 32-bit or 64-bit broadcast data bus 126. The other PEs, 151, 153, and 155 contain common physical data memory units 123′, 123″, and 123′″ though the data stored in them is generally different as required by the local processing done on each PE. The interface to these PE data memories is also a common design in PEs 1, 2, and 3 and indicated by PE local memory and data bus interface logic 157, 157′ and 157″. Interconnecting the PEs for data transfer communications is a cluster switch 171 which is more completely described in co-pending U.S. patent application Ser. Nos. 08/885,310 entitled “Manifold Array Processor” filed Jun. 30, 1997, now U.S. Pat. Nos. 6,023,753, 08/949,122 entitled “Methods and Apparatus for Manifold Array Processing” filed Oct. 10, 1997, and 09/169,256 entitled “Methods and Apparatus for ManArray PE-to-PE Switch Control” filed Oct. 9, 1998, now U.S. Pat. No. 6,167,501. The interface to a host processor, other peripheral devices, and/or external memory can be done in many ways. For completeness, a primary interface mechanism is contained in a direct memory access (DMA) control unit 181 that provides a scalable ManArray data bus 183 that connects to devices and interface units external to the ManArray core. The DMA control unit 181 provides the data flow and bus arbitration mechanisms needed for these external devices to interface to the ManArray core memories via the multiplexed bus interface symbolically represented by line 185. A high level view of a ManArray control bus (MCB) 191 is also shown in FIG. 1A.
In the instruction format 12B, bit 99 is the S/P bit. Two other bits 14 are hierarchy bits. Suitable instruction type-2-A,B,C formats 98 are described in further detail in U.S. patent application Ser. No. 09/215,081 entitled “Methods and Apparatus for Scalable Instruction Set Architecture with Dynamic Compact Instructions” and filed Dec. 18, 1998.
The ManArray abbreviated-instruction architecture of the present invention allows a programmer to write application code using the full ManArray architecture based upon the native instruction format 12B of
Thus, the ManArray abbreviated-instruction architecture allows maximum flexibility during development while providing an optimized-to-an-application core in final production. This multiple application focusing process 200 is illustrated in
The ManArray instruction format 12B of
In this present implementation, when a non-iVLIW SP instruction is executed on the control processor, no PE instruction is executed. When a non-iVLIW PE instruction is executed, no SP control processor instruction is executed. This separation provides an easy logic-design control strategy for implementation and an intuitive programming model. For those instances where additional performance is required, the SP array controller merged with an array iVLIW PE such as merged unit 101 of
Further aspects of the present invention are discussed in greater detail below. While 32-bit and now 64-bit architectures have dominated the field of high-performance computing in recent years, this domination has occurred at the expense of the size of the instruction memory subsystem. With the movement of digital signal processing (DSP) technology into multimedia and embedded systems markets, the cost of the processing subsystem, in many cases, has come to be dominated by the cost of memory and performance is often constrained by the access time of the local instruction memory associated with the DSP. Real-time issues impose further constraints, making it desirable to have time-critical applications in instruction memory with deterministic access time. This memory is preferably located on-chip. In a high volume embedded application, the full application code is embedded and many times stored in a read only memory (ROM) to further reduce costs. Since application code has been growing to accommodate more features and capabilities, the on-chip memory has been growing, further increasing its cost and affecting memory access timing. Consequently, the issue of code density becomes important to processor implementations.
The Manifold Array processor architecture and instruction set are adapted to address the code density and configurable processor optimization problem by utilizing the stream-flow process and abbreviated-instruction apparatus and tools in accordance with the present invention. The stream-flow process 300 is shown in FIG. 3A. In the development of a specific application, the standard ManArray software development kit (SDK) is used in step 301 with the application of some optional programmer/tool-supported rules as programming constraints listed in 302 and 320. These rules are chosen to improve the probability of creating smaller abbreviated programs that if no rules were used in the program development process. The rules are also chosen to aid in determining what instruction set choices are best suited to the intended application. For example, in a portable voice-only cell phone type of application, where power is of extreme importance and the performance requirements are low relative to the full ManArray capabilities, sample rules such as those indicated in step 302 might be used. One of these rules specifies a restricted use of the configurable register file (CRF), allowing the register file to be cut in half providing a 16×32 or an 8×64 configurable register file for a lower cost optimized processor core. Selected instructions can be eliminated from a programmer's choice, such as those specifically intended for MPEG Video type processing. Each of the rules describes a subset of the full ManArray architecture to be used and verified with tools that support this sub-setting.
After the application code is written using native instructions, an instruction-abbreviation tool is used in step 303 to analyze the ManArray native application code for common characteristic features of the code. These common characteristic features are specific bit-patterns within the instructions that are termed style-fields. These style-fields are used in conjunction with the abbreviated-instruction translation hardware to translate instructions as described herein. After the tool creates the application code in abbreviated-instruction form, the code can be run in step 304 on Manta-2 hardware capable of executing B-bit abbreviated instructions for evaluation purposes. In step 321 of
A Manta-1 chip implementation 360 of the ManArray architecture is shown in FIG. 3D. As presently defined, this implementation contains a 2×2 Manta DSP Core 361, including DMA and on-chip bus interfaces 363, a PCI controller 365, an input/output (I/O) unit 367, a 64-bit memory interface 369, and a ManArray peripheral bus (MPB) and host interface 371. This DSP is designed to be utilized as a coprocessor working alongside an X86, MIPS, ARM, or other host processor. The 2×2 ManArray core contains an I fetch unit 379 that interfaces with a 32-bit instruction memory 381. The 2×2 core attaches to the two main on-chip busses, the 32-bit ManArray control bus (MCB) 375 and the 64-bit ManArray data bus (MDB) 373 which is a scaleable bus allowing wider bus widths in different implementations depending upon a product's needs. The memory interface block 369 provides bulk memory (SDRAM) and non-volatile memory (FLASH read only memory) service via two busses, namely the MDB 373 and the private host memory port 377 from the host processor interface block 371. The ManArray peripheral bus is an off chip version of the internal ManArray busses and provides an interface to an ARM host processor. It is noted that the ManArray peripheral bus, in the present implementation, is shared with a host processor interface that is designed specifically to interface with a MIPS processor. The PCI controller 365 provides a standard X86 personal computer interface. The I/O block 367 internally contains a rudimentary I/O system for an embedded system, including, for example a debug UART interface, as well as MIPS host interface I/Os. These host I/Os include three 32-bit timers and an interrupt controller for the external host. Other chip support hardware such as debug and oscillator functions are not shown for clarity.
A Manta-2 chip implementation 385 of the ManArray architecture including instruction abbreviation in accordance with the present invention is shown in FIG. 3E. This implementation 385 contains equivalent functional units to those in the Manta-1 system of
The next step as shown by the examples of
In a similar manner, a subset of the full ManArray architecture can also be employed without using the abbreviated-instruction tool to produce optimized 32-bit processor cores. This path is indicated by step 320. For example, this process may be advantageous in connection with the removal of MPEG video instructions from a communications only application core. The resultant code can be verified in the Manta-1 hardware evaluation vehicle as in step 321, and an optimized silicon core produced for the intended application as indicated in optimized subset 32-bit processor step 322.
Instruction Abbreviation
The approaches described in this invention for abbreviating instructions, hardware to execute the abbreviated instructions, and supporting configurations of the core processor have a number of unique and advantageous differences with respect to the approach used in the previously mentioned U.S. patent application Ser. No. 09/215,081. In the present invention, a program, using the full ManArray native instruction set, is used as input to the instruction-abbreviation tool and a new stand-alone abbreviated representation of the program is uniquely produced dependent upon the common characteristics of the initial program. In this present invention, all instructions including control flow and 32-bit iVLIW instructions, such as Load VLIW (LV) and execute VLIW (XV) instructions, can be abbreviated, allowing the abbreviated program to stand-alone without any use of the original 32-bit instruction types in the program flow. The abbreviated-instruction program, stored in a reduced-size instruction memory, is fetched instruction-by-instruction and each abbreviated instruction is translated into a native form that then executes on the ManArray processor. The abbreviated-instruction translation hardware may use one or more styles of translation formats if it is determined by the instruction-abbreviation tool that a smaller abbreviated-instruction memory can be obtained through the use of multiple styles. Note that the preferred approach is to do the translation of abbreviated instructions in the SP and only dispatch PE instructions in native form to the array of PEs. By using the SP to dispatch PE instructions, the array power can be reduced during SP-only operations, a feature not previously described in the ManArray architecture. Further, even though each program will have a different abbreviated form resulting in a potentially different configuration of the resultant processor core, in each case, all the abbreviated instructions are subsets of the ManArray architecture. These aspects of the present invention are explained further below.
The ManArray architecture uses an indirect VLIW design which translates a 32-bit execute VLIW instruction (XV) into a VLIW, for example, a VLIW consisting of Store (S), Load (L), ALU (A), MAU (M), and DSU (D) instructions as in memory 109 of
It is also possible to create an abbreviated B-bit instruction that can be translated into a native C-bit form. For example, a 32-bit instruction abbreviated into a 13-bit instruction would use a separate memory, or translation memory (TM), to contain the necessary bits of the original 32-bit instruction that are not represented in the 13-bit form. The TM is used in the process to translate the 13-bit abbreviated form back into a form containing all the information of the original native instruction necessary for execution, though not necessarily in the same format as the documented native format. For implementation reasons, the internal processor version of the native format can vary. The important point is that all the information context of the native format is maintained. It is also noted that each Store, Load, ALU, MAU, DSU, and control opcode type may use its own translation-memory (TM). Two related but distinctly different uses of VIMs, individually associated with execution units, are described in further detail in U.S. patent application Ser. Nos. 09/215,081 and 09/205,558, respectively.
In the present invention, a TM, is directly used in the translation process for every instruction. The TM does not contain VLIWs, but the TM does contain partial bit-patterns as defined by a selected style. One of the aspects of the present invention is the mechanism for translating the abbreviated instructions back into a native form necessary for execution. By translating back into a native form, the full capabilities of the ManArray architecture remain intact at the execution units. In other words, the abbreviation process does not restrict the programmer in any way. The only restrictions are determined by the programmer in selecting rules to govern the program creation based on characteristics of the application and desired performance, size, and power of the configurable processor core to be built at the end of the development process. This invention also provides a mechanism so that after the functionality of an application program is stabilized, or at some point in the development process at the discretion of the product developer, the execution units can be made into subsets of the full ManArray architecture definition optimized for the intended application.
A style-field is a specific set of bits, identified by the instruction-abbreviation tool's analysis of a particular program or by human perception, that, for the specific analyzed program, change infrequently with respect to the other bits in the instruction stream. Note that multiple style-fields can be identified depending upon the characteristics of the application code. There may be a different style-field for each opcode in the abbreviated-instruction format, multiple style-fields within an opcode, or common style-fields for multiple opcodes. In the hardware, a style is defined as a logical mechanism, operative for at least one instruction but more typically operative on a group of instructions, that specifies how the translation is to occur. The style is indicated in hardware by a set of bits, such as the four bits (15-12) loaded in 4-bit style register 351 of FIG. 3C. These 4-bits can be loaded in the same programmer-visible control register associated with a Vb TM base address register 353 also shown in FIG. 3C. For the specific example shown in
It is anticipated that the TMs will usually require only a small address range and the number of styles needed will also usually be small. For example, an implementation may use only two styles and use TMs of only 64 addresses. Depending upon the analysis of the program to be reduced in size, it may turn out that the number of bits in the different style-fields is constant, allowing a single TM to be implemented where the different styles relate to different address ranges in the single TM. The distribution of the style-field bits can be different for each style and is specified by the definition of each style. Alternatively, physically separate TMs, associated with each style in the abbreviated-instruction format, can be provided. A combination of separate TMs and address-range selectable TM sections can be used dependent upon the style-fields chosen, as discussed in further detail in the following sections. Note that for a TM which holds multiple style bit-patterns, the style can be indirectly inferred by the address range within the TM accessed as part of the translation mechanism. Also note that depending upon the characteristics of the program being reduced, there can be a common style associated with a common TM base address register, individual styles with a common TM base address register, a common style with individual TM base address registers, and individual styles with individual TM base address registers among the different opcodes. The choice of which approach to use is dependent upon the characteristics of the program being reduced and product needs.
It is noted that alternatives to the encoding shown in
Type 1 Translation
Type 2 Translation
Where only certain bits within the C-bit (32-bit) native instruction format tend to change frequently in an application program, it is conceivable to divide the C-bit instruction into two or more portions which are not necessarily contiguous bit field portions, and analyze the pattern of bit changes in these two portions across the application program. Using the information obtained from this analysis, it is then possible to determine a number of strategies to abbreviate the instructions and to handle the instruction translation mechanism. Three examples of this further approach are shown in
Type 2A Translation
As shown in
In a similar fashion, eight styles are shown for the MAU and ALU instructions in FIG. 5C. Only three of these styles 550, 552 and 554 have been numbered, as have their corresponding style-field bit patterns 551, 553 and 555. The remaining unnumbered styles correspond to bit patterns which are presently reserved. Exemplary styles for the DSU instruction are shown in
Type 2B Translation
Type 2C Translation
Another approach to TM accessing and abbreviated-instruction translation is illustrated in FIG. 6C. Mechanism 670 of
Type 2 Translation Extension
It will be recognized that there exist instruction set architecture employing more than 32-bits, such as 40-bits, 48-bits, and 64-bits, among others. The instruction abbreviation process and translation approaches of the present invention would work equally well for these architectures. Further, the concept of splitting the native instruction format into two sections can be generated to splitting the instruction format into three or more sections. In these cases, the style would cover the three or more sections with separate bit-patterns that would be analyzed in a program's instructions. For each section, there would be a translation memory TM and the abbreviated-instruction translated into the larger native format. For example, a 48-bit instruction could be split into three sections, with each section represented in a TM. The abbreviated-instruction format for this 48-bit case might contain three 5-bit fields, a 3-bit opcode, and a single S/P-bit, totaling 19-bits instead of the 48-bit instruction. It is noted that the 32-bit instruction format may also be split into more than two segments for abbreviation purposes, but present analysis indicates the split into two segments is a better match to presently anticipated needs.
As technology processes continue to improve providing greater density of logic and memory implementations, it becomes desirable to expand the scope of an architecture to take advantage of the greater on-chip density. Instruction abbreviation allows the expansion of the instruction set format while still minimizing the instruction memory size external to the core processor. For example, the ManArray architecture register file operand specification can be expanded from the present 5-bits per operand to 7-bits per operand. Since the ManArray architecture is a three operand specification architecture, this expansion adds 6 bits to the instruction format size. Assuming 2 additional bits are added to expand the opcode field or other field specifiers, the 32-bit ManArray architecture could be expanded to 40-bits.
With instruction abbreviation, the 40-bit instructions could be abbreviated to a B-bit format, where B might be 15, 16, 17, or a different number of bits less than 40 depending upon the application. Since instruction abbreviation decouples the instruction format used by the core processor from the instruction format size stored in instruction memory, the core processor has more freedom to grow in capability and performance, while still minimizing external memory size and access time requirements.
ManArray SP/PE0 Translation
The adaptation of the presently preferred dual TM using mechanism 670 of
An example of unequal fields is shown in
For illustrative purposes,
Also, not shown in
In each cycle, the S/P-bit 705 and opcode bits 703 are sent to the Opcode, Group, L/S, and Unit PreDecode logic 755 over signal lines 739. In addition, the abbreviated-instruction Y-TM offset field 701 is sent to the iVLIW address generation function unit 730 over lines 737. For execute VLIW (XV) instructions in abbreviated form, the dual TM translation occurs in parallel with the XV VLIW 735 access. For XV iVLIW abbreviated instructions of the form shown in
ManArray 1×2 Translation
Dual Abbreviated-Instruction Fetching
The dual abbreviated-instruction format 12A of
In some applications, it is noted that the abbreviated-instruction program and/or individual tasks of a product program may be stored in a system's storage device whose data types may be based on 32-bits, due to other system needs. In this case, it is noted that two abbreviated instructions can be fit into a 32-bit format with bits to spare. For example, using the format 912 of
In
Pipeline Description
In the next cycle “i+1” shown in row 1020, the SP fetches the abbreviated B-bit XV.S instruction and loads it into the IR1702. While the fetch operation is occurring, the ADD.S is in the Xpand and Dispatch pipeline stage in which a number of operations occur. The S/P-bit 705 indicates this is an SP-only operation. The local dual TM fetches occur and a native form of the ADD.S instruction is loaded into the IR2771 at the end of the cycle. The S/P-bit and 3-bit abbreviated opcode arc sent to the Opcode, Group, L/S, and Unit PreDecode logic 755 and are decoded in the SP with control latches set at the end of this stage as required to control the next stages of the pipeline.
In cycle “i+2” shown in row 1030, the SP fetches the abbreviated B-bit COPY.S instruction and loads it into the register IR1702 at the end of the fetch cycle. While the fetch operation is occurring, the XV.S instruction is in the Xpand and Dispatch pipeline stage in which a number of operations occur. The S/P-bit and opcode indicate an SP XV operation. The local TM fetches occur and a native form of the XV.S instruction is loaded into register IR2 at the end of this cycle. The S/P-bit and 3-bit opcode are decode in the SP and appropriate latches are set at the end of this stage. In parallel, the VIM address is calculated by address generation function unit 730 of FIG. 7 and the iVLIW is fetched from the VIM 735. Also, in cycle “i+2”, the ALU decodes the ADD.S instruction.
In cycle “i+3” shown in row 1040, the SP fetches the next abbreviated B-bit instruction, which in this example is an ADD.S instruction, and loads it into the register IR1 at the end of the fetch cycle. In the Xpand and Dispatch stage, the COPY.S abbreviated instruction is being translated into a native form suitable for continued processing. In the decode pipeline stage, the VLIW fetched from the VIM representing up to 5 native ManArray instructions is in unit 1-n decoder 779 of FIG. 7. The ADD.S has entered the execute pipeline stage and the results of the ADD.S will be available by the end of this stage.
In cycle “i+4” shown in row 1050, the SP fetches the next abbreviated B-bit instruction, Instr(I+3). The fetched ADD.S abbreviated instruction enters the Xpand and Dispatch stage where it is translated into a native form suitable for decoding and execution. The COPY.S instruction is decoded in the DSU in the decode pipeline stage and the fetched VLIW of up to 5 native instructions enters the execute stage of the pipeline with the results from the up to 5 executions available at the end of this stage. The ADD.S first fetched in cycle “i” enters the condition return stage where any side effects of its execution are stored in programmer visible flag registers, Arithmetic Scalar Flags (ASFs) and the Arithmetic Condition Flags (ACFs).
Other Processor Architectures
As an example of the generality of the instruction-abbreviation process, consistent with the teachings of the present invention, other processor architectures containing one or more execution units can have their opcode space partitioned into one or more separate groups and the instruction format partitioned into one or more bit-pattern style-fields. Based upon a program coded with this other processor architecture, B-bit abbreviated instructions can be formed that can then be stored in a reduced size memory. These abbreviated instructions can then be fetched and translated into a native form internal to the other processor suitable for execution on the other processor.
Since there is a standard B-bit format for this other processor's abbreviated instructions, and a one-to-one mapping between the B-bit instruction and a processor's native instruction, there is no problem storing the abbreviated instruction in a cache, branching to an abbreviated instruction, or taking interrupts as would normally occur in a native processor program.
Sample Analysis
The sample analysis described below is based on a ManArray MPEG decoder application program containing 5050 total native 32-bit instructions. The analysis tool reads instructions as data from an input file <mpeg.dump> where the MPEG decoder program is located. The analysis tool also reads style-fields from input file <style7.dat> where different style-fields can be placed for analysis. For this example, the following styles were used in the analysis program. The dual-TM translation apparatus of
VLIW
0 1 2 3 4 5
Style-1
FLOW
0 1 2 3 4 5 6 7
Style-2
LOAD
0 1 2 3 4 5 16 17 18 19
Style-3
STORE
0 1 2 3 4 5 16 17 18 19
Style-3
ALU
6 7 8 11 12 13 16 17 18 19
Style-4
MAU
6 7 8 11 12 13 16 17 18 19
Style-4
DSU
6 7 8 11 12 13 16 17 18 19
Style-4
An example from the sample analysis program for MAU instructions using style-4 is as follows: ##STR00001##
The instruction format given by 100011-0yyyy-00yyy-10yyy-000-101 indicates the Y-TM style-field bit pattern covering y1 (bits 19-16), y2 (bits 13-11), and y3 (bits 8-6). The x-field covers bits x1 (bits 26-20), x2 (bits 15, 14), x3 (bits 10, 9), and x4 (bits 5-0). It is group bits (bits 31 and 30), the S/P bit (29), and the unit field bits (bits 28 and 27) have been excluded from the analysis since the group, S/P, and unit information is covered in the abbreviated instruction format's S/P-bit and opcode bits. In the reported analysis, 12 MAU instructions were found where the X-field was x1=100110, x2=00, x3=10, and x4=000101 which did not change for the 12 instructions and only bits within the y fields changed as follows:
The MPEG application code was analyzed by a software static analysis tool which:
In this particular example, a 14-bit abbreviated-instruction format was used and the total number of bits was determined for the main instruction memory and compared to the native form as used in the actual MPEG program. A memory savings was then reported. In the following summary reports, a slot is an addressable location in the TM.
VLIW instructions:
Total native 32-bit instructions=5050
Total UNIQUE instructions=1540
Overall 14-bit Dual-TM Analysis
Instruction memory Savings (14-bit) (161600−(5050*14)/(161600)=56.25%. It is noted that the addition of Vb and style register management instructions will reduce this percentage slightly. It is further noted that there are additional analysis mechanisms not addressed in this exemplary summary report but which can further reduce instruction memory requirements. For example, for those opcodes with common styles, a search may be done to find the common X-TM and Y-TM entries.
Also, this analysis report did not try more than one style per group. It is very feasible that an additional style or styles can be determined for each style grouping and steps 3 and 4 (in the previous identified analysis tool steps) are repeated to determine whether additional styles further reduce memory requirements.
Guidelines to Develop Application Code for Abbreviated-Instructions:
Exemplary ManArray Abbreviated-Instruction Set guide-line rules are:
While the present has been described in a presently preferred embodiment, it will be recognized that a number of variations will be readily apparent and that the present teachings may be widely applied. By way of example, while instructions with specific numbers of bits and formats are addressed herein, the present invention will be applicable to instructions having other numbers of bits and different formats. Further, while described in the presently preferred context of the ManArray architecture, the invention will also be applicable to other processing architectures.
Larsen, Larry D., Pechanek, Gerald George, Kurak, Jr., Charles W.
Patent | Priority | Assignee | Title |
10869108, | Sep 29 2008 | PATENT ARMORY INC | Parallel signal processing system and method |
7784041, | Mar 30 2006 | Oracle America, Inc | Mechanism for reducing detectable pauses in dynamic output caused by dynamic compilation |
Patent | Priority | Assignee | Title |
4722050, | Mar 27 1986 | Hewlett-Packard Company | Method and apparatus for facilitating instruction processing of a digital computer |
4965771, | Aug 18 1986 | MINOLTA CAMERA KABUSHIKI KAISHA, C O OSAKA KOKUSAI BLDG 30, 2-CHOME, AZUCHI-MACHI, HIGASHI-KU, OSAKA-SHI, OSAKA-FU, JAPAN A CORP OF JAPAN | Printer controller for connecting a printer to an information processor having a different protocol from that of a printer |
5632028, | Mar 03 1995 | Fujitsu, Ltd | Hardware support for fast software emulation of unimplemented instructions |
5784585, | Apr 05 1994 | Motorola, Inc. | Computer system for executing instruction stream containing mixed compressed and uncompressed instructions by automatically detecting and expanding compressed instructions |
5790874, | Sep 30 1994 | Kabushiki Kaisha Toshiba | Information processing apparatus for reducing power consumption by minimizing hamming distance between consecutive instruction |
5819058, | Feb 28 1997 | HANGER SOLUTIONS, LLC | Instruction compression and decompression system and method for a processor |
5835746, | Apr 21 1997 | Freescale Semiconductor, Inc | Method and apparatus for fetching and issuing dual-word or multiple instructions in a data processing system |
5896519, | Jun 10 1996 | AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD | Apparatus for detecting instructions from a variable-length compressed instruction set having extended and non-extended instructions |
5898883, | Jan 25 1994 | Hitachi, Ltd. | Memory access mechanism for a parallel processing computer system with distributed shared memory |
5909587, | Oct 24 1997 | GLOBALFOUNDRIES Inc | Multi-chip superscalar microprocessor module |
6044450, | Mar 29 1996 | Hitachi, LTD | Processor for VLIW instruction |
6049862, | Jul 19 1996 | III Holdings 6, LLC | Signal processor executing compressed instructions that are decoded using either a programmable or hardwired decoder based on a category bit in the instruction |
6101592, | Dec 18 1997 | Altera Corporation | Methods and apparatus for scalable instruction set architecture with dynamic compact instructions |
6199126, | Sep 23 1997 | International Business Machines Corporation | Processor transparent on-the-fly instruction stream decompression |
6317867, | Jan 29 1999 | International Business Machines Corporation | Method and system for clustering instructions within executable code for compression |
6801995, | Aug 04 1998 | AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD | Method for optimally encoding a set of instruction codes for a digital processor having a plurality of instruction selectable resource types and an associated optimized set of instruction codes |
EP820006, | |||
JP9265397, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
May 18 2004 | Altera Corporation | (assignment on the face of the patent) | / | |||
Aug 24 2006 | PTS Corporation | Altera Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 018184 | /0423 |
Date | Maintenance Fee Events |
Nov 20 2009 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Nov 26 2013 | M1553: Payment of Maintenance Fee, 12th Year, Large Entity. |
Date | Maintenance Schedule |
Sep 16 2011 | 4 years fee payment window open |
Mar 16 2012 | 6 months grace period start (w surcharge) |
Sep 16 2012 | patent expiry (for year 4) |
Sep 16 2014 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 16 2015 | 8 years fee payment window open |
Mar 16 2016 | 6 months grace period start (w surcharge) |
Sep 16 2016 | patent expiry (for year 8) |
Sep 16 2018 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 16 2019 | 12 years fee payment window open |
Mar 16 2020 | 6 months grace period start (w surcharge) |
Sep 16 2020 | patent expiry (for year 12) |
Sep 16 2022 | 2 years to revive unintentionally abandoned end. (for year 12) |