A virtual parallel multiplier-accumulator (vmac) that can execute more than or less than one MAC operation in a single system clock cycle. The inventive vmac advantageously employs a resource/time-sharing methodology with multiple sequential computational stages.
|
1. A virtually parallel multiplier-accumulator (vmac) responsive to a vmac clock (vmck) derived from a master clock (mck), said vmac being adapted for performing more than one multiplier-accumulator (MAC) operation within a mck cycle, said vmac comprising:
a control-wave generator (CWG) adapted for generating a plurality of control signals within a vmck cycle; and a sequential-computational stage MAC (SCS-MAC) adapted for receiving data from a source register and for receiving said plurality of control signals from said CWG, said SCS-MAC performing an operation on said data upon receipt of each of said plurality of control signals from said CWG.
18. A virtually parallel multiplier-accumulator (vmac) responsive to a vmac clock (vmck) derived from a master clock (mck), said vmac being adapted for performing one of more than one multiplier-accumulator (MAC) operation within a mck cycle, and less than one multiplier-accumulator (MAC) operation within a mck cycle, said vmac comprising:
a control-wave generator (CWG) adapted for generating a plurality of control signals in relation to a vmck cycle; and a sequential-computational stage MAC (SCS-MAC) adapted for receiving data from a source register and for receiving said plurality of control signals from said CWG, said SCS-MAC performing an operation on said data upon receipt of each of said plurality of control signals from said CWG.
10. An integrated circuit including a virtually parallel multiplier-accumulator (vmac) responsive to a vmac clock (vmck) derived from a master clock (mck), said vmac being adapted for performing more than one multiplier-accumulator (MAC) operation within a mck cycle, said integrated circuit comprising:
a source register for providing data to said vmac; and a result register for receiving data from said vmac and for providing data to said source register; wherein said source register provides parallel data to said vmac and wherein said vmac provides serial data output, said integrated circuit further comprising an output data demultiplexer and a register for receiving said serial data output from said vmac, for converting said serial data to parallel data, and for communicating said parallel data to said result register.
2. A vmac as recited by
a partial product generator (PPG) adapted for receiving a first data and a second data from a source register and for generating an output that is a product of said first and said second data; and a multi-stage partial product adder (PPA) adapted for receiving said PPG output and for receiving a third data, said PPA generating an output that is the sum of said PPG output and said third data.
3. A vmac as recited by
4. A vmac as recited by
5. A vmac as recited by
6. A vmac as recited by
7. A vmac as recited by
8. A vmac as recited by
9. A vmac as recited by
11. An integrated circuit as recited by
12. An integrated circuit as recited by
a control-wave generator (CWG) adapted for generating a plurality of control signals within a vmck cycle; and a sequential-computational stage MAC (SCS-MAC) adapted for receiving data from a source register and for receiving said plurality of control signals from said CWG, said SCS-MAC performing an operation on said data upon receipt of each of said plurality of control signals from said CWG.
13. An integrated circuit as recited by
a partial product generator (PPG) adapted for receiving a first data and a second data from a source register and for generating an output that is a product of said first and said second data; and a multi-stage partial product adder (PPA) adapted for receiving said PPG output and for receiving a third data, said PPA generating an output that is the sum of said PPG output and said third data.
14. An integrated circuit as recited by
15. An integrated circuit as recited by
16. An integrated circuit as recited by
17. An integrated circuit as recited by
|
The present invention is directed to a multiplier-accumulator (MAC) and more particularly, to a virtual parallel multiplier-accumulator (VMAC) that processes more than or less than one MAC operations within a single system clock cycle.
A multiply-accumulate (MAC) operation is a common operation performed in signal processing and other algorithms. Because of its frequency of occurrence in such algorithms, many prior art microprocessor and digital signal processors (DSPs) include some form of direct instruction support for the multiply-accumulate operation. Typically, the CPU's instruction set includes a multiply-accumulate instruction or multiply and add instructions that, together, can execute a MAC operation in a single system clock cycle. These instructions are executed by hardware circuits such as separate multiplier and adder circuits, or a combined multiply-add circuit.
Algorithms that use MAC operations typically consist of a loop over many iterations. The algorithm's performance can be improved by executing the MAC operations of multiple loop iterations at once. This property has motivated CPU designers to include instructions that execute multiple MAC operations per system clock cycle. An instruction executing multiple MAC operations per system clock cycle may be implemented in a number of ways. For example, hardware may be provided to execute multiple MAC operations per cycle consisting of a number of multipliers and adders or a number of multiply-add circuits. By providing multiple arithmetic circuits, the CPU can execute the simultaneous multiplies and adds needed to support multiple MAC operations in parallel.
Microprocessor integrated circuits may include a plurality of multiplier-accumulator (MAC) units connected in parallel with each other. While this configuration provides the ability to perform multiple MAC operations within a single system clock cycle, it also consumes more real estate within the integrated circuit, and adversely affects the performance and power consumption of the integrated circuit due to the relatively long bus connections between multi-port memories, registers, and the multiple MAC units.
An example of a prior art CPU data path executing two MAC operations per cycle is depicted in FIG. 1. Each MAC unit defines a data path which consists of a register file comprised of sixteen, 40-bit registers, each having a multiplier and a load/store/arithmetic unit attached thereto. The multipliers each multiply two 16-bit operands to produce a 32-bit product. The multipliers can accept a new operand and produce a new product every system clock cycle, but have a latency of two system clock cycles. The load/store/arithmetic units can perform a 40-bit accumulate (i.e., addition/subtraction) in a single system clock cycle. The multiple MAC units are identical to each other, and provide an effective throughput of two multiply-accumulates per system clock cycle. Performing a complete multiply-accumulate operation requires passing the operands through a multiplier by issuing a multiply instruction, and then through a load/store/arithmetic unit by issuing an add instruction. The multiply and add instructions are scheduled for execution so that the product of the multiply operation is not used by the add operation until the multiplier has finished generating the product.
A prior art dual MAC data path is depicted in
It is common in CPU designs to increase the CPU clock frequency by processing instruction execution in a pipeline. The flow of instructions and their operands and results through the pipeline is controlled by the CPU's pipeline control logic. For CPUs that do not support a MAC operation, the duration of a pipeline stage (and therefore the clock frequency) is typically determined by the adder circuit or the delay to access memory. For CPUs that support MAC operations, the duration of a pipeline stage is often determined by the multiplier/adder/multiply-add circuit, i.e. by the hardware provided to perform the MAC operation. To overcome this limitation, prior art CPUs extend the pipeline by pipelining the multiplier/adder/multiply-add arithmetic circuits. Although the arithmetic circuits are pipelined with a fixed number of stages, pipelining still introduces significant complexity both in the design of the pipeline control logic and in writing a sequence of instructions to handle the latency of the pipeline. Ideally, the MAC operation should be executed with an arithmetic circuit that does not constrain the CPU's clock frequency and does not introduce complex latencies for the programmer to manage.
The prior art dual MAC data path has a number of disadvantages. Firstly, two multipliers and two adders are required. Secondly, the clock frequency of the dual MAC data path is restricted by the multiplier's delay; the multiplier already being pipelined once in an attempt to deal with its impact on the system frequency. However, this pipeline then requires extra circuit area, power and latency if the product is immediately re-used in a subsequent multiplication. Finally, the prior art dual MAC data path does not produce a single sum of all four products and the data-path has to be partitioned into mirror components to reduce the pressure on register file ports and bus loading. However, this means that the data path does not directly sum a sequence of products in half the number of cycles, and an additional cycle is needed to add the final sums.
It is desirable to provide a MAC unit that overcomes the shortcomings of the prior art.
The present invention is directed to a virtual parallel multiplier-accumulator (VMAC) that can process N MAC operations within M system clock cycles, where N may or may not be equal to M and where a MAC operation is generally defined by the equation (x)*(y)+(z). The present invention also reduces the physical size of integrated circuits and electronic devices since one VMAC constructed in accordance with the present invention replaces N prior art MAC units.
The VMAC of the present invention consists of a Control-Wave Generator (CWG) and a Sequential-Computational-Stage MAC (SCS-MAC) comprised of a plurality of sequentially (i.e., serially) arranged computational stages. The CWG produces multiple sets of consecutive control signals within a single VMAC clock (VMCK) cycle, and with each rising edge of VMCK. The frequency of the VMCK may be different from or the same as the system or main clock (MCK), as a matter of design choice. The control signals generated by the CWG control the flow of data or operands through the SCS-MAC (i.e., through the VMAC), and are also used to clock output or result registers that may be connected to the inventive VMAC. A source register may be connected to the VMAC to provide input operand data to the SCS-MAC. The SCS-MAC performs a MAC operation as the operand data propagates through each computational stage of the SCS-MAC. The output from the VMAC may be latched into an output or result register for communication to the source register of the VMAC or to another electronic device or circuit.
While, prior art MAC units accept a maximum of one set of operands per clock cycle, the VMAC of the present invention can accept a new set of operand data within a single clock cycle. In fact, the VMAC of the present invention permits many new MAC operations to start within a single clock cycle, and permits many operands to be present in the sequential computational stages, with each computational stage executing a different phase of a MAC operation (e.g., partial sums, products, etc.). Thus, the VMAC of the present invention simultaneously performs different phases of a MAC operation on different sets of operand data, and produces a MAC result per a time period which is approximately equivalent to the propagation delay through a single computational stage.
The present invention is directed to a virtual parallel multiplier-accumulator (VMAC) responsive to a VMAC clock (VMCK) derived from a master clock (MCK). The VMAC is adapted for performing more than or less than one multiplier-accumulator (MAC) operation within a MCK cycle.
The present invention is also directed to a virtual parallel multiplier-accumulator (VMAC) responsive to a VMAC clock (VMCK) derived from a master clock (MCK), where the VMAC is adapted for performing more than or less than one multiplier-accumulator (MAC) operation within a MCK cycle. The VMAC of this embodiment comprises a control-wave generator (CWG) adapted for generating a plurality of control signals within a VMCK cycle. The VMAC further comprises a sequential-computational stage MAC (SCS-MAC) adapted for receiving data from a source register and for receiving said plurality of control signals from the CWG. The SCS-MAC performs an operation on the data upon receipt of each of the plurality of control signals from the CWG.
The present invention is also directed to an integrated circuit including a virtual parallel multiplier-accumulator (VMAC) responsive to a VMAC clock (VMCK) derived from a master clock (MCK). The integrated circuit includes a VMAC that is adapted for performing more than one multiplier-accumulator (MAC) operation within a MCK cycle.
Other objects and features of the present invention will become apparent from the following detailed description, considered in conjunction with the accompanying drawing figures. It is to be understood, however, that the drawings, which are not to scale, are designed solely for the purpose of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims.
In the drawing figures, which are not to scale, and which are merely illustrative, and wherein like reference characters denote similar elements throughout the several views:
The present invention is directed to a virtual parallel multiplier-accumulator (VMAC) that could execute multiple MAC operations in a single system clock cycle. The inventive VMAC advantageously employs a resource/time-sharing methodology with multiple sequential computational stages.
Referring to
With continued reference to
The VMAC 100 provides time-multiplexed, serial output data (DQ) communicated over a single output data bus 172 to an output data demultiplexer 174, which demultiplexes the output data DQ and communicates the demultiplexed output data to a register 176. Data output (DQA, DQM) from the register 176 is provided on a plurality of parallel output data lines 170, 178 to a result register 190. A control signal (QSEL) 186 from the CWG 110 controls the flow of data output from the register 176. Upon receipt of a control signal QSEL, output data DQA, DQM is clocked out of the register 176 and into a result register 190. Intermediate data output from the VMAC 100 presented as data output (RA, RM) of the result register 190 is communicated back into the source register 160, where it may be utilized again by the VMAC 100.
A VMAC clock (VMCK) 102 controls the CWG 110 and is derived from a master clock (MCK). The VMCK 102 also controls the result register 190. The VMCK 102 may be the same (frequency, duty cycle, etc.) as the master clock 90, or the VMCK 102 may be different from the MCK 90, as a routine matter of design choice.
Control of the SCS-MAC 130 is provided by a control signal 112 communicated from the CWG 110. The control signal 112 (described in more detail below and with reference to
With reference next to
Flow of an operand (i.e., flow of data) through the SCS-MAC 130, and through the VMAC 100, of the present invention will now be discussed in detail with reference to
A plurality of precharge control signals, namely PREPP, PREA, PREB, PREC and PRED, are generated by the CWG 110 and communicated to the PPG 140 and PPA 150 to control the flow of data therethrough. The relationship between and among these signals, control signals DN-PPG and DNA:D, and the flow of data through the VMAC 100 is depicted in FIG. 6. These precharge control signals are also communicated to the delays 116-124 of the delay data path 114. Initially, and as depicted by dotted line 210 in
The timing diagram depicted in
The VMAC 100 of the present invention is operable in various different modes, four of which are represented by the timing diagrams depicted in
The timing diagram of
The timing diagram of
The timing diagram of
With reference next to
The present invention permits dynamic variation of the number of MAC operations to be performed in a MCK cycle. It is thus desirable to provide a signal from the VMAC 100 to the CPU pipeline control logic when the VMAC has completed the requested number of MAC operations.
The present invention is directed to a VMAC 100 that performs variable MAC operations within a single system clock cycle. The VMAC 100 of the present invention may perform a fixed or variable number of MAC operations, as a routine matter of design choice. Thus, the present invention eliminates the duplicative, parallel MAC circuits and components required in prior art configurations and, as a result, achieves improvements in integrated circuit design and performance. More compact layout of integrated circuits including a VMAC may be provided, accompanied by a reduction in power consumption.
Thus, while there have been shown and described and pointed out fundamental novel features of the invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the form and details of the disclosed invention may be made by those skilled in the art without departing from the spirit of the invention. It is the intention, therefore, to be limited only as indicated by the scope of the claims appended hereto.
Patent | Priority | Assignee | Title |
7617333, | Jan 21 2003 | Mellanox Technologies Ltd | Fibre channel controller shareable by a plurality of operating system domains within a load-store architecture |
7620064, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for shared I/O in a load/store fabric |
7620066, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for shared I/O in a load/store fabric |
7664909, | Apr 18 2003 | Mellanox Technologies Ltd | Method and apparatus for a shared I/O serial ATA controller |
7698483, | Jan 21 2003 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Switching apparatus and method for link initialization in a shared I/O environment |
7706372, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for shared I/O in a load/store fabric |
7743238, | May 09 2003 | ARM Limited | Accessing items of architectural state from a register cache in a data processing apparatus when performing branch prediction operations for an indirect branch instruction |
7782893, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for shared I/O in a load/store fabric |
7836211, | Jan 21 2003 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Shared input/output load-store architecture |
7917658, | Jan 21 2003 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Switching apparatus and method for link initialization in a shared I/O environment |
7953074, | Jan 21 2003 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Apparatus and method for port polarity initialization in a shared I/O device |
8032659, | Jan 15 2003 | Mellanox Technologies Ltd | Method and apparatus for a shared I/O network interface controller |
8102843, | Jan 21 2003 | AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE LIMITED | Switching apparatus and method for providing shared I/O within a load-store fabric |
8346884, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for a shared I/O network interface controller |
8913615, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for a shared I/O network interface controller |
9015350, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for a shared I/O network interface controller |
9106487, | Jan 21 2003 | Mellanox Technologies Ltd | Method and apparatus for a shared I/O network interface controller |
Patent | Priority | Assignee | Title |
4769780, | Feb 10 1986 | International Business Machines Corporation; INTERNATIONAL BUSINESS MACHINES CORPORATION, A CORP OF NEW YORK | High speed multiplier |
6223196, | Aug 29 1997 | International Business Machines Corporation | Shared mac (multiply accumulate) system and method |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 30 1999 | LEE, HYUN | Lucent Technologies Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 010089 | /0447 | |
Jun 30 1999 | WHALEN, SHAUN P | Lucent Technologies Inc | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 010089 | /0447 | |
Jul 07 1999 | Agere Systems, Inc. | (assignment on the face of the patent) | / | |||
May 06 2014 | LSI Corporation | DEUTSCHE BANK AG NEW YORK BRANCH, AS COLLATERAL AGENT | PATENT SECURITY AGREEMENT | 032856 | /0031 | |
May 06 2014 | Agere Systems LLC | DEUTSCHE BANK AG NEW YORK BRANCH, AS COLLATERAL AGENT | PATENT SECURITY AGREEMENT | 032856 | /0031 | |
Aug 04 2014 | Agere Systems LLC | AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 035365 | /0634 | |
Feb 01 2016 | DEUTSCHE BANK AG NEW YORK BRANCH, AS COLLATERAL AGENT | LSI Corporation | TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS RELEASES RF 032856-0031 | 037684 | /0039 | |
Feb 01 2016 | DEUTSCHE BANK AG NEW YORK BRANCH, AS COLLATERAL AGENT | Agere Systems LLC | TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS RELEASES RF 032856-0031 | 037684 | /0039 |
Date | Maintenance Fee Events |
Mar 08 2007 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Mar 14 2011 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Apr 24 2015 | REM: Maintenance Fee Reminder Mailed. |
Sep 16 2015 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Sep 16 2006 | 4 years fee payment window open |
Mar 16 2007 | 6 months grace period start (w surcharge) |
Sep 16 2007 | patent expiry (for year 4) |
Sep 16 2009 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 16 2010 | 8 years fee payment window open |
Mar 16 2011 | 6 months grace period start (w surcharge) |
Sep 16 2011 | patent expiry (for year 8) |
Sep 16 2013 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 16 2014 | 12 years fee payment window open |
Mar 16 2015 | 6 months grace period start (w surcharge) |
Sep 16 2015 | patent expiry (for year 12) |
Sep 16 2017 | 2 years to revive unintentionally abandoned end. (for year 12) |