A first number is multiplied by a second number, by representing the first number as a first set of one or more w-bit wide numbers, and representing the second number as a second set of one or more w-bit wide numbers. Each of the w-bit wide numbers from the first set is paired with each of the w-bit wide numbers from the second set. For each pair of w-bit wide numbers, a set of sub-partial products is generated. Combinations of the sub-partial products are formed such that each combination is representable by a w-bit wide lower partial product and a carry out term that has fewer than w bits. The w-bit wide lower partial products and the carry out terms are combined to form the product of the first number and the second number. The carry out term is advantageously representable by (w/2)+1 bits.

Patent
   7318080
Priority
Nov 06 2003
Filed
Nov 06 2003
Issued
Jan 08 2008
Expiry
Oct 18 2025
Extension
712 days
Assg.orig
Entity
Large
8
15
all paid
7. A partial product generator for use in multiplying a first number by a second number, wherein the first number is represented as a first set of two or more w-bit wide numbers, and the second number is represented as a second set of one or more w-bit wide numbers, the apparatus comprising:
first receiving circuitry that receives a first w-bit wide number from the first set of two or more w-bit wide numbers, wherein Wis an integer greater than or equal to 2;
second receiving circuitry that receives a second w-bit wide number from the second set of one or more w-bit wide numbers;
sub-partial product generation circuitry that uses the first w-bit wide number and the second w-bit wide number to generate a set of sub-partial products; and
combination forming circuitry that forms combinations of the sub-partial products such that each combination is representable by a w-bit wide lower partial product and a carry out term that has fewer than w bits.
13. An apparatus for performing montgomery multiplication between a first number and a second number, wherein the first number is represented as a first set of two or more w-bit wide numbers, and the second number is represented as a second set of one or more w-bit wide numbers, the apparatus comprising:
first number input circuitry that receives a first w-bit wide number from the first set of two or more w-bit wide numbers, wherein w is an integer greater than or equal to 2;
second number input circuitry that receives a second w-bit wide number from the second set of one or more w-bit wide numbers;
sub-partial product generation circuitry that uses the first w-bit wide number and the second w-bit wide number to generate a set of sub-partial products; and
combination forming circuitry that forms combinations of the sub-partial products such that each combination is representable by a w-bit wide lower partial product and a carry out term that has fewer than w bits.
1. An apparatus for multiplying a first number by a second number, the apparatus comprising:
first representation circuitry that represents the first number as a first set of two or more w-bit wide numbers, wherein w is an integer greater than or equal to 2;
second representation circuitry that represents the second number as a second set of one or more w-bit wide numbers;
pairing circuitry that pairs each of the w-bit wide numbers from the first set with each of the w-bit wide numbers from the second set;
sub-partial product generation circuitry that generates a set of sub-partial products for each pair of w-bit wide numbers;
combination forming circuitry that forms combinations of the sub-partial products such that each combination is representable by a w-bit wide lower partial product and a carry out term that has fewer than w bits; and
combining circuitry that combines the w-bit wide lower partial products and the carry out terms to form the product of the first number and the second number.
2. The apparatus of claim 1, wherein each of the carry out terms is representable by (w/2)+1 bits, wherein w is an even number.
3. The apparatus of claim 2, wherein:
the first number is representable by N bits N≧2;
the second number is representable by M bits, M≧2;
for each pair of w-bit wide numbers, (ai, bj),

ai=aiH2w/2+aiL and bj=bjH2w/2+bjL,
where aiH, aiL, bjH, bjL are each w/2-bit wide numbers, 0≦i≦N/W−1, and 0≦j≦M/W−1; and
the combination forming circuitry forms each of the combinations of the sub-partial products in accordance with:

2wcOUT+pi=aiLbjL+ai−1HbjH+2w/2(aiHbjL+aiLbjH)+cIN,
where pi is a w-bit wide lower partial product, cIN is a (w/2)+1 bit wide carry-in term, cOUT is a (w/2)+1 bit wide carry-out term, and ai−1H=0 when i=0.
4. The apparatus of claim 3, comprising:
logic that uses offset binary coding to represent ai;
logic that represents bj in accordance with
{ b j σ = b j L + b j H b j δ = b j L - b j H ;
and
logic that uses logic functions of bits of ai to select either bjσ or bjδ in forming each of the combinations of the sub-partial products.
5. The apparatus of claim 2, wherein:
the first number is representable by N bits N≧2;
the second number is representable by M bits, M≧2;
for each pair of w-bit wide numbers, (ai, bj),

ai=aiH2w/2+aiL and bj=bjH2w/2+bjL,
where aiH, aiL, bjH, bjL are each w/2-bit wide numbers, 0≦i≦N/W−1, and 0≦j≦M/W−1; and
the combination forming circuitry forms each of the combinations of the sub-partial products in accordance with:

2wcOUT+tNEW=tOLD+aiLbjL+ai−1HbjH+2w/2(aiHbjL+aiLbjH)+cIN
where tOLD is a w-bit wide term representing an accumulation of previously generated lower partial products and carry terms, tNEW is a w-bit wide term representing a new accumulation of previously and presently generated lower partial products and carry terms, cIN is a (w/2)+1 bit wide carry-in term, cOUT is a (w/2)+1 bit wide carry-out term, and ai−1H=0 when i=0.
6. The apparatus of claim 5, comprising:
logic that uses offset binary coding to represent ai;
logic that represents bj in accordance with
{ b j σ = b j L + b j H b j δ = b j L - b j H ;
and
logic that uses logic functions of bits of ai to select either bjσ or bjδ in forming each of the combinations of the sub-partial products.
8. The partial product generator of claim 7, wherein each of the carry out terms is representable by (w/2)+1 bits, wherein w is an even number.
9. The partial product generator of claim 8, wherein:
the first number is representable by N bits N≧2;
the second number is representable by M bits, M≧2;
for each pair of w-bit wide numbers, (ai, bj), that are input into the partial product generator,

ai=aiH2w/2+aiL and bj=bjH2w/2+bjL,
where aiH, aiL, bjH, bjL are each w/2-bit wide numbers, 0≦i≦N/W−1, and 0≦j≦M/W−1; and
the combination forming circuitry forms each of the combinations of the sub-partial products in accordance with:

2wcOUT+pi=ailbjL+ai−1HbjH+2w/2(aiHbjL+aiLbjH)+cIN,
where pi is a w-bit wide lower partial product, cIN is a (w/2)+1 bit wide carry-in term, cOUT is a (w/2)+1 bit wide carry-out term, and ai−1H=0 when i=0.
10. The partial product generator of claim 9, comprising:
logic that uses offset binary coding to represent ai;
logic that represents bj in accordance with
{ b j σ = b j L + b j H b j δ = b j L - b j H ;
and
logic that uses logic functions of bits of ai to select either bjσ or bjδ in forming each of the combinations of the sub-partial products.
11. The partial product generator of claim 8, wherein:
the first number is representable by N bits N≧2;
the second number is representable by M bits, M≧2;
for each pair of w-bit wide numbers, (ai, bj), that are input into the partial product generator,

ai=aiH2w/2+aiL and bj=bjH2w/2+bjL,
where aiH, aiL, bjH, bjL are each w/2-bit wide numbers, 0≦i≦N/W−1, and 0≦j≦M/W−1; and
the combination forming circuitry forms each of the combinations of the sub-partial products in accordance with:

2wcOUT+tNEW=tOLD+aiLbjL+ai−1HbjH+2w/2(aiHbjL+aiLbjH)+cIN
where tOLD is a w-bit wide term representing an accumulation of previously generated lower partial products and carry terms, tNEW is a w-bit wide term representing a new accumulation of previously and presently generated lower partial products and carry terms, cIN is a (w/2)+1 bit wide carry-in term, cOUT is a (w/2)+1 bit wide carry-out term, and ai−1H=0 when i=0.
12. The partial product generator of claim 11, comprising:
logic that uses offset binary coding to represent ai;
logic that represents bj in accordance with
{ b j σ = b j L + b j H b j δ = b j L - b j H ;
and
logic that uses logic functions of bits of ai to select either bjσ or bjδ in forming each of the combinations of the sub-partial products.

The present invention relates to automated multiplication, and more particularly to efficient automated multiplication that is especially well suited for multiplication of large numbers.

Hardware-implemented techniques for multiplying two numbers together are well known. In many processing system architectures, it is adequate to accomplish multiplication by iteratively instructing generic logic, such as an arithmetic logic unit (ALU), to perform suitable add and shift operations to generate the final product. However, it is often desirable to make available very fast multiplication operations, and to this end specialized multiplication logic is often provided. Such logic is often separate and apart from the central processing unit (CPU).

Hardware mapped multiplier units are very useful so long as the size (i.e., word length) of the input operands is comparable to size of the computational data paths for communicating those operands. However, in many applications (e.g., cryptographic algorithms) it is necessary to multiply together operands that are much larger than the size of the computational data path. In such cases, it is impractical to implement the desired multiplication using a hardware-mapped multiplication unit. Instead, one or both of the operands are broken up into parts, and the hardware data path is conventionally reused in a time-multiplexed fashion, operating on the parts, or words, of the input numbers. Hardware reuse is also the case for software implementation on standard microprocessors having a fixed word length data path.

The operation of carrying out a part of the multiplication for each word is denoted “partial product generation.” In order to have a fast execution time, the number of iterations is minimized by using a large word length (also denoted “high radix”) for the partial product generation. Unfortunately, higher radices imply longer carry chains and intermediate carry signals width larger word length, thereby slowing down operation and increasing power consumption. This can be seen from the following analysis:

A positive integer N-bit number a can then be written as a sequence of W-bit words ai as

a = i = 0 N / W - 1 a i 2 Wi .
The generalization to negative and fractional numbers is straightforward, but not included in the calculations for the sake of simplicity. The multiplication of two words, x=ab may be calculated by generating partial products from the W-bit words ai and bi, and combining the partial products. More specifically, the product x may be calculated according to

x = i = 0 N / W - 1 j = 0 N / W - 1 a i b j 2 W ( i + j ) ,
where the partial product, xi,j is generated from two W-bit numbers, ai and bj as
xi,j=aibj.
For a word length W, the partial products are simply calculated as W×W multiplications as indicated in the equation above. To calculate the complete product x=ab, all partial products are generated and added together according to their significance.

One partial product slice 101 is shown in FIG. 1. The rhombic shape is due to the significance of each of the partial product bits; significance increases when going from right to left in the figure. FIG. 2 is a diagram depicting how all of the required partial products are mathematically combined to generate the complete product. It is apparent from the figure that the computed result from one slice should be combined with the result from the neighboring slices to the left and right, and that these combination results are also accumulated with the values generated by the slices above and below.

The partial product xi,j is 2W bits wide, and is conventionally divided into two W-bit wide words, herein denoted carry (ci,j) and lower partial product (pi,j), as
xi,j=2Wci,j+pi,j,
or
ci,j=int(xi,j/2W) and pi,j=xi,j mod2W,
where “int( )” is a function that generates the integer part of a number, and “mod n” indicates modulo n arithmetic.

Assume that it is desired to multiply two large numbers A and B, each of word length N, stored in a memory of word length W. Then, each number consists of N/W words (assuming that W is a factor of N). Let T be a storage area of 2N bits, or equivalently 2N/W words denoted t0, t1, . . . , t(2N/W)−1. T is used as a working storage area in which carry and lower partial product terms are accumulated until the final product, x, is generated. The final values for t0, t1, . . . , t(2N/W)−1 are efficiently generated from carry terms, previously-generated lower partial products, and interim values of t0, t1, . . . , t(2N/W)−1 as follows (where the symbol “: =” denotes a processing operation whereby already-existing (“old”) values of terms are combined as indicated on the right side of the symbol, with the result being assigned to the indicated “new” term on the left side of the symbol):

t 0 := p 0 , 0 t 1 := c 0 , 0 + p 1 , 0 t 2 := c 1 , 0 + p 2 , 0 t 1 := t 1 + p 0 , 1 t 2 := t 2 + c 0 , 1 + p 1 , 1 t 3 := t 3 + c 1 , 1 + p 2 , 1
That is, operations take place in a right-to-left, top to bottom order, starting with the horizontal direction first as illustrated in FIG. 3. For example, in FIG. 3 it can be seen that the word b0 is first applied against each of the words a0 . . . a(N/W)−1 to generate corresponding lower partial products p0,0 . . . P(N/W)−1,(N/W)−1, carry terms c0,0 . . . c(N/W)−1, (N/W)−1, and words t0 . . . t(N/W)−1 before the next word b1 is applied against the words a0 . . . a(N/W)−1, and so on.

FIG. 4 is a logic diagram illustrating conventional logic of an exemplary row 301 for implementing multiplication as illustrated in FIG. 3. The first row 303 can be considered a special case in which the values of t0 . . . t(N/W)−1 have each been initialized to zero. If the first row 303 is physically implemented by logic as depicted in FIG. 4, it can be efficiently realized by merely omitting the tk inputs from each indicated adder (k=0 . . . (N/W)−1).

FIG. 5 is a logic diagram of a generic one of the conventional partial product generators illustrated in FIG. 4. Mathematically, the outputs from the partial product generator are related to the inputs as follows:
2WcOUT+tNEW=tOLD+pi,j+cIN,
where Pi,j is a lower partial product. (It will be noted that, in order to ease the notational burden in this description, the carry term supplied to a partial product generator is henceforth referred to as “cIN”, and the carry term provided as an output from the partial product generator is henceforth referred to as “cOUT”.) While tNEW and tOLD may be maintained separately, in practical embodiments it is often most efficient to maintain a single value of t in storage, with tOLD being the value read out of storage, and tNEW being the value to be written back.

It will now be shown how this expression can be used to derive the minimum word length of the carry signal. Since pi,j=aibj, and ai, bj≦2W−1, it follows that:
pi,j≦(2W−1)2.
Furthermore, the word length of t is W bits, and thus t≦2W−1. Thus, if we collect the carry terms on the left side of the relationship, and collect the t terms on the right side of the relationship, we find that
2WcOUT−cIN=tOLD−tNEW+Pi,j.
The right side of the equation can be set to its maximum value by letting tNEW be set to zero (i.e., its minimum value), and by letting tOLD and pi,j each be set to their respective maximum values. This yields the following relationship:

2 W c OUT - c IN ( 2 W - 1 ) + ( 2 W - 1 ) 2 = 2 W ( 2 W - 1 )
Since cIN is, by definition, greater than or equal to zero, and since the relationship must be true for all values of cIN (i.e., including cIN=0), it can be concluded that cOUT≦(2W−1). Furthermore, the word length of the carry in signal is the same as the carry out signal. Therefore,
cIN,cOUT≦2W−1.

From the previous discussion, two statements can be made regarding the shown radix-2W approach:

1. All data words, including carry signals, are W bits wide.

2. The carry propagate chain for the radix-2W partial product generator approach is 2W bits long.

The length of the carry propagate chain sets the upper limit on the speed of a partial product generator implementation, and the size of the propagated carry sets the limit on the maximum required word length of the data path.

It is common to increase multiplication speed by using modified Booth encoding, Wallace adders to compress the number of partial products, and faster addition schemes for carry propagation summation. Booth encoding is discussed in A. D. Booth, “A signed binary multiplication technique,” Quarterly Journal of Mechanics and Applied Mathematics, vol. 4, pp. 236-240, 1951; and in L. P. Rubinfield, “A proof of the modified Booth algorithm for multiplication,” IEEE Transactions on Computers, October 1975, both of which are hereby incorporated herein by reference. Wallace adders are discussed in C. Wallace, “A suggestion for a fast multiplier,” IEEE Transactions on Electronic Computers, vol. EC-13, February 1964, which is hereby incorporated herein by reference.

The choice of radix for the partial product generation implementation depends on a number of factors, mainly including constraints on clock frequency, area, available data word length, and latency. To have a fast and area-efficient partial product generation, the word length, or radix, has to be limited. A restricted word length results in a larger number of partial products, which takes more time to add together when producing the full word length product. Thus, the choice of a radix for the partial product generator results in a sub-optimal solution.

The use of Booth encoding, or other means to speed up partial product generation, may speed up calculation of the actual partial product, but the word length of the intermediate carry signal remains the same, thus not improving the time required for addition of the partial products.

It is therefore desirable to provide improved methods and apparatuses for multiplying large numbers together.

It should be emphasized that the terms “comprises” and “comprising”, when used in this specification, are taken to specify the presence of stated features, integers, steps or components; but the use of these terms does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.

In accordance with one aspect of the present invention, the foregoing and other objects are achieved in methods, apparatuses and computer readable storage media for multiplying a first number by a second number, where the first number is represented as a first set of one or more W-bit wide numbers, and the second number is represented as a second set of one or more W-bit wide numbers. In accordance with an aspect of the invention, each of the W-bit wide numbers from the first set is paired with each of the W-bit wide numbers from the second set. For each pair of W-bit wide numbers, a set of sub-partial products is generated. Combinations of the sub-partial products are formed such that each combination is representable by a W-bit wide lower partial product and a carry out term that has fewer than W bits. The W-bit wide lower partial products and the carry out terms are combined to form the product of the first number and the second number. In some embodiments, each of the carry out terms is representable by (W/2)+1 bits.

In one aspect, the first number is representable by N bits, N≧0; the second number is representable by M bits, M≧0; for each pair of W-bit wide numbers, (ai, bj),
ai=aiH2W/2+aiL and bj=bjH2W/2+bjL,
where aiH, aiL, bjH, bjL are each W/2-bit wide numbers, 0≦i≦N/W−1, a 0≦j≦M/W−1. In some embodiments, each of the combinations of the sub-partial products is formed in accordance with:
2WcOUT+pi=aiLbjL+ai−1HbjH+2W/2(aiHbjL+aiLbjH)+cIN,
where pi is a W-bit wide lower partial product, cIN is a (W/2)+1 bit wide carry-in term, cOUT is a (W/2)+1 bit wide carry-out term, and ai-1H=0 when i=0.

In alternative embodiments, each of the combinations of the sub-partial products is formed in accordance with:
2WcOUT+tNEW=tOLD+aiLbjL+i-1HbjH+2W/2(aiHbjL+aiLbjH)+cIN
where tOLD is a W-bit wide term representing an accumulation of previously generated lower partial products and carry terms, tNEW is a W-bit wide term representing a new accumulation of previously and presently generated lower partial products and carry terms, cIN is a (W/2)+1 bit wide carry-in term, cOUT is a (W/2)+1 bit wide carry-out term, and ai−1H=0 when i=0.

In yet another aspect of the invention, in either of these embodiments, offset binary coding is used to represent ai; bj is represented in accordance with

{ b j σ = b j L + b j H b j δ = b j L - b j H ;
and
the logic functions of bits of ai are used to select either bjσ or bjδ in forming each of the combinations of the sub-partial products.

In still other aspects of the invention, methods, apparatuses and computer readable storage media are for generating a partial product that is for use in multiplying a first number by a second number, wherein the first number is represented as a first set of one or more W-bit wide numbers, and the second number is represented as a second set of one or more W-bit wide numbers. Such methods, apparatuses and computer readable storage media are based on receiving a first W-bit wide number from the first set of one or more W-bit wide numbers; and receiving a second W-bit wide number from the second set of one or more W-bit wide numbers. The first W-bit wide number and the second W-bit wide number are used to generate a set of sub-partial products. Combinations of the sub-partial products are formed such that each combination is representable by a W-bit wide lower partial product and a carry out term that has fewer than W bits.

In some of these embodiments, the carry out term is representable by (W/2)+1 bits.

The various embodiments of the invention may be used whenever multiplication is to be performed between two or more numbers. Because of the many advantageous properties presented by the invention (including those properties expressly stated herein as well as others that are immediately apparent to those skilled in the art), the invention is especially useful whenever large numbers are to be multiplied together. Of course, what constitutes a “large” number will vary from one application to another. In some applications, a number may be considered “large” if it is about 4 times the computational data path (e.g., 4096 bits used in cryptography). Thus, the invention may advantageously be applied in the field of cryptography, as well as many other fields. When used in cryptography, the invention may be used as part of a Montgomery multiplication process, or alternatively as a “stand-alone” process for multiplying two or more numbers together.

The objects and advantages of the invention will be understood by reading the following detailed description in conjunction with the drawings in which:

FIG. 1 is a diagram showing one partial product slice.

FIG. 2 is a diagram depicting how all of the required partial products are combined to generate a complete product.

FIG. 3 is a diagram illustrating how outputs from partial product slices are combined in a right-to-left, top to bottom order, to generate a final product.

FIG. 4 is a logic diagram illustrating conventional logic of an exemplary row 301 for implementing multiplication as illustrated in FIG. 3.

FIG. 5 is a logic diagram of a generic one of the conventional partial product generators illustrated in FIG. 4.

FIG. 6 illustrates a technique for multiplying two numbers together by breaking them up into numbers that are representable by fewer bits than the original two numbers, multiplying these numbers together to generate partial products, and then combining these partial products to generate the desired product.

FIG. 7 is a diagram that illustrates a technique for generating the partial products of FIG. 6 by breaking up the numbers into other numbers representable by even fewer bits, multiplying these numbers together to generate other partial products, and then combining these other partial products together to generate the desired partial product.

FIG. 8 is a diagram that illustrates, for the case of W/2-bit by W/2-bit multiplications, how the sub-partial products take the place of the partial products shown in FIG. 6.

FIG. 9 is a flow diagram that illustrates the general case of the split-radix multiplication techniques described herein.

FIG. 10 illustrates how the sub-partial products generated from a W-bit by W-bit multiply may be advantageously grouped for the cases in which for each pair of W-bit wide numbers, a set of sub-partial products is generated by performing W/2 by W/2-bit multiplications.

FIG. 11 is a block diagram of a new partial product generator based on the above-described groupings.

FIG. 12 is a diagram depicting how all of the required partial products are mathematically combined to generate a complete product in accordance with an aspect of the invention.

FIG. 13 is a logic diagram of an embodiment of a new partial product generator that also accumulates sub-partial products of like significance from earlier operations.

FIG. 14 illustrates the division of the input words into W×W blocks that the algorithm works on, in accordance with an aspect of the invention.

FIG. 15 illustrates how a split radix Montgomery algorithm is created, in accordance with an aspect of the invention, by dividing all computing blocks of FIG. 14 into sub-blocks containing three or four radix-2W/2 blocks.

The various features of the invention will now be described with reference to the figures, in which like parts are identified with the same reference characters.

The various aspects of the invention will now be described in greater detail in connection with a number of exemplary embodiments. To facilitate an understanding of the invention, many aspects of the invention are described in terms of sequences of actions to be performed by elements of a computer system. It will be recognized that in each of the embodiments, the various actions could be performed by specialized circuits (e.g., discrete logic gates interconnected to perform a specialized function), by program instructions being executed by one or more processors, or by a combination of both. Moreover, the invention can additionally be considered to be embodied entirely within any form of computer readable carrier, such as solid-state memory, magnetic disk, optical disk or carrier wave (such as radio frequency, audio frequency or optical frequency carrier waves) containing an appropriate set of computer instructions that would cause a processor to carry out the techniques described herein. Thus, the various aspects of the invention may be embodied in many different forms, and all such forms are contemplated to be within the scope of the invention. For each of the various aspects of the invention, any such form of embodiments may be referred to herein as “logic configured to” perform a described action, or alternatively as “logic that” performs a described action.

To overcome the problem of the sub-optimal radix selection, the techniques described herein use dual radices: one for the word length of the computational data path, and one for the calculations of the partial products. This is referred to herein as “split radix”. Using dual radices helps reach a more optimal implementation: data is addressed with a larger word length, and arithmetic operations in the partial product generator are made with a smaller word length. As a result, the data bandwidth remains the same while arithmetic calculations become faster and more efficient.

To facilitate an understanding of the basis for the split radix design, consider FIG. 6, which illustrates partial products that are generated when multiplying together two operands, a and b. Assume that the lengths of a and b are each integer multiples of a word length W. (Even if they are not initially of such a length, their length could always be extended as necessary—for example by padding with zeroes or by sign extension—to force their lengths to be integer multiples of W.) As explained in the Background section, the N-bit number a can then be written as a sequence of W-bit words ai as

a = i = 0 N / W - 1 a i 2 Wi .
Similarly, the M-bit number b can be written as a sequence of W-bit words bj as

b = j = 0 M / W - 1 b j 2 Wj .
The multiplication of the two words, x=ab may be calculated by generating partial products from the W-bit words ai and bj, and combining the partial products. More specifically, the product x may be calculated according to

x = i = 0 N / W - 1 j = 0 M / W - 1 a i b j 2 W ( i + j ) ,
where the partial product, xi,j is generated from two W-bit numbers, ai and bj as
xi,j=aibj.

Each of the partial products xi,j will be 2W bits long. In practical implementations, the multiplication by 2W(i+j) is performed by aligning the partial product xi,j so that it has the proper significance when combined with other partial products. For example, the partial product xi,j may be left-shifted by W(i+j) bits (inserting “zeroes” as the fill bits) in order to accomplish the multiplication by 2W(i+j).

This technique for multiplying two numbers together is illustrated in FIG. 6. Here it can be seen that a first partial product, a0b0, is generated by multiplying the two words a0 and b0 together. A next partial product a1b0 is generated by multiplying the two words a1 and b0 together. To effect multiplying the partial product a1b0 by 2W(0+1)=2W, the partial product a1b0 is aligned W bits to the left of the first partial product a0b0. The remaining partial products are similarly generated and aligned, so that they may be summed to form the final product ab, as shown in the figure.

In accordance with one aspect of the invention, it is recognized that each of the W-bit by W-bit multiplies can itself be broken up into a series of W/2-bit by W/2-bit multiplies. Mathematically, what we have is a1=aiH2W/2+aiL and bj=bjH2w/2+bjL. The product aibj is then generated as

a i × b j = ( a i H 2 W / 2 + a i L ) × ( b j H 2 W / 2 + b j L ) = a i H b j H 2 W + ( a i H b j L + a i L b j H ) 2 W / 2 + a i L b j L
FIG. 7 is a diagram that illustrates these partial products being generated and properly aligned to permit their sum to represent the product aibj.

In accordance with another aspect of the invention, each of the W-bit by W-bit multiplies illustrated in FIG. 6 can itself be broken up into a series of smaller multiplications, such as a series of W/2-bit by W/2-bit multiplies as illustrated in FIG. 7. This results in the generation of what are herein referred to as sub-partial products. For the case of W/2-bit by W/2-bit multiplies, FIG. 8 illustrates how the sub-partial products take the place of the partial products shown in FIG. 6.

In the general case, the split-radix multiplication technique includes the steps illustrated in FIG. 9. To multiply a first number by a second number, the first number is represented as a first set of one or more W-bit wide numbers (step 901), and the second number is represented as a second set of one or more W-bit wide numbers (step 903). Each of the W-bit wide numbers from the first set is paired with each of the W-bit wide numbers from the second set (step 905). For each pair of W-bit wide numbers, a set of sub-partial products is generated (step 907). Combinations of the sub-partial products are formed such that each combination is representable by a W-bit wide lower partial product and a carry out term that has fewer than W bits (step 909). For example, when the sub-partial products are the result of performing W/2 by W/2-bit multiplications, the carry out terms are representable by (W/2)+1 bits. The W-bit wide lower partial products and the carry out terms are combined to form the product of the first number and the second number (step 911).

In accordance with yet another aspect of the invention in which for each pair of W-bit wide numbers, a set of sub-partial products is generated by performing W/2 by W/2-bit multiplications, the sub-partial products generated from the W-bit by W-bit multiply may be advantageously grouped as illustrated in FIG. 10. Two types of groupings are illustrated: a first grouping type 1001 that involves the addition of four sub-partial products, and a second grouping type 1003 that involves the addition of only three sub-partial products. The second grouping type 1003 can be considered a special case of the first grouping type 1001, in which one of the four sub-partial products is always zero.

In some embodiments, a third grouping can also be constructed for use in the most-significant position of the partial product generation (e.g., the left-most position of a row of sub-partial product generators) in that fewer than four sub-partial products will need to be combined in this position. In many embodiments, it is advantageous simply to use enough ones of the first grouping type 1001 for the size of the product that it is intended to generate.

In accordance with an aspect of the invention, a new type of partial product generator (PPG) generates its outputs based on these groupings. In the general case, the new partial product generator receives ai, bj, aHi-1, a value t, and a carry-in value (cin) as its input parameters (where i=0, . . . , L−1; and L=the number of W-bit words that make up the number a). The final values for t0, t1, . . . , t(2N/W)−1 may be efficiently generated from carry terms, previously-generated lower partial products, and interim values of t0, t1, . . . , t(2N/W)−1 in the manner previously described with respect to the conventional PPG, except that it will be recognized that different values will be obtained because the carry terms themselves will take on different values.

Like the conventional type PPG, the new PPG uses W-bit wide input operands (except for the carry-in operand, which is only

W 2 + 1
bits wide). However, unlike the conventional type PPG, the new PPG generates an output that is, in total,

3 W 2 + 1
bits wide.

A new PPG 1101 based on the above-described groupings can be schematically depicted as in FIG. 11. Here it can be seen that the ai and bj inputs are only W-bits wide, and that the output from this new PPG 1101 is

3 W 2 + 1
bits wide. The shape depicted in FIG. 11 is representational of the significance of the four sub-partial products that are summed within the new PPG 1101.

In another aspect of the invention, the

3 W 2 + 1
bit wide output from the new PPG is divided into a W-bit wide lower partial product pi,j that is propagated in a vertical direction (i.e., combined with previously-generated and later-generated terms of like significance), and

W 2 + 1
carry part ci,j that is propagated in the horizontal direction (i.e., to a term of higher significance).

Operations using the new PPG take place in a right-to-left, top to bottom order, starting with the horizontal direction first as earlier-illustrated in FIG. 3.

FIG. 12 is a diagram depicting how all of the required partial products are mathematically combined to generate the complete product. It is apparent from the figure that the computed result from one slice should be combined with the result from the neighboring slices to the left and right, and that these combination results are also accumulated with the values generated by the slices above and below. It can be seen in the figure that in each row, the right-most (i.e., least significant) PPG is depicted in the shape of the number “7”, to indicate that only three sub-partial products are generated and accumulated within the PPG.

FIG. 13 is a logic diagram of an embodiment of a new PPG 1301 that also accumulates sub-partial products of like significance from earlier operations. Consequently, the output of the new PPG 1301 includes a t term rather than a p term as the lower partial product. As can be seen, this is a very compact design that utilizes only four size W/2×W/2 multipliers 1303, 1305, 1307, 1309. These multipliers 1303, 1305, 1307, 1309 are arranged, along with three adders 1311, 1313, 1315 to generate carry-out (cOUT) and tNEW as follows:
2WcOUT+tNEW=tOLD+aiLbjL+ai-1HbjH+2W/2(aiHbjL+aiLbjH)+cIN,
where the symbols aiL, aiH denote the lower and higher W/2 bits of ai, respectively, and similar notation is used for the b variable. Additional logic 1317 is illustrated in the figure to represent scaling the output of the adder 1313 by a factor of 3W/2. However, because this is multiplication by a power of 2, it can advantageously be implemented merely by aligning the output of the adder 1313 by W/2 bits to the left (i.e., to a position of higher significance), or by performing a comparable left shift operation. Thus, it is unnecessary to use actual multiplication circuitry to accomplish this function. Also, while tNEW and tOLD may be maintained separately, in practical embodiments it is often most efficient to maintain a single value of t in storage, with tOLD being the value read out of storage, and tNEW being the value to be written back—such an embodiment is illustrated in FIG. 13.

It is possible to use the embodiment of FIG. 13 throughout the entire multiplication process if the value “0” is used for a−1H whenever i=0. Alternatively, the design depicted in FIG. 13 may be modified to effect a special-case version of the PPG for use whenever i=0 (i.e., for use as the right-most—i.e., least-significant—PPGs depicted in FIG. 12). This modification involves the removal of the multiplier 1305 and the adder 1311, and supplying the output of the multiplier 1303 directly to the adder 1315.

While tNEW and tOLD may be maintained separately, in practical embodiments it is often most efficient to maintain a single value of t in storage, with tOLD being the value read out of storage, and tNEW being the value to be written back.

It will now be shown how this expression can be used to derive the maximum word length of the carry signal. Since 0≦aiH, aiL, ai-1H, bjH, bjL≦2W/2−1, it follows that:
aiHbjH,aiLbjH,aiHbjL,aiLbjL,ai-1HbjH≦(2W/2−1)2.
Furthermore, the word length of t is W bits, and thus t≦2W−1. Thus, if we collect the carry terms on the left side of the relationship, and collect the t terms on the right side of the relationship, we find that
2WcOUT−cIN=tOLD−tNEW+aiLbjL+ai-1HbjH+2W/2(aiHbjL+aiLbjH).
The right side of the equation can be set to its maximum value by letting tNEW be set to zero (i.e., its minimum value), and by letting tOLD and each of the sub-partial products be set to their respective maximum values. This yields the following relationship:

2 W c OUT - c IN ( 2 W - 1 ) + 2 ( 2 W / 2 - 1 ) 2 + 2 W / 2 × 2 ( 2 W / 2 - 1 ) 2 = ( 2 W - 1 ) + 2 ( 2 W / 2 - 1 ) 2 ( 1 + 2 W / 2 ) = ( 2 W - 1 ) + ( 2 W / 2 - 1 ) 2 × 2 ( 1 + 2 W / 2 ) = ( 2 W - 1 ) + ( 2 W - 2 ( W / 2 ) + 1 + 1 ) × 2 ( 1 + 2 W / 2 ) = ( 2 W - 1 ) + ( 2 W - 2 ( W / 2 ) + 1 + 1 ) ( 2 + 2 W / 2 + 1 ) = ( 2 W - 1 ) + ( 2 ( 3 W / 2 ) + 1 - 2 W + 2 + 2 ( W / 2 ) + 1 + 2 W + 1 - 2 ( W / 2 ) + 2 + 2 ) = 2 ( 3 W / 2 ) + 1 - 2 W + 2 + 2 W + 1 + 2 W - 2 ( W / 2 ) + 2 + 2 ( W / 2 ) + 1 + 2 - 1 < 2 ( 3 W / 2 ) + 1
Since cIN is, by definition, greater than or equal to zero, and since the relationship must be true for all values of cIN (i.e., including cIN=0), it can be concluded that cOUT<2(W/2)+1. Furthermore, the word length of the carry in signal is the same as the carry out signal. Therefore,
cIN,cOUT<2(W/2)+1.
That is, the intermediate right-to-left propagating carry for the partial product generator 201 is only half the size of the conventional approach plus one bit.

The gain from using the new split-radix multiplication method compared to the conventional technique can also be computed. Assume the word length is W bits. The conventional approach to partial product generation utilizes W×W multipliers. This results in a carry propagate chain of 2W adder cells. For the new split radix multiplier, the corresponding number of cells is W+W/2+1 Thus, the carry chain is reduced by a factor

carry   chain   reduction = 1 - W + W / 2 + 1 2 W = 1 4 - 1 2 W ,
that is, approaching 25% for large values of W. Two examples of realistic scenarios are:

Example 1: 16×16 bit partial product. Reduction is 21.9%

Example 2: 32×32 bit partial product. Reduction is 23.4%.

For multipliers and adders, the carry propagate chain is equal to the critical path. Therefore, a shorter carry chain implies an implementation of lower cost in terms of delay, area, and power consumption.

It is not possible to make a general statement about the exact magnitude of the gain because of the wide range of different multiplication and addition schemes. However, for the common ripple-carry addition scheme, often used in multipliers, the reduced carry chain implies linear reduction in delay and area, and linear or even polynomic improvement in energy per operation.

The new split radix multiplication scheme can be further optimized by using distributed arithmetic and offset binary coding, as described in A. Croisier et al., U.S. Pat. No. 3,777,130, entitled “Digital filter for PCM encoded signals” (issued December 1973); A. Peled and B. Liu, “A new hardware realization of digital filter,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-22, no. 6, December 1974; and A. Berkeman et al., “A low logic depth complex multiplier using distributed arithmetic,” IEEE Journal of Solid-State Circuits, vol. 35, no. 4, pp. 656-659, April 2000, each of which is hereby incorporated herein by reference. Where the ai signals are offset binary coded, the bj signals may be recoded as follows:

{ b j σ = b j L + b j H b j δ = b j L - b j H .
The following identities are also utilized:
aiLbjL+aiHbjL=F(aiL,aiH,bjσ,bjδ),
ai-1HbjH+aiLbjH=F(ai-1H,aiL,bjσ,bjδ)
where the F( )-function denotes a second order distributed arithmetic multiplier with offset binary encoded input signals. Further information about this may be found in A. Berkeman and V. Öwall, “Co-Optimization of FFT and FIR in a Delayless Acoustic Echo Canceller Implementation,” Proceedings of ISCAS 2000, Geneva, Switzerland, May 2000. With this construction, the number of partial product bits for the four W/2-sized multipliers are reduced to approximately one half, with a corresponding decrease in the number of required adders in the multipliers. The recoding of the b-signals can be implemented either prior to the multiplication operation, or in the right-most partial product generator stage. A further advantage with the distributed arithmetic multipliers is that the adder 1315 with four inputs as shown in FIG. 13 is replaced by a more efficient three word input adder.

In yet another aspect of the invention, Montgomery's method of operating on data divided into words of W bits can be implemented in an algorithm that computes the product
p=abr−1 mod n,
where r=2DW is constant. The algorithm is as follows:

1: n0← (−n−1) mod 2W
2: t ← 0
3: for i = 0 to D − 1 do
4:  c ← 0
5:  for j = 0 to D − 1 do
6:   if j = 0 then
7:    mi ← (t0 + a0bi)n0 mod 2W
8:   end if
9:   2W c + tj ← tj + ajbi + njmi + c
10:  end for
11: t ← t/2W
12: end for
13: if t ≧ n then
14:   p ← t − n
15: else
16:   p ← t
17: end if

For each iteration of the loop variable i, the least significant W bits are set to zero and the partial result is shifted W bits to the right.

FIG. 14 illustrates the division of the input words into W×W blocks that the algorithm works on. There are two different kinds of blocks: right-most (i.e., least significant) preprocessing blocks (identified by “pre-proc” in the figure) that perform the zeroing of the least significant word according to lines 6-9 of the algorithm; and “standard” Montgomery blocks (identified by “MPPG” in the figure) that compute only line 9 of the algorithm. The standard Montgomery blocks dominate the complexity of the algorithm.

The split radix Montgomery algorithm is created by dividing all computing blocks of FIG. 14 into sub-blocks, containing three or four radix-2W/2 blocks (identified as “MsPPG” in the figure) as illustrated in FIG. 15. In the figure, the square brackets indicate that the designated data has to be read from or written to a memory. Other variables are passed between the consecutive steps and stored intermediately.

Using the notation ñ=n′0 mod 2W/2, and ajL, ajH are the least and most significant W/2 bits of aj, and so forth, the rightmost blocks of FIG. 15 compute the following operations:
MemRead: t=t0, a=a0, n=n0, b=bi
1: (2W/2c0+s0)←tL+aLbL
2: mL←s0ñ mod 2W/2
3: (2W/2c0+s0)←(2W/2c0+s0)+nLmL
4: (2W/2c1+s1)←tH+aHbL+aLbH+nHmL+c0
5: mH←s1ñ mod 2W/2
6: (2W/2c1+s1)←(2W/2c1+s1)+nLmH
thereby zeroing the least W bits (corresponding to tH and tL) and feeding a carry c=c1, and the signals b and m to the next step. The remaining blocks then compute:
MemRead: t=tj, a=aj, n=nj, c
1: (2W/2c0+s0)←tL+aLbL+a′HbH+nLmL+n′HmH+c
2: (2W/2c1+s1)←tH+aHbL+aLbH+nHmL+nLmH+c0
MemWrite: ti-1=2W/2 s1+S0
Between the blocks, a carry c=c1 is propagated, as well as the values that are shared and propagated: a′H=aj−1H, n′H=nj−1H, bi, and mi.

Looking again at the operations performed in the non-rightmost blocks in FIG. 14, it can be seen that eight W/2×W/2 multiplications are required. One W×W multiplier is equivalent to four W/2×W/2 multipliers in terms of the number of partial product bits, but as seen before in the general case, the reordering shortens the length of the carry chain, and reduces the word length of the intermediate carry signal.

Applying distributed arithmetic and offset binary coding, denote

{ b i σ = b i H + b i L b i δ = b i H - b i L and { m i σ = m i H + m i L m i δ = m i H - m i L .
Then, the scalar products can be written more efficiently as
aLbL+a′HbH=F(aL,a′H,bσ,bδ)
nLmL+n′HmH=F(nL,n′H,mσ,mδ)
aHbL+aLbH=F(aH,aL,bσ,bδ)
nHmL+nLmH=F(nH,nL,mσ,mδ)
where the F-function denotes a distributed arithmetic multiplier as before.

A further optimization step is found by generating all of the sums and differences of bσ, bδ, mσ, and mδ; in total 16 combinations. Actually, only half of them need be generated because the other half have the same absolute value but opposite sign. Now, only hardware corresponding to one size W/2×W/2 multiplier plus some multiplexers is required for the implementation, and the memory addressing scheme still remains the same.

The above-described strategy for multiplying numbers together has advantageous properties, including but not limited to the following:

1. The same addressing and word lengths are used as in the conventional approach. That is, since the input/output data word length is kept at W bits, the same number of partial products is generated as in the conventional approach, and accessing of data remains the same.

2. The carry propagate chain is shortened from 2W bits to 3W/2+2 bits. Having a shorter carry propagate chain means that the calculation and addition of a partial product will be faster than in the conventional approach. Furthermore, switching activity is reduced, and energy consumption is lower.

3. The intermediate carry signal is shortened from W bits to W/2+1 bits. Thus, fewer bits have to be propagated and added from one slice to a following slice. This helps increase speed and reduce energy consumption due to less switching activity.

4. The only additional requirement is intermediate forwarding of the W/2 bits of ai-1H.

5. The required hardware area for implementation of the new multiplication strategy will be the same or less than that required for the conventional approach due to the reduced word length of the carry signals.

6. Modified Booth encoding and similar devices are still applicable to the radix-2W/2 multipliers, further reducing delay and area.

7. Distributed arithmetic and offset binary coding is an efficient means to reduce speed and area of the split radix partial product generator.

8. Applications include efficient calculation of Montgomery multiplication, for use in, for example, cryptographic applications.

9. Applications, including cryptographic applications, may use the split-radix multiplication techniques “as is” (i.e., outside the context of Montgomery multiplication).

The invention has been described with reference to a particular embodiment. However, it will be readily apparent to those skilled in the art that it is possible to embody the invention in specific forms other than those of the preferred embodiment described above. This may be done without departing from the spirit of the invention.

For example, an embodiment was described above with respect to FIG. 13 in which a new PPG 1301 also accumulates sub-partial products of like significance from earlier operations. However, in alternative embodiments, it is possible for each PPG to merely generate
2WcOUT+pi=aiLbjL+ai-1HbjH+2W/2(aiHbjL+aiLbjH)+cIN,
where pi is a W-bit wide lower partial product. In this embodiment, the accumulation is performed separately in one or more subsequent steps.

Also, the invention has been described in the context of multiplying two numbers together. However, it will be readily apparent that the same principles of split-radix multiplication may be applied to perform multiplication between more than two numbers. In such cases, sub-partial products are formed from the various operands and combined in the manner described above.

Thus, the preferred embodiment is merely illustrative and should not be considered restrictive in anyway. The scope of the invention is given by the appended claims, rather than the preceding description, and all variations and equivalents that fall within the range of the claims are intended to be embraced therein.

Berkeman, Anders

Patent Priority Assignee Title
10528642, Mar 05 2018 International Business Machines Corporation Multiple precision integer multiple by matrix-matrix multiplications using 16-bit floating point multiplier
10732932, Dec 21 2018 Altera Corporation Methods for using a multiplier circuit to support multiple sub-multiplications using bit correction and extension
10871946, Sep 27 2018 Altera Corporation Methods for using a multiplier to support multiple sub-multiplication operations
10884705, Apr 17 2018 Approximate mixed-mode square-accumulate for small area machine learning
11416218, Jul 10 2020 FAR, ALI TASDIGHI Digital approximate squarer for machine learning
11467805, Jul 10 2020 Digital approximate multipliers for machine learning and artificial intelligence applications
7644115, Jan 07 2005 SAS Institute Inc. System and methods for large-radix computer processing
9600235, Sep 13 2013 Nvidia Corporation Technique for performing arbitrary width integer arithmetic operations using fixed width elements
Patent Priority Assignee Title
3777130,
4965762, Sep 15 1989 Motorola Inc. Mixed size radix recoded multiplier
5200912, Nov 19 1991 RPX Corporation Apparatus for providing power to selected portions of a multiplying device
5442799, Dec 16 1988 Mitsubishi Denki Kabushiki Kaisha Digital signal processor with high speed multiplier means for double data input
5586070, Aug 03 1994 ATI Technologies ULC Structure and method for embedding two small multipliers in a larger multiplier
5751622, Oct 10 1995 ATI Technologies ULC Structure and method for signed multiplication using large multiplier having two embedded signed multipliers
5764558, Aug 25 1995 International Business Machines Corporation Method and system for efficiently multiplying signed and unsigned variable width operands
5999959, Feb 18 1998 Maxtor Corporation Galois field multiplier
6286024, Sep 18 1997 Kabushiki Kaisha Toshiba High-efficiency multiplier and multiplying method
6523055, Jan 20 1999 LSI Logic Corporation Circuit and method for multiplying and accumulating the sum of two products in a single cycle
6915322, Nov 04 1998 DSP GROUP, LTD Multiplier capable of multiplication of large multiplicands and parallel multiplications of small multiplicands
7062526, Feb 18 2000 Texas Instruments Incorporated Microprocessor with rounding multiply instructions
7111155, May 12 1999 Analog Devices, Inc Digital signal processor computation core with input operand selection from operand bus for dual operations
20030018678,
WO176131,
//
Executed onAssignorAssigneeConveyanceFrameReelDoc
Nov 05 2003BERKEMAN, ANDERSTELEFONAKTIEBOLAGET LM ERICSSON PUBL ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS 0146850738 pdf
Nov 06 2003Telefonaktiebolaget L M Ericsson (publ)(assignment on the face of the patent)
Date Maintenance Fee Events
Jul 08 2011M1551: Payment of Maintenance Fee, 4th Year, Large Entity.
Jul 08 2015M1552: Payment of Maintenance Fee, 8th Year, Large Entity.
Jul 08 2019M1553: Payment of Maintenance Fee, 12th Year, Large Entity.


Date Maintenance Schedule
Jan 08 20114 years fee payment window open
Jul 08 20116 months grace period start (w surcharge)
Jan 08 2012patent expiry (for year 4)
Jan 08 20142 years to revive unintentionally abandoned end. (for year 4)
Jan 08 20158 years fee payment window open
Jul 08 20156 months grace period start (w surcharge)
Jan 08 2016patent expiry (for year 8)
Jan 08 20182 years to revive unintentionally abandoned end. (for year 8)
Jan 08 201912 years fee payment window open
Jul 08 20196 months grace period start (w surcharge)
Jan 08 2020patent expiry (for year 12)
Jan 08 20222 years to revive unintentionally abandoned end. (for year 12)