SIMD operation method and SIMD appartus that implement SIMD operations without a large increase in the number of instructions

SIMD operation method and SIMD appartus that implement SIMD operations without a large increase in the number of instructions
RE46277

An operation method has processing for applying a same type of operation in parallel to n m-bit operands to obtain n m-bit operation results executed on a computer. Here, n is an integer equal to or greater than 2 and m is an integer equal to or greater than 1. The operation method includes: an operation step of applying the type of operation to an n*m-bit provisional operand that is formed by concatenating the n m-bit operands, to obtain one n*m-bit provisional operation result, and generating correction information based on an effect had, by applying the operation, on each m bits of the provisional operation result from a bit that neighbors the m bits; and a correction step of correcting the provisional operation result in m-bit units with use of the correction information, to obtain the n m-bit operation results.

PTO Wrapper PDF
Dossier Espace Google

Patent RE46277
Priority Nov 28 2001
Filed Jun 24 2009
Issued Jan 17 2017
Expiry Nov 26 2022
Inventors Suzuki, Ma…
Assg.orig SOCIONEXT …
Assg.curr SOCIONEXT …
Entity Large
Referenced by 0
References 31
Maint.: all paid

EXAMPLE 2

0. 39. An simd (Single instruction Multiple Data) processor that executes n operations for applying a same type of operation in parallel to n m-bit operands to obtain n operation results, n being an integer equal to or greater than 2 and m being an integer equal to or greater than 1,

the simd processor implementing:

an instruction set which includes an simd incrementing instruction for adding an m-bit-one to each of the n m-bit operands,

wherein the m-bit-one is a value of one expressed in an m-bit number, and

wherein n m-bit-ones are stored in a register.

0. 46. An simd (Single instruction Multiple Data) processor that executes n operations for applying a same type of operation in parallel to n m-bit operands to obtain n operation results, n being an integer equal to or greater than 2 and m being an integer equal to or greater than 1,

the simd processor implementing:

an instruction set which includes an simd decrementing instruction for subtracting an m-bit-one from each of the n m-bit operands,

wherein the m-bit-one is a value of one expressed in an m-bit number, and

wherein n m-bit-ones are stored in a register.

0. 54. An simd (Single instruction Multiple Data) processor that executes

(i) n operations for applying a same type of operation in parallel to n m-bit operands to obtain n operation results, and

(ii) n/2 operations for applying a same type of operation in parallel to n/2 m*2-bit operands to obtain n/2 operation results,

n being an integer equal to or greater than 4 and m being an integer equal to or greater than 1,

the simd processor implementing an instruction set which includes:

(a) a first simd incrementing instruction for adding a value of one to each of the n m-bit operands; and

(b) a second simd incrementing instruction for adding a value of one to each of the n/2 m*2-bit operands,

wherein the value of one is not designated by any operand in the first simd incrementing instruction and the second simd incrementing instruction.

0. 56. An simd (Single instruction Multiple Data) processor that executes

(i) n operations for applying a same type of operation in parallel to n m-bit operands to obtain n operation results, and

(ii) n/2 operations for applying a same type of operation in parallel to n/2 m*2-bit operands to obtain n/2 operation results,

n being an integer equal to or greater than 4 and m being an integer equal to or greater than 1,

the simd processor implementing an instruction set which includes:

(a) a first simd decrementing instruction for subtracting a value of one from each of the n m-bit operands; and

(b) a second simd decrementing instruction for subtracting a value of one from each of the n/2 m*2-bit operands,

wherein the value of one is not designated by any operand in the first simd decrementing instruction and the second simd decrementing instruction.

0. 32. An simd (Single instruction Multiple Data) processor that executes n operations for applying a same type of operation in parallel to n m-bit operands to obtain n operation results, n being an integer equal to or greater than 2 and m being an integer equal to or greater than 1,

the simd operation apparatus implementing:

an instruction set which includes an simd decrementing instruction for subtracting a value of one from each of the n m-bit operands regardless of values of the n m-bit operands,

wherein the value of one is not designated by any operand in the simd decrementing instruction,

wherein in subtracting the value of one from each of the n m-bit operands by the simd decrementing instruction, no carry is propagated from each m*L-th bit to corresponding m*L+1-th bit, L being an integer from 1 to N−1 and an lsb (least significant bit) being considered to be a first bit position, and

wherein subtracting the value of one from each of the n m-bit operands is performed by adding a value of minus one to each of the n m-bit operands.

27. An operation apparatus that executes (a) an existing operation that applies a predetermined type of operation to a first-bit-length operand, to obtain one first-bit-length operation result, and (b) an simd (Single instruction Multiple Data) operation used for applying n parallel operations that applies a predetermined type of operation in parallel to n second-bit-length operands, to obtain n second-bit-length operation results, n being an integer equal to or greater than 2,

the operation apparatus implementing:

an operation instruction for instructing application of the predetermined type of operation on one of (c) the first-bit-length operand, and (d) the plurality of second-bit-length operands concatenated and considered to be a first-bit-length operand; and

an simd correction instruction for instructing correction of an operation result of the operation instruction to an operation result of the simd operation, and

the operation apparatus comprising:

a storage unit storing the first-bit-length operation result, and correction information that is used in the correction;

a decoding unit decoding the operation instruction and the simd correction instruction used for applying n parallel operations; and

an execution unit,

(e) when the operation instruction is decoded, applying the predetermined type of operation to one of (i) the first-bit-length operand, and (ii) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain a first first-bit-length operation result, store the obtained first first-bit-length operation result in the storage unit, and generate correction information that corresponds to a difference between the first first-bit-length operation result and a second first-bit-length operation result that is the n second-bit-length operation results concatenated, and store the generated correction information in the storage unit, and

(f) when the simd correction instruction for applying n parallel operations is decoded, to correct the stored first first-bit-length operation result using the stored correction information, to obtain n simd operation second bit-length operation results.

15. An operation apparatus that when n is an integer equal to or greater than 2, and m is an integer equal to or greater than 1,

executes (a) an existing operation that applies a predetermined type of operation to one n*m-bit first-bit-length operand, to obtain one n*m-bit first-bit-length operation result, and (b) an simd (Single instruction Multiple Data) that applies a predetermined type of operation in parallel to n m-bit second-bit -length second operands, to obtain NM-bit n m-bit second-bit-length operation results,

the operation apparatus implementing:

an simd correction instruction for instructing correction of an operation result of the operation instruction to an operation result of the simd operation, and

the operation apparatus comprising:

a storage unit storing the first-bit-length operation result, and correction information that is used in the correction;

a decoding unit decoding the operation instruction and the simd correction instruction used for applying n parallel operations; and

an execution unit,

(e) when the operation instruction is decoded, applies the predetermined type of operation to one of (i) the first-bit-length operand, and (ii) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain one MA-bit n*m-bit first-bit-length operation result, store the obtained NW-bit n*m-bit first-bit-length operation result in the storage unit, and generate correction information based on an effect had, by applying the predetermined type of operation, on each m bits of the n*m-bit first-bit-length operation result from a bit that neighbors the m bits, and store the generated correction information in the storage unit, and

(f) when the simd correction instruction used for applying n parallel operations is decoded, to correct the stored first-bit-length operation result in m-bit units using the stored correction information, to obtain the n second-bit-length operation results.

11. An operation method for having an operation apparatus execute (a) an existing operation that applies a predetermined type of operation to a first-bit-length operand, to obtain one first-bit-length operation result, and (b) an simd (Single instruction Multiple Data) operation used for applying n parallel operations that applies a predetermined type of operation in parallel to n second-bit-length operands to obtain n second-bit-length operation results, n being an integer equal to or greater than 2, the operation apparatus implementing:

an operation instruction for instruction application of the predetermined type of operation on one of (c) the first-bit-length operand, and (d) the plurality of second-bit-length operands concatenated and considered to be a first-bit-length operand; and

an simd correction instruction for instructing correction of an operation result of the operation instruction to an operation result of the simd operation, and

the operation apparatus comprising:

a storage unit storing the first-bit-length operation result, and correction information that is used in the correction, and

the operation method comprising:

a decoding step of decoding the operation instruction and the simd correction instruction used for applying n parallel operations; and

an execution step of,

(e) when the operation instruction is decoded, applying the predetermined type of operation to one of (i) the first-bit-length operand, and (ii) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain one first-bit-length operation result, storing the obtained first-bit-length operation result in the storage unit, and generating correction information that corresponds to a difference between the first-bit-length operation result and a first-bit-length operation result that is the n second-bit-length operation results concatenated, and store the generated correction information in the storage unit, and

(f) when the simd correction instruction used for applying n parallel operations is decoded, correcting the stored first-bit-length operation result using the stored correction information, to obtain the n second-bit-length operation results,

wherein when executing the existing instruction, the operation instruction is decoded and an obtained first-bit-length operation result is considered to be an operation result of the existing operation, and

when executing the simd operation, the operation instruction is decoded, an obtained first-bit-length operation result is considered to be a provisional operation result, the simd operation is then decoded, and n second-bit-length operation results obtained by correcting the provisional operation result are considered to be an operation result of the simd operation.

1. An operation method for having an operation apparatus execute (a) an existing operation that applies a predetermined type of operation to one n*m-bit first-bit-length operand, to obtain one n*m-bit first-bit length operation result, and (b) an simd (Single instruction Multiple Data) operation used for applying n parallel operations that applies the predetermined type of operation in parallel to n m-bit second-bit-length operands to obtain n m-bit second-bit-length operation results, n being an integer equal to or greater than 2 and m being an integer equal to or greater than 1,

the operation apparatus implementing:

an simd correction instruction for instructing correction of an operation result of the operation instruction to an operation result of the simd operation,

the operation apparatus comprising:

a storage unit storing the first-bit-length operation result, and correction information that is used in the correction:

the operation method comprising:

a decoding step of decoding the operation instruction and the simd correction instruction used for applying n parallel operations; and

an execution step of,

(e) when the operation instruction is decoded, applying the predetermined type of operation to one of (i) the first-bit-length operand, and (ii) the n second-bit length operands concatenated and considered to be a first-bit-length operand, to obtain one first-bit-length operation result, storing the obtained first-bit-length operation result in the storage unit, and generating correction information based on an effect had, by applying the predetermined type of operation, on each m bits of the first-bit-length operation result from a bit that neighbors the m bits, and storing the generated correction information in the storage unit, and

(f) when the simd correction instruction used for applying n parallel operations is decoded, correcting the stored first-bit-length operation result in m-bit units using the stored correction information, to obtain the n second-bit-length operation results,

2. The operation method of claim 1,

wherein in the execution step, when the simd correction instruction used for applying n parallel operations is decoded, m least; significant bits of the first-bit-length operation result are excluded from being corrected.

3. The operation method of claim 1, wherein the execution apparatus further executes the simd operation used for applying n/P parallel operations that applies the same predetermined type of operation in parallel to n/P m*P-bit third-bit-length operands to obtain n/P m*P-bit third-bit-length operation results, P being an integer equal to or greater than 2 and equal to or less than n/2,

the decoding step further decodes the simd correction instruction used for applying n/P parallel operations, and

the execution step,

(a) when the operation instruction is decoded, applies the predetermined type of operation to the first-bit-length operand, the first-bit-length operand being one of (i) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, and (ii) the n/P third-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain a first-bit-length operation result, stores the obtained first-bit-length operation result in the storage unit, generates the correction information based on an effect had, by applying the predetermined type of operation, on each m bits of the first bit-length operation result from a bit that neighbors the m bits, and stores the generated correction information in the storage unit, and

(b) when the simd operation used for applying n/P parallel operations is decoded, corrects the stored first-bit-length operation result in m*P-bit units, using only parts of the stored correction information that correspond to an effect on each m*P-bit unit.

4. The operation method of claim 3,

wherein respective values of n, m and P are one of (a) N=8, M=8 and P=one of (i) 2, (ii) 4, and (iii) 2 and 4, and (b) N=4, M=16 and P=2.

5. The operation method of claim 1,

wherein the predetermined type of operation is any one of a plurality of types of operations,

the execution step, when a least significant bit is considered to be a first bit,

(a) when the operation instruction is decoded, generates the correction information, in m-bit units, based on the predetermined type of operation and a carry from an m*L-th bit to an m*L1-th bit according to the predetermined type of operation, the m*L+1-th bit having a value of one of(a) 0 or 1 and (b) 0 or −1, L being n integers from 0 to N−1, and

(b) when the simd correction instruction is decoded, performs, regardless of the predetermined type of the operation, one of (a) adding the stored correction information to the first-bit-length operation result in m-bit units, and (b) subtracting the correction information from the first-bit-length operation result in m-bit units, to obtain the NM-bit operation results.

6. The operation method of claim 5,

wherein the plurality of types of operations includes at least one of increment, decrement, dyadic add, and dyadic subtract,

the execution step, (a) when the operation instruction is decoded and the predetermined type is increment, increments the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value −1 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 0 in the correction information,

(b) when the operation instruction is decoded and the predetermined type is decrement, decrements the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 0 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 1 in the correction information,

(c) when the operation instruction is decoded and the predetermined type is dyadic add, adds a first first-bit-length operand and a second first-bit-length operand to obtain first-bit-length operation result, the first first-bit-length operand being formed by concatenating n second-bit-length operands, and the second first-bit-length operand being formed by concatenating n second-bit-length operands, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 0 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 1 in the correction information,

(d) when the operation instruction is decoded and the predetermined type is dyadic subtract, subtracts a second first-bit-length operand from a first first-bit-length operand to obtain a first-bit-length operation result, the first first-bit-length operand being formed by concatenating n second-bit-length operands, and the second first-bit-length operand being formed by concatenating n second-bit-length operands, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value −1 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 0 in the correction information, and

(e) when the simd correction instruction is decoded, subtracts, in m-bit units, the stored correction information from the first-bit-length operation result, to obtain the n second-bit-length operation results.

7. The operation method of claim 5,

wherein the plurality of types of operations includes at least one of increment, decrement, dyadic add, and dyadic subtract,

the execution step, (a) when the operation instruction is decoded and the predetermined type is increment, increments the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 1 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 0 in the correction information,

(b) when the operation instruction is decoded and the predetermined type is decrement, decrements the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 0 in the correction information, and each m*L+1-th bit position with a carry according to the operation being represented by a value −1 in the correction information,

(c) when the operation instruction is decoded and the predetermined type is dyadic add, adds a first first-bit-length operand and a second first-bit-length operand to obtain a first-bit-length operation result, the first first-bit-length operand being formed by concatenating n second-bit-length operands, and the second first-bit-length operand being formed by concatenating n second-bit-length operands, and generates the correction information, each represented by a value 0 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value −1 in the correction information, and

(d) when the operation instruction is decoded and the predetermined type is dyadic subtract, subtracts the second first-bit-length operand from the first first-bit-length operand to obtain a first-bit-length operation result, the first first-bit-length operand being formed by concatenating n first-bit-length operands, and the second first-bit-length operand being formed by concatenating n second-bit-length operands, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 1 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 0 in the correction information, and

when the simd correction instruction for n parallel operations is decoded, subtracts, in m-bit units, the stored correction information from the stored first-bit-length operation result, to obtain the n second-bit-length operation results.

8. The operation method of claim 1,

wherein the predetermined type of operation is any one of a plurality of types of operations,

the execution step further stores type information showing the type, and

when the simd correction instruction used for applying n parallel operations is decoded, corrects the first-bit-length operation result according to the stored type information.

9. The operation method of claim 8,

wherein when the operation instruction is decoded, the execution step generates, as the correction information, information showing whether there is a carry from lower bits to corresponding higher bits.

10. The operation method of claim 9,

wherein the plurality of types of operations includes at least one of increment, decrement, dyadic add, and dyadic subtract,

the execution step, in order to obtain the n second-bit-length operation results, where L is n integers from 0 to N−1, and when a least significant bit is considered to be a first bit,

(a) when the stored type information shows one of increment and dyadic add, adds 1 to each m*L+1-th bit without a carry in the provisional operation result, based on the generated correction information, and

(b) when the stored type information shows one of decrement and dyadic subtract, subtracts 1 from each m*L+1-th bit with a carry in the provisional calculation result, based on the generated correction information.

12. The operation method of claim 11, wherein

when m is an integer equal to or greater than 1, the first-bit-length operand is n*m bits in length, each second-bit-length operand is m bits in length, and each second-bit-length operation result is m*2 bits in length,

the execution step applies the predetermined type of operation to one of (a) the first-bit-length operand and (b) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain the n*m*2-bit first-bit-length operation result, stores the obtained first-bit-length operation result in the storage unit, generates correction information based on an effect had, by applying the operation, on each m*2 bits of the first-bit-length operation result from another m*2 bits, and stores the generated correction information in the storage unit.

13. The operation method computer readable medium of claim 12, further having the execution apparatus execute the simd operation used for applying n/P parallel operations that applies the same predetermined type of operation in parallel to n/P m*P-bit third-bit-length operands, to obtain n/P m*P*2-bit third-bit-length operation results, P being an integer equal to or greater than 2 and equal to or less than n/2,

wherein the decoding step further decodes the simd correction instruction used for applying n/P parallel operations,

the execution step, (a) when the operation instruction is decoded, applies the predetermined type of operation to the first-bit-length operand, the first-bit-length operand being one of (i) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, and (ii) the n/P third-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain one n*m*2-bit first-bit-length operation result, stores the obtained first-bit-length operation result in the storage unit,

where L is N−1 integers from 0 to N−1 and when a least significant bit is considered to be a first bit, generates first correction information, for each m*2 bits, based N−1 effects by the type of operation between (iii) the m*2*L-th bit and lower bits and (iv) the m*2*L+1 bit position and higher bits, and generates second correction information, for each m*2*P bits, based on n/P−1 effects by the type of operation between (v) the m*2*P*L-th bit and lower bits and (vi) the m*2*L+1-th bit and higher bits, and stores the first correction information and second correction information in the storage unit,

(b) when the simd correction instruction used for applying n parallel operations is decoded, corrects the stored first-bit-length operation result with use of the stored first correction information, and

(c) when the simd correction instruction used for applying n/P parallel operations is decoded, corrects the first-bit-length operation result with use of the stored second correction information.

14. The operation method of claim 13,

wherein N=8, M−=4 M=4, P=2, and the predetermined type of operation is multiply.

16. The operation apparatus of claim 15, wherein

the execution unit, when the simd correction instruction used for applying n parallel operations is decoded, excludes m least significant bits of the first-bit-length operation result from being corrected.

17. The operation apparatus of claim 15, further executing the simd operation used for applying n/P parallel operations that applies the same predetermined type of operation in parallel to n/P m*P-bit third-bit-length operands, to obtain n/P m*P-bit third-bit-length operation results, P being an integer equal to or greater than 2 and equal to or less than n/2, wherein the decoding unit further decodes the simd correction instruction used for applying n/P parallel operations, and

the execution unit,

(a) when the operation instruction is decoded, applies the predetermined type of operation to the first-bit-length operand, the first bit-length operand being one of (i) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, and (ii) the n/P third-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain a first-bit -length operation result, stores the obtained first-bit-length operation result in the storage unit, generates the correction information based on an effect had, by applying the operation, on each m bits of-the first bit-length operation result from a bit that neighbors the m bits, and stores the generated correction information in the storage unit, and

18. The operation apparatus of claim 17

wherein respective values of n, m and P are one of (a) N=8, M=8 and P=one of (i) 2, (ii) 4, and (iii) 2 and 4, and (b) N−=4 N=4, M=16 and P=2.

19. The operation apparatus of claim 15,

wherein the type of operation is any one of a plurality of types of operations, and

the execution unit, where L is n integers from 0 to N−1, when a least significant bit is considered to be a first bit, and when the operation instruction is decoded, generates correction information, in m-bit units, based on the type of operation and a carry from an m*L-th bit to an m*L+1-th bit according to the operation, the correction information showing for m bits the value of the m*L+1-th as one of (a) 0 or 1 and (b) 0 or −1, and when the simd correction instruction is decoded, performs, regardless of the type of the operation, one of (a) adding the stored correction information to the first-bit-length operation result in m-bit units, and (b) subtracting the correction information from the first-bit-length operation result in m-bit units.

20. The operation apparatus of claim 19,

wherein the plurality of types of operations includes at least one of increment, decrement, dyadic add, and dyadic subtract, and

the execution unit (a) when the operation instruction is decoded and the predetermined type is increment, increments the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value −1 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 0 in the correction information,

(b) when the operation instruction is decoded and the predetermined type is decrement, decrements the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 0 in the correction information, and each m*L+1th m*L+1-th bit with a carry according to the operation being represented by a value 1 in the correction information,

(c) when the operation instruction is decoded and the predetermined type is dyadic add, adds a first first-bit-length operand and a second first-bit-length operand to obtain a first-bit-length operation result, the first first-bit-length operand being formed by concatenating n m-bit operands, and the second first-bit-length operand being formed by concatenating n m-bit operands, and generates correction information, each m*L+1-th without a carry according to the operation being represented by a value 0 in the correction information, and m*L+1-th bit with a carry according to the operation being represented by a value 1 in the correction information,

(d) when the operation instruction is decoded and the predetermined type is dyadic subtract, subtracts a second first-bit-length operand from a first first-bit-length operand to obtain a first-bit-length operation result, the first first-bit-length operand being formed by concatenating n m-bit operands, and the second first bit-length operand being formed by concatenating n m-bit operands, and generates correction information, each m*L+1-th bit without a carry according to the operation being represented by a value −1 in the correction information, and each m*L+1-th bit position with a carry according to the operation being represented by a value 0 in the correction information, and

(e) when the simd correction instruction is decoded, subtracts, in m-bit units, the stored correction information from the stored first-bit-length operation result, to obtain the n second-bit-length operation results.

21. The operation apparatus of claim 19,

wherein the plurality of types of operations includes at least one of increment, decrement, dyadic add, and dyadic subtract, and

the execution unit (a) when the, operation instruction is decoded and the predetermined type is increment, increments the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 1 in the correction information, and each m*L+1-th bit with a carry according to the operation being represented by a value 0 in the correction information,

(b) when the operation instruction is decoded and the predetermined type is decrement, decrements the first-bit-length operand, to obtain a first-bit-length operation result, and generates the correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 0 in the correction information, and each m*L+1th m*L+1-th bit with a carry according to the operation being represented by a value −1 in the correction information,

(c) when the operation instruction is decoded and the predetermined type is dyadic add, adds a first first-bit-length operand and a second first-bit-length operand to obtain a first-bit-length operation result, the first first-bit-length operand being formed by concatenating n m-bit operands, and the second first-bit-length operand being formed by concatenating n m-bit operands, and generates correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 0 in the correction information, and m*L+1-th bit with a carry according to the operation being represented by a value −1 in the correction information,

(d) when the operation instruction is decoded and the predetermined type is dyadic subtract, subtracts a second first-bit-length operand from a first first-bit-length operand to obtain a first-bit-length operation result, the first first-bit-length operand being formed by concatenating n m-bit operands, and the second first bit-length operand being formed by concatenating IV h-bit n m-bit operands, and generates correction information, each m*L+1-th bit without a carry according to the operation being represented by a value 1, and each m*L+1-th bit position with a carry according to the operation being represented by a value 0 in the correction information, and

(e) when the simd correction instruction for n parallel operations is decoded, subtracts, in m-bit units, the stored correction information from the stored first-bit-length operation result, to obtain the n second-bit-length operation results.

22. The operation apparatus of claim 15,

wherein the type of operation is any one of a plurality of types of operations,

the execution unit further stores type information showing the predetermined type, and, when the simd correction instruction used for applying n parallel operations is decoded, corrects the stored first-bit-length operation result according to the stored type.

23. The operation apparatus of claim 22,

wherein, when the operation instruction is decoded, the execution unit generates, as the correction information, information showing whether there is a carry from lower bits to corresponding higher bits.

24. The operation apparatus of claim 23,

wherein the plurality of types of operations includes at least one of increment, decrement, dyadic add, and dyadic subtract,

the execution unit, in order to obtain the n m-bit operation results, where L is n integers from 0 to N−1, and when a least significant bit is considered to be a first bit,

25. The operation apparatus of claim 22, farther comprising:

a saving unit when an interrupt is received or when switching to another context, saving contents stored in the storage unit to a storage apparatus that is external to the operation apparatus; and

a restoration unit when returning from the interrupt or switching back to an original context, restoring the saved contents to the storage unit.

26. The operation apparatus of claim 15, further comprising:

a saving unit when an interrupt is received or when switching to another context, saving contents stored in the storage unit to a storage apparatus that is external to the operation apparatus; and

a restoration unit when returning from the interrupt or switching back to an original context, restoring the saved contents to the storage unit.

28. The operation apparatus of claim 27,

wherein, when m is an integer equal to or greater than 1, the first-bit-length operand is n*m bits in length, each second-bit-length operand is m bits, the first first-bit-length operation result is n*m*2 bits in length, and each second-bit-length operation result is m*2 bits in length, and

the execution unit applies the predetermined type of operation to one of (a) the first-bit-length operand and (b) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain an n*m*2-bit first-bit-length operation result, stores the obtained first-bit-length operation result in the storage unit, generates correction information based on an effect had, by applying the predetermined type of operation, on each m*2 bits of the first-bit-length operation result, from other m*2 bits, and stores the correction information in the storage unit.

29. The operation apparatus of claim 28, further executing the simd operation used for applying n/P parallel operations that applies the same predetermined type of operation in parallel to n/P, m*P-bit third-bit-length operands, to obtain n/P m*P*2-bit third-bit-length operation results, P being an integer equal to or greater than 2 and equal to or less than n/2,

wherein the decoding unit further decodes the simd correction instruction used for applying n/P parallel operations,

the execution unit, (a) when the operation instruction is decoded, applies the predetermined type of operation to the first-bit-length operand, the first-bit-length operand being one of (i) the n second-bit-length operands concatenated and considered to be a first-bit-length operand, and (ii) the n/P third-bit-length operands concatenated and considered to be a first-bit-length operand, to obtain one n*m*2-bit first-bit-length operation result, stores the obtained first-bit-length operation result in the storage unit,

where L is N−1 integers from 0 to N−1, and when a least significant bit is considered to be a first bit, generates first correction information, for each m*2 bits, based N−1 effects by the predetermined type of operation between (iii) the m*2*L-th bit and lower bits and (iv) the m*2*L+1-th bit position and higher bits, and generates second correction information, for each m*2*P bits, based on n/P−1 effects by the predetermined type of operation between (v) the m*2*P*L-th bit and lower bits and (vi) the m*2*L+1-th bit and higher bits, and stores the first correction information and second correction information in the storage unit,

(c) when the simd correction instruction used for applying n/P parallel operations is decoded, corrects the stored first-bit-length operation result with use of the stored second correction information.

30. The operation apparatus of claim 29,

wherein N=8, M=4, P=2, and the type of operation is multiply.

31. The operation apparatus of claim 27, further comprising:

a saving unit when an interrupt is received or when switching to another context, saving contents stored in the storage unit to a storage apparatus that is external to the operation apparatus; and

a restoration unit when returning from the interrupt or switching back to an original context, restoring the saved contents to the storage unit.

0. 33. The simd processor of claim 32, comprising:

a decoding unit for decoding the simd decrementing instruction; and

an execution unit, when the decoding unit decodes the simd decrementing instruction, for subtracting the value of one from each of n m-bit operands designated by the simd decrementing instruction.

0. 34. The simd processor of claim 33, wherein the simd decrementing instruction is designated in a software program.

0. 35. The simd processor of claim 32, wherein the simd decrementing instruction is designated in a software program.

0. 36. The simd processor of claim 32, wherein the simd decrementing instruction is:

(a) an instruction for decrementing each value of n pieces of 8-bit data in the operands in parallel,

(b) an instruction for decrementing each value of n pieces of 16-bit data in the operands in parallel,

(d) an instruction for decrementing each value of n pieces of 64-bit data in the operands in parallel.

0. 37. The simd processor of claim 32, wherein:

the instruction set further includes an simd subtraction instruction, and

the simd decrementing instruction and the simd subtraction instruction are different from each other.

0. 38. The simd processor of claim 32, wherein the simd processor is a microprocessor composed on a single semiconductor chip.

0. 40. The simd processor of claim 39, comprising:

a decoding unit for decoding the simd incrementing instruction; and

an execution unit, when the decoding unit decodes the simd incrementing instruction, for adding the m-bit-one to each of the n m-bit operands designated by the simd incrementing instruction.

0. 41. The simd processor of claim 40, wherein the simd incrementing instruction is designated in a software program.

0. 42. The simd processor of claim 39, wherein in adding the m-bit-one to each of the n m-bit operands by the simd incrementing instruction, no carry is propagated from each m*L-th bit to corresponding m*L+1-th bit, L being an integer from 1 to N−1 and an lsb (least significant bit) being considered to be a first bit position.

0. 43. The simd processor of claim 39, wherein the simd incrementing instruction is designated in a software program.

0. 44. The simd processor of claim 39, wherein the simd incrementing instruction is:

(a) an instruction for incrementing each value of n pieces of 8-bit data in the operands in parallel,

(b) an instruction for incrementing each value of n pieces of 16-bit data in the operands in parallel,

(d) an instruction for incrementing each value of n pieces of 64-bit data in the operands in parallel.

0. 45. The simd processor of claim 39, wherein:

the instruction set further includes an simd add instruction, and

the simd incrementing instruction and the simd add instruction are different from each other.

0. 47. The simd processor of claim 46, comprising:

a decoding unit for decoding the simd decrementing instruction; and

an execution unit, when the decoding unit decodes the simd decrementing instruction, for subtracting the m-bit-one from each of n m-bit operands designated by the simd decrementing instruction.

0. 48. The simd processor of claim 47, wherein the simd decrementing instruction is designated in a software program.

0. 49. The simd processor of claim 46, wherein in subtracting the m-bit-one from each of the n m-bit operands by the simd decrementing instruction, no carry is propagated from each m*L-th bit to corresponding m*L+1-th bit, L being an integer from 1 to N−1 and an lsb (least significant bit) being considered to be a first bit position.

0. 50. The simd processor of claim 49, wherein subtracting the m-bit-one from each of the n m-bit operands is performed by adding a value of minus one to each of the n m-bit operands.

0. 51. The simd processor of claim 46, wherein the simd decrementing instruction is designated in a software program.

0. 52. The simd processor of claim 46, wherein the simd decrementing instruction is:

(a) an instruction for decrementing each value of n pieces of 8-bit data in the operands in parallel,

(b) an instruction for decrementing each value of n pieces of 16-bit data in the operands in parallel,

(d) an instruction for decrementing each value of n pieces of 64-bit data in the operands in parallel.

0. 53. The simd processor of claim 46, wherein:

the instruction set further includes an simd subtraction instruction, and

the simd decrementing instruction and the simd subtraction instruction are different from each other.

0. 55. The simd processor of claim 54, wherein in adding the value of one to each of the n/2 m*2-bit operands by the second simd incrementing instruction, no carry is propagated from each m*2*K-th bit to corresponding m*2*K+1-th bit, K being an integer from 1 to n/2−1, whereas a carry propagation from each m*J-th bit to corresponding m*J+1-th bit is performed, J being an odd integer from 1 to N−1.

0. 57. The simd processor of claim 56, wherein in subtracting the value of one from each of the n/2 m*2-bit operands by the second simd decrementing instruction, no carry is propagated from each m*2*K-th bit to corresponding m*2*K-th bit, K being an integer from 1 to n/2−1, whereas a carry propagation from each m*J-th bit to corresponding m*J+1-th bit is performed, J being an odd integer from 1 to N−1.

0. 58. The simd processor of claim 57, wherein subtracting the value of one from each of the n m-bit operands is performed by adding a value of minus one to each of the n m-bit operands and subtracting the value of one from each of the n/2 m*2-bit operands is performed by adding the value of minus one to each of the n/2 m*2-bit operands.

The ALU 34 masks bits other than the least significant 16 bits of the contents “0x0000000088888888” of the AR 32:
“0x0000000088888888” and “0x000000000000FFFF”=“0x0000000000008888” <<2>>.

The ALU 34 then multiplies <<1>> and <<2>>:

0x0000000012340000

*)0x0000000000008888

_{- - -}

0x0000000091A00000←“0x12340000”* “00000008”

0x000000091A000000←“0x12340000”* “00000080”

0x00000091A0000000←“0x12340000”* “00000800”

+)0x0000091A00000000←“0x12340000”* “00008000”

- - -
0x000009B54BA00000 <<3>>.

The ALU 34 then masks bits other than the least significant 16 bits of the contents “0x0000000012345678” of the BR 33:
“0x0000000012345678” and “0x000000000000FFFF”=“0x0000000000005678” <<4>>.

The ALU 34 then masks bits other than the least significant 16 bits of the contents “0x0000000088888888” of the AR 32:
“0x0000000088888888” AND “0xFFFFFFFFFFFF0000”=“0x0000000088880000” <<5>>.

The ALU 34 then multiplies <<4>> and <<5>>:

0x0000000000005678

*)0x0000000088880000

_{- - -}

0x00000002B3C00000←“0x00005678”* “0x00080000”

0x0000002B3C000000←“0x00005678”* “0x00800000”

0x000002B3C0000000←“0x00005678”* “0x08000000”

+)0x00002B3C00000000←“0x00005678”*

“0x80000000”

_{- - -}
0x00002E1DAFC00000 <<6>>.

The ALU 34 then adds <<3>> and <<6>>:

0x000009B54BA00000

+) 0x00002E1DAFC00000

_{- - -}
0x000037D2FB600000 <<7>>.

<<7>> is the 16-bit correction data. The ALU 34 stores the 16-bit correction data in the CR 36.

The ALU 34 includes an operation device for executing calculating such as that described above.

The following shows an example of a method for calculating the 8-bit correction data.

The ALU 34 masks the least significant 8 bits of the contents “0x0000000012345678” of the BR 33:
“0x0000000012345678” and “FFFFFFFFFFFF00”=“0x0000000012345600” <<8>>.

The ALU 34 masks bits other than the least significant 16 bits of the contents “0x0000000088888888” of the AR 32:
“0x0000000088888888” and “0x00000000000000FF”=“0x0000000000000088” <<9>>.

The ALU 34 then multiplies <<8>> and <<9>>:

0x0000000012345600

*)0x0000000000000088

_{- - -}

0x0000000091A2B000←0x12345600”* “0x00000008”

+)0x000000091A2B0000←“0x12345600”*

“0x00000080”

_{- - -}
0x00000009ABCDB000 <<10>>.

The ALU 34 then masks the 8th to 15th least significant bits of the contents “0x0000000012345678” of the BR 33:
“0x0000000012345678” and “0xFFFFFFFFFF00FF”=“0x000000001234008” <<11>>.

The ALU 34 then masks bits other than the 8th to 15th least significant bits of the contents “0x0000000088888888” of the AR 32:
“0x0000000088888888” and “0x000000000000FF00”=“0x0000000000008800” <<12>>.

The ALU 34 then multiplies <<11>> and <<12>>:

(BR) 0x0000000012340078

(AR) *)0x0000000000008800

_{- - -}

0x0000091A003C000←“0x12340078”* “0x00000800”

+)0x0000091A003C0000←“0x12340078”*

- “0x00008000”

_{- - -}
0x000009ABA03FC000 <<13>>.

The ALU 34 then masks the 16th to 23rd least significant bits of the contents “0x0000000012345678” of the BR 33:
“0x0000000012345678” and “0xFFFFFFFFFF00FFFF”=“0x0000000012005678” <<14>>.

The ALU 34 then masks bits other than 16th to 23rd the least significant bits of the contents “0x0000000088888888” of the AR 32:
“0x0000000088888888” and “0x0000000000FF0000”=“0x0000000000880000” <<15>>.

The ALU 34 then multiplies <<14>> and <<15>>:

(BR) 0x0000000012005678

(AR) *)0x0000000000880000

_{- - -}

0x00009002B3C00000←“0x12005678”* “0x00080000”

+)0x0009002B3C000000←“0x12005678”*

- “0x00800000”

_{- - -}
0x0009902DEFC00000 <<16>>.

The ALU 34 then masks the 24th to 31st least significant bits of the contents “0x0000000012345678” of the BR 33:
“0x0000000012345678” and “0xFFFFFFF00FFFFFF”=“0x0000000000345678” <<17>>.

The ALU 34 then masks bits other than the 24th to 31st least significant bits of the contents “0x0000000088888888” of the AR 32:
“0x0000000088888888” and “0x00000000FF000000”=“0x0000000088000000” <<18>>.

The ALU 34 then multiplies <<17>> and <<18>>:

(BR) 0x0000000000345678

(AR) *)0x0000000088000000

_{- - -}

0x0001A2B3C0000000←“0x12005678”*

- “0x00080000”

+)0x001A2B3C00000000←“0x12005678”*

- “0x00800000”

_{- - -}
0x001BCDEFC0000000 <<19>>.

The ALU 34 adds <<10>>, <<13>>, <<16>> and <<19>>:

0x00000009ABCDB000

0x000009ABA03FC000

0x0009902DEFC00000

+)0x001BCDEFC0000000

_{- - -}
0x002567D2FBCD7000 <<20>>.

Here, <<20>> is the 8-bit correction data. The 8-bit correction data is stored in the CR 35.

The ALU 34 includes an operation device for executing calculating such as that described above.

DEC stage: SIMD8 instruction.

The SIMD correction instruction “SIMD8 D1” stored in the IR 2 is decoded by the DEC 31. The result of decoding shows that four parallel 8-bit SIMD correction operations are to be executed. Based on the decoding, the contents “0x09B58373297DAFC0” written to the D1 register in the EX stage are read and stored in the AR 32, and the 8-bit correction data “0x002567D2FBCD7000”, which is the contents of the CR 35 written in the EX stage, is read and stored in the BR 33.

IF stage: not relevant.

(4) Operation Timing 4 (Operation timing 4+α when the previous operation timing is operation timing 3 to 3+α.)

EX stage: SIMD8 instruction.

Based on the result of decoding by the DEC 31 in operation timing 3 (or operation timing 3+α), the ALU 34 performs a subtraction operation to subtract B input from A input, using the contents of the BR 33 as the A input and the contents of the AR 32 as the B input, and stores the operation result “0x9901BA02DB03FC0” in the D1 register.

0x09B58373297DAFC0

−)0x002567D2FBCD7000

_{- - -}

0x09901BA02DB03FC0

When divided into four pieces of 8-bit data: “0x0990”, “0x1BA0”, “0x2DB0” and “0x3FC0”, this operation result “0x09901BA02DB03FC0” is the SIMD operation result obtained if the contents pre-stored in the D0 register are considered to be four pieces of 8-bit data, and the pieces of 8-bit data are dyadic multiplied with corresponding non-encoded pieces of 8-bit data.

The following shows this SIMD operation result.

(D0) 0x0012 0x0034 0x0056 0x0078

(D1) *)0x0088 *)0x0088 *)0x0088 *)0x0088

_{- - - - - - - - - - - -}

0x0090 0x01A0 0x02B0 0x03C0

+)0x0900+)0x1A00+)0x2B00+)0x3C00

_{- - - - - - - - - - - -}

(D1) 0x0990 0x1BA0 0x2DB0 0x3FC0

DEC stage: not relevant.

IF stage: not relevant.

EXAMPLE 2

The following describes an operation example of a 16*2 SIMD dyadic multiply operation for dyadic multiplying two pieces of 16-bit data respectively with another two pieces of 16-bit data.

If the first two pieces of 16-bit data to be multiplied are stored concatenated in the lower part of the D0 register, and the second two pieces of 16-bit data to be multiplied are stored in the lower part of the D1 register, the 16*2 SIMD dyadic multiply operation is realized by the following two instructions.

MUL D0, D1

SIMD16 D1

Here, supposing that the first two pieces of 16-bit data are “0x1234” and “0x5678”, and that the second eight pieces of 8-bit data are both “0x8888”, “0x0000000012345678” is pre-stored in the D0 register, and “0x0000000088888888” is pre-stored in the D1 register.

FIG. 15 shows the contents of the registers in the 16*2 SIMD dyadic multiple operation.

(1) Operation Timing 1

EX stage: not relevant.

DEC stage: not relevant.

IF stage: MUL instruction.

The dyadic multiply instruction “MUL D0, D1” is fetched from the ROM 1, and stored in the IR 2.

(2) Operation Timing 2

EX stage: not relevant.

DEC stage: MUL instruction.

The dyadic multiply instruction “MUL D0, D1” stored in the IR 2 is decoded by the DEC 3. The result of decoding shows that a 32-bit data dyadic multiple operation is to be executed. Based on the decoding, the contents “0x0000000012345678” of the D0 register are read and stored in the BR 33, and the contents “0x0000000088888888” of the D1 register are read and stored in the AR 32.

IF stage: SIMD instruction.

The SIMD correction instruction “SIMD16 D1” is fetched from the ROM 1, and stored in the IR 2.

(3) Operation Timing 3 (Referred to as operation timing 3 to 3+α in cases in which several blocks are necessary).

EX stage: MUL instruction.

Based on the result of the decoding by the DEC 3 in the operation timing 2, the ALU 34 performs an unsigned 64-bit multiplication to multiply A input by B input, using the contents of the BR 33 as the A input and the contents of the AR 32 as the B input, and stores an operation result “0x09B58373297DAFC0” in the D1 register of the register file 4.

Furthermore, the ALU 34 generates 8-bit correction data and 16-bit correction data.

Note that the ALU 34 generates the 8-bit correction data and the 16-bit correction data using the same method as in Example 1 in the present embodiment, therefore a detailed description is omitted here.

DEC stage: SIMD16 instruction.

The SIMD correction instruction “SIMD16 D1” stored in the IR 2 is decoded by the DEC 31. The result of decoding shows that two parallel 16-bit SIMD correction operations are to be executed. Based on the decoding, the contents “0x09B58373297DAFC0” written to the D1 register in the EX stage are read and stored in the AR 32, and the 8-bit correction data “0x000037D2FB600000”, which is the contents of the CR 35 written in the EX stage, is read and stored in the BR 33.

IF stage: not relevant.

(4) Operation Timing 4 (Operation timing 4+α when the previous operation timing is operation timing 3 to 3+α.)

EX stage: SIMD16 instruction.

0x09B58373297DAFC0

−)0x000037D2FB600000

_{- - -}

0x09B54BA02E1DAFC0

When divided into two pieces of 16-bit data: “0x09B54BA0” and “0x2E1DAFC0”, this operation result “0x09B54BA02E1DAFC0” is the SIMD operation result obtained if the contents pre-stored in the D0 register are considered to be two pieces of 16-bit data, and the pieces of 16-bit data are dyadic multiplied with corresponding unsigned pieces of 16-bit data.

The following shows this SIMD operation result.

(D0) 0x00001234 0x00005678

(D1) *)0x00008888 *)0x00008888

_{- - - - - -}

0x000091A0 0x0002B3C0

0x00091A00 0x002B3C00

0x0091A000 0x02B3C000

+)0x091A0000+)0x2B3C0000

_{- - - - - -}

(D1) 0x09B54BA0 0x2E1DAFC0

DEC stage: not relevant.

IF stage: not relevant.

As described, the processor of the third embodiment of the present invention is able to execute SIMD operations for a plurality of types of operations by simply implementing instructions SIMD8 and SIMD 16 in addition to conventional instructions. These additional instructions are not related to the types of operations, but instead to sizes of operations. As a result of this construction, a dramatic increase in the number of instructions is avoided.

Note that in the first and second embodiments, when the ALU 8 or the ALU 21 executes an operation instruction, carry information is made for each type of operation according to whether there is a carry from seven bit positions in the operation result, and the carry information is stored in the CR 9. In the third embodiment, correction information is generated and stored in the CR 35 and the CR 36, but it is not always necessary to generate carry information and correction information when the ALU 34 executes an operation instruction. A possible structure in the first and second embodiments is a structure in which, when the ALU 8 or the ALU 21 executes an operation instruction, the carry result from seven bit positions (C7, C15, C23, C31, C39, C47, C55) and the type of operation (ADD, SUB, INC or DEC) are recorded, and carry information is generated based on the contents of the recorded information. A possible structure in the third embodiment is one in which, when the ALU 34 executes an operation instruction, the data for generating correction data, and the type of operation are recorded, and correction data is generated based on the recorded contents when the SIMD correction instruction is executed.

FIG. 16 shows the structure of an SIMD operation apparatus in which the carry result and the operation type are recorded when an operation instruction is executed, and carry information is generated when an SIMD correction instruction is executed.

An SIMD operation apparatus 40 shown in FIG. 16 includes an ALU 41 instead of the ALU 8 in the SIMD operation apparatus 10 shown in FIG. 1, an EXT 42 instead of the EXT 5 in the SIMD operation apparatus 10, a CR 43 and an OPR 44 instead of the CR 9 in the SIMD operation apparatus 10.

The ALU 41 is a 64-bit adder/subtractor that, when an operation instruction is being executed, performs either and an addition A+B or a subtraction A−B, and stores the result of the operation in the registers. Here, A input and B input are the respective contents of the AR 6 and the BR 7. In addition, the ALU 41 stores a carry result in the CR 43 and the operation type in the OPR 44, and corrects the operation result to the SIMD operation results based on carry information generated by the EXT 42.

The EXT 42 is a carry information generator/extender. The EXT 42 generates 8-bit-use carry information from the carry result stored in the CR 43 as a result of an 8-bit SIMD correction instruction being decoded by the IR 2 and the operation type stored by the OPR 44. The EXT 42 generates 16-bit-use carry information from the carry result stored in the CR 43 as a result of a 16-bit SIMD correction instruction being decoded by the IR 2 and the operation type stored by the OPR 44. The EXT 42 generates 32-bit-use carry information from the carry result stored in the CR 43 as a result of a 32-bit SIMD correction instruction being decoded by the IR 2 and the operation type stored by the OPR 44. The EXT 42 stores the generated carry information in the BR 7.

The CR 3 is a register of at least seven bits that stores the carry result during execution of an operation instruction.

The OPR 44 is a register that stores the operation type during execution of an operation instruction.

Furthermore, it is possible to include an additional feature in the CR 9 by which the CR 35, the CR 36, the CR 43 and the OPR 44 store their contents to a memory or the like on receiving an interrupt or a switch to another context, and the contents are restored when returning from the interrupt and switching to the original context. This means that interrupts can be received without inconsistencies between the operation instruction and the SIMD correction instruction, and the SIMD operation apparatus 40 is able to perform multitasking without a time lag.

Furthermore, although in each embodiment SIMD correction instructions are employed along with ADD, SUB, INC, DEC, MUL and DIV operation instructions and so on to achieve SIMD operations, it is possible to have SIMD operations performed according to both SIMD correction instructions and SIMD-specific instructions. For example, the two instructions ADD and SUB may be used together with an SIMD correction instruction, while increment and decrement may be implemented with an SIMD-specific instruction.

Here, the increment SIMD-specific instructions may be INCS8 (for processing eight pieces of 8-bit data in parallel), INCS16 (for processing four pieces of 16-bit data in parallel), and INCS32 (for processing two pieces of 32-bit data). The decrement SIMD specific instructions may be DECS8 (for processing eight pieces of 8-bit data in parallel), DECS16 (for processing four pieces of 16-bit data in parallel), and DECS32 (for processing two pieces of 32-bit data in parallel).

The total number of instructions in such a case is three less than that for when SIMD-specific instructions are implemented for all four operations.

The following describes an example of operations of the INCS8 instruction with use of the compositional elements shown in FIG. 1.

(1) On the INCS8 instruction being decoded by the DEC 3, the register operand designated in the instruction is read from the register file 4, and stored in the AR 6. In addition, a value 0x0101010101010101 is stored in the BR 7.

(2) Next, the ALU 8 adds the contents of the AR 6 to the contents of the BR 7. Here, propagation of the carry from bit position 7 to bit position 8, the carry from bit position 15 to bit position 16, the carry from bit position 23 to bit position 24, the carry from bit position 31 to bit position 32, the carry from bit position 39 to bit position 40, the carry from bit position 47 to bit position 48, and the carry from bit position 55 to bit position 56 is not performed, and the operation result from the ALU 8 is stored in a register designated by the instruction language.

The values stored in the BR 7 for SIMD-specific instructions other than INCS8 are: 0x0101010101010101 for DECS8 (the same as for INCS8), 0x0001000100010001 for INCS16 and DECS16, and 0x0000000100000001 for INCS 32 and DECS32. The operation preformed by the ALU 8 for INCS16 and INCS32 is the same as for INCS8. The ALU 8 performs a subtraction for DECS8, DECS16 and DECS32. The places in which carry propagation is not performed are the same for DECS8 as INCS8. The carry propagation is not performed from bit position 15 to bit position 16, bit position 31 to bit position 32, bit position 47 to bit position 48 for INCS 16 and DECS 16. The carry propagation is not performed from bit position 31 to bit position 32 for INCS32 and DECS32. Other operations are the same as for INCS8.

In this way, by implementing increment and decrement SIMD-specific instructions, it is possible to increase or decrease a plurality of addresses at once and control the brightness, color or the like of a plurality of pieces of image data at high speed.

Furthermore, although the SIMD operation apparatuses in the embodiments use an operation device such as a 64-bit adder/subtractor to implement three types of SIMD operations: eight parallel operations on 8-bit data, four parallel operations on 16-bit data, and two parallel operations on 32-bit data, the SIMD operation apparatuses may implement more or less types of SIMD operations. For example, the operation device may be 32 bits, and may implement four parallel 8-bit operations on and two parallel 16-bit data operations. Alternatively, the operation device may be 128 bits, and may implement all or some of the following types of operations: sixteen parallel 8-bit data operations, eight parallel 16-bit data operations, four parallel 32-bit data operations, and two parallel 64-bit operations.

In any of the above-described cases, carry information that corresponds to the smallest data size of the implemented SIMD operations is recorded during execution of an operation. For example, when the smallest data size is 16 bits, the carry information is generated based on carries C15, C31, C47, . . . C(16n−1).

Furthermore, in the embodiments, it is not necessary to store the eight least significant bits of the carry information since they are always 0 and are not used in correction.

Furthermore, although the SIMD operation apparatuses in the embodiments employ a single scalar architecture method that processes one instruction per machine cycle, it is possible to use an architecture method that processes a plurality of instructions per machine cycle, such as a super scalar architecture method or a VLIW (very long instruction word) architecture method.

Furthermore, although the processors in the embodiments are composed of a three-stage pipeline, specifically instruction fetch, decode and execute, the pipeline may have any number of stages. Alternatively, it is possible to not use a pipeline structure.

Fourth Embodiment

The fourth embodiment of the present invention is a compiler that generates machine language instruction programs for the processors in the first to third embodiments for realizing an SIMD operation instruction with an operation instruction for a non-parallel operation and a correction instruction for correcting the non-parallel operation instruction result to an SIMD operation instruction operation result.

FIG. 17 shows the structure of the compiler of the fourth embodiment.

A compiler 100 shown in FIG. 17 is composed of a file reading unit 101, read buffer 102, a syntax analysis unit 103, an intermediate code buffer 104, a machine language instruction generation unit 105, an output buffer 106 and a file output unit 107.

The file reading unit 101 reads a C language program file from an external recording medium such as a hard disk, to the read buffer 102.

FIG. 18 shows an example of the C language program read to the read buffer 102.

The C language program shown in FIG. 18 is a loop that finds the sum of an array variable a [i] and an array variable b [i], and stores the result in an array variable c [i]. Here, i has a value of 0 to 63, therefore 64 array operations are performed.

The syntax analysis unit 103 analyzes the syntax of the C language program read to the read buffer 102, to generate an intermediate code program which it writes to the intermediate code buffer 104. Here, the intermediate code program is in a processor-independent format, and does not include intermediate code that shows an SIMD operation.

FIG. 19 shows an example of the intermediate code program generated from the C language program in FIG. 18.

The following describes the intermediate codes in the intermediate program shown in FIG. 19.

A value 0 is assigned to the variable i.

The i-th element of the char-type array variable a and the i-th element of the char-type array variable b are added together, and the result is stored in the i-th element of the char-type array variable c.

The value of the variable i is increased by one.

The value of each flag is updated according to a result of subtracting 64 from the variable i.

When the updated value of the flags shows “0 or less”, in other words, when “i−64≦0” is fulfilled in intermediate code 4, the intermediate code program branches to intermediate code 2.

The machine language instruction generation unit 105 generates a machine language program that includes a machine language instruction showing an SIMD operation, using the intermediate code program stored in the intermediate code buffer 104 as input, and writes the generated machine language instruction program to the output buffer 106. Here, this machine language instruction program is composed of machine language instructions and is in a processor-dependant format. The machine language instructions include those that show an SIMD operation.

The file output unit 107 outputs the machine language instruction program stored in the output buffer 106 to an external recording medium such as a hard disk.

FIG. 20 shows the structure of the machine language instruction generation unit 105 in detail.

The machine language instruction generation unit 105 shown in FIG. 20 is composed of an SIMD operation extraction unit 110, an SIMD intermediate code generation unit 111 and a machine language instruction output unit 112.

The SIMD operation extraction unit 110 scans the intermediate code program input from the intermediate code buffer 104 to find intermediate codes for which an array operation is to be performed, and generates a modified intermediate code program. The modified intermediate code program is obtained by converting the intermediate code in the intermediate code program that is for performing data array operations to modified intermediate code that shows an SIMD operation for a predetermined number of operations at once according to the data array type.

FIG. 21 shows an example of a modified intermediate code program generated from the intermediate code program shown in FIG. 19.

The following describes the modified intermediate codes in the modified intermediate code program shown in FIG. 21.

A value 0 is assigned to the variable i (same as intermediate code 1).

The eight array elements from the i-th to the (i+7)-th array element of the char-type array variable a, and eight array elements from the i-th to the (i+1)-th array element of the char-type array variable b are respectively added, and the results are stored in the respective eight array elements from the i-th to the (i+1)-th array element of the char-type array variable c.

The value of the variable i is increased by 8.

The value of each flag is updated according to a result of subtracting 64 from the variable i (same as intermediate code 4).

When the value of each flag shows “0 or less”, in other words, when “i−64≦0” is fulfilled in modified intermediate code 4, the modified intermediate code program branches to modified intermediate code 2.

The SIMD intermediate code generation unit 111 uses the modified intermediate code program generated by the SIMD operation extraction unit 110 to generate an SIMD intermediate code program. Here, the SIMD intermediate code program includes intermediate codes showing SIMD operations.

FIG. 22 shows an example of an SIMD intermediate code program generated from the modified intermediate code program shown in FIG. 21.

The following describes the intermediate codes in the SIMD intermediate code program shown in FIG. 22.

A value 0 is assigned to the variable i (same as intermediate code 1 and modified intermediate code 1).

This corresponds to reading the eight elements of the char-type array variable a in modified intermediate code 2. 64 bits' worth of data are read from a memory area shown by a pointer A, and stored in the variable a.

This corresponds to reading the eight elements of the char-type array variable b in modified intermediate code 2. 64 bits' worth of data are read from a memory area shown by a pointer B, and stored in the variable b.

This corresponds to adding the eight elements of the char-type array variable a and the eight array elements of the char-type array variable b respectively in the modified intermediate code 2. The variable a and the variable b are subjected to SIMD addition 8 bits at a time, and the result is stored in the variable c.

This corresponds to writing the eight elements of the char-type array variable c in modified intermediate code 2. The variable c is written to the 64 bits' worth of the memory area shown by a pointer C.

This corresponds to increasing the pointer A for the array variable a following increasing the value of the variable i in the modified intermediate code 3. The value of the pointer A is increased by 8.

This corresponds to increasing the pointer B for the array variable b following increasing the value of the variable i in the modified intermediate code 3. The value of the pointer B is increased by 8.

This corresponds to increasing the pointer C for the array variable c following increasing the value of the variable i in the modified intermediate code 3. The value of the pointer C is increased by 8.

The value of the variable i is increased by 8 (same as modified Intermediate code 3).

The value of each flag is updated according to a result of subtracting 64 from the variable i (same as intermediate code 4 and modified intermediate code 4).

When the updated value of each flag shows “0 or less”, in other words, when “i−64≦0” in fulfilled in SIMD intermediate code 4, the SIMD intermediate code program branches to SIMD intermediate code 2.

The machine language instruction output unit 112 generates a machine language instruction program that includes machine language instructions showing an SIMD operation, using the SIMD intermediate code program generated by the SIMD intermediate generation unit 111.

FIG. 23 shows an example of the machine language instruction program generated from the SIMD intermediate code program shown in FIG. 22.

The following describes the machine language instructions in the machine language instruction program shown in FIG. 23.

This corresponds to SIMD intermediate code 1. The variable i in the SIMD intermediate code 1 is assigned to the D0 register, and the contents of the D0 register are cleared by subtracting the contents of the D0 register from the contents of the D0 register.

This corresponds to SIMD intermediate code 2. The pointer A in the SIMD intermediate code 2 in assigned to the D1 register and the variable a is assigned to the register D2. Data is loaded from the 64-bit memory area shown by the contents of the D1 register and stored in the register D2.

This corresponds to SIMD intermediate code 3. The pointer B in the SIMD intermediate code 3 is assigned to the register D3 and the variable b is assigned to the register D4. Data is loaded from the 64-bit memory area shown by the contents of the register D3 and stored in the register D4.

Here, the SIMD intermediate code 4 is broken down into a machine language instruction 4 that is an ordinary operation instruction, and a machine language instruction 5 that is an SIMD correction instruction for correcting an ordinary operation instruction result to an SIMD operation result.

This corresponds to the former half of the SIMD intermediate code 4. The variable c in the SIMD intermediate code 4 is assigned to the register D4. The contents of the register D2 and the contents of the register D4 are added, and the result is stored in the register D4. In addition, the carry information for each eight bits is stored in an implicitly-determined implicit register.

This corresponds to the latter half of the SIMD intermediate code 4. Each eight bits of the contents of the register D4 are corrected using the carry information stored in the implicit register in machine language instruction 4, to obtain an 8-bit SIMD addition result which is then stored in the register D4.

This corresponds to the SIMD intermediate code 5. The pointer C in the SIMD intermediate code 5 is assigned to the register D5, and the contents of the register D4 are stored in the 64-bit memory area shown by the contents of the register D5.

This corresponds to the SIMD intermediate code 6. The contents of the register D1 are increased by 8.

This corresponds to the SIMD intermediate code 7. The contents of the register D3 are increased by 8.

This corresponds to the SIMD intermediate code 8. The contents of the register D5 are increased by 8.

This corresponds to the SIMD intermediate code 9. The contents of the register D0 are increased by 8.

This corresponds to the SIMD intermediate code 10. The value of each flag is updated according to a result of subtracting 64 from the contents of the D0 register.

This corresponds to the SIMD intermediate code 11. When the values of the flags show that the result is “0 or less”, in other words, when a relationship between a zero flag (Z), an overflow flag (V) and a negative flag (N) fulfills a relationship “Z or (V xor N)=1”, the machine language program diverges to the machine language instruction 2, which is ten instruction previous.

Note that although the above-described example of the program shows a case in which the data of the array subject to operation is char-type 8-bit data, the type of the array is not limited to being char-type. For example, when the data is short-type 16-bit data, four pieces of data may be treated at a time as 64-bit data. Furthermore, when the data is int-type 32-bit data, two pieces of data may be treated at a time as 64-bit data.

FIG. 24 shows an outline of operations of processing by the SIMD operation extraction unit 110 for generating a modified intermediate code program.

In this processing eight, four or two data operations are performed at a time, according to whether the array type subject to operation is char-type, short-type or int-type. Here, it is supposed that the processor performs an operation for 64 bits of data at once with one operation instruction, and that char-type data is 8 bits, short-type data is 16 bits, and int-type data is 32 bits.

The following describes an outline of processing for generating a modified intermediate code program, with use of FIG. 24.

(1) The SIMD operation extraction unit 110 judges whether there are any unprocessed intermediate codes in the intermediate code program stored in the intermediate code buffer 104 (step S1). When there are no unprocessed intermediate codes, the processing ends.

(2) When there are one or more unprocessed intermediate codes, the SIMD operation extraction unit 110 treats one line of unprocessed code as a target for processing (hereinafter “target code”), and judges whether the target code is for performing an array operation (step S2). When the target code is not for performing an array operation, the SIMD operation extraction unit 110 returns to step S1 process remaining intermediate codes.

(3) When the target code is for performing an array operation, the SIMD operation extraction unit 110 judges whether the target code is for performing a char-type array. operation (step S3).

(4) When the target code is for performing a char-type array operation, the SIMD operation extraction unit 110 finds other codes to be processed that are for performing char-type array operations, converts the codes in groups of eight to char-type modified intermediate codes, and then returns to process remaining intermediate codes (step S4).

(5) When the target code is not a for performing char-type array operation, the SIMD operation extraction unit 110 judges whether the target code is for performing a short-type array operation (step S5).

(6) When the target code is for performing a short-type array operation, the SIMD operation extraction unit 110 finds other codes to be processed that are for performing short-type array operations, converts the codes in groups of four to short-type-use intermediate codes, and then returns to step S1 to process remaining intermediate codes (step S6).

(7) When the target code is not for performing a short-type array operation, the SIMD operation extraction unit 110 judges whether the target code is an for performing int-type array operation (step S7).

(8) When the target code is for performing an int-type array operation, the SIMD operation extraction unit 110 finds other codes to be processed that are for performing int-type array operations, converts the codes in groups of two to int-type-use intermediate codes, and then returns to step S1 process remaining intermediate codes (step S8).

(9) When the target code is not an int-type array operation, the SIMD operation extraction unit 110 treats the target code as a long-type data array operation, and then returns to step S1 to process remaining intermediate codes without further processing the target code (step S9).

In this way, the compiler of the fourth embodiment of the present invention breaks a conventional SIMD operation into an operation instruction and an SIMD operation, and therefore is able to generate machine language instruction programs corresponding to each of the processors in the first to third embodiments. Furthermore, the compiler of the fourth embodiment is able to generate a program by which SIMD operations are executed for a plurality of types of operations with only SIMD correction instructions in addition to conventional instructions. Avoiding a dramatic increase in the number of instructions means that the machine language is relatively short, therefore reducing the program code size.

Although the fourth embodiment discloses a compiler that translates a C language program to a machine language instruction program, the program that is translated is not limited to being a C language program, but instead may be any high-order language program. Furthermore, the program that is generated as a result of translating is not limited to being a machine language program, but may be any kind of program that is not a of a higher order than the high-order program. Furthermore, since in the present invention it is sufficient to translate the parts of the program that correspond to an SIMD operation to an ordinary operation instruction and an SIMD correction instruction, it is not always necessary to translate the whole program language. For example, the present invention may be a program conversion apparatus that converts a high-order language program that includes a syntax that corresponds to an SIMD operation to a same or different high-order language program that includes, instead of the syntax, an operation instruction and an SIMD correction instruction. Alternatively, the present invention may be a program conversion apparatus that converts a machine language program that includes an SIMD operation instruction to a same or different machine language program that includes, instead of the SIMD operation, an operation instruction and an SIMD correction instruction.

Note that any of the programs of fourth embodiment may be recorded on a computer-readable recording medium and the recording medium traded, or the programs may be traded by being directly transferred over a network.

Furthermore, a program that has operations such as those of any of the first to fourth embodiments of the present invention executed on a computer may be recorded on a computer-readable recording medium and the recording medium traded, or the programs may be traded by being directly transferred over a network.

Here, the computer-readable recording medium may be, but is not limited to being, a detachable recording medium such as a floppy disk, a (CD) compact disk, an MO (magneto optical), a DVD (digital versatile disk) and a memory card, or a fixed recording medium such as a hard disk or a semiconductor memory.

Although the present invention has been fully described by way of examples with reference to the accompanying drawings, it is to be noted that various changes and modifications will be apparent to those skilled in the art. Therefore, unless otherwise such changes and modifications depart from the scope of the present invention, they should be construed as being included therein.

INVENTORS:

Suzuki, Masato

THIS PATENT IS REFERENCED BY THESE PATENTS:

Patent

Priority

Assignee

Title

THIS PATENT REFERENCES THESE PATENTS:

Patent	Priority	Assignee	Title
4446517,	Jun 02 1980	Hitachi, Ltd.	Microprogram memory with page addressing and address decode in memory
4710872,	Aug 07 1985	International Business Machines Corporation	Method for vectorizing and executing on an SIMD machine outer loops in the presence of recurrent inner loops
5001662,	Apr 28 1989	Apple Inc	Method and apparatus for multi-gauge computation
5408670,	Dec 18 1992	Xerox Corporation	Performing arithmetic in parallel on composite operands with packed multi-bit components
5504698,	May 17 1994	ARM Finance Overseas Limited	Compact dual function adder
5588152,	May 22 1992	International Business Machines Corporation	Advanced parallel processor including advanced support hardware
5606707,	Sep 30 1994	ROADMAP GEO LP III, AS ADMINISTRATIVE AGENT	Real-time image processor
5761466,	May 09 1994	AVAGO TECHNOLOGIES GENERAL IP SINGAPORE PTE LTD	Soft programmable single-cycle/pipelined micro-programmed control system
5764787,	Mar 27 1996	Intel Corporation	Multi-byte processing of byte-based image data
5828875,	May 29 1997	Telefonaktiebolaget LM Ericsson	Unroll of instructions in a micro-controller
5847978,	Sep 29 1995	Godo Kaisha IP Bridge 1	Processor and control method for performing proper saturation operation
5907500,	Oct 21 1996	NEC Corporation	Motion compensation adder for decoding/decompressing compressed moving pictures
6006316,	Dec 20 1996	GOOGLE LLC	Performing SIMD shift and arithmetic operation in non-SIMD architecture by operation on packed data of sub-operands and carry over-correction
6047304,	Jul 29 1997	AVAYA MANAGEMENT L P	Method and apparatus for performing lane arithmetic to perform network processing
6092094,	Apr 17 1996	SAMSUNG ELECTRONICS CO , LTD	Execute unit configured to selectably interpret an operand as multiple operands or as a single operand
6173388,	Apr 09 1998	ROADMAP GEO LP III, AS ADMINISTRATIVE AGENT	Directly accessing local memories of array processors for improved real-time corner turning processing
6175892,	Jun 19 1998	Hitachi America. Ltd.	Registers and methods for accessing registers for use in a single instruction multiple data system
6282556,	Oct 08 1999	Sony Corporation	High performance pipelined data path for a media processor
6292814,	Jun 26 1998	Hitachi America, Ltd.	Methods and apparatus for implementing a sign function
6460134,	Dec 03 1997	Apple Inc	Method and apparatus for a late pipeline enhanced floating point unit
6529930,	Nov 16 1998	Hitachi America, Ltd	Methods and apparatus for performing a signed saturation operation
6570570,	Aug 04 1998	Renesas Electronics Corporation	Parallel processing processor and parallel processing method
6704762,	Aug 28 1998	NEC Corporation	Multiplier and arithmetic unit for calculating sum of product
7085795,	Oct 29 2001	Intel Corporation	Apparatus and method for efficient filtering and convolution of content data
7185176,	Jun 03 2002	SOCIONEXT INC	Processor executing SIMD instructions
7237089,	Nov 28 2001	SOCIONEXT INC	SIMD operation method and SIMD operation apparatus that implement SIMD operations without a large increase in the number of instructions
20010008563,
20020135683,
20040268094,
20140199461,
JP3268024,

ASSIGNMENT RECORDS Assignment records on the USPTO

Executed on	Assignor	Assignee	Conveyance	Frame	Reel	Doc
Jun 24 2009		SOCIONEXT INC.	(assignment on the face of the patent)
Mar 02 2015	Panasonic Corporation	SOCIONEXT INC	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	035294	0942	pdf

MAINTENANCE FEES AND DATES: Maintenance records on the USPTO

Date	Maintenance Fee Events
Dec 13 2018	M1553: Payment of Maintenance Fee, 12th Year, Large Entity.

Date	Maintenance Schedule
Jan 17 2020	4 years fee payment window open
Jul 17 2020	6 months grace period start (w surcharge)
Jan 17 2021	patent expiry (for year 4)
Jan 17 2023	2 years to revive unintentionally abandoned end. (for year 4)
Jan 17 2024	8 years fee payment window open
Jul 17 2024	6 months grace period start (w surcharge)
Jan 17 2025	patent expiry (for year 8)
Jan 17 2027	2 years to revive unintentionally abandoned end. (for year 8)
Jan 17 2028	12 years fee payment window open
Jul 17 2028	6 months grace period start (w surcharge)
Jan 17 2029	patent expiry (for year 12)
Jan 17 2031	2 years to revive unintentionally abandoned end. (for year 12)