A low-complexity combinatorial RNS multiplier

Share Embed


Descripción

A LOW-COMPLEXITY RNS MULTIPLIER V. Paliouras, K. Karagianni and T. Stouraitis Electrical and Computer Engineering Department University of Patras, Greece

Abstract - Novel VLSI architectures and a design methodology for adderbased Residue Number System (RNS) multipliers are presented in this paper. In the proposed approach, the exploitation of the non-occuring combinations of input bits reduces the number of 1-bit full adders (FAs) required to compose a multiplier. In particular, couples and triplets of input bits assigned to particular FAs are identi ed, which contain bits that cannot be simultaneously asserted for any valid input combination. It is shown that the particular couples or triplets can be assigned to OR gates instead of 1-bit adders, therefore reducing multiplier complexity. By comparing the performance and hardware complexity of the proposed multiplier to previously reported designs, it is found that the introduced architecture is more ecient in the areatime product sense. In fact, it is shown that more than 80% performance improvement can be achieved in certain cases.

INTRODUCTION

Among alternative number representations, RNS can o er ecient solutions for digital signal processing (DSP) applications [1], as it provides carry-free addition and multiplication and borrow-free subtraction. Various VLSI architectures have been proposed for RNS-based arithmetic operations; they rely on memory table lookup, combinatorial logic, or combination of both [2]. The simplest method to perform residue arithmetic is the utilization of lookup tables, usually implemented as ROMs [3]. In the case of residue multiplication, by restricting the allowed moduli to prime numbers, index transform [4] or the submodular index transform [5] can be exploited to reduce the complexity of the required ROM tables. A variety of adder-based multiplier architectures have been proposed in the literature [6],[7]. DiClaudio et al. introduced the pseudo-RNS representation, which enables building reprogrammable modulus multipliers, allows for systolization, and simpli es the computation of DSP algorithms such as FIR ltering, correlations, and DFT [6]. Stouraitis et al. [7] presented FA-based single-modulus architectures for RNS multiply-and-accumulate operations. While maintaining the organization of the residue multiplier as a series of cascaded stages [7],[8],[9], in the proposed approach the organization of the stages is reconsidered. In fact a new organization is proposed for each stage, which occurs through an introduced simpli cation procedure, leading to

complexity reduction. Hardware simpli cation is achieved by reducing certain 1-bit FAs to logic OR gates. The reduction is made possible by organizing the input bits into sets, the maximum sum of the elements of which, cannot exceed unity for any input combination. The proposed procedure leads to architectures demonstrating an area-time performance improvement of more than 80%, in certain cases, when compared to previously reported residue multipliers, as detailed later in the performance evaluation section. The organization of the remainder of the paper is as follows. The following section revisits basics of RNS and of adder-based residue multiplier architectures. Subsequently, the proposed multiplier architecture is described. The achieved performance is then compared to results previously reported in the literature. Finally, the conclusions are discussed.

RNS AND ADDER-BASED RNS ARCHITECTURES

In RNS, each operand is encoded as an ordered set of residues with respect to the members of the base, a set of relatively prime integers. In this way an integer is transformed to an ordered set of smaller integers that can be processed in parallel [3]. Let A, B be integers and let the set fm1 ; m2 ; : : : ; mM g be a base. An QM integer A, 0  A < i=1 mi is uniquely represented in RNS as follows

A RNS ! fA1 ; A2 ; : : : ; AM g (1) where Ai = hAim , for i = 1; 2; : : :; M , and hxim denotes the residue of x modulo mi . The RNS representation of product, sum or di erence C of A and B is denoted by the tuple i

i

C RNS ! fC1 ; C2 ; : : : ; CM g;

(2)

where Ci = hAi  Bi im and  denotes multiplication, addition or subtraction. RNS is a carry-free number system in the sense that Ci can be computed without requiring any information from the tuples (Aj ; Bj ), i 6= j: Recovering integer C from the tuple fC1 ; C2 ; : : : ; CM g is performed by means of the Chinese Remainder Theorem (CRT) or mixed-radix conversion [3]. The three-stage residue multiplier organization [7],[8],[9] exploited by the proposed architecture, is explained by means of an example. Assume a modulo-5 multiplier, which computes residue C as follows i

C = hAB i5 =

*n;1 n;1 XX

i=0 j =0



i+j

a i bj 2

+

5

5

;

(3)

where A, B , and C are n-bit residues modulo 5, hence n = 3, and ai , bj 2 f0; 1g denote the ith and j th bit of A and B respectively. According to the architecture by Stouraitis et al. [7], the rst stage of the multiplier architecture implements the nested summations of (3), while the second and

third one perform the modulo operation of the right-hand side of (3). In particular, the second stage iteratively transforms its input to a number having the same residue modulo m = 5, but with a word length, n, equal to that of the selected modulus, n = 3. Finally, the third stage maps its input, Y 0 , to the sought residue through a conditional addition: Since Y 0 is less than 2m, ;m is added to it if Y 0  m. Hence the third stage implements the operation  0 0 m Y = YY 0 ;; m; ifif YY 0 < (4) m :

Each stage can be described by the sets Zk;i of input bits that contribute to the ith output bit of a stage or of the cascaded parts of a stage|called recursions [9]. The index k = 0 denotes that the corresponding set Zk=0;i refers to the rst stage, while 0 < k  r refers to the kth recursion of the second stage. The output bits to which the bit product ai bj contributes|and hence the Zk=0;i sets to which it belongs|can be identi ed by writing h2i+j i5 in binary form as implied by (3). For example, for i = 1 and j = 2, it is







2i+j 5 = 23 5 = (011)2 = 21 + 20 ;

(5)

where (x)2 denotes that x is written in binary form. Therefore, due to (5), the bit product a1 b2 contributes to the output bits of weight 2 and 1, due to the term

a1 b2 21+2 5 = a1 b2 21 + a1 b2 20 (6) of the inner summation in (3); hence a1 b2 belongs to both Z0;0 and Z0;1 . The described technique can be applied to determine Zk;i for each of the r recursions of the second stage [9]. An equivalent way to describe each stage or recursion in multiplication is o ered by the Flag Bit Sequence (FBS) fqi;j;k g, qi;j;k 2 f0; 1g, where each qi;j;k denotes whether the j th input bit contributes or not to the ith result bit during the kth recursion [8]. A Zk;i based description is equivalent to an FBS-based one, since

Zk;i = f yk;j j qi;j;k = 1g ; (7)

j where yk;j denotes an input bit of the kth recursion of weight 2 m . In the de nition of the rst stage of a multiplier, the bit product ai bj of (3) that has a weight of 2i+j m , replaces yk;j in (7). Having determined the input bits

that contribute to a particular digital position, an architecture consisting of 1-bit FAs and half adders (HAs) can be derived using existing methodologies [9][10, pp. 108-109].

THE PROPOSED MULTIPLIER ARCHITECTURE

The proposed architecture is organized into three stages similarly to the multiplier described in the previous section. The organization of each stage resembles the carry-save array [10]. In particular, each stage consists of columns of FAs or HAs and|departing from the conventional approach|OR

gates. Carry bits produced by the FAs or HAs are processed in the subsequent more signi cant column in accordance to the carry-save paradigm. The derivation of the proposed multiplier architecture is a hardware complexity reduction procedure perceived as the minimization of the number of bits required to be added by the ith column of 1-bit adders. The particular hardware reduction procedure exploits the fact that the set of values assumed by the operands of multiplication is incomplete. In particular, as the operands are residues modulo m, they may assume values in the range f0; 1; : : : ; m ; 1g and not in the complete set of values f0; 1; : : :; 2blog2 mc+1 ; 1g, assumed by a (blog2 mc + 1)-bit binary number. Similarly, the recursions of the second stage assume input values in the range f0; 1; : : :; ymax;k ; 1g, which is not complete in the particular sense. Conventionally, bits of the input words form the sets of bits which are assigned to the 1-bit adders in a column. According to the proposed approach, 1-bit adders can be reduced to OR gates, when the 2- or 3-bit word assigned to a HA or FA respectively assumes an incomplete set of values; therefore hardware simpli cation can be achieved by replacing FAs or HAs with|simpler|OR gates. Let Yk given as

Yk =

nX k ;1 j =0

yk;j 2j ;

(8)

be an input to the kth recursion. There exist couples (yk;j1 ; yk;j2 ) or triplets (yk;j1 ; yk;j2 ; yk;j3 ) of input bits yk;j 2 f0; 1g, 0  j1 ; j2 ; j3 < nk , assigned to the ith column, i.e., qj1 ;i;k = qj2 ;i;k = 1 or qj1 ;i;k = qj2 ;i;k = qj3 ;i;k = 1, the summation of which does not produce a carry for any valid input word combination. In other words, the particular triplets or couples contain bits, only one of which can be asserted for any valid input. Equivalently, for the particular triplets it holds that yk;j1 + yk;j2 + yk;j3  1; 8 Yk 2 I; (9) while for the couples it is yk;j1 + yk;j2  1; 8 Yk 2 I; (10) where Yk denotes an input word to the kth recursion and I is the set of all legitimate input values. The particular couples or triplets are important in the derivation of the proposed architecture because a carry bit is not required to represent the sum of their elements; therefore a simpler circuit than an FA or a HA suces to process them. When the rst stage of a multiplier is considered, an input Yk=0 is a couple of integers Y0 = (A; B ) 2 I = f0; 1; : : :; m ; 1gf0; 1; : : :; m ; 1g and the input bits to the rst stage are the Pn;1 Pn;1 i bit products ai bj , ai ; bj 2 f0; 1g, where A = i=0 ai 2 and B = j=0 bj 2j : When the kth recursion of the second stage is considered, Yk is given by (8) and it belongs to the set Ik , where Ik = f0; 1; : : :; Ymax;k;1 g ; and Ymax;k;1 is the maximum output of the (k ; 1)st recursion or of the rst stage.

In order to minimize hardware complexity, a three-step procedure is suggested. The objective of the proposed procedure is to identify the largest possible number of couples or triplets, formed by bits contained in each of the Zk;i sets, which satisfy conditions (9) and (10) respectively. 1. Let Zk;i denote the set of input bits required to be added at the ith column. According to (9) and (10), the triplets or couples of bits x; y, and z contained in Zk;i , are organized into the sets c(it) and c(id) de ned as c(it) = f(x; y; z ) jx; y; z 2 Zk;i ; x + y + z  1; 8 Y 2 I g (11) c(id) = f(x; y) jx; y 2 Zk;i ; x + y  1; 8 Y 2 I g : (12) 2. For each column i, i = 0; 1; : : :; nk ; 1, the bits of Zk;i are distinguished into sets of disjoint triplets or couples and a set which collects the remainder of the bits not included in any of the triplets or couples. Speci cally, several sets c0i can be de ned, which contain disjoint triplets and couples of c(it) and c(id), n

\

o

c0i = (x; y) 2 c(id) ; (x; y; z ) 2 c(it) j (x; y) (x; y; z ) = ; ; (13) while the remainder of the Zk;i bits, which are not included in the triplets and couples of c0i , form the set cfi given as cfi = fx 2 Zk;i jx 2= A; 8 A 2 c0i g : (14)

3. From the several possible c0i and cfi sets that can be formed, c0i;min and cfi;min, which minimize the number Sic of bits added in the ith column, given as the sum Sic = kc0i k + kcfi k; are adopted. Following the three-step procedure described above, the actual construction of a stage in the proposed architecture, is described. A stage is composed of a sequence of columns. The organization of each column exploits the sets c0i;min and cfi;min derived by the procedure above. The bits added at the ith column are the carry bits from the (i ; 1)st column according to the carry-save paradigm, input bits which belong to cfi;min sets, and bits produced by 2- or 3-input OR gates, each operating on a couple or triplet of c0i;min . An OR gate suces to add the bits of the particular couples or triplets. Therefore hardware is simpli ed in the ith column as an OR gate replaces a more complex FA or HA, as shown in Fig. 1. The replacement is feasible because, as two or more bits cannot be simultaneously asserted due to the de nition of c(it) and c(id) , carry generation logic is not required and it is omitted. Similarly the XOR gates conventionally employed to generate a sum digit [10], are reduced to OR gates since only one of the input bits can be asserted. In addition, hardware is simpli ed in the subsequent more signi cant columns, since the number of carries generated at the ith column is reduced and, therefore, fewer bits need to be processed.

Figure 1: The principle of reducing the number of adders in a column by exploiting the property (10). The column architecture i. is replaced by architecture ii., if (10) holds for the bit pairs (x; y), (c; d), and (a; b). The white box represents an FA, while the shaded box represents an HA.

Area Complexity

The numbers of FAs and HAs which form the ith column are given by   b + r ; 1 i i FA (15) Ni = 2 NiHA = hbi + ri ; 1i2 ; (16) where ri denotes the number of input carries. Since each FA and HA produces a carry bit, the number of carries, ri+1 , to the next more signi cant digital position, i.e., the (i + 1)st column, is computed as ri+1 = NiFA + NiHA with r0 = 0. The numbers of two- and three-input OR gates placed at the ith column are

\ NiOR2 =

c(id) c0i

(17)

NiOR3 =



c(it)

\

c0i

:

(18)

The organization of a column is shown in Fig. 1.ii. The total hardware complexity of a recursion is obtained through

Carea;k =

nX l ;1 ; i=0



NiFAaFA + NiHA aHA + NiOR2 aOR2 + NiOR3 aOR3 ;

(19)

where nl is the word length of the largest number produced at the particular recursion, and aFA , aHA , aOR2 , and aOR3 denote the complexity of an FA, an HA, and an OR gate of two and three inputs, respectively. To obtain the complete Phardware complexity, Ctot , the complexity of all recursions is added Ctot = rk=0 Carea;k ; where each Carea;k is computed as described by (19). To quantify the area complexity of the architecture, it is assumed that aFA = 28, aHA = 12, aOR2 = 6, and aOR3 = 8, since an FA requires 28 transistors, an HA is implemented with 12 transistors, while a 2- and a 3-input OR gate require 6 and 8 transistors respectively [11].

Time Complexity

Let Yk = ni=0;1 yk;i 2i , yk;i 2 f0; 1g, be an inputPword to the kth recursion. The kth recursion computes Yk+1 as Y k+1 = ni=0;1 yk;i 2i m = Pn ;1 Pn;1 Pn;1 l l i l=0 i=0 yk;i ql;i;k 2 ; where l=0 ql;i;k 2 = 2 m and ql;i;k 2 f0; 1g compose the related FBS fql;i;k g. Assume that the delays dki after which each bit yk;i is available, form the delay sequence Dk , Dk =  k input dn ;1 ; dkn ;2 ; : : : ; dk0 : In the following the derivation of the delay sequence Dk+1 of the output bits of the kth recursion from sequence Dk is described. The delays of the bits processed by the ith column of the adder array, are organized into a sequence Dik for modulo m operation. Sequence Dik is de ned as Dik = fdxjx 2 cfi ; x 2 c0i g; where P

k

k

k

l

l

8 <

x = yk;l 2 cfi dx = : maxfdkl1 ; dkl2 g + tOR ; : (20) x = (yk;l1 ; yk;l2 ) 2 c0i k k k maxfdl1 ; dl2 ; dl3 g + tOR ; x = (yk;l1 ; yk;l2 ; yk;l3 ) 2 c0i The delay of the carry and sum output of the j th adder in the ith column, is computed from ti;j = tFA + max fta ; tb ; tc g ; where tFA is the delay of an 1-bit FA, 0  j < NiFA + NiHA and ta ; tb , and tc are the delays of the inputs to the j th adder, which can be any of the sum bit of the adder above with a delay ti;j;1 , if any, carry bits from the (i ; 1)st column of delay ti;1;j;h , 0  h, and any of the input bits the delays of which are contained in Dik . The constraint h  0 re ects the carry-save organization of the array. It should dkl ;

be stressed that the column delay is minimized by assigning the input bits of longer delay to adders located low in the ith column. The delay of the ith column is dki +1 = ti;0 and the delays of all columns which compose the knth recursion form the output o delay sequence Dk+1 , (k+1) ; : : : ; d(k+1) : The delay C described as Dk+1 = d(nk+1) ; d time of the 0 +1 ;1 n +1 ;2 complete multiplier architecture is the maximum value in the delay sequence of the nal (rth) recursion, Ctime = d(nr);1 . k

k

r

PERFORMANCE COMPARISON

The complexity of the proposed multiplier is compared to the complexity of RNS multiplier architectures previously reported in the literature. The performance of the proposed multiplier is compared to the performance of an FA-based multiplier architecture similar to the multiply-add architecture proposed by Stouraitis et al. [7], as shown in Table I. It is found that, in certain cases, the proposed architecture demonstrates up to 70% area complexity reduction, when compared to the FA-based architecture, without imposing any additional delay. The performance comparison to the pseudo-RNS multiplier [6] shown in Table I, reveals that the proposed architecture is more ecient in terms of area requirements, yet it is slower. However, the performance in terms of the AreaTime product criterion is in favor of the proposed

Figure 2: The proposed architecture for a modulo-5 multiplier. architecture. In particular, the architectures are compared in terms of the ratio p, given by p = (Cprop ; Cpseudo )=Cpseudo  100%; where Cprop denotes areatime complexity of the proposed architecture, while Cpseudo denotes the corresponding complexity of the pseudo-RNS multiplier. In particular, in certain cases, the proposed architecture is more than 74% smaller when compared to the pseudo-RNS multiplier. However, while in some cases the proposed architecture is faster, it can be up to 65% slower. As it can be seen in Table I, the areatime complexity of the proposed architecture is generally smaller and it can be up to 76% reduced compared to the corresponding complexity of a pseudo-RNS multiplier. It is noted that the pseudo-RNS multiplier can operate for more than one moduli. The proposed architecture is also compared to a direct ROM lookup table implementation, as shown in Table I. ROM complexity is quanti ed by adopting the ROM complexity model used by DiClaudio et al. [6], according to which a 2n W table has a complexity of 2n W gates and a delay of n 1-bit FAs. A gate is assumed to have a complexity of 6 transistors. It is found that the area complexity reduction is more than 99%, while the architecture can be up to ve times slower. However, the AreaTime complexity reduction exceeds 99%. The proposed multiplier is nally compared to the sub-modular index transform-based multiplier of Radhakrishnan and Yuan [5], as shown in Table II. It is found that the proposed multiplier is smaller but of longer delay. However, in most cases of 5- and 6-bit prime moduli, the introduced technique produces multipliers of signi cantly lower area-time complexity.

CONCLUSIONS

A new architecture for a residue multiplier modulo m has been introduced in this paper. The proposed multiplier is derived by means of an introduced design methodology which eliminates unnecessary hardware complexity. The

m

Proposed A T tr. tg 260 24 310 28 556 54 506 42 496 30 450 44 556 34 416 28 688 64 720 52 672 54 622 52 888 58 612 42 692 42 1360 86 1246 78 1040 62 956 60

FA based [7] ROM pseudo-RNS [6] A T AT A T AT A T AT tr. tg % tr. tg % tr. tg % 11 360 24 27.78 6144 8 83.2 1064 28 72.3 13 416 28 25.48 6144 8 76.3 1064 28 60.9 17 1456 64 67.78 30720 10 78.0 1680 36 ;12.0 19 1120 50 62.05 30720 10 87.4 1680 36 35.8 21 1008 36 58.99 30720 10 91.3 1680 36 55.6 23 924 46 53.42 30720 10 88.3 1680 36 40.3 29 700 34 20.57 30720 10 92.3 1680 36 60.7 31 516 28 19.38 30720 10 95.3 1680 36 76.1 37 2144 76 72.98 147456 12 97.5 2436 44 58.9 41 1408 60 55.68 147456 12 97.9 2436 44 65.1 43 1660 60 63.57 147456 12 97.9 2436 44 66.1 47 1380 58 59.59 147456 12 98.2 2436 44 69.8 53 1648 58 46.12 147456 12 97.1 2436 44 52.9 59 768 42 20.31 147456 12 98.6 2436 44 76.0 61 888 42 22.07 147456 12 98.4 2436 44 72.9 67 4604 96 73.54 688128 14 98.8 3332 52 32.5 71 3968 96 74.49 688128 14 99.0 3332 52 43.9 73 3380 70 72.75 688128 14 99.3 3332 52 62.8 79 2532 70 67.64 688128 14 99.4 3332 52 66.9 Table I: Performance comparison to FA-based, ROM, and pseudo-RNS multiplier for various moduli m. A is the area complexity as the number of required transistors, T is the delay in gate delays, and AT is the areatime complexity savings percentage (%) achieved. elimination becomes possible by suitably grouping the input bits into couples or triplets the maximum sum of the elements of which, does not exceed unity. Therefore, OR gates can be used for their processing, instead of more complex FAs and HAs. It has also been shown that the application of the introduced methodology leads to signi cant complexity savings over residue multiplier architectures published in the literature, in the areatime complexity sense. Hence, the proposed residue multiplier can provide an ecient solution for the DSP systems which employ residue arithmetic.

References [1] F. Taylor, \Residue arithmetic: A tutorial with examples," IEEE Computer, pp. 50{62, May 1984. [2] M. A. Bayoumi, G. A. Jullien, and W. C. Miller, \A VLSI implementation of residue adders," IEEE Transactions on Circuits and Systems, vol. 34, pp. 284{288, Mar. 1987.

m Yuan-Radhakrishnan Proposed Savings (%) Area Time Area Time Area Time Areatime 17 2072 26 1254 54 65.23 ;51.85 ;25.70 19 1932 36 924 42 109.09 ;14.29 44.20 23 1932 36 820 44 135.61 ;18.18 48.13 29 1904 32 700 24 172.00 33.33 72.43 31 1932 36 516 28 274.42 28.57 79.23 37 4312 40 688 64 526.74 ;37.50 74.47 41 4284 36 720 52 495.00 ;30.77 75.72 43 4340 44 672 54 545.83 ;18.52 81.00 47 4340 44 622 52 597.75 ;15.38 83.06 53 4312 40 888 58 385.59 ;31.03 70.14 59 4340 44 612 42 609.15 4.76 86.54 61 4312 40 692 42 523.12 ;4.76 83.15 Table II: Performance comparison to Yuan-Radhakrishnan [5] multiplier. [3] N. Szabo and R. Tanaka, Residue Arithmetic and its Applications to Computer Technology. McGraw-Hill, 1967. [4] G. A. Jullien, \Implementation of multiplication, modulo a prime number, with applications to number theoretic transforms," IEEE Transactions on Computers, vol. C-29, pp. 899{905, Oct. 1980. [5] D. Radhakrishnan and Y. Yuan, \Novel approaches to the design og VLSI RNS multipliers," IEEE Transactions on Circuits and Systems { Part II, vol. 39, pp. 52{57, Jan. 1992. [6] E. D. DiClaudio, F. Piazza, and G. Orlandi, \Fast combinatorial RNS processors for DSP applications," IEEE Transactions on Computers, vol. 44, pp. 624{633, May 1995. [7] T. Stouraitis, S. Kim, and A. Skavantzos, \Full adder-based arithmetic units for nite integer rings," IEEE Transactions on Circuits and Systems-II, vol. 40, pp. 740{745, Nov. 1993. [8] V. Paliouras and T. Stouraitis, \Multifunction architectures for RNS processors," IEEE Transactions on Circuits and Systems { Part II, vol. 46, pp. 1041{1054, August 1999. [9] D. J. Soudris, V. Paliouras, T. Stouraitis, and C. Goutis, \A VLSI design methodology for RNS full adder-based inner product architectures," IEEE Transactions on Circuits and Systems { Part II, Apr. 1997. [10] I. Koren, Computer Arithmetic Algorithms. Prentice-Hall, 1993. [11] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A systems perspective. Reading, MA: Addison Wesley, second ed., 1992.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.