Optimal Final Carry Propagate Adder Design for Parallel Multipliers

July 7, 2017 | Autor: Harish Kittur | Categoría: Hardware architecture

Descripción

Optimal Final Carry Propagate Adder Design for Parallel Multipliers Ramkumar B and Harish M Kittur Abstract- Based on the ASIC layout level simulation of 7 types of adder structures each of four different sizes, i.e. a total of 28 adders, we propose expressions for the width of each of the three regions of the final Carry Propagate Adder (CPA) to be used in parallel multipliers. We also propose the types of adders to be used in each region that would lead to the optimal performance of the hybrid final adders in parallel multipliers. This work evaluates the complete performance of the analyzed designs in terms of delay, area, power through custom design and layout in 0.18 um CMOS process technology. Index terms – ASIC (Application Specific Integrated Circuit), optimal hybrid CPA, Parallel multiplier, low power, area efficient.

and BEC (Binary to Excess1) based adders called here as BEC Carry Select Adder (BCSLA), BEC Carry Save Adder (BCSA) and BEC Carry Look Ahead adder (BCLA)[4]-[10]. This paper is structured as follows; Section II deals with the design of the PPST based on Dadda algorithm and analysis of the signal arrival profile from the PPST. The analysis of the performance of various adders in terms of area, delay and power is in Section III. The equations for efficient partitioning of the multiplier region are developed in Section IV. The final adder design and ASIC implementation details are provided in Section V and VI respectively. Finally the work is concluded in Section VII.

I. INTRODUCTION

The critical signal path in a parallel multiplier is divided into three domains: AND gate array, PPST (Partial Product Summation Tree) and the final CPA. The delay introduced by the AND gate is relatively small compared to the other two components, especially for the large size multiplier. This delay component is also relatively independent of the size of the multiplier. The delay introduced by the PPST and the final CPA constitutes a dominant component of the delay in the multiplier [1]. Hybrid CPA have been proposed earlier with detailed investigations on the final addition of parallel multipliers [1]-[3]. It is well known that the signals applied to the inputs of the CPA arrive first at the ends of the CPA and the last ones are those in the middle of the CPA. So the determination of the exact arrival time to final adder is of prime importance in the design of the optimal final adder. We have therefore analyzed the arrival time from the PPST through layout implementation and based on those arrival times, the inputs has been applied to the 7 type of adders for 4 different bit sizes ( total of 28 adders). The analysis is done by using industry standard tool and based on the post layout simulation results we have designed the optimal final structure. The investigation includes 8 by 8, 16 by 16, 32 by 32 and 64 by 64 Dadda multiplier with the final adders being 16, 32, 64 and 128-bit Ripple Carry Adder (RCA), Carry Save Adder (CSA), Carry Select Adder (CSLA), Carry Look Ahead Adder (CLA) This work was supported in part by the Integrated Circuit Design Laboratories, VIT University, Vellore, India. 1 2 B.Ramkumar and Harish Kittur are with the VLSI division, School of Electronics Engineering, VIT University, Vellore, India. (email: 1 2 [email protected] ; [email protected] )

II.

ANALYSIS OF PPST SIGNAL ARRIVAL PROFILE

The basic top-level implementation for N by N unsigned parallel multiplier without CPA is shown in Fig. 1. To analyze the exact arrival time from the PPST, the multiplier is implemented without CPA. Signal Buffering In order to determine typical signal arrival profile and drive strengths, D flip-flops are used on the primary inputs & outputs. D flip-flops drive multiple buffers to distribute input signals to N2 AND gates. Delay simulations were performed for each cell library to resolve, a) The maximum number of buffers that a single D flipflop can drive. b) The maximum number of AND gate inputs that a single buffer can drive. Multiplicand (A)

Multiplier (B) N

N D flip-flops

D flip-flops

CLK N

N Buffers

AND Gate Array

Partial Product Reduction

Ps……P1

P0

D flip-flops + load caps

Fig. 1. Top-level implementation of N by N multiplier without CPA

III. FINAL ADDER ANALYSIS

Once the partial product matrix has been reduced to a height of two, the final stage CPA length is determined in the Dadda approach as below, CPA length = 2N – 2 Fig.1 details the connection of a AND gate array (p0) and compression strategy (p1 to ps) to a D flip-flop and capacitive load which finds the arrival time from the inputs to the final CPA. The evaluated arrival profile is fixed as input delay to the CPA using timing constraint file as shown in Fig. 2.

The performance comparisons, of the 7-types of 16-bit adders for the 8 by 8 multiplier, in terms of output timing are shown in Fig. 3(b) and Fig. 3(c). The area and power comparison is shown in Fig. 3(d) and Fig. 3(e) respectively. These values are predicted from the post layout simulation of each adder. In each of the 3 region, we note that more than one adder is performing well with only slight differences. In the positive slope region of the comparison graph, the RCA and CSA are performing faster. The difference between the arrival times of each successive point in positive slope is greater than the carry propagation time of one full adder of an RCA. i.e., consider a full adder takes 0.24 ns to propagate the carry. But the difference between the arrival times of successive points in positive slope is more than 0.24 ns. So the RCA is sufficient in this region. Also due to the greater area and power requirement of CSA, the RCA is the best choice for the positive slope region. In the second region the arrival profile of the multiplier signals are relatively constant. So these bits are waiting for the carry inputs from the LSB side. Thus a faster adder is needed in this region. The BCSA and BCLA perform faster in this region, shown in Fig 3b. On comparing BCSA with BCLA, the BCLA is slightly faster than the BCSA and also the area and power of the

m u l_ 8 b it 5 4 Delay(ns)

Partial Product Reduction Tree The Wallace and Dadda methods are the popular partial product reduction algorithm for fast multipliers [11]-[12], but a closer examination of the delays and area within these two multiplier shows that the Dadda multiplier is slightly faster and area efficient than the Wallace multiplier [13]-[14]. For the reduction of the N by N partial matrix, Dadda proposes a sequence of matrix heights that are determined by working back from the final two-row matrix. In order to implement the minimum number of reduction stages, the height of each intermediate matrix is limited to the largest integer that is no more than 1.5 times the height of its successor. In this paper, the PPST is implemented based on the Dadda algorithm.

2 1

A. 8 by 8 Dadda multiplier The input arrival profile to the final CPA of 8 by 8 Dadda multiplier is shown in Fig. 3a. Based on the arrival profile it is divided into 3 regions, in which the first region has a positive slope from the point 0 to 5 and the second region has a constant region from point 6 to 14 and the third region has a negative slope which is point 15. Since the negative slope has only one point we can include this region within the second region.

0 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

B it P o sitio n s

(a) a d d _ 1 6 b it

Delay(ns)

CLK

3

Setting the delay of Arrival Profile from partial products

9 8 7 6 5 4 3 2 1 0

RCA

C SA

C SLA

CLA

1

2

3

4

5

6

Ps……P1

7

8

9 10 11 12 13 14 15 16 17

Bit P o sitio n s

(b) D filp-flops a d d _ 1 6 b it

Ps……P1

P2n-2……P1 D filp-flops

Delay(ns)

Final Adder

8 7 6 5 4 3 2 1 0

BCSA

BCSLA

BCLA

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17

B it P o sitio n s

Fig. 2. Arrival profile evaluation to the CPA for N by N multiplier

(c)

m u l_ 1 6 b it 7

7000

6

6000

5

Delay(ns)

area(um2)

a d d _ 1 6 b it 8000

5000 4000 3000

4 3 2

2000

1

1000

0 1

0 RCA

CS A

C S LA

C LA

B C S LA

B CS A

3

5

7

9

13

15

17

19

21

23

25

27

29

31

B it P o sitio n s

(d)

(a) add_32bit

a d d _ 1 6 b it 14

0.9 0.8 0.7 0.6 0.5

12

RCA

10

Delay(ns)

power(mW)

11

B C LA

0.4 0.3 0.2 0.1 0

8

CSA

6 4

CSLA

2 0 1 RCA

CS A

C S LA

C LA

B C S LA

B CS A

3

5

7

9

11 13 15 17 19 21 23 25 27 29 31 33

B C LA

Bit Positions

(e)

(b) add_32bit 10 8

Delay(ns)

Fig. 3. Final adder analysis for 8 by 8 Dadda multiplier: (a) arrival profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c) delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power analysis.

CLA

BCSA

6 4

BCSLA

2 0

BCLA are lesser than the BCSA. So we conclude that for the 8 by 8 multiplier the hybrid adder should have RCA for the region 1 and BCLA for the region 2.

BCLA

1

3

5

7

9

11 13 15 17 19 21 23 25 27 29 31 33 Bit Positions

(c) a d d _ 3 2 b it 16000 14000

area(um2)

12000 10000 8000 6000 4000 2000 0 RCA

CS A

C S LA

C LA

B C S LA

B CS A

B C LA

B CS A

B C LA

(d) a d d _ 3 2 b it 2.5 2

power(mW)

B. 16 by 16 Dadda multiplier The input arrival profile for the final adder of 16 by 16 Dadda multiplier is shown in Fig. 4(a). Based on the arrival profile, here also we can divide this in to 3 regions, in which the first region has a positive slope from the point 0 to 6 and the second region has a constant region from point 7 to 25 and the third region has a negative slope from the point 26 to 31. Since the negative slope region has more than a few bits we consider this a separate region. The performance comparisons, of 7-types of 32-bit adders for the 16 by 16 multiplier, in terms of output timing are shown in Fig. 4(b) and Fig. 4(c). The area and power comparisons are shown in Fig. 4(d) and Fig. 4(e) respectively. In the positive slope region, here also the carry propagation time of one full adder is smaller than the arrival inputs between successive points of the positive slope region. Thus the RCA is again the best choice in this positive slope region. In the second region, we find three adders are working faster with only slight difference. They are CSLA, BCSA and BCSLA. The BCSA works faster than the CSLA and BCSLA in this region. But due to increase in the area and power of the BCSA, we can choose one of the adders from the CSLA and BCSLA for the second region. On comparing these two adders, the CSLA is slightly faster than the BCSLA

1.5 1 0.5 0 RCA

CS A

C S LA

C LA

B C S LA

(e)

Fig. 4. Final adder analysis for 16 by 16 Dadda multiplier: (a) arrival profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c) delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power analysis.

Delay(ns)

18 16 14 12 10 8 6 4 2 0

BCSA

BCSLA

BCLA

1 4

7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 Bot Positions

(c) a d d _ 6 4 b it 350 00 300 00 25000

area(um2)

C. 32 by 32 Dadda multiplier The input arrival profile to the final adder of 32 by 32 Dadda multiplier is shown in Fig. 5(a). Based on the arrival profile, here also we can divide this in to 3 regions, in which the first region has a positive slope from the point 0 to 15 and the second region has a constant region from point 16 to 52 and the third region has a negative slope from the point 53 to 63. The performance comparison of 7-types of 64-bit adders for the 32 by 32 multiplier in terms of output timing is shown in Fig. 5(b) and Fig. 5(c). The area and power comparison is shown in Fig. 5(d) and Fig. 5(e) respectively. In the positive slope region, here also the carry propagation time of one full adder is smaller than the arrival inputs between each successive point in the positive slope region. Thus the RCA is again best suited in this positive slope region. In the second region, we find four adders are working fast with only slight difference. They are CLA, CSLA, BCSA and BCSLA. The CLA is faster only up to middle of the second region (approximately 20 bits of 40 bits). The others perform faster with slight difference in the entire second region. Since the CLA is faster only half part of the second region, we can omit CLA in this region. So we have to choose one of the remaining adders. The BCSA performs faster than the CSLA and BCSLA in this region. But it leads to larger power and area. So we need to choose either CSLA or BCSLA. As explained earlier, the BCSLA is slightly slower than the CSLA, but it saves more area and power. So we suggest that the BCSLA is best in the entire region2. In the third

add_64bit

20000 15000 10000 5000 0 R CA

CS A

CS LA

C LA

B CS LA

B CS A

B CLA

(d)

ad d _32b it 4 3.5 3 power(mW)

and it leads to more area and power than the BCSLA. So we suggest that the BCSLA be used for the second region. In the third region, the BCSA and BCLA are working faster. The BCLA saves more area and power than the BCSA; we can therefore choose the BCLA for the entire third region.

2.5 2 1.5 1 0.5 0 RCA

CS A

CS LA

CLA

B CS LA

B CS A

B CLA

(e) Fig. 5. Final adder analysis for 32 by 32 Dadda multiplier: (a). arrival profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c) delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power analysis.

m u l_ 3 2 b it

Delay(ns)

8 7 6 5 4 3 2 1 0

region, the BCSA and BCLA perform faster with only slight differences. But the BCLA is faster than the BCSA in this region. Also the BCLA requires lesser area and power than the BCSA. So we can suggest BCLA be used for the third region. 1

4

7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 B it P o sitio n s

(a) add_64bit 25

Delay(ns)

20

RCA

15 CSA

10 CSLA

5 0 1 4

7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 Bot Positions

(b)

CLA

D. 64 by 64 Dadda multiplier The input arrival profile to the final adder of 64 by 64 Dadda multiplier is shown in Fig. 6(a). Based on the arrival profile, here also we can divide this into 3 regions, in which the first region has a positive slope from the point 0 to 29 and the second region has a constant region from point 30 to 112 and the third region has a negative slope from the point 113 to 127. The performance comparisons, of 7-types of 128-bit adders for the 64 by 64 multiplier, in terms of output timing are shown in Fig. 6(b) and Fig. 6(c). The area and power comparison is shown in Fig. 6(d) and Fig. 6(e) respectively. In the positive slope region, here also the

a d d _ 1 2 8 b it 8 7 6 power(mW)

carry propagation time of one full adder is smaller than the arrival inputs between each successive point in the positive slope region. Thus the RCA is again the best in this positive slope region. In the second region, we can find three adders are working fast with only slight difference. They are CSLA, BCSA and BCSLA. The CLA is faster only up to the quarter part of the second region (approximately 20 bits of 80 bits). The BCLA is faster from the last quarter part of the second region to the entire remaining third region. Since the effect of CLA and BCLA is on only certain parts of the second region, we have omitted these two adders in this region. The BCSA works faster than the CSLA and BCSLA in this region. However due to more area and power requirement of BCSLA, we have to choose one of the adders from CSLA and BCSLA.

5 4 3 2 1 0 R CA

CS A

C S LA

C LA

B C S LA

B CS A

B CLA

(e) Fig. 6. Final adder analysis for 64 by 64 Dadda multiplier: (a). arrival profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c) delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power analysis.

mul_64bit 10

Delay(ns)

8 6 4 2 0 1

7

13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 Bit Positions

As explained earlier, the BCSLA is slightly slower than the CSLA but it saves more area and power. So we can conclude that the BCSLA is suitable for this region. In the third region, the BCSLA and BCLA perform faster with slight difference. But the BCLA is faster than the BCSLA in this region. Also the BCLA requires lesser area and power than the BCSA. So we choose BCLA for the third region.

(a) add_128bit

IV.

EFFECTIVE PARTITIONING OF MULTIPLIER REGIONS

35

Delay(ns)

30

RCA

25 20

CSA

15 10

CSLA

5 0 1

8

15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127

CLA

Bit Positions

(b) add_128bit 30

BCSA

Delay(ns)

25 20 BCSLA

15 10 5

BCLA

0 1

8

From the analysis of four operand size multipliers, first we can propose some simple equations to partition the 3 regions of the multiplier. The Table I (exact) is derived from the exact arrival bit positions to final adder. By shifting one or few bits from each region, we can develop equations for partitioning of the 3 regions. The Table II (approx) shows the contents of Table I (Exact) after modification of the bit width of each region. From the Table I and Table II, we can conclude for Nbit multiplier (n must be >=4), the first region has ~n/2 bits, the second region has ~n+2x , where x = 0 for 4 to 7 (i.e., n to 2n-1), x = 1 for 8 to 15, etc. and the third region has ~n/4 bits. These expressions for the width of the three regions will reduce the design time in findings the three regions of the final adder.

15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 Bit Positions

(c) TABLE I NUMBER OF BITS IN 3 REGIONS (EXACT)

ad d _128b it 70000

area(um2)

60000 50000 40000

Mul size

Region1 Bit Width

Region2 Bit width

Region3 Bit width

8

0-5

6-14

15

16

0-6

7-25

26-31

32

0-15

16-52

53-63

64

0-29

30-112

113-127

30000 20000 10000 0 RCA

CS A

CS LA

(d)

CLA

B CS LA

B CS A

B CLA

TABLE II NUMBER OF BITS IN 3 REGIONS (APPROX) Mul size

Region1 Bit width

Region2 Bit width

Region3 Bit width

8

0-3=4

4-13 = 10

14-15=2

16

0-7=8

8-27 = 20

28-31=4

32

0-15=16

16-55 = 40

56-63=8

64

0-31=32

32-111 = 80

112-127=16

0.18um technology. The synthesized Verilog netlist and their respective design constraints file are imported to Cadence SoC Encounter and are used to generate automated layout from standard cells and placement & routing [16]. Parasitic extraction is performed using Encounter’s Native RC extraction tool. The extracted parasitic RC (SPEF format) is back annotated to Common Timing Engine in Encounter Platform and analyzed for static timing delay. The power analysis is done using Virtuso Ultrasim [17]. VII.

CONCLUSION

` V.

FINAL ADDER DESIGN

From the analysis of 28 adders, the RCA is the best choice in the positive slope region. In the second region, there are three adders giving good performance. They are BCSA, BCSLA and BCLA. The overall adder structure will uses a hybrid adder structure, because each region uses a unique adder for giving better performance. If we make a hybrid structure in one region, the layout structure becomes more complex. So to avoid complexity, we can use a unique type of adder structure for each region. From the above analysis, the BCSLA gives optimal performance than the BCSA and BCLA. So the BCSLA is the best choice in the second region. In the third region, both the BCSA and BCLA perform faster. Due to the large area and power requirement of BCSA, the BCLA is the best adder in this region. So we can choose the BCLA for the third region. Thus we arrive at the optimal final adder structure for parallel multiplier as shown in Fig. 7. The variable block BCSLA and BCLA is designed based on square-root method [15].

Inputs from the third region

Inputs from the second region

Inputs from the first region

n+2x

n/4

In this work we have obtained simple equations to obtain the bit size of the three different regions (positive slope, constant, negative slope) of the input arrival profile to the final CPA and analysis has been made on each region in terms of area, delay and power with various standard adders to suggest the structure of the optimal final adder for parallel multipliers. From the observed analysis results, the RCA, BCSLA & BCLA provide the optimal performance for positive slope region (width n/2), constant region (width 5n/4) and negative region (width n/4) respectively. ACKNOWLEDGMENT

This work was supported in part by the Integrated Circuit Design Laboratories, VIT University, Vellore, India. REFERENCES [1]

[2]

n/2

[3] Variable Block BCLA

Variable Block BCSLA

RCA

[4] Variable Block CLA

0

BEC with multiplexer

Variable Block RCA

0

[5]

BEC with multiplexer

n+2x

(n/4)+1

[6]

n/2

[7] Fig. 7. Optimal final adder for all the three regions. [8] VI.

ASIC IMPLEMENTATION

The designs proposed in this paper have been developed using Verilog-HDL and synthesized in Cadence RTL compiler using typical libraries of TSMC

[9]

Vojin G. Oklobdzija, Improving Multiplier Design by Using Improved Column Compression Tree and Optimized Final Adder in CMOS Technology, IEEE transactions on Very Large Scale Integration (VLSI) systems, vol. 3, no. 2, June 1995. P.F. Stelling and V.G. Oklobdzija, "Design strategies for optimal hybrid final adders in a parallel multiplier," Journal of VLSI Signal Processing, Vol. 14, no.3, pp.321-31, 1996. Wen-Chang Yeh and Chein-Wei Jen, “High-Speed Booth Encoded Parallel Multiplier Design”, IEEE Transactions on Computers, Vol. 49, No. 7, July 2000. B. Parhami, Computer Arithmetic, Algorithm and Hardware Design, Oxford University Press, New York, pp. 91-119, 2000. Jaehong Park,Hung C.Ngo, Joel A. Silberman and Sang H. Dhong, "470ps 64bit Parallel Binary Adder," Symposium on VLS[ Circuits Digest of Technical Papers, pp. 192-193, 2000. Y. He, C.H. Chang, and J. Gu, "An area efficient 64-bit square root carry-select adder for low power applications", IEEE International Symposium on Circuits and Systems, Vol. 4, pp. 4082 - 4085, 2005. Jin-Fu Li, Jiunn-Der Yu, and Yu-Jen Huang, "A Design Methodology for Hybrid Carry-Lookahead/Carry-Select Adders with Reconfigurability", IEEE International Symposium on Circuits and Systems (ISCAS 2005), 23-26, May 2005, R.P.P. Singh, Parveen Kumar and Balwinder Singh, "Performance Analysis Of Fast Adders Using VHDL",IEEE International Conference on Advances in Recent Technologies in Communication and Computing,189-193, 2009. B.Ramkumar and Harish M Kittur, “ Low Power and Area Efficient Carry Select Adder", IEEE Transactions on Very Large Scale Integration (VLSI) systems, accepted for publication DOI:10.1109/TVLSI.2010.2101621

[10] B.Ramkumar, Harish M Kittur and P.Mahesh Kannan, “ ASIC Implementation of Modified Faster Carry Save Adder”, European Journal of Scientific Research, Vol 42, Issue 1, 2010. [11] C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Transactions on Electronic Computers, vol. EC-13, pp. 14-17, 1964. [12] Luigi Dadda, “Some Schemes for Parallel Multipliers,” Alta Frequenza, vol. 34, pp. 349-356, August 1965. [13] K.C. Bickerstaff, E.E. Swartzlander, M.J. Schulte, Analysis of column compression multipliers, Proceedings of 15th IEEE Symposium on Computer Arithmeitc,2001. [14] Townsend.Whitney J, Swartzlander Earl E. Jr and Abraham.Jacob A, "A comparison of Dadda and Wallace multiplier delays", Advanced Signal Processing Algorithms, Architectures, and Implementations, Proceedings of the SPIE, Volume 5205, pp. 552-560, 2003 [15] Y. He, C. H. Chang, and J. Gu, “An area efficient 64-bit square root carry-select adder for lowpower applications,” in Proc. IEEE Int. Symp.Circuits Syst., 2005, vol. 4, pp. 4082–4085. [16] Cadence, “Encounter user guide”, Version 6.2.4, March 2008. [17] Virtuoso® Ultrasim User Guide, Version 6.1, November, 2006.

Lihat lebih banyak...

Optimal Final Carry Propagate Adder Design for Parallel Multipliers

Descripción

Comentarios