Energy efficient comparators for superscalar datapaths

June 19, 2017 | Autor: Dmitry Ponomarev | Categoría: Distributed Computing, Computer Hardware, Computer Software, Low Power, Energy Dissipation, Energy efficient

Share Embed

Laporkan tautan ini

Descripción

892

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 53,

NO. 7, JULY 2004

Energy Efficient Comparators for Superscalar Datapaths Dmitry V. Ponomarev, Member, IEEE, Gurhan Kucuk, Student Member, IEEE, Oguz Ergin, Student Member, IEEE, and Kanad Ghose, Member, IEEE Abstract—Modern superscalar datapaths use aggressive execution reordering to exploit instruction-level parallelism. Comparators, either explicit or embedded into content-addressable logic, are used extensively throughout such designs to implement several key out-of-order execution mechanisms and support the memory hierarchy. The traditional comparator designs dissipate energy on a mismatch in any bit position. As mismatches occur with a much higher frequency than matches in many situations, considerable improvements in energy dissipation are to be gained by using comparators that dissipate energy predominantly on a full match and little or no energy on partial or complete mismatches. This paper makes two contributions. First, we introduce a series of dissipate-onmatch comparator designs, including designs for comparing long arguments. Second, we show how comparators, used in modern datapaths, can be chosen and organized judiciously based on the microarchitectural-level statistics to minimize the energy dissipation. We use the actual layout data and the realistic bit patterns of the comparands (obtained from the simulated execution of SPEC 2000 benchmarks) to show the energy impact from the use of the new comparator designs. For the same delay, the proposed 8-bit comparators dissipate 70 percent less energy than the traditional designs if used within issue queues and 73 percent less energy if used within load-store queues. The use of the proposed 6-bit comparators within the dependency checking logic is shown to increase the energy dissipation by 65 percent on the average compared to the traditional designs. We also find that the use of a hybrid 32-bit comparator, comprised of three traditional 8-bit blocks and one proposed 8-bit block, is the most energy-efficient solution for the use in the load-store queue, resulting in 19 percent energy reduction compared to the use of four traditional 8-bit blocks used to implement a 32-bit comparator. Index Terms—Energy-efficient comparators, low-power datapath.

æ 1

INTRODUCTION

T

ODAY’S superscalar microprocessors make extensive use of associative matching logic and comparators to support out-of-order execution and virtual memory mechanisms. Comparators, either explicit or embedded into contentaddressable logic, are used in at least the following ways:

1.

2.

Within the wakeup logic of instruction issue queues (IQs) to allow results generated by function units to be picked up by instructions waiting in the IQ. The wakeup mechanism relies on associative comparisons of the source tags of the instructions in the issue queue against the result tags broadcast by the instructions selected for execution. A result tag may be either a physical register address of the result or the corresponding reorder buffer slot, depending on the implementation [11]. Within associatively addressed storage components, like the translation lookaside buffer (TLB), to quickly translate virtual page numbers to physical page numbers in parallel with the access of a physical address-tagged cache.

. The authors are with the Department of Computer Science, State University of New York at Binghamton, Binghamton, NY 13902-9000. E-mail: {dima, gurhan, oguz, ghose}@cs.binghamton.edu. Manuscript received 22 Nov. 2002; revised 23 June 2003; accepted 15 Oct. 2003. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 117820. 0018-9340/04/$20.00 ß 2004 IEEE

3.

4.

5.

6.

7.

Within load-store queues (LSQ) to match the addresses of the pending load instructions against the addresses of previously dispatched store instructions and enable the data forwarding between the matching entries. In addition, comparators are also used within the LSQ to support data forwarding to the store instructions. Within banks of instruction and data caches for some embedded CPUs like the Strong ARM SA 1100 (which uses a fully associative 256-entry cache bank). Within the instruction decoding/dispatching/renaming logic to compare source and sink architectural register addresses to detect true dependencies within the group of codispatched instructions. The dependency information is used to maintain data flow based on the renamed registers (i.e., based on physical register addresses). Within reorder buffers, ROB, for associatively locating the most recent entry established for a physical register within the ROB. Here, physical registers are integrated into the ROB entries and register renaming is used to implement interinstruction data dependencies. Architectural register address is used as a key for associative matching. Such a mechanism avoids the need to maintain an explicit rename table. Within CAM-based rename tables to associatively locate the most recently established physical register for a given architectural register, as used in Alpha 21264 implementation [9].

Published by the IEEE Computer Society

PONOMAREV ET AL.: ENERGY EFFICIENT COMPARATORS FOR SUPERSCALAR DATAPATHS

Fig. 1. Traditional pull-down comparator.

The traditional equality comparator circuit used for implementing associative logic in modern datapaths (or, for that matter, any digital comparison) is shown in Fig. 1 [5]. These so-called pull-down comparators pull down a precharged output, out, on a mismatch in any bit position when the evaluation signal (eval) goes high. The precharged output remains high on a match. Energy is thus dissipated on a mismatch. Little (due to leakage) or no energy dissipation occurs on a full match. In some situations, where the circuit of Fig. 1 is used in superscalar datapaths, as exemplified above, mismatches occur with a much higher frequency compared to full matches. Consequently, significant energy savings can be realized in such cases if comparators and associative logic can be designed to dissipate energy only on a full match and little or no energy on a mismatch. In this paper, we propose a comparator design that dissipates energy predominantly on a full match and little energy on mismatches or partial matches. This comparator, useful for comparing up to 8 bits, can be used in digital comparisons or integrated within RAM bitcells for implementing associative storage. We use SPICE measurements from actual layouts of the comparator in a 0.18 micron, 6-metal layer CMOS process, along with transition counts obtained from the simulated execution of the SPEC 2000 benchmarks to show the impact of the new comparator in terms of energy dissipation in some of the key datapath components discussed earlier. The comparator circuitry that we propose in this work was first introduced in [1]. In this paper, we make two contributions that significantly extend the work of [1]. First, we evaluate the use of this comparator in various places within the superscalar datapath, taking into account the actual bit patterns in the comparands as obtained from the cycle-accurate microarchitectural simulations. Second, we show how the new comparator can be used by itself or in conjunction with traditional pull-down comparators to compare longer operands (“comparands”). Again, layout data and data obtained from the simulated execution of SPEC 2000 benchmarks are used to show how energy savings can be realized in comparing longer arguments typical of load-store queues (where memory addresses have to be compared), caches, and TLBs. The need for a power-efficient comparator was first mentioned in [3], but no specific circuit solution was proposed. To the best of our knowledge (including patent searches), the comparator design of [8] and the one given here are the only comparators that dissipate energy predominantly on a match. The problem with the

893

domino-style comparator design of [8] is its high delay in a match condition. In [1], we demonstrated that the comparator design described in this paper is better, both in terms of power dissipation and delay, than the improved version of the comparator of [8]. In this paper, we avoid further comparisons and refer the reader to [1] for more discussions. The rest of the paper is organized as follows: Section 2 reviews the traditional comparator design. In Section 3, we describe the new 8-bit dissipate-on-match comparator. Section 4 explores various ways to use 8-bit comparators to compare 32-bit long arguments. Our simulation methodology is presented in Section 5. We evaluate the energy impact of the proposed design for use within the issue queue, the load/store queue. and within the dependency checking logic in Section 6 and our concluding remarks appear in Section 7.

2

TRADITIONAL COMPARATOR

Fig. 1 depicts a traditional 8-bit dissipate-on-mismatch comparator. The output is precharged and, in the evaluate phase, the output is pulled down on a mismatch in any bit position, causing energy dissipation. The n-transistor pulldown logic consisting of two parallel branches of three n-transistors in series is used to detect a mismatch in each bit position. A simpler variation of this structure is used within bitcells to implement content-addressable memories (CAMs), but the basic characteristics are still the same— energy dissipation occurs on a mismatch in any bit position. Notice that the effective output loading of traditional comparators is high, amounting to the diffusion capacitances of 2 C n-transistors (C is the number of bits compared = 8). This results in high response time and considerable power dissipation in the case of a mismatch. In the applications of these comparators to superscalar datapaths, where mismatches occur with a much much higher frequency compared to matches, a great deal of energy is wasted. Content-addressable memories (CAMs) also employ traditional dissipate-on-mismatch comparators that are embedded into the bitcells. Recent works have addressed the problem of minimizing energy dissipation in CAMs [10], [12]. In [12], the CAM words are effectively subbanked and searches proceed on a subbank-by-subbank order. If a word-slice within a subbank does not match the corresponding bits in the search key, comparisons in slices of the same word in the following subbanks are disabled, thereby saving energy in extraneous comparisons. The approach of [10] extends this technique further. The common feature in both approaches is still reliance on dissipate-on-mismatch comparators. Our approach, in some senses, is similar to these two solutions but differs fundamentally from them in its use of dissipate-on-match comparators.

3

A DISSIPATE-ON-MATCH COMPARATOR

The proposed comparator design is shown in Fig. 2. In contrast to the approach proposed in [8], this design avoids the use of Domino-style logic, thus drastically reducing

894

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 53,

NO. 7, JULY 2004

Fig. 3. Timings of the traditional and the proposed comparators.

Fig. 2. An 8-bit dissipate-on-match comparator.

delays, but still ensures that energy dissipations occur mostly on a full match. The circuit of Fig. 2 compares two 8-bit comparands, A7A6...A0 and B7B6...B0. P-transistor pass logic blocks (such as P in Fig. 2) compare two bits of the comparands at a time. A high voltage level Vs is passed on to the right by each of these P-transistor blocks when both input bits that they compare match. Each P-transistor block drives the gate of an n-transistor (such as Q1) that is part of a discharge path. The pass transistor logic shown within the grayed box in Fig. 2 passes a high logic level to the gate of the n-transistor Q1 when bits A7 and B7, as well as bits A6 and B6 of the comparands, match. The series pulldown structure consisting of the devices Q1, Q2, Q3, and Q4 thus conducts when all 8 bits of the comparands are equal. The output of this comparator, precharged to Vdd by Q0 is thus discharged when all bits of the comparands are equal and when the evaluate device, Q5, is on. The n-transistors Q6, Q7, Q8, and Q9 discharge any accumulated charges from the gates of Q1, Q2, Q3, and Q4, respectively, during the precharge phase and prevent the series structure from discharging the output due to the presence of accumulated charges at the gates of the series device from past matches, possibly partial. This energy dissipation is small compared to the energy dissipated on a full match. The value of Vs can be set lower than the value of Vdd to reduce the charge stored at the gates of Q1, Q2, Q3, and Q4

in the case of partial matches. Since these gates have to be discharged during the precharge phase of the next cycle, lower Vs results in reduced energy dissipation in the process of such discharging. On the flip side, applying a voltage lower than Vdd to the gates of n-devices Q1, Q2, Q3, Q4 increases the circuit delay on a match due to the longer time required to propagate the high voltage level through the P-structures to the gates of the cascaded n-transistors. Also, the control and usage of extra voltage level (which can be either supplied from the outside of the chip or generated locally from the Vdd ) creates additional design complications. Analyzing the effects of various values of Vs on the delay of the circuit and its power dissipation is beyond the scope of this paper and we only consider the comparator operated at Vs equal to Vdd in the rest of this work. The total number of devices used to implement the circuit of Fig. 2 (33 p-devices and nine n-devices) is essentially identical to what is needed to implement a traditional comparator (40 n-devices and one p-device). The data inputs to the comparator of Fig. 2 (bits A0 through A7, B0 through B7, and their complements) are driven after buffering at the beginning of a clock cycle. Because of this, the sizing of the transistors within the P-blocks and the use of a Vs value close or equal to Vdd , circuit malfunction due to charge sharing is avoided. Since the outputs of all enabled comparators are latched at the end of every cycle, keeper devices are not needed within the comparator. A pleasant side effect of the use of the comparator of Fig. 2 within an issue queue has to do with current surges (di/dt). With the use of traditional comparators, most of the comparators draw a significant amount of energy from the power supply in every cycle as very few comparators match. With the use of the comparators of Fig. 2, where only the few matching comparators draw energy from the power supply in each cycle, the di/dt spike is drastically reduced. To better understand the operation of the proposed comparator, as well as its timing and power issues, we show the timing diagrams of the two comparator designs in Fig. 3. The bottom part of Fig. 3 depicts the operation of the traditional comparator, which operates in two phases: precharge and evaluate. For simplicity, the same signal can be used to drive the precharge and evaluate transistors and this is the situation shown in Fig. 3. The delay component that lies on the critical path in a typical cycle is the evaluation delay (teval ). This is because this delay

PONOMAREV ET AL.: ENERGY EFFICIENT COMPARATORS FOR SUPERSCALAR DATAPATHS

895

Fig. 4. Variation of response time with Vs.

cannot be overlapped with any other useful activity in a cycle. The precharge time, on the other hand, can be used to provide and stabilize the new input data such that the inputs are ready when the evaluation signal rises. For example, if the comparators are used within the issue queue, then the source operand tags (which are the data inputs to the comparators) can be driven across the issue queue while the comparator is precharged. Therefore, teval defines the delay of the traditional comparator for all practical purposes. The top part of Fig. 3 shows the timings of the proposed comparator of Fig. 2. A separate discharge signal (dis in Fig. 2) is used to discharge the gates of the n-transistors to prevent false matches. After the discharge signal falls, the high voltage level Vs must be allowed to propagate to the gates of n-devices before the evaluation signal rises. The new inputs must become ready and stable just before the discharge signal falls and they are not allowed to change until the evaluation signal goes down. Thus, the critical delay of the proposed comparator is the sum of teval and tprop . Notice that the discharge signal is not on the critical path since the inputs can still be changing when the discharge signal is high. Even though the propagation delay tprop is the additional overhead that is not present in the traditional design, the significantly lower evaluation delay of the proposed comparator (due to its much lower output capacitance compared to the traditional design) makes it possible to maintain the same comparator delay as that of the circuit of Fig. 1 (by ensuring that tprop þ teval

Lihat lebih banyak...

Energy efficient comparators for superscalar datapaths

Descripción

Comentarios