An energy-efficient heterogeneous CMP based on hybrid TFET-CMOS cores

August 17, 2017 | Autor: Asit Kumar Mishra | Categoría: Field effect transistors, Tunneling, Energy efficient, Logic Gates, Integrated Circuit, Logic Gate, Multiprocessor System on Chip (MPSoC), Low voltage, Logic Gate, Multiprocessor System on Chip (MPSoC), Low voltage

Share Embed

Laporkan tautan ini

Descripción

An Energy-Efficient Heterogeneous CMP based on Hybrid TFET-CMOS Cores Vinay Saripalli† , Asit K. Mishra† , Suman Datta§ and Vijaykrishnan Narayanan† The Pennsylvania State University, University Park, PA 16802, USA †

{vxs924,amishra,vijay}@cse.psu.edu, § [email protected]

ABSTRACT The steep sub-threshold characteristics of inter-band tunneling FETs (TFETs) make an attractive choice for low voltage operations. In this work, we propose a hybrid TFET-CMOS chip multiprocessor (CMP) that uses CMOS cores for higher voltages and TFETs for lower voltages by exploiting differences in application characteristics. Building from the device characterization to design and simulation of TFET based circuits, our work culminates with a workload evaluation of various single/multi-threaded applications. Our evaluation shows the promise of a new dimension to heterogeneous CMPs to achieve significant energy efficiencies (upto 50% energy benefit and 25% ED benefit with single-threaded applications, and 55% ED benefit with multi-threaded applications). Categories and Subject Descriptors: C.1.3 [Processor Architectures]: Other Architecture Styles— Heterogeneous (hybrid) systems; B.7.1 [Integrated Circuits]: Types and Design Styles— Advanced technologies, Memory technologies General Terms: Design, Experimentation, Performance. Keywords: Tunnel FETs, Heterogeneous Multi-Core.

1. INTRODUCTION Power consumption is a critical constraint hampering progress towards more sophisticated and powerful processors. A key challenge to reducing power consumption has been in reducing the supply voltage due to concerns of either reducing performance (due to reduced drive currents) or increasing leakage (when reducing threshold voltage simultaneously). The sub-threshold slope of the transistor is a key factor in influencing the leakage power consumption. With a steep sub-threshold device it is possible to obtain high drive currents (IOn ) at lower voltages without increasing the off state current (IOf f ). In this work, we propose the use of Inter-band Tunneling Field Effect Transistors (TFETs) [11] that exhibit sub-threshold slopes steeper than the the-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2011, June 5-10, 2011, San Diego, California, USA. Copyright 2011 ACM ACM 978-1-4503-0636-2/11/06 ...$10.00.

oretical limit of 60 mV/decade found in CMOS devices. Consequently, TFETs can provide higher performance than CMOS based designs at lower voltages. However, at higher voltages, the IOn of MOSFETs are much larger than can be accomplished by the tunneling mechanism employed in existing TFET devices. This trade-off enables architectural innovations through use of heterogeneous systems that employ both TFET and CMOS based circuit elements. Heterogeneous chip-multiprocessors that incorporate cores with different frequencies, micro-architectural resources, instruction -set architectures [6, 7] are already emerging. In all these works, the energy-performance optimizations are performed by appropriately mapping the application to a preferred core. In this work, we add a new technology dimensionality to this heterogeneity by using a mix of TFET and CMOS based cores. The feasibility of TFET cores is analyzed by showing design and circuit simulations of logic and memory components that utilize TFET based device structure characterizations. Dynamic voltage and frequency scaling (DVFS) is widely used to reduce power consumption. Our heterogeneous architecture enables to extend the range of operating voltages possible by supporting TFET cores that are efficient at low voltages and CMOS cores that are efficient at high voltages. For an application that is constrained by factors such as I/O or memory latencies, low voltage operations is possible, sacrificing little performance. In such cases a TFET core may be preferable. However, for compute intensive performance critical applications, MOSFETs operating at higher voltages are necessary. Our study using two DVFS schemes show that the choice of TFET or CMOS for executing an application varies based on the intrinsic characteristics of the applications. In a multi-programmed environment which is common on platforms ranging from cell-phones to highperformance processors, our heterogeneous architectures can improve energy efficiencies by matching the varied characteristics of different applications. The emerging multi-threaded workloads provide an additional dimension to this TFET-CMOS choice. Multi-threaded applications with good performance scalability can achieve much better energy efficiencies utilizing multiple cores operating at lower voltages. While energy efficiencies through parallelism is in itself not new, our choice of TFET vs. CMOS for the application will change based on the actual voltage at which the cores operate and the degree of parallelism (number of cores). Our explorations shows TFET based cores to become more preferred in multi-threaded applications from both energy and performance perspective.

1.E-01 Vds 0.05V

1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 -1

0

1

2

3

1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06

(B)

(A)

Vds 0.75V

(B)

Gate

Vds 0.05V

I(g,s)

I(g,d)

Vds 0.75V

Cgs

Vds 0.05V

VOut

VOut [V]

Ids (uA/um)

Vds 0.75V

Ids (uA/um)

(A)

1.E+00

VIn

Cgd VIn [V]

(C) -1

-0.5

0

0.5

1

Vgs [V]

Vgs [V]

Experiment [EOT 4.5nm]

Projected [EOT 0.5nm]

I(d,s,g)

1.5

Source

Igd = d Cgd.Vgd dt d Cgs.Vgs Igs = dt

Simulation [EOT 4.5nm]

Figure 1: (A) Comparison of experimental and simulated characteristics of single-gate In0.53 Ga0.47 As homojunction TFET (EOT 4.5nm) [11] (B) Comparison of simulated characteristics of single-gate In0.53 Ga0.47 As homojunction TFET (EOT 4.5nm) and projected double-gate In0.53 Ga0.47 As homojunction TFET (EOT 0.5nm). The rest of this paper is organized as follows: In section II, we introduce Tunnel FET device operation and modeling, and discuss III-V semiconductor-based TFETs. By comparing the transistor level characteristics of TFETs with state-of-the-art MOSFETs, we identify the potential impact of III-V semiconductor-based TFETs at the architecture level. In section III, we demonstrate circuit modeling using TFETs, and compare the energy-delay performance of logic and memory elements for MOSFETs and HTFETs. In section IV we show the benefits of our heterogeneous multicore. Finally we conclude in section V.

2. TUNNEL FET CHARACTERISTICS 2.1 Device Modelling of Tunnel FETs Since compact models for the transfer characteristics of TFETs have not been fully developed, we use the device simulator TCAD sentaurus [15] in order to model the Id -Vg characteristics of TFETs. Fig 1(A) compares the experimental and simulated characteristics for a single-gate homojunction In0.53 Ga0.47 As TFET from [11], and shows a good match between experimental and simulated curves. The parameters used for simulating the single-gate homojunction TFET are from [11]. By changing the gate oxide to Hi-κ (ǫox 21, TOx 2.5nm (EOT 0.5nm), and by using a double-gated structure (TBody 5nm), we obtain projected characteristics of a homojunction TFET as shown in Fig 1(B). We capture the transfer characteristics of the TFET obtained through device simulation across a range of voltages in a Verilog-A lookup table, in order to perform circuit simulations. The Ids (Vgs ,Vds ), Cgd (Vgs ,Vds ) and the Cgs (Vgs ,Vds ) characteristics are captured in two-dimensional look-up tables for modeling TFETs. Fig 2(A) shows the Verilog-A small-signal model for TFETs, which uses the look-up tables for circuit simulation. Fig 2(A) and 2(B) show the Voltage Transfer Characteristics (VTC) and the transient output characteristic of a In0.53 Ga0.47 As homojunction TFET inverter (VCC 0.5V), which shows the validity of the Verilog-A lookup table based method.

2.2 Heterojunction Tunnel FETs We consider a GaAs0.1 Sb0.9 /InAs heterojunction Tunnel FET (HTFET), and use the modeling technique described

Drain

VIn &VOut [V]

1.E+01

VOut

VIn

Time [sec]

Figure 2: Verilog-A small signal model used for TFET simulation. in Section 2.1 to obtain the transfer characteristics of the HTFET. A comparison of the homojunction TFET and the heterojunction TFET is shown in Fig 3. By using the HTFET, a higher IOn can be obtained because (1) InAs is a smaller band-gap material, and (2) the staggered P-N heterojunction provides a higher critical-field strength for efficient inter-band tunneling. In order to understand the circuit level implications of using HTFETs, we compare the IOn versus IOn /IOf f characteristics for the transistor candidates by considering different operating points along the Id Vg curve for a given VCC window, as shown in Fig 4. Fig 4A shows that at VCC 0.8V, the highest IOn and IOn /IOf f ratio are provided by 22nm CMOS, making it the preferred device for operation at high VCC . However, at VCC 0.3V, the CMOS device cannot provide both a good IOn as well as a good IOn /IOf f ratio because of the 60 mV/decade limit on the sub-threshold slope. The In0.53 Ga0.47 As homojunction TFET can provide a good IOn /IOf f but cannot provide a high IOn since the homojunction does not allow a strong tunneling current. In contrast, the heterojunction TFET can provide a good IOn (due to the staggered P-N junction and the lower EG material), as well as a good IOn /IOf f , due to the sub-60 mV/decade sub-threshold slope, making it the preferred device for operation at low VCC . (A)

(B) EG 0.88 eV InGa0.53As0.47

EG GaAs0.1Sb0.9 0.83 eV

i Channel

i Channel P++ Source

InAs

EG 0.53 eV

P++ Source

InGa0.53As0.47

Figure 3: Comparison of heterojunction and homojunction TFET (Band-Gap includes quantization effect due to Double-Gate structure with 5nm TBody )

3. 3.1

CHARACTERIZATION OF HTFET BASED LOGIC AND MEMORY Tunnel FET Logic

We illustrate the energy-performance characteristics of logic gates constructed using CMOS transistors and HTFETs. We use a predictive BSIM model [18] for 22nm CMOS (VT 0.2V) which provides an IOn of 1.4 uA/um and an IOn /IOf f

Pr ef co er re rn e d r

1.E+03

1.E+03

1.E+02

Pr ef co erre rn e d r

(B) VCC 0.3V

Table 1: HTFET Simulation parameters. Parameter Band-Gap, EG [eV ] (Including Quantization) Electron Affinity, χ0 [eV ] Tunneling Parameter, AP ath Tunneling Parameter, BP ath Tunneling Parameter, RP ath Electron DoS, NC [cm−3 ] Hole DoS, NV [cm−3 ]

1.E+02 1.E+01 1.E+00

IOn/IOff

IOn/IOff

Bulk-planar CMOS LG 22nm GaAs0.1Sb0.9-InAs Double-Gate Heterojunction TFET LG 22nm

In0.53Ga0.47As Double-Gate Homojunction TFET LG 22nm

Figure 4: Comparison of IOn versus IOn /IOf f ratio for different operating points on the Id -Vg for (A) a VCC window of 0.8V and (B) a VCC window of 0.3V.

3.2 Tunnel FET Pass-Transistor Logic Due to the asymmetric source-drain architecture (Fig 7(A(A)

(C)

40 nm

GaAs0.1Sb0.9

InAs Intrinsic Channel 22 nm 2.5 nm HiK

InAs N+ Drain 4x1017 cm-3 40 nm

2.5 nm HiK

InAs N+ Source 1x1018 cm-3

GaAs0.1Sb0.9 Intrinsic Channel

GaAs0.1Sb0.9 P+ Source 1x1019 cm-3

40 nm

22 nm 2.5 nm HiK

40 nm

(D)

(B)

GaAs0.1Sb0.9

P++ Source

electron

5 nm

2.5 nm HiK

GaAs0.1Sb0.9 P++ Source 4x1019 cm-3

hole

InAs i Channel

InAs

i Channel

GaAs0.1Sb0.9 P+ Drain

InAs N+ Source

N+ Drain

Figure 5: (A-B) Double-Gate H-NTFET device structure and operation (C-D) Double-Gate HPTFET device structure and operation.

(A) Energy [fJ]

of 3 × 103 when operating at its nominal VCC of 0.8V. We also use a GaAs0.1 Sb0.9 /InAs HTFET which provides an IOn of 100 uA/um and an IOn /IOf f of 2 × 105 at VCC 0.3V. In order to build logic gates, a pull-up device is also required. A p-HTFET can be constructed using a heterojunction with InAs as the source, as shown in Fig 5(B). When a positive gate and drain voltage are applied to the nHTFET (Fig 5(A)), electrons tunnel from the GaAs0.1 Sb0.9 source into the InAs channel (Fig 5(B)). In contrast, when a negative gate and drain voltage is applied to the p-HTFET (Fig 5(C)), holes tunnel into the GaAs0.1 Sb0.9 channel from the InAs source. By using the modelling techniques described in Section 2.1, we obtain the energy-delay characteristics of HTFET logic gates. The TCAD parameters used for simulating the HTFETs are shown in Table 1. The energy-delay performance curve of a HTFET 40-stage ringoscillator, when compared to that of a CMOS ring-oscillator in Fig 6(A), shows a cross-over in the energy-delay characteristics. The CMOS ring-oscillator has a better energy-delay compared to the HTFET ring-oscillator at VCC > 0.65V and the HTFET ring-oscillator has a better energy-delay trade-off at VCC < 0.55V. Other logic gates, such as Or, Not and Xor (which are not shown here) also show a similar cross-over. This trend is consistent with the discussion in Section 2.2 where it has been illustrated that CMOS devices provide better operation at high VCC and HTFETs provide preferred operation at low VCC . Fig 6(B) shows the energy-delay performance of a 32-bit prefix-tree based HanCarlson Adder has a similar crossover behavior for CMOS and HTFETs.

GaAs0.1 Sb0.9 0.84

InAs 0.53

3.96 1.58 × 1020 6.45 × 106 1.12 4.59 × 1019 6.84 × 1018

4.78 1.48 × 1020 2.4 × 106 1.13 8.75 × 1016 6.69 × 1018

The n-HTFET and p-HTFET are simulated separately using the Dynamic Non-local Tunneling model implemented in TCAD sentaurus (Ver. C-2009-SP2). The work function used for n-HTFET is 4.9 eV and for pHTFET is 4.66 eV. The temperature is set to 300 K and High-Field Velocity-Saturation is enabled.

(B)

0.8V

0.6V

0.7V 0 .8 0 .5

0.8V 0.8V

0.6 0.6V 0 .7 V V V 0.5V

V

Energy [fJ]

(A) VCC 0.8V

IOn (uA/um)

IOn (uA/um)

1.E+04

0.6V 0.4V

0.4V

0.8V

0.2V 0.2V

0.6V

0.2V

0.4V

e ag ak E Le

0.4V 0.3V

Frequency [MHz] CMOS Ring Oscillator HTFET Ring Oscillator

Delay [ns] Activity Factor 1 32-bit CMOS Adder 32-bit HTFET Adder

Activity Factor 0.1 32-bit CMOS Adder 32-bit HTFET Adder

Figure 6: Energy-Delay performance comparision for (A) a CMOS and a HTFET ring-oscillator and (B) a CMOS 32-bit and a TFET 32-bit Adder. B)), HTFETs cannot function as bi-directional pass transistors. Though this may seem to limit the utility of TFETs in SRAM-cell design, several SRAM designs have been proposed to circumvent this problem [5, 13]. It is also important to consider a solution for logic, because of the ubiquitous usage of pass-transistors in logic design. We propose using a pass-transistor stack made of N-HTFETs, with a P-HTFET for precharging the output (Fig 7C) . All the N-HTFET transistors in the pass-transistor stack will be oriented toward the ouput which allows them to drive the On current when the input signals are enabled. During the pre-charge phase, the P-HTFET precharges the output to VCC , and during the evaluate phase, the N-HTFET stack evaluates the output based on the inputs to the pass-transistor stack.

3.3

Tunnel FET SRAM Cache

In order to model TFET-based processor architectures, it is important to consider the characteristics of the L1 cache, which is an integral on-chip component of a processor. We use the analytical method implemented in the cache analysis tool CACTI [16], in order to evaluate the energy-delay performance of a TFET-based cache. As discussed in Section 3.2, in order to overcome the problem of asymmetric conduction in TFETs, a precharge-based pass-transistor mux (Fig 7C) is implemented in CACTI, and we also assume a 6-T SRAM Cell with virtual-ground [13]. To compute the gate delay, CACTI uses the Horowitz approximation [3] given by: p τDelay = τf (log(Vs ))2 + 2.(τin /τf ).b.(1 − Vs )

IDS (uA/um)

Vgs (0.5V) G S P++

D N+

Vgs (0.4V) Vgs (0.3V) Vgs (0.2V)

Asymmetric Source and Drain

Vds [V]

(C) BIT0

BIT1

BIT(n-1)

S N++

S P++

S P++

S P++

SEL(n-1)

G

G

G

G

SEL1

0.8V 0.7V

0.7V

0.6V

0.6V

0.5V

0.5V 0.4V 0.3V

0.4V

0.3V

(B)

Precharge

CMOS L1 Cache

0.8V

HTFET L1 Cache 0.6V 0.8V 0.4V 0.6V 0.4V 0.3V

Vcc [V]

Delay [ns]

Vcc

SEL0

CMOS L1 Cache HTFET L1 Cache

0.8V

Leakage [nW]

(A)

(B)

Energy [pJ]

(A)

Figure 9: (A) Energy-Delay performance comparision and (B) Leakage Power comparison for CMOS and H-TFET based L1 Cache.

D P+

D N+

D N+

D N+

Output

Figure 7: (A) Asymmetric source-drain architecture for a heterojunction NTFET and (B) Asymmetric ID − VD characteristics resulting from source-drain asymmetry. to compute the gate delay, where τf = REf f ×[CLoad +CEf f ] is the time constant, and τin is the input ramp time. REf f and CEf f are estimated using simulations as described in [10] which takes into account the effect of enhanced Miller capacitance effect in TFET resulting from the presence of a tunnel junction between the source and the channel. In order to validate the Horowitz model for TFETs, we compare the TFET-inverter delay from the Horowitz analytical expression with the delay estimated using the Verilog-A table-lookup model for different input ramp times (τin ), and obtain a good match as shown in Fig 8. We modified CACTI to implement the 6T TFET SRAM cell design proposed in [13] and evaluated the energy-delay performance of a 32KB L1 cache with a 32Byte block-size and an associativity of 2. Fig 9 shows that a cross-over point similar to that in logic exists for CMOS and TFETbased SRAM L1 caches. Due to the higher IOn /IOf f ratio of TFETs, the TFET L1 cache has lower leakage power than the Low-VT CMOS L1 cache.

4. ARCHITECTURAL ANALYSIS OF CMOS AND H-TFET CORES The detailed processor and cache parameters for simu(A)

200

Analytical Expression

150 125 100 75

Verilog-A Simulation

150 125 100 75

50

50

25

25

0 0.25

(B)

175

Delay [ps]

Delay [ps]

175

0.35

0.45 0.55 VCC [V]

CLoad (2fF) 70 ps Ramp 35 ps Ramp 2 ps Ramp

0.65

0.75

CLoad (0.1fF) 70 ps Ramp 35 ps Ramp 2 ps Ramp

0 0.25

Value Sun’s SPARC based core 1 32 32KB, 2-way 32B block 2MB, 8-way 64B block 70 cycles/ 2 GHz 22 nm / VCC = 0.7V − 0.3V 200,000 Instructions

lating single-core processors using Simics [9] are shown in Table 2. The delay and power numbers for each voltage/ frequency pair obtained using circuit simulations are incorporated into our simulator. We evaluate both single-threaded (SPEC 2006) and multi-threaded (SPLASH) applications. For power analysis, we use a utilization based approach. The utilization is monitored by tracking the execution and stall cycles of the processor using Simics. For the execution cycles, the dynamic energy is modeled assuming 10% of the overall 20M gates in our core switch (typical switching activity in logic based data paths ranges from 10% - 15% [17] and the variations across instructions in commercial low power cores are minimal [14]). Leakage power is consumed during both execution and stall cycles and no power-gating is assumed. The cache power models are based on CACTI [16] that incorporates our modifications mentioned in Section III-C. For clarity, we highlight the results from 8 SPEC 2006 and 4 SPLASH benchmarks that capture the major trends observed across the suite. Figure 10 shows the different voltage-frequency coordinates that can be achieved for a H-TFET and a CMOS based processor respectively (with a minimum frequency of 500 MHz and frequency increasing in steps of 125 MHz). It is clear from this figure that H-TFETs are the preferred device when operating below 1250 MHz. We consider a heterogeneous-technology asymmetric multi-core processor, 0.8

0.35

0.45 0.55 0.65 0.75 VCC [V] CLoad (2 fF) CLoad (0.1fF) 70 ps Ramp 70 ps Ramp 35 ps Ramp 35 ps Ramp 2 ps Ramp 2 ps Ramp

Figure 8: Validation of Horowitz approximation for TFET inverters, (A) Delay calculated using Horowitz Analytical Expression and (B) Delay calculated using Verilog-A Simulation.

0.7 0.6

Figure 10: Voltage Frequency Operating Points for H-TFET and CMOS processors.

VCC [V]

200

Table 2: Simulation parameters. Parameter Processor Pipeline Issue width Fetch Queue L1 cache L2 cache Mem. Lat / Baseline Freq. Technology / Voltages DVFS Interval Period

0.5 0.4 0.3

CMOS TFET

0.2

Frequency [MHz]

50% 25%

RADIX

1 0.75 MCF

PERL XALAN OMNT SJENG GBMK

LU

OCEAN FFT

RADIX

1

0.5 BZIP

GCC

MCF

PERL XALAN OMNT SJENG GBMK

LU

OCEAN FFT

CMOS ED2 DVFS [1375-500 MHz]

Figure 11: (A) Frequency distribution (B) Normalized Delay and (C) Normalized EDP for EnergyAware DVFS on benchmarks. with a TFET processor operating in the 1250-500 MHz frequency range, and a CMOS processor operating in the 0.7V to 0.5V range (frequency 1375-500 MHz). We then execute various benchmarks (SPLASH benchmarks are executed using a single thread) using (1) an Energy-Aware DVFS policy which seeks to minimize the ED2 [8], and (2) a purely IPC-aware DVFS algorithm [12]. The energy-aware DVFS policy monitors if the ED2 in a DVFS interval (using the energy and delay incurred in executing 200,000 instructions) is better than the previous interval, and if so, it continues the voltage-frequency (VF) change (either continuing to increase or decrease) - otherwise, the direction of the VF change is reversed. We find that, when using the energy-aware DVFS policy on TFETs, most of the applications spend a significant amount of time (close to 60%) in 1000 MHz to 750MHz range, whereas when using CMOS most of the applications execute in 1375 MHz to 1250 MHz range (Figure 11(A)). As Figure 10 shows, the relationship between E and D2 for TFET processors is non-linear, and the energy-aware DVFS algorithm sees a significant energy benefit when operating in these frequency ranges (1000 - 750 MHz) with TFETs. Consequently, there is a significant energy-delay benefit (average 50%) when using TFETs over the baseline CMOS based design (Figure 11(A)), but with a 40% cost in performance (Figure 11(B)). The IPC-aware DVFS algorithm, on the other hand monitors the change in the IPC of the processor and ramps the frequency up or down by 125 MHz when it detects a 5% change in IPC. Figure 12(A) shows that the degradation in performance is less 12% than compared to baseline CMOS when using IPC-aware DVFS on TFET. Figure 12(B) shows that the energy reduction is significant when using DVFS on TFETs due to the lower energy of lower frequency modes in TFETs (Energy reduction 26% and ED reduction 18% over baseline CMOS). Further, Figure 12(D) shows that there is significant ED2 reduction over baseline CMOS (upto 9% ED2 benefit over baseline CMOS) for applications such as bzip, mcf and ocean. These applications have significant L2 miss-rates (shown in Figure 13) and consequently, the processors spend a lot of time stalling. Thus, by using energy oriented DVFS scaling on TFETs dur-

MCF

PERL XALAN OMNT SJENG GBMK

LU

OCEAN

FFT

RADIX

BZIP

GCC

MCF

PERL XALAN OMNT SJENG GBMK

LU

OCEAN

FFT

RADIX

BZIP

GCC

MCF

PERL XALAN OMNT SJENG GBMK

LU

OCEAN

FFT

RADIX

PERL XALAN OMNT SJENG GBMK

LU

OCEAN

FFT

RADIX

0.5 1.25 1 0.75 1.25 1 0.75 BZIP

GCC

CMOS @ 1375 MHz CMOS @ 1250 MHz

TFET ED DVFS [1250-500 MHz]

TFET @ 1250 MHz

CMOS @ 1250 MHz

GCC

0.75

RADIX

2

CMOS @ 1375 MHz

BZIP

1

(D)

MCF

TFET AD DVFS [1250-500 MHz] TFET @ 1250 MHz CMOS AD DVFS [1375-500 MHz]

Figure 12: (A) Normalized Delay (B) Normalized Energy (C) Normalized ED and (D) Normalized ED2 for IPC-Aware DVFS on benchmarks. ing these stall cycles gives us significant energy advantage when compared to a CMOS based design. Thus, we conclude that in heterogeneous-technology asymmetric-performance multi-core processor, single-threaded applications with higher miss-rates are more suited for execution on TFETs with IPC-aware DVFS, since it results in significant ED2 advantage. Pure energy conservation is best achieved by executing the applications on the TFET processor with energy-aware DVFS. Applications such as sjeng, perl and radix with low cache miss-rates are best executed on the CMOS processor with higher performance. 50

Misses-Per-KiloInstructions

0.75

GCC

Normalized ExD2

BZIP

1.25

(B)

Normalized ExD

1.25

(C)

1 0.75

(C)

1.5

Normalized ExD

Normalized Delay

CMOS

FFT

TFET

CMOS

OCEAN

TFET

CMOS

LU

TFET

CMOS

PERL XALAN OMNT SJENG GBMK

TFET

CMOS

TFET

CMOS

TFET

CMOS

TFET

CMOS

TFET

CMOS

MCF

TFET

CMOS

GCC

TFET

(B)

TFET

CMOS

BZIP

CMOS

0%

1375 1250 1125 1000 875 750 625 500

Normalized Energy

Freq

75%

TFET

Frequency Distribution

(A) 100%

Normalized Delay

(A) 1.25

40 30

L1 IC MPKI L1 DC MPKI L2 MPKI

20 10 0

Figure 13: Miss Rates for various benchmarks. Multi-core processors can be used to minimize energy consumption by scaling down the operating frequency and increasing thread-level parallelism in order to regain iso- performance to baseline CMOS (1-Core @ 1375 MHz) as shown in Figure 14(A). Figure 14(B) shows the energy consumption compared to baseline CMOS for parallel program execution on 2-Core CMOS and 2-Core TFET, for iso-performance to baseline CMOS. When moving from 1 to 2 cores, we observe almost linear performance scaling with the number of cores, that drops the required operating frequency for isoperformance below the CMOS-TFET cross-over point. Due to the energy advantage of TFET processors at lower frequencies, TFET processors have a distinct energy advantage in iso-performance multi-core execution, giving an energy savings of 70% against single-core CMOS and energy savings of 55% against 2-core CMOS. Our hybrid architecture provides additional energy efficiencies for multi-threaded applications by scheduling performance critical threads [1] on high performance CMOS

1.4

Normalized Energy

1.2

(B) 1.4

1375 MHz

1375 MHz

1 0.8 ing as r cre so De oces ncy Pr que Fre

0.6 0.4 0.2

iso - performance

Normalized Energy

(A)

0

1.2 1

[2]

0.8 0.6 0.4 0.2

[3]

0 0

0.5

1

1.5

2

Normalized Delay 1-Core TFET 1-Core CMOS 2-Core TFET 2-Core CMOS

2.5

LU

FFT

1-Core CMOS 2-Core CMOS

OCEAN

RADIX

[4] 1-Core TFET 2-Core TFET

Figure 14: (A) Illustration of normalized energydelay for iso-performance for OCEAN benchmark and (B) Normalized multi-core execution energy for isopeformance to CMOS @ 1375 MHz. cores and non-critical threads on energy efficient TFET cores. They can exploit imbalance across threads due to application behavior [4].

5. CONCLUSIONS In this work we show the effectiveness of a hybrid TFETCMOS core for exploiting inter-application characteristics in multi-programmed workloads. Our proposal can also be used to exploit intra-application characteristic. This can be done by detecting phases in applications that would benefit by being scheduled on a CMOS core and phases that would benefit by being scheduled on a TFET core (through OS support). We also show TFET cores become preferable in multi-threaded applications. Our future work will explore iso-performance scenarios to achieve the performance of multiple CMOS cores. Our initial results from this architectural study indicate promise in all these directions. In order to evaluate the benefits of a heterogeneous architecture, nominal TFET devices without any variation at the device level were assumed in this paper. However, TFETs for a sub-22nm technology node operating at low VCC can be susceptible to process-induced variations. A study describing the impact of process-induced variations on the characteristics of TFET-based circuits needs to be conducted. Further, the demonstration of such a heterogeneous system requires monolithic fabrication of GaAs0.1 Sb0.9 /InAs HTFETs and Si CMOS on a single chip on a silicon substrate, which requires further developments in process technology before such a system can be experimentally demonstrated. At present, such material/process integration has already been demonstrated through experimental fabrication of Indium Gallium Arsenide-based Quantum-Well FETs on silicon substrate [2]. The same integration scheme can be applied to fabricate heterojunction TFETs monolithically on silicon substrate.

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

6. ACKNOWLEDGMENTS This work is supported in part by the Semiconductor Research Corporation Nanoelectronics Research Initiative, the National Institute of Standards and Technology through the Midwest Institute for Nanoelectronics Discovery (MIND), and by NSF Grants 0829926, 0903432 and 0916887.

7. REFERENCES [1] M. Aater Suleman, O. Mutlu, M. Qureshi, and Y. Patt.

[16] [17]

[18]

Accelerating Critical Section Execution with Asymmetric Multicore Architectures. IEEE Micro, 2010. S. Datta, G. Dewey, J. Fastenau, M. Hudait, D. Loubychev, W. Liu, M. Radosavljevic, W. Rachmady, and R. Chau. Ultrahigh-Speed 0.5V Supply Voltage In0.7 Ga0.3 As Quantum-Well Transistors on Silicon Substrate. IEEE Electron Device Letters, 2007. M. A. Horowitz. Timing Models For MOS Circuits. Technical report, US Army Research Office, 1994. I. Kadayif, M. Kandemir, and I. Kolcu. Exploiting Processor Workload Heterogeneity for Reducing Energy Consumption in Chip Multiprocessors. In Proc. of the Design, Automation and Test in Europe Conference and Exhibition., 2004. D. Kim, Y. Lee, J. Cai, I. Lauer, L. Chang, S. J. Koester, D. Sylvester, and D. Blaauw. Low Power Circuit Design Based on Heterojunction Tunneling Transistors (HETTs). In Proc. of the 14th ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED, 2009. R. Kumar, K. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures. Computer Architecture Letters, 2, 2003. R. Kumar, D. M. Tullsen, and N. P. Jouppi. Core Architecture Optimization for Heterogeneous Chip Multiprocessors. In Proc. of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT ’06, 2006. G. Magklis, P. Chaparro, J. Gonzalez, and A. Gonzalez. Independent Front-end and Back-end Dynamic Voltage Scaling for a GALS Microarchitecture. In Proc. Int. Symp. on Low Power Electronics and Design, 2006. P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. H˚ allberg, J. H¨ ogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. Computer, 35:50–58, February 2002. S. Mookerjea, R. Krishnan, S. Datta, and V. Narayanan. Effective Capacitance and Drive Current for Tunnel FET (TFET) CV/I Estimation. IEEE Transactions on Electron Devices, 2009. S. Mookerjea, D. Mohata, R. Krishnan, J. Singh, A. Vallett, A. Ali, T. Mayer, V. Narayanan, D. Schlom, A. Liu, and S. Datta. Experimental Demonstration of 100nm Channel Length In0.53 Ga0.47 As-based Vertical Inter-Band Tunnel Field Effect Transistors (TFETs) for Ultra Low-Power Logic and SRAM Applications. In Proc. IEEE Int. Electron Devices Meeting (IEDM), 2009. G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, and M. L. Scott. Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling. In HPCA, 2002. J. Singh, K. Ramakrishnan, S. Mookerjea, S. Datta, N. Vijaykrishnan, and D. Pradhan. A Novel Si-Tunnel FET Based SRAM Design for Ultra Low-Power 0.3V VDD Applications. In Proc. 15th Asia and South Pacific Design Automation Conf. (ASP-DAC), 2010. A. Sinha and A. P. Chandrakasan. JouleTrack-a Web Based Tool for Software Energy Profiling. In Proc. Design Automation Conf, 2001. Synopsys. TCAD Sentaurus Device Manual, Release: C-2009.06, 2009. S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical report, HP Labs, 2008. Xilinx. Xilinx Power Tools Tutorial, (http: // www. xilinx. com/ support/ documentation/ sw_ manuals/ xilinx12_ 2/ ug733. pdf ), 2010. W. Zhao and Y. Cao. New Generation of Predictive Technology Model for Sub-45nm Design Exploration. In Proc. 7th Int. Symp. Quality Electronic Design, 2006.

Lihat lebih banyak...

An energy-efficient heterogeneous CMP based on hybrid TFET-CMOS cores

Descripción

Comentarios