Custom Processor Core Construction from C Code

Share Embed


Descripción

Custom Processor Core Construction from C Code Jelena Trajkovic and Daniel D. Gajski Center for Embedded Computer Systems UC Irvine

Outline p

Introduction

p

Initial Data path Extraction

p

Data path Optimization

p

Experimental results

p

Conclusion and Future Directions

J. Trajkovic and D. Gajski Copyright © CECS 2008

Introduction p 1.

Application implementation options Chose processor, compile application C C D

2.

Design HW for the application C D D

p

Cheaper, shorter time-to-market Scalable with code size May not have resources for optimal execution

Optimal components and structures Longer development time Not scalable

Our goal: benefits from 1 & quality of 2

J. Trajkovic and D. Gajski Copyright © CECS 2008

Problem Definition p

What is the best data path to execute given C code? n

“The best” architecture is one that meets given constraints

n

Manual design p

n

Not scalable for large application

Solution: Automatically extract data path from C code

J. Trajkovic and D. Gajski Copyright © CECS 2008

Approach (1/2) p

Traditional n

Data path and controller are generated simultaneously p

C

C-to-RTL

Constraints

Applicable to small code size Controller + DP Designer

p

Proposed n n

C

Separate Data path and controller generation Derive Data path from C code

Data path Generator

Constraints

DP Designer Controller Generator

Controller

J. Trajkovic and D. Gajski Copyright © CECS 2008

Approach (2/2) p

Our design flow n

C

Extract Initial Data path (IDp) Data path Generator

p

n

For max performance

Component Library

IDp Extraction

IDp

Constraints

Data path Optimization

Controller Generation & Profiling

Optimize Data path p

Iteratively based on designer constraints

Final Data path

p

Benefits n n n n n

Automate data path extraction Standard input – C reference code Scalability (~10000 lines of code) Controllability of data path design Quality J. Trajkovic and D. Gajski Copyright © CECS 2008

Outline p

Introduction

p

Initial Data path Extraction

p

Data path Optimization

p

Experimental results

p

Conclusion and Future Directions

J. Trajkovic and D. Gajski Copyright © CECS 2008

Initial Data path Extraction p

Set of C code properties n

p

Set of HW components and templates n n

p

Data types, operators, parallelism, loops, dependences

Functional units, storage elements, interconnect Connectivity scheme, pipelining

Matching heuristics n n

Date types and representation → bit-width Parallelism → number of FUs, number of registers in RF, pipelining

J. Trajkovic and D. Gajski Copyright © CECS 2008

Example: C Code Properties → HW p

Set of operations {+, -, shr, *, >}

p

Max concurrent occurrence m+= 3, m- = 2, mshr = 1, m* = 2, m> = 1

p

Mapping operations → FU + → ALU, - → ALU, shr → ALU, * → mul, > → comp Allocation

Mapping

p

Max concurrent transfers of source and destination operands n n

p

Source and destination busses Greedy interconnect

Create Initial Data path (IDp) J. Trajkovic and D. Gajski Copyright © CECS 2008

Outline p

Introduction

p

Initial Data path Extraction

p

Data path Optimization

p

Experimental results

p

Conclusion and Future Directions

J. Trajkovic and D. Gajski Copyright © CECS 2008

Main steps in Data path Optimization p

Compile and profile IDp

p

Select Critical code to be optimized

Critical Code Extraction Usage Plot Creation Resource Constraints

p

Create Usage Plot for components Estimation

p

Estimate Timing Overhead from Resource constraints

Timing Overhead (TO)

Design Creation and Evaluation

p

Balance out the remaining components of Dp

N

TO satisfied? Y

Net-list Creation

J. Trajkovic and D. Gajski Copyright © CECS 2008

Critical Code Extraction p

Critical Code n n

p

Selects based on relative length (l) and frequency (f) n n n

p

Set of Basic Blocs (BB) Contributes the most to the execution time

High length High frequency High frequency-length product

Designer specifies Critical Code n

By selecting parameters Pl, Pf, Pfl

J. Trajkovic and D. Gajski Copyright © CECS 2008

Usage Plot and Timing Overhead p

For selected BBs create usage plot n n

p

Usage per cycle Annotate with execution frequency

Resource constraint n

in use available 4 3 2 1

e.g. adder = 2

p

Model delayed execution n n

p

number of instances

0 4 3 2 1

1

number of instances

0

1

Extra operations “executed” in available cycles Estimate dc – number of extra cycles

The smallest dc becomes Timing Overhead

J. Trajkovic and D. Gajski Copyright © CECS 2008

estimated delayed execution operation but no available units adder

2

3

4

resource constraint

2

3

4

5

time

adder

5

6

dc

time

Balancing Unrestricted Components p

Select designs to be evaluated n n n

Set number of components with resource constraints Set other resources to values in IDp Select a component type to be varied 4 number of 3 2 1

0

p

E.g. multiplier: two designs With one or with two multipliers n Having one multiplier satisfies Timing Overhead n

4 3 2 1

J. Trajkovic and D. Gajski Copyright © CECS 2008

multiplier

instances

1

2

number of instances

0

1

3

4

5

time

multiplier selected number

2

3

4

5

Timing Overhead

6

time

Outline p

Introduction

p

Initial Data path Extraction

p

Data path Optimization

p

Experimental results

p

Conclusion and Future Directions

J. Trajkovic and D. Gajski Copyright © CECS 2008

Experimental Results Bench

(Pl,Pf, Pfl) [%]

LoC

bdist2

(60, 50, 45)

Sort

Gen. Time [sec] Non-pipe

Pipe

61

0.2

0.8

(80, 60,45)

33

0.1

0.1

dct32

(18, 65, 50)

1006

1.3

2.3

Mp3

(30, 55, 50)

13898

15.6

42.6

p

Two set of experiments to be shown n n

p

Interactive design exploration Design refinement quality

Benchmarks n

bdist2 (from MPEG2 encoder), Sort (bubble sort), dct32 (from MP3 decoder) and Mp3 (decoder) J. Trajkovic and D. Gajski Copyright © CECS 2008

Interactive Design Exploration p

Designer specifies n n n

p

Partial resource constraint Parameters for critical code extraction Design choice (pipelined/non-pipelined)

Baseline: n n n

MIPS-style: ALU, multiplier, 128-entry RF2x1 and 2 source and 1 destination bus A divider for Mp3 baseline implementation Memory size customized per application

J. Trajkovic and D. Gajski Copyright © CECS 2008

Description of Generated Architectures p

Difference from baseline

Bench

Pipe

ALU1

ALU2

ALU3

RF2x1

RF4x2

RF6x3

RF8x4

RF16x8

N

#R=64

#R=64

#R=64

#R=64

#R=64

#R=64

#R=64

#R=64

Y

#R=32

#R=32

#R=32

#R=32

#R=32

#R=32

#R=32

#R=32

N

#R=16

#R=16

#R=16

#R=16

#R=16

#R=16

#R=16

#R=16

Y

#R=16

#R=16

#R=16

#R=16

#R=16

#R=16

#R=16

#R=16

N

Rf4x2

Rf6x3

Rf8x4

-

2 Alu

3 Alu

3 Alu

3 Alu

Y

Rf4x2

Rf4x2

Rf6x3

-

2 Alu

2 Alu

2 Alu

3Alu

N

-

Rf6x3

Rf8x4

-

2 Alu

3 Alu

3 Alu

3 Alu

Y

Rf4x2

Rf6x3

Rf6x3

-

2 Alu

2 Alu

2 Alu

2 Alu

Bdist2

Sort

dct32

Mp3

J. Trajkovic and D. Gajski Copyright © CECS 2008

IDp Rf 16x8, 3 Alu, 2 Mul Rf 16x8, 2 Alu Rf 16x8, 3 Alu, 2 Mul Rf 16x8, 3 Alu, 2 Mul

Example:Mp3 p

Baseline n

Bench

MIPS-style: ALU, multiplier, 128-entry RF2x1 and 2 source and 1 destination bus, divider Pipe

ALU1

ALU2

ALU3

RF2x1

RF4x2

RF6x3

RF8x4

RF16x8

N

-

Rf6x3

Rf8x4

-

2 Alu

3 Alu

3 Alu

3 Alu

Y

Rf4x2

Rf6x3

Rf6x3

-

2 Alu

2 Alu

2 Alu

2 Alu

Mp3

Mp3 Non-pipelined

1.10

Pipelined

1.00 0.90 0.80 0.70 0.60 0.50 0.40

ALU 1

ALU 2

ALU 3

RF2x1

RF4x2

RF6x3

J. Trajkovic and D. Gajski Copyright © CECS 2008

RF8x4

RF16x8

IDp

IDp Rf 16x8, 3 Alu, 2 Mul

Number of Execution Cycles p

Non-pipelined n n n n

Non-pipelined

bdist2 - biggest saving for IDp Sort – smallest improvement dct32 – ALU3 Mp3 – ALU2 optimum

bdist2

1.10

Sort

dct32

Mp3

1.00 0.90 0.80 0.70 0.60 0.50 0.40

ALU 1

p

ALU 2

ALU 3

RF2x1

RF4x2

RF6x3

RF8x4

RF16x8

IDp

Pipelined n

less overall saving

Pipelined 1.10

bdist2

Sort

dct32

Mp3

1.00

0.90

0.80

0.70

ALU 1

ALU 2

J. Trajkovic and D. Gajski Copyright © CECS 2008

ALU 3

RF2x1

RF4x2

RF6x3

RF8x4

RF16x8

IDp

Total Execution Time p

p

Post-synthesis

bdist2

Sort

Texe [s] 9.00E-05

Texe [s] 2.50E-03

8.00E-05

2.25E-03

7.00E-05

2.00E-03 1.75E-03

6.00E-05

bdist2

1.50E-03

5.00E-05

1.25E-03

4.00E-05

n

pipelined RF4x2

1.00E-03

3.00E-05

7.50E-04

2.00E-05

p

Sort n

Non-pipelined ALU1 p

p

p

Large Tclk

Mp3 n

Non-pipelined ALU2 p

0.00E+00 Non-pipelined

Baseline

ALU 1

ALU 2

ALU 3

RF2x1

Pipelined RF4x2

RF6x3

Non-pipelined RF8x4

Large Tclk

Baseline

ALU 1

ALU 2

ALU 3

dct32

2.50E-03

Texe [s] 2.50E-02

2.25E-03

2.25E-02

2.00E-03

2.00E-02

1.75E-03

1.75E-02

1.50E-03

1.50E-02

1.25E-03

1.25E-02

1.00E-03

1.00E-02

7.50E-04

7.50E-03

5.00E-04

5.00E-03

2.50E-04

2.50E-03

0.00E+00

0.00E+00 Non-pipelined

Baseline

ALU 1

ALU 2

ALU 3

RF2x1

J. Trajkovic and D. Gajski Copyright © CECS 2008

RF2x1

Pipelined RF4x2

RF6x3

RF8x4

Mp3

Texe [s]

Non-pipelined ALU2 p

2.50E-04

0.00E+00

Reduced resources

dct32 n

5.00E-04

1.00E-05

Pipelined RF4x2

RF6x3

Non-pipelined RF8x4

Baseline

ALU 1

ALU 2

ALU 3

RF2x1

Pipelined RF4x2

RF6x3

RF8x4

Design Refinement Quality p

n

p

dct32

n n

n

BRAM

2 1.5

23% - 80% extra

1 0.5

29% speedup – 20% slowdown RF4x2: 23% extra

Generated vs. academic HLS n

Slice

2.5

Best Texe: p

Texec

3

Tclk: p

Tclk

3.5

No.cycles: p

p

No.cycle 4

Generated vs. manual n

p

4.5

Relative to manual design

0

Baseline

ALU1-N

RF4x2-N

RF4x2-P

HLS

No.cycle

2.51

1.80

1.23

1.52

1.83

Tclk

1.03

1.20

1.20

0.81

1.67

Texec

2.59

2.16

1.47

1.23

3.06

Slice

2.79

3.49

3.49

3.40

2.32

BRAM

4.00

4.00

4.00

4.00

0.25

Generated has better performance HLS has better area

Design time: Generated → 2.3 sec vs. Manual → 3-man week J. Trajkovic and D. Gajski Copyright © CECS 2008

Conclusion p

Contributions n n n

p

Benefits n n n n

p

Algorithm for basic block analysis and core generation Iterative algorithm for optimizing resource utilization Demonstration on large C examples (>13K LOC) Automatic C input to core datapath construction Scalable to any size of C code Datapath construction controllable using designer constraints Generated core quality comparable to manual design

Future work n n

Automatic pipeline configuration from C code Forwarding based on static C code analysis J. Trajkovic and D. Gajski Copyright © CECS 2008

Acknowledgments p

The authors wish to thank Pramod Chandraiah for providing the Mp3 source code

p

Many thanks to Pramod Chandraiah, Weiwei Chen, Gunar Schirner, Lin Yang and Bin Zhang for manual designs and Roger Ang, Hansu Cho and Dongwan Shin for manual and HLS designs for dct32

p

We would also like to thank Mehrdad Reshadi and Bita Gorjiara for compiler support and Verilog generator

J. Trajkovic and D. Gajski Copyright © CECS 2008

Thank You

p

Questions?

J. Trajkovic and D. Gajski Copyright © CECS 2008

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.