Custom Processor Core Construction from C Code
Descripción
Custom Processor Core Construction from C Code Jelena Trajkovic and Daniel D. Gajski Center for Embedded Computer Systems UC Irvine
Outline p
Introduction
p
Initial Data path Extraction
p
Data path Optimization
p
Experimental results
p
Conclusion and Future Directions
J. Trajkovic and D. Gajski Copyright © CECS 2008
Introduction p 1.
Application implementation options Chose processor, compile application C C D
2.
Design HW for the application C D D
p
Cheaper, shorter time-to-market Scalable with code size May not have resources for optimal execution
Optimal components and structures Longer development time Not scalable
Our goal: benefits from 1 & quality of 2
J. Trajkovic and D. Gajski Copyright © CECS 2008
Problem Definition p
What is the best data path to execute given C code? n
“The best” architecture is one that meets given constraints
n
Manual design p
n
Not scalable for large application
Solution: Automatically extract data path from C code
J. Trajkovic and D. Gajski Copyright © CECS 2008
Approach (1/2) p
Traditional n
Data path and controller are generated simultaneously p
C
C-to-RTL
Constraints
Applicable to small code size Controller + DP Designer
p
Proposed n n
C
Separate Data path and controller generation Derive Data path from C code
Data path Generator
Constraints
DP Designer Controller Generator
Controller
J. Trajkovic and D. Gajski Copyright © CECS 2008
Approach (2/2) p
Our design flow n
C
Extract Initial Data path (IDp) Data path Generator
p
n
For max performance
Component Library
IDp Extraction
IDp
Constraints
Data path Optimization
Controller Generation & Profiling
Optimize Data path p
Iteratively based on designer constraints
Final Data path
p
Benefits n n n n n
Automate data path extraction Standard input – C reference code Scalability (~10000 lines of code) Controllability of data path design Quality J. Trajkovic and D. Gajski Copyright © CECS 2008
Outline p
Introduction
p
Initial Data path Extraction
p
Data path Optimization
p
Experimental results
p
Conclusion and Future Directions
J. Trajkovic and D. Gajski Copyright © CECS 2008
Initial Data path Extraction p
Set of C code properties n
p
Set of HW components and templates n n
p
Data types, operators, parallelism, loops, dependences
Functional units, storage elements, interconnect Connectivity scheme, pipelining
Matching heuristics n n
Date types and representation → bit-width Parallelism → number of FUs, number of registers in RF, pipelining
J. Trajkovic and D. Gajski Copyright © CECS 2008
Example: C Code Properties → HW p
Set of operations {+, -, shr, *, >}
p
Max concurrent occurrence m+= 3, m- = 2, mshr = 1, m* = 2, m> = 1
p
Mapping operations → FU + → ALU, - → ALU, shr → ALU, * → mul, > → comp Allocation
Mapping
p
Max concurrent transfers of source and destination operands n n
p
Source and destination busses Greedy interconnect
Create Initial Data path (IDp) J. Trajkovic and D. Gajski Copyright © CECS 2008
Outline p
Introduction
p
Initial Data path Extraction
p
Data path Optimization
p
Experimental results
p
Conclusion and Future Directions
J. Trajkovic and D. Gajski Copyright © CECS 2008
Main steps in Data path Optimization p
Compile and profile IDp
p
Select Critical code to be optimized
Critical Code Extraction Usage Plot Creation Resource Constraints
p
Create Usage Plot for components Estimation
p
Estimate Timing Overhead from Resource constraints
Timing Overhead (TO)
Design Creation and Evaluation
p
Balance out the remaining components of Dp
N
TO satisfied? Y
Net-list Creation
J. Trajkovic and D. Gajski Copyright © CECS 2008
Critical Code Extraction p
Critical Code n n
p
Selects based on relative length (l) and frequency (f) n n n
p
Set of Basic Blocs (BB) Contributes the most to the execution time
High length High frequency High frequency-length product
Designer specifies Critical Code n
By selecting parameters Pl, Pf, Pfl
J. Trajkovic and D. Gajski Copyright © CECS 2008
Usage Plot and Timing Overhead p
For selected BBs create usage plot n n
p
Usage per cycle Annotate with execution frequency
Resource constraint n
in use available 4 3 2 1
e.g. adder = 2
p
Model delayed execution n n
p
number of instances
0 4 3 2 1
1
number of instances
0
1
Extra operations “executed” in available cycles Estimate dc – number of extra cycles
The smallest dc becomes Timing Overhead
J. Trajkovic and D. Gajski Copyright © CECS 2008
estimated delayed execution operation but no available units adder
2
3
4
resource constraint
2
3
4
5
time
adder
5
6
dc
time
Balancing Unrestricted Components p
Select designs to be evaluated n n n
Set number of components with resource constraints Set other resources to values in IDp Select a component type to be varied 4 number of 3 2 1
0
p
E.g. multiplier: two designs With one or with two multipliers n Having one multiplier satisfies Timing Overhead n
4 3 2 1
J. Trajkovic and D. Gajski Copyright © CECS 2008
multiplier
instances
1
2
number of instances
0
1
3
4
5
time
multiplier selected number
2
3
4
5
Timing Overhead
6
time
Outline p
Introduction
p
Initial Data path Extraction
p
Data path Optimization
p
Experimental results
p
Conclusion and Future Directions
J. Trajkovic and D. Gajski Copyright © CECS 2008
Experimental Results Bench
(Pl,Pf, Pfl) [%]
LoC
bdist2
(60, 50, 45)
Sort
Gen. Time [sec] Non-pipe
Pipe
61
0.2
0.8
(80, 60,45)
33
0.1
0.1
dct32
(18, 65, 50)
1006
1.3
2.3
Mp3
(30, 55, 50)
13898
15.6
42.6
p
Two set of experiments to be shown n n
p
Interactive design exploration Design refinement quality
Benchmarks n
bdist2 (from MPEG2 encoder), Sort (bubble sort), dct32 (from MP3 decoder) and Mp3 (decoder) J. Trajkovic and D. Gajski Copyright © CECS 2008
Interactive Design Exploration p
Designer specifies n n n
p
Partial resource constraint Parameters for critical code extraction Design choice (pipelined/non-pipelined)
Baseline: n n n
MIPS-style: ALU, multiplier, 128-entry RF2x1 and 2 source and 1 destination bus A divider for Mp3 baseline implementation Memory size customized per application
J. Trajkovic and D. Gajski Copyright © CECS 2008
Description of Generated Architectures p
Difference from baseline
Bench
Pipe
ALU1
ALU2
ALU3
RF2x1
RF4x2
RF6x3
RF8x4
RF16x8
N
#R=64
#R=64
#R=64
#R=64
#R=64
#R=64
#R=64
#R=64
Y
#R=32
#R=32
#R=32
#R=32
#R=32
#R=32
#R=32
#R=32
N
#R=16
#R=16
#R=16
#R=16
#R=16
#R=16
#R=16
#R=16
Y
#R=16
#R=16
#R=16
#R=16
#R=16
#R=16
#R=16
#R=16
N
Rf4x2
Rf6x3
Rf8x4
-
2 Alu
3 Alu
3 Alu
3 Alu
Y
Rf4x2
Rf4x2
Rf6x3
-
2 Alu
2 Alu
2 Alu
3Alu
N
-
Rf6x3
Rf8x4
-
2 Alu
3 Alu
3 Alu
3 Alu
Y
Rf4x2
Rf6x3
Rf6x3
-
2 Alu
2 Alu
2 Alu
2 Alu
Bdist2
Sort
dct32
Mp3
J. Trajkovic and D. Gajski Copyright © CECS 2008
IDp Rf 16x8, 3 Alu, 2 Mul Rf 16x8, 2 Alu Rf 16x8, 3 Alu, 2 Mul Rf 16x8, 3 Alu, 2 Mul
Example:Mp3 p
Baseline n
Bench
MIPS-style: ALU, multiplier, 128-entry RF2x1 and 2 source and 1 destination bus, divider Pipe
ALU1
ALU2
ALU3
RF2x1
RF4x2
RF6x3
RF8x4
RF16x8
N
-
Rf6x3
Rf8x4
-
2 Alu
3 Alu
3 Alu
3 Alu
Y
Rf4x2
Rf6x3
Rf6x3
-
2 Alu
2 Alu
2 Alu
2 Alu
Mp3
Mp3 Non-pipelined
1.10
Pipelined
1.00 0.90 0.80 0.70 0.60 0.50 0.40
ALU 1
ALU 2
ALU 3
RF2x1
RF4x2
RF6x3
J. Trajkovic and D. Gajski Copyright © CECS 2008
RF8x4
RF16x8
IDp
IDp Rf 16x8, 3 Alu, 2 Mul
Number of Execution Cycles p
Non-pipelined n n n n
Non-pipelined
bdist2 - biggest saving for IDp Sort – smallest improvement dct32 – ALU3 Mp3 – ALU2 optimum
bdist2
1.10
Sort
dct32
Mp3
1.00 0.90 0.80 0.70 0.60 0.50 0.40
ALU 1
p
ALU 2
ALU 3
RF2x1
RF4x2
RF6x3
RF8x4
RF16x8
IDp
Pipelined n
less overall saving
Pipelined 1.10
bdist2
Sort
dct32
Mp3
1.00
0.90
0.80
0.70
ALU 1
ALU 2
J. Trajkovic and D. Gajski Copyright © CECS 2008
ALU 3
RF2x1
RF4x2
RF6x3
RF8x4
RF16x8
IDp
Total Execution Time p
p
Post-synthesis
bdist2
Sort
Texe [s] 9.00E-05
Texe [s] 2.50E-03
8.00E-05
2.25E-03
7.00E-05
2.00E-03 1.75E-03
6.00E-05
bdist2
1.50E-03
5.00E-05
1.25E-03
4.00E-05
n
pipelined RF4x2
1.00E-03
3.00E-05
7.50E-04
2.00E-05
p
Sort n
Non-pipelined ALU1 p
p
p
Large Tclk
Mp3 n
Non-pipelined ALU2 p
0.00E+00 Non-pipelined
Baseline
ALU 1
ALU 2
ALU 3
RF2x1
Pipelined RF4x2
RF6x3
Non-pipelined RF8x4
Large Tclk
Baseline
ALU 1
ALU 2
ALU 3
dct32
2.50E-03
Texe [s] 2.50E-02
2.25E-03
2.25E-02
2.00E-03
2.00E-02
1.75E-03
1.75E-02
1.50E-03
1.50E-02
1.25E-03
1.25E-02
1.00E-03
1.00E-02
7.50E-04
7.50E-03
5.00E-04
5.00E-03
2.50E-04
2.50E-03
0.00E+00
0.00E+00 Non-pipelined
Baseline
ALU 1
ALU 2
ALU 3
RF2x1
J. Trajkovic and D. Gajski Copyright © CECS 2008
RF2x1
Pipelined RF4x2
RF6x3
RF8x4
Mp3
Texe [s]
Non-pipelined ALU2 p
2.50E-04
0.00E+00
Reduced resources
dct32 n
5.00E-04
1.00E-05
Pipelined RF4x2
RF6x3
Non-pipelined RF8x4
Baseline
ALU 1
ALU 2
ALU 3
RF2x1
Pipelined RF4x2
RF6x3
RF8x4
Design Refinement Quality p
n
p
dct32
n n
n
BRAM
2 1.5
23% - 80% extra
1 0.5
29% speedup – 20% slowdown RF4x2: 23% extra
Generated vs. academic HLS n
Slice
2.5
Best Texe: p
Texec
3
Tclk: p
Tclk
3.5
No.cycles: p
p
No.cycle 4
Generated vs. manual n
p
4.5
Relative to manual design
0
Baseline
ALU1-N
RF4x2-N
RF4x2-P
HLS
No.cycle
2.51
1.80
1.23
1.52
1.83
Tclk
1.03
1.20
1.20
0.81
1.67
Texec
2.59
2.16
1.47
1.23
3.06
Slice
2.79
3.49
3.49
3.40
2.32
BRAM
4.00
4.00
4.00
4.00
0.25
Generated has better performance HLS has better area
Design time: Generated → 2.3 sec vs. Manual → 3-man week J. Trajkovic and D. Gajski Copyright © CECS 2008
Conclusion p
Contributions n n n
p
Benefits n n n n
p
Algorithm for basic block analysis and core generation Iterative algorithm for optimizing resource utilization Demonstration on large C examples (>13K LOC) Automatic C input to core datapath construction Scalable to any size of C code Datapath construction controllable using designer constraints Generated core quality comparable to manual design
Future work n n
Automatic pipeline configuration from C code Forwarding based on static C code analysis J. Trajkovic and D. Gajski Copyright © CECS 2008
Acknowledgments p
The authors wish to thank Pramod Chandraiah for providing the Mp3 source code
p
Many thanks to Pramod Chandraiah, Weiwei Chen, Gunar Schirner, Lin Yang and Bin Zhang for manual designs and Roger Ang, Hansu Cho and Dongwan Shin for manual and HLS designs for dct32
p
We would also like to thank Mehrdad Reshadi and Bita Gorjiara for compiler support and Verilog generator
J. Trajkovic and D. Gajski Copyright © CECS 2008
Thank You
p
Questions?
J. Trajkovic and D. Gajski Copyright © CECS 2008
Lihat lebih banyak...
Comentarios