Design style case study for embedded multi media compute nodes

June 28, 2017 | Autor: Bingfeng Mei | Categoría: Energy Consumption, Case Study, Real Time Systems, Handheld Device Use, Real Time, Energy efficient

Share Embed

Laporkan tautan ini

Descripción

Design Style Case Study for Embedded Multi Media Compute Nodes Andy Lambrechts 1,2 , Anthony Leroy 1,4 , Francky Catthoor 1,2 ,

Tom Vander Aa 2 , Murali Jayapala 2 , Guillermo Talavera 3 , 1,2 2 Adelina Shickova , Francisco Barat , Bingfeng Mei 1,2 , Diederik Verkest 1,2,5 , Geert Deconinck 2 , Henk Corporaal 6 , 4 3 Fr´ed´eric Robert , Jordi Carrabina Bordoll Email: [email protected] 1

2 3

IMEC vzw, Kapeldreef 75, Leuven, B-3001, Belgium

Department of Electrical Engineering, Katholieke Universiteit Leuven, Belgium

Department of Electrical Engineering, UAB, Spain 5 6

4

MiEL, Universit´e Libre de Bruxelles, Belgium

Department of Electrical Engineering, Vrije Universiteit Brussel, Belgium Department of Electrical Engineering, TU/e Eindhoven, The Netherlands

Abstract Users expect future handheld devices to provide extended multimedia functionality and have long battery life. This type of application imposes heavy constraints on both (realtime) performance and energy consumption and forces designers to optimise all parts of their platform. In this experiment we focus on the different processor core design options for future embedded platforms, including the effect of instruction memory hierarchy on the energy consumption. The results show that significant improvements for energy efficiency and/or performance over currently used RISC or VLIW processors can be achieved. We conclude, based on concrete data for a realistic application, that different styles, including both configurable hardware and instruction set processors, will find their way into future heterogeneous platforms and designers will need to be aware of the trade-offs. Secondly, we show for the same application task that a heavily optimised instruction/configuration memory hierarchy can significantly reduce the energy consumption of this part, so it forms a crucial part of every energy aware design.

1. Context and motivation The merging of mobile phones, electronic agendas, multimedia computers and (broadband) communication networks gives rise to very fast growing markets for handheld communication and entertainment devices. Technology advances lead to platforms with enormous processing capacity. However, for handheld terminals the energy consumption is currently the limiting factor. At the same time,

achieving real-time performance for the processing kernels is still a challenge. Major innovative solutions are needed to merge energy efficiency and high performance into a single embedded processor. A first step in this direction is the evolution from RISC to VLIW. VLIW processors provide more computing resources, and rely heavily on the compiler to figure out how to use these resources efficiently. This means that energy inefficient hardware resources (e.g. dispatch unit) that are extensively used in super scalar processors can be avoided. Because even higher performance and higher energy efficiency is needed, different, mostly domain specific, VLIW-descendants are currently being developed. However, no clear overview of these different design styles and their advantages and disadvantages, was available up to now in public literature. To be fair and really valid this overview needs to be based on concrete data for the same realistic application, that was separately optimised for the different styles. Because different processors tend to be more optimal for different applications or modes of operation, most of the emerging SoC platforms are heterogeneous in nature. They contain several types of compute nodes and memory nodes. This work fits into a bigger research activity that estimates the energy consumption of all parts of such a platform. The energy consumption of this global platform can be considered to consist of three main parts: the data-path logic of the compute nodes (computational circuits, combined with local register files that are usually quite dominant), the data memory hierarchy (from level 1 upwards) and the instruction/configuration memory hierarchy (which we will abreviate ICMH in the rest of the paper. In this paper, we will present the data obtained from our case study covering different design options for the

compute nodes in this context. They include both reconfigurable hardware and more traditional instruction set processor (software style) options. We have estimated the performance (cycle and operation count) of different processor cores running an MPEG2 decoder, as part of a realistic video processing chain. It is well known that the power breakdown in future processors will not be the data-path but the memory hierarchies[1, 2]. An extensive literature exists on how to reduce the energy consumption by optimising accesses of the data memory hierarchy [3, 4]. We assume that when this stage has been performed well, the data memory hierarchy energy comsumption will be relatively low compared to the other unoptimized contributions. Moreover, the optimized orgnisation can be assumed to be quite similar (certainly in terms of energy consumption) for all covered processor styles because we map the same application with the same throughput requirements. So except for the local scalar variables in the foreground register files, which will vary heavily in the different styles, the access (and the related energy) to the complex data types (e.g. arrays) in layer 1 and higher should be very comparable. Because of this, the data memory hierarchy is not discussed any further in this work. When the data memory hierarchy is optimised, the ICMH will consume a significant part of the total energy [5]. Moreover, the ICMH exploration is tightly coupled to the architecture style of the compute node, and the link can therefore not be neglected here. In this paper, an ICMH exploration for most of the considered design styles is detailed to show the big impact on the total energy consumed by the processing node. We also show that advanced architectural and compiler optimisations can significantly reduce this energy consumption. The rest of the paper is organised as follows. After a short description of the related work in section 2, we will point out the drawbacks of the commonly used RISC and VLIW processors and cover different design options that try to improve the performance and/or energy efficiency of the compute nodes (section 3). A separate section discusses the impact of a ICMH on the energy consumption of the processor (section 4). In section 5 the results of the compute node and ICMH are presented. In section 6 we discuss our results and finally we conclude in section 7.

2. Related work To our knowledge, no comprehensive comparison of different embedded processor styles has been published for a realistic application that was seperately optimised for the different styles. Still, such a study is crucial to indicate the main research problems for such platforms in the context that was sketched in the previous sections. In general, research groups are either focusing on one processor

style, like a RISC, a RISC-based ASIP or a VLIW, with only limited comparison based on quantitative data. Recently also coarse-grained reconfigurable architectures have entered into the picture. Because researchers tend to specialise, comparisons found in literature mostly compare one optimised style with a totally unoptimised reference case, leading to incomplete or even unfair conclusions. In this paper, we will use different processor specific optimisations of the same realistic application to compare possible design styles for embedded processors. Many techniques have been proposed to reduce energy consumption in different aspects of the ICMH[6]. Several bus encoding schemes [7, 8] have been applied to reduce the effective switching on the (instruction and address) buses. Code size reduction techniques, both hardware [9] and software [10, 11], in addition to saving energy in buses (due to smaller widths and less traffic), reduce the size of the program memory and thus reduce energy. On the other hand, software transformations [12] aiming to utilise the underlying memory hierarchy efficiently have also been applied in the context of instruction memory. Little or no studies are published to compare different styles specially in terms of energy. Our ICMH experiment focuses on that as presented in section4.

3. Compute nodes The computation intensive nature of future embedded applications has moved the designers’ choice from sequential RISC processors towards more parallel architectures, like the VLIW (Fig. 1). In a VLIW, multiple (e.g. four to eight) ALU-like Functional Units (FU) operate in parallel, relying on the compiler to extract sufficient instruction level parallelism (ILP) to keep the FUs busy. Because a VLIW compiler decides which instructions can be executed in parallel, the processor does not need an energy hungry hardware that performs this job at run-time as in the case of a super-scalar processor. This leads to a more energy efficient and less complex processor, in which more resources can be dedicated to doing the actual computations. Register File FU

FU

FU

FU

FU

FU

Figure 1. Homogeneous VLIW (baseline style)

The traditional VLIW architecture (called VLIW baseline further on) relies on mature compiler support, but suffers from two major drawbacks. The multi-port centralised

data register file and the very wide instruction hierarchy of the straightforward implementation consume too much energy. Secondly, multimedia applications contain very regular and computation intensive loops (kernels) that can provide more parallel operations than a standard VLIW can exploit. The quest for high performance and lower energy consumption for the same application task has lead to the development of a whole range of VLIW descendants, with several types of extensions, both compiler techniques and architectural optimisations. We will summarise some approaches that try to improve the energy efficiency, the performance or both.

3.1. Software pipelining (e.g. modulo scheduling) Goal: performance improvement and indirectly (instruction + data-path) energy reduction: A VLIW processor can only achieve the promised speed-up over a RISC if the compiler succeeds in keeping the extra resources busy as often as possible. Traditional VLIW compilers have been concentrating on finding independent instructions in the sequential instruction stream. Reported data show that compilers, on average, can keep only 3 to 8 FUs busy by using only instruction level parallelism [13]. Modulo Scheduling (MS) is a software pipelining technique used by the compiler to extract more parallelism by executing multiple iterations of a loop at the same time. This type of parallelism (loop level parallelism) is abundantly available in multimedia applications. The most compute intensive parts of these applications, called kernels, consist mostly of nested loops. MS allows different loop iterations to be executed in an overlapping manner, and can keep tens of FUs busy during the execution of the kernels. This results in a higher performance, if the architecture has sufficient FUs to exploit this kind of parallelism. For very regular applications that can be software pipelined, this means that more parallelism is available than can be exploited by a traditional VLIW.

3.2. Clustering (Clustered VLIW) Goal: energy efficiency improvement: The very expensive multi-ported centralised data register file of the baseline can be avoided by grouping the FUs in separate clusters, each with their own register file. By using a separate, smaller register file with fewer ports for every cluster the total area and energy consumption of the register files can be reduced. When the clusters become too small, the number of extra copies that are needed to communicate the intermediate results between different clusters will start to have a negative influence on performance. This type of clustering can however also be successfully applied to the instruction hierarchy. In this paper we will show how this can be achieved

and the large impact the latter has on a realistic application (Section 4).

3.3. 2D Coarse-grained reconfigurable matrix Goal: performance improvement and maybe also energy efficiency, compared to the baseline VLIW:A second possible way to increase the performance of a VLIW processor evenmore, is by adding extra resources, e.g. adding more rows of FUs to the VLIW (see Figure 2). By doing this, a regular array of FUs, distributed small register files and reconfigurable interconnect can be created. This reconfigurable array of FUs, also called a coarse-grained reconfigurable matrix, can be seen as a second baseline style but cal also be considered as a VLIW extension[14]. A compiler for this architecture can exploit the high parallelism provided by software pipelining. We have used the advanced DRESC compiler [15], that combines a modulo scheduler, a 2D router and register allocator. This approach allows to boost the performance of a regular VLIW. Although the total number of cycles required for execution decreases, there is an increase in the number of executed operations (see results in section 5.1). A part of the extra operations is due to the configuration, the routing of intermediate results through the array. The other part results from the use of predication to allow the mapping of kernels that contain a limited amount of control-flow onto the matrix. The energy consumed for a task, compared to the baseline, is difficult to predict without a very detailed analysis and energy models for all the architectural components. It is the subject of ongoing work.

Program Fetch Instruction Dispatch Instruction Decode

Register File FU

FU

FU

FU

FU

FU

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

VLIW View

Reconfigurable Matrix View

Figure 2. VLIW with tightly coupled coarsegrained array

3.4. Sub-word parallelism Goal: performance improvement and potentially a large impact on energy, because almost no extra operations are needed if the mapping is performed effectively [16]: Another way to be able to execute more operations in parallel is to use a sub-word parallel approach. Instead of adding extra FUs, the existing ones can be made sub-word parallel. This means that a 32bit wide FU can e.g. do a single 32bit, two 16bit or four 8bit operations in one cycle. In multimedia applications, word sizes of 8 and 16 bit are very common. Again the success of this approach depends on how well the compiler can take advantage of the sub-word parallel capabilities of the hardware.

3.5. Custom instructions and/or FUs Goal: performance and energy efficiency improvement: If a VLIW is targeted towards a certain application or application domain, specialised instructions can be added to the instruction set. These custom instructions combine a number of frequently used operations into one instruction and are executed on specialised FUs that are more efficient (in time and energy). This technique can have a very big positive influence on performance and energy, but the disadvantage is the loss of generality (can only be used for an ASIP).

3.6. Memory hierarchy Goal: energy efficiency improvement and performance improvement by reducing stalls: By adding a data/instruction memory hierarchy, locality of data/instructions can be exploited. Data (or instructions) that are moved to small memories, closer to where they are needed, can be accessed with a smaller delay and lower energy penalty. This results in better performance and less energy consumption. A detailed exploration of the instruction side of the hierarchy will be given in the section 4.

3.7. Focus of the experiments Most of the concepts presented in the previous paragraphs can be considered to be decoupled. This means that a specific processor design can implement combinations of these approaches, of which only some influence each other. The concepts can be categorised in two different ways. On one hand a distinction can be made between techniques improving the energy efficiency and those improving performance as was already indicated by the Goals in the individual paragraphs. On the other hand these concepts can be separated into extensions of the baseline styles and supporting techniques. We consider (1) the clustered VLIW, (2) the subword parallel VLIW, (3) the VLIW with custom FUs and (4)

the 2D coarse-grained array to be extensions of the VLIW baseline design style, because they are descendants of the centralised VLIW. Software pipelining and the usage of a memory hierarchy can be applied to these (extended) baseline styles, and are considered to be supporting techniques. In this paper we focus on these different styles, by comparing the performance of a RISC (separate baseline style, used as a reference), a standard Centralised VLIW, a clustered VLIW and a VLIW with closely coupled 2D coarse-grained array. Comparing the sub-word Parallel VLIW and the custom instruction approach is the subject of future work. To make a good comparison possible, supporting techniques are used. Both the compilers of the clustered VLIW and the VLIW with closely coupled 2D coarse-grained array use modulo scheduling to extract sufficient parallelism to exploit these architectures. Our compilers are built on top of the Trimaran framework with the Impact front-end and the Elcor back-end As was mentioned in section 1, we assume that an important part of the total energy consumption is consumed by the ICMH. The choice of the compute node design style has a strong influence on this part of the system. Moving from RISC processors to more parellel architectures, like VLIWs and coarse-grained arrays, means fetching more instructions in parallel. In these parallel architectures some architecture specific new techniques can be used to reduce the energy consumption of the ICMH. A separate section will now first show the importance these optimisations for VLIWs and clustered VLIWs.

4. Instruction/configuration memory hierarchy Instruction memories in programmable processor based systems consume up to 30% of the processor power consumption in VLIW [17] and RISC [18] processors. Also in configurable processors the energy in the configuration memory hierarchy can be high. We will treat both hierarchies in the unified way because the solution to reduce the energy are very similar. Loop buffering or L0 buffering is an effective scheme to reduce energy consumption in the ICMH[19]. In any typical multimedia application, significant amount of execution time is spent in small program segments. Hence, by storing them in a small L0 buffer instead of the big instruction cache, energy can be reduced. By shifting much of the instruction fetch activity from larger memory blocks of an L1 cache to the small L0 buffers, up to 50% of the energy consumption in the ICMH can be reduced[19, 20]. While this reduction is substantial, orthogonal optimisations on different parts of the processor, like the data-path, register files and data memory hierarchy, are still necessary to ensure high energy efficiency in the future processors [21]. If unchecked, the instruction

L0 Buffer Enable

INSTRUCTION LEVEL 1 CACHE

Loop Buffer partition

LC

...

LC

Loop Buffer partition

FU

DECODE FU

DECODE FU

...

S/W

Traditional Compiler

Transformations

Compiling for a VLIW

Code Transformation Level

mux

mux

DECODE

Loop Buffer Configuration

Application

DECODE

DECODE

FU

FU

Mapping Stage

Clustering Stage

Selecting Appropriate Basic Blocks for this Loop Buffer Configuration

Generating Optimal Clustering for this Mapping

Optimal Mapping

Scheduled Assembly Compiler Level

Optimal Clustering Micro−Architectural Level

Figure 4. Optimisations at different abstraction levels

INSTRUCTION CLUSTER

Figure 3. Clustered Loop Buffer Organization

cache energy, including the L0 buffer, increases by up to 50%. Of the two main contributors of energy consumption in the ICMH, the L0 buffers are the main bottleneck; they consume up to 70% of the energy consumption in the ICMH for a good mapping. To alleviate the L0 buffer energy bottleneck, a clustered L0 buffer organisation has been proposed in the recent past [22]. Essentially, in a clustered L0 buffer organisation the L0 buffers are partitioned and grouped with certain functional units in the data-path to form an instruction cluster or an L0 cluster. In each cluster the buffers store only the operations of a certain loop mapped to the functional units in that cluster (see also Figure 3). The notion of L0 clusters is different from the conventional clustered VLIWs. In a data-path cluster, the functional units derive ’data’ from a single register file. In an L0 cluster the functional units derive ’instructions’ from a single L0 buffer. Even though in both cases the main aim of partitioning is to reduce energy (power) consumption, the principle of partitioning is different, and the decisions can be taken independently. For further architectural details we refer to [19, 22]. With such a clustered organisation, optimisations at three abstraction levels can be conceived, namely micro architectural, compiler (mapping) and software levels. These optimisations coupled together can reduce the instruction related energy consumption by up to 60%. In this paper we show the reductions due to optimisations at microarchitectural and compiler levels. Micro-architectural Level At the architectural level we can exploit certain features of the application, namely the access pattern to the memories. The basic observation which aids in clustering is that typically in an instruction cycle not all the FUs are active. Furthermore, the FUs active in different instruction cycles may or may not be the same. Hence, a direct correlation exists between the access pattern to the instruction memory

and the functional unit activation. Instead of clustering randomly, the access patterns can be exploited to generate clusters more efficiently. Given a certain mapping on to the L0 buffers and a certain schedule, L0 clusters can be generated by the algorithms presented in [22]. Compiler The cluster generation as described in the previous section is sensitive to the given set of basic blocks to be mapped on to the L0 buffers and to the schedule of those basic blocks. By influencing the basic block selection and the basic block scheduling, the power consumption can be reduced even more. Basic block selection performs an energy aware mapping of basic blocks, by deciding which ones to map on to the L0 buffers. Basic block scheduling shuffles operations within a VLIW instruction cycle or even moves operations to different cycles, to get to the most energy efficient schedule. These steps are integrated in an iterative methodology, shown in Figure 4. It is also implemented on top of Trimaran for our drivers.

5. Experimental results 5.1. Case study for the compute nodes The results for the different processor styles that will be presented in the following sections are obtained by simulation of an MPEG2 decoder, taken from the Mediabench benchmark suite. The figures report the simulated behaviour to decode a 4 frame MPEG2 sequence (IPPP) with a frame size of 4-CIF. The performance estimates are based on cycle and instruction counts given by a cycle-accurate simulator, including the effects of the memory hierarchy. Because of differences in the architectural assumptions of the Impact back-end, used for the coarse-grained reconfigurable architecture, and the Elcor back-end, used for the VLIW experiments, the semantics of some operations are different. Elcor decomposes some instructions into multiple operations, where Impact does not. To make a fair comparison possible, we have recomputed the relevant results into Impact semantics, by removing the extra operations.

This turned out to be quite hard in this case, because of many small differences in semantics between the compilers and the underlying instruction set. Removing the extra operations would intuitively lead to a smaller IPC, because in some cycles fewer independent instructions can be found. This leads to two re-computations. An optimistic recomputation is done by assuming that the IPC will stay the same for the reduced number of operations, and is reported as “best case”. From this IPC and the reduced number of operations, a reduced cycle count is computed. The pessimistic, or “worst case” result assumes that only cycles that are empty after the extra operations have been removed, can be left out. The lower IPC is then calculated based on this lower cycle count. The recomputed results are mentioned between brackets (worst and best case separated by /) and summarised in Table 1.

a) b) c) d) e)

# ops. (×106 ) 243 252 311 337 331

# cycles (×106 ) 336 86/60 95/72 84/64 26

IPC 0.72 2.92/4 3.24/4.31 3.98/5.26 9.33

fr/sec. @ 600MHz 6.9 27.9/39.9 25.2/33.3 28.5/37.5 92.4

ms./fr. 145 36/25 40/30 35/27 11

a). RISC (ARM920T), b). Centralised VLIW (8 FU), c). Clustered VLIW (2x4 FU), d). Clustered VLIW (3x4 FU), e). ADRES (8x8 matrix)

Table 1. Compute node simulation results for different design styles: all reported figures are for Impact semantics, and recomputations from Elcor to Impact semantics mention worst case/best case assumptions. The coarse-grained array ADRES result does not include the effects of a memory hierarchy.

RISC As a first reference case, the decoding of the four frames was simulated with ARMulator for an ARM 920T. This resulted in a total of 243 million operations, and 336 million cycles. This means that at 600 MHz, this processor can decode almost 6.9 frames per second. For the sake of comparison, we provide the number of decoded frames per second for all processors at 600 MHz. The number of executed Instructions Per Cycle (IPC) in this case is 0.72. This is clearly far from the performance that is needed in this context (15 to 30 frames per second). Even for very high clock frequencies, this issue would not be solved yet. We take this as a first reference case. Centralised VLIW

For the second reference case, called VLIW baseline, we have modelled a standard Centralised VLIW with 8 FUs using the CRISP processor architecture exploration environment [24]. This retargetable research compiler and simulation framework for VLIW-like architectures was built on top of Trimaran. The CRISP simulator reports the total number of cycles, the total number of operations needed to execute the MPEG2 decoder for four frames and also reports the amount of reads and writes to different levels of the memory hierarchy. The results of this experiment are highly influenced by the performance of the compiler. The first straightforward result, obtained by compiling the unoptimised Mediabench MPEG2 decoder code, has resulted in a cycle count of 484 million and an operation count of 824 million, or an IPC of only 1.7. This shows that on average less than 2 of the 8 FUs are used. This very bad performance (even worse than the RISC), was the result of a mismatch between the compiler front-end, Impact, and the Elcor back-end of Trimaran. An aggressive loop unrolling in Impact prevented Elcor from performing modulo scheduling in the kernels and generated a lot of spill code. This explains the extra operations and a degraded performance. It shows however that simply using public domain compiler software to compare different solutions is not sufficient to make any valid conclusions. It took a lot of effort to find out all the issues involved and to modify the compilations setting accordingly to obtain a correct comparison. Changing the parameter settings of the front-end so that the aggressive unrolling was turned off had the biggest impact and because of this the result improved to 182 million cycles and 392 million operations, or an IPC of 2.16. This means that at 600 MHz this processor can decode 13.2 frames per second, which is still not impressive. Another important issue that is often overlooked in comparisons is the impact of source code transformations. The way the code was initially written heavily influences the outcome of the experiments. By performing very processor specific manual optimisations to the MPEG2 decoder code, the performance has been increased to 88 (86/60) million cycles and 369 (252) million operations, or an IPC of 4.17 (2.92/4). In this case, on average, more than half of the FUs are used every cycle. At 600 MHz, this processor can decode 27 (27.9/39.9) frames per second. The manual optimisation consists of well known code transformations, but have to be applied with the specific architecture in mind. To obtain a fair comparison, we have spent also here much effort to apply this for all the investigated styles. This experiment shows that mapping an application to a certain platform is highly influenced by the quality of the compiler, and big improvements can still be made by (up to now) manual source code transformations. Clustered VLIW

The CRISP simulation framework extends Trimaran by adding support for data-path clustering. This includes (1) assigning operations to a data-path cluster, (2) scheduling inside the clusters (assigning operations to a functional unit, in a certain cycle), using modulo scheduling for the kernels, and (3) adding necessary copy operations to communicate between clusters. To compare to the 8 FU centralised VLIW of the previous paragraph, we have modelled a clustered VLIW with 2 clusters of 4 FUs. The simulations show that the clustered processor needs 95 (95/72) million cycles and 415 (311) million operations, or an IPC of 4.33 (3.24/4.31). This means it could decode 25 (25.2/33.3) frames per second. The decreased performance (8% more cycles needed) and the significant increase in number of operations (12% more) are the result of the communication overhead between different clusters. The compiler has to add extra inter cluster copies. In this case the overall IPC, including these extra operations, gives the wrong impression. In fact, less instructions belonging to the original algorithm are executed per cycle. As was explained in Sections 3.2, the goal of clustering is to save energy. Because the energy consumption of a centralised register file quickly dominates the energy budget of a processor as the number of ports grows [23], clustered register files allow designers to reduce the energy consumption of their VLIWs or to keep the energy consumption of large VLIW processors under control. We have also modelled a clustered VLIW with 3 clusters, each containing 4 FUs, to see how adding more clusters influences performance. The decoding takes 84 million cycles and 445 (337) million operations, with an IPC of 5.26 (3.98/5.26). The number of frames that can be decoded at 600 MHz would be 28 (28.5/37.5). In this case, the very small performance improvement can not justify the increase in resources by adding an extra cluster and the higher operation count will lead to a higher energy consumption for the same task. The positive effect on performance of adding more resources (more clusters) is quickly countered by the performance degradation of the communication overhead between the clusters. The number of clusters that can efficiently be used depends on the amount of parallelism available in the application, and on the ability of the compiler to exploit it. Coarse-Grained Array Experiments for the coarse-grained array have been performed with the retargetable DRESC compiler [15]. This compiler can map complete applications, written in C, to the ADRES architecture [14]: a tightly coupled VLIW and coarse-grained array. DRESC uses modulo scheduling to map the regular and compute intensive kernels of the application on the 2D coarse-grained reconfigurable matrix,

while more control dominated parts are mapped to the VLIW part of the architecture. The DRESC compiler uses the Impact research compiler (front-end and back-end) to compile to this VLIW part. For the moment DRESC still requires a limited manual effort from the designer. The most compute intensive parts of the code, identified by profiling, should be marked so DRESC will consider mapping these kernels to the array. Other architecture specific code transformations highly improve the obtained results (the transformations used here are not necessarily the same ones as are needed for the VLIW baseline styles discussed previously). The code that is used here is the result of less than one designer-week of manual transformations. The simulated architecture consists of an 8x8 array of FUs, of which the first row is a normal VLIW. The other FUs are ALU-like resources or multipliers, with support for predication and routing capabilities. A reconfigurable interconnect architecture allows results from an operation performed on one FU to be fed into other FUs in the array, without passing through the big centralised register file. The simulation results for this case do not include the performance effect of a data or ICMH, and assumes all needed data is available in one big level 1 data cache. This means that a realistic implementation including the necessary memory hierarchies would add extra stalls because of cache misses. Because of this, the reported figures are an underestimate. The decoding of the four MPEG2 frames on ADRES takes 26 million cycles and 331 million operations, or an IPC between 15 and 35 for kernels that are mapped to the array and an overall IPC of 9.33. Only counting the valid operations (with predicate equal to true) and excluding the routing operations, the overall IPC is 8.8. Running at 600 MHz, this architecture would be able to decode 90 frames per second.

5.2. ICMH optimisations Figure 5 shows the results when the instruction memory optimisations techniques that were discussed in Section 4, are applied. These experiments were again carried out using the CRISP environment [24], for three different architecural configurations: (a) A VLIW datapath with 8 functional units and a single centralized register file (1X8FU). (b) A VLIW datapath with 8 functional units and two data cluster (two register files) with 4 functional units per data cluster. (c) A VLIW datapath with 12 functional units and three data clusters with 4 functional units per cluster. The number of words of the instruction cache is the same in all three configurations, but the block sizes are increased proportionately (#FUs*4 bytes). The energy and power figures were estimated for a 0.10um technology with a Vdd of 1.9V and scaled to 0.13um technology with a Vdd of 1.2V which are the technological parameters used as the basis for our case

study. In all the cases, we have fixed a frequency of 600Mhz. The L1 instruction cache is modelled in Cacti and the L0 buffers are modelled as 1 read, 1 write port register file in Wattch. Different steps have been applied: • 1. No L0: No loop buffer. All the instructions are loaded from the IL1 cache (8kB). This is the reference for the other steps (100%). • 2. Inner Only: A simple loop buffer, not clustered, that is capable of storing 64 instructions. Most of the inner loops that fit, i.e. that have less than 64 instructions, are executed from the loop buffer. No compiler optimisations are applied to the applications. This case represents how loop buffers are implemented in hardware in current commercial processors. Here the instruction memory energy is reduced with 35%. • 3. Centralised: A centralised (i.e. non-clustered) loop buffer that has the optimal size for this application. It is software controlled. This means the compiler has optimised the mapping of loops such that energy is minimised. We can infer from figure 5 that a software controlled loop buffer like this one, performs better than the previous step. This is because by selecting the appropriate basic blocks to be mapped on to the loop buffers, energy can be saved. • 4. Clustered: A loop buffer that is clustered into loop buffer partitions of optimal size. Clearly, clustering the loop buffer has the desired effect: The loop buffer energy is reduced by at least 50% in all three machine configurations. This corressponds to an additional energy reduction of 15% as compared to the centralised loop buffer. • 5. Scheduling: A loop buffer that is clustered into loop buffer partitions of optimal size; plus, the compiler has adapted the schedule of the program to minimise the loop buffer energy even further. Another 4% of energy is squeezed out.

6. Discussions As seen in previous sections, VLIW processors exploit their extra resources to increase performance, but the cost of multi-ported register files will cause an important energy penalty. In Table 2 we show the energy consumption of the register file for the diferent VLIW configurations under study. The power model used is linear with respect to the number of simultaneous read/write accesses and on the number of read and write ports ans has reported in[25]. Compared to a centralised VLIW, a clustered VLIW can bring the energy consumption of the VLIW architecture down, at the cost of a (small) decrease in performance. At the beginning, the performance of a clustered VLIW increases as more resources are added. After a certain point, the performance deteriorates due to the inter-cluster communication and the efficency of the compiler to use all the clusters. In this case, due to the bigger number of ports and the extra operations, the power consumption increases significantly.

a) b) c)

#Reg file clusters 1 2 3

#read ports (/cluster) 16 8 8

#write ports (/cluster) 8 4 4

Power (mW) 66 57 81

Energy (mJ) 15.6 7.9 10.1

a). Centralised VLIW (8 FU), b). Clustered VLIW (2x4 FU), c). Clustered VLIW (3x4 FU)

Table 2. Register file consumption for the diferent implementation of the VLIW processor.

The VLIW with tightly coupled coarse-grained array can really boost performance for applications which are regular enough (sufficient loop level parallelism with limited control flow) to keep the vast amount of resources busy, but as in the other cases, the price that has to be paid is an increase in the number of operations. These extra operations

are for a part needed to configure the interconnect at the start of each loop that is mapped to the coarse-grained Array. This however eliminates the need for expensive inter cluster copy operations that are needed in the clustered VLIW. Because of the high number of resources, the area of this design style is significantly bigger that the other options. We should also note that the extra performance that is gained by adding more resources gets smaller and smaller if the number of resources keeps growing. The impact on the energy consumption of this architecture for the execution of the same task is not clear at this moment, and is the subject of future work. Because of the different trade-offs involved here (energy efficiency vs. performance vs. area), designers will have to consider different design styles, to be able to pick the one that is best suited to the given requirements. We expect that different styles will be integrated together in future heterogeneous platforms, and the activated components will be decided at mapping time or even at run-time, depending on the dynamic application characteristics. In all the cases the cost of the ICMH in terms of energy will remain very important and architecture and compiler optimisations are crucial to at least partially alleviate this as shown in section 5.

7. Conclusions The design style case study for embedded compute nodes shows that improvements for energy efficiency and/or performance over currently used RISC or VLIW processors can be achieved. From the results presented in Table 1 we can conclude that a RISC processor can do the job with the smallest number of operations. However, it is clear that a RISC, even for higher clock frequencies, does not provide the performance that is needed for this real-time application. Next to the case study for the compute nodes, the instruction memory optimisations for the same MPEG2 decoder application show that we were able to reduce the instruction memory energy with 60% from 38 mJ to 15 mJ. This has a large impact on the overall energy consumption of the compute node: while originally the instruction memory power could be more than 50% of the total power, it has now been reduced to only 25% of the total.

References [1] M.Viredaz and D.Wallacha “Power Evaluation of a Handheld Computer”, In Journal of IEEE Micro, Vol 1, pp 66-74, January 2003. [2] F. Catthoor, K. Danckaert, C Kulkarni, E. Brockmeyer, P.G. Kjeldsberg, T. Van Achteren, T. Ommes, “Data Access and Storage Management for Embedded Programmable Processors”, In Kluwer Academic Publishers, 2002.

[3] F. Catthoor, K. Danckaert, C. Kulkarni, E. Brockmeyer, P.G. Kjeldsberg, T. Van Achteren, T. Omnes, “Data access and storage management for embedded programmable processors”, ISBN 0-7923-7689-7, Kluwer Acad. Publ., Boston, 2002. [4] P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle, P.G. Kjeldsberg, “Data and Memory Optimizations for Embedded Systems”, In ACM Trans. on Design Automation for Embedded Systems (TODAES), Vol. 6, No. 2, pp. 142-206, April 2001. [5] N. Vijaykrishnan, M. Kandemir, M.J. Irwin, H.S Kim, D. Duarte, “Evaluating Integrated Hardware-Software Optimizations Using a Unified Energy Estimation Framework”, In IEEE Transactions on Computers, Vol. 52, No. 1, pp. 59-75, Januari 2003. [6] S. V. Adve, D. Burger, R. Eigenmann,A. Rawsthorne, M. D. Smith, C. H. Gebotys, M. T. Kandemir, D. J. Lilja, A. N. Choudhary, J. Z. Fang, and P-C Yew, “Changing interaction of compiler and architecture”, In IEEE Computer Magazine, 30(12):51–58, 1997. [7] C. Lee, J. Lee, Ting Ting Hwang, “Compiler optimization on instruction scheduling for low power”, In Proc of International Symposium on System Synthesis (ISSS 2000). [8] W-C. Cheng, M. Pedram, “Power-aware bus encoding techniques for i/o and data busses in an embedded system”, Journal of Circuits, Systems, and Computers, 11(4):351–364, August 2002. [9] H. Lekatsas, J. Henkel, W. Wolf. “Code compression for low power embedded system design”, In Proc. of Design Automation Conference (DAC), June 2000. [10] S. Debray, W. Evans, R. Muth, B. De Sutter, “Compiler techniques for code compaction”, In ACM Transactions on Programming Languages and Systems (TOPLAS), 22(2):378– 415, March 2000. [11] A. Halambi, A. Shrivastava, P. Biswas, N. Dutt, A. Nicolau, “An efficient compiler technique for code size reduction using reduced bit-width ISAs”, In Proc. of Design Automation Conference (DAC), March 2002. [12] S. Steinke, L. Wehmeyer, B. Lee, P. Marwedel, “Assigning program and data objects to scratchpad for energy reduction”, In Proc. of Design Automation and Test in Europe (DATE), March 2002. [13] B. Mei, S. Vernalde, D. Verkest, H. De Man, R. Lauwereins, “Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling”, In Design, Automation, and Test in Europe conference (DATE), March 03. [14] B. Mei, S. Vernalde, D. Verkest, H. De Man, R. Lauwereins, “ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix”, In Proc. of the International Conference on Field Programmable Logic and Applications (FPL), September 2003. [15] B. Mei, S. Vernalde, D. Verkest, H. De Man, R. Lauwereins, “DRESC: A Retargetable Compiler for Coarse-Grained Reconfigurable Architectures”, In Proc. of the International Conference on Field Programmable Technology (FPL), September 2002.

[16] P. Op de Beeck, C. Ghez, E. Brockmeyer, M. Miranda, F. Catthoor, G. Deconinck, “Background Data Organisation for the Low-Power Implementation in Real-Time of a Digital Audio Broadcast Receiver on a SIMD Processor”, In Proc. of Design, Automation, and Test in Europe conference (DATE), pp. 1144-1145, March 03. [17] L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zaccaria, “A power modeling and estimation framework for VLIW-based embedded system”, ST Journal of System Research, 3(1):110– 118, April 2002. [18] W. Tang, R. Gupta, and A. Nicolau, “Power savings in embedded processors through decode filter cache”, In Proc. of Design Automation and Test in Europe (DATE), March 2002. [19] R. Bajwa, M. Hiraki, H. Kojima, D. Gorny, K. Nitta, A. Shridhar, K. Seki, K. Sasaki, “Instruction buffering to reduce power in processors for signal processing”, In IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 5(4):417–424, December 1997. [20] N. Bellas, I. Hajj, C. Polychronopoulos, G. Stamoulis, “Architectural and compiler support for energy reduction in the memory hierarchy of high performance microprocessors”, In Proc. of International Symposium on Low Power Electronic Design (ISLPED), August 1998. [21] T. Burd, R. W. Brodersen, Energy Efficient Micorprocessor Design. Kluwer Academic Publishers, January 1992 (1st Edition). [22] M. Jayapala, F. Barat, T. Vander Aa, F. Catthoor, G. Deconinck, H. Corporaal, “Clustered l0 buffer organization for low energy embedded processors”, In Proc. of 1st Workshop on Application Specific Processors (WASP), November 2002. [23] F. Barat, M. Jayapala, T. Vander Aa, R. Lauwereins, G. Deconinck, H. Corporaal, “Low Power Coarse-Grained Reconfigurable Instruction Set Processor”, In Proc. of the International conference on Field Programmable Logic (FPL), September 2003. [24] V. Zyuban, P. Kogge, “The Energy Complexity of Register Files”, In International Symposium on Low-Power Electronics and Design,Monterey, CA, pp.305-310, August 1998. [25] L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zaccaria, R. Zafalon “A Power Modeling and Estimation Framework for VLIW-based Embedded Systems”, In Proc of the International conference PATMOS01 - IEEE Eleventh International Workshop on Power and Timing Modeling, Optimization and Simulation , Septembre 2001.

Lihat lebih banyak...

Design style case study for embedded multi media compute nodes

Descripción

Comentarios