A Generic Model of Embedded System to Enable Dynamic Self-Reconfigurable Applications

July 4, 2017 | Autor: Ewerson Carvalho | Categoría: Case Study, Embedded System, Generic model

Descripción

XIX SIM - South Symposium on Microelectronics

1

A Generic Model of Embedded System to Enable Dynamic SelfReconfigurable Applications Eduardo Brião, Ewerson Carvalho, Leandro Möller, Daniel Camozzato, Ney Calazans, Fernando Moraes [email protected], {ecarvalho, moller, camozzato, calazans, moraes}@inf.pucrs.br

Abstract ASIPs and reconfigurable processors are architectural choices to extend the capabilities of a given processor. ASIPs suffers from fixed hardware after design, while ASIPs and reconfigurable processors suffer from the lack of a pre-established instruction set, making it difficult to program. An intermediate solution, reconfigurable coprocessors systems (RCSs), contains dedicated hardware (coprocessors) coupled to a standard processor core to accelerate specific tasks, being possible to insert or substitute hardware functionalities at execution time. This paper proposes a generic model for RCSs, targeted to reconfigurable devices with self-reconfiguration capabilities. A proof-of-concept case study, with performance and prototyped results, is also presented.

1. Introduction Processors employed to implement embedded systems have to cope with widely varying sets of requirements. For example, an embedded multimedia application may demand that a processor compute FFT operations, while a network application could require real-time comparison of IP packets. Features such as silicon area, performance and flexibility further constrain the design of these embedded processors. A single processor may meet the requirements of several embedded system scenarios if it is somehow parameterizable. Application-specific instruction set processors (ASIPs) and reconfigurable instruction set processors (RISPs) are two opposite forms of implementing parameterizable processors with regard to design versus runtime parameterization trade-offs. ASIPs provide flexibility and performance at the cost of extra silicon area for each new function supported directly in hardware. If the application requires a new specific functionality, the ASIP is redesigned. RISPs such as the DISC [WIR 95] are processors where some or all instructions are implemented as dedicated hardware and loaded on demand, according to the software execution flow. Here, the highest degree of flexibility is achieved. However, the lack of a pre-established fixed instruction set makes it harder to generate the object code for new applications. This occurs because each new function must be supported at the same time by the dedicated hardware and the compiler. An intermediate solution, named reconfigurable coprocessors systems (RCSs) is addressed in this paper. Examples of RCSs are the Garp architecture [CAL 00] and the approach described herein, based on the FiPRe infrastructure. As ASIPs and RISPs, RCSs contain dedicated hardware (coprocessors) to accelerate specific tasks. However, these are not fixed at design time as in ASIPs. It is possible to insert or substitute hardware functionalities at execution time in RISPs and RCSs without having to redesign the processor. Contrary to what happens in RISPs, RCSs contain a standard processor core with a fixed instruction set, enabling the use of standard compilers. The processor and the parameterizable parts are loosely coupled in RCSs and tightly coupled in RISPs and ASIPs. The RCS processor core communicates with reconfigurable hardware using software instructions and a standard bus interface. Techniques such as blocking or non-blocking communication can be directly employed for interfacing the core processor with reconfigurable parts. RISPs and ASIPs, on the other hand, rely upon the specialization of the datapath and/or the control unit hardware to provide extension to the native Instruction Set Architecture (ISA). Loosely coupled approaches favor software reuse, at the cost of added latency in the communication with the reconfigurable coprocessors. Additionally, ASIPs and RISPs are inherently sequential approaches, while RCSs may benefit from the parallel execution of the processor software and dedicated computations in each coprocessor. Communication between the processor and the coprocessors can be achieved in this case through the use of interrupts. A potential performance bottleneck faced by embedded applications in RISPs and RCSs is the latency to perform hardware reconfiguration, which can be orders of magnitude longer than the time to perform application atomic operations. To reduce or eliminate this problem, RISPs and RCSs assume the existence of an infrastructure to control the storage and the dynamic loading of hardware configurations, usually called a configuration controller [ROB 99]. Consider the current trend to increase the number of embedded processors in SoCs, leading to the concept of “sea of processors” systems [HEN 03], and add to this the above discussion on implementation alternatives for parameterizable embedded processors. From these it is possible to justify the objective of this paper, which

2

XIX SIM - South Symposium on Microelectronics

comprises proposing a generic implementation model for RCSs, and introducing a case study used to evaluate the ideas behind the model. The rest of this paper is organized as follows. Section 2 presents the proposition of a model to implement RCSs called FiPRe. Section 3 presents an implementation case study based on the FiPRe model, where a simple 16-bit core processor is adapted to support dedicated instructions, connecting it to a configuration controller and two reconfigurable areas. Section 4 present area and performance results of the implemented case study, while Section 5 presents conclusions and directions for future work.

2. The FiPRe Implementation Model The FiPRe implementation model takes its name from its most important characteristics: a Fixed core Processor connected to Reconfigurable Coprocessors. Fig. 1 presents the block diagram of the model, deemed to allow self-reconfigurable applications implemented as RCSs. First, there is a Fixed Region, which has an embedded processor to execute applications and to trigger reconfiguration actions. This region also contains a configuration controller, to manage the details of the reconfiguration process. The existence of external devices intended to provide input/output capabilities for the embedded system is a natural part of the model. Besides, a dedicated memory is needed to store reconfigurable coprocessor bitstreams, a block named Configuration Memory in Fig. 1. Finally, the model assumes the existence of a Reconfigurable Region that contains a subset of configured coprocessors (C). This region presents data exchange and configuration interfaces to the rest of the system. Fixed Region External Devices

Embedded Processor Subsystem

Reconfigurable Region C

C

C

1

...

n

Configuration Controller Configuration Memory

Configuration Interface

Fig. 1. Block diagram of the FiPRe implementation model. To employ an embedded processor in a system that follows the FiPRe model it is necessary that the former support the following requirements: (i) availability of control and data channels to provide communication between the processors and the coprocessors; (ii) processor support to select and dismiss coprocessors; (iii) processor support to initialize and communicate with coprocessors. Processors used in embedded applications can easily support the first requirement. The second and third requirements imply communication between the processor and the configuration controller, and between the processor and the reconfigurable region, respectively. These requirements can be attained by adapting the Instruction Set Architecture (ISA) of the processor to provide explicit support to them, as will be described in the case study implementation of Section 3. Alternatively, the same requirements can be obtained by simply designing adequate memory mapped I/O interfaces for the configuration controller and the reconfigurable processors. In this way, hard core processors as well as soft core processors can be used to implement RCSs following the FiPRe model. Of course, system performance is potentially easier to increase if the RCS designer is able to tune the embedded processor architecture to work with the reconfigurable part of the system. As stated before, a fixed instruction set processor provides advantage in terms of code and hardware reuse, because neither the core processor nor the compiler needs to be changed in the process of developing reconfigurable coprocessors sought to achieve system performance and functionality goals. The configuration controller is responsible to handle the coprocessor selection and dismiss procedures produced by the processor. When the selection procedure is executed, the configuration memory is accessed and the respective coprocessor bitstream is sent to the configuration interface, but only if this coprocessor is not already configured. The processor is notified at the end of procedure, after which the coprocessor is ready to be used. The dismiss procedure tells the configuration controller that it need not retain the named configured coprocessor any longer in the system, which effectively frees a portion of the reconfigurable region. For further details about the structure of the configuration controller the reader should refer to 0.

3. The R82R System Case Study In order to evaluate the FiPRe model to implement RCSs embedded systems, an example case study, named R82R was designed and implemented.

XIV Microelectronics Seminar

3

Fixed Region

Reconfigurable Area 1

Reconfigurable Area N

Bus Macros

Bus Macros

FPGA

Bus Macros

A set of choices was made to produce the case study. First, the system, except the configuration memory, was implemented in a single device. A Xilinx Virtex-II FPGA was used to prototype the case study. The goal was to evaluate the implementation of SoC RCSs. Second, the case study employed a soft core processor, customized to facilitate its use with the FiPRe model. The changes made in the original processor were to add specific instructions to deal with the reconfiguration process, and to add a specific external interface to the reconfigurable region and to the configuration controller. The objective of this interface is to potentially increase performance by allowing the processor-memory interface to proceed in parallel with processor-coprocessor communication. For example, the processor may contain a specialized DMA module programmed to interact with reconfigurable coprocessors while the processor executes software and interrupts are used by the DMA to call the attention of the processor to process data generated by the coprocessors. Another simplifying choice was to allow coprocessors to act only as slaves of the processor, making the processor-coprocessor interface simple to design and use. The next Sections describe the structure and the operation of the R82R case study. Fig. 2 displays the organization of the R82R system. The system is composed by three main modules: (i) a host computer, providing an interface to the system user; (ii) a configuration memory, containing all partial bitstreams used during the system execution; (iii) the FPGA, containing fixed and reconfigurable regions of the R82R system. The fixed part in the FPGA is a complete computational system, comprising the R8R processor, its local memory containing instructions and data, a system bus controlled by an arbiter and peripherals (serial interface and the configuration controller). The serial interface peripheral provides capabilities for communication with the host computer (an RS232 serial interface). The Configuration Controller (CC) peripheral is a specialized hardware, designed for acting as a slave of the R8R processor or the host computer. The host computer typically fills the configuration memory before system execution starts.

IOce Local Memory

IOrw

R8R

a r b i t e r

IOreset IOack IOdata_in IOdata_out IOaddress

Serial Interface

CC

reconf remove

ack

Physical Configuration Interface (ICAP)

Host Computer

Reconfigurable Coprocessor 1

Reconfigurable Coprocessor N

Configuration Memory

Fig.2. General structure of the R8NR system. In the implementation discussed here, the number of reconfigurable areas is 2 (N=2). The reconfigurable region corresponds to the set of reconfigurable areas. The R8R processor is based on the R8 processor, a 16-bit load-store 40-instruction RISC-like processor [MOR 03]. R8 is a Von Neumann machine, whose logic and arithmetic operations are executed among registers only. The register bank contains 16 general-purpose registers. Each instruction requires at most 4 clock cycles to execute. The original R8 processor was transformed into the R8R processor by the addition of five new instructions intended to give support to the use of partially reconfigurable coprocessors. The added instructions are defined and described in Tab.1. The R8R processor was wrapped to provide communication with the (i) local memory; (ii) the system bus; (iii) the CC; (iv) the reconfigurable region. The interface to the reconfigurable areas comprises three identical sets of signals interconnected through special components furnished by the FPGA vendor, called bus macros. The coprocessors are configured on demand, under control of the software that executes on the R8R processor. During the execution of the system, the R8R selects, at each moment, one specific coprocessor with which it operates. This selection is sent to the CC, which according to the allocation state of reconfigurable areas verifies if the coprocessor is already present in the hardware. If necessary, the CC reconfigures the coprocessor in some unoccupied reconfigurable area. After this, the CC notifies the processor that the selected coprocessor is ready. From now on, the software can request coprocessor services. The next paragraphs detail the inner workings of this protocol.

4

XIX SIM - South Symposium on Microelectronics Tab.1. Instructions added to the R8 processor in order to produce the R8R processor 0. Reconfigurable Semantics description instruction Selects the coprocessor identified by address for communication with the SELR address processor, using the reconf signal. If the coprocessor is not currently loaded into the FPGA, the CC automatically reconfigures some area of the FPGA with it. Informs the CC, using the remove signal, that the coprocessor specified by DISR address address is dismissed and can be removed if needed. Resets the coprocessor specified by address, using the Ioreset signal. The INITR address coprocessor must have been previously configured. Sends the data stored in RS1 and RS2 to the coprocessor selected by the last Selr WRR RS1 RS2 instruction. RS1 can be used for passing a command while RS2 passes data. The coprocessor must have been previously configured. Sends the data stored in RS to the coprocessor (typically a command or an RDR RS RT address). Next, reads data from the coprocessor, storing it into the RT register. The coprocessor must have been previously configured.

The normal operation of the CC module is to wait for the R8R processor to produce coprocessor reconfiguration requests using the reconf signal, while informing the specific coprocessor identifier in the IOaddress lines. If the coprocessor is already in place (an information stored internally in the CC), the ack signal is immediately asserted, which releases the processor to resume instruction execution. If the coprocessor is not configured in some reconfigurable region, the reconfiguration process is triggered. The CC is responsible to locate the configuration memory area where lies the coprocessor bitstream corresponding to the identifier. This bitstream is read from memory and sent, word by word, to the Physical Configuration Interface. For VirtexII devices this corresponds to the ICAP module. In this case, only after the reconfiguration process is over the ack signal is asserted. The remove signal exists to allow the R8R processor to invalidate some reconfigurable coprocessor. This is useful to help the CC to better choose the most adequate region to reconfigure next. The processor-coprocessor communication protocol is based on read/write operations. Write operations are used by the processor to send data to some coprocessor. A write operation uses IOce, IOrw, IOdata_out and IOaddress signals, broadcasting these to all reconfigurable areas. The coprocessor containing the address specified by the IOaddress signal gets the data. The read operation works similarly, with the difference that data flows from a coprocessor to the processor. The IOreset is asserted by the processor to initialize one of the coprocessors. The communication among processor and coprocessors is achieved by using bus macros. In this communication scheme, the processor is the system master, starting all read/write operations. A next version of this system in currently under implementation to enable interrupt controlled communication. This will allow implementing non-blocking operations and coprocessor-initiated communication.

4. Results The R8NR system described in Section 3 has been prototyped and is operational in two versions, with one and two reconfigurable areas, respectively (R81R and R82R). A V2MB1000 prototyping platform from InsightMemec was used. This platform contains a million-gate XC2V1000 Xilinx FPGA, memory and I/O resources. In order to evaluate the relative performance of the implemented reconfigurable system, a set of experiments was conducted. For each experiment, the functionality was implemented in both software and hardware (as a coprocessor). Each implementation was executed in a system version with one reconfigurable area, and their execution times were compared. The execution time for the hardware version in each experiment includes not only the coprocessor execution time, but also the time to load the coprocessor into the FPGA by reconfiguration. Fig. 3 shows the comparison between the number of operations executed and the time to execute hardware and software versions of three 16/32-bit arithmetic nodules: multiplication, division and square root. Note that for each module, the execution time grows linearly but with different slopes for software and hardware implementations. Also, the reconfiguration time adds a fixed latency (10 ms) to the hardware implementations. The fixed latency is an approximation of the time needed to configure one FPGA area dedicated to hold one coprocessor. The break even point for each functionality determines when a given hardware implementation starts to be advantageous with regard to a plain software implementation, based on the number of times this hardware is employed before it is reconfigured. From the graph, it can be seen that the multiplier, division and square root coprocessors are advantageous starting from 750, 260 and 200 executions without an intervening reconfiguration step. Consider the application of a filter (e.g. edge or smooth) over an image with 800x600 pixels. If only one operation is applied per pixel 480000 operations are executed, easily justifying the use of a

XIV Microelectronics Seminar

5

hardware coprocessor. This simple experiment highlights how in practice it is possible to take advantage of RCSs, achieving performance gains, flexibility and system area savings.

Fig. 3. Execution time versus the number of operations for three arithmetic modules, multiplication, division and square root, implemented in hardware (hw suffix) and software (sw suffix). The hardware reconfiguration latency, 10ms, is dependent on the size of the reconfigurable area partial bitstream and on the performance of the CC module. This graph was plotted for a CC working at 24MHz frequency and for reconfigurable bitstreams of 46Kbytes, corresponding to a reconfigurable coprocessor with an area of roughly one tenth of the employed million-gate device. With more design effort it is estimated that this latency can be reduced by at least one order of magnitude, for the same bitstream size 0. The R82R case study was synthesized using Leonardo Spectrum. The area report for the fixed modules is presented in Tab.2, while the area report for the reconfigurable coprocessors is presented in Tab.3. Configuration controllers (CC) found in the literature are software implementations [CAR 04]. The CC proposed here was implemented in hardware, having a small area footprint (around 3,000 gates) and is expected to present superior performance over software versions in terms of reconfiguration speed. This could not be verified because no other approach reveals reconfiguration speed data. Another important advantage to implement the CC in hardware is that the embedded processor is free to execute tasks in parallel during the reconfiguration process. Tab.2. R82R fixed modules area report for a XC2V1000 FPGA and for 0.35 µm CMOS tecnology ASIC. LUTs represent combinational logic (10240 LUTs available). FFs means Flip Flops (11212 FFs available). Gates represent the number of equivalent 2-input NAND gates. Module ASIC Gates LUTs FFs %LUTs R82R 6331 1020 555 9.96 Memory 3139 307 366 2.99 Serial Interface 5430 616 607 6.01 CC 2790 493 218 4.81 Arbiter 157 27 15 0.26 Total 17847 2443 1761 23.85 The Modular Design software tool 0, used for generating partial bitstreams, limit the size of a reconfigurable area to a minimum of 4 CLB columns. In a XC2V1000 FPGA, there are 32 CLB columns, being possible to implement up to 8 distinct reconfigurable areas with this approach. One minimum size reconfigurable area contains 1280 LUTs (each column contains 320 LUTs). Nevertheless, the implemented coprocessors use in average 140 LUTs. Therefore, it is possible to implement much larger coprocessors in these areas. Examples are simple dedicated processors, FFT operators, and image filters. Tab.3. Reconfigurable coprocessors area report for a XC2V1000 FPGA and for 0.35 µm CMOS technology ASIC. LUTs represent combinational logic (10240 LUTs available). FFs are Flip Flops (11212 FFs available). Gates mean the number of equivalent 2-input NAND gates. Module Asic Gates LUTs FFs %LUTs Division 1688 162 188 1.58 Multiplication 1074 127 124 1.24 Square Root 1125 144 129 1.40

6

XIX SIM - South Symposium on Microelectronics

5. Conclusions The major contribution of the present work is the RCS model. Even if the case study described in the present work uses a processor with specific instructions to communicate with coprocessors, any processor with memory mapped input/output functions could be employed. This is possible since the model assumes the use of a standard protocol based on read/write operations to provide communication among the processor, the configuration controller and the reconfigurable region. A second advantage of the model is the parallelism between processor and coprocessors, enabling the use of non-blocking operations. Third, the compiler does not need to be changed when a new coprocessor is added. On the other hand, an increased latency in communication may be observed, since the system parts are loosely coupled. Also, since RCSs are reconfigurable systems, they potentially reduce the final system cost, as the user can employ smaller configurable devices, downloading partial bitstreams on demand. In addition, partial system reconfiguration makes it possible to benefit from a virtual hardware approach, in the same manner that e.g. computers benefit from the use of virtual memory. Application areas for RCSs are another crucial point. Unless sound applications are developed to show real advantage of a RCSs design solution over conventional solutions, RCSs will remain no more than an academic exercise on an interesting subject area. Ongoing work includes performance measurement of benchmarks and improvements on the configuration controller to reduce the wasted time in the partial reconfiguration.

6. References [WIR 95]

WIRTHLIN, Michael; HUTCHINGS, Brad. A Dynamic Instruction Set Computer. In: Field-Programmable Custom Computing Machines (FCCM’95), pp. 99-107, 1995.

[CAL 00]

CALLAHAN, Tim; HAUSER, John; WAWRZYNEK, John. The Garp Architecture and C Compiler. IEEE Computer, vol 33(4), pp. 62-69, 2000.

[ROB 99]

ROBINSON, David; LYSAGHT, Patrick. Modeling and Synthesis of Configuration Controllers for Dynamically Reconfigurable Logic Systems using the DCS CAD Framework. In: 9th Field-Programmable Logic and Applications (FPL’99), 1999.

[HEN 03]

HENKEL, Jörg. Closing the SoC Design Gap. IEEE Computer, vol 36(9), pp. 119-121, 2003.

[CAR 04]

CARVALHO, Ewerson; BRIÃO, Eduardo; MÖLLER, Leandro; MÖLLER, Frederico; MORAES, Fernando; CALAZANS, Ney. RSCM – A Configuration Controller for Reconfigurable Systems. 2004. Submitted to FPL 2004.

[MOR 03]

MORAES, Fernando; CALAZANS, Ney. R8 Processor Architecture and Organization Specification and Design Guidelines. 2003. http://www.inf.pucrs.br/~gaph/Projects/ R8/public/R8_arq_spec_eng.pdf

[BRI 04]

BRIÃO, Eduardo. Partial and Dynamic Reconfiguration to Intellectual Property Cores. Master Degree Dissertation (in Portuguese), Pontifícia Universidade Católica do Rio Grande do Sul, 2004.

[XIL 03]

XILINX, Inc. Two Flows for Partial Reconfiguration: Module Based or Difference Based. Application Note XAPP290, Version 1.1. http://www.xilinx.com/xapp/ xapp290.pdf, 2003.

Lihat lebih banyak...

A Generic Model of Embedded System to Enable Dynamic Self-Reconfigurable Applications

Descripción

Comentarios