Reconfigurable Computing Machines: An Empirical Analysis

June 12, 2017 | Autor: Esam El-Araby | Categoría: Image Processing, Reconfigurable Computing, Scientific Computing, System Reliability, Experimental Study, Field Programmable Gate Array, High performance computer, Empirical Analysis, Parallel Systems, Field Programmable Gate Array, High performance computer, Empirical Analysis, Parallel Systems

Share Embed

Laporkan tautan ini

Descripción

Reconfigurable Computing Machines: An Empirical Analysis Kris Gaj1, Tarek El-Ghazawi2, Allen Michalski1, Miaoqing Huang2, Chang Shu1, Esam El-Araby2, Mohamed Taher2, Proshanta Saha2 1

ECE Department, George Mason University 4400 University Drive, Fairfax, VA 22030, U.S.A. 2 ECE Department, The George Washington University 801 22nd Street NW, Washington DC 20052, U.S.A. {kgaj, emichals, cshu}@gmu.edu, {tarek, mqhuang, esam, mtaher, sahap}@gwu.edu

Abstract. Reconfigurable Supercomputers are parallel systems that are designed around multiple generalpurpose processors and multiple field programmable gate array (FPGA) chips. These systems can leverage the synergism between conventional processors and FPGAs to provide low-level hardware functionality at the same level of programmability as general-purpose computers. In this work we conduct an experimental study using one of the state-of-the-art reconfigurable computers and a representative set of applications to assess the field, uncover the challenges, propose solutions, and conceive a realistic evolution path. We consider issues of concern including performance/cost. We also consider productivity in the sense of development, compiling, running, and system reliability. It will be shown that for some applications, the performance/cost can be orders of magnitude better than conventional computers. It will be also shown that programming such machines may still require some hardware knowledge, similar to hardware knowledge computer programmers must acquire to write scalable programs. Keywords: FPGA, high performance computing, scientific computing, cryptography, image processing

1. Introduction The synergistic advances in high-performance general computing systems, and in reconfigurable computing based on field programmable gate arrays (FPGAs), form the basis for a new paradigm shift in supercomputing, namely reconfigurable supercomputing. This can be achieved through hybrid systems of microprocessors as well as FPGA modules that can leverage the system level concepts from high-performance computing and extend them to accommodate reconfigurations. A typical reconfigurable computer node (RCN) can be thought of as a module of microprocessors, FPGA devices, and memories. Such modules can be then interconnected via some fixed (or perhaps even reconfigurable) interconnection networks. While early systems only include several microprocessors and FPGA chips, they can be very powerful and they are expected to grow in size rapidly over time. On the architecture side, such reconfigurable supercomputing systems inherently support fine-grain and coarse-grain parallelism, due to the presence of FPGAs and microprocessors, respectively. Moreover, they can dynamically tune their architecture to fit the applications due to the in-circuit-programmability of the FPGAs, applications can be partitioned between the hardware and software based on suitability and requirements. Programming such systems can be quite challenging as programming FPGA devices can essentially involve hardware design. Moreover, programming tools should also distinguish among parts of the applications that can execute on single or multiple FPGAs and those that can execute on one or more microprocessors, as well as how the two will interact. The job of an operating system and compiler run-time support become more stretched in order to handle reconfigurable resources. For example, dynamic resource monitoring and allocation in this case goes beyond memory partitioning and processor allocation to include configuring and reconfiguring FPGA resources at run-time. In order to facilitate rapid deployment of applications and standardize the system operation across those applications, it is important to implement and integrate into such system a number of hardware and software libraries. Hardware libraries are modules that can be instantiated in FPGAs at start up time or on a need to basis. They should include I/O libraries for I/O between FPGAs and the outside world. They should also include libraries for interfacing with microprocessors and for internodes communications. Finally, they should support efficient arithmetic in various domains, and most popular applications, such as image processing.

1

This paper focuses on fully integrated, stand-alone, reconfigurable computer systems. It does not address reconfigurable boards that can be used with existing desktops or workstations, or clusters built around such boards and workstations [1, 2]. Through the case study that includes one of the most advanced reconfigurable computers available on the market, SRC-6 [3, 4], representative applications and a number of experiments, we intend to address the progress and challenges of this domain. We then derive from our measurements a projection for the growth path of this technology.

2. Reconfigurable Computer Architecture 2.1 Abstract Architecture Issues and Considerations A generic architecture of a typical present-day stand-alone reconfigurable computer is shown in Fig. 1. A reconfigurable computer consists of a microprocessor system and a reconfigurable processor system closely coupled with each other. A microprocessor system includes all major components of a traditional computer system: general purpose microprocessors, microprocessor memories, and I/O interfaces. A reconfigurable processor consists of an array of FPGAs, FPGA memories, and an I/O interface. Both systems are connected through a communication interface. Within each of the two systems, each component’s functionality within the system is influenced by its interconnection topology. Within a reconfigurable processor system, every FPGA can be connected to every other FPGA, or FPGAs can be grouped into clusters. FPGA memories can be shared by a group of FPGAs or dedicated to a single FPGA. Additional hardware, such as a cross-bar switch, might be necessary to make connections as economic and flexible as possible. The communication interface may lead to an FPGA memory, a single FPGA, or to all FPGAs. Because of historical and economical reasons, in the first generation of high-performance reconfigurable computers (including SRC 6E [3] and Starbridge HC-36m [5]) a microprocessor system was based on commodity PC boards (such as Pentium boards), and a reconfigurable processor system resembled general-purpose FPGA accelerator boards, such as WildStar II [2]. As a result, the corresponding communication interface had to be based on traditional I/O interfaces offered by the current generation of PC boards, such as PCI and PCI-X, or an external memory interface, such as one based on DIMM sockets. These interfaces limit the amount of physical coupling and the communication speed between both systems. In the newer generation of high-performance reconfigurable computers (including Cray XD-1 [6] and SGI Altix [7]) FPGAs are placed on the same board as microprocessors (as in Cray XD1), or connected to microprocessors using fast interface, such as NUMAlink [7]. Other solutions, such as HyperTransport, might be considered in the future [9]. Reconfigurable processors (FPGAs) may have their own memory systems, in which an on-chip memory (such as block RAMs in Xilinx Virtex devices) can be treated as a first level cache, and an FPGA memory (often referred to as On-Board-Memory (OBM)) is treated as a second level cache. However, compared to today’s microprocessors current FPGAs do not have default support for memory management and dynamic memory refreshing, so both mechanisms would need to be dynamically programmed into FPGAs during initialization. Because of the existence of memories in both systems, communication between a microprocessor system and a reconfigurable processor system can be easily implemented using DMA transfer.

2.2 Challenges in the Development of Efficient Architectures There exist multiple challenges that must be resolved to develop an efficient and economically viable architecture for reconfigurable computers. Efficient communication must be available between both the microprocessor and reconfigurable processor systems of a reconfigurable computer, in order to assure efficient data and control transfer in both directions. Fast and wide interfacing should exist among FPGAs constituting the reconfigurable processor system. This interface should preferably allow the exchange of long data words within one clock cycle of an internal FPGA clock. An efficient memory system should exist that would allow similar highbandwidth communication between FPGAs and associated memories, as well as possible exchange of information and control among FPGAs through a shared memory. Efficient synchronization mechanisms necessary to transfer control between microprocessors and FPGAs, and between FPGAs themselves, must be developed. These mechanisms should allow running related processes in parallel on multiple microprocessors and multiple FPGAs. Another major issue associated with the reconfigurable processor architecture is hardware support for fast reconfiguration. This support is necessary to avoid long initialization times, as well as to make the run-time

2

reconfiguration feasible. Solving this issue might require support for parallel reconfiguration of multiple FPGAs, as well as reconfiguration of selected FPGAs without interrupting computations performed by other FPGAs.

3. Programming and Execution Models Functions of the software environment for reconfigurable computers can be distributed among three major tools: Program entry, Compiler, and Run-Time Environment. These tools can be implemented separately, or integrated into a single software environment. Below we describe major features and challenges associated with each of these tools.

3.1 Program Entry Program entry is responsible for specifying a sequence of instructions executed by each microprocessor (software), operations executed on each reconfigurable processor (hardware), as well as synchronization and exchange of data between these two types of processors (interface). Additionally, if the run-time reconfiguration is used, the program entry must either specify explicitly or at least imply a sequence of reconfigurations. The unique problems of reconfigurable computers come from the fact that hardware and software are traditionally described using different languages and tools. The standard way of describing software is using highlevel languages (HLLs), such as C, C++, or Fortran. The standard way of describing hardware is using hardware description languages (HDLs), such as VHDL and Verilog. Describing hardware using HLLs is possible, and has been tried in several commercial products such as Xilinx Forge, Celoxica Handel-C, Impulse C, and most recently Mitrion-C [8]. These languages offer a trade-off between a shorter development time and a performance overhead imposed by any high level language. A dataflow program entry, based on the graphical user interface, seems to be offering an interesting compromise between HLLs and HDLs. Nevertheless, it suffers from the lack of standards, and difficulties with scaling designs and porting them among different graphical design environments. Describing hardware in HLLs, or at least using dataflow diagrams, seems to be a major and distinctive feature of high-performance reconfigurable computers, since it would allow mathematicians and computer scientists to develop entire applications without relying on hardware designers. It also would substantially increase the productivity of the design process. The major challenge in this area is to describe parallelism and non-standard data types using traditional programming languages, such as C. Independently of the mode of description chosen, the existence of libraries suitable for multiple types of applications is a major challenge affecting the productivity of the design process. This is especially true in the case of describing hardware component objects to be implemented on FPGAs. The libraries of such components are currently at the early stage of development, and their availability is not guaranteed. Hardware libraries should allow easy access to the on-chip memories and special arithmetic units available on FPGAs. Library components should support multiple input and output sizes, or at least a small discrete set of sizes. Finally, multiple versions of the same functional unit might be required to support optimization trade-offs among area, throughput, and latency.

3.2 Compiler A compiler for a reconfigurable computer must combine the capabilities of tools for traditional microprocessor compilation and tools for computer-aided design with FPGAs. It must also extend these two separate set of tools with capabilities for mutual synchronization and data transfer between microprocessor and reconfigurable processor systems [10]. The microprocessor compilation is likely to be based on the standard programming language compilers, such as Intel C/C++ compiler or GNU C/C++ compiler. The hardware compilation process depends on the choice of hardware design entry. In case HDL is used for design entry, the processing flow is based on standard CAD tools used for logic synthesis (e.g., Synplify Pro or Synopsys), and placing and routing of FPGAs (e.g., Xilinx ISE). In case HLL or data flow diagram is used for hardware design entry, a specialized compiler module must be developed. This module can either produce an HDL output, or include synthesis and produce directly netlist files used as inputs for placing and routing. Since placing and routing tools are based on the proprietary information available only to the vendors of FPGA devices, vendor tools are likely to be always used in this phase. In case of synthesis tools, it seems to be also

3

advisable to make use of the years of development of industry standard tools such as Synopsys or Synplify Pro. Nevertheless, the designers of reconfigurable computers may also be inclined to try to develop their own synthesis tools, hoping that they can more fully exploit additional information provided during program entry. Any information specific to a reconfigurable computer’s operation must be processed by a special compiler module provided by the vendor of the reconfigurable computer. This module can have a form of preprocessor that produces inputs to standard third party tools. A few examples of information that is specific to reconfigurable computers and cannot be easily handled by third party tools include: 1) Design partitioning between a microprocessor system and a reconfigurable processor system (i.e., software and hardware). 2) Communication and synchronization between both systems. 3) Design partitioning and communication within the reconfigurable processor system (i.e., among multiple FPGAs) 4) Defining the sequence of FPGA reconfigurations, etc. A large number of these characteristics can be either defined manually during program entry, or derived by the compiler automatically. In the current state of maturity of reconfigurable computers, the majority of such information must be provided manually. Nevertheless, the natural future tendency will be to free the programmer of these details and to make the compiler responsible for majority of the design decisions, based on more general preferences and constraints provided by the programmer during program entry.

3.4 Run-Time Environment The run time environment is the software necessary to execute an application on a reconfigurable computer. In the simplest case, it might consist of a standard operating system (e.g., Linux) running on the microprocessor system, software driver necessary to communicate with the reconfigurable processor system, and an interface portion of a reconfigurable processor system implemented on the same or different FPGA than the reconfigurable user logic. In other cases, it might be an integrated graphical user environment. The major challenge facing the run-time environment of reconfigurable computers is support for multiuser/multitasking operation. To make this possible, the mechanisms must be developed to perform resource monitoring and resource allocation, under the condition that FPGAs are treated as dynamically allocated resources. In the current state of technology, the entire reconfigurable processor system might need to be treated as a single, indivisible resource. In the most advanced systems, single FPGAs can be treated as separate resources that might be allocated to different applications, and even different users. Finally, in the future, with the further increase in the size of FPGAs and increased support for partial reconfiguration, it might be feasible to allocate different portions of the same FPGA to different tasks and users. Another challenge regarding the run-time environment is its support for the program profiling, i.e., determining the amount of time spent in particular phases of the program execution. This task is not easy, taking into account the need to measure separately the time spent in the microprocessor vs. reconfigurable processor systems, on the interface between both systems, as well as during FPGA reconfigurations. The difficulties come from the different mechanisms used to measure time in software and hardware, the delay contribution of other processes (e.g., operating system processes) running on the same platform, timing overlap between various operations, and extra delays introduced by timers themselves. One of the major challenges is support for debugging. At this point, two primary means of such support include software emulation of functions to be executed on FPGAs and simulation of the intermediate hardware netlists. This mechanisms allow verifying a top level design of the application, without the need of expensive hardware compilations that include placing and routing. To detect more subtle errors, and speed-up verification of complex applications, one can imagine a runtime debugger with functions, such as conditional breakpoint, or step-by-step execution of applications running on the actual hardware. Although certain progress has been accomplished in this area [11, 12], this functionality may be difficult to implement in the current state of technology, because of the limited access to the internal state of FPGA nodes during the device operation.

4

4. Case Study: SRC-6 Reconfigurable Computer At the time of writing, there exist at least four companies offering hardware solutions in the area of high performance reconfigurable computing. These companies include: SRC, Cray, SGI, and Star Bridge Systems [3, 4, 5, 6, 7]. Multiple other companies and university groups currently work on developing new languages, software tools, libraries, and applications for these four machines. Although the exact hardware architecture and software tools very considerably among the products of these four companies, they are all based on the same generation and brand of FPGA devices, and therefore are likely to offer similar performance improvements over classical microprocessor-based systems. Additionally, at the time of writing, only one of the aforementioned companies, SRC, offers a mature fully integrated software/hardware environment, while the software environments necessary to fully utilize the capabilities of remaining platforms are still under development. Therefore, in this paper, we focus on one, currently most mature platform, and treat it as a case study demonstrating the potential and possible challenges of the entire field of reconfigurable high-performance computing.

4.1 SRC Architecture SRC-6 is a stand-alone reconfigurable computer composed of three types of boards: Microprocessor boards, Reconfigurable Processor Boards (referred to as MAP boards) and Common (Global) Memory boards. Multiple boards can connected among each other using Hi-Bar Switch sustaining 1.4 GBytes/s (see Fig. 2). The microprocessor board is based on the commodity PC board, with two Pentium 4, Xeon 2.8 GHz processors. The reconfigurable processor board, referred to as MAP®, is based on three Xilinx Virtex II FPGAs, XC2V6000. Two of these FPGAs are available as user chips, and are used to perform application specific operations. The third FPGA is a control FPGA. Its configuration is fixed and it supports a fast transfer of data and control between the microprocessor system and the reconfigurable processor system. On the microprocessor side, the interface is supported by the SRC-specific fast DMA communication module, called SNAP®, which supports a sustained throughput of 1.4 GB/s in each direction.

4.2 SRC Programming Model and Tools The SRC programming model allows implementations using traditional HLL programming languages and tools as well as HDL language specifications. Portion of a program allocated to the microprocessor can be implemented using HLL, and portions of the program allocated to a reconfigurable processor can be implemented using either HLL or HDL. The SRC-6 has a similar compilation process as a conventional microprocessor-based computing system, but needs to support additional tasks in order to produce logic for the MAP reconfigurable processor, as shown in Fig. 3. There are two types of the application source files to be compiled. Source files of the first type are compiled targeting execution on the Intel platform. Source files of the second type are compiled targeting execution on the MAP. A file that contains a program to be executed on the Intel processor is compiled using the microprocessor compiler to produce a relocatable object (.o) file. All files containing functions that call hardware macros and thus execute on the MAP are compiled by the MAP C compiler, mcc, or MAP FORTRAN compiler, mftn. These compilers produce several relocatable object files (.o), corresponding to respective subroutines. Object files resulting from both the Intel and MAP compilation steps are then linked with the MAP libraries into a single executable file. The resulting binary file may then be executed on the SRC-6E. MAP source files contain MAP functions composed of macro calls. Here, macro is defined as a piece of hardware logic designed to implement a certain function. Since users often wish to extend the built-in set of operators, the compiler allows users to integrate their own macros, encoded in VHDL or Verilog, into the compilation process. The macro is invoked from within the C or FORTRAN subroutine by means of a subroutine call. In Fig. 4, we demonstrate the mapping between macro calls and the corresponding contents of a MAP FPGA. Please, note that Macro 2, called twice in Subroutine 1, results in two instantiations of the logic block representing Macro 2. Values of arguments in the macro calls determine interconnects between macro instantiations in hardware. The contents of each MAP function in software determines the configuration of the entire FPGA device in hardware. Each time a new MAP function is called, the contents of the entire FPGA changes. This way, SRC-6

5

implements run-time reconfiguration. The current version of the SRC environment relies on Synplify Pro for logic synthesis of macros, Xilinx ISE for place and route, and Intel Compiler for a microprocessor compiler.

4.3 Applications Applications considered in our study belong to two major application areas: cryptography and image processing. Within cryptography, we considered three distinct classes of applications: a. bulk data encryption using modern secret-key (symmetric) ciphers: DES, IDEA, and RC5 [13, 14]. b. cipher breaking for secret-key ciphers, based on the method of exhaustive key-search, implemented for three ciphers: DES, IDEA, and RC5 [13, 15]. c. key exchange based on the public key encryption schemes called Elliptic Curve Cryptosystems, defined over the Galois Field GF(2m), with two different versions based on the Polynomial Basis (PB) and the Optimal Normal Basis (ONB) representation of the field elements [17, 18, 19, 20]. In the area of image processing, we considered three applications: Wavelet Hyperspectral Dimension Reduction [16], Sobel Edge Detection, and Median Filter. In all our implementations, the majority of computations have been assigned to the reconfigurable processor system. The microprocessor system was responsible primarily for input/output operations. Thus, the partitioning of the applications between both systems was trivial. The computations within the reconfigurable processor system were performed using either a single FPGA or two FPGAs. Two FPGAs were used for computationally intensive applications, such as cipher breaking, where using two FPGAs effectively doubles the performance. For I/O intensive applications, there was no point in using two FPGAs, as their use does not improve the overall performance, limited by the I/O bandwidth. In terms of the programming method, several macros implementing elementary operations of all applications have been first described in VHDL or Verilog. These macros have been then used to implement the computational portion of the application in C for the execution on the MAP reconfigurable processor. Basic user macros used in the implementation of bulk data encryption and cipher breaking represent the operation of a single cipher round and a single round of key scheduling. For the key exchange scheme, these macros implement basic operations in the Galois Field GF(2m), such as addition, multiplication, and inversion. Finally, for the image processing applications, no user macros were required, and the entire code was developed in C, based on the standard SRC library functions for fixed-point operations on real numbers. In terms of the hardware architecture, bulk data encryption and image processing applications have been implemented using a fully pipelined architecture, producing new output and accepting new input every clock cycle, at the clock frequency of 100 MHz. For cipher breaking, pipelining and multiple processing units working in parallel have been used to speed-up data processing. The number of processing units was 20 for DES, 10 for IDEA and two for RC5. The difference was caused by a different amount of FPGA resources required by each of the algorithms. In the key exchange applications, an effort was made to perform all mutually independent arithmetic operations in parallel. For each application, input data were read from a file or generated by the microprocessor system itself, transferred from the microprocessor memory to the reconfigurable processor (on-board) memory using DMA (Data Transfer In), processed by one of the user FPGAs (FPGA Computations), and then transferred back from the reconfigurable processor (on-board) memory to the microprocessor memory using DMA (Data Transfer Out). Whenever possible, an effort was made to overlap FPGA Computations and Data Transfer In. This overlapping was performed using a mechanism called streaming. The current generation of SRC-6 did not allow overlapping of all three phases (Data Transfer In, FPGA Computations, and Data Transfer Out) All our applications can be divided into three major categories: 1. input/output intensive applications, such as bulk data encryption (DES, IDEA, and RC5 encryption) and image processing (Wavelet Hyperspectral Dimension Reduction, Sobel Edge Detection and Median Filter). In these applications the end-to-end execution time is dominated by the time required for transferring data between the microprocessor system and the reconfigurable processor system. The major performance parameter, End-to-End Throughput is limited by the Data Transfer In and Data Transfer Out Throughputs. 2. computationally intensive applications, such as cipher breaking based on the exhaustive key search (DES IDEA, and RC5 cipher breaking) In these applications, a very small amount of input data is used for massive computations by the reconfigurable processor, generating a very small amount of output data. Therefore, the End-to-End Throughput is limited by the Computational Throughput of the reconfigurable processor and its FPGAs.

6

3.

latency-critical applications, where the primary requirement is a small data latency (rather than the large data throughput), and only one input is processed at a time, such as cipher key exchange using Elliptic Curve Cryptosystems. Since in the reconfigurable computers, there exist a substantial timing overhead associated with transferring control and data to and from the reconfigurable processor, latency-critical applications can benefit from the use of a reconfigurable processor only if the computations in hardware using user FPGAs are sufficiently complex, and substantially less time consuming compared to the same computations performed using microprocessors.

4.4 Timing Measurements

Our procedure for performing timing measurements of data throughputs is illustrated in Fig. 5. In the test program, reconfigurable processor is first allocated to the given application. Next, MAP function is called for the first time in order to enforce reconfiguration of an FPGA with a bitstream corresponding to the computational (hardware) portion of the given application. This is followed by n iteration of the loop. The number of iteration of the loop is chosen in such a way that the entire program runs for about 15 minutes, in order to eliminate the influence of any one-time-effects. The execution time of the MAP function is measured in software in time units using the C timer function of the Linux operating system, gettimeofday(). This time is referred to as the End-to-End Time. Three independent components of the End-to-End Time: Data Transfer In Time, Computation Time, and Data Transfer Out Time are measured within the MAP function using hardware counters. The execution time of these operations is measured in the number of clock cycles by calling the standard SRC macro, read_timer(). The measurement procedure for latency is somewhat similar, with the difference that only one iteration of the loop is performed, and the minimum amount of data that can be processed by the given application is used. For encryption and cipher breaking this minimum amount of data is equal to one data block, or 64 bits. For image processing this minimum block corresponds to the minimum amount of points necessary to produce a single output.

4.5 Results In Tables 1, 2, and 3, we summarize timing results obtained for the applications described in Section 4.3, using the measurement procedure described in Section 4.4. Table 1 contains the throughput measurements for I/O intensive applications, Table 2 throughput measurements for computationally intensive applications, and Table 3 latency measurements for latency critical applications. The throughput is understood as the amount of data processed in a unit of time, latency is the amount of time necessary to process the smallest possible amount of data accepted by a given application. An additional assumption is that all measurements for both throughput and latency start at the moment when input data is already present in the microprocessor memory, and finish when output data is stored to the microprocessor memory. This assumption corresponds to the common practical scenario, and allows us to characterize the performance of applications without a file I/O overhead, which is a characteristics of a microprocessor system, and is not influenced by the existence and operation of a reconfigurable processor system. For each application, we list the End-to-End Throughput, TE, obtained using SRC-6, and the corresponding throughput obtained using a single microprocessor, Pentium IV, Xeon 2.8 GHz, included in the microprocessor portion of the SRC-6. Additionally, for each application running on the SRC-6E, we measure a. Computational Throughput, TC, representing the speed of computations performed on the User FPGA, b. Data Transfer In Throughput, TIN, representing the speed of the data transfer from the microprocessor memory to the reconfigurable processor (on-board) memory, and the c. Data Transfer Out Throughput, TOUT, representing the speed of the data transfer in the opposite direction. For all I/O intensive applications, streaming is used to overlap FPGA Computations and Data Transfer In. In these cases only joint throughput covering both Computations and Data Transfer In can be measured, and is reported in Table 1. Software implementations running on Pentium IV are based on the libraries and software modules listed in Table 4. In particular, highly optimized OpenSSL library [22] is used for the software implementation of secret–key encryption and cipher breaking, and the open-source LiDIA library [21] for the software implementation of Elliptic Curve Cryptosystems. In-house developed C code is used for all image processing applications.

7

One can see from Tables 1-3 that the speed-up is the highest (from 282 to 1130) for the computationally intensive applications, such as cipher breaking, and the smallest (from 4 to 58) for the input/output intensive applications, such as image processing and bulk data encryption. The SRC End-to-End Throughput for bulk data encryption is limited by the speed of Transfer In and Transfer Out, and is identical for all three ciphers, because all of them are implemented using the same fully unrolled architecture, capable of producing one 64-bit block of output every clock cycle. The SRC End-to-End Throughput for cipher breakers is the same as the Maximum Computational Throughput, and is proportional to the number of cipher breaking units working in parallel on the User FPGA (20 for DES, and 10 for IDEA and two for RC5). ECC Key Exchange is the only application that provides any speed-up in case of latency. This speed-up is particularly large (over 1000) for the case of the ECC key exchange scheme with normal basis. In Table 5, we summarize the amount of FPGA resources, Look-Up Tables (LUTs), and Configurable Logic Block slices (CLB slices) used by the hardware portion of each application. It should be stressed that at this point of the SRC system development, FPGAs cannot be shared between multiple applications, and therefore, it is in the best interest of the programmer to make full use of the entire FPGA. This goal has been best accomplished in case of the DES, IDEA, and RC5 breakers that all use over 85% of the CLB slices available on the user FPGAs. On the other hand, increasing the number of processing units in case of encryption and image processing applications would have a very small (and possibly even negative) impact on the End-to-End Throughput, because this throughput is already limited by the Data Transfer In and Data Transfer Out rates. Additionally, it is clear from Table 3 that the obtained speed-ups come at the cost of the substantially increased compilation time. The compilation time on the SRC machine is dominated by the time of placing and routing performed by the Xilinx implementation tools. This time has reached up to almost five hours for one of the investigated applications (Wavelet Hyperspectral Dimension Reduction). The primary way of reducing this time would be the introduction of the incremental compilation by FPGA vendors.

5. Conclusions Reconfigurable computers offer a great promise for solving complex computationally intensive problems with the speed of specialized hardware and flexibility and productivity of software implementations. The most important features of applications that can truly benefit from the execution on reconfigurable computers include large amount of parallelism and limited input/output. A perfect example of such application is an exhaustive key search cipher breaker, such as the DES, IDEA, and RC5 breakers investigated in this paper. Additionally, it is preferable if the resource requirements of the application do not exceed the physical resources of the reconfigurable computer, so no run-time reconfiguration is required. If this condition is not fulfilled, the reconfiguration should at least be semi-static, i.e., an average time between reconfigurations should be large compared to the reconfiguration time. Taking into account a large overhead associated with the transmission of data between the microprocessor system and the FPGA system, it is also preferable for the application to be optimized for maximum throughput rather than for minimum latency. The second major area of concern is productivity, which is typically measured using a total time to the solution, including the development time. At present, the development time is influenced primarily by the need to optimize design for a specific clock frequency, the need for the manual entry of information that could be potentially derived by the compiler, and the compilation time. Especially, the compilation time is not very likely do decrease dramatically in the future, because the compilation for hardware, including logic synthesis, mapping, placing and routing is likely to remain much slower than the compilation for software. On the other hand, the productivity is likely to substantially improve because of the gradual shifting of tasks from the manual entry by the programmer to the automatic execution by the compiler. Examples of such tasks include design partitioning between the microprocessor system and the reconfigurable processor system, partitioning among multiple FPGAs, data transfer among multiple parts of both systems (including communication among FPGAs), synchronization, and overhead minimization techniques. An especially powerful advantage comes from the capability of defining multiple versions of the same component in the library, and then letting the compiler to choose an optimal implementation of this component that best matches the available resources and target requirements. Another interesting trend will be an increase in the utilization of system resources. At the present moment, the smallest amount of reconfigurable hardware that can be allocated to a given user process is an FPGA board. In the future, an allocation of a single FPGA device, or a small portion of the device, will become possible. A new generation of operating systems will be developed to enable this flexibility.

8

In general, less and less knowledge of the underlying computer system is likely to be required from the programmer in the future. The programmer will be able to use high level programming languages, such as C, to describe both software and hardware. The restrictions on the subset of HLL that can be used to describe hardware will be gradually relaxed. Using commonly known high-level programming languages and an automated design process will create a direct path between a scientist posing a problem or hypothesis and a machine capable of generating a solution and verifying the hypothesis. This way, the time to the solution will be dramatically reduced. Additionally, a reconfigurable computer will encourage innovation and experimentation. Since there is no cost and very small time penalty for any changes in the algorithms or logic architectures, a scientist can easily evaluate multiple hypotheses and multiple ideas, leading to the optimal solution in the minimum possible time. This way, reconfigurable computers are likely to become the enablers of innovation, experimentation, and scientific progress.

9

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

K. Compton, S. Hauck, Reconfigurable Computing: A Survey of Systems and Software, ACM Computing Surveys 34 (2) (2002) 171-210. Annapolis Micro Systems web site, http://www.annapmicro.com/ SRC Inc. web site, http://www.srccomp.com/ K. Morris, SRC Code, J. FPGA and Programmable Logic, available at http://www.fpgajournal.com/articles_2005/20050712_src.htm Star Bridge Systems Inc. web site, http://www.starbridgesystems.com/ Cray web site available at http://www.cray.com SGI Altix, available at http://www.sgi.com/servers/altix/ Mitrion web site available at http://www.mitrion.com/index.shtml P. Abusaidi, SystemIO Solution Expands with Bandwidth Demands, Xilinx Xcell Journal Online, (6) (2002). W. Luk, N. Shirazi, and P.Y.K. Cheung, Compilation Tools for Run-time Reconfigurable Designs, IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 1997; 56–65. Xilinx, Board Level Verification, http://www.xilinx.com/products/design_resources/design_tool/grouping/board_level_verif.htm T. Wheeler, P. Graham, B. Nelson, and B. Hutchings, Using Design-Level Scan to Improve Design Observability and Controllability for Functional Verification of FPGAs, FPL 2001. W. Stallings, Cryptography and Network Security, Prentice Hall, 1999. P. Chodowiec, P. Khuon, and K. Gaj, Fast Implementations of Secret-Key Block Ciphers Using Mixed Innerand Outer-Round Pipelining, FPGA 2001; 94-102. O.D. Fidanci, D. Poznanovic, K. Gaj, T. El-Ghazawi, and N. Alexandridis, Performance and Overhead in a Hybrid Reconfigurable Computer, RAW 2003. E. El-Araby, M. Taher, K. Gaj, T. El-Ghazawi, D. Caliga, and N. Alexandridis, System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications, RAW 2004. D. Hankerson, A. J. Menezes, S. Vanstone, Guide to Elliptic Curve Cryptography, Springer 2004. M. Rosing, Implementing Elliptic Curve Cryptography, Manning, Greenwich, 1999. S. Bajracharya, C. Shu, K. Gaj, T. El-Ghazawi, Implementation of Elliptic Curve Cryptosystems over GF(2^n) in Optimal Normal Basis on a Reconfigurable Computer, FPL 2004, 1001-1005. C. Shu, K. Gaj, T. El-Ghazawi, Low Latency Elliptic Curve Cryptography Accelerators for NIST Curves on Binary Fields, FPT 2005 (in print). LiDIA A C++ Library For Computational Number Theory, available at http://www.informatik.tu-darmstadt.de/TI/LiDIA/ OpenSSL: The Open Source toolkit for SSL/TLS, available at http://www.openssl.org/

10

Microprocessor system

Reconfigurable processor system

µP

...

µP

FPGA

...

FPGA

µP memory

...

µP memory

FPGA memory

...

FPGA memory

Interface

I/O

Interface

I/O

Fig. 1 General diagram of a typical present-day reconfigurable computer

SRC Hi-Bar Switch

SNAP™

SNAP

Memory

µP Gig Ethernet

Disk

MAP

Common Memory

Common Memory

Memory

µP PCI-X

MAP®

etc.

Storage Area

Chaining GPIO

PCI-X

Local Area

SRC-6

Wide Area

Network

Customers’ Existing Networks Fig. 2 Hardware architecture of SRC 6

11

Application sources

Macro sources

.c or .f files

.vhd or .v files HDL sources Logic synthesis

.v files µP Compiler

MAP Compiler Netlists .ngo files

Object files

.o files

.o files

Place & Route

Linker

.bin files Configuration bitstreams

Application executable

Fig. 3. Compilation process of SRC-6E

Program in C or Fortran Main program ……

Function_1

FPGA contents after the Function_1 call a

FPGA

Macro_1(a, b, c)

Function_1(a, d, e) Macro_2(b, d) Macro_2(c, e)

Macro_1

……

c

b Function_2

Macro_2

Macro_2

Macro_3(s, t)

Function_2(d, e, f) ……

d

e

Macro_1(n, b) Macro_4(t, k)

Fig. 4. Programming model of SRC-6E

12

MAP function

MAP Allocate

FPGA Configure

MAP function

Read Data*

CM to

Compute

OBM

OBM

to CM

Data

Data

Write Data

MAP Free

Transfer In Computation Transfer Out Time

Time

Time

Loop n times

End-to-End Time Configuration Time + End-to-End Time

End-to-End Time with File I/O

Fig. 5. Measurement procedure used for the measurement of data throughputs.

13

Table 1. Throughput of input/output intensive applications optimized for maximum throughput, running on the SRC 6 machine and Pentium IV, Xeon 2.8 GHz

Application

Measured Data Transfer Measured Theoretical In and Data Maximum Computational Transfer Computational Throughput Out Throughput (assuming Throughput streaming) (MBytes/s) (MBytes/s) (MBytes/s) SRC 6 SRC 6 SRC 6

DES Encryption (2 units working in parallel) IDEA Encryption (2 units working in parallel) RC5 Encryption (2 units working in parallel) Wavelet Hyperspectral Dimension Reduction (1 unit) Sobel Edge Detection (1 unit) Median Filter (1 unit)

Measured End-to-End Throughput

(MBytes/s) SRC 6

Xeon 2.8 GHz

Speedup

1,600

1,419

1,345

675

11.6

58.1

1,600

1,419

1,345

675

14.1

47.5

1,600

1,419

1,345

675

70

9.5

800

792

388

271

20

13.5

800

797

797

398

102

3.9

800

797

797

398

63

6.3

Table 2. Throughput of computationally intensive applications optimized for maximum throughput, running on the SRC 6 machine and Pentium IV, Xeon 2.8 GHz Application

DES Cipher Breaking (20 units working in parallel) IDEA Cipher Breaking (10 units working in parallel) RC5 Cipher Breaking (2 units working in parallel)

Theoretical Maximum Computational Throughput (million keys/s) SRC 6 2000

Measured End-to-End Throughput

(million keys/s) SRC 6 Xeon 2.8GHz 2000 1.77

Speed-up

1130

1000

1000

2.19

457

200

200

0.71

282

14

Table 3. Latency of selected applications optimized for minimum latency, running on the SRC-6 machine and Pentium IV, Xeon 2.8 GHz

Application

ECC key exchange, polynomial basis, key size=163 bits ECC key exchange, polynomial basis, key size=233 bits ECC key exchange, normal basis, key size=233 bits

Measured Computational Latency (µs)

Measured Data Transfer In Latency (µs)

Measured Data Transfer Out Latency (µs)

Measured End-to-End Latency

SRC 6

SRC 6

SRC 6

SRC 6

Xeon 2.8 GHz

33.3

11.2

5.7

300.2

5,340

17.8

56.9

14.7

7.6

322.2

9,710

30.1

89.7

14.5

7.5

351.7

394,000

1120.2

(µs)

Speedup

Table 4. Software libraries or software modules used in performance comparison Application DES, IDEA, RC5 Encryption DES, IDEA, RC5 Cipher Breaking Wavelet Hyperspectral Dimension Reduction Sobel Edge Detection Median Filter ECC key exchange, polynomial basis ECC key exchange, normal basis RSA

Software library or module used in performance comparison OpenSSL [22] OpenSSL [22] C code developed by GWU & NASA C code developed by GWU C code developed by GWU LiDIA [21] Rosing [18] OpenSSL [22]

15

Table 5. Resource utilization and compilation time for selected applications implemented on the SRC 6 machine Compilation Time Application

DES Encryption (2 units) IDEA Encryption (2 units) RC5 Encryption (2 units) Wavelet Hyperspectral Dimension Reduction (1 unit) Sobel Edge Detection (1 unit) Median Filter (1 unit) DES Cipher Breaking (10 units per chip) IDEA Cipher Breaking (5 units per chip) RC5 Cipher Breaking (1 unit per chip) ECC key exchange, polynomial basis, key size=163 bits ECC key exchange, polynomial basis, key size=233 bits ECC key exchange, normal basis, key size=233 bits

LUT Area

CLB Slice Area

#

%

#

3,533

5

7,394

Synthesis Time

Xilinx ImplementationTime

%

(hr:min:sec)

(hr:min:sec)

8,283

24

00:00:15

00:04:52

10

9,398

27

00:00:53

00:25:07

11,674

17

11,311

33

00:00:04

00:08:31

33,231

49

22,044

65

00:00:27

04:56:29

5,332

7

5,234

15

00:00:00

00:06:55

5,967

8

5,735

16

00:00:00

00:06:11

19,315

28

33,790

99

00:00:15

01:39:01

27,065

40

30,108

89

00:22:34

00:55:25

31,887

47

33,790

99

00:00:05

00:55:10

27,072

40

15,923

47

00:00:31

00:44:04

36,656

54

21,025

62

00:00:45

03:03:21

35,257

52

20,638

61

00:00:39

03:38:49

16

Lihat lebih banyak...

Reconfigurable Computing Machines: An Empirical Analysis

Descripción

Comentarios