Energy-Efficient Synthetic-Aperture Radar Processing on a Manycore Architecture

June 14, 2017 | Autor: Zain Ul-abdin | Categoría: Computer Architecture, Parallel Programming, Parallel Processing, Synthetic Aperture Radar, High resolution satelite image, Radar Imaging

Share Embed

Laporkan tautan ini

Descripción

Energy-Efficient Synthetic-Aperture Radar Processing on a Manycore Architecture † and Bertil Svensson∗ ˚ Zain-ul-Abdin∗ , Anders Ahlander ∗ Centre for Research on Embedded Systems (CERES) Halmstad University, Halmstad, Sweden † Saab AB, Gothenburg, Sweden

Abstract—The next generation radar systems have high performance demands on the signal processing chain. Examples include the advanced image creating sensor systems in which complex calculations are to be performed on huge sets of data in real time. Manycore architectures are gaining attention as a means to overcome the computational requirements of the complex radar signal processing by exploiting massive parallelism inherent in the algorithms in an energy efficient manner. In this paper, we evaluate a manycore architecture, namely a 16-core Epiphany processor, by implementing two significantly large case studies, viz. an autofocus criterion calculation and the fast factorized back-projection algorithm, both key components in modern synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of achieved performance and programmability. One of the Epiphany implementations demonstrates the usefulness of the architecture for the streaming based algorithm (the autofocus criterion calculation) by achieving a speedup of 8.9x over a sequential implementation on a state-of-the-art general-purpose processor of a later silicon technology generation and operating at a 2.7x higher clock speed. On the other case study, a highly memory-intensive algorithm (fast factorized backprojection), the Epiphany architecture shows a speedup of 4.25x. For embedded signal processing, low power dissipation is equally important as computational performance. In our case studies, the Epiphany implementations of the two algorithms are, respectively, 78x and 38x more energy efficient.

I. I NTRODUCTION AND M OTIVATION Synthetic-Aperture Radar (SAR) [1] systems produce high resolution images of the ground. In SAR, a long antenna aperture is simulated by combining data from multiple pulses, which are transmitted while the aircraft is moving. The amount of raw radar data depends on the size of the area, the resolution, and the processing mode. SAR signal processing can be performed in the frequency domain by using Fast Fourier Transform (FFT) technique, which is computationally efficient but requires that the flight trajectory is linear and has constant speed. The back-projection integration technique has been applied in SAR systems, enabling the processing of the image in the time domain. An advantage of the time-domain processing over the commonly used frequency domain techniques is that it is possible to compensate for non-linear flight tracks. However, the cost is typically a higher computational burden. The Fast Factorized Back-projection (FFBP) [2] is a computationally efficient algorithm for image forming in the time domain. It reduces the performance requirements significantly relative to those for the conventional Global Back-projection (GBP)

technique. However, the large data sets needed to be processed by the SAR signal processor still make it hard to meet the high performance that is required for real-time image creation, i.e. when the images are created during the flight. Another related challenge is to cope with the increased computational demands within a limited power budget. Thus there is a dire need to come up with solutions for on-board computing hardware that can meet the performance and energy requirements of realtime image creation in next generation radar systems. Traditionally, advanced architectural techniques in CPUs, such as branch prediction, out-of-order execution, and superscalar execution, in addition to the frequency scaling have been used to meet the computational demands of these challenging applications. However, these advances result in increasing the complexity/area and power consumption. A common technique for reduction of the overall power consumption is to reduce the supply voltage and frequency and compensate the loss of processing speed by providing concurrent execution of computations. Manycore processor arrays consisting of tens or hundreds of simple processing cores offer the possibility of meeting the growing performance demand in an energy efficient way by exploiting parallelism instead of scaling the clock frequency of a single powerful processor. In this paper we evaluate one such manycore architecture, namely a 16-core Epiphany processor [3], from Adapteva Inc., by implementing one compute-intensive algorithm and one compute- and memory-intensive algorithm of SAR systems. The evaluation is performed by analyzing whether the selected architecture is suitable for meeting the performance and energy demands of the SAR algorithms and comparing the implementation results on the Epiphany architecture with a state-of-the-art uni-processor implementation. We analyze the characteristics of the two algorithms in terms of availability of parallelism, the computation and memory requirements, the memory access pattern and the synchronization needs. We provide insights into why one algorithm performs better on Epiphany than the other and highlight the key architectural features of Epiphany to achieve energy conservation. The rest of the paper is organized as follows: Section II describes the SAR image forming using fast factorized backprojection and the autofocus criterion calculation algorithms. Section III provides an overview of the selected manycore architecture. Section IV reviews the literature in SAR signal processing implementations. Section V presents the implemen-

Fig. 1.

A SAR signal processing block diagram.

tation methodology of the two algorithms on the Epiphany architecture. Section VI discusses the implementation results of the case studies, and the paper is concluded with some remarks in Section VII. II. SAR I MAGE FORMING USING FFBP AND AUTOFOCUS The history of SAR systems dates back to the 1950s, when a number of image forming algorithms were proposed [1]. A SAR system produces a map of the ground while the platform is flying past it. The radar transmits a relatively wide beam to the ground, illuminating each resolution cell over a long period of time. The effect of the movement is that the distance between a point on the ground and the antenna varies over the data collection interval. This variation in distance is unique for each point in the area of interest. The points thus create unique paths in the collected radar data. A SAR signal processing chain is shown in Figure 1. Since we, in this paper, are focusing on the time-domain image forming algorithms, we will limit the description to the back-projection block highlighted in Figure 1. The task for the signal processor is to integrate, for each resolution cell in the output image, the responses along the corresponding path. This is illustrated in Figure 2, where the area to be mapped is represented by MxN resolution cells. The flight path is assumed linear. The integration time may be several minutes, which means that the memory requirement for the data set is from 10 GBytes up to 1 TBytes. The computational performance demands are between 10 GFLOPS and 50 GFLOPS [4]. The large data sets themselves represent a challenge but also the complicated memory addressing scheme due to the various unique paths in the memory. The latter require additional geometrical calculations during the processing in order to determine the locations of the contributing elements. The exact computational requirements are dependent on the details of the chosen algorithms and radar system parameters. The formation of the image can be done in the time domain with back-projection techniques [5]. In FFBP, the whole aperture initially consists of a large number of small subapertures with low angular resolution, as shown in the

Fig. 2.

Simplified illustration of stripmap SAR.

leftmost part of Figure 3(a). These subapertures are iteratively merged into larger ones with higher angular resolution, until the full aperture with full angular resolution is obtained as shown in Figure 3(a) after the nth iteration. The geometry of the contributing data elements is illustrated in Figure 3(b), which shows a single subaperture at the projected position (r,θ) and its corresponding back-projected image positions. The cosine theorem is used to calculate the range values r1 , r2 and angle values θ1 , θ2 corresponding to the contributing image positions as given in the following equations: s 2 l 2 r1 = r + − 2rlcos (π − θ) (1) 2 s r2 =

r2 +

−1

θ1 = cos

2 l − 2rlcosθ 2 r12 +

l 2 2

− r2

(2)

! (3)

r1 l −1

θ2 = π − cos

r22 +

l 2 2 r2 l

− r2

! (4)

Fig. 3.

Simplified illustration of (a) Fast factorized back-projection, (b) Element combining.

where l is the subaperture length for the resulting image pixel. The interpolation is used to map the range and angle values obtained from the above-mentioned equations against the pixel data matrix. The element combining is done according to the following mathematical equation: aλ+1,j (r, θ) = aλ,i (r1 , θ1 ) + aλ,i (r2 , θ2 )

(5)

A. Autofocus In reality, the flight path is not perfectly linear. This can, however, be compensated for in the processing. In the FFBP, the compensations are typically based on positioning information from GPS. If this information is insufficient or even missing, autofocus can instead be used. The autofocus calculations use the image data itself and are done before each subaperture merge. One autofocus method, which assumes a merge base of two, relies on finding the flight path compensation that results in the best possible match between the images of the contributing subapertures in a merge. Several different flight path compensations are thus tested before a merge. The image match is checked according to a selected focus criterion as shown in Figure 4. The criterion assumed in this study is maximization of correlation of image data and is given as: f ocus criterion ≈

X

2

|f− (r, fi )| × |f+ (r, fi )|

2

(6)

where f− and f+ represent the subapertures from the corresponding images. As the criterion calculations are carried out many times for each merge, it is important that these are done efficiently. The calculations include interpolations and correlation calculations. The interpolation kernels are swept along tilted paths in memory. Here, the images to correlate with each other are assumed to be only small subimages. The effect of a path error is therefore approximated to a linear shift in the data set. Thus a number of correlations between subimages that are slightly shifted in data are to be carried out. Autofocus in FFBP for SAR is further discussed in [6].

Fig. 4.

Illustration of Autofocus and focus criterion.

III. E PIPHANY A RCHITECTURE Epiphany is a scalable manycore architecture, comprising of individual Epiphany cores that are based on a dual instruction issue architecture, which has a floating-point unit (FPU) and an integer ALU [3]. The floating-point unit performs one 32-bit single precision floating point operations per clock cycle and supports fused multiply add instruction. A 64-entries register file is used to feed operands to the ALU and the FPU. As shown in Figure 5 each core contains a DMA engine that allows it to efficiently transfer data to and from on-chip and off-chip resources. The DMA engine can transfer a double data word per clock cycle and works at the same clock frequency as the core. The Epiphany is based on a shared memory model which is physically distributed on the chip. It has a 32-bit global address map shared by all cores in the system, and both code and data can be placed anywhere within this global memory space. All registers and memory locations are memory mapped and accessible by the host and by all cores. The overall memory is divided into blocks of 32 KB that are co-located with the cores in the multiprocessor system and each memory block is mapped to the private memory map of its co-located core. The E16G3 device that we have used in this study has a total of 16 cores and 512 KB of on-chip memory.

Fig. 5.

Epiphany architecture block diagram.

The eGrid Network-on-Chip (NoC) from Adapteva comprises a 2D scalable mesh network with four duplex links at every node. The NoC consists of three separate mesh structures, each one for writing to an on-chip node, writing to off-chip resources, and reading transactions from off-chip and on-chip locations. Each routing node consists of a round robin five direction arbiter and a single stage FIFO. The routing mechanism is based on a distributed address based routing that provides a single cycle wait approach, meaning that there is a single cycle routing latency per node. Operating at a frequency of 1 GHz with a throughput of one transaction per clock cycle, the eGrid NoC provides a cross-section bandwidth of 64 GB/sec and a total on-chip bandwidth of 512 GB/sec, whereas the total off-chip bandwidth is 8 GB/sec [3]. The Epiphany architecture is programmed using standard ANSI-C language and each core in the architecture runs a separate program code. These multiple programs are built independently and then loaded onto the chip using a common loader. The programs are compiled using the standard GNU GCC compiler [7]. An Eclipse based integrated design environment manages the details of configuring and building the project for the manycore architecture. IV. R ELATED W ORK There have been a number of solutions in the past to implement SAR signal processing, some of the earliest dating back to the 1980s [8] [9]. In this section we review some of the significant ones related to SAR implementation on customized hardware. Friedlander et al. [10] have proposed two techniques for azimuth compression computing, which can then be developed in VLSI. They conclude that the technique of correlating the radar return with the reference signal is much more compute intensive as compared to using FFT, but the simplicity of correlation based design makes it much more feasible to implement it in VLSI. Przytula et al. [11] developed a mesh connected 64x64 SIMD processor array to perform real-time SAR imaging and have shown the feasibility of the approach by implementing sequences of kernels for both stripmap and spotlight modes of operation. Although they were able to achieve higher throughput based on the available technology at that time, they were not able to meet the low-power and

hardware size requirements for air-borne radars. Since the hardware technology has evolved considerably since these reported implementations, one would expect that the latest hardware can cope with the performance requirements of those times. However the computational requirements have also scaled because of the use of more complex signal processing algorithms such as FFBP and Autofocus algorithms, i.e., the techniques that we are considering in our implementations. Our parallelization of interpolation kernels in the case of autofocus criterion calculation has been inspired by the work of Przytula et al. Other implementations of SAR image forming based on the back-projection technique include the work of Calderon et al. [12], who implemented cache optimization techniques for the back-projection algorithm on a digital signal processor, and the parallel implementation of back-projection on an FPGA platform carried out by Conti et al. [13] and by Hast et al. [14]. The approach taken by Hast et al. is to develop a systemon-chip, where calculation of the index generation and the interpolation is done by dedicated hardware in the FPGA and then the subsequent element combining is done on a PowerPC core, also implemented on the FPGA. In contrast the work done by Conti et al. contains the complete back-projection algorithm implemented in dedicated hardware. More recently, Lidberg et al. [15] have implemented a parallel version of fast-factorized back-projection by using OpenMP and vectorization instructions on a general purpose multi-core platform. The results achieved by implementing the FFBP algorithm in stripmap mode on a machine consisting of two Intel Xeon X5675 hexa-core processors with 3-level cache hierarchy running at 3.06 GHz indicate that they were able to achieve real-time performance for computing raw radar data of 240 MB and 7.8 GB collected within the integration intervals of 74 and 149 seconds, respectively. The provision of 128bit vector register and the use of Streaming SIMD Extension (SSE) for vectorization resulted in a significant reduction of the execution time besides the use of all the twelve available cores. Our implementation of the FFBP algorithm is closer related to the work of Lidberg et al. than to the FPGA implementations. The major differences are in the parallelization technique that makes use of coarse-grained data-level parallelism instead of the vectorized instructions and without using the OpenMP library that is currently not supported on our target manycore platform. V. I MPLEMENTATION M ETHODOLOGY A. General Design Considerations In order to realize the two algorithms, FFBP and autofocus, on the Epiphany platform in such a way as to achieve the required performance and meet the timing constraints, the first step is to determine what kinds of parallelization are applicable on the algorithms. Parallelization can be applied with different level of granularity and the task of parallelization is not trivial because of the data dependencies. Since the two algorithms have different workload and dataflow characteristics, we have

Fig. 6.

Coarse-grained data partitioning.

considered ad chosen two different approaches for parallelization of these algorithms. For the FFBP algorithm, we have chosen to exploit coarsegrained data-level parallelism, meaning that the same computations are applied on different sets of data. Rather than dividing the input data, we prefer to divide the resulting image into several independent data slices, which are computed in parallel by the processing cores, as shown in Figure 6. The data partitioning takes into consideration the data dependencies within the data set, so some redundancy in accessing the data slices is required. This kind of coarse-grained parallelization gives us natural scalability by increasing the number of compute nodes with the increase in image size. At the fine-grained level we exploit the instruction-level parallelism capabilities of the target architecture in performing complicated index calculations and data interpolation. The autofocus criterion calculation algorithm is characterized by regularity in memory access pattern. We have relied on a task-level parallelism technique in this case, meaning that the overall computations are subdivided into several independent tasks, which can be assigned to individual processing cores executing in parallel. The intermediate processed data is passed in a streaming manner between the compute nodes. B. FFBP Implementations The input stimulus corresponding to the pulse compressed radar data for the fast-factorized back projection implementations consist of 1001 range bins for each of the 1024 pulses. We have used a merge base of 2, which means that it will take ten iterations of element-wise sub-aperture combining to compute the full aperture with full angular resolution on the given image size of 1024x1001 pixels. Each pixel data comprises two 32-bit floating-point numbers corresponding to the real and imaginary components. We have implemented two sequential and one parallel version of the FFBP algorithm. In the sequential version the complete algorithm is executed on a single core of Epiphany and also as a single-threaded application on an Intel Core i7 processor. The parallel implementation of the FFBP algorithm is based on the Single Program Multiple Data (SPMD) technique meaning that the same source code is used for every core. It makes use of the data-level parallelism, where the whole data set is split among the processing cores. We use the two upper data banks of the co-located memories

Fig. 8.

Dataflow diagram of Autofocus Criterion Calculation Algorithm.

with each Epiphany core to store the subaperture data corresponding to two pulses, which is equal to 16,016 bytes. The data for the contributing subapertures is prefetched into the local memories of the individual cores. The indexes of the contributing subaperture data in the memory are calculated on the basis of eq.1-4 after using simplified (nearest neighbor) interpolation for both the range and the angle calculations. Each Epiphany core then performs the subaperture combining on a its designated set of the contributing subapertures and stores the resultant subaperture data back to the external SDRAM. After the completion of one iteration of subaperture combining, the resulting subaperture data of the previous iteration is loaded into the local memories of the cores for the subsequent computation of the next iteration. This process continues until all the ten iterations of the FFBP algorithm are performed to obtain the full aperture with the highest angular resolution. For validation we have used a test scenario of six target points. Figure 7(a) shows the image corresponding to the raw radar data, consisting of curved paths for the six target points. The resulting image using global back-projection (GBP) image is shown in Figure 7(b). The FFBP processed images shown in Figure 7(c) and Figure 7(d) have a lower quality as compared to the GBP processed image due to the noise introduced by the simplified interpolation performed in the successive iterations. On the other hand the FFBP algorithm is much faster than the GBP algorithm. Note that the quality of the FFBP processed images could be considerably improved by using more complex interpolation kernels such as cubic interploation. As seen in Figure 7(c) and Figure 7(d), the qualities of the resultant images on the Intel and Epiphany architectures are similar. C. Autofocus Implementations For the parallelized autofocus implementation, the partitioning of computational workload is done by determining the dataflow patterns of the algorithm. We have used the dataflow pattern shown in Figure 8 and mapped it according to the available resources on the target architectures. The implementations take, as input, two 6x6 blocks of image pixels from the area of interest of the contributing

Fig. 7.

(a) Pulse Compressed Radar Data, (b) GBP processed Image, (c) FFBP processed Image on Intel i7, (d) FFBP processed Image on Epiphany.

image. Cubic interpolation based on Neville’s algorithm [16] is performed in the range direction, followed by the beam direction, to estimate the value of the contributing pixels along the tilted lines, and the resulting subimages are correlated according to the autofocus criterion. Three iterations of the range interpolation, beam interpolation, correlation and summation stages are performed in order to compute the autofocus criterion for the entire 6x6 image block. Similar to FFBP, we have implemented two sequential and one parallel version of the autofocus algorithm. However, unlike the FFBP parallel implementation, we use different source codes for the different Epiphany cores for implementing the parallel version of autofocus criterion calculation i.e., the Multiple Program Multiple Data (MPMD) style of parallelization. The parallel version takes as an input, two 6x6 blocks of image pixels from the area of interest of the contributing image and performs the autofous criterion calculation in a streaming manner, in which the overall algorithm is partitioned into several tasks, each of which is then implemented on an individual core, as shown in Figure 9. For each of the two pixel blocks we have used three cores for computing range interpolation and three cores for beam interpolation. Finally, the correlation is calculated by a common

Fig. 9.

Mapping diagram of Parallel Autofocus Implementation.

core, which makes the total number of used cores equal 13. Thus, the proposed mapping implements task-level parallelism and pipelined execution based on the dataflow characteristics of the algorithm and utilizes the routing mechanisms applied in the Epiphany architecture in an optimized manner. The three

spare cores can then be used to execute the subsequent stages of SAR signal processing. The range interpolators perform the same operation on different rows and the first four columns of pixel data. The input pixel data is also copied to the local memory of the next adjacent core for computing the range interpolation by including another column of pixels instead of the first column. Each beam interpolator in the next stage receives its input data values corresponding to four range interpolated pixels. The resulting data values of the beam interpolation stage is passed to the correlation and summation stage. A similar set of six cores are used for the second image block for calculating range and beam interpolations. The correlation and summation stage is computed by a single core, which provides the final autofocus criterion calculation result to be written to the off-chip SDRAM. VI. R ESULTS E VALUATION AND DISCUSSION The Epiphany results are obtained from the implementations executing on a 16-core Epiphany E16G3 chip mounted on an experimental board that limits the clock speed to 400 MHz. We measure the total number of cycles for the results on Epiphany and calculate the execution time when executed at 1 GHz, which is the maximum specified clock frequency for the Epiphany architecture. We measure the results of the sequential versions of the same algorithms on an Intel platform by executing them as single threaded applications on an Intel Core i7-M620 CPU operating at 2.67 GHz. We wanted a reference to a sequential execution on a state-of-theart processor running at maximum speed, and thus chose not to use the obtainable 2-core parallelism of the Intel CPU. Thus we use the single-core Intel i7 implementations as a reference for results comparison. Table I summarizes the number of processing cores, the execution time in milliseconds, the throughput in terms of number of pixels per second on which the given autofocus criterion is computed, the speedup figures for the implementations realized on the Epiphany platform compared to a sequential implementation executed on a single core of the Intel CPU, and the estimated power consumed by the two architectures based on the figures obtained from Intel i7-M620 processor [17] and Epiphany E16G3 [3] data sheets. The estimated power for Epiphany is based on the clock frequency of 1 GHz. The power for the single core of the Intel processor is half the total dissipated power of the chip. The execution time of the FFBP algorithm for the 1024x1001 image size on a single core of Epiphany is almost three times longer than the reference sequential implementation, mainly because the image data is stored in the off-chip SDRAM whose access time is much longer than the memory access time by the Intel processor. The Epiphany architecture does not support caches in contrast to the modern Intel architectures which provide prefetching mechanisms combined with three levels of caches to hide the memory latencies, instead it uses the large register file available on-chip to reduce the memory read and write operations. The Intel Core i7 processor also contains an on-die memory controller that

connects to three channels of DDR memory to increase the memory bandwidth. It also employs an out-of-order superscalar architecture to exploit instruction level parallelism. It should also be noted here that the clock frequency of the Epiphany architecture is 2.7x lower than that of the Intel architecture. The parallel implementation of the FFBP algorithm utilizing all the 16 cores of the Epiphany chip is 11.7x faster than the sequential Epiphany implementation and 4.25x times faster than the sequential implementation on Intel. The obvious reason is the effective use of larger a number of processors, but it is also due to the fact that we prefetch the contributing subaperture data and store it in the local memories co-located with each core in the Epiphany architecture. Since the size of local memories is limited, we use only the two upper memory banks (each of 8KB size) to store data, corresponding to two pulses of the contributing subapertures. During the first merge iteration the prefetched data is sufficient for computing the resulting subaperture data, but in the later iterations it still requires contributing data to be read from the external memory. The resulting subaperture data is still written into the external memory, but its effect is less pronounced because in the Epiphany architecture the write operation is performed without stalling. Thus, writing has a single cycle throughput whereas the memory read operation is more expensive due to stalling. We also made use of the micro-architectural features of fused multiply-add instruction in the geometrical calculation of indices for the element combining. Other optimizations include the use of scalar variables to maximise the use of register file, to skip the additions with zero when the indices are out of range, and to use struct data type for representing complex numbers that explicitly force the compiler to generate 64-bit MOV instruction instead of two 32-bit MOV instructions. The said optimization is applied in the case of both architectures under consideration. The resulting images from the FFBP algorithm on the two architectures are similar in quality, but when compared with the computed image from the GBP algorithm, there is a degradation in quality. The main reason is the approximations made in the simplified interpolations performed in each iteration and the less compute-intensive implementation of the square root operation, both of these techniques are used for reducing the overall computational complexity of the algorithm. Looking at the execution results of the Autofocus algorithm, we see that the throughput measures of the two sequential implementations are comparable. This is due to the fact that the amount of computations to be performed on each input pixel block within the area of interest is significantly high. Since the working data set of the kernel fits completely in the on-die storage of Epiphany, the effects of memory latency are not very visible in the throughput results of the autofocus case-study. Again, the presence of an on-chip register file also helps in avoiding memory references. The effect of slower clock than the Intel processor is compensated by the faster execution of floating-point operations. The throughput of the parallel implementation using 13 processors is 10.9x higher

TABLE I R ESOURCES , P ERFORMANCE , AND E STIMATED P OWER RESULTS OF FFBP AND AUTOFOCUS CRITERION CALCULATIONS .

FFBP Implementations

No. of Cores

Sequential on Intel i7 @ 2.67 GHz Sequential on Epiphany @ 1 GHz Parallel on Epiphany @ 1 GHz Autofocus Implementations

1 1 16 Np

Sequential on Intel i7 @ 2.67 GHz Sequential on Epiphany @ 1 GHz Parallel on Epiphany @ 1 GHz

1 1 13

than the sequential implementation on a single Epiphany core and 8.9x higher than the sequential implementation on the Intel processor. The efficiency is based on the use of a streaming approach where the intermediate results are streamed between the neighboring cores to produce the final output instead of writing it to the off-chip memory. We have also managed to achieve minimal delay in the communication between cores in Epiphany because of the custom mapping of the parallel implementation, which avoids transactions with distant cores. It may appear that the mapping would introduce some congestion at the correlation block because of getting the input data from six beam interpolators. However, the fact that the onchip bandwidth is 64 times higher than the off-chip bandwidth helps to avoid the impact of this bottleneck on the throughput of the implementation. A. Energy Efficiency The implementations on Epiphany show excellent energy efficiency, measured as throughput per watt. The throughput per watt figure for the parallel autofocus implementation on Epiphany is 78x higher than the figure for the sequential implementation on the Intel processor, and the parallel FFBP implementation is 38x more energy-efficient than the FFBP implementation on the Intel processor. It is understood that the Intel processor is not primarily intended for low-power embedded applications, but it has frequently been used as CPU in laptop computers. It was fabricated in a 32 nm silicon technology, whereas the Epiphany implementation we used was fabricated in a 65 nm technology. The Epiphany architecture incorporates several optimizations at the individual core micro-architecture level, such as the provision of a fused multiply-add instruction, variable instruction lengths, and a large register file that enables the optimization of the code density and the possibility to perform register allocation to minimize the interaction with the memory. The mesh network also reduces power, since all signals travel from one tile to its immediate neighbor, minimizing signal length and thus the drive current. These short signals also enable the network to operate at the same clock speed as the Epiphany cores. Epiphany also saves power in clock distribution and by using extensive, fine-grained clock gating, thus shutting off the clock

Execution Time (msec.) 1295 3582 305 Throughput (pixels/sec.) 21,600 17,668 192,857

Speedup 1 0.36 4.25 Speedup 1 0.8 8.93

Estimated Power (Watts) 17.5 2 Estimated Power (Watts) 17.5 2

to unused function units and entire cores on a cycle-by-cycle basis [18]. Looking at the related work mentioned above, the limited on-chip and on-board memory available for the Epiphany platform restricts us to process smaller radar data sizes for the FFBP algorithm than reported in the work of Lidberg et al. [15], but our implementation outperforms theirs in terms of energy efficiency. B. Programmability Regarding the programmers productivity for exploiting parallelism, the application memory access patterns play a critical role. In the case where we have regular data access with greater opportunities for optimizations based on data-level parallelism, the programmer can use the SPMD approach which requires quite little effort. However, explicit management of synchronization between the different cores - as we find in the autofocus case-study -needs to be done manually and increases the burden on the programmer in addition to the requirement of writing separate C programs for each individual core. The synchronization is required for the processing cores to indicate to the following core according to the dataflow that it has completed its task so that the subsequent core can proceed with the processing of intermediate data. VII. C ONCLUSIONS AND F UTURE D IRECTIONS We have described two significantly complex algorithms that are being used in SAR systems. Two application casestudies have been performed and the results corresponding to sequential and parallel implementations of the two algorithms, achieved by execution on real hardware, are presented and analyzed. The parallel implementation of the FFBP algorithm on Epiphany results in a speedup of 4.25x over the reference implementation on a single core of an Intel Core i7 processor. This is achieved operating at a clock frequency of 1 GHz, compared compared to 2.67 GHz in the reference implementation. Similarly, the throughput results of the parallel autofocus criterion calculation implementation on Epiphany outperform the sequential implementation on Intel by a factor of 8.9. The achieved speedup is of course mainly attributed to the use of larger number of processors. The observed difference in the

speedup figures for the two case-studies is due to the different inherent natures of the two algorithms. For instance, the ratio of the amount of computations performed on the input data to the number of memory operations is much higher in the autofocus algorithm as compared to the FFBP. This is reflected in the much higher speedup figure in the autofocus case-study, despite using a slightly smaller number of processors. The frequent off-chip memory accesses performed in the parallel FFBP implementation limits the speedup figure as compared to those of the autofocus implementation. The highly energyefficient design of the Epiphany architecture also results in much better energy efficiency figures for the two parallel implementations in the order of 38x and 78x, respectively, as compared to the reference implementations. From the programmability point of view, it is observed that the data access patterns in an application very much influence the implementation approach. In the case of regular data access, the SPMD approach significantly reduces the programming effort, which leads to a significant reduction in the turnaround time for implementation development. On the other hand, the use of an MPMD approach, where each core performs a different task, requires implementing separate C code programs, which, together with the added work of managing synchronization between the cores, reduces productivity. This will be even more significant when new, much more parallel versions of the Epiphany and other architectures appear (a 64-core Epiphany chip is now available). For manycore architectures like the one used in this study, a high-level language support that can raise the abstraction level for the programmer, while not compromising the performance benefits, is essential for the widespread adoption of such architecture in the mainstream computing industry. We believe that some of the building blocks of the solution can be found in our earlier work of using occam-pi language to program a variety of reconfigurable manycore platforms [19][20]. The advancement of methods and tools for this will be the main focus of our future work related to manycore architectures. ACKNOWLEDGMENT The authors would like to thank Adapteva Inc. for giving access to their software development suite and hardware board. This research is done in the JUMP (JUmp to Manycore Platforms) project within the CERES research program supported by the Knowledge Foundation in cooperation with Saab AB

and Free2move AB. R EFERENCES [1] W.G. Carrara, R.S. Goodman, and R.M. Majewski: Spotlight synthetic aperture radar signal processing algorithms. Boston: Artech House. (1995) [2] L.M.H. Ulander, H. Hellsten, and G. Stenstrom: Synthetic-aperture radar processing using fast factorized back-projection. IEEE Transactions on Aerospace and Electronic Systems, Vol. 39(3). 760-776. (2003) [3] Epiphany E16G3 datasheet, Revision 1.0, Adapteva Inc. (2010) ˚ [4] A. Ahlander, H. Hellsten, K. Lind, J. Lindgren, and B. Svensson: Architectural challenges in memory-intensive, real-time image forming. Proceedings of International Conference on Parallel Processing, (ICPP’07). (2007) [5] A. Olofsson: Real-time signal processing in airborne ultra-wideband low frequency SAR. Masters Thesis, Chalmers University of Technology, Gothenburg. (2003) (In Swedish) ˚ [6] H. Hellsten, P. Dammert, and A. Ahlander: Autofocus in fast factorized back-projection for processing of SAR images when geometry parameters are unknown. Proceedings of 2010 IEEE International Radar Conference. (2010). [7] GCC, The GNU Compiler Collection. http://gcc.gnu.org/ (2013) [8] R. Fabrizio: High speed digital processor for real-time synthetic aperture radar imaging. Proceedings of the IGARSS Symposium. (1987) [9] D.G. Appelby and J.J. Soraghan: Fast SAR signal processing on the DAP. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). (1986) [10] B. Friedlander and J. Newkirk: A comparison of two SAR processing architectures for VLSI implementation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). (1983) [11] K.W. Przytula and J.G. Nash: Parallel implementation of syntheticaperture radar algorithms. Journal of VLSI signal processing systems for signal, image and video technology, Vol. 1(1), pp 45-56. (1989) [12] R.A. Neri-Calderon, S. Alcaraz-Corona, and R.M. Rodriguez-Dagnino: Cache-optimized implementation of the filtered back-projection algorithm on a digital signal processor. Journal of Electronic Imaging, 16(4). (2007) [13] A. Conti, B. Cordes, M. Leeser, E. Miller, and R. Linderman: Adapting parallel back-projection to an FPGA enhanced distributed computing environment. Proceedings of the Workshop on High-Performance Embedded Computing (HPEC). (2005) [14] A. Hast and L. Johansson: Fast factorized back-projection in an FPGA. Masters thesis, Halmstad University, Halmstad, Sweden. (2006) [15] C. Lidberg and J. Olin: Optimization of fast factorized back-projection execution performance. Masters Thesis, Chalmers University of Technology, Gothenburg, Sweden. (2012) [16] E.H. Neville: Iterative interpolation. Journal of Indian Math Society, Vol. 20. 87-120 (1934). R [17] Intel CoreTM i7-600, i5-500, i5-400, and i3-300 Mobile processor series datasheet, Intel Corporation. (2010) [18] L. Gwennap: Adapteva: More flops, Less watts. Microprocessor Report, 6/13/11-02. “www.MPRonline.com” (2011). [19] Zain-ul-Abdin, and B. Svensson: Occam-pi as a high-level language for coarse-grained reconfigurable architectures. Proceedings of the Reconfigurable Architectures Workshop (RAW). (2011) [20] Zain-ul-Abdin and B. Svensson: Occam-pi for programming of massively parallel reconfigurable architectures. International Journal of Reconfigurable Computing, Vol. 2012, Article ID 504815. (2012)

Lihat lebih banyak...

Energy-Efficient Synthetic-Aperture Radar Processing on a Manycore Architecture

Descripción

Comentarios