Portable LDPC Decoding on Multicores using OpenCL

June 12, 2017 | Autor: Joao Andrade | Categoría: Mechanical Engineering, Signal Processing, Parallel Processing, Experimental Evaluation, High performance, Electrical And Electronic Engineering, Central Processing Unit, graphic processing unit (GPU), Error Correction Code, Electrical And Electronic Engineering, Central Processing Unit, graphic processing unit (GPU), Error Correction Code

Share Embed

Laporkan tautan ini

Descripción

Gabriel Falcao, Vitor Silva, Leonel Sousa, and Joao Andrade

Portable LDPC Decoding on Multicores Using OpenCL

T

his article proposes to address, in a tutorial style, the benefits of using Open Computing Language [1] (OpenCL) as a quick way to allow programmers to express and exploit parallelism in signal processing algorithms, such as those used in error-correcting code systems. In particular, we will show how multiplatform kernels can be developed straightforwardly using OpenCL to perform computationally intensive low-density parity-check (LDPC) decoding, targeting them to run on a large set of worldwide disseminated multicore architectures, such as x86 generalpurpose multicore central processing units (CPUs) and graphics processing units (GPUs). Moreover, devices with different architectures can be orchestrated to cooperatively execute these signal processing applications programmed in OpenCL. Experimental evaluation of the parallel kernels programmed with the OpenCL framework shows that high-performance can be achieved for distinct parallel computing architectures with low programming effort. CODE PORTABILITY OF DSP APPLICATIONS ON MULTICORES The recent paradigm shift in computer architectures has been driving toward two or more cores on a single chip to provide augmented computational power. Multicores have gone mainstream over the last few years due to power and memory walls, and a vast set of architectures [2] and programming models [3] have been made available to the community. They range

from general-purpose homogeneous multicores to heterogeneous multicore systems, or even more aggressive manycore architectures originally designed, for example, for video and graphics markets [2]. These architectures present new challenges but also new opportunities, to accelerate the processing of intensive digital signal processing (DSP) applications [4] by using parallel programming. But the large number of multicore architectures and the different parallel programming models available still shows that the development of efficient source code for parallel signal processing algorithms has to be performed targeting each specific architecture. To overcome this difficulty, OpenCL defines a programming model and provides a framework that allows programmers to develop code once and execute it on different multicore architectures. In this article, we discuss how to introduce changes in signal processing algorithms to make them suitable and portable across different multicore platforms. The purpose of the article is to provide a high-level overview of OpenCL based on simple signal processing operations used to perform LDPC codes decoding, which is a class of well-known error correcting codes that require intensive processing and irregular memory accesses, to show the ease of implementation and effectiveness of OpenCL. Although we focus on this particular class of algorithms, similar principles can be applied to address the development of more complex parallel signal processing applications. OpenCL Historically, developing or adapting an algorithm for execution on a multicore

computer has been considered nontrivial. Multicore platforms vary from homogeneous to heterogeneous and can have a shared or distributed memory structure or a combination of both [2]. Distributing the computational workload of a DSP algorithm over two or more cores, eventually with multiple threads launched per core, while dealing efficiently with communications and the memory hierarchy of the system has been performed differently for distinct platforms. Manufacturers of multicore systems support different programming models with variable complexity and a low level of portability. Exploiting the potential of multicore-based processing requires parallel programmers with diverse programming skills. Gathering all these competences consumes nonnegligible efforts, and this becomes even more significant if we consider that multicore architectures are continuously evolving at an increasingly rapid pace. Since the very beginning, one of the key challenges in parallel computing has been the development of a broadly accepted programming model truly supported by a large set of architectures. Originally conceived with these concerns in mind by a group of manufacturers such as Apple, Intel, AMD, or NVIDIA, OpenCL defines an application programming interface (API) for crossplatform modern multiprocessors through an industry standard supported by the Khronos Group [1]. Using a C/ C11 language environment, OpenCL (currently in version 1.2) is supported by the major computer manufacturers and operating systems. The reader should consult the OpenCL official Web site [1] for accessing full documentation and the complete list of specifications.

Based on our experience, the level of complexity imposed by OpenCL is similar to other dedicated programming models [5], [6]. Although it is not the only API supporting the development of software for multicore systems such as Compute Unified Device Architecture (CUDA) [5] or OpenMP [6], it certainly is the one covering the largest set of multicore platforms. They range from the x86 family of general-purpose CPUs, to NVIDIA or ATI GPUs and Cell-based architectures, while support for fieldprogrammable gate arrays (FPGAs) and other parallel computing machines such as digital signal processors is expected soon [2], [3]. Broadly supported, the potential of OpenCL is high, and this article shows that it is simple to implement, putting in perspective a quick learning curve for DSP developers, who can consider this portable alternative for their projects and algorithms. THE OpenCL CONTEXT AND FRAMEWORK OpenCL allows one to define a kernel or group of kernels that exploit multithread-based parallelism and are loaded and executed on a multicore platform. The programmer instructs the compiler how a certain code section should be parallelized or multithreaded, typically by exploiting data-parallelism found in DSP applications. A kernel programmed for a multicore system generally adapts its processing depending on the thread identifier (thread-ID) it is dealing with at the moment. But to do so, the program must inspect how many cores compose the system, in how many blocks should the algorithm be partitioned, and how many threads should be assigned to each block. To automatically distribute a balanced workload across its computational resources, the system must be able to query the environment or context of the multicore platform. Also, the compilation of the parallel kernel to run on the multicore must be performed at runtime. The OpenCL framework supports a set of functions to perform all the necessary steps that range from querying the system’s context and resources, loading data and kernels, compiling, launching

execution, and gathering computed data at the end of processing. In the last section of this article, we show C11 code examples that were compiled with g11 4.4 and OpenCL 1.1 running on a Linux 3.0 distribution.

BROADLY SUPPORTED, THE POTENTIAL OF OpenCL IS HIGH, AND THIS ARTICLE SHOWS THAT IT IS SIMPLE TO IMPLEMENT. An OpenCL program starts with a sequential main function launched on a host system (e.g., a CPU) and performs a single-thread-based execution until it reaches the parallel section of the program. At that point, the parallel kernel is launched on the OpenCL parallel computing device (e.g., a GPU) where the processing flows based on a multithreaded execution model. When the parallel kernel terminates executing, the host system resumes sequential execution. THE OpenCL RUNTIME In the initial section of the program, the system is queried, defining the appropriate operating context, specifically the characteristics of the available OpenCL devices, such as the amount of processing units and threads available for computation. Consequently, the OpenCL task scheduler can conveniently split the workload and perform a balanced computation across the system’s resources. As shown in Figure 1, the programmer can automatically set appropriate OpenCL functions to perform this step, which can ultimately allow her/him to manually refine optimized working sets and define optimal compilation options, features, and arguments. The OpenCL runtime is defined by a set of functions that can be globally grouped in Query Platform Info, Contexts, Query Devices, and Runtime API. ■ The Query Platform Info group includes functions c l G e t P l a t f o r m I D s and clGetPlatformInfo to obtain the list of platforms available and get

specific information regarding the OpenCL platform, respectively. ■ The Contexts group contains a more vast set of functions, including clCreateContextFromType that is used to create an OpenCL context, and clGetContextInfo used to query information about a certain context. ■ The Query Devices group is also composed by two functions, specifically clGetDeviceIDs, which is used to get the list of devices available on a platform, and c l G e t D e v i c e I n f o to obtain information regarding an OpenCL device. ■ Runtime API defines the most extensive group of functions used in the OpenCL runtime dedicated to: 1) launch and control command queues, e.g., the clCreateCommandQueue function creates a command queue on a specific device associated within a valid OpenCL context; 2) control memory objects, e.g., the clCreateBuffer function creates a buffer object and the function clEnqueueWriteBuffer enqueues commands to write to that buffer object from host memory; 3) create a program from a source, e.g., clCreateProgramWithSource creates a program object for a context and loads the source code specified by the text strings in the strings array into the program object, while function clBuildProgram builds (compiles and links) a program executable from the program source; and 4) launch kernels on a device, e.g., clCreateKernel function creates a kernel object from a program object with a successfully built executable, and clEnqueueNDRangeKernel enqueues a command to execute a kernel on a device. Figure 1 presents a subset of the main runtime functions described in this section, illustrating a typical utilization scenario. The complete set of OpenCL functions and enumerated

OpenCL Context

OpenCL Program

Platform 0

Host

Runtime Support Functions Device 1

Device 0

OpenCL Devices

DSP Kernel 1

DSP Kernel 2

Platforms and Device Management clGetPlatformInfo() clGetPlatformIDs() clGetDeviceInfo() clGetDeviceIDs()

Context Management clCreateContextFromType() clGetContextInfo()

Memory Management clCreateBuffer() Queueing Management clEnqueueWriteBuffer() clEnqueueReadBuffer() clEnqueueNDRangeKernel() OpenCL Program Compilation clCreateProgramWithSource() clBuildProgram() clCreateKernel()

[FIG1] OpenCL runtime functions supporting the parallel execution of DSP kernels using threaded blocks.

types supported are available from [1]. Also, as indicated in the last section of the article, the source code example herein presented illustrates all the sections of an OpenCL program and is available online. MEMORY ALLOCATION RULES Data can be passed to the kernels as function arguments. To do so, data buffers have to be previously allocated in the OpenCL device’s memory when running the context queries in the initial section of the program on the host side. Afterward, data can be transferred to the device’s memory by using the clEnqueueWriteBuffer function. After all data is processed by the parallel kernels, it can be similarly transferred back to the host by using the function clEnqueueReadBuffer. MULTITHREAD-BASED PARALLELISM As mentioned above, after a kernel is compiled and launched on the OpenCL

device, the parallel part of the program starts. The host system can launch kernels on the device following an approach similar to /*This kernel computes the current work item ID and calls the respective numerical routine k_parAdd()*/ __global void g_parAdd(__ global int *a, __global int *b, __global int *c){ int tid = get_global_ id(0); k_parAdd(a,b,c,tid); } Kernels can also be called from the OpenCL device side. The tid variable defines the ID of the finest-grain unit of processing, i.e., a thread or w o r k item, which is obtained by querying the OpenCL device using function get_ global_id(). At each instant, the execution unit automatically defines

which group of threads, or w o r k group, to launch and execute concurrently, and the corresponding processing and memory accesses are automatically calculated based on the tid of that group. It is the programmer’s responsibility to associate variables with the tid of each work item in the code. To exemplify this association, one can directly assign tid to parallelize the sum of two integer arrays in a very straightforward manner /*This kernel executes simultaneously several work items to perform the parallel sum of arrays a and b of dimension N*/ __kernel void k_parAdd(__ global int *a, __global int *b, __global int *c, int tid){ if(tid

Lihat lebih banyak...

Portable LDPC Decoding on Multicores using OpenCL

Descripción

Comentarios