Benchmarking parallel compilers: A UPC case study

Share Embed


Descripción

Future Generation Computer Systems 22 (2006) 764–775 www.elsevier.com/locate/fgcs

Benchmarking parallel compilers: A UPC case study Tarek A. El-Ghazawi, Franc¸ois Cantonnet, Yiyi Yao ∗ , Smita Annareddy, Ahmed S. Mohamed Department of Electrical and Computer Engineering, The George Washington University, Washington, DC 20052, United States Available online 4 April 2006

Abstract Unified Parallel C (UPC) is an explicit parallel extension to ISO C which follows the Partitioned Global Address Space (PGAS) programming model. UPC, therefore, combines the ability to express parallelism while exploiting locality. To do so, compilers must embody effective UPCspecific optimizations. In this paper we present a strategy for evaluating the performance of PGAS compilers. It is based on emulating possible optimizations and comparing the performance to the raw compiler performance. It will be shown that this technique uncovers missed optimization opportunities. The results also demonstrate that, with such automatic optimizations, the UPC performance will be compared favorably with other paradigms. c 2006 Elsevier B.V. All rights reserved.

Keywords: Distributed shared memory; Unified Parallel C; Compilers; Non-Uniform Memory Accesses Architecture; Benchmark; NAS parallel benchmark

1. Introduction While the performance evaluation of parallel architecture has been well studied and many benchmarks and evaluation methodologies exist, the notion of evaluating compiler implementations of a given language has not received enough (i.e. similar) attention. In this work, we define a methodology for the evaluation of parallel compilers. As a case study, we are considering UPC, also known as Unified Parallel C, a new language which has many promising characteristics. In such a case, insightful evaluation of UPC compilers is very critical. It is easy to confuse performance behaviors due to the programming model and the underlying language concepts with the quality of implementations. Therefore, our methodology examines all the possible UPC-specific optimizations and provides emulations that can be implemented by hand. This way, the presence and quality of implementations of each of these optimizations can be examined and the potential performance improvements for an existing implementation can be assessed. ∗ Corresponding address: The George Washington University, Department of Electrical and Computer Engineering, 725 23rd St. NW Room 409, 22202 Washington, DC, United States. E-mail addresses: [email protected] (T.A. El-Ghazawi), [email protected] (F. Cantonnet), [email protected] (Y. Yao), [email protected] (S. Annareddy), [email protected] (A.S. Mohamed).

c 2006 Elsevier B.V. All rights reserved. 0167-739X/$ - see front matter doi:10.1016/j.future.2006.02.002

UPC is an explicit parallel programming language based on the ISO C standard and the distributed shared memory programming model. It capitalizes on the experience gained from its predecessor distributed shared memory C languages such as Split-C [1], AC [2] and PCP [3]. The language keeps the C philosophy of making programs concise while giving the programmer the ability to get closer to the hardware to gain more performance. Therefore, UPC is becoming more and more widely accepted in the high-performance computing community. UPC resulted from the combined effort of a consortium of government, industry and academia. Consortium participants include GWU, IDA, the US Department of Defense (DoD), UCB, MTU, UFL, ANL, ARSC, HP, Cray, Etnus, IBM, Intrepid Technologies, LBNL, LLNL, and the US Department of Energy (DoE). This has incurred efforts at many of the vendors to develop and commercialize UPC compilers. UPC compilers are now available for SGI O2k and O3k, Cray T3E and X1, HP AlphaServer SC, and Beowulf Clusters, to name a few. More implementations are also available or underway. The UPC Specifications v1.0 document was produced in February 2001 [4], a new version (v1.1) was released in May 2003 [5], and the latest UPC specification (v1.2) [6] was published in June 2005 by the UPC consortium. A TotalView debugger is now available from Etnus for most implementations. UPC has been developed on the basis of lessons learned from message passing and pure shared memory programming paradigms [7]. The shared memory model brings ease of use.

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

Under this programming model, remote memory accesses does not need to be expressed explicitly. However, performance loss may occur due to unintended remote accesses, as programmers have no control over how data is laid out in the address space. Message passing offers explicit distribution of data and work such that programmers can avoid unnecessary messages. On the other hand, message passing is characterized by its significant overhead for small messages and the difficulty of using it. The partitioned global address space (PGAS) model (or the distributed shared memory programming model), which is followed by UPC, brings together what’s good about these two worlds. This allows the exploitation of data locality by distributing data and processing under the control of the programmer, while maintaining ease of use. In this case study, we consider the performance of the recent UPC compiler implementations on the SGI Origin family of computers. The SGI Origin family features Non-Uniform Memory Access Architecture (NUMA). The architecture offers global address space, distributed among the nodes. In this study, it will be shown that there are still significant optimizations for the current UPC compiler implementations to take advantage of the capabilities offered by this hardware. In general, it is noticed that many of the current practices in developing distributed shared memory compilers do not necessarily take advantage of the benefits of the PGAS model. Furthermore, it will be shown that implementations also often ignore some of the helpful features of the underlying system that can be particularly beneficial for UPC in order to achieve high portability. Some of these facts are revealed in our work by using a number of explicit work-around and hand-optimizations that can mimic the effects of automatic compiler optimizations [8]. Such hand-coded optimizations will help to reveal the potential performance improvements for the compiler being evaluated. This paper is organized as follows. Section 2 gives a brief description of the UPC language, while Section 3 introduces the experimental testbed and workloads used. Section 4 presents the performance measurements, followed by conclusions in Section 5. 2. Unified Parallel C (UPC) UPC model and basic concepts Fig. 1 illustrates the memory and execution model as viewed by UPC codes and programmers. Under UPC, memory is composed of a shared memory space and a private memory space. A number of threads work independently and each of them can reference any address in the shared space, as well as its own private space. The total number of threads is “THREADS” and each thread can identify itself using “MYTHREAD,” where “THREADS” and “MYTHREAD” can be seen as special constants. The shared space, however, is logically divided into portions with a special association (affinity) to a given thread. Thus UPC allows the programmers, with proper declarations, to keep the shared data that will be processed predominantly by a given thread (and occasionally accessed by others) associated with that thread. Thus, a thread

765

Fig. 1. The UPC memory and execution model.

and the data that has an affinity to it can likely be mapped by the system into the same physical node. This can clearly help in exploiting inherent data locality in applications. UPC is an explicit parallel extension of ISO C. Thus, all language features of C are already embodied in UPC. In addition, UPC declarations give the programmer control over how the data structures are distributed across the threads, by arbitrary sized blocks, in the shared space. Among the interesting and complementary UPC features is a work-sharing iteration statement known as upc forall. This statement can spread independent loop iterations across threads based on the affinity of the data to be processed. This is also complemented by the presence of rich private and shared UPC pointers and pointer casting ability that offer the sophisticated programmer substantial power. In addition, UPC supports dynamic memory allocation in the shared space. On the synchronization and memory consistency control side, UPC offers many powerful options. A fact worth mentioning here is that the memory consistency model can be set at the level of a single object, a statement, a block of statements, or the whole program. Many other rich synchronization concepts are also introduced, including non-blocking barriers that can overlap synchronization with local processing for hiding synchronization latencies. A first UPC example Fig. 2 shows a short example of a UPC program; a vector addition. In this example, arrays v1, v2 and v1plusv2 are declared as shared integer arrays. By default, they are distributed across the threads by blocks of one element each, in a round-robin fashion. Under UPC, this default distribution can be altered and arbitrary block sizes can be utilized. The UPC work-sharing construct upc forall explicitly distributes the iterations among the threads. upc forall has four fields. The first three are similar to those found in the for loop of the C programming language. The fourth field is called “affinity” and in this case it indicates that the thread which has the v1[i] element should execute the ith iteration. This guarantees that iterations are executed without causing any unnecessary remote accesses. The fourth field of a upc forall can also be an integer expression, e.g. i. In such a case, thread (i mod THREADS) executes the ith iteration. Data and distribution In UPC, an object can be declared as shared or private. A private object would have several instances, one at each thread. A scalar shared object declaration would result in one instance

766

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

(a) Default blocking factor.

Fig. 2. A UPC vector addition.

of such an object with affinity to thread 0. The following example shows scalar private and shared declarations. int x;

// x is private, // one x in the private // space of each thread shared int y; // y is shared, // only one y at thread // 0 in the shared space

(b) User-defined blocking factor. Fig. 3. Data layout in UPC.

The first declaration declares x as a private integer. This means that each thread will have its own independent instance of variable x. The second declaration declares y to be a shared integer, which means that only one instance of y will exist on thread 0 and it will be shared by all threads. UPC distributes shared arrays by blocks across the different threads. The general declaration is shown below: shared [block-size] array [number-of-elements] When there is no specified blocking factor, UPC automatically assumes a blocking factor of 1. For example, if the user declares an array x as shared int x [12], UPC will use a blocking factor of 1 and distribute the x array in a round-robin fashion, as shown in Fig. 3(a). Alternatively, if this declaration were shared [3] int x [12], UPC will distribute the array with a blocking factor of 3. In this case, x will be distributed across the shared memory space in three-element blocks across the threads in a round-robin fashion; see Fig. 3(b). Pointers in UPC Pointer declarations in UPC are very similar to those of C. In UPC there are four distinct possibilities: private pointers pointing to the private space, private pointers pointing to the shared space, shared pointers pointing to the shared space, and lastly shared pointers pointing into the private space; see Fig. 4. Consider the following pointer declarations: int *p1; shared int *p2; int *shared p3; shared int *shared

// private to private // private to shared // shared to private p4; // shared to shared

The first statement declares a private pointer p1, which points to the private space and resides in the private space. We

Fig. 4. Pointer table.

Fig. 5. UPC pointer scenarios.

notice that p1 is clearly a typical C pointer. p2 is a private pointer that points to the shared space. Therefore, each thread has an instant of p2. On the other hand, p4 is a shared pointer pointing to the shared space. Thus, it has one instance with affinity to thread 0. p3, however, is a shared pointer to the private space and therefore should be avoided. Fig. 5 demonstrates where the pointers of the previous example declarations are located and where they are pointing.

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

Fig. 6. A shared pointer format.

In order to keep track of shared data, a UPC shared pointer (pointer to a shared object) is composed of three fields: thread, phase, and virtual address (see Fig. 6). The thread information clearly indicates the thread affinity of the data item, i.e. to which thread the datum has its affinity. On the other hand, Vadd indicates the block address, and phase indicates the location of the data item within that block. UPC allows the casting of one pointer type to another. Casting a shared pointer to a private pointer results in the loss of the thread information. On the other hand, casting a private pointer to a shared pointer is not advised and would produce unknown results. Shared pointers can also have a blocking factor to traverse blocked arrays as needed. Dynamic memory allocation Being an ANSI C extension, allocating private memory is already supported in UPC under the C syntax. Shared memory allocation is, however, supported using the UPC extensions. UPC provides three ways to dynamically allocate shared memory. They are upc global alloc(), upc all alloc(), and upc alloc(). The upc global alloc() routine allocates memory across all the threads and returns one pointer to the calling thread. The syntax for upc global alloc() is as follows: shared void *upc_global_alloc( size_t nblocks, size_t nbytes) upc global alloc() is not a collective call. Be aware that, if upc global alloc() is called by multiple threads, each thread will allocate the amount of memory requested. If the intent is to give all the threads access to the same block of memory, upc all alloc() should be used instead. The upc all alloc() routine is called by all the threads collectively and each thread is returned with the same pointer to memory. The syntax for upc all alloc() is: shared void *upc_all_alloc( size_t nblocks, size_t nbytes) This routine allows the threads easy access to a common block of memory, which could be very useful when performing a collective operation. Finally, the upc alloc() routine is provided as a way to allocate local memory to each thread. upc alloc() is not a collective function. The syntax for upc alloc() is:

767

Memory consistency The adopted memory consistency model determines the order of data accesses in the shared space. Under a strict memory consistency, accesses take place in the same order that would have resulted from sequential execution. Under a relaxed policy, however, the compiler can order such accesses in any way that can optimize the execution time. UPC provides memory consistency control at the single object level, statement block level, and full program level. Either relaxed or strict mode can be selected. In the relaxed mode, the compiler makes the decisions on the order of operations. In the strict case, sequential consistency is enforced. Statements, including implicit or explicit barriers and fences, force all prior accesses to complete. The programmer defines the memory consistency behavior as follows. At the global level: #define /* for relaxed consistency */ #define /* for strict consistency */ At the statement level: #pragma upc_strict /* following statements have strict consistency */ #pragma upc_relaxed /* following statements have strict consistency */ At the object level: The keyword strict or relaxed can be used when a variable needs to be treated in a strict or relaxed manner. In the strict case, changes to that variable are visible immediately. For example, a possible declaration could be: strict shared int x; Thread synchronization Thread synchronization in UPC is facilitated through locks and barriers. Critical sections of code can be executed in a mutually exclusive fashion using locks. Example lock primitives are:

shared void *upc{\_}alloc(size{\_}t nbytes)

void upc_lock (shared upc_lock *l) /* grabs the lock */ void upc_unlock (shared upc_lock *l) /* releases the lock */ int upc_lock_attempt (shared upc_lock *l) /* returns 1 on successful lock and 0 otherwise */

upc alloc() returns a pointer to the local shared memory. The upc free() function is used to free dynamically allocated shared memory space. This call is not a collective routine. In the case of a global shared buffer created by upc all alloc(), the freeing is only effective when all the threads have completed the upc free() call.

In addition to the previous primitives, locks can also be dynamically allocated and initialized. Another commonly used synchronization mechanism is barrier synchronization. A barrier is a point in the code at which all threads must arrive, before any of them can proceed any further.

768

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

Two versions of barrier synchronization are provided. In addition to the commonly used barriers, which simply ensure that all threads have reached a given point before they can proceed, UPC provides a split-phase (non-blocking) barrier. The split-phase barrier is designed to overlap synchronization with computation. The syntax for the regular barrier follows. upc barrier; // regular barrier For the split-phase barrier, the following pair of statements is used: Example: upc_notify; ... upc_wait;

// // // // //

notify others and start local processing local processing do not proceed until others notify

3. Strategy In order to evaluate the performance of the UPC compilers, we have identified several key points to be studied. Each of these points corresponds to a particular optimization that should be present in any “mature” UPC compiler. For each workload, several implementations were created. They are the sequential C implementation, parallel implementation in UPC with no hand optimizations, and several UPC implementations with hand tunings, each of which represents a potential compiler optimization. Thus, the evaluation methodology proceeds as follows: for each workload, we compare the performance of the GCCUPC compiler (with all automatic optimizations enabled) to the performance and scalability of each of the implementations, mentioned previously. This examines the presence and the quality of implementing various compiler optimizations. The sequential code is also compared to that of the single thread UPC performance to understand and quantify the parallel overhead imposed by the compiler implementation. The UPC parallel code performance is also compared to that of other implementations such as MPI and SHMEM. This helps to assess, in case some parallel overhead is observed, how the parallel overhead of the UPC implementation compares with other paradigms. The first optimization to be evaluated is privatization. This quantifies the compiler’s ability to recognize local-shared memory accesses, and perform them with the same low overhead associated with private accesses. The local-shared memory accesses can be divided into two categories: thread local-shared accesses, and SMP local-shared accesses, when a thread accesses data that is not local to the thread, but local to the SMP. The latter requires that implementation details are best exploited using run-time systems, while the former can be exploited by compilers. The hand-tuning that emulates privatization is accomplished by casting any localshared reference to a regular C pointer, and using a C pointer to access the data as if it were private. This is clearly possible, since local-shared data is present in the same physical place as private data.

The second optimization is the aggregation of remote accesses to amortize the overhead and associated latencies. This is emulated using UPC block transfer functions. Due to the UPC thread-memory affinity concept, UPC compilers can recognize the need for remote data at compile time, thus providing a unique opportunity for prefetching. Thus, in addition to using block transfers to mimic aggregation, we also emulate prefetching. What makes these optimizations important in this case is the fact that they can be implemented very easily in UPC. This is because UPC, unlike MPI, OpenMP and even SHMem, is locality conscious. Privatization opportunities can easily be discovered when the thread portion of a shared address matches the local thread number, MYTHREAD, and the shared address is cast to private and used from a private pointer. In the UPC language, this can easily be discovered and integrated at compile time. Likewise, due to the concept of affinity in UPC, it is also possible to recognize when a for loop, for example, is accessing remote data. At that point, the compiler can replace this construct with a UPC block operation, such as upc memget or upc memput. In cases where the for loop is doing both access and processing, the loop can be broken into two. The access loop can then be replaced with the block transfer and placed ahead of the processing loop. Other forms of transformations can be considered to accommodate different scenarios. 4. Testbed and applications 4.1. Testbed The experimental testbed consists of an SGI Origin 2000 running the GCC-UPC v3.2.0.4 compiler from Intrepid [9]. The Origin 2000 has 32 MIPS R10000 processors, each running at 195 MHz, with two processors per node. The total memory available is 16 GB, with a bandwidth of 780 MB/s. The system is running under the Irix 6.5 operating system. The GCC-UPC compiler v3.2.0.4 [9] is based on the standard GCC compiler v3.2, but with added parsing modules for UPC. Low-level performance monitoring to gain insight into the observed performance is quite important, and in this case was accomplished using the SGI Speedshop software suite [10]. In particular, we used Perfex, one tool included in the Speedshop tool-kit, which is capable of retrieving the values of selected CPU monitoring registers. Altogether, there are up to 32 different monitoring registers, counting a number of CPU events such as the number of CPU cycles spent, number of loads or stores issued, number of instructions issued, number of TLB misses, and many others. 4.2. Applications In order to analyze the performance of the UPC implementation, we examined the results of a UPC implementation of the STREAM benchmark. We also used the Sobel Edge Detector and N-Queens Problem kernels, and workloads from the NAS Parallel Benchmark (NPB) [11] are also considered. In addition, MPI [12] and SHMem [13] implementations are given for many of the workloads for reference.

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

The STREAM benchmark is a simple synthetic benchmark program that measures the sustainable memory bandwidth (in MB/s) for single and block memory copy as well as the computation rate for simple vector kernels. The UPC STREAM Benchmark has been extended from the original STREAM Benchmark [14] to focus on the UPC shared (local and remote) and private spaces. For these benchmarking experiments, the memory access rates are reported in MB/s. GUPS [15], or Giga Updates per second, is a simple program performing updates to random locations in a large shared array. In the UPC GUPS implementation, each thread first computes its own set of random indices and stores them into a private array called indices. After that, each thread updates (increments) the shared array at random locations, as given by indices. The counterpart has been developed using MPI/2, emphasizing the performance of single remote accesses, following a random memory access pattern. The Sobel edge detection performs edge-detection using the well-known Sobel operator. The computational process involves two 3 × 3 masks, or filters, for detecting the vertical and horizontal edges. The west mask is placed on the top of the image with its center on top of the currently considered image pixel. Each of the underlying image pixels is multiplied by the corresponding mask pixel and the results are added up, then the sum is squared. The same process is applied to the north mask and the square root for the two squared sums becomes the pixel value in the edge image. This program is parallelized in UPC by partitioning the source image across the threads into equal contiguous chunks of rows, as blocks of a shared array. A upc forall() is then used to get each thread to work mainly on the data that it has. With such data distribution, each thread can work independently for all its blocks of lines except for the last line (ghost zone), in which remote accesses to the next thread are performed. The N-Queens problem computes the number of different solutions to the problem of placing N queens on an N*N chessboard such that no queen can kill the other. In other words, no two queens can be present on the same row, column or diagonal cells of the board. The algorithm uses a depth-first searching and backtracking. For each row, we try to add a queen by checking the occupied columns and diagonals. When a queen is placed, we do the same for the next row. If no queen can be placed in a row, we go back and move the preceding queen. If N queens have been placed, we have obtained a solution. We save this solution and continue the algorithm. If we try to go back in the first row, it means that we have reached the end. The parallel solution to this problem is very straightforward. This is because the branches can be distributed across the threads with no thread interaction being needed. A job is described as searching along the subtree that corresponds to a given row position combination for the first L rows. All threads proceed to perform the sequential search along their own subtrees. No attempt has been made to balance the workload dynamically. The remote accesses associated with this algorithm are only the initial broadcast of the parallel

769

job description parameters (three integers) and the reduction of the total number of solutions found. Thus, this problem is embarrassingly parallel. The NAS parallel benchmark (NPB) is developed by the Numerical Aerodynamic Simulation (NAS) program at NASA Ames Research Center for the performance evaluation of parallel supercomputers. The NPB mimics the computation and data movement characteristics of large-scale computational fluid dynamics (CFD) applications. The NPB comes in two flavors: NPB 1 and NPB 2. The NPB 1 is the original “paper and pencil” benchmark. Vendors and others implement the detailed specifications of the NPB 1 report, using algorithms and programming models appropriate for their different machines. On the other hand, NPB 2 workloads are MPI-based source-code implementations written and distributed by NAS. They are intended to be run with little or no tuning. Another implementation of NPB 2 is the NPB 2-serial; these are single processor (serial) sourcecode implementations derived from the NPB 2 by removing all parallelism [11]. We have therefore used the official NPB 2 from NAS for our MPI performance measurements. NPB 2-serial was used to provide the uniprocessor performance when reporting on the scalability of MPI. The NPB suite consists of five kernels (EP, MG, FT, CG, IS) and three pseudo-applications (LU, SP, BT) programs. The bulk of the computations are integer arithmetic in IS. The other benchmarks are floating-point-computation intensive. However, we will be considering only three of them: SP (Scalar Penta-diagonal) is a simulated computational fluid dynamics (CFD) application that uses an implicit algorithm to solve the three-dimensional compressible Navier–Stokes equations. The finite differences solution to the problem is based on a Beam-Warming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Penta-diagonal bands of linear equations that are solved sequentially along each dimension. SP uses coarse-grain communications. CG (Conjugate Gradient) is a kernel that computes an approximation to the smallest eigenvalue of a symmetric positive definite matrix. It features unstructured grid computations requiring irregular long-range communications. EP (Embarrassingly Parallel) is a kernel that can run on any number of processors with little communication. It estimates the upper achievable limits for floating point performance of a parallel computer. It generates pairs of Gaussian random deviates according to a specific scheme and tabulates the number of pairs in successive annuli. There are different versions/classes of the NPB like Sample, Class A, Class B, Class C and even Class D, introduced in the NPB 2.4 release. These classes differ mainly in the size of the problem. Table 1 gives the problem sizes and performance rates (measured in Mflop/s) for each of the eight benchmarks, for Class A and Class B problem sets on a single processor Cray YMP. Our performance analyses considered not only UPC but also MPI and SHMem implementations.

770

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775 Table 2 UPC stream measurements

Table 1 NPB problem sizes (a) Class A workloads (smaller version) Benchmark

Size

Operations (×103 )

EP CG SP

228

26.68 1.508 102

14,000 643

M Flops (Cray YMP NP = 1) 211 127 216

(b) Class B workloads (bigger version) Benchmark

Size

Operations (×103 )

M Flops (Cray YMP NP = 1)

EP CG SP

230 75,000 1023

100.9 54.89 447.1

447 627

The Message Passing Interface (MPI) is the most widely used message passing paradigm. In this paradigm, parallel processing is derived from the use of a set of concurrent sequential processes, cooperating on the same task. Each process has its own private address space. Thus, two-sided communication in the form of send and receive operations is needed. This involves a significant overhead in case of a large number of small messages. The difficulty of programming is another problem with this paradigm. The SHMem library is another programming paradigm that provides a low-level, high-performance, one-sided communication for machines with shared memory support. It optimizes performance at the cost of programmability. In this study, SHMem serves as a reference point for nearly the best performance that a programming paradigm can provide on the used architecture. On the other hand, MPI serves as another reference point representing the most widely used parallel programming paradigms. The target here is to attempt to show whether UPC can have a performance close to SHMem, while enjoying high-level programmability. 5. Experimental measurements and observations 5.1. STREAM benchmark Table 2 presents the results of the STREAM microbenchmark. The measured memory access rates are reported in MB/s (1,000,000 bytes transferred per second). Attention is focused here on the transfer rates for memcpy() in the private space, compared to those in the cases of upc memget() and upc memput() between the local-shared and private space. This is because both private data and local shared data are kept in the same physical space. Thus, the ability of a compiler to distinguish local shared accesses from remote shared accesses is very important. This is because the compiler should be able to access local shared data in the same way that it accesses private data to avoid the overhead associated with shared pointer arithmetic. Table 2 shows that UPC local shared accesses are several times slower than private accesses. This fact indicates that the compiler has substantial room for improving local shared access speeds if it can recognize that such accesses are indeed

MB/s

Memcpy

Array copy c[i] = a[i]

Scale c[i] = K ∗ a[i]

GCC UPC private UPC shared (thread local) UPC shared (SMP local) UPC shared (remote)

127 127 139

106 106 14

108 107 13

130

13

13

112

12

13

MB/s

Sum += a[i]

upc memget (shared to private)

upc memput (private to shared)

GCC UPC Private UPC shared (thread local) UPC shared (SMP local) UPC shared (remote)

223 215 31

N/A N/A 140

N/A N/A 136

30

129

136

28

117

136

local at compile time. This is very possible under the UPC program model, which makes such thread-memory affinity explicit. For the same reason, programmers use affinity to minimize remote accesses in general. Thus, improving the local shared cases can have a great impact. The compiler run-time system should also extend this benefit beyond thread-local accesses to include shared accesses within the same symmetric multiprocessor node (referred to here as the SMP local case). Table 2 shows that UPC remote shared accesses are a little slower than UPC SMP-node local shared accesses, however. This is mostly because of the NUMA architecture itself, and the compiler may not be able to help in this case, other than by prefetching. In order to have a better understanding of the overhead caused by a local shared memory access, we monitored the STREAM benchmark array copy test using Perfex for two UPC single-thread applications. The first reads a local shared array and writes the result into a private table. The second application performs a private array read and stores it into another private table. The results, shown in Fig. 7, demonstrate the overhead introduced by the compiler to complete a single local shared read. In both cases, arrays for reads and writes are declared statically and their size is kept as a constant (100M 32-bit integers). The number of elements to be copied, however, was varied. It is shown in Fig. 7 that the ratio of issued loads and stores between local shared and private accesses is almost constant. The hardware monitoring register only considers “decoded loads”. It does not take into consideration any prefetching, cache operations and synchronization operations. The register “decoded stores” works in a similar manner, i.e., incremented when a store instruction is decoded in the previous cycle. These ratios, therefore, express the amount of additional work associated with a single local shared memory get operation as imposed by the compiler. The ratio of issued decoded loads and decoded stores from Fig. 3 for the cases

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

771

(a) Issued loads. Fig. 8. GUPS performance (MPI/2 and UPC).

(b) Issued stores. Fig. 7. Issued memory access instructions for local shared versus private.

of private-to-private against local-shared-to-private accesses are 20 and 6, respectively. This shows that a local shared read of one data element leads the processor to perform 20 private reads and 6 private writes. This overhead is quite high, since the NUMA architecture already has a global address space that can be leveraged if the compiler improves the way it represents the UPC memory view. This shows that this UPC compiler has still significant room for improving its performance. 5.2. The GUPS microkernel Fig. 8 considers the performance of the GUPS microkernel in both UPC and MPI/2. This experiment illustrates the fact that UPC has low overhead associated with its small accesses, as compared with MPI/2. This becomes clearer for a larger number of processors, since the greater number of processors implies greater probability of remote accesses. Therefore, while the UPC performance continues to improve beyond 16 processors, the performance of the MPI/2 implementation starts to degrade. This shows that this implementation of the UPC compiler does in fact fulfill the promise that short remote accesses are efficient, as compared to the one-sided communication features of MPI/2. 5.3. The N-Queens and Sobel edge problems As new UPC compilers are maturing, our interest lies in discovering which potential compiler optimizations are still missing and assessing the expected performance benefit

Fig. 9. Impact of UPC optimizations on performance for N-Queens (N = 14).

from implementing such optimizations. To do so, we emulate such compiler optimizations and compare the performance with that of the raw performance of the compiler. An important emulation of effective access to local shared data is privatization, which makes local shared objects appear as if they were private. This emulation is implemented by casting a local shared pointer to a private pointer and using the private pointer for accessing the local shared objects as if they were private. This is made possible by the fact the UPC, like C itself, allows casting, and the fact that a private pointer in UPC can access not only the private space of a thread but also its local shared space. One other optimization is aggregating and prefetching remote shared accesses. This can be established through block moves, using upc memget and upc memput. Also, one can overlap local computations with remote accesses, which can be emulated using split-phase barriers. All these emulations can be done at the application level as hand-tuning for the UPC applications, while compilers are becoming more sophisticated by incorporating and automating these optimizations. The impact of these optimization is greatly application dependent. As is shown in Fig. 9, the N-Queens problem does not get any benefit from such optimizations. This is due to the inherent nature of N-Queens as an embarrassingly parallel application that does not require significant shared data. However, it is important to note that N-Queens and embarrassingly parallel applications that do not require extensive use of shared data will not be adversely affected by such optimizations.

772

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775 Table 3 Sequential execution times for NPB (s)

CG

EP

SP

F CC GCC MPI (F) SHMem (CC) UPC-O0 (GCC) UPC-O1 (GCC) UPC-O3 (GCC)

43.56 53.86 66.01 43.02 46.05 61.2 58.89 56.66

191.87 269.49 304.57 192.74 269.31 305.37 N/A N/A

1 916.27 1 723.81 2 192.53 1 976.92 1 891.82 33 571.71 2 550.43 2 394.14

(a) Execution time.

(b) Scalability. Fig. 10. Impact of UPC optimizations on performance for Sobel edge detection.

The Sobel edge detection performance is presented in Fig. 10(a). The image size is set to 2048 × 2048. This case illustrates the importance of these optimizations. The execution time is almost nine times faster after having privatized all local shared memory accesses and prefetched the remote data lines being accessed. Fig. 10(b) reports that the UPC optimized Sobel edge implementation has a nearly linear speed-up. Thus, the incorporation of these optimizations into the compiler can improve the execution time by almost one order of magnitude in some cases. 5.4. The NPB benchmark workloads All the following experiments on the NPB workloads use CLASS A. This class has moderate problem sizes, emphasizing both computational and communication aspects. Table 3 demonstrates how variable the performance is for uni-processor execution time [16]. This variability was found to be due to the specific language used, C or FORTRAN for example, or the compiler front-end as well as the associated overhead with a given machine implementation. Table 3 shows three variant implementations of UPC, noted O0, O1, and O3. The O0 variant is for the plain UPC implementation without any optimizations. The variant O1 is one in which the space privatization compiler optimization is emulated. The variant noted O3 is for a version that emulates privatization as well as prefetching-aggregation optimizations [8]. All serial compiler optimizations were enabled for these measurements. For the

three workloads used, the Fortran 77 compiler is the fastest, followed by the C compiler, then GCC is the worst. In case of EP, this difference, particularly between Fortran and both C versions, is very significant. This large difference for the EP kernel is due to a fast random number generator, which uses 64-bit integer logic for the Fortran version. In the case of UPC and SHMem, the built-in random number generator, involving floating point arithmetic, is used. GCC, however, is roughly 15% slower than the corresponding CC version, or the native C compiler on the Origin machine. Since the Origin UPC compiler is built around GCC while MPI and SHMem use the native SGI MIPS compiler, MPI as well as SHMem always start with a better uni-processor time than UPC. Table 3 reveals a number of other important facts. In cases such as EP, where the workload does not require a lot of shared data or remote accesses, UPC compiler optimizations are not necessary, as they cannot provide much improvement. This is seen easily from the fact that EP performance under GCC, as well as under the single-threaded UPC-O0, is identical. Meanwhile, in the case of SP, O1 shows more than one order of magnitude improvement over O0. This is because SP requires extensive use of shared data. In the case of CG, only about 3% improvement is observed, since the shared data is used only during reduction operations, thus mostly all processing is done over private data. Figs. 11–13 report the performance measurements for the selected NPB workloads in MPI, SHMem and UPC, as well as the impact on the performance of each UPC variant when applicable. In each figure, part “a” illustrates the execution time of all UPC variants, whereas part “b” shows the execution time of those three programming paradigms for the same target workload. The UPC codes used in part “b” are ones in which all possible compiler optimizations are emulated. There are only two exceptions. The EP workload, Fig. 12, is presented with the O0 variant. This is because EP was found not to benefit from any of the optimizations due to the fact that it makes little use of the shared space. In the case of SP, the O0 was at least an order of magnitude worse than the optimized versions. Therefore, the figure was broken into three parts in order to be able to zoom in on the performance differences between O1 and O3. The performance measurements show that UPC can achieve similar levels of performance to MPI and SHMem, should the cited optimizations be implemented. Note that this is the case in spite of the fact that UPC does not enjoy the same low-level

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

(a) Impact of optimizations.

773

(a) Impact of optimizations.

(b) Comparison with other paradigms. Fig. 11. Parallel performance of CG class A.

(b) Impact of optimizations — a Zoom in on O1 and O3.

(c) Parallel performance. Fig. 12. Parallel performance of EP class A.

Fig. 13. Parallel performance of SP class A.

optimization as the two other implementations. In addition, it was observed that not all applications will likely benefit from all optimizations. Applications that make substantial use of the shared space can greatly benefit from the O3 optimization, which can lead to an order of magnitude improvement, as was the case with SP. Applications that have modest use of the shared space, such as CG, can have only modest improvements, while applications such as EP, where there is little or no use of the shared space, may not benefit from this optimization. Benefits from aggregation and prefetching, O3, are generally in the range of a few percent to about 25%. In both SP and CG,

it is noticed that this improvement increases as the number of processors increase. This is because the cost of remote accesses becomes higher for a larger machine size. However, expressed as a percentage improvement, the benefit from O1 remains constant as we increase the number of processors. This is because both the number of local accesses, as well as the overall execution time, is reduced at a similar rate as we increase the number of processors. In addition to emulating the discussed UPC optimizations at the high-level language, which has an associated cost, MPI and SHMem implementations take advantage of collective

774

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

operations libraries that are optimized at a very low level. With those emulations, programmers will be able to enjoy the performance benefits and the ease of use featured in UPC [16]. 6. Summary and conclusion In this paper, several micro-benchmarks as well as application benchmarks were used to study the new UPC language implementation on the SGI NUMA architecture, Origin 2000. The strategy that we adopted included emulating potential UPC compiler optimizations and comparing with un-optimized versions of the codes as well as with MPI and SHMem implementations. This evaluation methodology identifies each possible automatic compiler optimization for the target parallel programming paradigm and a way of emulating it. Comparison of measurements for the raw performance versus the performance with the emulated optimizations allowed uncovering of whether that optimization was implemented or not in current compilers, and the potential performance impact of including it or improving the quality of its implementation. The STREAM benchmark has made it clear that, under UPC, the programmer’s awareness of local versus remote shared data can help in improving performance by exploiting data locality. This benchmark has also demonstrated that the UPC compiler should reduce the high overhead in local shared accesses. This is a potential improvement, as both local shared and private memories are available on the same node, and again the threadmemory affinity can be interpreted under UPC. Thus, accessing local shared data can be as efficient as accessing private data, or at least it has similar performance. Considering the low-level counters available under SpeedShop, it was shown that a local shared access could generate 20 loads and 6 stores under this benchmark. Such a costly overhead should be alleviated via compiler optimization and/or exploitation of the global address space capabilities of the underlying architecture. The GUPS micro-kernel has demonstrated that, even with the current implementations, UPC random remote accesses incur less overhead than MPI/2, and for a large number of processors the scalability of UPC is better. The N-Queens application has attested to the efficiency of UPC. This is because N-Queens is embarrassingly parallel and has little shared memory activity. Thus, with no optimizations, UPC N-Queens implementation outperformed MPI implementation. Sobel edge has demonstrated a linear speed-up when all UPC compiler optimizations were emulated. Finally, the NAS parallel benchmark workloads have shown that, while emulating compiler optimizations can make significant improvements in UPC performance, the front-end compiler, for example native CC or GCC, is also an important factor in the overall performance. It was further shown that privatization can produce up to one order of magnitude improvement in applications that make extensive use of the shared space, such as SP. Such a benefit remains constant as we scale up the machine size. On the other hand, a more modest improvement in performance from aggregation and prefetching can be achieved for small-size machines for some applications.

This benefit grows as the machine size grows and the cost of remote accesses becomes more expensive. Finally, by having UPC compiler developers incorporate more automatic optimizations (such as those mentioned here) and the low-level optimized collective libraries, UPC will likely fulfill its promise of improved performance and better programmability. References [1] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. Von Eicken, Y. Yelick, Introduction to Split-C, University of California, Berkeley, 1993. [2] W.W. Carlson, J.M. Draper, Distributed data access in AC, in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Santa Barbara, CA, July 19–21, 1995, pp. 39–47. [3] E. Brooks, K. Warren, Development and Evaluation of an Efficient Parallel Programming Methodology, Spanning Uniprocessor, Symmetric Shared-memory Multi-processor, and Distributed-memory Massively Parallel Architectures, Poster SuperComputing’95, San Diego, CA, 1995. [4] T.A. El-Ghazawi, W.W. Carlson, J.M. Draper, UPC Language Specifications V1.0, http://upc.gwu.edu, February 2001. [5] T.A. El-Ghazawi, W.W. Carlson, J.M. Draper, UPC Language Specifications V1.0, http://upc.gwu.edu, September 2004. [6] UPC Consortium, UPC Language Specifications V1.0, http://upc.gwu.edu, June 2005. [7] T.A. El-Ghazawi, W. Carlson, T. Sterling, K. Yelick, UPC: Distributed Shared Memory Programming, Published by John Wiley and Sons, ISBN: 0-471-22048-5, May, 2005. [8] T.A. El-Ghazawi, S. Chauvin, UPC benchmarking issues, in: Proceedings of the International Conference on Parallel Processing, ICPP’01, IEEE CS Press, Valencia, Spain, September 2001. [9] Intrepid, The GCC UPC Compiler for SGI Origin Family, http://www.intrepid.com/upc/. [10] Silicon Graphics Inc., Speedshop User Guide. [11] D. Bailey, E. Barszcz, J. Barton, The NAS Parallel Benchmark RNR Technical Report RNR-94-007, March 1994. [12] M. Snir, S. Otto, S. Huss, D. Walker, J. Dongarra, MPI: The Complete Reference, MIT Press, Boston, 1996. [13] K. Hwang, C.L. Wang, Z. Xu, Resource scaling effects on MPP performance: The STAP benchmark implications, IEEE Transactions on Parallel and Distributed Systems 10 (5) (1999). [14] J.D. McCalpin, Sustainable Memory Bandwidth in Current High Performance Computers, Silicon Graphics Inc. [15] B.R. Gaeke, K. Yelick, GUPS (Giga-Updates per Second) Benchmark, Berkeley. [16] T.A. El-Ghazawi, F. Cantonnet, UPC Performance and Potential: A NPB Experimental Study, SuperComputing 2002 (SC2002), IEEE, Baltimore MD, USA, 2002. Tarek A. El-Ghazawi is a Professor in the Department of Electrical and Computer Engineering at The George Washington University, where he also directs the High Performance Computing Laboratory (HPCL). He received his Ph.D. degree in Electrical and Computer Engineering from New Mexico State University in 1988. Tarek El-Ghazawi is one of the principal co-authors of the UPC and the UPC-IO specifications. His research interests include highperformance computing and architectures, reconfigurable computing, parallel I/O, and performance evaluations. He has published over 100 refereed research papers and book chapters in these areas and his research has been supported by the DoD/DARPA, NASA, the US National Science Foundation (NSF), and industry. He is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), and a member of the Association for Computing Machinery (ACM) and the Phi Kappa Phi National Honor Society.

T.A. El-Ghazawi et al. / Future Generation Computer Systems 22 (2006) 764–775

Smita Annareddy received her M.S. in Computer Engineering in summer 2004, from The George Washington University. She is working in the High Performance Computing Laboratory of The George Washington University as a research assistant. Her research interests are in high-performance computing, with an emphasis on load balancing and benchmarking.

Franc¸ois Cantonnet received his M.S. in Computer Science from Universit´e de Technologie de BelfortMontb´eliard in 2002. He received his M.S. in Computer Engineering from The George Washington University in 2003. He is a senior research assistant at the High Performance Computing Laboratory of The George Washington University. His current research interests include parallel computing, reconfigurable platforms and hyperspectral imaging. He has published more than 10 research papers in the proceedings of major international conferences, specifications and manuals. Yiyi Yao is currently a doctoral student in the Dept. of Electrical and Computer Engineering in The George Washington University. He received his M.S. in Computer Engineering from The George Washington University in 2003. He is a research assistant in the High Performance Computing Laboratory of The George Washington University and his research interests are high-performance computing and reconfigurable devices and applications.

775

Ahmed S. Mohamed is a Professor in the Department of Computer Science at the American University in Cairo. He received his B.Sc., M.Sc., and Ph.D. degrees all in computer science from the University of Alberta, Canada, in 1979, 1985, and 1989, respectively. His research areas are high-performance computing, artificial intelligence, biomechanics, modeling, and hardware design. During 2002–2003, Mohamed was a visiting scientist at the HPCL of The George Washington University.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.