Utilizing multidimensional loop parallelism on large scale parallel processor systems

June 16, 2017 | Autor: C. Polychronopoulos | Categoría: Distributed Computing, Computer Hardware, Parallel Processing, Computer Software, Design optimization, Large Scale

Share Embed

Laporkan tautan ini

Descripción

1285

IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 9, SEPTEMBER 1989

Utilizing Multidimensional Loop Parallelism on Large-scale Parallel Processor Systems

Abstract-Parallel processor systems that have been built so far can execute in parallel only singly nested parallel loops. However, it is crucial to be able to exploit multidimensional parallelism which occurs in multiply nested parallel loops. Developing schemes for executing efficiently arbitrarily nested loops in parallel will allow us to exploit (and therefore develop) computer systems with hundreds or thousands of processors. In this paper, we discuss issues on program parallelism and processor allocation for parallel processor systems. Optimal processor assignment algorithms are presented for simple and complex nested parallel loops. These processor assignment schemes can be used by the compiler to perform static processor allocation to multiply nested parallel loops. Speedup measurements for EISPACK and IEEE DSP subroutines that result from the optimal assignment of processors to parallel loops are also presented. These measurements indicate that optimal processor assignments result in almost linear speedups on parallel processor machines with a few tens of processors, and significantly high speedups for machines with hundreds or thousands of processors. Index Terms-Parallel loops, parallel processor system, processor allocation, restructuring compilers, speedup.

I. INTRODUCTION

T

HE main driving force behind the development of parallel processor systems is the need to exploit parallelism in algorithms and programs, and solve problems whose timing requirements make them impossible to tackle on conventional systems. The current trend leaves no doubt that the shared memory parallel processor model is, and it will be, one of the dominant architectures for the near future supercomputers. The flexibility, scalability, and high potential performance offered by parallel processor machines are vital characteristics for any high-performance system. The flexibility of these machines is indeed greater than that of single vectodarray processor computers [8], and they can execute more efficiently a larger class of programs. A key problem in designing and using large-scale parallel processor systems is determining how tc schedule independent processors to execute a parallel program as efficiently as possible. We know little about coordinating large numbers of processors to execute multiply nested parallel loops and no solution exists for the general Manuscript received February 11, 1987; revised December 16, 1987. This work was supported in part by the National Science Foundation under Grant US NSF DCR84-10110, the U.S. Department of Energy under Grant US DOE DE FG02-85ER25001, and the IBM Donation. The authors are with the Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, IL 61801. IEEE Log Number 8928034.

problem [ 141. Very efficient solutions, however, are possible for specific cases [l], 191, [ill, 1131, [151. This paper discusses optimal static processor allocation schemes for arbitrarily nested parallel loops. A dynamic programming approach can be used by the compiler to decide the number of processors that need to be allocated to each individual loop, such that the parallel execution time of the entire loop construct is minimized. The terms processor allocation and assignment are used interchangeably in this paper. A compile-time algorithm was implemented in the Parafrase restructuring compiler and experiments were conducted using subroutines from the EISPACK and the IEEE/ DSP packages [4]. An alternative to static processor allocation is dynamic scheduling of parallel loops. While static allocation minimizes run-time overhead associated with scheduling, dynamic approaches may involve considerable overhead that may make the parallel execution of small loops impractical. On the other hand, dynamic scheduling offers more flexibility and it may actually be more practical on the average. If the execution time of loop bodies is not constant, or if the loop bounds are unknown at compile-time, then dynamic loop scheduling seems to be a more attractive approach. This is also the case when the exact number of processors allocated to a loop is decided at run-time. An efficient dynamic scheduling scheme that minimizes the run-time overhead associated with loop scheduling was proposed in [ 131 and [ 111. Static scheduling may be more efficient, however, in the absence of branches and when the loop iterations and the number of processors are known. In the remainder of this paper, we concentrate only on static processor allocation schemes. An interesting extension of this work would involve the comparison of the performance of dynamic and static scheduling algorithms on realistic loops. The rest of the paper is organized as follows. Section I1 gives some background information and defines general concepts. Section I11 discusses the problem of optimal static processor allocation to parallel loops in detail, gives the formulation of the problem, and presents an optimal algorithm in Section 111-B. Experimental results that were carried out using EISPACK and IEEE/DSP programs are presented in Section IV. Section V gives the conclusion of the paper. 11. BACKGROUND AND BASICCONCEPTS

A parallel program can be written using a parallel language or it can be the result of automatic restructuring. For our

0018-9340/89/0900-1285$01.00 @ 1989 IEEE

1286

IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 9, SEPTEMBER 1989

purposes, we use output generated by the Parafrase restructurer [7]. Parafrase receives as input Fortran programs and applies to them a series of architecture independent and dependent transformations. The front-end of the compiler consists of a set of architecture independent transformations. The back-end consists of a series of architecture dependent optimizations that can be applied on a given program. Depending on the architecture of the machine we intend to use, we choose the appropriate set of passes to perform transformations targeted to the underlying architecture. Parafrase can be used to transform programs for execution on four types of machines: single execution scalar or SES (uniprocessor), single execution array or SEA (array/pipeline), multiple execution scalar or MES (multiprocessor), and multiple execution array or MEA (multiprocessor with vector processors) architectures [8]. In a restructured Fortran program, we observe several types of parallelism and all of them can be potentially utilized by an MEA machine. We can roughly classify the different types of parallelism into three categories: fine grain, medium grain, and coarse grain parallelism. Fine grain parallelism includes the parallel execution of different statements of the program on different processors, or even different operations of the same statement on different processors or functional units. Medium grain includes tasks of the size of basic blocks or small loops. Coarse grain parallelism arises from the parallel execution of independent disjoint modules of the program (i.e., subroutines), or from larger parallel loops. To expose program parallelism, Parafrase builds for each program the data dependence graph [6] which indicates data and control dependences between statements of the program [5]. This graph is a central data structure for transformations. During parallel execution of a program, data and control dependences must be observed in order to preserve the semantics of the program. One of the most vital optimizations involves the recognition of several types of parallel loops. In the transformed programs, loops can be of one of the following four types: serial or DO loops, recurrence loops (e.g., vector sum/product), do all or DOALL loops (all iterations of the loop can execute in parallel), and do across or DOACROSS (successive iterations can be partially overlapped [lo], [2]. The restructured programs are to be executed on a multiprocessor such that DOALLs and DOACROSSes can be distributed across different processors. Furthermore, execution of many loops of different kinds may also be overlapped. Given a Fortran program that executes on a P-processor multiprocessor system, we define program speedup Sp as follows: Sp = Tl/ Tp,where TIis the serial execution time of the program and Tp is the execution time when it runs on P processors. The efficiency of the execution of a parallel program that achieves a speedup of S p on P processors is then defined as Ep = Sp/P, and obviously 0 s Ep I1. We assume that any (serial or parallel) DO loop has the following form: DOZ= l , N

Po

ENDO

where N is the upper bound of the loop index and B is a set of statements with constant (independent of I ) execution time, or another loop. In a nested loop, the nest depth is the maximum number of loops nested in each other. A perfectly nested or one-way nested loop is a loop of the form DO Zi = 1, NI DO 12 = 1, N2

...

..

DOI, = 1, N ,

{B) ENDO

...

...

ENDO ENDO where m is the nest depth and B is as defined above. To simplify our notation, each loop is assumed to be normalized (i.e., its iteration space is of the form [1, ,NI, N E Z + ), and is denoted by the upper bound of its iteration space. Thus, N j denotes a DO loop whose body is executed N j times. In a perfectly (or one-way) nested loop, there is exactly one loop at nest level i (i = 1,2, * , m). A loop is k-way nested if there exist k disjoint loops at the same nest level. A perfectly nested loop of nest depth m is denoted by L l , m= (NI,N2, * , Nm).For 1 5 i 5 j 5 m, Lj,j = (Ni,Ni+1, * , N j ) is a subset of L1,* that includes loops i through j . A number of processors P is said to be useful with respect to a parallel loop L l , mif there exists an assignment which can allocate exactly P processors to the components (loops) of L],,. Let Tp be the parallel execution time of a loop L l , mon P processors. Then P is the maximum number of useful processors for LI,, iff it is the smallest integer such that for any q > P and any processor allocation scheme Tq = T p .

-

-

III. OPTIMAL PROCESSOR ASSIGNMENT TO PARALLEL LOOPS The processor assignment problem becomes especially important when we deal with nested parallel loops where inefficient assignment algorithms may result in an execution time far worse than the optimal. In programs with several nested parallel loops, the efficiency may then drop down to unacceptable levels. We can informally define the limited processor assignment problem as follows. Given an arbitrary multiply nested loop which contains serial and parallel (DOACROSS, DOALL) loops and a number of processors P , find the (optimal) way of assigning the P processors to the loops so that the parallel execution time of the entire module is minimized. For loops with very few nest levels and systems with a small number of processors, exhaustive search might be affordable at compile time. But as the number of processors increases, the number of processor-loop combinations grows exponentially. Moreover, loops with large nest levels are not very uncommon in scientific computations. As an example, 10-17 deeply nested parallel loops were observed in several subroutines of the restructured (by Parafrase) IEEE Digital Signal Processing Package. In the case of a perfectly nested (one-way) loop, where all

1287

FQLYCHRONOFQULOS et al. : MULTIDIMENSIONAL LOOP PARALLELISM

loops are DOALLs, processor assignment becomes an easy problem to solve: if N is the product of the loop bounds, then each processor (but possibly the last) is assigned rN/P1 iterations of the loop. Then for each loop the range of its index values corresponding to each processor is computed directly. However, in the general case of arbitrarily nested loops that contain combinations of DO, DOALL, and DOACROSS loops, the above method would not work. The algorithm described in this paper solves the general problem optimaly in the context of static scheduling. This means that optimality is obtained assuming that the structure of the target loop is to be preserved. The solutions obtained are not necessarily optimal when compared to loops containing branches and which are scheduled dynamically (even through they may still be more efficient in certain cases).

execution time in this case will be twice the execution time of B. If the use of exactly 64 processors were required, then an optimal solution will allocate 16 processors to the K loop, four clusters of 16 processors each to the Jloop, and one cluster of 64 processors to the I loop. This gives an execution time of three times B. Notice that this problem is not as artificial as it looks since for some hierarchical parallel processor organizations the exact factorization may be the simplest way to partition processors between loops. A metric called the efficiency index is used throughout this section. The usefulness of this metric is twofold. First it makes it easier to formulate the processor assignment problem, and second it allows us to observe several interesting properties of the problem that are otherwise hidden in modular arithmetic. The optimal processor assignment is guided by the use of a A . Optimal Simple Processor Assignment to DOALLs function called the assignment function. The assignment The techniques described below partition a P-processor function can be easily defined to measure parallel execution machine hierarchically into sets of processors, and assign time. Processor assignment in an arbitrarily nested parallel different sets of processors to different loops in the nest. To loop is performed by allocating (possibly) different numbers of simplify our terminology we invariably refer to partitions of processors to different loops in the nest. any size as “processors.” Consider for example two nested For a DOALL with Ni iterations that has been assigned pi DOALL loops Ll,2 = ( N , ,N 2 )and a six-processor machine. processors we define E;, the efficiency index or E1 of N; as One possible allocation of the six processors to these loops follows: would be one that assigns two (clusters of) processors to the outer loop and three processors to the inner loop. More precisely, this means that the machine is partitioned into two halfs, each half consisting of three physical processors. Then, each iteration of the outer loop N I is assigned a one half The efficiency index is an indicator of how efficiently a loop is partition and the corresponding iterations of the inner loop are statically scheduled on a given number of processors. The allocated three physical processors. Therefore, the term higher the E1 the higher the efficiency (as defined in Section “processor” is used generically in this paper and refers to LI). Some other properties of the efficiency index that will be used directly or indirectly in the following sections are the clusters of physical processors of (possibly) different sizes. In what follows, the number of available processors P is following. For any DOALL N; and any number of processors always assumed to be “useful,” that is, less than or equal to p we have: 0 < E: I1 . For any Ni, E : = 1 . For Ni 1 p , E ; the maximum number of processors that a loop L can fully > 112. It should also be noted that p # q does not necessarily utilize. In this section and in Section 111-B,we discuss the case imply €7 # E ; . Is is clear that always pi r 1. If during of only allocating factors of P to the individual loops of a allocation a loop N; is not assigned processors explicitly, it is nested loop. This will be called henceforth exact allocation. implied that pi = 1 and thus its corresponding E ; = 1 . For a nested DOALL L,,,,, = ( N I ,N2, * ,Nm), a number Under this restriction, to compute the “optimal” assignment of processors P = p 1 p 2 p mand a particular assignment of of P is done by determining the optimal factorization of P and P to L we define the efficiency index vector w = (€:I, €9 then the optimal assignment of these factors to the individual , E:,,,), of L , where E? is the E1 for loop Ni when it is loops. These solutions are optimal assuming that all P p i processors. allocated processors must be used. However, a better solution may exist In what follows the terms “assignment of P ’and “decomfor the same loop and for a number of processors q < P. position of P” are used interchangeably. For the moment we Consider for example the loop can assume that any assignment of P to L defines implicitly a DOALLI = 1 , 3 decomposition of P into factors P = p l p 2 p,,,where each DOALLJ = 1 , 3 of the m different loops receives pi, (i = 1, 2, * , m) DOALL K = 1, 14 processors. A processor assignment profile ( p l ,p2, ,p m ) (where loop Ni receives p i processors) can also be described {B) ENDOALL by its efficiency index vector as defined above. ENDOALL For an assignment of P = plpz * p,,, processors to L ENDOALL described by w = ( € : I , * - - , E:,,,), we define E L , the compound efficiency index (CEI) of L , as to be executed on 64 processors. The optimum solution will allocate seven processors to the innermost ( K ) loop, three m clusters of seven processors to the middle (J) loop, and three EL= ~ f i . i= I clusters of 21 processors to the outermost (I) loop. The

-

-

--

-

1288

IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 9, SEPTEMBER 1989

[F] [n

For any L and P , we also have 0 < EL I1. Let TIbe the Lmm.ua 4: 2 serial execution time of a perfectly nested DOALL L . Next, r=l suppose that L is executed on P processors and let Tp and T i denote the parallel execution times for two different assignproof: By definition we have that ,~,)andw’ = (E,’,€;, ,~h)ofP mentsw = ( e l , t 2 , p h . We can express pm = p ; to L, where P = p 1 the parallel execution time Tp of L in terms of its CEI as (4) shown in the following lemma. Lemma I : If B is the execution time of one iteration of the side of (4)is an integer it follows and since the loop and N = II;=.=,NI is the total number of iterations, then directly that NB T =(3) E ~ P ‘

fi

-

Proof: By definition we have that Tp is given by m

T p = n [IV,/p,l B, = B r=l

fi NJPr

fi Nr /PI I=

The next lemma follows directly from Lemma 4. Lemma 5: For any integer n we have

I

r=I

i= 1

For the next lemma and most of what follows we assume that processors are assigned in units that are equal to products of the prime factors of P unless explicitly stated otherwise. Therefore, each loop is assigned a divisor of P including one. Lemma 6: If N is a (single) DOALL loop, P = p1p2 * * p mis the number of available processors and E , ( i = 1, 2, * * , m ) are the efficiency indexes for assigning P I ,p1p2,p1p2p3, * pIp2 * * p , , respectively, then

-

e,

€ 1 2 €2 L € 3 1 *

* *

1E , .

The following lemma is a direct application of (3). Lemma 2: Tp < TL iff EL > E t . Proof: From Lemma 5 we have As it was indicated by the previous lemma the optimality criterion used in this paper is based on the assumption that the total number of processors P remains fixed, and that at each time, all P processors are utilized by a particular loop. However, as it will be shown later, in the course of computing the optimal allocation of P processors to the individual loops in a loop nest, we also compute the minimum number of According to Lemma 2 the optimal processor assignment of processors q < P for which an equivalent optimal solution is achievable (i.e., one that results in the same execution time). P to L is the one that maximizes EL. Each assignment defines In the next few sections, we show how the efficiency index can a decomposition of P into a number of factors less than or be used to direct the efficient assignment of processors to equal to the number of loops in L . As P grows the number of different decompositions of P into factors grows very rapidly. perfectly nested parallel loops. Given a perfectly nested loop L and a number of processors From number theory we know that each integer is uniquely P we call a simple processor assignment one that assigns all P represented as a product of prime factors. Theorem 7 below processors to a single loop N, of L . A complex processor can be used to prune (eliminate from consideration) several assignment on the other hand is one that assigns two or more decompositions of P (i.e., several assignment profiles of P to L that are not close to optimal). From several hand-generated factors of P to two or more loops of L. Theorem 3: The optimal simple processor assignment over tests we observed that the use of Theorem 7 in a branch and all simple assignments of P processors to loop L is achieved bound algorithm for determining the optimal assignment of processors eliminated the majority of all possible assignments. by assigning P to the loop with E = maxi,;,, { E : } . Proof: Without loss of generality assume that E = € 1 . In some instances all but the optimal assignments were pruned Then ei = 1 for (i = 2, 3, - - ., m ) and EL = c l . For any by the test of Theorem 7. Again let LI,* = ( N I , * , N,) be a perfectly nested other simple allocation of P to Nj,j # 1, with a CEI of E t we have c l = EL 2 EL = q. The optimality follows from DOALL that executes on P processors and let P = plp2 * * * Lemma 2. P k be any decomposition of P where k Im . Now let E = The following two lemmas are used in subsequent discus- maxlsism { E : } be the maximum efficiency index over all sion. simple assignments of P to L, and let E ; = maxi,,,, {E?} (i

1289

POLYCHRONOPOULOS et al.: MULTIDIMENSIONAL LOOP PARALLELISM

= 1, 2, , k ) be the maximum efficiency indexes (over all ,Pk of P , respectively loops of L) for the factors p l , p 2 , ) ) . that here we do not (i.e., EJP' = ( N j / p , ) / ( ~ N , / p , l Note perform any actual assignment of processors to loops but simply compute the maximum efficiency index for each factor p Iof P over all loops of L ,excluding the loop that corresponds to E. If T, and T, are the parallel execution times for L corresponding to the optimal simple assignment of P and the optimal complex assignment of the factors of P, respectively, and S, and S, their respective speedups, we have the following theorem (using the notation of this paragraph). Theorem 7: If there exists i E { 1,2, * k} for which E 2 e l , then T, IT, and thus S, 2 S,. Equivalently, if one of the factors of P has a maximum efficiency index equal to or less than the maximum efficiency index of P, then there is a simple assignment that is at least as good as any exact assignment of P that uses the above factors. Proof: Without loss of generality we can assume that the optimal complex allocation assigns more than one processor to the first k loops (k I m), and implicitly one processor to the remaining m - k loops. Therefore, the corresponding efficiency index vector for the optimal complex allocation is U, = (~"1, , Egk, 1, , l), and for the optimal simple allocation of P is U, = (1, 1, E, 1, l), where E corresponds to the jth position. Then the parallel execution times of the optimal simple and complex allocations are e ,

- . e ,

T, =

e . . ,

1 :1 1

[

...

Suppose now that for Some E {

9

2* *

* *

for all j # i and for the same p ; .

E ; 2 E;

N~

+

. , . N,.

, k } we have

->-+

PI

h$

'pi

4 . rp]

-

1E;L E'j

-

E

Lihat lebih banyak...

Utilizing multidimensional loop parallelism on large scale parallel processor systems

Descripción

Comentarios