Time and Parallel Processor Bounds for Fortran-Like Loops

July 14, 2017 | Autor: Ross Towle | Categoría: Distributed Computing, Computer Hardware, Computer Software, Fortran, Data Dependence, Recurrence Relation, Upper Bound, Parallel Computer, Recurrence Relation, Upper Bound, Parallel Computer

Share Embed

Laporkan tautan ini

Descripción

660

IEEE TRANSACTIONS ON COMPUTERS, VOL.

c-28,

NO.

9, SEPTEMBER 1979

Time and Parallel Processor Bounds for Fortran-Like Loops UTPAL BANERJEE, SHYH-CHING CHEN, MEMBER, IEEE, DAVID J. KUCK, MEMBER, IEEE AND ROSS A. TOWLE

Abstract-The main goal of this paper is to show that a large number of processors can be used effectively to speed up simple Fortran-like loops consisting of assignment statements. A practical method is given by which one can check whether or not a statement is dependent upon another. The dependence structure of the whole loop may be of different types. For each type, a set of time and processor upper bounds is given. We also show how a loop can sometimes be transformed to change its dependence structure. Finally, we give a result on the possible splitting up of a given recurrence system into a number of smaller subsystems. These results can be used to modify and sometimes improve the bounds for the loops as demanded by special circumstances. Index Terms-Analysis of programs, data dependence, Fortran-

like loops, parallel computation, processor bounds, program speedup, recurrence relations, time bounds. I. INTRODUCTION

TWO MAIN forces have led to faster and faster computers over the last 30 years. One has been faster hardware components and circuits. The other has been increased simultaneity of hardware operation by advances in machine organization. Whether or not these two forces continue indefinitely, at any given time machine designers have only a fixed set of circuit types to deal with. So, for the purposes of this paper, we will assume that circuit speeds are fixed, and we will study the problem of designing faster computer organizations subject to this constraint. Since arithmetic algorithms now operate near their theoretical maximum speeds, we must look beyond arithmetic. For numerical programs, a fruitful option to consider is multioperation central processors (MCPU's) of some kind. Indeed, the fastest computers available for the past 10 years have had some kind of a multioperation CPU. Examples are the Burroughs B5500, B6700, and Illiac IV; the Control Data 6600, 7600, and Star; the Cray-1; the Goodyear Aerospace Staran IV; the IBM 360/91 and 370/195; the Texas Instruments ASC; and the Univac 1108 and 1110. This list includes traditional multiprocessors, multifunction arithmetic units, parallel processors, and pipeline processors with and without vector instruction sets; a very wide range of organizations indeed. And yet, if viewed as black boxes, all of these machines may be regarded as having some kind of Manuscript received March 1, 1975; revised August 1, 1976 and March 1, 1978. This work was supported in part by the National Science Foundation under Grant MCS73-07980. U. Banerjee and D. J. Kuck are with the Department of Computer Science, University of Illinois, Urbana, IL 61801. S. C. Chen was with the Burroughs Corporation, Paoli, PA 19301. He is now with Cray Research Inc., Chippewa Falls, WI 54729. R. A. Towle is with the Burroughs Corporation, Paoli, PA 19301.

multioperation CPU. One of the goals of this paper is to show that much larger numbers of processors than have been used traditionally, may in fact be used efficiently in processing ordinary programs. This leads to much greater program speedups. Once we understand how to analyze and transform programs for such multioperation machines, then we can discuss how the machine should be organized. Coupled with the emergence of multioperation CPU's, have been a wide range of attempts to include some kind of simultaneity in software. These have ranged from explicit language features to attempts to extract simultaneity from programs written in ordinary languages. The latter approach was first taken some 10 years ago [21], [2]. More recently, comprehensive algorithms have been developed [23] and used to measure the parallelism in a number of ordinary programs [17], [14]. Also, compilers for multioperation machines have been implemented which use a limited class of these algorithms [25], [19], [8]. It has recently been possible to prove bounds on the time and number of processors required to evaluate various parts of programs [15], [5], [3]. A second goal of this paper is to generalize and integrate some of these ideas in an effort to outline some compiler algorithms for machines with large numbers of processors. In this paper we will give time and processor upper bounds at a program level that is more comprehensive than has been considered before. We discuss entire Fortran DO loops consisting of assignment statements. Since loops usually tend to dominate programs, such bounds should be fairly representative of whole program bounds. Some of our results are based on rather long proofs which appear elsewhere. But our point is to show that program analysis algorithms exist which are capable of transforming most ordinary programs into a highly parallel form. What is the purpose of studying such fast computation? Many answers may be given. First, computer speeds have been increasing for 30 years and will probably not stop until 1) hardware speed increases saturate and 2) organizational speed increases saturate. We are attempting to explain how the latter will occur. In [27] and [28] it was shown that real computer arithmetic operation times are close to their theoretical minimum. If one can show how to transform whole programs into a form which can be computed in a time close to the fastest possible time for these programs, and machine organizations are available to execute the resulting programs, little further speedup will be possible through different machine organizations. While we do not discuss lower bounds here at all, it may be seen that a

0018-9340/79/0900-0660$00.75 C) 1979 IEEE

BANERJEE et

661

al.: PROCESSOR BOUNDS

number of our upper bounds on processing time are quite low in absolute terms. Another motivation for this work is that technology is taking us into a new era, where 8- or 16-bit processors can presently be used as components. In the future, longer word length floating-point processors may be available as components. How should high-speed machine organization respond to the wide availability of LSI? We believe that if hundreds or thousands of microprocessors could be organized into a single CPU, methods of program transformation like those we discuss should be available to aid in machine design and compiler design. The feasibility or reasonableness of multioperation CPU's can also be challenged on efficiency grounds, since not all of the arithmetic elements are operating at all times. However, if all of the arithmetic elements are regarded as part of one big processor, this is really a question about gate level efficiency. Some 20 years ago, most computers went from bit serial to bit parallel arithmetic operations at a greatly reduced gate level efficiency. We are suggesting a transition from word serial to word parallel processing of ordinary programs with another concomitant reduction in gate level efficiency. But two important factors should be noted. First, if we have processor level components, then gate level efficiency may not be an important measure. Secondly, in terms of speedup and cost/effectiveness, it may be observed [12] that, on the average, the program transformations we are advocating introduce no more inefficiency at the processor level than the transition from bit serial to bit parallel addition and multiplication algorithms did at the gate level. Thus, even if we want to measure gate level efficiency, in many cases higher overall efficiencies may be obtained by using slower, less parallel arithmetic operations together with faster algorithms that perform a number of operations at once. Finally, we point out that many practical improvements can be made to the ideas discussed in this paper. We are necessarily discussing upper bounds on time and processors using rather general arguments. In practice, most programs will achieve better results than those stated in our bounds and we illustrate this in some of the examples. The paper is organized as follows. Section II prepares the ground by explaining the key terms and assumptions. The theorems on the time and processor bounds for loops are given in Section III. Section IV describes a dependence testing method. An overall algorithm for handling a given loop is presented in Section V, and a brief summary in Section VI. II. BASIC CONCEPTS

Starting with some fundamental definitions, in this section we build the basic framework within which our results will be stated. By a variable, we mean either a scalar variable or an element of an array. A constant in a given context is either a universal constant (e.g., 2, log2 5, etc.) or a variable whose value is fixed in that context. An atom is either a variable or a

constant. An arithmetic expression is a well-formed string consisting of +, -, *, /, (, ), and atoms. (A precise recursive definition can be easily constructed.) A general arithmetic expression with e atoms will be denoted by E. An assignment statement is a program statement of the form x = E, where x is a variable and E is an arithmetic expression. For an assignment statement x = E, the length is the number of atoms in E, the output variable is x, and the input variables are all the variables in E. Such a statement is linear, if E is of the form c+a1*x1+a2*x2+" + ak * Xk, where c and the a's evaluate to constants, and the x's are variables. Our main object of study is a Fortran-like DO loop L of the form:

DO 2 I, = 0, u1 DO 2 I2 = 0, U2 DO 2 Id

=

SX)D

0, Ud

52(1)

SN(i)

2 CONTINUE.

where I (to be called the index vector) stands for (Il, I2, Id), and every Sr represents an assignment statement. We write Sr(i) to denote that S, is a function of Iand SrQ( when I takes on a particular value i. (If, for some k, Ik runs through the sequence {ak, ak + bk, , ak + uk bk}, then we can always use a different variable Ik = (Ik - ak)/bk that has the sequence {0, 1,... , Uk}. This particular choice of the form of the ranges of II, I2' *, Id keeps our formulas simple without any loss in generality.) The loop L is linear if the assignment statements Sr are all linear. Consider two statements Sr and St, not necessarily distinct. We say that St is dependent on Sr-and write SrySt,if there exist two values i and j of I such that )1 The statement SrQ1) is executed before the statement S,Q) in the proper serial execution of the loop, and 2) one of the following holds: a) the output variable of Sr() is an input variable of

st(j);

b) the output variables of Sr() and S(j) are the same; c) the output variable of S,(j) is an input variable of

SrA).

are three kinds of dependence. The defining conditions, the name and the notation for each kind are shown below.

Thus, there

1) and 2a): data dependence Sr6Sf; 1) and 2b): data output dependence SrbS0t; 1) and 2c): -data antidependence Sr;5St. In order to have a visual description of the dependence structure of the set {S,, S2, ..., SN}, we define a directed graph, the dependence graph (G, say), of the loop L as follows. The nodes of G correspond to the statements SI, S2, **, SN, and there is an edge in G directed from Sr to St iff Sr ySt.

662

IEEE TRANSACTIONS ON COMPUTERS, VOL.

A chain in G is a sequence of nodes {Sq Sq * *, S.} such that Sq1 YSq2, Sq2 YSq3, ... , 5Sqk1 YSqk Such a chain becomes a cycle, if in addition we have SqkYSq.1 The graph G is cyclic if it has at least one cycle. The opposite of cyclic is acyclic. A maximal cycle is a cycle that is not a proper subset of another cycle. An isolated point is a node that does not belong to any cycle. We define a i-block to be either a maximal cycle or the set consisting of an isolated point. The set of all i-blocks is a partition of the set {S1, S2, * , SN) into pairwise disjoint subsets, and it will be denoted by i. A partial ordering F is defined on the set it as follows. For any two i-blocks ir, and ;, we have ri,Fnt, if, either itr =it, or there is a chain {Sq,, Sq2 ., Sqkj of statements such that Sq1 eitr and Sqk Eir. If ir, ire we say that ;, depends on ir,. A sequence of it-blocks {ir1, it2, * it ) forms a chain if it1 Fra2, i2 r'i3 , x,* 1Frkt. Note, however, that two or more distinct a-blocks can never form a cycle. A set of i-blocks forms an antichain, if no two members nr, itt of the set exist such that itr F7r. Let h denote the maximum number of elements in any chain of i-blocks. Then all the it-blocks can be arranged in a sequence of h pairwise disjoint rows (lst, 2nd, , hth) such that 1) each row is an antichain, 2) a block in the kth row will never depend on a block in the tth row, if k < t, 3) each block in the kth row depends on at least one block in the (k - 1)st row, if k> 1. (This result is a slight variation of the dual form of Dilworth's theorem, and follows directly from a proof of the same (see [20]). Simply stated, the idea is as follows. Start with the initial set of it-blocks. The minimal elements of this set constitute the first row. Subtract this row from the original set. The minimal elements of the remainder form the second row, and so on. This procedure lasts through exactly h steps.) The distributed loop L4 of a it-block itr is defined to be the loop obtained from L by deleting all of the assignment statements not in itr. The fact that we can arrange the i-blocks in the way described above leads naturally to the following principle. The Principle of Loop Distribution The same results will be obtained from a computation, if, instead of executing a given loop L serially, we execute in parallel all the distributed loops for the i-blocks in the first row in the first step, then execute in parallel all the distributed loops for the i-blocks in the second row in the next step, and so on; the steps can overlap. This principle was first introduced by Muraoka [23]. It is the basic tool by which we introduce parallelism in the execution of the loop L. We close this section with our first example. Example 1: Consider the following loop L: DO 2 I =0, 100 S,: A(I) =D(I) **2

S2: B(I + 1) =A(I) * C(I + 1) S3: C(I + 4)= B(I) + A(l + 1) S4: X(I + 2) = X(I) + X(I + 1) S5: C(I+3)=D(I)-5 2

CONTINUE.

C-28,

NO.

9,

SEPTEMBER

1979

It is easily seenthatS1yS2,S2yS3,S3ySI,S3yS2,S3yS5,and S4 S4. To be more exact, we have S1 6S2, S2 3S3, S36S1, S3 S2 , S3 5S5 , and S4 5S4 The dependence graph G ofthis loop is shown below.

Si

There is only one isolated point S5, three cycles {S4}, {S2, S3}, {Sl, S2, S3}, and two maximal cycles {S4}, {SI, S2, S3)There are three it-blocks: rt1 = {S1, S2, S3), it2 = {S4}, and i73 = {S5}.Andwehaveit1F7i3 . These blocks can be arranged in two rows, where the first row consists ofi1 and it2 and the second of 7i3. The distributed loops L1 of 7r1 and L2 of 7t2 should be executed simultaneously in the first step, and then the loop L3 of it3. III. TIME AND PROCESSOR BOUNDS FOR L Here we show how to speed up the execution of a given loop L. We assume that any number of processors may be used simultaneously, and that each processor can add, subtract, multiply, and divide, although all active processors perform the same operation on each step. The smallest interval in which any of these operations can be done by any of the processors is taken to be the unit of time. (Different operation times for the arithmetic operators are explicitly considered in [10].) We will ignore memory activity, data

alignment, and control unit times. Let T[L] denote the number of time units and P[L] the number of processors needed to execute L according to some algorithm. Our aim is to minimize T[L] without putting any restrictions on P[L]. A typical theorem in this section shows that under certain conditions, an algorithm exists by which L can be computed in time T[L] less than or equal to a certain time bound, and the number of processors P[L] needed for this does not exceed a certain processor bound. These upper bounds are expressed in terms of T[E], P[E], T[R], P[R], and some parameters of L.

T[E] is the time taken to evaluate an arithmetic expression E with e atoms, and P[E] is the corresponding number of processors. Efficient high-speedup results for T[E] and P[E] are summarized in Lemma 1. Lemma 1: 1) T[EKe>] < 4 log e with P[EKe>] < 3(e - 1) 2) If d is the depth of parenthesis nesting in E, then 7TE] < log e + 2d + 1 with P[E] < [(e - 2d)/21.

663

BANERJEE et al.: PROCESSOR BOUNDS

3) If e

E= E

k= 1

e

ak, E= kHI bk, =

or E =

e/2

Eakbk

k= 1

then

T[E] = rlog el with P[E] = [el2l.

cannot be more than (N - h + 1) blocks in any given row. All the M instances of the single statement in a given distributed loop can be executed in parallel. And, all the distributed loops for the blocks in a given row can be executed in parallel. Thus, for any row, the maximum time and processor count needed are T[E] and (N - h + 1)M P[E], respectively. The theorem now follows. Q.E.D. Corollary 1: Let the loop L be such that no two statements SF, St can be found with S, yS,. Then

Proof: See [3] and [16]. (Note-Throughout this paper log x means log2 x.) T[R] is the time and P[R] the processor count needed to compute a linear recurrence system T[L] < T[E] with P[L] < MN- P[E]. R, 1 < m . n, according to some algorithm. Such a system is simply a set of linear equations of the following Proof: This follows immediately from Theorem 1, if we type: observe that h = 1. Theorem 2: Let the loop L be linear and such that its (k < 0) Xk = 0 dependence graph G is cyclic. Then k-i

Xk = Ck +

Z t=k-m

aktXt

(1 < k < n)

where the c's and the a's are constants, and the x's are variables. The highest speedup results for T[R] and P[R] are given in Lemma 2. Lemma 2:

T[L] < T[R] + T[E] with

P[L] < max {P[R], MN P[E]}. -

Proof: We -consider the worst case, namely, when there is one it-block. This means that the loop L cannot be exactly T[R] < (2 + log m) log n - I(log2 m + log m) broken up at all. If we write down all the instances of the N with statements of L corresponding to its M iterations, then the resulting set of equations can be treated as an R for m < n + 0(mn) with n = MN. To make the situation the worst system P[R/ < nm2/2 PLRn,m>J n3/68 + 0(n2) for m < n. possible, we also assume that m = MN. The bounds now if we realize that sonme or all of the MN statements follow, Proof: See [4], [5], and [24]. need to be preprocessed, and that this preprocessing may Let us now list the parameters of the loop L which will be in parallel in time 7TE] with MN- P[E] done may appear in our theorems. The last entry is not needed for Q.E.D. processors. Theorems 1-3, and its definition is postponed. If more parameters are known about a linear'loop with a cyclic dependence graph, we can prove the following. A List of Parameters for the Loop L Theorem 3: Let the loop L be linear and such that its 1) M, the total number of iterations of dependence graph G is cyclic. Then L(M= (ul + 1)(U2 + 1). (Ud + 1)); 2) N, the number of statements in L; T[L] < min{h, z} * T1R] + h T[E] 3) e, the maximum length of any (assignment) statement with in L; 4) c, the maximum number of nodes (of the graph G) in a P[L] < max {P1, P2}, it-block; 5) h, the maximum number of elements in any chain of where

it-blocks; 6) z, the total number of maximal cycles in G; 7) Q, the maximum number of iterations over which a and data dependence (b) holds; and 8) y, to be defined in Theorem 4. Theorem 1: Let the dependence graph G of loop L be and acyclic. Then T[L] < h * T[E] with

P1 = (N-h + 1)M

P[E]

P2 = max{(z -h + 1), 1} P[RKn, m>], -

n = cM m= cQ + c-

1.

Proof: We show first that the distributed loop for the worst maximal cycle can be treated as an R system. P[L] < (N-h + 1)M P[E]. Such a loop has c statements. Suppose that the instances of Proof: Since there is no cycle, every it-block consists of these c statements corresponding to all the M iterations have an isolated point. There are h rows (antichains) of i-blocks. been written down in the proper order. Then we have a Since each row must have at least one of the N blocks, there system of n = cM linear equations; for each iteration there is

664

a group of c equations, and there are M such groups. Take the output variable xl of a statement in the kth group. In order to compute x1 we might need to know the value of the output variable x2 of an earlier statement. In the worst possible case, x1 is the output variable of the cth statement of the kth group, and x2 is that of the first statement of the (k - Q)th group. Thus, we need never go back more than cQ + (c - 1) equations. Hence, the worst (i.e., the largest) value of m is cQ + c - 1. Consider now the distributed loop of any ic-block. If the block is a maximal cycle, then the time to execute its loop does not exceed T[R] + T[E], where the second term gives the maximum preprocessing time. If the block consists of an isolated point, then this time bound is simply T[E]. There are h rows of ic-blocks and z maximal cycles. For each row we first need a time of T[E] units, which may be used to preprocess a maximal cycle, or to execute the distributed loop of an isolated point. The time bound for L now follows from the fact that we need never have to compute serially the loops for more than min {h, z} maximal cycles. Since there are N statements in the loop L and h rows of 7t-blocks, we will never get more than (N - h + 1) statements, if all the i-blocks in any row are combined. The total number of iterations being M, the number of processors needed to preprocess all the instances of all the statements in any row never exceeds P1 = (N-h + 1)M P[E]. After preprocessing, we need only worry about the distributed loops for those it-blocks that are maximal cycles. In the worst case, there are all z of these cycles in one row. However, the time bound for 7[L] will remain unchanged, if we compute the z loops as follows. If z < h, then compute the loops serially. If z > h, then compute (z - h + 1) loops in parallel and the other (h - 1) loops serially. This way we will never need more than P2 = max {(z - h + 1), 1} * P[R] processors in the second stage of the game. The processor bound is thus max {P1, P2}. This completes our proof. Q.E.D. In the four previous results ofthis section, we accepted the loop L in its given form and derived certain bounds based on its properties. Now we will show how one can sometimes get better bounds by transforming L into a loop with a simpler dependence structure. Suppose that the dependence graph G of L is acyclic and that S,yS, always means S6bS,. Every atom on the righthand side of St that causes the dependence S,6S, is said to define a 6-arc from Sr to S. There may be more than one 6-arc directed from a given S, to a given S,. A fixed input variable of St can define several such 6-arcs, one for each atom corresponding to that variable. Thus, when we talk about 6-arcs, we are splitting the edges of G and looking at the dependence relationship a little more closely. The meaning of a chain of 6-arcs is clear. We say that G is

IEEE TRANSACTIONS ON COMPUTERS, VOL.

c-28,

NO.

9,

1979

SEPTEMBER

tree-connected if there exists at most one chain of 6-arcs between any two of its nodes. Let us now define statement substitution for acyclic graphs. Take any two statements S, and St such that the i-block {Sr} is in the first row, the i-block {Stj is in the second, and S6bS,. Let Srk and Slk denote the instances of Sr and St, respectively, corresponding to the kth iteration of the loop (1 < k < M). It will happen at least once that an input variable (xl, say) of some Stq is the same as the output variable (x2, say) of some Srk, where k < q for r < t and k < q otherwise. (Recall the definition of data dependence given in Section II.) Replace xl by the right-hand side expression of Srk. When all such replacements that can be made have been made, we have obtained statement substitution for the pair-S, S. By treating all such pairs S,, St we achieve statement substitution for the first two rows of i-blocks. Now treat all other pairs of rows in a similar way in the following order: (1st, 3rd), (1st, 4th), (1st, hth), (2nd, 3rd), -, ((h 1 )st, hth). (h is the number of rows.) When this process ends, we have achieved complete statement substitution for the loop L. We illustrate statement substitution by using two very simple loops. Example 2: We see that S, 6S2 in the loop Do 2 I = 0, 100 S': A(I) = B(I) * C(I) S2: D(I)= A(I) + A(I) * A(I- 1) ..,

-

2

CONTINUE.

In fact, there are three 6-arcs from SI to S2* Statement substitution transforms this loop into the following: Do 2 I

S1: A(I)

S'2 2

=

=

0, 100

B(I)* C(I)

D(I) = B(I)* C(l) + B(I)* C(I)* B(I - 1)* C(I - 1) CONTINUE.

And now S' and S'2 are independent of each other. Consider next the loop DO 2 I

=

0, 100

SI: A(2 * I) = B(I) * C(I)

S2: D(I) = A() + A(I)* A(I- 1)

2

CONTINUE.

As before, we have S, 6S2 and there are three 6-arcs from S1 to S2. After statement substitution we get a loop that can be broken into the following two independent loops: DO 2 I

S1: S'21 2

=

0, 100, 2

A(2 * I) = B(l) * C(I)

D(I) = B(I/2) * C(I/2) + B(I/2)* C(I/2)* A(I CONTINUE

1)

and Do2I= 1,99,2

A(2 * I)2= B() * C2I) S'22: D(I) = A(I) + A(I) * B((I- 1)/2) * (I - 1)/2)

S':

2

CONTINUE

BANERJEE et al.: PROCESSOR BOUNDS

665

The statements S'1 and S'22 together constitute the stateSo far, we have assumed that an unlimited number of ment S', the transformed version of S2. And, as before, in processors are available. This, of course, is never true in each loop the two statements are independent ofeach other. practice. We will now show how one can sometimes reduce Theorem 4: Let the loop L have an acyclic graph G and be the number of processors by a technique called loopfreezing. such that the only possible dependence relationship between Given a loop L, let us consider the following sets of loops any pair of statements is that of data dependence (b). Let y for k = 2, 3, ., d: denote the maximum number of (-arcs from any statement. DO 2 Ik = 0, Uk By statement substitution, L can be transformed into a loop L that can be computed in time T[L] with P[L] DO 2 Id = 0, Ud processors, where 7[L] < T[E], 2 CONTINUE. and For a fixed k, there are [(u1 + 1.). (Uk- + 1)] loops in the N if L is tree-connected set, which we denote by L(k)(I1, * *k_ I ), each correspondNt ing to a definite set of values of I 1, Jk-j Let G(k)(I 1, ..., (N - h + 2)y' 1 otherwise. Ik-1) denote the dependence graph ofL(k)(Ij, * * Ik- 1 ) And Proof: For any given r such that 1 < r < N, let erdenote suppose that for a fixed k, each of the loops Ikk)(1, * , k_1) the length of the statement Sr in 4r the statement in L that can be executed in time Tk)[L] with P(k)[L] processors. The corresponds to Sr, and e', the length of S'r. Let e' denote the next lemma follows immediately, if we observe that the maximum length of an assignment statement in l. Since no original loop L can be executed serially with respect to any two statements S, St of E can be found such that S' yS', this number of consecutive index variables starting with I,. theorem would follow from Corollary 1, if we can show that Lemma 3: Take any fixed k such that 1 < k < d. Then the e' < N'e. The important thing to remember is that if a loop L can be executed in time statement St is at the end of a chain of (-arcs from a T[L] < (ul + 1) ... (Uk l + 1).*pk)[L] statement Sr, then St has (er - 1) atoms in et due to Sr and this chain alone. with Suppose first that L is tree-connected. Then, there is at P[L] = P(k)[L] processors. most one chain of (-arcs from any Sr to any S. Hence, we have The usefulness of this lemma for some loops lies in the fact N that by not doing everything in parallel we are saving on the el < et + E (er - 1) (1 < t < N) number of processors, and at the same time we may reduce r= 1 r$t the total time of execution. The following example illustrates loop freezing and also the application of some of the earlier < Ne theorems of this section. so that In order that we may effectively compare the time bounds corresponding to different numbers of processors, we introe' < N'e. duce a term called the speedup of the p-processor Now assume that Lis not tree-connected. Since there is no calculation overSp,a uniprocessor. Sp (Sp[L], to be more exact) cycle in G, every ir-block consists of an isolated point. There is defined by cannot be more than h elements in a chain of i-blocks. Hence, there cannot be more than h statements in a chain of Sp= TP (-arcs. And, at most y chains of (-arcs can come out of a statement Sr. Thus, for a given statement St and in the worst possible case, there are (N - h + 1) statements, each gener- where, for any p, Tp(_ Tp[L]) is the time bound for the ating y-' chains to St, one statement generating yh2 execution of L using p processors. Clearly, S . 1. Example 3: Consider the following loop L: chains, one y-3 chains, and so on. This means that D021I =0,9 e < et + [(N - h + 1)yh- 1+ y+-3 ?2 + DO 2 I2 = 0, 9 + +y](e-1) DO 2 I3 = 0, 9 I A(l, I2 + 1,I3) = BQIJ,2JI3+ 2)* (I,IJ2) + U* V SI: < e[1 + (N -h + )yh- + yh-2 + yl-3 + ..+ y] S2: B(11 + 1, I2, I3)= A(1, I2, I3 + 6)*D(I2, I3) < e(N-h + 2)yh(1 < t . N), 2 CONTINUE. so that Thie dependence graphs of all the loops L3) are the same. Q.E.D. The dependence graphs ofall the loops 12) are also the same. e' < N'e. =

-

666

IEEE TRANSACTIONS ON

The three graphs G, G(2), and G(3) are shown below:

COMPUTERS,

VOL.

C-28,

NO.

9,

SEPTEMBER

1979

S200 = T1[L]/T200[L] . 100. Case 4: Execute L in parallel with respect to I1, 12, 13. Applying Theorem 3 to the loop L with graph G, we get

T[L] < T[R < 2000, 197 > ] + T[E] and

P[L] < max {2000P[E], P[R]}, G

V

G

We will execute the original loop L in four different ways. Case 1: Execute L serially with respect to I I 2, I using only one processor. There are four operations for each of the 1000 iterations, so that the time needed is ,

since h = 1, z =1 N = 2, M = 1000, c = 2, and Q=98. (To find Q, consider the relation S2 5S1 and look at the values (0, 0, 2) and (1, 0, 0) of I). By Lemmas 1 and 2, we have

T[L] < 76

3,

and

P[L] < P, TJ[L] = 4000. where P is approximately 4 107. The speedup in this case is given by Case 2: Execute L serially with respect to I, I2 and in -

parallel with respect to I3. Applying Corollary I to the loops 1!3) with graph G(3), we

Sp = TJ[L]/Tp[L] . 52.

Clearly, we should choose Case 3 for our computation. There are a number of results that provide the time bound T3)[L] < T[E] and P 20. q + = max {q, 0}, q- = max {-q, 0}, Case 3: Execute L serially with respect to I, and in parallel with respect to I2, I3. where q is any integer. Applying Theorem I to the loops 112) with graph G(2), we Theorem 5: Consider any two statements Sr and St of the get loop L. Let A(f(i)) be a variable of S, and A(g(i))a variable of St, such that at least one of the variables is the output TV2)[L] < 2T[E] and P(3)[L] < IOOP[E] variable of the corresponding statement, A is a onesince h = 2, e = 4, N = 2, and M = 100. dimensional array, and get

Lemma 1 then yields

d

7T2)[L] < 4 and P(3)[L] < 200. Hence, by Lemma 3,

T[L] m2>

>m > 1.

If g = gcd(m, iM2, ., mi), then this system is equivalent to g independent systems R, R, -.., R, where -

Ln/gJ

Lihat lebih banyak...

Time and Parallel Processor Bounds for Fortran-Like Loops

Descripción

Comentarios