Portable, parallelizing Pascal compiler

Share Embed


Descripción

PORZABLE, PMCAL COMPILER This compiler produces efficient parallel code for a variety of multiprocessors from the same serial program. Its portability is the result of separating machine-specific details from the pamllelization. ERAN GABBER AMlR AVERBUCH AMIRAM Y EHUOAI TeCAviv University

A

n effective tool for developing efficient parallel softwareshould not require any more effort than it takes to develop programs for a uniprocessor. It should be able to exploit the power of multiprocessors while hiding the machinespecific details from the programmer. Unfortunately, the current method of programning parallel computers -using explicitly parallel languages - doesn’t meet either of these goals. Both the dgnrithni and the datadismbution and synchronization operations must be specified nianually,and the resulting program is oken error prone and hard to optimize. Also, the source prograni must usuallybe extensively modified for each new parallel computer

P

you compile the same source code to all target machines without modification, yet it generates efficient parallel programs for multiple-instruczion,multiple-data multiprocessors. Because it is fully automatic, it -~ ..

~ ~

IEEE SOFTWARE

because architectural details about the machme are explicit in the program. As parallel computers become more widely used, programmers will need tools that can free them of low-level details, enabling them to write source code only once for many target machines. In response to ttus growing need, we developed P3C, a fully automatic, portable parallelizing Pascal compiler for scientific code, whch is characterized by loo s operating on regular data structures. P C lets

0 7 4 0 7 4 5 9 / 9 3 / 0 3 0 0 / 0 0 7 1 / ~ 3 00

0 IEEE

71

of up to 24 on a Sequent S y m m e q with 2 5 active processors, and the estimated speedup was withm eight percent of the actual figure. Our compiler also won honorable mention in the 1990 Gordon Bell Prize competition.

STRUCTURE

~

Figure 1shows the detailed structure of P3C,which consists of the parallelizer and the VMMP.

Parometers table

~

'

Pwleliier. T h e parallelizer translates programs in TAU-Pascal -a Pascal-based language wlth minor extensions developed 1 at Tel-Aviv University-into explicitlypar- 1 allel, single-program, multiple-data C ,~ ~

1

I1

I Unlprotessorr

Shored-memory multiprocessors

I1

Distributed-memory multiprocessors

1

e Torget mothines

i~

'~

I ' ~

'

Figure 1. S w i r n t r i u ofP'C. l%euiinpilerconsists oja parallelizer, which extractsparallelismfr.om TA U-PasMI mile a i i i l ge:rlzei.iirer C coili~.U I I Jthe L W M , zhich contains specific details ahout each machine. The p 1 1 i ~ ~ l l i ~ i i ~ e i ~ ~oftzfj.oiit ~ o i 7 s i sc9iitI. t . ~ ;:,bichtranslatestheserialpi-ograminto an internal representation;a middle p a i t ;:birh detrets pnizillel iiiil estimates speedup time; and a code genel-atol; zhich produces explirit~~~ piti~iillelC mde. 77v I :11.111' piw.i(lcs the r-uiitimesiwicrsfur the jperated rode mi the target machine.

does not require additional declarations to assist parallelization. The cnLuofP'C i s h sepalationinto two parts. T h e upper part - called the parallelizer - pertomis the inache-independent task of e x ~ i c d i yparallelism and ismbution;the lower pan-called the IP (l'irtual AIachine for Multiprocessors) - hides the specific implementation detailsofthe target machine. P'C also gives you an accwate estinyate about whether the parallel code will acmally reduce execution time over serial code, talung into account the associated merhead. Tt denies the estlmate b> stah-

72

~

,

i

cally analyzing the program at compile time, referring to a table of the target machine's parameters. To port P'C, you simply port the \ X V P and generate a parameters table describing the new- machine, which the parallelizeruses. In our effort, porting to a new multiprocessor took four staffmonths, most of which was devoted to enhancing the perfornmnce ofkey servicesin the I~LMP. Ue ' have implcmcnted P'C on two shared-memory niulhprocessors, as a ell as on a distnbuted-memory network of eight transputers. MTe achieted speedups

tion operations on that machine. T h e middle part is implemented by a 3,600line Prolog program. + Code geiieratoi-. The code generator ' translates the internal representation of

I

VMMP.The I'XVLP isa runhme library that implements a common set of semces

MARCH 1993

~

on diverse machmes, including sharedmemory multiprocessors, distributedmemory (message-passing) multiprocessors, and uniprocessors? At present, it supports only C programs. There is a version of the VMMP for each supported target m a c h e . The C code generated by the parallelizer is linked with the VMMP library to produce executable code. Although any V M ? ?application program will run on every machine supported by the VMMP, its execution efficiency may vary because each machine will require a different implementation of VMMP services.

COMPARISON P3C differs from existing compilers that generate parallelMLMD code, such as PTran,’ Parafrase: and ParaScope’ in a variety of ways: + It emphasizes portability. Current compilers extensively analyze control and data dependencies to identi@ potential areas for parallel execution, translating serial Fortran programs to parallel code. However, these compilers are primarily for shared-memory architectures. Although the trade-off for P3C’sportability emphasis is less extensive parallelization (dataflow analysisis relatively simple), it still produces an efficient code for many application programs. + It is$@y azmmatic. In many compilers, such as Fortran-D,6 data and process partitioning are only semiautomatic. You must declare the parallel loops and the requested partitioning strategy, which the compilers use to generate the optimized code that will carry the computation on distributed-memory multiprocessors. Other methods use data-parallel languages like C” to support the data-parallel execution model, in which each data element in an array may compute its value concurrently with the other elements.’ However, although data-parallel Ianpages do not require you to supply lowlevel demils, such as data and process partitioning, you must still rewrite the algorithm in data-parallel form. P’C requires no additional declarations in or modifications to the code.

+ It accwatel3, estimates execution time. Several tools use runtime estimation to enhance code generation. PTran, for example, uses a cost function and an estimate of parallel and serial execution times to determine the most effective compiletime partitioning of the program graph. A recent addition to Parascope incorporates a runtime estimate to help you choose the most efficient data partitioning. P3C is concerned with estimating exact execution time and takes into account many architecture and compiler details. The ability to estimate execution time accurately is essential to producing efficient parallel code. This estimation helps avoid-generating parallel code when the 1 synchronization and data distribution 1 overhead is larger than what you would save in computation speed. The estimation is also helpful in generating runtime tests to choose between parallel and serial I execution ofloops with variable bounds. It is difficult to compare the efficiency of P3C generated code with other parallelizing compilers because the target machines and applications are very different. T h e 1990 Gordon Bell Prizewinners in compiler parallelintion, for example, achieved a speedup of 1,900 on a Connection hlachine with 2,048 floating-point processors, using a KAP Fortran-77 to Fortran-90 preprocessor to generate efficient single-instruction, multiple-data vector code from nested loops.8 Code generated by P’C achieved a speedup of 24 on a 2 5-processor Sequent Synmietv for different application programs.

and no local memory. Only 25 processors were used for m g the programs. + LlfOoS.An experimental sharedmemory multiprocessor with eight DB532 processor cards, each with a NS32532 CPC, FPU, a 64-Kbyte writethrough cache and 4 Mbytes of local memory. The processors are connected to the shared memoly by a \ M E bus. T h e multiprocessor runs hlosix distributed Unix, developed a t the Hebrew University,Jerusalem.’ + Traizcputei- iierzoi-k. A distributedmemory multiprocessor containing a network of eight T800 transputers running the Helios distributed operating system. Each node in the network contaks-a processor, local memory and four high-speed links to its neighbors. We also implemented the \XW on several serial processors, including Sun Microsystems’ Sun 3 and Sun 4 workstations, IBAl’s RT, and National Semiconductor’sVhIE532. T h e VMMP is implemented as a subroutine library. The application program is compiled by an existing C compiler for the target machine (which saves effort) and linked uith the \X\TP library. Because the VlL\lP exploits the capabilities of specific target hardware to reduce services overhead, users do not have to be concerned with machinedependent details. For example, most global control structures in the MOS and Sequent implementations are allocated into separate 16byte areas so that they can reside in separate cache lines. The Sequent iniplementation uses queue locks to reduce shared-memory contention. For the transputer network, P3C uses a spanning IMPLEMENTATION tree for efficient broadcasts and for global \.$‘e implemented tmth the 1P s>nchronization. and the parallelizer on the following multiprocessors: CODE GENERATION + S’t’ptvt ~ y i ? ~ i e t i ?.~i. 26-processor P’C creates an explicitly parallel target multiprocessor \\ith 32 llbytes of shared memory Each processor has a private cache code that executes in SPMD mode’ (also

1

I

TO PRODUCE EFFICIENT CODE MUST BE ABLE TO ESTIMATE EXECUTlON TIME.

, 1~

1

I1 1

i~ I( 1 11 11

‘I ‘i 1

~

I1’ l 1 I

generated code. Parallel loops are divided among the processors in a way that makes each processor execute approximately the same amount ofwork. Each processor executes a distinct subset of the loop iterations, which is chosen accordmgto the datap&tioning strategy of the loop variables. The data exchange at the end of the parallel loop is a potential bottleneck. However, in most numerical computations, the time needed to exchange data is small compared with loop-computation time. P'C will not generate parallel code for a loop if the estimated time to execute parallel code - including data exchange - is larger than the time to execute serial code. .qso, PY inay not scale up for more than 100 processors because of the increased overhead from data exchange. This many processors may also cause an uneven work distribution, whch makes

kotassof Po

1

Seriol code

known as same-code, multiple-data mode'). SPMD is a restricted form ofparallel execution in which the same program runs on all processors. 'I'he parallel parts ofthe SPiMD program are executed by all processors running the same code but manipulating distinct parts of the common workload. T h e initialization and termination parts of the program (outside all parallel loops) are executed on a single processor. P3Cproduces SPMD code because it can be generated from parallel loops relatively simply. Although SPMD code is less efficient than a general parallel code, its performance is good for numerical algorithms, which nlanipulate large data arrays. P3C creates one process for each physical processor in the multiprocessor. This process operates only on the data mapped to the processor by data partitioning. The

74

I

Processor P1

process may also access some private variables. P'C reduces parallel code overhead by replicating the execution of an outer serial loop in all processors and distributing the work of the inner parallel loops aniong processors. In this way, the outer replicated loop does not require any synchronization or data movements, because all processors execute the same outer loop code on the same data and compute the same results. You can use P3Cin combination with a vectorizing C compiler because P'C parallelizes outer loops, while the vectorizer optimizes inner loops. Synchronization is required only a t the end of the parallel loops to exchange data aniong processors and to combine partial results. Synchronization ensures that the same results are distributed to all processors. Figure 2 shows the structure of the

I1 '1

'1 ! 1

'I

1 1

'1~

computation less efficient.

1

KEY TASKS

il

T h e parallelizer has two main tasks: detect parallel loops and estimate execution time. Detect parallel loops. P'C'S parauelization is based on data partitioning. It attempts to distribute the variables accessed inside the program loops among the processors, so that each processor will have an equal amount of work. Each processor works only on the data allocated to it. The parallelizerinserts special code at the end ofthe loops to distribute coninion results. I t attempts to paralleliw the outermost loops that are both eligible for parallelization(checked by pxallelization algorithm) and reduce total execution time (checked by execution-time estilnation). The parallelizer also recognizes a parallel loop nested inside a serial loop. The body of the parallel loop is modified to access the data allocated to the processor. It recognizes rnainly loops with independent iterations but it can also recognize l loop-carried dependence caused by data reduction, in which some result is com- ;; puted by application of an associative and

MARCH 1993

commutative operation, such as addition or maximal value. T h e parallelizer implements several data-partitioning strategies. Figure 3 shows strategies for four processors (PO through P3), which are chosen according to the data-access pattern in the loop. The strategy of a variable inside the replicated loop must be the same for all enclosed parallel loops. Otherwise, the variable data must be redistributed repeatedly inside the replicated loop, which causes a significant performance penalty when the data is large. The paralleliwr recognizes a parallel loop only if all the variables accessed inside the loop body belong to the following access categories. These categories do not introduce loop dependencies and are eligible for efficient data partitioning. Each access category is implemented in the parallel code by a corresponding datapartitioning strategy: + Read only. Read-only variables are used inside the loop but not changed. They are implemented by replication among the processors,as Figure 3a shows. + Idex. Loop-index variables are changed only by the loop-control statement and not inside the loop body. They are implemented by replication, but each processor executes only a subset of the loop iterations. + Data reductim. Figure 3b shows the replication-for-data-reduction strategy. Data-reduction variables are updated inside the loop by an associative and commutative operation like +, *, and, or, min, and max. T h e value of the data-reduction variable is not used elsewhere inside the loop except in its update operation. For example,

tion order. (Most computations do not produce these errors -especially those in which a result is computed incrementally by a data-reduction operation. In many cases, the result is an integer or Boolean value, which has no rounding errors.) + Dimihted. Distributed variables are arrays that are changed in the loop, so that each iteration changes a different array element. This array element may be used later in the same iteration. For example: fori := 1 to Rido begin v[i] := expression; ..... v[i] {uaedhere) cnd

Current implementation identifiesdistributed arrays only by their first index, which must be a loop-indexvariable. Distributed variables are either partitioned by rows, as in Figure 3c, or by inter-

leaved row, as in Figure 3d, according to the estimated time to execute the loop. + Gridc. Each iteration of the loop computes a unique element of the grid using a combination of correspondmg elements from a s i d a r gnd. The newly computed grid is later assigned into the old grid. The grid is recognized onlywhen the first index of all grid accesses in the loop are of the form i k c, where i is the loop index and c is a constant. Grids are partitioned by row with overlapped boundaries, as Figure 3e shows. Each processor computes its local slice and exchanges the boundarieswith its neighbors at the end of the loop computing the new grid values. T h e amount of overlap between the processors is max (cl, c2, ..., la), where c1, c2, ...,c,, are the offsets from the loop index used in the grid-element computation.

fori := 1 to N do sum := sum + expression;

T h e final value of a data-reduczion variable is computed at the end of all loop iterations by combining the partial values supplied by the participating processors. T h e operation mwt also be conmutative to allow the processors to combine the partial results in 'any order to produce the final result. We ignore the effect of rounding errors, which may produce different answers depending on the evalua-

IEEE SOFTWARE

Figure 3. P'Cic.parall&atiun i.r based on data partitioning. It choosesfiom a number 0fstratepe.r according to the datii-acce.c..c.pane??~in the loup: (A) replication, (B) replicutionfw data reduction, (CJ row distribution, (0) interleazed rms diswilnitiun. (E) merlapped row dist?-iliictiwi, and @) iiitevleaved column distribution.

75

i

double a",

b N , x N , new-xm+l];

the time needed for an integer addition. T h e parameters table also contains the coefficients ofthe time functions o f m services. Execution time is computed by a bottom-up traversal of the program parse , tree. T h e executiontime of a straight code sequence is the sum of the executiontimes of all primitive operations. T h e execution time of an if statement is the condition execution time plus the weighted average of execution times for the then and else 1

~

The probability of executing then and else parts is determined by a heuristic function that recormizes certain comDariFigure 4. Sample code to illmtrate datapartitioning: (A) Code takenfiom aprogvamforsvlzing a qstem of sons h v o l h g th; loop indices. other conditions are considered to produce true linear equationsAx = b ty 3wdan iterations and (B) parallel code generated b~P' C. ' at 5'0-percent probability. The execution time of a serial loop is the time it takes to execute the loop multiplied by the number of iterations. You can compute the number of iterations exactly in loops with constant bounds and in a few other cases, such as when the bound is a permitted row operations or accessed simple expression containing the index of across a single column. This column must an enclosing loop with constant bounds. be the index of an enclosing loop. This Otherwise we assume that the loop executes for a default number of iterations, in this case, 5'0. This value represents iterative computations with convergence tests. The execution time of a parallel for loop is the time to execute the loop body multiplied by the average number of iterations executed on a single processor. We add data-disaibution and -collection time to get the actual loop time. Loop-synchronization time and the execution time of all other VMMP services are computed by Q + p x p +(y+ 6 x p ) x (size of data) where a, p, y, and 6 characterize the specific operation, p is the number of processors, and data size is in bytes. Time functions have several sets of coefficients for different data-size ranges because the actual execution time resembles a step function. Sudden time changes are typical in message-passing QOn Qme of all pnmlhve constructs ap- Inultiprocessors, for example. You can pearing III the serial program, such as compute the exact parallel-loop operanthmenc operatlons, array indemng,data head using this formula because you Figure S. Sample ode t v illustrate overlapped 1'07;' movement, and various overhead (if, loop- know the loop's data size a t compile test, and so on). distribution: (A) Serial rode f i a p e n t and (B) partime, and you can accurately measure allel rode generated Ly P'C. Fxecuaon hilie is expressed relanve to the formula's coefficients. ~

1

~~

7 6

~~

~~

~

~

~~

~~

MARCH

1993

1'

I

~~

1

~

1

We used thls formula to model the actual execution time quite closely on our target machines. More complex architectures, like a hierarchical memory or a , larger message-passing multiprocessor, may require a more elaborate formula. We also developed a program to automatically produce the time-function coefficients from the measured execution times. The program measures the execution time of\'MW services for different ' data sizes and computes the time-function coefficients using a variant of the chisquared fit method. T h e estimated speedup is computed by dividing the estimated serial-execution time by the parallel-execution time. The estimated parallel-execution time is of little use on its own, because it is expressed in ' units of a single integer addition. ~

~

1 ~

~

1

initialization, data distribution, and some variable declarations. Because the lower bound of C arrays is wro, we modified the loop bounds to start at zero to simplify array-indexing expressions. T h e Vrrmud_partroutinecomputes the bounds of the slice allocated to the process and assignsthe bounds into -low and -high variables. T h e Vcombine routine combines the values of a distributed vector and computes an associativeoperation on the value stored in the last vector element. The combinedvector is returned to the participating processes. In thls case, we used the maximum value functionjina;x_d to compute the maximum error.

Overlapped row distribution. Figure 5 shows the serial code fragment and the parallel code generated for h s example. The parallelizer partitions grids by overlapping slices, when the overlap size is computed from the grid-update stencil. T h e processor is assigned the slice containing the elements - h + l through _high+l. The boundary elements are -1mu and -high+2.

Interleaved coknnn distribution. The serial code fragment in Figure 6a is taken from Gaussian elimination. P3C recognizes that array a should be allocated by interleaved columns to allow parallel execu-

~

~

double;"a

CODE EXAMPLES

Figures 4 through 7 p e several examples of the code the parallelizer generates for various partitioning strategies. The figures illustrate the style of parallel code generated for a wide range of scientific programs. The serial code fragments are taken from actual input to the parallelizer. The parallel code contains calls to VMMP services. Data partitioning Figure 4a shows code taken from a program for solving a system of linear equations Ax = b by Jordan iterations. Each iteration computes a better approximation to the solution vector x by computing

..wof real;

var a:array[1..N,1

for i := 1 to N do

k4.n

{ End pivot row - serial loop } pivot := i; for j := iel to N do if abs(a[jj]) > abs(a[pivot,i])then pivot := j;

f Row exchange 1 for k := i to N do begin t := a[i,k]; a[i,k] := a[pivot,k]; a[pivot,k] := t end;

c 4i-9 new-x, = b, - Jzr 4

1

The nested loop structure is typical of scientific code. The outer repeat loop is inherently serial,whle the inner for loops are parallel. The for loop at line 12 is eligible for parallelization according to the parallelization rules. All variables accessed in the loop belong to one of the permitted access categories: i is index, sum a n d j are private, a and b are read only, nm1-x is distributed, and emvy is data reduction. The parallelizer generated the parallel C code in Figure 4b. We have omitted

IEEE S O F T W A R E

{ elimination } for j := i+l to N do hgin scale := a&]; { row operation in a p d e l loop } for k := iel to N do a[j,k] := a@] - scale *a[i,k]; end end

IN

__

Figure 6. Sample code to illustrate interleavedcolumn distribution: (A) Serial codefw Gaussianelimination and (B) parallel code generated Ly P)C.

77

tion. The pivot-row access can be executed in parallel when the ith column is replicated among the prt )cessors. T h e row-exchange loop and the elimination loop can be parallelized hecause in each iteration they access a pail- of elements from the same colurim. In the code in Figure Ob, the interleaved column allocation assigns column to pronumbers p o i - i d +j x _t~o;i.d-Si~e cessor number _plac.-id. The column in-

'

dexes start at zero, -owd-size is the number of processors, 0 5 p o c - i d < ~~.~owd_siw, andj 2 0. The ith column is broadcast from the processor holding it to the other processors at the beginning of each iteration of the enclosing loop. The pivot search is executed by all processors. The other loops are modified to access only the columns assigned to the local processor. T h e start1 and sta~t2expressionscom-

pute the index of the first column allocated to the processor,which are greater than or equal to i and z+ I , respectively. T h e array-indexing expressions are also modified, because the lower index of Carrays is 0, not 1 as in the serial code.

Loops with variable bounds. When the parallel loop bounds are expressions, the paralleliwr generates a runtime check to choose between parallel and serial execu-

I

Sequent processors Program

Data

Efticiency (%) Esuinated speedup

Mandel

7 8

Elapsed time (sec.) Speedup Efficiency(%) Estimated speedup

305. IS

1.99 99.70 1.98

4.91 98.10 4.86

9 59 95.90 9.52

17.96 89.80 18.10

21.37 85.50 21.97

152.66 2 .00 99.90 2 .00

61.02 5.00 100.00 4.99

3 1.04 9.83 98.30 9.99

15.60 19.56 97.80 18.66

12.81 23.83 95.30 24.79

MARCH 1993

MOS processors Program Data

4

2

Serial 245.03

126.63 1.93

67.17 3.65

74.87

11 1.30

Laplace Elapsed tune (sec.) Speedup Efficiency ( % ) Estiinated speedup

434.03

97.80

Mandel Elapsed tune (sec.) Speedup Efficiency (%) Estimated speedup

129.23

908.87

1

tions. Parallel execution is chosen when the number of iterations is greater than a threshold, which is the minimal number of iterations for which the parallel code execution time is less than the serial execution time. This threshold is computed at compile time and it also takes into account the data distribution and collection overhead. When the parallel d e is chosen, it performs data distribution and collection as needed. When serial code is chosen, the same code is replicated on all processors. Figure 7a shows the serial code before it is translated into the parallel code in Figure 7b.

PERFORMANCE MEASUREMENTS Tables 1 through 3 contain the performance results of the following application

IEEE SOFTWARE

95 17.90 1.98 98.80 1.98 58.40 1.91 95.30 2 17.03 2.00 100.00 1.96

52.50 1.86 93.10 1.93 64.93 1.99 99.50 2 .00 462.33 1.97 98.30 1.99

4.0.3 3

50.00 4.90 .70 5.01 13.43 5.57

70

Speedup Efficiency (YO) Estimated speedup

8

6 ~~

~~~

1.85

10.27 7.29 91.20 7.20

92 .oo i.66

06.30 1.86 30.60 3.64

21.53 5.17 0 6 82.70 5.2 i 87 50

116.63 1.72 93.00 1.75 27.43 3.56

67.40 6.44 80.50 6.67

5 18

19.33 5.06 4.30 4.96 2 1 87

32.67 1.96 98.90 3.05 233.33

5.01

98.50 5.99 158.60 5.73

7.90 98.70 7.88 11

97.40 3.97

prognms, which we ran on all three target machines. The applications are taken from several important domains. They are mainly algorithm kernels of several hundred lines. In all the programs except Mandel, processors exchange data after each iteration. All floating-point computations are double-precision. Data size is linlited hy the machine with the smallest memory, the transputer network. + CG. Solvc a siste~nof 400 linear equdons of 400 unknowns with a dense random-coefficient matrix (not positive definite) using the conjugate-gradient method. + Fluid. Compute a small-scale fluid flow by a cellular automata on a 200 x 200 hexagonal grid for 300 time steps. + I m w ~ Invert . a 250 x 250 matrix

for i := low to high do

body

if((high-low) > threshold){ /* parallel code */ } else [ /* serial code */ for(i = low; i c= high; i++)

I

body;

[El

Tipre 7. Sample code to illustrate loops with ita hle lmmds: (A) S e d code a n d (B) pril-allcl ci ',e?ler[lted b y P'C.

79

_ - _

1

Program Data

I----CG

I

Fluid

1

!

I

Serial -

Elapsed tune (sec.) Speedup Efficiency(%) Estimated speedup Elapsed time (sec.) Speedup

Efficiency (Y") Estimated 5peedup Invert Elapsed time (sec.) Speedup Efficiency(%) Estimated speedup Laplace Elapsed time (sec.) Speedup Fflicienc! (%) Emmated speedup Linear Elapsed time {sec.)

' 1 I

(Yo.> Estimated speedup AIandel Elapsed time (sec.) Speedup (%I) Effiicie~~c~ Emmated speedup

~

412.44

211.25 1.95 97.60 1.95

110.79 3.72 93.10 3.72

79.33 5.20 86.70 5.2 1

326.22

165.02 1.(I8

83.63

56.53

98.80

07 5 0

234.98

632.66

175.68

80.92

1,487.42

-

using Chss-Jordan reduoion with partial pivoting. This algorithm is simiiar to Ckiussiane h a t i o n and LU factorization. + Laplace. Solve the Laplace partial differential equation on a 200 x 200 grid by simultaneous overrelaxation with Chebyshev acceleration and red-hlackpoint ordering. + Linear: Solve a system of 400 linear equations in 400 unknowns by Jordan iterations using an evedodd update order of unknowns to accelerate convergence. + Mandel. Compute the Mandelbrot set on a 150 x 1.50 grid with the lowest left point a t [-2,-1.25] and the highest right point at [0.5,1.25]. This calculation is balanced by interleaved allocation of grid rows to the processors.

80

2 ~~

~

3 .ou

5.7; 06 7 0 S 83

1.08

3.01

125.37 1.87 93.70 1.93 323 I 9 1.96 07.30 I .02 90.30 1.95 97.30 1.92 45.27 1.oo

65.49 3.58 89.60 3.71 168 50 3 75 01 .oo

99.30

08.50

1.90

3.w

5.87 0 ; 00 5.86

756.05 1.97 98.40 1.99

3 80.2 3 3.91 97.80 3.97

254.53 5.84 97.40 5.92

3.62

47.80 3.68 91.90 3.62 22.83 3 94

46.06 5.10 85.00 5.32 116.61 5.43 82 00 4 89 35.70 4.92 82 .OO 4.89 15.31

64.12 6.43 80.40 6.49 43.07 7.57 94.70 7.i7 36.67 6.41

9 1.49 6.92 73.60 5.94 29.83 5.89

11.53 7.80 97.iO

7.64 191.96 7.75

+ M ~ L ~Calculate K. a two-dimensional by running the parallelizer without the , surface wave through a deflector with pe- middle part. This serial C program is simriodic boundary conditions using a 192 x ilar to a handwritten C program imple192 grid for 500 time steps. 'The domain is menting the same algorithm, because the mapped on the unit square. This algorithm serial program is the h e c t translation of an and data were used by John Gustafson, efficient serial algorithmwritten in Pascal. We computed efficiency as Gam h l o n m , and Kobert Benner, the 198; Gordon Bell Prize winners."' 1 - speedup x 100 T h e tables contain information for number of processors each application program: elapsed time, Both the serial prograins and the exspeedup, efficiency, and estimated plicitly parallel programs were compiled speedup. Elapsed time is in seconds and 1 by the local optimizing C compiler, The includes all data-distribution and -collet- tables show that the automatically genertion overhead. ' ated code achieved an efficiency of 75 to To compute speedup, we divided the YS percent for all maches. Moreover, the serial execution time by the parallel execu- speedup estimation was w i h n 2.0 to 6.5 tion time. L+k obtained serial execution by percent ofthe measured speedup onMOS running a serial program -which we got and the transputers network. It was withm MARCH 1993

8.5 percent on Sequent because of the larger number of processors. These results indicate that P3C generates hgh-quality code for a wide range of practical scientific computations, some of which are not trivial to parallelize manually. T h e compilation speed of the application programs was also acceptable. For example, the compilation of the 18O-he C G program took 22.3 seconds on a Sun 4/3 70 SPARC server. We expect the compilation speed to improve significantly when we replace the current Prolog interpreter with a Prolog compiler.

0

compiler uses relatively simple parallelization techniques to pro-

UT

duce highly efficient code for both sharedmemory and distributed-memory MIMD multiprocessors.T h e speedup estimation computed at compile time was close to the actual speedup. Also, PIC is not limited to C. You can apply most of its algorithms to languages like Fortran, since parallel-loop detection and runtime estimation are essentially language independent. You can also convert P3C itself to accept Fortran (instead of TAU-Pascal) by changing the parallelizer’s front end. P3C results support our belief that automatic tools will eventually replace most of the hand coding of parallel programs by generating efficient and portable parallel code from serial programs.

ACKNOWLEDGMENTS We thank the referees for their helpful comments. This work was supported in part b>-the Basic Research Foundation administered by the Israel .Academy of Snences.

REFERENCES 1. E Darema et al., “ASingle-Program, hlultiple-

Data Computational Model for EPEXFortran,” Pa?.allel Compirting,Apr. 1988, pp. 11-24. 2. E. Gabber, “\3LMP: A Practical Tool for the Development of Portable and Efficient Programs for hlultiprocessors,” IEEE Trairr.POIalleland Drm-rbuted System,July 1990, pp. 3W317.

3. E Allen et al., .h Overview of the PTRLY Analysis Sptem for hlultiprocessing,” Pioc. Int’l Cm$ oil Szrpmromptmg,Lecture Notes in Computer Science, S o . 297, Springer-Verlag, New York. 1987, pp. 194-2 11

+

4 C Pohchronopoulos et al , “Parahse-2 h Envlronment for Paralleliung, Parohomng, Smchroniamg, and Scheduhng Programs on \Idhprocessors,” Pi nc Int’l Cmzf Paiallel PI omung, L6liime U, E Plachv and P Kogge, eds , Pennn l l a m a State Umversiq Press, Unner Sin Perm l9R9, PP 39-48

11 _ _ _ ~ _ _ _ Eran Gabber is chef programmer of the parallel compunng laboraton of the computer science department at Tel-km C‘niversin HISmain research mterests are software tools for parallel computers, computer architecture, operanng SI stems. and software reuse Gabber received a BSc m mathemahcs and an VSc in computer sciences from the Tel-Amv Universiq, and has completed work on a PhD in computer sciences at T 4 L H e is a member of the IEEE Computer SocieQ, ACM, and Israel Data-Processing Assoclanon

1 i

Amir Averbuch is a senior lecturer m the computer science deparnnent at XI-A m Umvesity His research mterests are parallel processmg ( s o h a r e and algorithms) and sigMI and image processmg Averbuch received a BSc and an MSc in mathemancs trom Hebren Umversin, Jerusalem, and a PhD in computer science from Columbia Uni\ersitv He is a member of the IEEE and Societv for Industrial and Apphed Mathemaocs

5

’’ ~

1

-

,

~

~

1’ 1 ~

5 K Kennedi, IC \lcIcmlev, and C -U Tseng, “Interacme Parallel Programmmg L sing the Parascope Editor,” IEEE Tram Paialleland Juh 1991, pp 329-1.11 DirtrrbirtedS~nws, 6 S IIiranandani, K Kennedy, and C -L\r Tseng, “Compihng Fortran D for ML\ID Distributed \lemon Llachnes,” Comni A C i I , Oct 1992, pp 66-80 i \I @inn and P Hatcher, “Data-Parallel Proqramnuns on \lulncomputers,” IEEE Lale. Sent 1 9 9 0 . ~69-76 ~

s$-

8. J. Dongarra et al., “Special Report: 1990 Gordon Bell Prize Lvinners,” IEEE So@L,aare, >lay 1991, pp. 92-97, IO?. 9. -4.Barak and .k Linnan, “MOS:AAluIticomputer Distributed Operating System.” Sofiwaiu Puzctice andfipmrence, Aug. 198.5, pp. 725-73;. 10. J. Gustafson, G. Alonny, and R. Benner, “De~elopmentof Parallel Methods for 3 102+Processor Hypercube,” SMZlJ. Snent~jcandStatrrtzral Cmnputmg,July 1988, pp. 609-638.

Amiram Yehudai 17 an associate professor of computer science at the Tel. Amv Umbersity His research interests are s o h a r e engmeenng and propammmg languages Yehudai received a BSc m electrical e n p e e r i n g from Techmon, Israel Insntute of Technologv, and an,MS and a PhD in computer science from the Unirersiw of Califorma, Berkelev H e is a member of the IEEE, IEEF Cnmputer Society, and ACAI

IEEE SOFTWARE

I Address questions about this article to Gabber at CS Dept., School of Alathematical Sciences, Tel-Aviv Unirersit)., TeI-Avk 6Y978 Israel; Internet eran@

~i

rnath.tau.ac.il.

ii

81

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.