Parallel Gaussian elimination on an MIMD computer

Share Embed


Descripción

Parallel Computing6 (1988) 275-296 North-Holland

275

Parallel G a u s s i a n e l i m i n a t i o n on an M I M D c o m p u t e r * M. COSNARD, M. MARRAKCHI and Y. ROBERT C.N.R.S. - Laboratolre TIM3-1NPG. F-38031 Grenoble Cedex, France

D. TRYSTRAM Ecole Centrale Paris, F.92293 Chatenay Malabry Cedex, France

Received February 1987

Al~'aet. This paper introduces a graph-theoretic approach to analyse the performances of several parallel Gaussian-like triangularizafionalgorithms on an MIMD computer. We show that the SAXPY, GAXPY and DOT algorithm~of Dongarra, Gustavson and Karp, as well as parallel versionsof the LDMt, LDLt, Doolittle and Choleskyalgorithms,can be classifiedinto four task graph models. We derive new complexityresults and compare the asymptoticperformancesof these parallelversions. geywords. Parallel algorithms,task graphs, linear algebra,Gaussian elimination,computationalcomplexity.

1. Introduction The well-known Gaussian elimination method is the most commonly used algorithm to solve linear systems of equations on sequential computers. Six different versions for a vector pipeline machine have been considered in [3]. Three implementations where data are accessed columnwise have been discussed in detail, each of them corresponding to a given permutation of the loop indices i, j, k of the sequential algorithm: namely the SAXPY version (form kfi), the GAXPY version (form j k i ) and the DOT version (form/jk). In this paper, we deal with the design of M I M D versions of these algorithms (see [5,6,9,10] for a classification of parallel computers). A parallel M I M D version of the Gaussian elimination algorithm with partial pivoting has been discussed in [11,13], and an implementation of the L D Lt decomposition algorithm in [12]. The performance analysis is based on the task graph model represented in [11]. Informally, algorithms are splitted into elementary tasks, whose execution ordering is directed by precedence constraints. The task graph model, which can be constructed directly from these precedence constraints, is the basic tool of our theoretical analysis. Together with MIMD versions of the [3] algorithras, we analyse several Gaussian-like elimination algorithms: parallel versions of the LDM t, LDLt, Doolittle and Cholesky algorithms. In Section 3, we present sequential versions of the algorithms, define the tasks together with their precedence constraints, and construct the associated task graph or precedence graph. We remark that these graphs can be classified into four categories which we name greedy, 2-steps, double greedy and double 2-steps graphs. * This work has been supported by the Centre National de la RechercheScientifiquethrough the GRECO C3. 0167-8191/88/$3.50 © 1988, ElsevierScience PublishersB.V.(North-Holland)

276

M. Cosnard et aL / Gaussian elimination on a M I M D computer

Section 4 is devoted to the theoretical analysis of these graphs. Assuming that the number of processors is proportional to the problem size, we design parallel algorithms, assigning tasks to processors according to the precedence constraints. We then establish complexity results. In Section 5, we compute the execution times of the parallel algorithms corresponding to the ten sequential algorithms, and we compare their performances. A modified version of the KJI-SAXPY of [3] appears to be the most suitable to paralIelization.

2. Restriction to the model Similarly to [12] we state that (i) Precise details of the underlying architecture are unimportant, but we assume a system which is capable of supporting multiple instruction streams executing independently and in parallel on multiple data streams [6,9,10,17]. We assume that there are means to synchronize the solution process, i.e. enforce temporal precedence constraints which are imposed by the nature of implemented algorithms. (ii) Throughout the paper, p denotes the number of processors, and Ep is the efficiency of the parallel algorithm considered [4,8,16,17]. (iii) The cost analysis of the derived algorithms is based on the following assumptions: - each processor can perform any of the four arithmetic operations in an unit of time, there are no memory conflicts or data communication delays. To be more realistic, we add the following hypothesis when triangularizing a dense n × n matrix: (iv) elementary tasks should be of length O(n), so that synchronization and data communication do not predomine over arithmetic. Hence we do not allow tasks of constant length (independent of n), contrarily to [12]. (v) The number of processors is limited to O(n). It is shown in [15] that processor communication costs overcome arithmetic when O(n z) processors are used. More precisely, we set p = a n , with 0 < a < 1. Most often, pivoting of rows or columns (or both) is used for stability reasons. However, we shall not consider here the overhead due to pivoting. Rather, we concentrate on evaluating the amount of arithmetic operations which are required by the different algorithms, and we neglect the time needed for data accessing, retrieving and exchanging. Moreover, we shall assume in the following that we can access the elements of the matrix A by columns to be more realistic and close to a F O R T R A N programming style environment. Then the two allowed transfer operations will be loading and storing a column. The duplication of a column will have a zero cost, i.e. we assume that it is possible to transfer simultaneously the same data to various processors. In this case, no processor can modify this data. Conversely, a processor can modify a data if it is the only one to possess it. Remark that the preceding constraints and the associated mechanisms lead to an unambiguous execution of the algorithms. Once the constraints are defined, the first step in the parallelization of a method is the definition of the elementary tasks and their precedence graph. This graph shows the temporal dependency of the operations of the algorithm. This warrants the absence of conflict, the temporal integrity of the data and a perfect correspondence between the method and its parallel version. The tasks are then assigned to the available processors according to the precedence graph. Throughout the paper, the relation T T~k: n - k arithmetic operations, 1 ~ k) and U~k (for i > k) can be processed in btk units, and that Sk+l.k+l Can be processed in atk time units, where a and b are integers,

291

M. Cosnard et al. / Gaussian elimination on a M I M D computer

u,1

...

Un-l,l

V'3~

U21

r12

r13

...

T23

T24

.--

Y2,n_ I

T34

...

T3.n_ l

T3n

i,n- |

~n

s22 Un-l,2

U42

...

U32

~n

$33

Un3

Un- l,3

[]43

...

S44 U..n-1

~-Ln n

Fig. 13. Double 2-steps graph.

and t k = k for all k

or

tk -

n - k for all k. The precedence constraints are

k+l
Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.