Parallel Computing 27 (2001) 963±976
www.elsevier.com/locate/parco
Parallel preconditioning of a sparse eigensolver Luca Bergamaschi a,*, Giorgio Pini a, Flavio Sartoretto b a
Dipartimento di Metodi e Modelli Matematici per le Scienze Applicate, Universit a, di Padova, Via Belzoni 7, 35131 Padova, Italy b Dipartimento di Informatica, Universit a di Venezia, Via Torino 155, 30171, Mestre VE, Italy Received 3 July 2000; received in revised form 3 October 2000; accepted 12 October 2000
Abstract We exploit an optimization method, called de¯ation-accelerated conjugate gradient (DACG), which sequentially computes the smallest eigenpairs of a symmetric, positive de®nite, generalized eigenproblem, by conjugate gradient (CG) minimizations of the Rayleigh quotient over de¯ated subspaces. We analyze the eectiveness of the AINV and FSAI approximate inverse preconditioners, to accelerate DACG for the solution of ®nite element and ®nite dierence eigenproblems. De¯ation is accomplished via CGS and MGS orthogonalization strategies whose accuracy and eciency are tested. Numerical tests on a Cray T3E Supercomputer were performed, showing the high degree of parallelism attainable by the code. We found that for our DACG algorithm, AINV and FSAI are both eective preconditioners. They are more ecient than Block±Jacobi. Ó 2001 Published by Elsevier Science B.V. Keywords: Generalized eigenproblems; Sparse approximate inverses; Parallel algorithms
1. Introduction An important task in many scienti®c applications is the computation of a small number of the leftmost eigenpairs (the smallest eigenvalues and corresponding eigenvectors) of the problem Ax kBx, where A and B are large, sparse, symmetric positive de®nite matrices. Several techniques for solving this problem have been proposed: subspace iteration [1,15], Lanczos method [7,11,14], and, more recently,
*
Corresponding author. E-mail addresses:
[email protected] (L. Bergamaschi),
[email protected] (G. Pini), sartoret@ dsi.unive.it (F. Sartoretto). 0167-8191/01/$ - see front matter Ó 2001 Published by Elsevier Science B.V. PII: S 0 1 6 7 - 8 1 9 1 ( 0 1 ) 0 0 0 7 7 - 1
964
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
restarted Arnoldi±Lanczos algorithm [12], Jacobi±Davidson method [17], and optimization methods by conjugate gradient (CG) schemes [3,9,16]. In this paper we analyze the performance of two preconditioning techniques, when applied to an optimization method, called de¯ation-accelerated conjugate gradient (DACG) [8]. DACG sequentially computes a number of eigenpairs by CG minimizations of the Rayleigh quotient over subspaces of decreasing size. When eectively preconditioned, we found [4] that the eciency of DACG well compares with that of established packages, like ARPACK [13]. In a recent work [5], the performance of DACG has also been numerically compared with that of Jacobi± Davidson method, showing that their eciency is comparable, when a small number of eigenpairs is to be computed. We exploit three preconditioners, Block±Jacobi, FSAI [10] and AINV [2], the latter ones falling into the class of approximate inverse preconditioners. FSAI and AINV explicitly compute an approximation, say M to A 1 , based on the sparse factorization of A 1 . Preconditioning by a product of triangular factors performs better than other techniques, mainly because the ®ll-in of the preconditioner is reduced. Unlike many other approximate inverse techniques, AINV and FSAI preserve the positive de®niteness of the problem, which is essential in our application. The FSAI algorithm requires an a priori sparsity pattern of the approximate factor, which is not easy to provide in unstructured sparse problems. We generated the FSAI preconditioner using the same pattern as the matrix A. On the other hand, AINV is based upon a drop tolerance, e, which in principle is more convenient for our unstructured problems. The in¯uence of the drop tolerance has been tested. De¯ation is accomplished via B-orthogonalization of the search directions. We analyzed both classical (CGS) and modi®ed (MGS) Gram±Schmidt, and tested the accuracy and eciency of both strategies. We have exploited the parallel Block±Jacobi-DACG, AINV±DACG and FSAI± DACG algorithms in the solution of ®nite element, mixed ®nite element, and ®nite dierence eigenproblems, both in two and three dimensions. A parallel implementation of the DACG algorithm has been coded 1 via a data-parallel approach, allowing preconditioning by any given approximate inverse. Ad-hoc data-distribution techniques allow for reducing the amount of communication among the processors, which could spoil the parallel performance of the ensuing code. An ecient routine for performing matrix-vector products was designed and implemented. Numerical tests on a Cray T3E Supercomputer show the high degree of parallelism attainable by the code, and its good scalability level.
2. AINV and FSAI preconditioners Let A be a symmetric positive de®nite N N matrix. 1
A release of our parallel DACG code is available, on 29 September, 2000, at URL http://www. dmsa.unipd.it/sartoret/Pdacg/pdacg.htm.
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
965
The approximate inverse preconditioner AINV, which was developed in [2] for linear systems, relies upon the following ideas. One can evaluate A 1 by a biconjugation process applied to an arbitrary set of linearly independent vectors. A convenient choice is the canonical basis
e1 ; . . . ; eN . This process produces a unit upper ~ and a diagonal matrix D such that A 1 ZD ~ 1 Z~t . Actually, even triangular matrix Z, ~ with sparse A, the factor Z is usually dense. The AINV preconditioner is based on ~ by reducing the amount of ®ll-in the idea of building a sparse approximation to Z, occurring in the computation. This is obtained by annihilating either all the elements ~ or those whose absolute value is smaller than a drop outside selected positions in Z, tolerance e. We preferred this latter strategy, since in principle it is more apt to treat unstructured sparse matrices. If the process is completed successfully, which is guaranteed, for example, when A is an H-matrix (see [2]), a unit upper triangular are obtained. Thus, matrix Z and a diagonal matrix with positive diagonal entries D, 1 t t a sparse approximate inverse of A; M Z D Z ZZ is produced, being 1=2 . Z ZD Let A LLt be the Cholesky factorization of A; the FSAI preconditioner [10] is computed by minimizing the Frobenius norm kI GLk on a set of lower triangular matrices G displaying a prescribed pattern S. The minimization problem ^ which solves the matrix amounts to computing a lower triangular matrix G equation ^
GA ij dij ;
i; j 2 S:
S is a prescribed lower triangular sparsity pattern, including the main diagonal. We set S f
i; j : Aij 6 0; i P jg, i.e., the same pattern as A. The ®nal approximate ^ where the diagonal inverse M Gt G comes from the preconditioning factor G DG, matrix D is computed in order to have
GAGt ii 1.
3. Parallel DACG algorithm Our DACG algorithm sequentially computes the eigenpairs, starting from the leftmost one
k1 ; u1 . To evaluate the jth eigenpair, j > 1, DACG minimizes the Rayleigh quotient in a subspace orthogonal to the j 1 eigenvectors previously computed. More precisely, DACG minimizes: q
z where zx
zt Az ; zt Bz Uj Ujt Bx ;
1
U j u1 ; . . . ; uj 1 ;
x 2 RN :
The ®rst eigenpair
k1 ; u1 is obtained by minimization of (1) with z x (U1 ;). Let M be a preconditioning matrix.The s leftmost eigenpairs are computed by the following conjugate gradient procedure [4]:
966
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
DACG Algorithm: Choose tolerances e1 and e2 . DO j 1; s 1. Choose x0 such that Ujt Bx0 0; set k 0; b0 0; 2. xA0 Ax0 ; xB0 Bx0 ; c xt0 xA0 ; g xt0 xB0 ; q0 q
x0 c=g; 3. REPEAT 3.1. gk rq
xk 2g xAk qk xBk ; 3.2. gM k Mg k ; t M 3.3. IF k > 0 THEN bk gtk
gM gM k k 1 =g k 1 g k 1 ; M 3.4. ~ p k g k bk p k 1 ; Pj 1 t pk pk ui ; 3.5. pk ~ i1 ui B~ 3.6. pAk Apk ; pBk Bpk ; p 3.7. ak arg minfq
xk tpk g
gd cb D=2
bc ad, with a ptk xAk ; b ptk pAk ; c ptk xBk ; d ptk pBk ; D
gd cb2 4
bc ad
ca gc; 3.8. xk1 xk ak pk ; xAk1 xAk ak pAk ; xBk1 xBk ak pBk ; 3.9. c c 2a ak b a2k ; g g 2c ak d a2k ; 3.10. qk1 q
xk1 c=g; 3.11. k k 1; p UNTIL
j qk qk 1 j =qk < e1 or
kxAk qk xBk k2 = c < e2 ; p 4. kj qk ; uj xk = g: END DO Our DACG algorithm can be decomposed into a number of scalar products, daxpylike linear combinations of vectors, av bw, and matrix-vector multiplications. We focussed on parallelizing these tasks, assuming that the code is to be run on a machine with p identical, powerful processors, our model machine being a Cray T3E Supercomputer. Scalar products, v w, are distributed among the p processors by a uniform block mapping. The results are combined via a global reduce operation. For any vector combination z av bw, each processor j computes only a subblock consisting of N =p elements, where denotes the integer part of a real number. Each N N matrix is uniformly row-partitioned among the p processors, so that N =p rows of A; B; Z and Z t are stored on each processor (except at most the last one, when N =p is not a whole number). Large problems can be solved by using a small amount of memory, provided that a suitably large number of processors is activated. In principle, some memory saving can be attained by avoiding to store Z t . To reduce data communication, however, we prefer to explicitly store Z t . For the same reason, the whole A and B are stored, instead of their upper (or lower) triangular part. Partitioning the matrices by taking an equal amount of entries on each processor could, in principle, produce a better load balance. This strategy requires more complex data structures, thus, burdening the code and hampering the implementation of our ecient strategy for performing matrix-vector products, which is described in the sequel. The matrices A and B have quite uniformly distributed entries in our problems, thus uniform row partitioning yields uniform distribution of
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
967
nonzero elements on dierent processors. The amount of nonzeros of the upper triangular factor Z on processor j decreases with j, but an equivalent increase is observed in the number of nonzeros of Z t , thus balancing the overall number of elements stored on each processor. The implementation of the matrix-vector products Av; Bv; Zv; Z t v is tailored for application to sparse matrices, by minimizing data communication between processors [6]. Let h N =p. De®ne the sets Rj fh
j 1 1; . . . ; hjg, for each j 1; . . . ; p 1, together with Rp fh
p 1 1; . . . ; N g. Ri encompasses the indices of rows belonging to processor i. Let P i fcrj 6 0; r 2 Ri g be the set including the nonzero elements belonging to processor i. De®ne P1i fcrj 2 P i ; j 62 Ri g. Let Cki fj : crj 2 P1i ; j 2 Rk g; Gik fj : crj 2 P1k ; j 2 Ri g. Recall that processor i has in its local memory the components v
j , j 2 Ri : One can see that to complete a matrixvector product, the elements v
j , j 2 Cki , must be received from processor k. At the same time the elements v
j ; j 2 Gik , must be sent to processor k. Our matrix-vector routine exchanges precisely these elements and no more, thus providing a large saving in the communication time. The B-orthogonalization of the current search vector against the previous eigenvectors is accomplished in step 3.5 of our DACG algorithm by a Gram±Schmidt process. This operation is computationally expensive, its relative cost depending on the sparsity of the matrices and on the eigenpair level. For large s (s P 40), this step can require more than half the time of the whole algorithm. To perform this task, we used both CGS and MGS. In principle, MGS is more stable than CGS, preventing the loss of orthogonality among vectors. By numerical testing, we found that in our problems CGS provides orthogonality which is accurate to machine precision, like MGS does. On the other hand, CGS algorithm can be implemented exploiting BLAS2 routines (while MGS cannot), thus obtaining a faster code. Moreover, our CGS implementation reduces the number of communications with respect to MGS, as shown below. Let C be an arbitrary matrix and v an arbitrary vector; let vfqg fv
j ; j 2 Rq g be the subarray whose components are stored inside processor q and Sq
C be the set of indexes i such that v
i is needed by processor q to perform a Cv product. Processor q performs the following operations for B-orthogonalizing x with respect to Uj . CGS Algorithm 1. x : gather
x
i ; i 2 Sq
B ; fqg 2. yfqg :
Bx ; fqg t fqg 3. a : Uj y ; {dgemv} 4. afqg : reduce
a; fqg 5. xfqg : xfqg
Uj a ; {dgemv}
MGS Algorithm DO k 1; j 1 10 . x : gather
x
i ; i 2 Sq
B ; fqg 20 . yfqg :
Bx ; 0 3 . a : parallel
utk y; fqg 40 . xfqg : xfqg a uk ; END DO
The statement parallel
vt w denotes the data-parallel execution of the scalar product, while reduce (a) performs the broadcasting and summation on each processor of the elements in a. Recall that dgemv is the BLAS2 routine for eciently
968
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
performing dense matrix-vector products in double precision arithmetics. It was exploited to implement CGS steps 3 and 5, which we labeled with the side-comment ``{dgemv}''. The amount of data to be broadcasted is larger in MGS: j 1 gather operations are needed, in contrast to only one in CGS. This circumstance largely reduces the parallel eciency of MGS. 4. Numerical tests We now report clarifying numerical results obtained applying DACG procedure to a number of ®nite element, mixed ®nite element, and ®nite dierence problems. The computations were performed on the T3E 1200 machine of the CINECA computing center, located in Bologna, Italy. The machine is a stand alone system made by a set of DEC-Alpha 21164 processing elements (PEs), performing at a peak rate of 1200 M¯op/s. The PEs are interconnected by a 3D toroidal network having a 480 MByte/s payload bandwidth along each of 6 torus directions. In the present con®guration there are 256 PEs, half of them have a 256-MByte RAM, the others a 128-MByte RAM. Table 1 lists the main features of our selected test problems, which arise from the discretization of parabolic and elliptic equations in two and three space dimensions. They are representative of a larger set of tests that we performed. Table 2 shows the number of nonzero elements in each preconditioning factor, together with the number of iterations performed to compute the smallest s 10 eigenpairs. In all but one problem we set 1 10 8 ; 2 10 3 : When N 268,515, to get an accuracy which is comparable to the other cases, we set 1 10 15 , leaving 2 10 3 . T(M) is the CPU time (in seconds) spent to compute the preconditioning factors. Table 2 reports the number of iterations (iter.) performed when using p 1, 2, 4, 8, 16 processors. In some cases the number of iterations changes with p, due to the ensuing changes in the order of ¯oating point computations. When a change was recorded, we report the minimum and maximum number of iterations. Note that AINV is more robust than Jacobi and FSAI, since the change in the number of iterations occurs only in one case (N 216; 000; e 0:025), while the variations with Jacobi and FSAI are more frequent. IC(0) is the standard incomplete Cholesky preconditioning factor, with the same sparsity pattern as A. Note that the FSAI Table 1 Main characteristics of our sample problemsa #
Problem
N
Nnz
1 2 3 4 5 6
2d 2d 3d 3d 3d 3d
10,593 28,600 42,189 64,000 216,000 268,515
72,737 142,204 602,085 438,400 1,490,400 3,926,823
a
± ± ± ± ± ±
FE MFE (B I) FE (B I) FD (B I) FD (B I) FE (B I)
N matrix size, Nnz nonzero elements in matrix A; FE ®nite elements, FD ®nite dierences, MFE mixed ®nite elements.
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
969
Table 2 Number of nonzero entries, Nnz
M , in the preconditioning factorsa N
IC(0)
Jacobi
FSAI
AINV(e) e 0:1
e 0:05
e 0:025
10,593
Nnz
M iter. T
M
41,665 1285 0.01
10,593 3411±3533 ±
41,665 1676 0.08
34,468 2234 0.16
97,051 1407 0.20
219,818 1019 0.37
28,600
Nnz
M iter. T
M
85,402 1978 0.02
28,600 7621 ±
85,402 4056 0.18
201,765 2276 0.31
464,705 1614 0.54
790,632 1381 0.90
42,189
Nnz
M iter. T
M
322,137 913 0.11
42,189 2981 ±
322,137 1414 0.81
141,956 1527 0.53
341,523 1024 0.69
601,360 824 1.08
64,000
Nnz
M iter. T
M
251,200 751 0.07
64,000 2005±2164 ±
251,200 1116±1271 0.51
251,200 1145 0.65
251,200 1145 0.65
798,640 945 1.61
216,000
Nnz
M iter. T
M
853,200 1133 0.26
216,000 3002±3526 ±
853,200 1681±1925 1.74
853,200 1672 1.94
853,200 1672 1.94
2,732,760 1308±1321 5.89
268,515
Nnz
M iter. T
M
2,097,669 1605 4.88
268,515 8325±8330 ±
2,097,669 4431±4435 5.07
985,304 3649 3.34
1,626,471 3493 4.55
4,421,002 2229 13.42
a The number of iterations needed to achieve convergence when computing s 10 leftmost eigenpairs of the sample problems is summarized for runs with p 1; 2; 4; 8; 16 processors. When the number of iterations changes with p, the minimum and maximum values are reported. T
M gives the CPU seconds spent to sequentially compute each preconditioner.
factor has the same pattern as IC(0), yet inspecting Table 2 we see that FSAI±DACG needs more iterations than IC(0)±DACG to converge. With respect to IC(0), a larger number of nonzero elements in AINV is needed to promote DACG convergence within the same number of iterations. On the other hand, IC(0) is not suitable to parallel computations, so it was run only on a single processor. Jacobi preconditioning needs a larger number of iterations to converge. Incidentally, note that the un-preconditioned DACG is not ecient, requiring much more iterations than Jacobi-DACG. When N 268; 515, the un-preconditioned DACG does not converge to the smallest eigenpair within 2000 iterations. Concerning the eciency of our matrix-vector routine, Table 3 reports the average number of elements sent (received) by one processor to perform the products Av and ZZ t v, when N 28; 600. The amount of data exchanged by each processor, for performing one matrix-vector product with our routine, is far smaller than N =p values sent to (received from) each other processor, which applies to the unoptimized strategy. Note that when e 0:1 the AINV factor has 201,765 nonzero elements, whereas when e 0:025 it has 790,632 ones. Though this high ®ll-in, the number of exchanged values in order to perform the product ZZ t v remains far smaller than in the unoptimized approach. Indeed, when the pattern of the matrix is not much
970
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
Table 3 Average number of ¯oating point values (#data) sent (received) by a single processor, in a p-processor run, when performing the products Av and ZZ t va p
# data ZZ t v; e 0:1
Av 2 4 8 16 32
ZZ t v; e 0:025
Opt.
Unopt.
Opt.
Unopt.
Opt.
Unopt.
179 269 314 336 348
14,300 21,450 25,025 26,812 27,706
240 359 418 448 463
28,600 42,900 50,050 53,624 55,412
910 1355 1579 1691 1718
28,600 42,900 50,050 53,624 55,412
a ``Opt.'' stands for our optimized matrix-vector routine, compared with the unoptimized (Unopt.) product. Z is the AINV factor for e 0:1; 0:025. (N 28; 600).
dispersed away from the diagonal, using our approach each processor sends/receives a small amount of data to/from a small number of processors. Table 4 reports the CPU time, Tp , spent for computing s 10 eigenpairs, on p 1; 2; 4; 8; 16 processors, together with the speedups, Sp T1 =Tp . The time for performing data input is not included, nor is the time for computing the preconditioners, which were stored into ®les. When N 268; 515 the one-processor run with AINV(0.025) was not performed, due to memory over¯ow. The corresponding speedups cannot be computed. Due to its slower convergence, Jacobi-DACG usually requires larger CPU times than AINV±DACG and FSAI±DACG in multiprocessor runs, despite its good parallel performance. This result con®rms that AINV and FSAI are eective preconditioners for our DACG algorithm. One can see that the time spent for evaluating the AINV and FSAI preconditioners is a small fraction of the overall CPU time, thus on this ground neither FSAI can be preferred to AINV, nor the cost of their computation makes Jacobi preconditioning far superior. Inspecting Table 4, one can see that good speedups are obtained for the larger problems, scoring up to S32 24.48 with FSAI±DACG, when N 268,515, which is very satisfactory. The speedup values increase with N, showing that our approach to parallelization is well suited to large problems. This is also evidenced by Fig. 1 which reports the eciency Ep Sp =p of preconditioned DACG, when solving the problems with N 28; 600 and N 268; 515. When dealing with the larger matrix, the eciency values can be larger than 70%, even when p 32. The eciency is smaller when attacking the smallest problem, N 28; 600, as one ®gures out. In that case, AINV±DACG displays the best parallel performance. Looking at Table 4, we see that sporadically S2 > 2 was found. Such a counterintuitive result is merely due to a smaller number of iterations performed sometimes when p 2, by chance. Good speedups are obtained exploiting Jacobi, which con®rms its good parallel performance, while AINV and FSAI allow for similar speedups. Comparing AINV±DACG eciency with that of FSAI±DACG, we observe that the e value allowing the smallest AINV-DACG CPU time is problem, and p, dependent. However, inspecting Table 4, we see that a convenient choice is
Table 4 CPU time, Tp , spent for computing s 10 eigenpairs on p processorsa N
S4
S8
Jacobi AINV(0.1) AINV(0.05) AINV(0.025) FSAI
75.64 45.96 35.18 36.64 35.01
39.18 25.74 19.14 18.84 19.52
23.14 14.98 11.20 10.49 11.18
14.28 9.44 7.36 6.41 7.00
10.83 7.06 6.10 5.00 5.13
10.42 6.20 4.50 4.45 3.88
1.93 1.79 1.84 1.94 1.79
3.26 3.07 3.14 3.49 3.13
5.29 4.87 4.78 5.72 5.00
6.98 6.51 5.77 7.33 6.82
7.25 7.41 7.82 8.23 9.02
28,600
Jacobi AINV(0.1) AINV(0.05) AINV(0.025) FSAI
255.12 122.74 126.42 147.89 177.29
135.04 66.55 64.48 76.93 92.39
79.18 36.98 34.96 40.81 52.41
46.62 21.52 20.55 22.88 31.32
29.29 13.56 13.16 14.09 20.06
21.31 10.64 10.31 10.21 15.37
1.89 1.84 1.96 1.92 1.92
3.22 3.32 3.62 3.62 3.38
5.47 5.70 6.15 6.46 5.66
8.71 9.05 9.61 10.50 8.84
11.97 11.54 12.26 14.48 11.53
42,189
Jacobi AINV(0.1) AINV(0.05) AINV(0.025) FSAI
229.03 129.73 106.23 99.31 141.45
109.40 67.94 55.99 52.68 74.25
60.79 37.59 29.55 29.09 40.94
34.40 21.89 17.64 16.69 23.35
20.66 13.97 10.53 10.73 14.48
13.76 10.27 8.50 7.98 10.71
2.09 1.91 1.90 1.89 1.91
3.76 3.45 3.59 3.41 3.46
6.66 5.93 6.02 5.95 6.06
11.08 9.29 10.09 9.26 9.77
16.65 12.63 12.50 12.44 13.21
64,000
Jacobi AINV(0.1), (0.05) AINV(0.025) FSAI
191.52 133.58 154.86 135.35
95.93 69.09 80.52 69.84
48.71 38.03 43.74 36.65
27.84 21.54 24.32 23.43
16.91 13.31 14.92 13.37
11.01 9.28 11.25 8.41
1.99 1.93 1.92 1.94
3.93 3.51 3.54 3.69
6.87 6.20 6.37 5.78
11.32 10.04 10.38 10.12
17.39 14.39 13.77 16.09
216,000
Jacobi AINV(0.1), (0.05) AINV(0.025) FSAI
855.81 644.71 750.03 654.73
496.62 339.43 372.77 384.80
250.34 176.38 201.75 179.52
136.04 95.24 103.69 91.46
71.45 53.45 55.29 54.55
42.90 31.49 34.54 31.12
1.72 1.90 2.01 1.70
3.42 3.66 3.72 3.65
6.29 6.77 7.23 7.16
11.98 12.06 13.57 12.00
19.95 20.47 21.71 21.04
268,515
Jacobi AINV(0.1) AINV(0.05) AINV(0.025) FSAI
3678.90 2276.42 2567.57 2994.49
1896.71 1196.96 1311.78 1146.68 1453.11
961.27 627.63 689.65 624.08 798.01
503.28 349.51 398.58 364.06 388.93
269.41 202.20 235.32 263.13 210.49
155.99 132.97 158.12 180.56 122.33
1.94 1.90 1.96 2.06
3.83 3.63 3.72 3.75
7.31 6.51 6.44 7.70
13.66 11.26 10.91 14.23
23.58 17.12 16.24 24.48
T1
T2
T4
T8
T16
T32
S16
S32
The corresponding speedups, Sp , are also reported. When N 64; 000; N 216; 000, the AINV(0.1) and AINV(0.05) factors are equal. An asterisk means that the value could not be computed, due to memory over¯ow.
971
a
Precond
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
S2
10,593
972
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
Fig. 1. Eciency of parallel DACG algorithm for problems with N 28; 600 and N 268,515.
e 0:05. For this setting, when p 1; N > 10; 593, AINV±DACG runs faster than FSAI±DACG. On the other hand, when massive parallel (p 32), large (N P 64; 000), problems are considered, due to a slightly better parallel performance, FSAI±DACG runs faster than AINV±DACG. Summarizing, AINV and FSAI can be considered equally eective preconditioners for our DACG algorithm. Note that when a large (say s > 40) number of eigenpairs is sought, the cost of each DACG iteration relies essentially on the orthogonalization process, rather than on the preconditioner. The CPU time is driven by the number of performed iterations, thus Jacobi preconditioning, which requires far more iterations than the other techniques, is far more CPU demanding. In a number of problems (see for instance the case N 216; 000 in Table 4), Jacobi-DACG spent not much more CPU time than AINV±DACG and FSAI± DACG, especially when a large number of processors is exploited. Let Jacobi(k) denote k-block preconditioning. Table 5 refers to Jacobi(k)-DACG runs, compared in each case with the best factorized approximate inverse preconditioner. Three of our test problems are considered. When k P 4, the growing cost of a single iteration is not balanced by the (slight) decrease in the iteration number. The total CPU time is larger on all our test problems, thus it is not reported in Table 5, where k 1; 2; 3 is considered. Note that, discarding the case N 42; 189; p 8, the choice k 3 allows for the smallest CPU time in all the problems considered. In the largest problem arising from FD discretization (i.e. N 216; 000), when p 1 the CPU time needed by Jacobi(3)-DACG is close to that required by AINV±DACG. Solving this problem, which yields structured matrices, the former is even faster than the latter, for p 2; 4; 16; 32, thanks to its
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
973
Table 5 CPU s, Tp , and speedups, Sp , in the computation of s 10 eigenpairs by Jacobi(k)-DACG and AINV± DACG p1
p2
p4
p8
p 16
p 32
N 28; 600 Jacobi k1 Jacobi k2 Jacobi k3 AINV e 0:1
Tp Sp Tp Sp Tp Sp Tp Sp
255.1 ± 237.4 ± 215.0 ± 122.7 ±
135.0 1.9 124.4 1.9 113.8 1.9 66.6 1.8
79.2 3.2 69.9 3.4 63.6 3.4 37.0 3.3
46.6 5.5 40.1 5.9 36.1 6.0 21.5 5.7
29.3 8.7 24.5 9.7 22.1 9.7 13.6 9.1
21.3 12.0 17.0 14.0 15.8 13.6 10.6 11.5
N 42; 189 Jacobi k1 Jacobi k2 Jacobi k3 AINV e 0:025
Tp Sp Tp Sp Tp Sp Tp Sp
229.0 ± 192.7 ± 180.7 ± 99.3
109.4 2.1 104.1 1.9 95.5 1.9 52.7 1.9
60.8 3.8 56.8 3.4 53.0 3.4 29.1 3.4
34.4 6.7 29.9 6.4 30.5 5.9 16.7 6.0
20.7 11.1 19.1 10.1 18.3 9.9 10.7 9.3
13.8 16.7 13.1 14.8 12.6 14.4 8.0 12.4
N 216; 000 Jacobi k1 Jacobi k2 Jacobi k3 AINV e 0:1
Tp Sp Tp Sp Tp Sp Tp Sp
855.8 ± 732.6 ± 683.3 ± 644.7 ±
496.6 1.7 398.4 1.8 307.6 2.2 339.4 1.9
250.3 3.4 182.5 4.0 158.1 4.3 176.4 3.7
136.0 6.3 95.7 7.7 99.4 6.9 95.2 6.8
71.5 12.0 52.7 13.9 51.5 13.3 53.5 12.1
42.9 19.9 31.4 23.3 26.5 25.8 31.5 20.5
higher degree of parallelization. When analyzing our unstructured matrices, arising from FE or MFE solvers, AINV and FSAI preconditioners are more ecient than Block±Jacobi, for each p. Table 6 reports the CPU times and speedups displayed by AINV±DACG, in the computation of s 40 eigenpairs, when using either CGS or MGS B-orthogonalization. Recall that the cost of the re-orthogonalization step increases with s, thus the performance of the preconditioned algorithm changes with s. First, by comparing the sequential (p 1) times in Table 6, note that CGS strategy leads to 10±15% smaller CPU times. With respect to MGS, higher speedup values are observed when using CGS. Table 7 gives the average CPU times required by the re-orthogonalization procedure in a p 32 run, when s 40. The percentage of orthogonalization time over the AINV±DACG time is also shown. Such percentage ranges from 45% to 49% when using CGS, while it is at most 36%, when MGS is exploited. Such an advantage is due both to a faster performance on each processor, and a smaller amount of data to broadcast. Moreover, note that using CGS, approximately 80%, or less, of the total CPU time spent using MGS is exhausted.
974
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
Table 6 CPU times, Tp , spent by AINV±DACG, and ensuing speedups, Sp , for evaluating s 40 eigenpairs, using either CGS or MGS strategies N
p1
p2
p4
p8
p 16
p 32
28,600 e 0:1 42,189 e 0:05 64,000 e 0:1 216,000 e 0:1
Tp Sp Tp Sp Tp Sp Tp Sp
636.24 ± 570.11 ± 708.90 ± 3826.63 ±
338.19 1.88 290.96 1.96 375.12 1.89 1889.21 2.03
CGS 172.85 3.68 157.46 3.62 196.44 3.61 1000.28 3.83
101.79 6.25 86.70 6.58 104.99 6.75 515.18 7.43
64.72 9.83 52.33 10.89 61.63 11.59 286.39 13.36
38.62 16.47 33.51 17.01 40.94 17.37 164.27 23.29
28,600 e 0:1 42,189 e 0:05 64,000 e 0:1 216,000 e 0:1
Tp Sp Tp Sp Tp Sp Tp Sp
713.42 ± 635.28 ± 829.06 ± 4328.06 ±
378.76 1.88 329.89 1.93 430.58 1.93 2197.26 1.97
MGS 208.27 3.43 180.93 3.51 229.99 3.60 1123.12 3.85
118.70 6.01 103.68 6.13 132.92 6.24 593.94 7.29
75.80 9.41 61.09 10.40 80.90 10.25 344.97 12.55
53.85 13.25 41.09 15.46 51.06 16.24 205.97 21.01
Table 7 AINV±DACG exploiting either CGS or MGS procedurea N
Precond.
Orthog. strategy
Time Total
Ortho
Ortho/total (%)
CGS/MGS (%)
28,600
AINV(0.1) AINV(0.05)
64,000
AINV(0.1)
216,000
AINV(0.1)
38.62 53.85 33.51 41.09 40.94 51.06 164.27 205.97
11.16 26.45 12.32 20.47 14.37 25.05 55.30 94.20
28 49 36 49 35 49 33 45
72
42,189
CGS MGS CGS MGS CGS MGS CGS MGS
82 80 80
a Average values in a 32-processor computation of the orthogonalization time (Ortho), and the whole computing time (Total), when s 40 eigenpairs are evaluated. The label CGS/MGS denotes the ratio total (CGS)/total (MGS) between the total CPU times spent using either CGS or MGS.
5. Conclusions The following points are worth emphasizing. · Choosing the nonzero pattern of A when computing FSAI, and setting e 0:05 in evaluating the AINV factor, allowed for obtaining equally satisfactory preconditioners for our DACG procedure. · AINV±DACG and FSAI±DACG displayed comparable speedups. For p 32 processors, FSAI±DACG usually showed a slightly better parallelization level,
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
975
in some cases better than with Jacobi-DACG, which however con®rmed its good parallel performance. · The AINV and FSAI techniques were found to be eective preconditioners for our DACG algorithm. They display better parallel performance than the easier Jacobi preconditioning, even when massively parallel computations (p P 16) are performed. Jacobi(3)-DACG can spend less CPU time than AINV±DACG only on FD problems, when a large number of processors (p 32) is used. · In our numerical tests, CGS procedure provides orthogonality up to machine precision, as MGS does. Moreover, when massive parallel runs are considered (p 32), optimal AINV±DACG using CGS generally spent 80% of the time spent using MGS. Acknowledgements This work has been supported in part by the Italian MURST Project ``Analisi Numerica: Metodi e Software Matematico'', and CNR contract 98.01022.CT01. Free accounting units on the T3E Supercomputer were given by CINECA, inside a frame research grant. We thank Rich Lehoucq for providing useful suggestions. References [1] K.J. Bathe, E. Wilson, Solution methods for eigenvalue problems in structural dynamics, Int. J. Numer. Methods Eng. 6 (1973) 213±226. [2] M. Benzi, C.D. Meyer, M. T uma, A sparse approximate inverse preconditioner for the conjugate gradient method, SIAM J. Sci. Comput. 17 (5) (1996) 1135±1149. [3] L. Bergamaschi, G. Gambolati, G. Pini, Asymptotic convergence of conjugate gradient methods for the partial symmetric eigenproblem, Numer. Lin. Alg. Appl. 4 (2) (1997) 69±84. [4] L. Bergamaschi, G. Pini, F. Sartoretto, Approximate inverse preconditioning in the parallel solution of sparse eigenproblems, Numer. Lin. Alg. Appl. 7 (3) (2000) 99±116. [5] L. Bergamaschi, M. Putti, Numerical comparison of iterative methods for the eigensolution of large sparse symmetric matrices, SIAM J. Sci. Comput. June, 2000 (preprint). [6] L. Bergamaschi, M. Putti, Ecient parallelization of preconditioned conjugate gradient schemes for matrices arising from discretizations of diusion equations, in: Proceedings of the Ninth SIAM Conference on Parallel Processing for Scienti®c Computing, March, 1999 (CD-ROM). [7] T. Ericsson, A. Ruhe, The spectral transformation Lanczos method for the numerical solution of large sparse generalized symmetric eigenvalue problems, Math. Comp. 35 (1980) 1251±1268. [8] G. Gambolati, F. Sartoretto, P. Florian, An orthogonal accelerated de¯ation technique for large symmetric eigenproblems, Comp. Methods App. Mech. Eng. 94 (1992) 13±23. [9] A.V. Knyazev, A.L. Skorokhodov, The preconditioned gradient-type iterative methods in a subspace for partial generalized symmetric eigenvalue problem, SIAM J. Numer. Anal. 31 (4) (1994) 1226± 1239. [10] L.Y. Kolotilina, A.Y. Yeremin, Factorized sparse approximate inverse preconditioning. I theory, SIAM J. Matrix Anal. Appl. 14 (1993) 45±58. [11] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear dierential and integral operators, J. Res. Nat. Bur. Standard 45 (1950) 255±282. [12] R.B. Lehoucq, D.C. Sorensen, De¯ation techniques for an implicit restarted Arnoldi iteration, SIAM J. Matrix Anal. Appl. 17 (4) (1996) 789±821.
976
L. Bergamaschi et al. / Parallel Computing 27 (2001) 963±976
[13] R.B. Lehoucq, D.C. Sorensen, C. Yang, ARPACK. Users guide solution of large scale eigenvalue problem with implicit restarted Arnoldi methods, SIAM (1998). [14] C.C. Paige, Computational variants of the Lanczos method for the eigenproblem, J. Inst. Math. Applics. 10 (1972) 373±381. [15] B.N. Parlett, in: The symmetric eigenvalue problem, Prentice-Hall, Englewood Clis, NJ, 1980. [16] F. Sartoretto, G. Pini, G. Gambolati, Accelerated simultaneous iterations for large ®nite element eigenproblems, J. Comp. Phys. 81 (1989) 53±69. [17] G.L.G. Sleijpen, H.A. van der Vorst, A Jacobi±Davidson method for linear eigenvalue problems, SIAM J. Matrix Anal. Appl. 17 (2) (1996) 401±425.