Optimization of a kinetic laser–plasma interaction code for large parallel systems

Share Embed


Descripción

Parallel Computing 29 (2003) 1175–1189 www.elsevier.com/locate/parco

Optimization of a kinetic laser–plasma interaction code for large parallel systems Olivier Coulaud a, Micha€el Dussere a,*, Pascal Henon a, Erik Lefebvre b, Jean Roman a a

b

ScAlApplix Project, INRIA-Futurs, LaBRI UMR CNRS 5800, Universit e Bordeaux 1 & ENSEIRB, F-33405 Talence, France D epartement de Physique Th eorique et Appliqu ee, CEA/DIF, BP 12, 91680 Bruy eres-le Ch^ atel, France Received 17 January 2003; received in revised form 13 June 2003; accepted 1 July 2003

Abstract In this work, we simulate the interaction between intense laser radiation and a fully ionized plasma by solving a Vlasov–Maxwell system using the ‘‘Particle-In-Cell’’ (PIC) method. This method provides a very detailed description of the plasma dynamics, but at the expense of large computer resources. Our SPMD 3D PIC code, CALDER, which is fully relativistic, is based on a spatial domain decomposition. Each processor is assigned one subdomain and is in charge of updating its field values and particle coordinates. This paper presents some optimizations to achieve large simulations, such as communication overlapping, cache-friendly data management, and the use of a parallel sparse PCG solves the Poisson’s equation. Finally we present the benefits from these optimizations on the IBM SP3 and the physical results for a large case simulation obtained on the CEA/DIF Teraflops parallel computer.  2003 Elsevier B.V. All rights reserved. Keywords: High-performance computing; PIC method; Parallel sparse solver

*

Corresponding author. E-mail addresses: [email protected] (O. Coulaud), [email protected] (M. Dussere), [email protected] (P. Henon), [email protected] (E. Lefebvre), [email protected] (J. Roman). 0167-8191/$ - see front matter  2003 Elsevier B.V. All rights reserved. doi:10.1016/S0167-8191(03)00098-X

1176

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

1. Introduction The continuing development of new, powerful laser sources that are able to deliver very intense light beams [1], motivates a large effort to model and simulate the interaction between these light beams and matter. There is a rich diversity among laser sources, in terms of duration (from continuous to femtosecond––1015 s–– pulses), wavelength, peak intensity, power (ranging from a few megawatt to petawatt powers). The phenomena taking place during the interaction of these light pulses with matter will be highly dependent on these parameters, and there is a variety of models to describe these phenomena. When the incident laser intensity is high enough, the gas or solid target is ionized, and a plasma composed of electrons and ions is produced. The evolution of this plasma on the nanosecond time scale, for an incident laser pulse of around 1015 W/cm2 , is adequately described by hydrodynamic equations, which describe the system with a small number of space and time parameters (density, flow, internal energy). At higher laser intensity, new phenomena can invalidate this relatively simple description. For incident intensities of 1018 W/cm2 and above, the motion of electrons in the laser electromagnetic fields becomes highly non-linear, and the plasma can no longer be described by a small number of hydrodynamic parameters. Instead, a kinetic formalism must be used, in which the density of particles in the phase space (i.e. the momentum–position ðp; xÞ space) is employed to characterize each particle species in the system. The evolution of this density function is governed, for each particle species a, by the collision-less Vlasov equation: ofa ofa ofa þv þ qa ðE þ v  BÞ  ¼0 ot ox op

ð1Þ

where fa is the density of particles a, E is the electric field, and B is the magnetic field. The electromagnetic (laser and plasma) fields, on the other hand, obey the system of Maxwell’s equations: ðaÞ div E ¼ q=0 ðbÞ div B ¼ 0 oE oB ¼ c2 rot B  j=0 ðdÞ ¼ rot E ðcÞ ot ot

ð2Þ

These two systems of equations are coupled: in the Vlasov equation the distribution function evolves under the action of the electromagnetic fields, and in Maxwell equations, the fields are influenced, through the charge and current densities q and j, by the dynamics of particles. The Vlasov equation can be solved with a characteristic method [2]: if we sample the initial distribution, in momentum and position spaces, with a large number of ‘‘macro-particles’’, Eq. (1) expresses the conservation of fa along the trajectories of these macro-particles given by: dx ¼v dt

dp ¼ qa ðE þ v  BÞ dt

v p ¼ ma qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1  ðv=cÞ

ð3Þ

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

1177

This brings us to simulate, under the effect of the electromagnetic field, the dynamics of a large collection of ‘‘macro-particles’’ which obey the same equation of motion as physical particles with mass ma and electric charge qa . Actually, each ‘‘macro-particle’’ represents a large number of physical particles. The field evolution, on the other hand, can be solved by a time-explicit Leapfrog method on a fixed Cartesian mesh, with source terms provided by the particle motions. The resulting method, known as ‘‘Particle-In-Cell (PIC), is probably the most widely used method to solve the coupled Vlasov–Maxwell system [2,3]. This detailed description requires very small time and space steps, and hence a very large amount of computation and memory to model realistic situations. To provide experimentally relevant informations, simulations must feature plasma volumes of a few thousand cubic micrometers as the spatial grid must at least correctly sample the laser wavelength (usually 1 lm); this leads to mesh sizes of many million cells or more, with one order of magnitude more for the particle number (usually we consider two particle species: electrons and ions). Similarly, the time step must at least resolve the laser period, usually 3 fs, but describes phenomena taking place on a picosecond time scale. Therefore simulations usually feature tens of thousands of time steps. With a typical time step duration of 5 ls per particle, this leads to a few minutes per time step, and a prohibitive time of thousands of hours for the whole simulation. To handle this large amount of data and operations, and to bring the execution time down to the tens of hours range, efficient algorithms relying on data parallelism have been proposed [4,5]. In our code, CALDER, the simulation domain is partitioned evenly between the processors at the beginning of the computation. Each processor is in charge of computing the fields and particle dynamics in its own subdomain which remains fixed during the whole simulation. As the particles move in the course of the computation, they may have to be transferred between neighboring processors. These and other communications are handled explicitly in the program by using the MPI library [6]. This paper is organized as follows. The next section will be devoted to the presentation of CALDER and of its algorithms. Section 3 will describe the improvements that we implemented to increase computational efficiency. The results of these modifications will be presented in the last section, together with a typical, large-scale calculation performed with the code.

2. Overview of the PIC simulation process The time discretization of Eqs. (2) and (3) is based on an explicit second-order time-scheme. Maxwell’s equations (2) are discretized in space by finite differences on regular staggered grids (Yee’s lattice [2]). At each time step, the particles are moved according to Eqs. (3) and the evolution of electromagnetic fields is governed by Maxwell’s equations. Considering that we sample the density function fa with macro-particles and by using Maxwell’s equations on grids, we need two projection

1178

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

operators; the first one performs the interpolation of the field onto the particles, and the second one performs the interpolation of density values defined on the particles onto the grid. In the next section, we first detail the simulation process and the different steps with their computational cost. Then, we introduce our parallelization scheme. 2.1. Simulation scheme In CALDER, the densities of ions and electrons are sampled by large numbers of particles defined by their position, speed and charge. The electromagnetic fields, the density of charge and the density of current are discretized on a Yee’s lattice [2]. These two different discretizations are related to two interpolation steps. The first one computes the electromagnetic fields at particle positions using the values of the neighboring grid nodes. The second one computes the contributions of particles to the source terms (density of charge and density of current) in Maxwell’s equations. As shown in Fig. 1, the simulation variables, computed during these different steps, are defined at different time-steps. Fig. 2 synthesizes the succession of the algorithmic steps in one time step. • Field interpolation. The iteration begins with the interpolation of the field values on the nodes onto the particle positions. This step requires between 25% and 35% of the global computation time. • Particle motion. This step consists in updating the particle positions according to Eq. (3) and detecting the outgoing particles (depending on the simulations, they are re-injected or simply removed). Some simulations also require a particle collision management. The whole step requires 30–45% of the total time, and its main part is spent on outgoing particle treatment. • Field source computation. In Maxwell’s equations, the source terms, densities of charge and of current, are computed on a given node from the positions and the speeds of the particles in the cells which contain this node. The part of the field source computation represents about 20–30% of the global time. • Electromagnetic field computation. The electromagnetic fields are computed using Maxwell’s equations. Ampere’s equation (2c) gives Eðtn Þ from the values of Eðtn1 Þ, Bðtn12 Þ and jðtn12 Þ. We have to add a correction to the electric field in order to satisfy Eq. (2a). This correction comes from Poisson’s equation and is solved by a conjugate gradient method. With Faraday’s law (2d), we compute Bðtnþ12 Þ from Eðtn Þ and Bðtn12 Þ. Finally, we apply the boundary conditions, which models the laser pulse and the physical domain limits, to the electromagnetic field. Because of Poisson’s equation resolution, the time spent in this step can vary from 10% to 50% of the whole simulation time. This scheme is iterated for thousands of time steps. The resulting simulations need too much memory and too many operations to be treated by one processor. Therefore, we must use parallelization techniques.

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

1179

Fig. 1. Time discretization and computation sequence.

Fig. 2. Computation and communication steps for one iteration.

2.2. Parallelization overview The CALDER code is parallelized by using a spatial domain decomposition, which is orthogonal and regular. In the case of 3D domains, we generally use a 2D decomposition, perpendicular to the laser beam propagation (ðy; zÞ decomposition). Each processor involved in the simulation is in charge of one subdomain, the particles located on this subdomain and the corresponding fields. The information shared between the subdomains are transfered by message passing. As shown in Fig. 2, we have introduced some communication steps to maintain the coherence between the subdomains. In the first communication step, the particles

1180

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

moving to another subdomain are sent to their new owner processor. The particles moving out of the discretized physical domain are treated according to the simulation boundary conditions. After the field source computation, the border source contributions are communicated between neighboring subdomains and locally summed. The last step, the field computation, implies different kinds of communications. The border fields are communicated between neighbors and buffered into ‘‘ghost’’cells. These are standard point-to-point communications as those previously described. There is also some global reduction to compute the dot products in the conjugate gradient solver. All these steps may synchronize the processes and reduce the parallel code performance if they are not well implemented.

3. High-performance implementation Considering that there is no low cost step in the simulation scheme, we have to reduce the cost of every computation step and their associated communications. Firstly, to optimize the interpolation steps, we use a particle sort that greatly improves the memory accesses. Secondly, for the particle motion, the optimization consists in dividing the cell in two categories: interior cells and boundary cells. This cell classification allows better performances in the outgoing particle treatment and the communications involved. Finally, we have efficiently preconditioned the parallel conjugate gradient to improve performance for the field simulation step. 3.1. Particle sort and cache-friendly routines During the interpolation steps, the algorithm consists in traversing the particle array, finding the surrounding nodes and making the appropriate numerical operations. But when the particles move in the subdomain, their storage does not match their spatial position. Then, the access to the field values in memory causes numerous cache misses. To maintain the coherence between the particle storage and the field storage, and to reduce the number of cache misses, we introduce a particle sort based on their spatial position. A simple solution to sort the particles is to compute each particle location, to put the particle list in a buffer and to order them. With the great number of particles treated, it is difficult to allocate enough memory for the buffer. To avoid this limitation, we use a buffer-free algorithm based on a mechanism of bi-referencing and particle exchange. In this algorithm, the reordering is done in a single traversal of the particle array. As shown on Fig. 3, an unsorted particle A has the reference of the position where it must be placed (i2) and the reference of the particle B which must substitute it (i1). When the particles have shifted their place, the references of A are updated. In the same way, for each exchange in the array, one particle is placed and the reference is updated. The number of exchanges is then bounded by the number of particles.

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

i2

i1

i0

i2

i0

i1

COME FROM

i0

Particle B

i1

i2

Particle A

i0

GO TO

1181

i1

i0

COME FROM

i2

Particle A

i0

Particle B

GO TO

Fig. 3. In place bi-referenced sort.

The use of a particle sort at each iteration allows another optimization of the interpolation steps. When the particles are unsorted, for each particle one has to compute the indices of the nodes involved in the interpolation. Many of these index computations are avoided when the particles are sorted by cell location. 3.2. Boundary cell isolation The treatment of the domain border appears in different steps of the code. Considering the field values, the access to the border information is quite simple through the grid. To have the same simplicity with the particles, we introduce a distinction between boundary cells and interior cells. It consists of a formal isolation of the bordering cells based on a reordering which does not affect the field storage. There is also a physical isolation of the bordering particles at the beginning of the data array. This particle isolation step is done at each iteration whereas the global sort phase can be made from time to time. The particle isolation phase allows us to limit the treatment of outgoing particles to the only ones concerned. At the same time, the particle order induced by this sort allows the overlapping of the necessary communications for the bordering particle treatment by the computation of interior particle motion. This optimization reduces the particle motion step, which is the most expensive one, by a factor of three (see Section 4.1). 3.3. Poisson solver preconditioning The Poisson correction consists in solving a sparse linear system resulting from the discretization of the Laplacian operator. To solve this system we use a preconditioned

1182

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

conjugate gradient (PCG). The PCG has been preferred to other iterative solvers (especially multi-grid ones) to allow irregular domain discretization and, if necessary, a future implementation of dynamic load balancing. Moreover a PCG solver is easy to use, easy to parallelize and, in our case, it does not require to actually build the matrix. Our inexpensive preconditioner is based on an incomplete LU factorization [7]. 3.3.1. Approximate ILU(0) preconditioner with constant coefficients The Laplacian discretization gives an m-diagonal matrix, with m ¼ 5 in 2D and m ¼ 7 in 3D. Considering that the Laplacian is discretized on regular grids, the matrix has constant diagonals, except for the unknowns located on the subdomain borders. We build the preconditioner as follows. First, we approximate the matrix of Poisson’s system by a m-constant-diagonal matrix. Then, we analytically compute a LU factorization for this matrix with no fill-in (ILU(0)). This double approximation avoids to store L and U and avoids all the subsequent memory accesses during the forward and backward substitutions. 3.3.2. Implementation framework An efficient preconditioner has to be global, in order to minimize the number of iterations, and parallel, in order to achieve high performances. However, the forward and backward substitutions are sequential processes. In order to exhibit parallelism for these substitutions, we must define a global reordering of the subdomains. To do this, we perform a standard Red–Black coloring of the subdomains: the diagonal neighbors have the same color and the face-to-face neighbors have different colors. The unknowns corresponding to red-colored subdomains are ordered with the lowest numbers and those corresponding to black-colored subdomains are ordered with the highest numbers. Since they are not connected to each other, all the subdomains which the same color can be eliminated independently in the backward and forward substitutions. A second level of parallelism can be exhibited between red and black subdomains. The red boundary nodes are numbered with the lowest numbers and black boundary nodes with the highest ones. Thanks to this second level of ordering, black center nodes can be eliminated independently from the red subdomain nodes. Then the forward and backward substitutions can be efficiently parallelized. As shown on Fig. 4, the black center nodes are not related to any node with a lower number; they can be treated in parallel with the red nodes. After their substitution, the black boundary nodes can be treated without any delay because the substitution of red boundary nodes and the communication of the resulting vector are already ended. Since the preconditioner is not stored, the reordering is made directly on the fly. This solution is simple, it can be easily optimized by the compiler and is not expensive in terms of memory access. Finally, this preconditioner is not very expensive: a preconditioned iteration costs twice as much as an unpreconditioned one

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

1183

BOUNDARY NODES

RED NODES

CENTER NODES

BLACK NODES

BOUNDARY NODES

Fig. 4. Double ordering of the global preconditioner.

while it reduces by a factor of about three the number of iterations needed (see Section 4).

4. Numerical experiments In this section, we first present a performance analysis of the different optimizations that we have discussed in this paper. This analysis is based on results obtained at CINES (Montpellier, France) on an IBM SP3 with 28 16 way-NH2 SMP nodes (16 · 1.5 Gflops peak), each with 16 Gb of memory, and connected by a Colony switch. Secondly, we present a typical physical simulation performed with CALDER on the TERA supercomputer at CEA/DIF (4 ways-ES45 Compaq SMP nodes connected by a Quadrix switch). 4.1. Performance analysis of proposed optimizations As shown in the previous section, several optimizations are applied to different steps of the code and they concern different aspects: communications, floating-point operations, memory accesses and solver convergence. As a result, the impact of these optimizations can be very different from one simulation to another. For that reason let us consider each one separately. The isolation of boundary cells (see Section 3.2) has led to an important reduction in the treatment of the outgoing particles. This reduction depends on the ratio between the number of boundary cells and the number of center cells. Smaller values

1184

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

of the ratio correspond to more efficient optimization. We generally achieve a factor from 5 to 10 on the outgoing particle treatment time. On large 2D cases the gain is up to 20% of the global cost. The benefits obtained at each iteration thanks to the particle sort section (see Section 3.2) and the use of cache-friendly routines depend on the parameter values of the simulation; basically the number of cells and the number of particles by cell. These benefits are illustrated in Table 1. This table shows the average time in microseconds necessary to interpolate one particle with or without a sort step and with or without the use of cache-friendly routines. The pool of cases presents different number of cells and different number of particles by cell. One can see that the sort step provides better performances on both interpolation steps. It also prevents the performance from decreasing when the number of cells increases. Moreover, the cache-friendly routines stress the effects of the sort step, and the more the cells are loaded with particles, the more the cache-friendly routines are useful. For 3D cases, the preconditioning of the conjugate gradient for Poisson correction divides the number of iterations by three and the time spent is reduced by 30–50%. This performance is illustrated in Fig. 5 and Table 2. The results, presented in Fig. 5, show the numbers of iterations needed by the conjugate gradient and the preconditioned conjugate gradient to achieve Poisson

Table 1 Average sequential time (ls) of interpolation steps per particle and per iteration Field interpolation 2.5 · 105a

Field source computation 5.1 · 105 a

1.0 · 106a

Simulation without sort step and without cache-friendly routines 10 part./cell 2.14 2.49 2.98 1.22 20 part./cell 2.06 2.38 – 1.15 30 part./cell 2.03 2.34 – 1.12 40 part./cell 2.00 – – 1.11 60 part./cell 2.01 – – 1.14

1.43 1.35 1.31 – –

1.87 – – – –

Simulation with sort step and without cache-friendly routines 10 part./cell 1.75 1.81 1.85 1.04 20 part./cell 1.71 1.75 – 0.98 30 part./cell 1.70 1.74 – 0.96 40 part./cell 1.74 – – 0.97 60 part./cell 1.68 – – 0.94

1.07 1.00 0.98 – –

1.13 – – – –

Simulation with sort step and with cache-friendly routines 10 part./cell 1.33 1.35 1.38 20 part./cell 1.16 1.18 – 30 part./cell 1.14 1.13 – 40 part./cell 1.12 – – 60 part./cell 1.10 – –

0.90 0.71 0.63 – –

0.95 – – – –

– means out-of-memory cases. a No. of cells.

5.1 · 105a

1.0 · 106 a

2.5 · 105a

0.88 0.68 0.63 0.59 0.56

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

1185

Fig. 5. Comparative efficiency (nb. of iterations) of CG (left) and PCG (right) solver over different discretizations of the same simulation case.

Table 2 Scalability of the preconditioned Poisson solver (the infinite norm of the residual is used as stopping criterion) Precision of Poisson 5 · 102 CG solver

5 · 103 Preconditioned CG solver

CG solver

Preconditioned CG solver

3D domain with 440 · 120 · 120 cells ((y,z) Processor number 32 32 Average iteration 7.20 1.68 number Average time/it./cell 0.56 1.52 (ls) Global iteration 14,390 3365 number Wall clock time (s) 1589 1012

decomposition) 64 144 1.64 1.71

32 32.79

32 9.31

64 9.52

144 9.82

1.51

2.38

0.38

0.63

0.64

0.96

3287

3419

65,577

18,616

19,047

19,646

491

358

4970

2333

1200

826

Larger 3D domain with 440 · 240 · 240 cells Processor number 64 64 Average iteration 7.17 1.52 number Average time/it./cell 0.69 1.86 (ls) Global iteration 14,340 3031 number Wall clock time (s) 3942 2228

((y,z) decomposition) 128 144 64 1.51 1.61 31.32

64 9.26

128 9.18

144 9.82

1.91

1.81

0.59

0.84

0.85

0.85

3017

3232

62,646

18,525

18,354

18,778

1141

1027

14710

6191

3097

2794

1186

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

correction. Both solvers are compared over different discretizations of the same physical case and for different precisions (the infinite norm of the residual is used as stopping criterion). These simulations were performed on the 16 processors of one SMP node. Fig. 5 shows that the performance of the preconditioner is stable and, in all cases, it reduces the number of iterations by a factor of three. Considering the different approximations used to build the preconditioner, especially on the subdomain borders, one could expect it to be sensible to the subdomain size. Table 2 presents a scalability study for two cases. The first case is a 3D domain with 440 · 120 · 120 cells and a 2D decomposition (ðy; zÞ decomposition) running for 2000 time steps. We compare the performance of our preconditioner for 32, 64 and 144 processors and the performance of a standard conjugate gradient solver on 32 processors. These measures are performed for two correction precisions. We perform the same comparison on a larger case, 440 · 240 · 240 cells, with the same discretization steps and the same decompositions and with 64, 128 and 144 processors. The comparison of the average number of iteration for one time step depicts the parallelism overhead in terms of convergence. The comparison of the average time spent on the treatment of one cell for one iteration (in microseconds) depicts the parallelism overhead due to communications. On the first case presented in Table 2, the gain of the preconditioner in terms of convergence remains stable up to 144 processors but we face a classical granularity problem on the average time by cell when the number of processors increases. On the larger case, one can see that the benefit of the preconditioner increases with the requested precision, and that our preconditioner achieves a good scalability, up to 144 processors, for the whole Poisson step. Finally, Fig. 6 presents some global results obtained in the larger case described in the previous paragraph. It shows that the ratio of the wall clock time spent in the different optimized computation steps in the simulation weakly depends on the granularity of the decomposition.

Fig. 6. Ratio between the wall clock time spent in the different optimized computation steps.

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

1187

4.2. Physical illustration A representative use of CALDER is provided by the following study of Forced Laser Wakefield electron acceleration [8]. When a high-intensity, short laser pulse propagates in a low density plasma, its radiation pressure can displace the electrons on its path, leaving in its wake an electric charge imbalance. The resulting electrostatic wave, called ‘‘wakefield’’, has a longitudinal field and a phase velocity close to the speed of light in vacuum, so that it can efficiently accelerate plasma electrons to relativistic kinetic energies [9]. Kinetic simulations permit us to investigate the detailed mechanism of this electron source. To model recent experiments [8], we simulated a 65 lm longitudinally, (32 lm)2 transversely volume of 2 · 1019 cm3 electron and helium plasma. The longitudinal dimension was meshed at k0 =12, i.e. 0.07 lm, with k0 ¼ 0:82 lm the laser wavelength in vacuum, and the transverse dimensions at 0.08 lm, resulting in a 154 million cell domain. The plasma was modeled with 615 million macro-particles for electrons, and as many for ions. The computation was distributed over 500 processors by dividing the domain in 5 slices longitudinally and 10 slices in each transverse direction. A 30 fs, 0.4 J laser pulse is propagated into this plasma. The density is sufficiently low for the pulse to propagate almost at the speed of light in a vacuum. In order to compute the laser–plasma coupling over longer time and distance, we used the ‘‘moving window’’ capability of the code, translating the simulation box longitudinally so as to keep up with the pulse as it propagates. At the end of the computations, after 24,000 time steps or 3.1 ps of simulated time, the pulse had propagated over roughly 0.9 mm of plasma. On 500 processors of CEA/DIF TERA supercomputer, the whole computation time is 1.9 · 105 s. The laser pulse can propagate over a relatively long distance in the lower density plasma: 61% of the incident energy is transmitted up to 675 lm. This guided propagation takes place over a distance much larger than the characteristic diffraction length of the pulses in a vacuum. This is made possible by a nonlinear effect in the plasma, which acts as a focusing lens on the beam and permits guided energy transport over long distances. A similar computation was carried out with a slightly less energetic laser pulse (0.3 J) on a denser plasma at 5 · 1019 cm3 . Even if it also shows evidence of guided propagation, the laser energy is more rapidly absorbed or scattered in the higher density target, so that only 15% of the incident energy is transmitted through the first 675 lm of plasma. During its propagation, the pulse displaces the plasma electrons it encounters, exciting a strong density modulation which is clearly apparent in Fig. 7. The electrostatic field associated with this density modulation can reach a very high amplitude, peaking at an accelerating (for electrons) value close to 1.4 TV/m just behind the laser pulse. This is four orders of magnitude larger than the fields achieved in conventional accelerators. In such a large electric field, some plasma electrons are rapidly accelerating to relativistic energy. Even though the effective acceleration distance is very small, less than 1 mm, efficient acceleration is observed in the computation, up to 200 MeV. Similar energies have been observed experimentally [8]. The characteristics of electron acceleration are related to the wakefield amplitude and hence to the strength of the laser pulse that drives it. Self-focusing of the pulse is stronger when it is described

1188

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

Fig. 7. Electron density map in a Forced Laser Wakefield electron acceleration.

in 3 dimensional geometry, and for this reason we believe that 3D modeling is necessary to make predictive computations of this electron acceleration scheme. Overall, the results from our simulations are in very good qualitative and quantitative agreement with experimental observations made under similar conditions [8,10].

5. Conclusion The predictive capability of CALDER will be most helpful to further understand the physics underlying the electron sources, and to optimize them for a variety of applications in basic and applied physics, chemistry, or medicine [11,12]. The optimization performed on CALDER led to a time reduction of the most expensive computation steps. As a perspective, we intend to use dynamic load balancing for the particles in the subdomains, in simulations with highly irregular behaviors.

References [1] M.D. Perry, G. Mourou, Terawatt to petawatt subpicosecond lasers, Science 264 (1994) 917. [2] C.K. Birdsall, A.B. Langdon, Plasma Physics via Particle Simulation, McGraw-Hill, New York, 1985. [3] A. Pukhov, Three-dimensional electromagnetic relativistic particle-in-cell code vlpl (virtual laser plasma lab), J. Plasma Phys. 61 (1999) 425. [4] P.C. Liewer, V.K. Decyk, A general concurrent algorithm for plasma particle-in-cell simulation codes, J. Comput. Phys. 85 (1989) 302. [5] P. Liewer, J. Wang, V. Decyk, 3d electromagnetic plasma particle simulations on a mimd parallel computer, Comput. Phys. Commun. 87 (1995) 35. [6] MPI: a message passing interface standard, Int. J. Supercomput. Appl. (1994). [7] Y. Saad, Iterative Methods for Sparse Linear Systems, second ed., January 2000. [8] E. Lefebvre, V. Malka, S. Fritzler, et al., Wave-breaking of a laser wakefield forced by an intense ultrashort laser pulse, Science 298 (2002) 1596. [9] T. Tajima, J.M. Dawson, Lasser electron accelerator, Phys. Rev. Lett. 43 (1979) 267.

O. Coulaud et al. / Parallel Computing 29 (2003) 1175–1189

1189

[10] J.-R. Marques, J. Faure, V. Malka, et al., Characterization of electron beams produced by ultrashort (30 fs) laser pulses, Phys. Plasmas 8 (2001) 2605. [11] V. Malka, A. Rousse, S. Fritzler, K. Ta Phuoc, E. Lefebvre, Ultra-short electron bunches generated with high-intensity lasers for injectors and X-ray sources, Applied Physics Letters, to be published. [12] E. Fourkal, B. Shahine, M. Ding, J.S. Li, T. Tajima, C.-M. La, Particle in cell simulation of laseraccelerated proton beams for radiation therapy, Med. Phys. 29 (2002) 2788.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.