MODTRAN on supercomputers and parallel computers

Share Embed


Descripción

Parallel Computing 28 (2002) 53±64 www.elsevier.com/locate/parco

Applications

MODTRAN on supercomputers and parallel computers P. Wang *, Karen Y. Liu, Tom Cwik, Robert Green Jet Propulsion Laboratory, California Institute of Technology, MS 168-522, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, USA Received 9 March 2000; received in revised form 14 December 2000; accepted 21 May 2001

Abstract To enable ecient reduction of large data sets such as is done in the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) project at the Jet Propulsion Laboratory (JPL), a high performance version of MODTRAN is essential. One means to accomplish this is to apply the computational resources of parallel computer systems. In our present work, a ¯exible, parallel version of MODTRAN has been implemented on the Cray T3E, the HP SPP2000, and a Beowulf-class cluster computer using domain decomposition techniques and the Message Passing Interface (MPI) library. In this paper, porting the sequential MODTRAN to various platforms is discussed; strategies of designing a parallel version of MODTRAN are developed; detailed implementation for a parallel MODTRAN is reported, and performance data of the parallel code on various computers are presented. Near linear scaling performance of parallel MODTRAN has been obtained, and comparisons of wallclock time are made among various supercomputers and parallel computers. The parallel version of MODTRAN gives excellent speedup, which dramatically reduces total data processing time for many applications such as the AVIRIS project at JPL. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Domain decomposition; Parallelization; MODTRAN; Radiance; Transmission

1. Introduction MODTRAN, the moderate resolution transmittance code, used in this study is among one of the most widely used radiative transfer codes in the community. This *

Corresponding author. Tel.: +818-393-1941; fax: +818-393-3134. E-mail address: [email protected] (P. Wang).

0167-8191/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 1 ) 0 0 1 2 8 - 4

54

P. Wang et al. / Parallel Computing 28 (2002) 53±64

software, developed by the Geophysics Division of the Air Force Phillips Laboratory, is designed to determine atmospheric transmission and radiance. Beginning in the early 1970s, the Air Force Cambridge Research laboratory initiated a program to develop computer-based atmospheric radiative transfer algorithms. The ®rst attempts were translations of graphical procedures, based on empirical transmission functions and e€ective absorption coecients derived primarily from controlled laboratory transmittance measurements. In 1972, LOWTRAN, the Low Resolution Transmittance band model Code, ®rst released, was the ®rst AF-developed transmittance algorithm. This limited initial e€ort has progressed to a set of codes and related algorithms (including line-of-sight spherical geometry, direct and scattered radiance and irradiance, non-local thermodynamic equilibrium, etc.) that contain thousands of coding lines, hundreds of subroutines, and improved accuracy, eciency, and accessibility. These studies have been continuing during the past 30 years, all for the purpose of solving the atmospheric radiative transfer equations as accurately and eciently as possible. In the late 1980s, MODTRAN, the moderate resolution transmittance code, was implemented [1]. Since then, several upgraded versions of MODTRAN have been released [2]. MODTRAN code calculates atmospheric transmittance and radiance for frequencies from 0 to 50; 000 cm 1 at moderate spectral resolution, primarily 2 cm 1 (20 cm 1 in the UV). MODTRAN was driven by a need for higher spectral resolution than LOWTRAN. Except for its molecular band model parameterization, MODTRAN adopts all the LOWTRAN 7 capabilities, including spherical refractive geometry, solar and lunar source functions, and scattering (Rayleigh, Mie, single and multiple), and default pro®les (gases, aerosols, clouds, fogs, and rain). The current version of MODTRAN has been used by thousands of users for various studies, including atmospheric science, environmental hazards, military, ecology, geology, remote sensing and energy deposition. It is one of the most successful radiative transport models in the ®eld. Recent advances in computing hardware have dramatically a€ected the prospect of computational science. Massively parallel supercomputers have become powerful tools for numerical simulation, data processing, and in other ®elds. However the use of parallel systems to perform radiative transfer modeling is still in a rudimentary stage [3] because there is no parallel application software available. Currently, the MODTRAN code runs on several computing platforms including those by Sun Microsystems, Silicon Graphics, HP, IBM compatible PCs and others, but it has not run on a parallel system because of the nature of its sequential code. It could take a day to run MODTRAN on a SUN workstation for a medium size application since MODTRAN might be called many times for one case study. In order to take advantage of parallel systems, designing and implementing a well-optimized parallel MODTRAN will signi®cantly improve the computational performance and reduce the total research time to complete these studies. In this paper, we report on our experiences running one of the most widely used radiative transfer models, MODTRAN, on a variety of parallel computer systems. One of our objectives is to design a parallel version of MODTRAN to improve the computational eciency of the code such that one can reduce the time in

P. Wang et al. / Parallel Computing 28 (2002) 53±64

55

conducting scienti®c studies. We also want to emphasize the portability of the code across a variety of parallel platforms, ranging from the most powerful supercomputers to the a€ordable desktop parallel PC cluster (Beowulf-class system). Experiences of designing such a parallel version of MODTRAN on a variety of parallel systems are described. 2. Implementation of MODTRAN on supercomputers and parallel systems 2.1. Parallel computing systems In order to design an ecient and general parallel code, understanding current advanced parallel systems, ranging from the most powerful supercomputers to the affordable desktop parallel PC cluster (Beowulf-class system), is necessary. Here three typical parallel systems, the Cray T3E, the HP/Convex SPP2000, and the Beowulf cluster system, are considered. A brief description of these systems, which are the major computing resources used for the present study, is given here. The Cray T3E at the Goddard Space Flight Center, currently one of the most powerful MIMD computers available, has 1024 compute nodes with 600 MFLOPS peak performance and 128 MBytes memory per node. It is a scalable parallel system with a distributed-memory structure. This machine improves application performance by three to four times over that of the previous generation Cray T3D. The 256 HP/Convex SPP2000 (Exemplar) at the Jet Propulsion Laboratory and California Institute of Technology has 16 hypernodes, with each hypernode consisting of 16 PA-8000 processors and a single pool of 4 GBytes of shared physical memory. The overall architecture of Exemplar is a hierarchical Scalable Parallel Processor (SPP). The topology of the 16 hypernodes connected by CTI (Coherent Toroidal Interconnect) is a 4  4 toroidal mesh. Exemplar supports a variety of programming models including a global shared memory programming model and an explicit message passing model. With the growing power and shrinking cost of personal computers (PCs), the availability of fast ethernet interconnections, and public domain software packages, it is now possible to combine them to build desktop parallel computers (named Beowulf or PC clusters) at a fraction of what it would cost to buy systems of comparable power from supercomputer companies. A typical Beowulf system, such as the Hyglac, the ®rst generation of Beowulf systems at JPL, has 16 nodes interconnected by 100 base T Fast Ethernet. Each node may include a single Intel Pentium Pro 200 MHz microprocessor which has a peak speed of 200 MFLOPS, 128 MBytes of DRAM, 2.5 GBytes of IDE disk, and PCI bus backplane, and an assortment of other devices. It is a loosely coupled, distributed-memory system, running message-passing parallel programs that do not assume a shared memory space across processors. Such a system will run the Linux operating system freely available over the net or in low-cost and convenient CD-ROM distributions. In addition, publicly available parallel processing libraries such as MPI and PVM are used to harness the power of parallelism for large application programs. Beowulf approach represents a

56

P. Wang et al. / Parallel Computing 28 (2002) 53±64

new business model for acquiring computational capabilities, particularly for small to medium sized applications. It complements rather than compete with the more conventional vendor-centric systems-supplier approach. Recently, JPL has also built its second and third generations of Beowulf systems. The newest one is the Pluto system, which has 21 dual Intel Pentium III 800 MHz microprocessors with 2 GBytes for each dual node. With a frond-end computer, it has 43 compute nodes interconnected by 100 base T Fast Ethernet (will be upgraded to GBit network cards in the future). This system delivers excellent performance at a moderate cost. Nimrod, the second generation Beowulf at JPL, has 32 Intel Pentium II compute nodes with 400 MHz clock speed for each node. These Beowulf systems have been extensively used for parallel code development and production runs. 2.2. Porting MODTRAN on various computing systems All systems described above support explicit message passing models with the Message Passing Interface (MPI) software. Besides these systems, several other supercomputing systems are also considered for the present study such as the Cray YMP and the Cray J90. In order to design an ecient parallel version of MODTRAN running on many di€erent parallel systems, it is necessary to understand the data structures, the I/O, and the algorithms of MODTRAN. To ®nd independent data structures and subroutines which can be executed in a parallel style is essential. Since the sequential code has been used for decades and by thousands of users, maintaining the original coding style and keeping changes minimal are also helpful for existing MODTRAN users. The ®rst step was porting sequential MODTRAN to various supercomputers and parallel computers using a single computing node. The sequential code has about 40,000 lines in Fortran, and it has been designed and improved by multiple authors since the early 1970s. It runs well on several platforms such as those by Sun Microsystems and Silicon Graphics, but it was not a simple issue to compile and run the MODTRAN on a single processor of current advanced computing systems. Recent developed compilers for each parallel system have much more strict rules for source codes compared with early developed compilers. To run sequential MODTRAN on recently developed computing systems produced many challenges. Certainly, it was ``non-trivial''. Many problems were encountered during the process of porting MODTRAN to the computing systems described earlier. Here some major problems are reported. The compilation of MODTRAN on various supercomputers and parallel system was the ®rst problem encountered. Many error messages were generated at the initial compilation on a new system. One of the common problems was uninitialized variables in the MODTRAN code. It was solved through a compiler option. When the code was ready to be compiled, an option dealing with initial variables had to be used in the compiling command if the compiler o€ers such an option. Besides this initial variable problem, there were also some other coding problems such as original subroutine's names con¯icting with those of the system, data structure inconsistence, I/O errors, ®le unit numbers out of the legal system range for some computers, and

P. Wang et al. / Parallel Computing 28 (2002) 53±64

57

some other coding issues. All these problems had been solved through minor changes of the code, and ®nally after these non-trivial changes to the code, the program compiled successfully. No doubt, besides understanding the original code, some knowledge of new compilers and new computers will save plenty of time to port an existing large code to a new computing platform. 2.3. Running updated MODTRAN on di€erent systems After successful compilation of updated MODTRAN on each system, execution of the code was also a problem. The major problem to execute MODTRAN on different systems was handling the binary input ®le for each system. In order to run the code on the Cray T3E, conversion statements were used such that the Sun version binary input data could be accessed by the Cray T3E. On the Beowulf system, more work needed to be done for converting the Sun binary input data to the Beowulf binary input data as the Beowulf system reads the Sun binary input data in reverse order. A conversion program was designed to convert binary input data from one system to another. On the Exemplar, no problem in using the Sun binary input data was encountered. After all these changes and minor recoding, the updated sequential MODTRAN ran well on the Cray YMP, the Cray J90, the Cray T3D and T3E (one node), the Intel Paragon (one node), Exemplar (one node), and the Beowulf system (one node). 2.4. Parallel implementation The next major step was designing an ecient parallel MODTRAN version running on multiple processors executing with signi®cant speedup. Here a brief summary for MODTRAN model is given. The model calculates up to four types of results for an atmospheric path: atmospheric transmission; atmospheric radiance (self-emission and scattering); scattering of sunlight and moonlight into the path; and transmitted solar irradiance to an observer. A band model (Pressure, Temperature and a line width) is used for molecular absorption. Since the entire package covers various areas and involves several numerical algorithms for di€erent applications, the details of these algorithms are omitted here and can be found in [1]. Our objective was to design a parallel version of MODTRAN and to keep minimum modi®cations of the code so that the original numerical algorithms remained unchanged, and any user for MODTRAN could easily use this parallel version without any speci®c training in parallel computing. To achieve this objective, we focused on the data structures of the code to discover all possible data dependencies. After investigating the entire package, we chose the variable spectral value as our candidate for parallelization. It covered a wide data range from 0 to 50,000 cm 1 and extensive computations for atmospheric transmission and atmospheric radiance were required for these spectral values. In order to achieve load balance and to exploit parallelism as much as possible, a general and portable 1D parallel partitioner based on domain decomposition techniques was designed for the parallel version of MODTRAN. MPI software was used for message

58

P. Wang et al. / Parallel Computing 28 (2002) 53±64

passing. The spectral data array (frequency) in MODTRAN was chosen as the domain decomposition array so that the message passing was minimized and more than 95% of the code was executed in parallel on multiple processors. Using this 1D partitioner involved careful calculations of the spectral subrange for each node and considerations of the spectral steps for such redistribution. By distributing spectral data evenly among the processors (a variable, which can be chosen by users according to the total available compute nodes or their application sizes), load balance was achieved. Assume that the spectral ranges from A1 to A2, and N processors of a parallel system are available. In theory, the computing task for each processor is equal to …A2 A1†=N of the original computing task so that the total computing time for MODTRAN on a parallel system will be reduced by a factor of N compared to the sequential case. The entire computation of MODTRAN is carried out by executing the following sequence on a parallel platform: 1. Read the beginning and ending frequencies, number of processors and other input data. 2. Using the MPI library to generate necessary system information including the total number of processors, each processor's ID number, timing, and other data. 3. Compute the subrange of frequencies for each processor by the 1D partitioner. 4. For each subrange of frequencies, execute the sequential MODTRAN. 5. Execute parallel I/O for each subrange of frequencies. 6. Collect the ®nal results from all processors. Sequential MODTRAN is an excellent application for parallel computing applications because there is no dependence between frequency data, and only minimal communication is required for the entire computation. Performance data are given in the next section. 3. Results and discussion 3.1. Performance of sequential MODTRAN Various code performance tests were carried out on several systems for both sequential and parallel codes. A model of frequencies ranging from 4000 to 28,500 cm 1 with some other given input data was tested on these machines. Table 1 lists the wallclock time (cpu time for some cases) of sequential MODTRAN on different systems. A Beowulf system (both Hyglac and Pluto) is a single user system, and the rest of platforms are multiple user systems. The timing might vary from case to case in a shared environment since the sequential MODTRAN has heavy I/O requirements that appear in many places of the code, and are a€ected if multiple users are using the system. From Table 1, it is easy to see the Cray YMP and Cray J90 give good performance data for CPU time, but the overall times are not as good as single user computers such as the Beowulf systems. This is due to the nature of MODTRAN code ± coupling of computing and I/O in many places, which limits the total performance data on shared I/O supercomputers and parallel systems. If the code ex-

P. Wang et al. / Parallel Computing 28 (2002) 53±64

59

Table 1 Sequential MODTRAN on various computing systems Systems Beowulf (one node) 800 MHz Beowulf (one node) 400 MHz Beowulf (one node) 200 MHz HP SPP2000 (one node) 720 MHz Cray T3E (one node) 600 MHz Cray J90 220 MHz Cray YMP 330 MHz Sun Sparc 5 Workstation 50 MHz

CPU time

Wallclock time

327 195 421

23 38 90 149 973 407 851 583

ecutes on supercomputers and parallel systems by a single user, no doubt, the timing performance would certainly be improved. 3.2. Performance of parallel MODTRAN For parallel MODTRAN, timing tests of the same model were performed on the three parallel systems described earlier. Sixteen processors were used for the parallel code due to the current maximum number of processors on the JPL Hyglac Beowulf system. Fig. 1 shows the real time comparison among the Cray T3E, the HP SPP2000, the Beowulf system with di€erent number of processors, and for comparison the result of a Sun Sparc 20 Workstation (sequential code). The total wallclock time is signi®cantly reduced by using multiple processors. For the real time

Fig. 1. Comparison of time for running MODTRAN on various parallel systems by using di€erent number of processors.

60

P. Wang et al. / Parallel Computing 28 (2002) 53±64

comparison, the Beowulf system gives the best performance data among these three machines, and the Cray T3E ranks the worst. This is because each node of the Beowulf system has its own I/O system to deal with the heavy I/O of the code, but the Cray T3E has its limits on I/O and also has multiple users. The results on the HP SPP2000 are very good, but are still slightly behind the Beowulf system since the HP SPP2000 system also has an I/O interface shared among processors on each hypernode which limits the performance if the code has heavy I/O. The comparison of the numerical accuracy of the parallel code with the sequential code is also made, and the results of the parallel version are identical as the sequential numerical results. These are illustrated in Fig. 2. Fig. 3 shows the speedup of the parallel MODTRAN code on the Cray T3E, the HP SPP2000, and the Beowulf system. It gives excellent speedup vs the number of processors. From the view of scalability, the Cray T3E, known as one of the best scalable parallel systems, gives the best scaling results ± nearly linear, and the scaling results on the HP SPP2000 are also excellent. These machines are still the best scalable parallel systems which are suitable to run large-scale scienti®c applications with a large number of processors. The speedup curve for the Beowulf system has a slight non-linear bend when more processors are applied. This is because the communication speed on the Beowulf system is slower than that of the Cray T3E and the HP

Fig. 2. Comparison of the numerical results between the parallel code and the sequential code.

P. Wang et al. / Parallel Computing 28 (2002) 53±64

61

Fig. 3. Speedup of the parallel MODTRAN code on di€erent parallel systems.

SPP2000. Once more processors are used, the communication starts playing a role such that the speedup is reduced. For the current application, this communication is still small. From the above results, a 16-node Beowulf system gives excellent real time performance results for parallel MODTRAN since the code itself only needs a little communication. On the other hand, the Cray T3E, one of the most powerful parallel systems available, might not be the best choices for this kind of application, unless the entire MODTRAN I/O segment is rewritten using its compiler and hardware to improve performance. Since the Beowulf system gives the best performance results for MODTRAN, it is also interesting to know its performance data on di€erent cluster systems. Extensive comparison tests were carried out on three Beowulf systems with 200, 400, and 800 MHz speeds separately. Fig. 4 gives a real time comparison among these machines, while Fig. 5 shows the speedup curves on each system. In Fig. 4, the di€erence of the wallclock time is very consistent with the di€erence of hardware. In Fig. 5, the three systems give a similar performance curve. They all scaled very well up to 32 processors (except that Hyglac has only 16 nodes). With a small number of processors, they all give excellent speedup. When the number of processors is larger than 16, the I/O part is playing a signi®cant role in the code. On Hyglac, the parallel MODTRAN used a local ®le system for each node, but on Nimrod and Pluto, a central ®le system was used. In this case, the I/O part performed multiple ®le's open and close operations, and it also transferred all output data from each node to the master node. At the same time, the computation time became a small portion of the total wallclock time for the entire code because of a large number of processors used.

62

P. Wang et al. / Parallel Computing 28 (2002) 53±64

Fig. 4. Comparison of time for running MODTRAN on various Beowulf systems by using di€erent number of processors.

Fig. 5. Speedup of the parallel MODTRAN code on di€erent Beowulf systems.

Hence, there is no signi®cant speedup which was gained for the overall performance. This might be improved by the use of a local ®le system to store output data on each node like Hyglac, but some post-processing for the entire output data will be required.

P. Wang et al. / Parallel Computing 28 (2002) 53±64

63

4. Conclusion In the present study, we have successfully ported sequential MODTRAN to various supercomputers and distributed memory and shared memory systems. An ecient, ¯exible, and portable parallel MODTRAN code has been designed. This is the ®rst parallel version of MODTRAN, building on the sequential MODTRAN that has existed for many years. It gives signi®cant speedup, which will dramatically reduce the total data processing time for the AVIRIS project at JPL and some other applications as well. The comparison of wallclock time for ®xed problems among these systems gives very useful information on the speedup performance of these advanced hardware systems. The discrepancy of the time on these systems is due to the di€erence of the hardware on each system, the network connection used, and the user environment for each system. The code scales very well on di€erent systems, and achieves excellent speedup in performance as the number of processors increases. It can be easily ported to any parallel system which supports a Fortran 77 or 90 compiler and MPI software. In particular, the Beowulf system (pile of PCs) makes high performance computing of atmospheric transmission models a reality to the low-cost parallel supercomputing community, and more importantly, performance data of parallel MODTRAN on the Beowulf system give one of the best examples of cluster computing applications. The parallel code has been tested with various input data sets, and the present results illustrated here clearly demonstrate the great potential for applying this approach to various applications. A full implementation of the AVIRIS data reduction chain on a Beowulf-class cluster computer is under investigation.

Acknowledgements The research described in this paper was performed at the Jet Propulsion Laboratory(JPL), California Institute of Technology, under contract to the National Aeronautics and Space Administration. The Cray Supercomputers, the HP SPP2000, and the Beowulf system used to produce the results in this paper were provided with funding from the NASA oces of Mission to Planet Earth, Aeronautics, and Space Science. The authors wish to acknowledge Edith Huang of JPL for help with coding, Dr. Robert D. Ferraro and Dr. Ray J. Wall of JPL for many valuable comments and suggestions.

References [1] A. Berk, L.S. Bernstein, D.C. Robertson, MODTRAN: A moderate resolution model for LOWTRAN7, Report GL-TR-89-0122, Air Force Geophys. Lab., Bedford, MA, 1989. [2] L.S. Bernstein, A. Berk, D.C. Robertson, P.K. Acharya, G.P. Anderson, J.H. Chetwynd, Addition of a correlated-k capability to modtran, in: Proceedings of the 19th Annual Conference on Atmospheric Transmission Models, Phillips Laboratory/Geophysics Directorate, 1996.

64

P. Wang et al. / Parallel Computing 28 (2002) 53±64

[3] P. Wang, K.Y. Liu, T. Cwik, R.O. Green, Modtran on parallel systems, in: Proceedings of the 21st Annual Review Conference of Atmospheric Transmission Models, Phillips Laboratory/Geophysics Directorate, 1998, p. 29.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.