Understanding architectural tradeoffs necessary to increase climate model intercomparison efficiency

June 13, 2017 | Autor: Amy Braverman | Categoría: Software Architecture, Software Framework, Climate Model, Data Exchange, THERMAL INFRARED REMOTE SENSING DATA, Jet Propulsion Laboratory

Share Embed

Laporkan tautan ini

Descripción

Understanding Architectural Tradeoffs Necessary to Increase Climate Model Intercomparison Efficiency Chris A. Mattmann NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA e-mail: [email protected]

Amy J. Braverman NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA e-mail: [email protected]

Daniel J. Crichton NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA e-mail: [email protected]

Abstract

to fill on an online web ordering “cart”, and have an email notification indicating availability at a later date; and (4) size - depending on the frequency of the instrument’s orbit, and the characteristics of the mission including the way that the instrument “sees” the Earth, the sheer volume of the Level 2 data can widely vary, ranging from megabytes (MB) per product, to gigabytes (GB). These four dimensions are just a sampling of the characteristics of Level 2 observational data. The goal of CDX is to deliver an open source software toolkit that allows science users to alleviate as much of the complexity of dealing with Level 2 observational data as possible, and to facilitate its comparison to model outputs. In this fashion, there are two fundamental subsystems within CDX: (1) a Client Toolkit – an easy to install software package, providing a set of CDX-enabled familiar commands (ls, get, subset, find, mask, locate) for science users to effectively integrate into their Python, IDL and Matlab environments; and (2) a CDX Gateway web service package that is deployed at each data “node” that is made part of the CDX system. Gateway services implement the server portion of the functionality needed by the client toolkit commands. To date, we have successfully leveraged our early prototypes in subsetting, accessing and using NCAR CCSM model output data from PCMDI in time series generations involving AIRS Level 2 and 3 observational data sets. CDX has increased our scientists’ efficiency by ten-fold (decreasing data access times from nearly 9 hours to 5 minutes in the generation of a time series comparison of a month’s worth of AIRS data to its NCAR CCSM model output counterpart). In the remainder of this paper, we will focus on providing some background details on CDX in the form of its lineage. We follow by discussing in detail the architectural tradeoffs faced during time series generations of AIRS data and NCAR CCSM model output. The paper concludes by pointing the reader to future work.

NASA’s Jet Propulsion Laboratory, in partnership with Lawrence Livermore National Laboratory, has been leading an effort to allow remote sensing data available from NASA satellites to be easily compared with climate model outputs available from the DOE-funded Earth System Grid, a national asset in climate science. This partnership is timely with the looming Intergovernmental Panel on Climate Change (IPCC)’s 5th Assement Report (AR5) in active discussion, and the metrics to better understand Earth’s climate under formulation. JPL’s project, titled the Climate Data eXchange (CDX) provides an easy-to-use software framework for cimate scientists to rapidliy integrate and evaluate the efficacy of observational data as applied to climate models. Keywords: CDX, OODT, Climate Research, PCMDI, Software Architecture.

Introduction The Jet Propulsion Laboratory (JPL) has within the last year initiated an effort to increase the use of its observational data in the improvement and analysis of climate model outputs. This effort, known as the Climate Data eXchange (CDX), is a multi-institutional collaboration involving representatives from JPL and from the Program for Climate Model Diagnosis and Intercomparisions (PCMDI) at Lawrence Livermore National Laboratory (LLNL). Our early focus in the context of CDX has been on NASA Level 2 observational data products. These products vary in a number of ways including: (1) format - many of the products are stored in the Hierarchical Data Format (HDF) [For98], others in netCDF [RD90], with variation even between software versions that generated these output files within the same format; (2) geographic distribution - most observational data products are co-located with their scientific discipline expertise, to increase the yield of promising scientific results and to cut down on the effort for a science user to make progress; (3) data access mechanism - some data products are available from sophisticated web service interfaces, e.g., OPeNDAP – others are not, requiring a user

Background and Related Work In this section we focus our efforts on describing some of the seminal work in the area of infrastructures to support distributed data processing and integration, on which the CDX

1

software is built. Specifically, we will focus on the Object Oriented Data Technology software, or OODT [MCMH06]. We also will point the reader to software for parsing and extracting information from scientific data, specifically the netCDF [RD90] and HDF suite of software. The section culminates by discussing OPeNDAP [CGS03], a marriage of web services, and scientific software providing capabilities (e.g., subsetting) that CDX leverages in its architecture. The OODT software packages [MCMH06] comprise a framework that enables dataset resources and products to be exchanged through distributed systems. These distributed systems allow users to request and retrieve data products without a priori knowledge of the dataset’s physical location (i.e., its originating repository and registry). The framework also facilitates the creation of hierarchical views into related data products, enabling the creation of collections of data products based on, for example, sub-discipline. The OODT software framework is comprised two families of software components: (1) an information integration family, which consists of three major services: a query system, a product server and a profile server designed to enable a creation of a standard data dictionary based on metadata to describe data resources; and (2) a distributed data processing and ingestion family, consisting of a catalog and archive server (CAS) [MFC+ 09], and its set of services. All OODT components are architecturally focused to enable a data delivery and processing system that is scalable and extensible. Each component in the framework exchanges data based upon metadata definitions, defining how data is required as well as how data is encoded at each individual node. The nature of the OODT components is to create a method for the homogenous display and ingestion/processing of disparate and dissimilar data products based upon standardized metadata. As a result, the OODT components must be configured to work with a given datacenter’s unique file system layout and data resources. OODT provides ample documentation and source code to create such configurations, utilizing XML files to drive the majority of configuration parameters with more in-depth customization possible utilizing the configurable OODT Java code. CDX builds on top of the OODT framework to offer specific deployments of OODT profile and product services (called “gateways” and elaborated upon in Section 3) for climate data available from the AIRS, MLS, MISR and CloudSAT missions within NASA. Further, CDX contributes a specific set of client command line tools (also elaborated in Section 3) that leverage the OODT profile/product client services to remotely communicate with CDX gateways available in the network, turning command line calls for subsetting, or metadata retrieval into a sequence of calls to gateway services. When dealing with scientific data, the de-facto standards for representation and format are the Hierarchical Data Format (HDF) standard [For98], version 4 and 5, and the network common data format (netCDF) standard [RD90]. Both HDF and netCDF focus on obfuscating the underlying data representation and format, allowing it to be independent of

the low-level storage architecture. HDF and netCDF both provide application programming interfaces (APIs), allowing users to retrieve observational and model output data in the form of named sequences, n-dimensional arrays, grids, and scalar values. In short, the goal of both HDF and netCDF is to provide a single uniform package concerning a set of observations or model output for a particular space/time boundary. Whereas HDF has been used extensively in the observational data community, the format of choice in the climate modeling community has been netCDF. CDX builds off of OODT components to allow users to retrieve either HDF or netCDF data, independent of its location (using the product server component), and to extract metadata (using the profile server component) for use in search and discovery of the scientific data. The OPeNDAP software [CGS03] is a web service and associated backend infrastructure for delivering scientific data, independent of its underlying format. OPeNDAP provides a common data model containing a subset of scientific data structures including arrays, grids, and sequences, and a query language for issuing constraints on those data structures. OPeNDAP reduces the difficulty of moving scientific data across the network, by allowing users to isolate only the portions of particular data files for which they are concerned (typically via constraints on space or time of the data within). OPeNDAP’s backend server allows arbitrary data formats to be “plugged in” via a common handler interface, affording easy extension of the software as new data formats are obtained and the desire to make them available via OPeNDAP is expressed. CDX leverages OPeNDAP as a mechanism of subsetting provided by the underlying CDX gateway product services, If OPeNDAP is available at a particular node.

The CDX Framework CDX is a distributed client/server based software environment with two major components. CDX Gateways (shown in the lower portion of Fig. 1) are web services that are colocated with each climate data source that is brought online in the CDX virtual environment. Each gateway has two major components, built on top of NASA’s Object Oriented Data Technology (OODT) grid middleware [OODT ref]: a product server, responsible for getting data, subsetting data, listing data, and check-summing it; and a profile server, responsible for providing free-text, facet and forms-based query of the underlying metadata, and for getting and subsetting of the metadata. CDX gateways can plug into existing services that provide these functionalities (such as OPeNDAP [ref] for subsetting, or Apache Lucene/SOLR [SOLR ref] for searching), or the gateway can provide the services itself if necessary. The other major element of CDX is the CDX Client Toolkit (shown in the upper portion of Fig. 1). The client toolkit is an easily installable set of client commands (and an associated application programming interface, or API) for in-

2

CDX Client Toolkit Layer cdxget

cdxls

cdxsubset

cdxfind

cdxdiscover

cdxlocate

cdxrangequery

cdxmask

Network/Internet

CDX Gateway Layer produdct server

profile server

AIRS

get data

list data

produdct server

profile server

produdct server

MLS

profile server

produdct server

MISR

subset data

checksum

profile server

CloudSAT

free text query

facet query

get/ subset metadata

formsbased query

Figure 1: The CDX architecture. The system draws from both distributed client-server and layered architectural styles. Client software, localized to a scientist’s machine, provides the set of eight functions, e.g., cdxget, cdxlocate, etc., shown in the upper region of the figure. To the user, these functions are provided as command line programs. Internally, the command-line programs communicate over the network with CDX gateway deployments, localized near the mission datasets which they expose. The datasets are exposed via two canonical gateway web services: a product server provides the capability to get, subset, list and checksum the data, as shown in the bottom left portion of the figure. The profile server is focused on mission metadata, allowing users to search for it based on free text, facet, and forms-based queries, and then further to subset that metadata, as it is returned to the user.

3

Case Study: Time Series Architecture Deployments

teracting with the data made available by the gateways. The client tools available include: cdxls which allows for virtual listing of data from all gateways. cdxls communicates with the remote gateway’s product server component and remotely calls its list data functionality.

As one means of evaluating the architectural benefits of CDX, we constructed a time series grid statistical simulation spatially and temporally comparing and scoring climate model output from the DOE-funed Earth System Grid project [BBB+ 07]. The ESG runs large-scale climate models and produces model output data that can be compared with observational data sets taken from the same spatial region over the same temporal constraints to perform model scoring and intercomparison. For the purposes of our prototype, data sets from NASA’s Atmospheric Infrared Sounder (AIRS) mission were leveraged. AIRS data includes climate related measued variables such as water vapor, and standard air pressure. Fig. 2 demonstrates a comparison of three architectures (labeled A, B and C in the diagram) leveraged to drive the implementation of our simulation software. In all three variations, the CDX Gridded Time Series was the driver program responsible for orchestrating the AIRS and ESG data assimilation (from the CDX Gateway component), and then evaluating the acquired datasets for spatial and temporal resolution using a legacy function, wrapped using CDX (the CDX Spatial Web Service component in Fig. 2). Fig. 2-A demonstrates the first architectural variation in which all remote AIRS climate datasets within a time range are pulled down to the user machine using CDX client software and spatially evaluated using a remotely accessible servicer. Fig. 2-B demonstrates the same functional capability, with the exception of extending the spatial servicer for remote dataset evaluation, using OPeNDAP [CGS03]. Fig. 2-C demonstrates the final architecture, integrating the wrapped spatial evaluation functionality into the gridded time series driver program on the client machine. The three architectural variations, while seemingly the same set of components, and connections, shifted around, exhibit decidedly different functional properties and implementations. In the first option (A), all the data is pulled down to the local client machine (using the CDX Client Tools, cdxget and cdxls, which pull data, and remotely list it, respectively). This is by far the slowest, and most cumbersome implementation of our time series calculation. The second option (B), provides an architecture that reduces the amount of data that needs to be pulled down by allowing remote spatial evaluation to access AIRS datasets remotely, without pulling them down to the client machine using CDX Client Tools. The final option (C), provides an architecture in which spatial evaluation occurs on the client machine, and data access occurs remotely (and only subsets of data are sent over the network, as required by the wrapped spatial evaluation function). Architectural option C reduces the amount of time to perform time series grid calculation and model comparison/scoring from 9 hours (in the case of option A) to a few minutes, improving efficiency and effectiveness of comput-

cdxget allowing for remote data downloading independent of gateway location. cdxget communicates wtih the remote gateway’s product server component and remotely calls its get data functionality, which allows for full data transfers, and partial subsetting. cdxsubset which allows for spatial/temporal data subsetting of data. This is typically provided via some underlying service, e.g., OPeNDAP, or a custom developed portion of the gateway software itself. cdxsubset interacts with the remote gateway’s product server component and calls it subset functionality. cdxfind allowing for free-text based searches for data, based on metadata made available via the remote gateway’s profile server component and its free-text functionality. This functionality is typically provided by an existing free text engine, such as Apache Lucene or Apache SOLR. cdxdiscover allowing for facet-based searching for data, respectively. cdxlocate allowing users to discover observational data matching a given spatial/temporal and variable constraint. cdxrangequery allowing for subsetting of data based on a variable range (e.g., pressure < 20). cdxmask responsible for identifying cells in a space-time model grid where observational data intersects. While the gateway services unify the underlying information model used by disparate observational dataset providers, the client toolkit provides climate modelers with commandline tools that can be easily incorporated into existing modeling efforts. In particular, we have leveraged CDX recently as a means of making AIRS level 3 observational data available on the Earth System Grid (ESG) [BBB+ 07], in an effort to expose observational data to the broader climate modeling community, focusing on the upcoming AR5 runs led by the IPCC. Leveraging publishing software provided by the ESG, we have created a demonstration system in which AIRS data is selected and obtained via CDX client toolkit commands, specifically cdxls, and a combination of cdxsubset and cdxget. After the data is obtained, it is then sent via the ESG publishing software running locally at JPL to the ESG gateway running at Lawrence Livermore National Laboratory (LLNL).

4

CDX Gateway Software

CDX Gateway Software

airscdx.jpl.nasa.gov

CDX Gridded Time Series (sciencecode)

airscdx.jpl.nasa.gov

CDX Gridded Time Series (sciencecode)

CDX Spatial Web Service

CDX Client Toolkit (cdxls)

CDX Client Toolkit (cdxget, cdxls) jpl-esg.jpl.nasa.gov (CDX/IPP Testbed) your.user.machine

CDX Client Toolkit (cdxget) jpl-esg.jpl.nasa.gov (CDX/IPP Testbed)

your.user.machine

A

CDX Spatial Web Service

OPeNDAP

CDX Testbed Data (AIRS, MLS)

B C

hardware host Component/ Service

CDX Gateway Software

data ow control ow

Legend:

airscdx.jpl.nasa.gov

CDX Gridded Time Series (sciencecode, with local spatial function)

OPeNDAP

CDX Client Toolkit (cdxget)

CDX Client Toolkit (cdxls)

jpl-esg.jpl.nasa.gov (CDX/IPP Testbed)

your.user.machine

CDX Testbed Data (AIRS, MLS)

Figure 2: Understanding the tradeoffs of CDX deployments for climate scenarios.

Conclusion and Future Work

ing the time series grid. The use of the CDX approach in this scenario enabled our CDX software to re-locate (in each architectural option), the CDX spatial computation service from a remote service (in which a subset of observation data is sent from the client machine over the network to the spatial service) all the way to moving the spatial service to the client machine and then requesting subsets of remote data and limiting the bandwidth and time required to compute our output time series, improving upon our initial architecture and implementation by a factor of 500.

We reported on the current work to date for the Climate Data eXchange (CDX) effort, at NASA’s Jet Propulsion Laboratory. Background information on CDX was described, as well as two current prototype efforts in the area of disseminating NASA observational data to the climate modeling community, and in the area of making climate model data easily available to those researchers who desire to compare it with observations. Results and architectural tradeoffs of approaches to deploying and configuring the CDX software in the area of time series generation was presented as an early example of the benefits of the CDX approach, including increasing efficiency (via configuration C in Fig. 2) and reducing overall program completion time from 9 hours to a matter of minutes.

Leveraging component-based technologies, with little effort, we were able to explore the architectural tradeoffs of different CDX service configurations in the context of a real world climate research use case (time series generation). It is our position that further research be conducted in this area as currently, though our results suggest tremendous benefit from understanding architectural configuration and deployment of climate web services, the exact tuning parameters defining those optimal configurations have yet to be pinpointed precisely. To this end, we have begun researching the intersection of data movement, software architecture, and statistical methods as a means of identifying the critical dimensions needed to truly improve the overall user experience of climate research. In the end, it is our goal to demonstrably reduce the overall time an effort that climate researchers require to make new observations and discoveries.

Our future work includes precisely identifying the architectural load bearing walls and tuning parameters to further increase efficiency of climate research and reduce the barrier to observational data to model intercomparison. To this end, we have begun two efforts: the first is operationalizing our prototype that delivers observational data to the ESG, and sending 10 NASA missions datasets including parameters such as water vapor, sea surface temperature and other critical variables to the ESG for dissemination as part of AR5; the second effort is empirical research in the area of data movement, software architecture and statistics for the purpose of determining successful architectural configurations, successful methodologies for querying data, and

5

ultimately advanced techniques for remote data analyses that push computations close to the data on which they operate. Acknowledgments This effort was supported by the Jet Propulsion Laboratory, managed by the California Institute of Technology, under a contract with the National Aeronautics and Space Administration.

References [BBB+ 07]

David E. Bernholdt, Shishir Bharathi, David Brown, Kasidit Chanchio, Meili Chen, Ann L. Chervenak, Luca Cinquini, Bob Drach, Ian T. Foster, Peter Fox, Jose Garcia, Carl Kesselman, Rob S. Markel, Don Middleton, Veronika Nefedova, Line Pouchard, Arie Shoshani, Alex Sim, Gary Strand, and Dean Williams. The earth system grid: Supporting the next generation of climate modeling research. CoRR, abs/0712.2262, 2007. informal publication.

[CGS03]

Peter Cornillon, James Gallagher, and Tom Sgouros. Opendap: Accessing data in a distributed, heterogeneous environment. Data Science Journal, 2:164–174, 2003.

[For98]

B Fortner. Hdf: The hierarchical data format. Dr Dobb’s J. Software Tools and Professional Programming, 1998.

[MCMH06] Chris Mattmann, Daniel J. Crichton, Nenad Medvidovic, and Steve Hughes. A software architecture-based framework for highly distributed and data intensive scientific applications. In ICSE, pages 721–730, 2006. [MFC+ 09] C. Mattmann, D. Freeborn, D. Crichton, B. Foster, A. Hart, D. Woollard, S. Hardman, P. Ramirez, S. Kelly, A. Y. Chang, and C. E. Miller. A reusable process control system framework for the orbiting carbon observatory and npp sounder peate missions. In 3rd IEEE Intl’ Conference on Space Mission Challenges for Information Technology (SMC-IT 2009), pages 165–172, 2009. [RD90]

R. K. Rew and G. P. Davis. Netcdf: An interface for scientific data access. IEEE Computer Graphics and Applications, 10(4):76–82, 1990.

6

Lihat lebih banyak...

Understanding architectural tradeoffs necessary to increase climate model intercomparison efficiency

Descripción

Comentarios