Servicing range queries on multidimensional datasets with partial replicas

June 9, 2017 | Autor: Ümit Çatalyürek | Categoría: Reservoir Simulation, Storage system, Large Dataset Analysis, Range Query, Heuristic algorithm

Share Embed

Laporkan tautan ini

Descripción

Servicing Range Queries on Multidimensional Datasets with Partial Replicas ∗ Li Weng ∗ , Umit Catalyurek† , Tahsin Kurc † , Gagan Agrawal ∗ , Joel Saltz† ∗ Department of Computer Science and Engineering † Department of Biomedical Informatics Ohio State University Columbus OH 43210

Abstract

chunks as the unit of I/O and data management. When data is stored and retrieved in chunks, the number of disk seeks, hence I/O latency, is reduced. In addition, managing a dataset as a set of chunks decreases the cost of metadata management, indexing, and data declustering. Indexing is another optimization technique to speed up query execution [11]. It enables rapid searching and extraction of meta-data for data elements that intersect a given query. Efficient access to data also depends on how well the data has been distributed across multiple storage nodes. The goal of declustering [9, 14] is to distribute the data across as many storage units as possible so that data elements that satisfy a query can be retrieved from many sources in parallel. Caching is yet another optimization that targets multiple query workloads [1, 10, 19, 21]. Caching attempts to reduce the execution time of queries through data and computation reuse by maintaining the results of queries previously executed in the system. The performance of these techniques depends on data access and processing patterns of an application. If there is only one type of query, the dataset can be declustered and indexed to optimize the execution of that type of query. However, when there are multiple types of queries, a single indexing and declustering scheme may not give good performance for all query types. Partial replication is an optimization technique that can be employed when multiple queries and/or multiple types of queries are expected. The objective is to decrease the volume of I/O and increase I/O parallelism by re-organizing and re-distributing a portion of the dataset across the storage system. For example, if the elements of a dataset are organized into chunks and indexed based on a single attribute or a subset of attributes (e.g., atomic species or 3D coordinates), queries that involve only those attributes can be efficiently processed. Queries that involve another attribute, on the other hand, may not be answered efficiently. If a secondary index is created for that attribute, a full scan of the dataset can be avoided. However, this may still yield inefficient execution due to random seeks to different locations in data files to access tuples that match the query criteria. When partial replication optimizations are employed, a

Partial replication is one type of optimization to speed up execution of queries submitted to large datasets. In partial replication, a portion of the dataset is extracted, re-organized, and re-distributed across the storage system. The objective is to reduce the volume of I/O and increase I/O parallelism for different types of queries and for the portions of the dataset that are likely to be accessed frequently. When multiple partial replicas of a dataset exist, query execution plan should be generated so as to use the best combination of subsets of partial replicas (and possibly the original dataset) to minimize query execution time. In this paper, we present a compiler and runtime approach for range queries submitted against distributed scientific datasets. A heuristic algorithm is proposed to choose the set of replicas to reduce query execution. We show the efficiency of the proposed method using datasets and queries in oil reservoir simulation studies on a cluster machine.

1

Introduction

A number of challenges should be overcome to carry out on-demand, interactive analysis of scientific datasets. The very large size of data in many scientific applications is a major problem. In many situations, large dataset sizes necessitate use of distributed storage. Another challenge is to efficiently extract subsets of data for processing and data analysis. Data subsets can be specified by spatio-temporal range queries or queries on other attributes such as pressure, permeability, speed. Several optimizations can be employed to efficiently search for and extract the data of interest for analysis. It is a common optimization to partition a dataset into chunks, each of which holds a subset of data elements, and use ∗ This research was supported by the National Science Foundation under Grants #ACI-9619020 (UC Subcontract #10152408), #EIA-0121177, #EIA-0203846, #ANI-0330612, #ACI-0130437, #ACI-9982087, #CCF0342615, #CNS-0406386, #CNS-0426241, Lawrence Livermore National Laboratory under Grant #B500288 and #B517095 (UC Subcontract #10184497), NIH NIBIB BISTI #P20EB000591, Ohio Board of Regents BRTTC #BRTT02-0003.

1

3

query can be answered from data subsets and replicas of datasets. Since partial replication can reorganize and redistribute a selected portion of a dataset, there may not be oneto-one mapping between the chunks in the original dataset and those in the replica. In addition, it is possible that no single replica can fully answer the query. Thus, it becomes challenging to figure out how to put together the requested data by piecing data from multiple replicas and possibly from the original dataset. In this paper, we present a compiler and runtime approach for efficient execution of multidimensional range queries when partial replicas of a dataset exist. We present a compile-time query planning strategy to select the best combination of replicas in order to minimize query execution time in a distributed environment. The efficiency of the proposed strategy is demonstrated on a parallel machine using queries for analysis of data from oil reservoir simulations.

2

3.1

Overview Motivating Application

In simulation-based oil reservoir management studies [17], the goal is to investigate changes in reservoir characteristics and assess how injection and production wells should be placed to maximize oil production and at the same time minimize effects to the environment. An oil reservoir is a 3-dimensional subsurface volume composed of different types of rocks, soil, holes, and underground water ways. A complex numerical model simulates 17 different variables (such as gas pressure, oil saturation, oil velocity) at each point of the reservoir mesh over many time steps. Due to computational and storage requirements, large scale simulation runs require use of distributed computers and storage systems. In order to improve I/O performance, a simulation dataset can be partitioned into a set of chunks, i.e., data elements are organized into application-specific groups and stored on disk. A chunk is the unit of I/O, i.e., data is retrieved from disk in chunks, even if only a subset of the data elements in the chunk are needed. One possible grouping of data elements divides each grid along x, y, and z dimensions as well as the time dimension into chunks (i.e., 4-dimensional rectangles). The chunks are stored on disk in data files. Each chunk contains a subset of grid points and time steps and the values of the variables that are computed at those grid points and time steps. Metadata associated with a chunk is the minimum bounding box of the chunk, the name of the file that contains the chunk, offset in the file, and the size of the chunk. An R-Tree index [11] can be built on space and time dimensions to store the bounding boxes of chunks. The index enables rapid search of chunks that intersect a given query. On a system with multiple storage devices, the chunks can be distributed across storage devices to achieve parallelism when a query accesses them.

Related Work

Several research projects have looked at improving I/O performance using different declustering techniques [9, 14]. Parallel file systems and I/O libraries have also been a widely studied research topic, and many such systems and libraries have been developed [3, 7, 13]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multi-dimensional arrays. In the context of replication, most of the previous work targets issues like data availability during disk and/or network failures, as well as to speed up I/O performance by creating exact replicas of the datasets [4, 5, 6, 16, 18, 20]. File level and dataset level replication and replica management have been studied [6, 16, 12]. In the area of partial replication [4, 5, 20] the focus had been on creating exact replicas of portions of a dataset to achieve better I/O performance. Our goal is different in the sense that we investigate strategies for evaluating queries when there are (partial) replicas stored.

3.2

Runtime Middleware

We employed a middleware framework, called STORM, to implement the runtime support for execution of queries. STORM [15] is a suite of services that collectively provide basic database support for 1) selection of the data of interest from scientific datasets stored in files and 2) transfer of selected data from storage nodes to compute nodes for processing. The data of interest can be selected based on attribute values, ranges of values, and user-defined filters. STORM supports user-defined distribution of data elements across destination nodes and transfer of selected data subset from distributed source nodes to destination nodes according to the distribution. Scientific datasets are usually stored in a set of flat files with application-specific file formats. In order to provide common support for different applications, STORM employs virtual tables and select query abstractions, which are based on object-relational database model.

In data caching context, Dahlin et al. [8] have evaluated algorithms that make use of remote memory in a cluster of workstations. Franklin et al. [10] have studied ways for augmenting the server memory by utilizing client resources. Shrira and Yoder [19] have proposed a technique for managing cooperative caches in untrusted environments. Venkataraman et al. [21] have proposed a new buffer management policy, which approximates a global LRU page replacement policy. Andrade et.al. [1] implement an “active semantic cache” framework for data reuse among multiple data analysis queries submitted against scientific multi-dimensional datasets. While these approaches can be employed for management and replacement of replicas, our work focuses on improving query performance by re-organization of portions of input datasets. 2

4.2

STORM has been developed using a component-based framework, called DataCutter [2], which enables execution of application data processing components in a distributed environment. Using the DataCutter runtime system, STORM implements several optimizations to reduce the execution time of queries. Distributed Execution of Filtering Operations. Both data and task parallelism can be employed to execute filtering operations in a distributed manner. If a select expression contains multiple (userdefined) filters, a network of filters can be formed and executed on a distributed collection of machines. Parallel Data Transfer. Data can be transferred from multiple data sources to multiple destination processors by STORM data mover components. Data movers can be instantiated on multiple storage units and destination processors to achieve parallelism during data transfer.

4

Our support for servicing range queries in the presence of partial replicas has been built on top of our prior work on supporting SQL Select queries on multidimensional datasets in a cluster environment [22]. A high-level overview of our system is shown in Figure 1. The underlying runtime system we use, STORM, was briefly described in the previous section. The STORM system requires the users to provide an index function and an extractor function. The index function chooses the file chunks that need to be processed for a given query. The extractor is responsible for reading the file(s) and creating rows of the virtual table. In our previous work, we had described how a code generation tool can use metadata about the layout of the datasets and generate indexing and extractor functions [22]. The information about replicas available in the system is stored in the replica information file. This file captures information about the multi-dimensional range covered by each (partial) replica, the size and shapes of the chunks, and the dimension order for the layout of the chunks. Given a range query, the replica selection algorithm uses the replica information file and indices to determine which partial replicas and which chunks in partial replicas can be used to answer the query. Based on this information, it chooses a strategy or an execution plan for answering the query. This module then interacts with our modules for code generation. It is possible that no single replica can fully answer the query. In that case, it is necessary to generate subqueries for each selected replica that partially answers the query. A subquery corresponds to the use of one replica (or the original dataset) for answering a part of the query. The code generation modules generate indexing and extraction services for each subquery.

Query Execution with Partial Replicas

In this section we present an algorithm for efficient execution of range queries when partial replicas of a dataset exist. We first describe the type of partial replicas we consider in our system and give an overview of our overall system.

4.1

System Overview

Partial Replicas Considered

Our main focus is on answering queries in the presence of partial replicas. Our system assumes that partial replicas have been created, and information about them is stored in a replica information file. A variety of approaches could be taken for selecting partial replicas. One possible approach is to use a group of representative queries to identify the portions of the dataset to be replicated. We assume that a hot region in a dataset is defined by a multi-dimensional bounding box. A partial replica of the dataset is then created from data elements, whose coordinates in the underlying multidimensional space fall into this bounding box. The contents of this bounding box are partitioned into chunks and these chunks are distributed across the system. We allow flexible chunking of partial replicas, i.e., different replicas can have different chunk shape and size, even if they cover the same multi-dimensional region. Moreover, the layout of chunks on disk can follow different dimension order. This is important, as different dimension order will influence the file seek time for retrieving a set of chunks. Once a partial replica is chunked and a dimension order is chosen, chunk layout is done by traversing the chunks in the dimension order. To maximize I/O parallelism when retrieving data, we assume that each chunk of a replica is partitioned across all available data source nodes. However, it should be noted that if the replicas have been created with different chunk shapes and sizes, there will not be one-to-one mapping between data chunks in the original dataset and those in the replicas.

4.3 4.3.1

Query Execution Computing Goodness Value

To execute a range query efficiently, we should minimize the total cost associated with the query execution on the required data. To that end, we need to establish a “goodness value” for replicas to choose the ones that achieve the best query execution performance. Because chunk is the basic unit of data access from disk, the goodness value is calculated on chunks. We should note that 1) the I/O cost we pay to retrieve a chunk might be too high when the chunk is only partially intersected by the query bounding box, and 2) the cost of a chunk is proportional to its size. Thus, we need to take into account both the cost per chunk and the amount of useful data the chunk contains to answer the query. Our goodness value is defined as the total amount of the useful data in a chunk normalized by the estimated retrieval cost of the chunk. goodness = usef ul dataper−chunk /costper−chunk The retrieval cost of a chunk is computed as: 3

Figure 1. Overview of Our System

(a)

(b)

(c)

Figure 2. Using Partial Replicas for Answering a Query (a) Query and intersecting datasets, (b) 4 Fragments from 2 Replicas (c) Chunks retrieved. our implementation, each fragment has a goodness value associated with it.

cost = k1 ∗ Cread−operation + k2 ∗ Cseek−operation

goodness = usef ul dataper−f ragment /costper−f ragment

We measure the cost of read operation Cread−operation as the number of pages fetched from disks when answering a query. In the formula, k1 is used as the average read time for a page. The cost of seek operation Cseek−operation is measured as the number of seeks used to locate the start point for a chunk during query execution. k2 is the average seek time for one read operation. This cost metric is relatively simplistic. However, experiments presented in Section 5 demonstrate that it is capable of capturing the dominant cost factor in query execution. As illustrated in the Figure 2(a), Replica 1 has two kinds of chunks that are needed to answer the query. The chunks whose boundaries intersect the query boundary are called partial chunks, since they contain redundant tuples, which are not used to answer the query and which incur extra filtering operations. The chunks whose bounding boxes fall completely within the query bounding box are called full chunks. In order to compute the contribution of full and partial chunks in estimating the goodness value, we introduce an intermediate unit, called Fragment, between a replica and its chunks. A fragment is defined as a group of partial or full chunks having same retrieval goodness value in a replica. Obviously, all full chunks of one replica should be grouped as one fragment. In Figure 2(b), Replica 1 has three fragments and Replica 2 has only one fragment. In

Using this cost model, we estimate the execution cost for different combinations of replicas to answer the query and choose the ”best”, i.e., the least expensive combination of replicas. A simple exhaustive search algorithm needs to explore a large search space to find the best combination of replicas. Suppose we are given r replicas with n chunks in total. In the worst case, the exhaustive algorithm needs to check upto 2n potential combinations for finding the most efficient way to answer an issued query. In order to reduce the exponential search cost and find the efficient combination quickly, we use a heuristic algorithm to recommend a set of candidate replicas. 4.3.2

Replica Selection Algorithm

Our approach has two main steps; a greedy heuristic (step 1) and one optimization extension (step 2). In the first step we obtain a set of candidate replicas, i.e., a set of candidate fragments, which are chosen based on their goodness value in a greedy manner. If the union of the fragments do not cover the query range completely, a remainder query range is generated to be directed to the original dataset as well. The candidate replicas with or without 4

ness value, move it from F to S, and modify Q by subtracting the range contained by this fragment. Note that for other fragments in F , if they have an overlap with this chosen fragment, the area of overlap should be subtracted and their goodness values should be re-computed. To simplify our algorithm implementation, we re-compute the goodness value based on the formula of per-fragment instead of per-chunk. As shown in Figure 2(a), if we choose the only fragment of Replica 2 in the beginning, we need to subtract the portion of Fragment 2 in Replica 1 that overlaps with the fragment of Replica 2 and modify its goodness value. The while loop iterates until F is empty. At this point, the chosen fragment list S is the initial candidate solution. We may need to include the original dataset in the initial solution and direct a subquery to it, if the union range of S can not completely answer the query (steps 19-21).

the original dataset is our initial solution. This initial solution may contain chunks (from different replicas) that intersect each other in the underlying multi-dimensional space of the dataset. This results in redundant I/O volume as one or more chunks may contain the same data element. In the second step, we attempt to eliminate redundant I/O by checking chunks in one candidate replica against others. When multiple partial replicas exist, our algorithm finds all the replicas that can be used to answer the query, selects a combination of replicas that will have the minimum cost value, and performs an efficient set union operation among data elements selected from these replicas to answer the query. The set union operation is needed since two replicas may contain an overlapping set of data elements selected by the query. INPUT: Query Q; a given set of partial replicas R; the original dataset D. OUTPUT: a list of candidate replicas (fragments) with or without the original dataset D for answering the issued query.

INPUT: a list of candidate replicas (fragments) S in decreasing order of the goodness values. OUTPUT: a set of replicas with or without the original dataset D for answering the issued query.

// Let F be the set of all fragments intersecting with // the query boundary 1. F ← ∅ 2. Foreach Ri ∈ R and Ri ∩ Q 6= ∅ 3. Classify chunks that intersect the query into different fragments 4. Calculate the goodness value of each fragment 5. Insert the fragments of Ri into F 6. End Foreach // Let S be the ordered list of the candidate fragments in // decreasing order of their goodness values 7. S ← ∅ 8. While F 6= ∅ 9. Find the fragment (Fi ) with the maximum goodness value 10. F ← F − Fi 11. Append Fi to S 12. Q ← Q − (Fi ∩ Q) 13. Foreach Fj ∈ F 14. If Fj intersects with Fi 15. Subtract the region of overlap from Fj 16. Re-compute the goodness value of Fj 17. End Foreach 18. End While 19. If Q 6= ∅ 20. Use D to answer the subquery Q 21. Append D for the range Q to S

// Let Intersect be the variable indicating whether we can // improve the candidate solution 1. Intersect ← 0 2. Foreach Fi ∈ S 3. Foreach Fj ∈ S and Fj 6= Fi 4. If Fi ∩ Fj 6= ∅ 5. Intersect ← 1 and break 6. End Foreach 7. End Foreach 8. If Intersect = 1 9. Foreach Fi ∈ S from the head of S 10. Foreach chunk C in Fi // Let r be the union range contained by the // filtered areas of all other fragments 11. If chunk C is completely within r 12. Drop it from Fi 13. Foreach Fk after Fi in list S 14. If C ∩ Fk 6= ∅ 15. Modify Fk to retrieve the intersection 16. Subtract the intersection from C 18. if C = ∅ break 19. End Foreach 20. End Foreach 21. End Foreach

Figure 4. Extension to the Greedy Algorithm Figure 3. Greedy Algorithm The second step as shown in Figure 4 attempts to improve the initial solution obtained in the first step. Consider the example in Figure 2(a). Suppose after the first step, the 4 candidate fragments in Figure 2(b) together with the original dataset are the initial solution. Fragment 2 in Replica 1 requires additional filtering operations because the data in the regions of overlap can be retrieved from Fragment 4 in Replica 2. Clearly, the better solution would be to delete the two chunks that fall in the region of overlap from Fragment 4 and use Fragment 2 for that region as illustrated in

The first step of the algorithm is shown in Figure 3. Chunks in each replica that intersect the query are categorized as partial or full chunks and into different fragments, and the respective goodness values of the fragments are calculated (steps 2-6). For a given query Q, let us denote the set of all fragments as F and the list of all chosen fragments in decreasing order of goodness value as S. We can apply our greedy search over F (the while loop over steps 8-18). We choose the fragment with the largest good5

Figure 2(c). This solution results in fewer I/O operations and less filtering computation. In the second step, a check is done chunk by chunk for each fragment in S against other fragments. If the data contained in a chunk is also contained in the data area marked to be filtered by other fragments, the chunk should be deleted from the fragment. The running time of categorizing chunks and calculating the goodness valus is O(n), where n is the number of chunks intersecting with the query bounding box. The first step would loop in time O(m), where m is the total number of fragments containing those n chunks. Within the while loop, the running time is equal to O(m) for finding the fragment with the largest goodness value and O(m) for modifying other fragments if there is overlap with the chosen fragment. Thus, the total running time of the first step is O(m2 ). The second step executes in O(n2 ). Overall, the algorithm takes O(m2 +n2 ) to compute the set of replicas to be used. Since generally m

Lihat lebih banyak...

Servicing range queries on multidimensional datasets with partial replicas

Descripción

Comentarios