A Comparative Study of Replica Placement Strategies in Data Grids

Share Embed


Descripción

A Comparative Study of Replica Placement Strategies in Data Grids Qaisar Rasool1, Jianzhong Li1,2, George S. Oreku1, Ehsan Ullah Munir1, and Donghua Yang1 1

School of Computer Science and Technology, Harbin Institute of Technology, China 2 School of Computer Science and Technology, Heilongjiang University, China [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. Data Grids are today’s emerging infrastructure providing specialized services on handling large datasets that needs to be transferred and replicated among different grid sites. Data replication is an important technique for data availability and fast access. In this paper we present a comparison of various replication models and techniques employed by some major topologies used in data grid environment. We focus on dynamic strategies for replica placement in tree, Peer-to-Peer (P2P) and hybrid architectures. Beside tree model which is being implemented for many Data Grid applications, hybrid and P2P grid models of replication are also emerging for providing scientific communities with better availability and efficient access to massive data.

1 Introduction Grid computing is a wide-area distributed computing environment that involves largescale resource sharing among collaborations, referred to as Virtual Organizations, of individuals or institutes located in geographically dispersed areas. A Data Grid [1] provides services that help users discover, transfer and manipulate large datasets stored in distributed repositories and also, create and manage copies of these datasets. Examples of the scientific applications dealing with huge amounts of data and the potential beneficiaries of Data Grid technology are high energy physics, astronomy, bioinformatics, and earth sciences. Replication of data means the creation and maintenance of copies of data at multiple computers. Replication is a key to the effectiveness of distributed systems in that it can provide enhanced performance, high availability and fault tolerance. Grid being a specialized form of distributed system is going to get benefit from replication and possibilities are continually explored by the research community. In this paper we present a brief review of current and past research on replication techniques. Specifically, we focus on data replica placement policies proposed for use in the data grid environment. For each replica placement technique, we consider its methodology, objective (or performance metric), and results. In the presence of diverse and varying characteristics of tree and other architectures it is difficult to create a common ground for comparison of different replication strategies. We, K.C. Chang et al. (Eds.): APWeb/WAIM 2007 Ws, LNCS 4537, pp. 135–143, 2007. © Springer-Verlag Berlin Heidelberg 2007

136

Q. Rasool et al.

therefore, group topologies into tree and hybrid/P2P architectures and analyze the impact of replica placement policies in each one. A hybrid topology can carry features of both tree and P2P architectures and thus can be used for better performance of a replication strategy. The paper is organized as follows. Section 2 describes some widely used grid topologies and their implications. In section 3, we take a closer look on data replication in grid environment and how different replica placement strategies work under analogous topologies. Section 4 concludes the paper.

2 Grid Topologies In this section we present an overview of major grid topologies. We use terms tree, hierarchical, and multi-tier to refer the same topology. Other topologies considered are P2P and hybrid. A hierarchical or tree model is used where there is a single source for data and the data has to be distributed among collaborations worldwide. For instance, the data grid envisioned by the GriPhyN project is hierarchical in nature and is organized in tiers [2]. The source where the data is produced is denoted as Tier 0 (e.g. CERN). Next are the Tier 1 national centers, the Tier 2 regional centers (RC), the Tier 3 workgroups and finally Tier 4, which consists of thousands of desktops. Such a multi-tier Data Grid has many advantages [3]. Firstly, it allows the scientific community to access the resources in a common and efficient way. Secondly, the datasets can be distributed to appropriate resources and accessed by multiple sites. The network bandwidth can be used efficiently because most of the data transfers involve local or national network resources, hence alleviating the workload of international network links. Thirdly, with the support of the Grid middlewares, the resources located in different centers can be utilized to support data-intensive computing. The multi-tier Data Grid model is being implemented for many scientific applications, the High Energy Physics (HEP) being the most prominent one.

(a)

(b)

(c)

Fig. 1. Data grid topologies (a) tree, (b) P2P, and (c) hybrid

A tree topology also has shortcomings. The tree structure of the grid means that there are specific paths the messages and files can travel to get to the destination. Further, data transference is not possible among sibling nodes or nodes situated on the same tier. Peer-to-Peer (P2P) systems overcome these limitations and offer flexibility in communication among components. A P2P system is characterized by

A Comparative Study of Replica Placement Strategies in Data Grids

137

the applications that employ distributed resources to perform functions in a decentralized manner. From the viewpoint of resource sharing, a P2P system overlaps a grid system. The key characteristics that distinguishes a P2P system from other resource sharing systems is its symmetric communication model between peers, each of which acts as both a server and a client. The unique characteristics of P2P make the Data Grid more scalable and flexible to deal with the distributed data and replication. With the needs for researchers to collaborate and share products of their analysis, new hybrid models are emerging for Data Grids which utilize the combined benefits of tree and other architectures such as ring and P2P. A hybrid model of a hierarchical Data Grid with peer linkages at the edges is shown in Fig.1(c).

3 Data Replication in Grid Replication is used in distributed systems to improve data access efficiency and fault tolerance. Fig.2 shows pros and cons of data replication. The general idea is to store copies of data in different locations so that data can be easily recovered if one copy of data is lost. Also, placing data replicas closer to user can improve data access performance significantly. A replica is an exact copy of a file that is linked to the original file through some well-defined mechanisms [4]. In Grid environment, replica management services do not enforce any semantics to (multiple) replicas of a file and thus a replica takes the form of a user-asserted correspondence between two physical files [1]. Data replication is characterized as an important optimization technique in Grid for promoting high data availability, low bandwidth consumption, increased fault tolerance, and improved scalability.

Fig. 2. Pros and cons of data replication in a generic distributed system

The goals of replica optimization is to minimize file access times by pointing access requests to appropriate replicas and pro-actively replicating frequently used files based on access statistics gathered. In general, replication mechanism determines which files should be replicated, when to create new replicas, and where the new replicas should be placed. Replication schemes can be classified as static and dynamic. In static replication, a replica persists until it is deleted by users or its duration is expired. The drawback of static replication is evident, when client access patterns change greatly in the Data

138

Q. Rasool et al.

Grid, the benefits brought by replica will decrease sharply. In contrast to static replication, dynamic replication automatically creates and deletes replicas according to changing access patterns, and thus ensures that the benefits of replication continue even if user behavior changes. Dynamic replication strategies tend to reduce hotspots created by popular data and facilitate load sharing. The benefits achieved, however, incur extra overhead costs in the form of real-time assessment of network traffic and maintaining efficient system logs to handle the dynamicity of grid environment. 3.1 Replica Placement in Tree Topology Grids The manner in which replicas are placed in the grid nodes follows two approaches. The first approach is to distribute data from top to bottom starting from root node to lower tier nodes. The other technique is to place replicas in bottom up fashion. There is a variation in the consideration of read/write costs before making replication decision. We take both approaches into consideration in our discussion. As most of the data in Data Grids are read-only [5, 6], we ignore replica update and consistency issue. We start with the initial work [7] on replication strategies that are proposed and simulated for a hierarchical Data Grid similar to Fig.1(a). These techniques are briefly described as follows: 1) No Replication or caching: all data held at root (the base case against which other strategies are evaluated); 2) Best Client: The access history of files is maintained and a replica of a file is created at the client node who has generated most number of requests for that file; 3) Cascading Replication: when file access exceeds a certain threshold, a replica of file is created at immediate lower tier node on the path to the best client; 4) Plain Caching: the client that requests a file stores a copy locally; 5) Fast Spread: When a client requests a file, a copy (replica) is stored at each node on the path to the client. Another policy named Cascading plus Caching was also proposed in which a client node can act as server for its siblings. We discuss this strategy in the next section. It is important to note the difference between replication and caching. In case of replication, the decision to make a replica and sending it to other node is solely taken by the server. In case of caching, a client requests a file and server sends a copy of the file to be stored in the local cache of client. We can say that caching is a client-side phenomenon whereas replication is a server-side phenomenon. The above algorithms consider three different patterns of data access: 1) Random (no locality in data); 2) Access pattern with assumption that popular files at one site are popular at others (temporal locality); 3) Access pattern with little temporal locality and assumption that files recently accessed by a client are likely to be accessed by nearby clients (geographical locality). In order to capture the dynamics of the access pattern, the history statistics (popularity logs) are cleared at specified time intervals. The effectiveness of a replication strategy is related to how well the time interval is tuned to the access behavior (i.e. interval checking); another parameter is the threshold (only if number of requests exceed this threshold is the file replicated). Among these strategies, Fast Spread shows a relatively consistent performance and best in case of random access pattern. When locality is introduced, Cascading offers good results. In contrast, Best Client presents the worst case among all. For locating

A Comparative Study of Replica Placement Strategies in Data Grids

139

the nearest replica each model/algorithm selects the replica server site that is at the least number of hops from the requesting client. An improvement in Cascading technique is proposed namely Proportional Share Replica policy [8]. The method is a heuristic one that places replicas on the optimal locations by assuming that number of sites and the total replicas to be distributed are already known. Firstly an ideal load distribution is calculated and then replicas are placed on candidate sites that can service replica requests slightly greater than or equal to that ideal load. The formula for calculating ideal load is:

Load = Totalrequests /(Originalcopy + Replicas ) Table 1. Merits and demerits of replica placement techniques in tree topology Grids Authors

Technique

Best client

Cascading Kavitha and Foster [7] 2001

A replica created at node who generates max requests Replica trickles down to lower tier if no. of requests exceeds threshold

Results/Comments

TD /B U*

Objective/ Performance metric

TD

Faster avg response time and bandwidth conservation

TD

Mean response time

- worst performance - not suitable for grid + better for small degree of locality - Bad for random access pattern + almost similar performance as cascading - relatively high response time + consistent performance + best for random access pattern - high storage req, I/O, CPU load + better results over Cascading + load sharing among replica servers

Caching

A requesting client receives and stores a copy locally

Fast Spread

A replica is stored at each node along the path toward client

Abawajy [8] 2004

Proportional Share Replication

Calculating an ideal workload and distributing replicas

Ming Tang et al. [3] 2005

Dynamic Replication Algorithms

Simple Bottom Up (SBU), Aggregate Bottom Up (ABU)

+ better results over Fast Spread strategy

B U

Response time, bandwidth cost, replication freq

Multiobjective approach

Solved as pfacility problem to get sites for replica placement

+ user requests and currentnetwork status + dynamic maintainability when performance metric degrades

TD

Requestweighted average response time

Rahman et al. [9,10] 2005, 2006 *

Methodology/ Realization

TD = Top-down, BU = Bottom-up.

An intensive work in the form of dynamic replication algorithms is presented by Tang et al [3]. Nodes at the intermediate tiers (i.e. tiers between root and client nodes)

140

Q. Rasool et al.

act as the replica servers, each running a Local Replica Manager (LRM). A Dynamic Replication Scheduler (DRS) makes the replication decision after gathering file access statistics. The DRS uses two algorithms, Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU). With SBU, replicas are created as close as possible to the clients with requests rates exceeding the predefined threshold. The algorithm ABU takes into account the access history of files used by the sibling nodes and aggregates the access record of similar files so that these frequently accessed files are replicated first. The performance of algorithms was evaluated and improvements shown against Fast Spread dynamic replication strategy. The values for interval checking and threshold were based on data access arrival rate, data access distribution and capacity of the replica servers. A method exploiting Operations Research techniques is proposed in [9,10] for replica placement. In this method, named multi-objective approach, replica placement decision is made considering both the current network status and data request pattern. The problem is formulated in p-median and p-center models to find the p replica placement sites. The p-center problem targets to minimize the max response time between user site and replica server whereas the p-median model focuses on minimizing the total response time between the requesting sites and the replication sites. The dynamic maintainability is achieved by considering the replica relocation cost. The decision of relocation is made when performance metric degrades significantly in last K time periods. The threshold value is varied proportionally to response time in each time interval. Table 1 summarizes the major research work done on replica placement in tree topology Data Grids. 3.2 Peer-to-Peer and Hybrid Solutions for Replica Placement In large decentralized systems where the reliability of storage elements is low and bandwidth is limited, data availability is a serious challenge. Replication can be an effective technique for improving data availability in such unreliable systems. A strategy for creating replicas automatically in a generic decentralized peer-to-peer network is proposed in [11]. The goal of this model-driven approach is to maintain a specific level of availability at all times. To achieve this goal, the optimal number of replicas for each file is calculated and then best nodes are identified to host these replicas. This is done for files that have no replica or have replicas less than an availability threshold. The approach relies on a resource discovery service to find available storage and the Network Weather Service [14] to get network status information. The availability of a file depends on the node failure rate and the accuracy of the Replica Location Service (RLS). The amount of replica availability needed for a given file is modeled as:

RLacc *(1 − (1 − p) r ) ≥ Avail Where r is the total number of replicas for a given file, p is the probability of a node to be up, RLacc is the accuracy of the replica location service, and Avail is the required amount of availability for the file. For a certain availability threshold, the value of r can be calculated from the above function. For example, if the probability of a node to be up is 30% and the RLS accuracy is 80% then for the availability threshold of 75%,

A Comparative Study of Replica Placement Strategies in Data Grids

141

the model recommends a minimum of 8 replicas. After calculating ideal number r of replicas, the system consults RLS to know how many replicas M are actually present. If M is less than r the system creates (r – M) copies of file and distribute them to best candidate nodes. The best nodes are those which maximize the difference replicationBenefits – replicationCosts. The benefit is the reduction in transfer time to the potential users. The replication costs are the storage costs at the remote site and the transfer time from the current location to the new location. Let F be a file stored at a location N1 then cost of creating its replica at a new location N2 is:

replicationCosts = trans ( F , N 1, N 2) + s ( F , N 2) where function trans() shows the transfer cost of F from N1 to N2 and function s() represents the storage cost of F at N2. The benefit of creating a replica of F at N2 is given by:

replicationBenefits = trans ( F , N 1,User ) − trans ( F , N 2, User ) where User is the location from which we expect the most number of future requests. In this way, the net benefits of replicating F at N2 are calculated and the best candidate identified to place replica. Table 2. Merits and demerits of replica placement techniques in P2P & hybrid topology Grids Objective/ Performance metric

Authors

Technique

Methodology/ Realization

Kavitha et al [11] 2002

Model-driven P2P Replication

Computing ideal no. of replicas for a file; placing replicas at best nodes by doing cost/benefit analysis

Decentralized P2P Strategy

Modeling costs b/w peers using adjacency graph; 0/1 vector for replica placement

+ better performance than static technique

Data transfer cost

Cascading plus Caching

Joining the cascading and caching methodologies

+ Client can act as Server for siblings + better performance than cascading or caching alone

Response time and bandwidth conservation

Hybrid Replica Connection Service

A cost model evaluates access costs and performance gains of creating & placing replicas

+ Favorable results when replicas placed closer to users

Response time

Hai Jin et al [12] 2006 Kavitha and Foster [7] 2001 Lameha -medi etal [13] 2002

Results/Comments

+ No single point of failure - possibility of excessive replicas as each peer takes own decision

Replica availability

In [12], a P2P network cost model is constructed along with a decentralized algorithm to dynamically optimize the replica placement. The optimization strategy is based on the status of the grid system. Though all the status info is obtained dynamically in a decentralized way and all decision-making of replica movement is independent among the peer nodes, the data locating and storage resource

142

Q. Rasool et al.

management are centralized. System scale is controlled by deploying super peers or brokers. To make the effective use of grid storage resources, the lifetimes of replicas are set. A replica will be deleted from the system if it is not accessed for a certain time period. Among replication techniques for tree topology discussed in previous section, authors [7] also presented Caching plus Cascading replication strategy by joining Plain Caching and Cascading. A copy of the requested file is saved at the client local storage and periodically replicas of popular files are sent down to lower tiers. Any node in the hierarchy can be server. Specifically, a client can act as a Server to its siblings. This constitutes a hybrid topology in which client nodes having the same parent are linked in P2P-like structure. The strategy gives better performance than Cascading or Caching alone. A hybrid topology is used in [13] where ring and fat-tree replica organizations are combined into multi-level hierarchies. Each site or node maintains an index of the replicas it hosts and the other locations of these replicas that it knows. Replication of a dataset is triggered when requests for it at a site exceed some threshold. A cost model evaluates the data access costs and performance gains for creating replicas. The replication strategy places a replica at a site that minimizes the total access costs including both read and write costs for the datasets. Table 2 highlights some major work on replica placement in P2P and hybrid Data Grid topology.

4 Conclusions Comparing different replication policies working under different or even similar topologies is complicated due to diverse nature of assumptions made regarding topology, data access patterns and policy objectives. The hierarchical model of Data Grid, used in many high energy physics projects, has motivated the development of tree-structured replication mechanisms. The decision of replica placement is very important for the effective performance of any replication scheme. For our study, we consider tree and P2P/hybrid structures separately and present the merits and demerits of different replica placement strategies in each one. A Data Grid tree provides a relatively simple top-down structure for data propagation, albeit data movement remains confined to specific paths. In contrast, Peer-to-Peer systems present a more flexible layout but at the cost of additive complexity in the replication schemes. There are proposals for P2P and hybrid replication solutions in research but practically they have not yet found a place in Data Grids. We foresee that like tree structure, P2P and other replication frameworks shall become prevalent to fulfill the needs of scientific communities for accessing huge amounts of data efficiently and securely.

Acknowledgements We would like to thank the National Natural Science Foundation of China under Grant No. 60533110 and 60473075, the Program for New Century Excellent Talents

A Comparative Study of Replica Placement Strategies in Data Grids

143

in University under Grant No.NCEF-05-0333, the Key National Science Foundation of Heilongjiang Province under Grant zjg03-05 and the National Science Foundation of Heilongjiang Province under Grant F0208 for their support.

References 1. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S.: The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. In. Journal of Network and Computer Applications (23), 187–200 (2001) 2. The GriPhyN Project, http://www.griphyn.org 3. Tang, M., Lee, B.-S., Yeo, C.-K., Tang, X.: Dynamic Replication Algorithms for the Multi-tier Data Grid. In Future Generation Computer Systems 21, 775–790 (2005) 4. Guy, L., Kunszt, P., Laure, E., Stockinger, H., Stockinger, K.: Replica Management in Data Grids. Technical Report, Global Grid Forum Information Document, GGF5, Edinburgh, Scotland (July 2002) 5. Stockinger, H., Samar, A., Allcock, B., Foster, I., Holtman, K., Tierney, B.: File and Object Replication in Data Grids. In: 10th IEEE Symposium on High Performance and Distributed Computing. pp. 305–314 (2001) 6. Hoschek, W., Janez, F.J., Samar, A., Stockinger, H., Stockinger, K.: Data Management in an International Data Grid Project. In: proceedings of GRID Workshop. pp. 75–86 (2001) 7. Ranganathan, K., Foster, I.: Identifying Dynamic Replication Strategies for a High Performance Data Grid. In: proceedings of the Int’l Grid Computing Workshop, Denver, Colorado, USA (November 2001) 8. Abawajy, J.H.: Placement of File Replicas in Data Grid Environments. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds.) ICCS 2004. LNCS, vol. 3038, pp. 66– 73. Springer, Heidelberg (2004) 9. Rahman, R.M., Barker, K., Alhajj, R.: Replica Placement in Data Grid: A Multi-objective Approach. In: Zhuge, H., Fox, G.C. (eds.) GCC 2005. LNCS, vol. 3795, pp. 645–656. Springer, Heidelberg (2005) 10. Rahman, R.M., Barker, K., Alhajj, R.: Replica Placement Design with Static Optimality and Dynamic Maintainability. In: Proceedings of the Sixth IEEE Int’l Symposium on Cluster Computing and the Grid (CCGrid’06) (2006) 11. Raganathan, K., Lamnitchi, A., Foster, I.: Improving Data Availability through ModelDriven Replication for Large Peer-to-Peer Communities. In: proceedings of Global and Peer-to-Peer Computing on Large-Scale Distributed Systems Workship, Berlin, Germany, (May 2002) 12. Jin, H., Mao, F., Chen, H., Wu, S., Zou, D.: P2P Based Decentralized Dynamic Replica Placement Strategy in Data Grid Environment. Downloaded on October 2006 (2006) from http://grid.hust.edu.cn/fmao/papers/new/replica%20placement-0224IE%20fianl%20.pdf 13. Lamehamedi, H., Szymanski, B., Shentu, Z., Deelman, E.: Data Replication Strategies in Grid Environments. In: proceedings of 5th Int’l Conference on Algorithms and Architecture for Parallel Processing, pp. 378–383 (2002) 14. Wolski, R.: Forecasting Network Performance to Support Dynamic Scheduling Using the Network Weather Service. In: proc. 6th IEEE Symposium on High Performance Distributed Computing. Portland, Oregon (1997)

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.