Effective load balancing on highly parallel multicomputers based on superconcentrators

Share Embed


Descripción

Effective Load Balancing on Highly Parallel Multicomputers Based on Superconcentrators Gene Eu Jan

Ming-Bo Lin

Department of Navigation National Taiwan Ocean University Keelung, Taiwan

Electronic Engineering Department National Taiwan Institute of Technology Taipei, Taiwan

Abstract Tree and mesh architectures have been considered as two of the most highly scalable parallel multicomputers due t o their scalabilities are superior t o that of hypercubes. However, the load balancing on these two multicomputer systems are not so well as we expected. The worst case of tree architecture requires O ( M x p x logp) routing time for redistributing the workload over the system and it requires O ( M x 4) for mesh architecture while pipelined packet routing scheme is used. In this paper, we proposed an approach based on superconcentrators t o reduce the above bounds to O(M1ogp) for both cases with only additional O(p) cost. Furthermore, using this scheme, the underlying systems can leave the load balancing problem entirely t o the superconcentrator so that it does not arise any additional workload of the systems. In addition, this scheme also adds extra communicating paths to the processors so that it not only increase the communication capacity among the processors but also could tolerate edge faults of the systems.

1

Introduction

It is well understood that distributed computer systems have the potential for improved performance and resource sharing [3]. However, one major problem for such systems is that it is possible for some processors t o be heavily loaded while others to be lightly loaded or even idle. To maximize the performance of such systems, it is necessary to keep every processor busy while some tasks are waiting for service in the systems. Two distinct strategies have been proposed [9, 11, 121 for this purpose. Load balancing algorithms explore the possibility of equalizing the workload among the processors while load sharing algorithms simply attempt t o assure that no processor is idle while some tasks wait for service.

In general, load balancing algorithms require much more resources from the systems than load sharing algorithms [7]. Therefore, the resource requirement may outweight the potential benefits in the underlying systems if we do not have good enough load balancing algorithms. Most recently highly parallel computer systems utilize wormhole routing scheme. In such systems, the communication overhead depends largely on link contention and the variation due to distance between two processors is negligible [l]. Hence, the effectiveness of load balancing problem lies to finding a matching between heavily loaded processors and lightly loaded processors such that link contention can be avoided. Bokhari [l] proposed a network flow model for load balancing in such systems. However, the resulting time bounds is O(p’) for meshes and O(p2log’ p ) for hypercubes, where p is the number of processors for both schemes. To improve the performance of load balancing algorithms, many dynamic load balancing strategies have been proposed [lo, 13, 141. These include sender initiated diffusion [a], which uses the load information of near-neighbor processor t o migrate surplus load from heavily loaded processors t o lightly loaded neighbors in the system, receiver initiated diffusion, which is the converse of the sender initiated diffusion, hierarchical balancing method, which is an asynchronous global approach and organizes the system into a hierarchy of subsystems, gradient model [$I, which employs a gradient map of the proximities of underloaded processors in the system t o guide the tasks migration between overloaded and underloaded processors [13], and dimension exchange method [a],which is a global and fully synchronous approach. The major problem of the above approaches is that all of these strategies are built on the underlying machine architectures. Thus, the load balancing algorithms increase not only system workload but also the

216

0-8186-6555-6/94 $04.00 0 1994 IEEE

AF

link contention. In this paper, we will propose a selfrouting superconcentrator based load balancing algorithm. Using this scheme, the underlying system can leave the load balancing problem entirely to the superconcentrator. Hence, the performance improvement is twofold. First, the superconcentrator takes care all the tasks required for load balancing problem so that it is transparent t o the system. Second, the superconcentrator provides all communication paths for tasks migration required for load balancing process; hence it does not increase any additional traffic on the link between any two processors of the underlying systems. The only price needs t o be paid for is that it requires an O ( p ) cost superconcentrator and each computer requires two additional 1/0 ports, one for input and the other for output, which are connected t o the superconcentrator. Load balancing can be considered as a problem that searches for appropriate pairings among processors that are heavily loaded and those that are lightly loaded. Three issues are intimately related t o load balancing. They are: load difference evaluation, that is, t o classify the processors as overloaded and underloaded, mapping between overloaded and underloaded processors, and redistribution of the load among the processors. Of course, the communication overhead associated with load transfers depends on the communication mechanisms supported by the underlying parallel computer and must be minimized. As for convenience, in this paper, we assume that the basic workload unit is task and all tasks are independent, that is, they can be assigned independently to any processor and obtain the same result. Tree and mesh architectures are considered as two of the most potential candidates for highly scalable parallel multicomputer systems [4].Tree architecture consists of p = 2' leaf processors with d level balanced binary tree as its communication mechanism. As for the load balancing problem on a p n o d e tree architecture, the worst case occurs when the left half of p / 2 processors each with exactly M tasks will transfer a total of ( M / 2 ) ( p / 2 )tasks t o the right half of empty p / 2 processors. The resulting communication overhead is O(Mp1ogp) while pipelined packet routing scheme is used. A p n o d e mesh architecture consists of f i x l/ir twodimensional array. Each node has a degree 2,3, or 4 dependent on that the node is a boundary node or an interior node. As for the load balancing problem on a p n o d e mesh architecture is considered, the worst case occurs when the left half of p / 2 processors each with exactly A4 tasks will transfer a total of ( M / 2 ) ( p / 2 ) tasks t o the right half of empty p / 2 processors. The

Ranking

Figure 1: A self-routing concentrator. resulting communication overhead is O(Ml/ir) while pipelined packet routing scheme is used. In this paper, we propose an approach based on a self-routing superconcentrator t o improve the above bounds t o O ( M logp) with only linear cost added. The rest of the paper is organized as follows. Section 2 reviews and summarizes the features of concentrators and superconcentrators. Section 3 describes the load balancing algorithm based on self-routing superconcentrators. Section 4 proves the correctness and estimates the performance of the load balancing algorithm. The paper is concluded by Section 5.

2

Concentrators and Superconcentrators

Loosely speaking, a ( p , q ) packet-switching concentrator, where 1 5 q < p , is a network with p inputs and q outputs that can map any k packets from its inputs to some fixed IC of its outputs without the capability t o distinguish between those outputs. A ( p , q ) packet-switching superconcentrator, where 1 5 q < p , is a network with p inputs and q outputs that can map any le packets from its p inputs, where 1 5 le < q , t o any IC of its q outputs without the capability t o distinguish between those outputs. The main difference between a packet-switching concentrator and a packet-switching superconcentrator is that the input packet sets of the latter can be mapped into any subset of outputs in a one-to-one basis while the input packet set of the former can only be mapped into some fixed but not any subset of outputs. Concentrator: Figure 1 shows a self-routing concentrator which consists of a ranking tree, p / logp selection trees and distribution trees, and a p / log p i n p u t cube interconnection network. The concentration on

217

the concentrator proceeds in two phases. In the first phase, k packets from the inputs are ranked by the ranking tree and can be done in O(1ogp) time [5]. In the second phase, the ranks of the IC inputs are used to route the IC packets t o k consecutive outputs starting from the topmost output. By using the fully pipelined routing scheme described in [5], any I% packets can be routed from its inputs to the IC consecutive outputs starting from the topmost output in O(1ogp) time, where IC 5 p . It is easily t o show that the cost of the self-routing concentrator is O(p). The details of concentrator can consult [5]. Superconcentrator: A self-routing superconcentrator can be obtained by attaching two concentrators back t o back together as shown in Figure 2. Note that the p inputs of reversed concentrator also serve as the outputs of the superconcentrator. The operation of superconcentrator is proceeded in two phases. In the first phase, any IC packets, where k 5 p , of the concentrator and any equivalent number of "dummy packets" of the reversed concentrator are concentrated simultaneously t o the outputs of concentrator network and reversed concentrator network starting from the topmost output, respectively. These outputs must meet in a one-to-one basis exactly at the interface between concentrator and reversed concentrator because the same number of packets are concentrated from both concentrators. To memorize the paths of dummy packets of reversed concentrator during the concentrating process, a routing tag is used to remember the switching information along the path. The information in the routing tags is then used as routing guide for the packets from concentrator to continue their ways to the outputs of the superconcentrator. The switching states of selection trees are defined as: lower child is 1 and upper child is 0; while values of routing tag of reversed cube network are identified as: the upper input of a 2 x 2 switching box is 0 and the lower input is 1. In general, a routing tag has (log logp log &) bits for the reversed concentrator. The routing tag uses log logp bits to remember the routing status in the selection tree and log(&) bits for the reversed cube network. In the second phase, we apply the upper child priority scheme [5] t o avoid the contention in the distribution trees of the reversed concentrator and select a set of monotonic packets t o the inputs of the reversed cube network. Then the sets of monotonic packets will be routed through the reversed cube networks using the routing tags in a set-by-set fashion without contention [ 6 ] . It is easily t o show that a p-input superconcentrator can superconcentrate any pattern of inputs to any

+

equivalent number of outputs on a self-routing basis in O(1ogp) time with O ( p ) cost.

3

Load Balancing Based on Superconcentrators

For our purpose, load balancing can be defined as that N tasks which are distributed over p processors with no more than M tasks assigned t o any single processor, where N / p M N . To make the most use of the system processor resources, the workload differences between any two processors must be at most one at any time. In this section, a load balancing algorithm based on a self-routing superconcentrator described in the previous section is explored. This algorithm can be used with tree- or mesh-based or other parallel computer architectures. The basic idea is described as follows. Before the algorithm is proceeded, all p processors are connected t o the superconcentrator with the ith input and the ith output of the superconcentrator connected together t o the ith processors, where 0 i 5 p - 1. An algorithm then is performed to distribute the tasks of the system so that the load can be balanced over the entire system. Here the balance means that the load difference between any two processors is at most one unit. Assume that the ith processor Pi has ni tasks, where 0 5 i 5 p - 1. The algorithm is composed of three steps: computation of the load difference, distribution, and redistribution. Load balancing algorithm begins with step 1 by calculating the sum, average, and difference of load between the average and ni of each processor. Step 2 distributes the overloaded tasks from overloaded processors t o underloaded processors. In this step, the processors are identified by their load difference between the actual load and the average. If the load difference DIF(Pa) = ni - AVG, where AVG = of the processor Pi is positive then the processor is called overloaded processor, P,"", while the load difference is negative the processor is called underloaded processor, P,"". In this step, each overloaded processor will send at most DIF(P,"") tasks to some underloaded processors and each underloaded processor will receive at least IDIF(PyL)I tasks from some overloaded processors. Let B ( t )denote the minimum number of overloaded processors and underloaded processors at time t . B ( t ) is the maximum tasks that can be superconcentrated from overloaded processors to underloaded processors over superconcentrator at time t . Tasks are superconcentrated from the top B ( t ) overloaded processors to

<

<

<

Lg],

218

Figure 2: A self-routing superconcentrator. the top B ( t ) underloaded processors. A processor Pi is said t o be filled or balanced when its load difference D I F ( Pi) is zero. Unless specify . otherwise,. processor with zero load difference will remain neutral and will not participate in the superconcentration process. The superconcentration of tasks from overloaded processors to underloaded processors will be repeated until B(t) . , = 0. There are possibilities that after all underloaded processors are filled there still have overloaded tasks left in some overloaded processors since the AVG' = [$j. Therefore, one or more processors may still be overloaded by more than one task. In order t o achieve the load balancing with tasks between any two processors differ by at most one, the third step, that is, redistribution, is required. In the redistribution step, we reassign overloaded processors and underloaded processors with new specification. First, reassign processor Pi that has D I F ( P i ) = 0 to be underloaded processor, PiuL, which will accept at most one task from some overloaded processor. Processor Pa that has D I F ( P ; ) = 1 is considered as neutral and will remain idle in the redistribution step. Processor Pi that has D I F ( P i ) > 1 is considered as overloaded processor and will send D I F ( P i ) - 1 tasks to underloaded processors. This step is repeated through superconcentration process until the load balancing is reached. The superconcentrator load balancing algorithm is shown as follows. .

Step 1.1: Compute the S U M by summing al1 tasks on the leaf processors through the ranking tree t o the root node.

I

Step 1.2: Compute the A V G at the root node of the ranking tree and broadcast the A V G down t o the leaf processors through the ranking tree. Step 1.3: Compute the DIF(Pi)s at the leaf processors of the ranking tree. ParEnd Step 2: Distribution ParDo Step 2.1 Identified processors with positive, negative, or zero D I F ( P i ) as overloaded processors, underloaded processors, or balanced processors, respectively. Step 2.2 Superconcentrate one task from each of the top B ( t ) overloaded processors to the top B ( t ) underloaded processors at time t through the superconcentrator. Step 2.3 Update the load differences. The load differences of the top B ( t ) underloaded processor and the top B ( t )overloaded processor processor will increase and decrease by one, respectively, after the superconcentration.

A1gorithm:Load Balancing

Step 2.4 Go to Step 2.2 until B ( t ) = 0. (that is, D I F s of all underloaded processors are equal t o zero.)

Step 1: Computation of the load differences. Let AVG' = L$] and D I F ( P i ) = ni - AVG', respectively. ParDo

ParEnd

219

Step 3: Redistribution If D I F ( P Y L ) > 1 , 0 5 i

5 p - 1 then ParDo

Step 3.1 Reassign processors with D I F greater than one, zero, or equal t o one, as overloaded processors, underloaded processors, or balanced (idle) processors, respectively. Step 3.2 Repeat Step 2.2 and Step 2.3 until all the load differences of overloaded processors are equal t o one's. ParEnd END {Algorithm}

4

Correctness and Performance

Some lemmas are required for proving the correctness and estimating the time complexity of the load balancing algorithm.

Lemma 1 The absolute value of maximum load difference IDIF(P;)I of any processor is M - 1. Proof: It is obvious.

0

Lemma 2 The superconcentration of overloaded tasks in distribution step can be completed in O(M1ogp) tame in which underloaded processors are filled.

Proof: Let D I F m a z ( P F L )and DIFm,,(PyL> denote the maximum load difference of overloaded processor Pi and underloaded processor Pj, respectively, where 0 5 i , j 5 p - 1. By specifying B ( t ) = the minimum number of overloaded processors and underloaded processors at time t , we see that for each superconcentrating process the load difference D I F m a z ( P y L ) and DIFma,(P'L) will be decreased and/or increased by one, respectively. In addition, the maximum load difference value D I F of any processor is M - 1 by lemma 1 and the routing time through the superconcentrator requires O(1ogp) time. Therefore, the total time required to superconcentrate overloaded tasks from overloaded processors to underloaded processors so that B ( t )reaches t o zero is at most 2 ( M - 1)logp = O ( M logp) time units. 0

Lemma 3 The sum of the load digerences of overloaded processors i s at m o s t p - 1 a f t e r the distribution step. Proof: Let E be the sum of the load differences of overloaded processors after the distribution step. The maximum value of load difference of overloaded processors after the distribution step is limited by AVG =] : L = = AVG+ Lf]. Therefore, 0 0 5 E 5 p - 1 and the proof is completed.

LJ'xAIG+E]

Lemma 4 The sum of overloaded tasks of overloaded processors will be less than the number of underloaded processors after the completion of step 3.1. Proof: From the step 3.1, the processors with D I F greater than one are assigned as overloaded processors, the processors with D I F equal t o zero as underloaded processors, and the processors with D I F equal to one as balanced processors. Furthermore, by lemma 3, the sum of overloaded tasks of overloaded processors is at most p - 1 after the distribution step (that is, step 2); the sum of overloaded tasks of overloaded processors will be less than the number of underloaded processors after the completion of step 3.1. Otherwise, the sum of DIF(P,)s of all processors would be greater than 0 p - 1, which contradicts with lemma 3.

Lemma 5 The redistribution step needs at most ( M 1)logp time units. Proof: There are at most p - 1 overloaded tasks in one or more overloaded processors with no more than M tasks assigned to each processor after the distribution step by lemma 3. In the redistribution step, B ( t ) would be the number of overloaded processors by lemma 4. Since 0 5 IDIF(Pi)l 5 M - 1 and logp time units are required for routing overloaded tasks through the superconcentrator, the time required for performing redistribution step is therefore at most ( M - 1)logp. 0 Theorem 1 The load balancing can be accomplished in O ( M logp) communication and computation time based on superconcentrator, where each processor has a maximum M tasks. Proof: First, we prove the correctness of the algorithm. By lemma 3 there are at most p - 1 tasks in overloaded processors after the distribution step, hence distribution step alone will not be sufficient enough to achieve load balancing. The redistribution step is required to achieve the load balancing. By lemma 4, there are more underloaded processors than the sum of overloaded tasks of overloaded processors at the beginning of step 3.2. Therefore, it is sufficient to conclude that load balancing will be accomplished after the redistribution step with the number of tasks between any two processors that differs by at most one. Next, we estimate the time complexity of the algorithm. It takes O(1ogp) time units to compute the sum N and load differences and takes constant time for the average AVG for each processor. Therefore, step 1 only requires O(1ogp) time units for both communication and computation.

220

Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, New York: McGraw-Hill, 1993.

By lemma 2, the superconcentration process between overloaded processors and underloaded processors will take O ( M logp) communication and computation time t o complete. By lemma 5, the redistribution step will take at most an additional ( M - 1)logp time units. Combining the above results, the theorem is proved.

Ching Yuh (Gene E;) Jan and A. Yavuz OruC, “Fast self-routing permutation switching on an asymptotically minimum cost network,” IEEE Trans. on Computers, Vol. C-42, No. 12, pp. 14691479, Dec. 1993.

0

5

Conclusion

Gene Eu Jan, “Fast self-routing circuit-switching and packet-switching concentrators, hyperconcentrators, and superconcentrators,” manuscript.

In this paper, a load balancing algorithm based on a linear cost self-routing superconcentrator is proposed. Using this scheme, the underlying system can leave the load balancing problem entirely t o the superconcentrator. That is, the superconcentrator takes care all the tasks required for load balancing problem and provides all communication paths for tasks migration required for load balancing process. Hence, it is transparent to the system and does not increase any additional traffic on the link between any two processors of the underlying systems. The only extra cost for this is that it requires an O ( p ) cost superconcentrator and each computer requires two additional 1/0 ports, one for input and the other for output, which are connected t o the corresponding superconcentrator’s output and input, respectively. The results show that based on this algorithm the load balancing on mesh- and treebased highly parallel multicomputers can be done in O(M1ogp) time with additional O(p) cost due to the added superconcentrator. In addition, this scheme also adds extra communicating paths to the processors so that it could increase the communication capacity among the processors and could tolerate edge faults of the system.

Orly Kremien and Jeff Kramer, “Methodical analysis of adaptive load sharing algorithms,” IEEE Trans. on Parallel and Distributed Systems, Vol. 3, No. 6, pp. 747-760, Nov. 1992.

F. C. H. Lin and R. M. Keller, “The gradient model load balancing method,” IEEE Trans. Software Eng., vol. SE-13, no. 1 pp. 32-38, Jan. 1987. R. Mirchandaney, D. Towsley, and J . A. Stankovic, “Analysis of the effetcs of delays on load sharing,” IEEE Trans. Comput., vol. C-38, no. 11 pp. 1513-1525, NOV. 1989. D. Nicol and J . Saltz, “Dynamic remapping of parallel computations with varying resource demands,” IEEE Trans. Comput., vol. C-37, no. 9 pp. 1073-1087, Sept. 1988.

E. Shamir and E. Upfal, “A probabilistic approach to the load-sharing problem in distributed systems,” J. Parallel Distributed Comput., vol. 4, pp. 521-530, 1987. K. G. Shin and Y. C. Chang, “Load sharing in distributed real time systems with state-change broadcasts,” IEEE Trans. Comput., vol. C-38, no. 8 pp. 1124-1142, Aug. 1989.

References Shahid H. Bokhari, “A network flow model for load balancing in circuit-switched multicomputers,” IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No. 6, pp. 649-657, Jun. 1993.

Marc H. Willebeek-LeMair and Anthony P. Reeves, “Strategies for dynamic load balancing on highly parallel computers,” IEEE ?bans. on Parallel and Distributed Systems, Vol. 4, No. 9, pp. 979-993, Sept. 1993.

G. Cybenko, “Dynamic load balancing for distributed memory multiprocessors,” J. Parallel Distributed Comput., vol. 7, pp. 279-301, Oct. 1989.

S. Zhou, “A trace-driven simulation study of dynamic load balancing,” IEEE Trans. Software Eng., vol. SE-14, no. 9, pp. 1327-1341, Sept. 1988.

D. Gerogiannis and S. C. Orphanoudakis, “Load balancing requirements in parallel implementations of image feature extraction tasks,” IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No. 9, pp. 994-1013, Sept. 1993.

22 1

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.