A comprehensive analytical model of interconnection networks in large-scale cluster systems

June 13, 2017 | Autor: Jemal Abawajy | Categoría: Distributed Computing, Computer Software, Throughput, Analytical Model, Large Scale, Latency, Cluster System, Latency, Cluster System

Share Embed

Laporkan tautan ini

Descripción

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 20:75–97 Published online 17 August 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1222

A comprehensive analytical model of interconnection networks in large-scale cluster systems Bahman Javadi1 , Jemal H. Abawajy2,∗, † and Mohammad K. Akbari1 1 Computer Engineering and Information Technology Department, Amirkabir University

of Technology, 424 Hafez Ave., Tehran, Iran 2 School of Engineering and Information Technology, Deakin University, Geelong,

Vic. 3217, Australia

SUMMARY The trends in parallel processing system design and deployment have been toward networked distributed systems such as cluster computing systems. Since the overall performance of such distributed systems often depends on the efficiency of their communication networks, performance analysis of the interconnection networks for such distributed systems is paramount. In this paper, we develop an analytical model, under non-uniform traffic and in the presence of communication locality, for the m-port n-tree family interconnection networks commonly employed in large-scale cluster computing systems. We use the proposed model to study two widely used interconnection networks flow control mechanism namely the wormhole and store&forward. The proposed analytical model is validated through comprehensive simulation. The results of the simulation demonstrated that the proposed model exhibits a good degree of accuracy for various system organizations and under different working conditions. Copyright © 2007 John Wiley & Sons, Ltd. Received 13 June 2006; Revised 13 March 2007; Accepted 2 April 2007 KEY WORDS:

analytical modeling; super-cluster; interconnection networks; flow control mechanisms; latency; throughput

1. INTRODUCTION Advances in computational and communication technologies have made it economically feasible to develop large-scale high-performance computing systems such as super-cluster systems [1]. ∗ Correspondence to: Jemal H. Abawajy, School of Engineering and Information Technology, Deakin University, Geelong,

Vic. 3217, Australia.

† E-mail: [email protected]

Copyright q

2007 John Wiley & Sons, Ltd.

76

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

This paper addresses the interconnection networks performance analysis problem with emphasis on large-scale cluster computing system similar to that described in [1]. The motivation for our work is that large-scale cluster computing systems are gaining importance both in academic and commercial sectors [2–4] and a wide variety of parallel applications are being hosted on such systems as well [2,5–10]. Examples of production-level super-cluster systems include the Huinalu (IBM Linux super-cluster) [9] and the AIST super-cluster system [10]. Also, having a fast communication network does not necessarily guarantee a good performance. This could be attributed to contentions in host nodes, network links, and network switches [11]. Node contention happens when multiple data packets compete over a receive channel of a node whereas link contention occurs when two or more packets share a communication link. The switch contention is due to unbalanced traffic flow through the switch resulting in an overflow of the switch buffer. Moreover, the topology, routing algorithm, and flow control mechanism have impact on contention and determine the performance of an interconnection network. Network topology defines how the network is physically connected together whereas routing algorithm establishes the path between the source and the destination of a message. The flow control mechanism manages the allocation of resources (e.g. channels and buffers) to messages as they progress along their route. Since the interconnection network plays a central role in determining the overall performance of network-based distributed systems, interconnection networks evaluation models and tools are very important. Simulation has been commonly used to investigate the performance of large-scale cluster computing systems [6,7]. In contrast, we are interested in analytical modeling of interconnection networks. The significant advantage of analytical model over simulation is that it provides quick performance estimates as opposed to simulation that demands excessive computation. Several analytical performance models for multi-computer systems have been proposed in the literature (e.g., [3–5,12,13]) mostly under uniform traffic pattern. However, analytical performance models for the system of interest are generally rare or confined to a small size single cluster systems [14–17]. For example, a general model based on queuing networks was proposed for a single cluster system in [14]. The model assumes that the message length is exponentially distributed. Also, extensive numerical calculation of the model renders it too complicated. In contrast, our model has been simplified based on reasonable assumptions for the super-cluster systems. The contribution of our work is as follows: first, we propose an analytical model for interconnection networks widely used in super-cluster systems under non-uniform traffic in the presence of communication locality. Second, the proposed model will be applied to the wormhole and store&foreward flow control mechanisms. Third, the proposed model is validated through comprehensive simulation, which demonstrated that the proposed model exhibits a good degree of accuracy for various system sizes and under different working conditions. The motivation for considering the non-uniform traffic is that the uniform traffic pattern commonly assumed in analytical modeling is not always appropriate in practice. In fact, there are many real-world applications that exhibit non-uniform traffic patterns. Also, the communication locality has been shown to be important in cluster systems [2,18]. For example, parallel applications such as computational fluid dynamics applications, image processing applications and N -body simulation and finite element computation have destination locality [2,18]. Last but not least, the wormhole and store&foreward flow control mechanisms are widely used flow control mechanisms.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

77

The rest of the paper is organized as follows. In Section 2, a brief background and an overview of the cluster computing systems of interest are discussed. In Section 3, detailed description of the proposed analytical model is presented. The model validation and performance analysis of both the flow control mechanisms are discussed in Section 4. We summarize our findings and conclude the paper in Section 5.

2. BACKGROUND Figure 1 shows the general architecture of the super-cluster computing system of interest. The system is made up of C clusters. Each cluster is composed of N0 independent processing nodes with associated memory module, {Pn 0 , Pn 1 , . . . , Pn N0 −1 }. Also, each cluster has two communication networks, an Intra-Communication Network (ICN1) and an intEr-Communication Network (ECN1). ICN1 is used for the purpose of message passing between processors, whereas ECN1 is used to transmit messages between clusters as well as for messages used for the management of the system. A set of Concentrators/Dispatchers [19] is used to interconnect ECN1 and ICN2. Since high-performance computing clusters typically utilize Constant Bisectional Bandwidth networks [10,20,21], we adopted m-port n-tree [22], as a fixed arity switches to construct the topology for each cluster system. An m-port n-tree topology consists of N pn processing nodes and Nsw network switches. The number of processing nodes (i.e. N pn ) and network switches (i.e. Nsw ) can be calculated as follows: m n N pn = 2 (1) 2 m n−1 (2) Nsw = (2n − 1) 2

Intra-Communication Network 2 (ICN2)

Concentrator /Dispatcher

Concentrator /Dispatcher

Concentrator /Dispatcher

ECN1 Pn

Pn

Pn

Cluster #i

Pn

. . . . . . .

Cluster #(C-1)

ICN1

Figure 1. Super-cluster computing system architecture.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

78

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

Figure 2. An example of m-port n-tree topology; 4-port 3-tree.

In this network, a processing node is labeled as n-tuple, S=(S0 S1 . . . Sn−1 ), such that S∈{0, 1, . . . , m − 1} × {0, 1, . . . , (m/2) − 1}n−1 . Also, each network switch has m communication ports, {0, 1, 2, . . . , m − 1}, that are attached to other switches or processing nodes. Every switch, except the root switches, connects with its descendants and ancestors through {0, 1, 2, . . . , (m/2) − 1} and {(m/2), (m/2) + 1, . . . , m − 1} ports, respectively. This means, the internal switches have m/2 Up Links (UL) to its ancestors and m/2 Down Links (DL) to its descendants, respectively. An example of 4-port 3-tree network topology is shown in Figure 2. Note that the m-port n-tree is a full bisection bandwidth topology [23], which means the link contention problem does not occur in such network. In this paper, we used a deterministic routing based on Up∗ /Down∗ routing algorithm [24] which is proposed in [23]. In this algorithm, each message experiences two phases, an ascending phase to get to a nearest common ancestor (NCA), followed by a descending phase. Furthermore, since this algorithm performs a balanced traffic distribution, the switch contention problem will not be present [23]. Flow control manages the allocation of resource to messages as they progress along their route. The key resources in most interconnection networks are the channels and the buffers. Buffers allow nodes to hold packets temporarily. Two most famous flow control mechanisms are store&forward and wormhole flow control which are widely used in commercial switches [19]. With store&forward flow control, each node along a route waits until a packet has been completely received (stored) and then forwards the packet to the next node. The major drawback of this mechanism is its very high latency. Wormhole flow control overcomes the latency penalty of store&forward by dividing a packet into a sequence of fixed size units, called flits, and channel and buffers allocated to flits rather than packets. The header flit (containing routing information) establishes the path through the network while the remaining data flits follow it in a pipelined fashion. If a channel transmits the header of a message, it must transmit all the remaining flits of the same message before transmitting flits of another message. When the header is blocked (i.e. the requested next channel is busy), the data flit stops advancing and remains spread across the channels that they have already acquired. Fast Ethernet or Gigabit Ethernet switches use store&forward packet switching. In contrast, dedicated cluster network technologies (e.g. Myrinet, Infiniband and QsNet) use wormhole flow control. In this paper, we propose an analytical model for both flow control mechanisms.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

79

3. THE PROPOSED ANALYTICAL MODEL In this section, we develop an analytical model for the super-cluster system described in Section 2. First, the model for wormhole flow control is discussed. We will then modify the model to capture the store&forward flow control. The summary of the notations used throughout this paper are presented in Table I. The proposed model is built on the basis of the following assumptions which Table I. Notations in the analytical model. Symbol

g E1 I 1 I 2 Lm k tn2s Po

P, j,n C N0

net net davg Wk W I1 T I1 W dis R E1,I 2 L E1,I 2

Copyright q

Description

Symbol

Description

Message generation rate at a node Arrival rate of messages in the ECN1 Arrival rate of messages on a channel in the ICN1 Arrival rate of messages on a channel in the ICN2 Length of each flit in bytes Stage number, where 0 ≤ k ≤ K − 1

I 1 I 2 E1

Arrival rate of messages in the ICN1 Arrival rate of messages in the ICN2 Arrival rate of messages on a channel in the ECN1 Length of generated messages in flit

Time to transfer a flit between a switch and a node or vice versa Probability of exiting a massage from a cluster (degree of inter-cluster locality) Degree of intra-cluster communication locality Probability of crossing a massage 2 jlink in the m-port n-tree network with the locality of Number of clusters in the system Number of processor in each cluster (cluster size) Network technology latency Network technology bandwidth Average distance in the network topology with uniform traffic Average amount of time spent waiting to acquire a channel at stage k Average waiting time at the source queue in the intra-cluster network Average intra-cluster network latency Average waiting time at concentrator/ dispatcher Average time for the tail flit to reach the destination in the inter-cluster networks Average message latency in the intercluster networks

PBk

2007 John Wiley & Sons, Ltd.

M K ts2s

Number of stages in the network Time to transfer a flit between two adjacent switches Channel blocking probability at stage k

Po(uniform) P j,n

ϑ(, m, n) N nc

sw m, n davg() Tk W E1,I 2 T E1,I 2 RI1 LI1

Probability of exiting a massage from a cluster with uniform distribution Probability of crossing a massage 2 jlink in the m-port n-tree network The normalization factor for P, j,n Number of nodes in the system Number of trees in ICN2 (m-port n c tree topology) Network switch fabric latency Parameters of m-port n-tree topology Average distance in the network topology with locality traffic Average service time of a message at stage k Average waiting time at the source queue in the inter-cluster network Average inter-cluster network latency Average time for the tail flit to reach the destination in the intra-cluster network Average message latency in the intracluster network Average message latency

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

80

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

are commonly accepted in the literature [3–5,12–14]: 1. To cover a wider traffic range and to have a more general and applicable model, we assume that there are two types of communication locality in the traffic pattern of the system: ‘inter-cluster locality’ and ‘intra-cluster locality’. When a message is generated it has a finite probability Po of being an external message and probability 1 − Po of being internal message. So, this probability is defined as the degree of inter-cluster locality. The external messages are destined to any other cluster in the system with equal probability. The internal messages are destined to a node within the cluster with intra-cluster locality which means the likelihood of communication to different nodes decreases with distance. 2. Nodes generate traffic independently of each other, which follows a Poisson process with a mean rate of g messages per time unit. It has been shown in [2,17] that the temporal behavior of message sending appears to be exponentially distributed in the real cluster systems, so the Poisson process could be a reasonable approximation. 3. The network heterogeneity is presence between inter-cluster and intra-cluster communication networks. 4. The network switches are input buffered and each channel is associated with a single flit buffer. 5. Message length is fixed (M flits/1 packet). 6. The source queue at the injection channel in the source node has infinite capacity. Moreover, messages are transferred to the node once they arrive at their destinations. We have two types of connections in this topology, node to switch (or switch to node) and switch to switch. In the first and the last stage, we have node to switch and switch to node connection, respectively. In the middle stages, the switch to switch connection is employed. Each type of connection has a service time which is approximated as follows: tn2s = 0.5net + L m net

(3)

ts2s = sw + L m net

(4)

where tn2s and ts2s represent times to transmit from node to switch (or switch to node) and switch to switch connection, respectively. net and sw are the network and switch latency, net is the transmission time of one byte (inverse of bandwidth) and L m is the length of each flit in bytes. In the presence of network heterogeneity, we have two values for times to transmit. For intra-cluster networks the pair of (tn2s(I ) , ts2s(I ) ) and for inter-cluster networks the pair of (tn2s(E) , ts2s(E) ) are adopted in the model. 3.1. Traffic pattern analysis The traffic pattern in the system has two types of communication locality and it mainly affects the average message distance, which is the expected number of links that a message traverses to reach its destination. For a newly generated message, the average number of links that the message traverses to reach its destination (i.e. davg ) is given by the following equation: davg =

n

(2 j P j,n )

(5)

j=1

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

81

where P j,n is the probability of a message crossing 2 j-link ( j-link in the ascending and j-link in the descending phase) to reach its destination. Different choices of P j,n could lead to different distributions for message destination, and consequently different average message distances. As it is mentioned in assumption 1, the probability Po is the degree of inter-cluster communication locality and is defined as the number of external messages to the total number of messages. Note that, in the inter-cluster traffic pattern, an external message is destined to any other nodes in the system with equal probability. In the m-port n-tree, we observe that each node has (m/2−1) nodes at distance 2, (m/2)(m/2−1) nodes at distance 4, (m/2)2 (m/2 − 1) nodes at distance 6, and so on. Each node has (m/2)n−1 (m − 1) nodes at distance 2n. Hence, recalling that a node cannot send a message to itself, the probability that a newly generated messages makes 2 j-link with uniform distribution, P j,n , can be defined as ⎧ m j−1 m ⎪ ⎪ −1 ⎪ ⎪ 2 2 ⎪ , j = 1, 2, . . . , n − 1 ⎪ ⎪ m n ⎪ ⎪ 2 −1 ⎨ 2 P j,n = (6) m j−1 ⎪ ⎪ ⎪ (m − 1) ⎪ ⎪ ⎪ ⎪ , j =n n 2 ⎪ ⎪ ⎩ 2 m −1 2 By substituting Equation (6) in Equation (5), the average message distance with uniform distribution is obtained as follows: m n +1 (nm − 2n − 1) 2 davg = (7) m n 1 m −1 − 2 2 2 For internal messages, we consider the case where the traffic contains communication locality. There are several models for communication locality [25]. The following analysis uses the communication locality, the likelihood of communication to different nodes decreases with distance, where all nodes within a cluster share the same degree of locality. Let denote the degree of locality, which is the fraction of nodes that are potential candidates to receive a message from a source node. For a given source node, destination node is chosen randomly among the cluster’s nodes in a subset of N0 nodes centered at the source node. For a given degree of locality , destination nodes for messages originating at a source node with address S = (S0 S1 . . . Sn−1 ) are randomly chosen from the set of nodes D = (D0 D1 . . . Dn−1 ) where each index in the destination can be found as follows: ⎧

m ⎪ 1 1 n N0 n N0 ⎪ ⎪ Si − − 1 ≤ Di ≤ Si + mod , 1≤i ≤n−1 ⎪ ⎪ ⎨ 2 2 2 2 2 (8)

⎪ ⎪ n N0 n N0 ⎪ ⎪ − 1 ≤ Di ≤ Si + mod(m) , i = 0 ⎪ ⎩ Si − 2 2

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

82

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

With such a locality model, the probability of an internal message crossing 2 j-link to reach its destination would be driven based on Equation (6) by the following equation: ⎧ m j−1 m ⎪ ⎪ − 1 ⎪ ⎪ 2 2 ⎪ ⎪ j , j = 1, 2, . . . , n − 1 ⎨ ϑ(, m, n) P, j,n = (9) m j−1 ⎪ ⎪ ⎪ (m − 1) ⎪ ⎪ 2 ⎪ ⎩ j, j = n ϑ(, m, n) where ϑ(, m, n) is a normalizing factor, and is chosen such that nj=1 P, j,n = 1. So we can determine this factor as follows:

m 1 m n − +1 m−1− 2 2

(10) ϑ(, m, n) = 1 m − 2 When is small, the probability of exchanging messages between nodes decreases dramatically as their distance increases. In other words, the message transmission length decreases. On the other hand, as approaches 1, traffic pattern approaches the uniform distribution. With substitution of Equation (9) instead of P j,n into Equation (5), the average message distance for internal message in the presence of communication locality is obtained by the following equation:

m m m n−1 − 1 (m − − 1) − m −1 +m−2 nm 2 2 2 davg() = (11)

n

m m m 1 −1 1− + m−1− 2 2 2 It is worth noting that when = 1 and the external probability Po computed by the following equation the traffic pattern is pure uniform: Po(uniform) =

N − N0 (C − 1)N0 = N −1 C N0 − 1

(12)

3.2. Node level analysis The message flow model of the system is shown in Figure 3, where the path of a flit from a source node (S) to a destination node (Din or Dout ) through various communication networks is illustrated. As it is shown in this figure, we could find the average message latency with the following equation: = (1 − Po )(L I 1 ) + Po (L E1,I 2 )

(13)

where L I 1 and L E1,I 2 are the average message latency in the intra-cluster network and the average message latency in the inter-cluster networks, respectively. In the flow model, the source nodes (S) send its request to the ICN1 and ECN1 with probabilities 1 − Po and Po , respectively.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

83

ICN2 (m-port nc-tree) 4

3

Concentrator /Dispatcher

Concentrator /Dispatcher

5

2

ECN1 (m-port n-tree)

ECN1 (m-port n-tree)

Po S

Din

ξ

1 1-Po

6

1

Dout

2

ICN1 (m-port n-tree)

ICN1 (m-port n-tree)

Figure 3. Message flow model in each communication network.

The message path is depicted as arrows and its sequence is shown with the numbers. The request rate of a processor is g , so the input rates of ICN1 and ECN1 which are fed from the same processor will be g (1 − Po ) and g Po , respectively. The internal message traverses the ICN1 to reach to its destination (Din ). The external message goes through the ECN1 with probability Po and then ICN2. Particularly, the message leaves the ECN1 form the NCA switch of the source and destination. In the return path, it again accesses the ECN1 through the same NCA switch in the destination cluster to get to the destination node (Dout ). That means the external messages traverse the whole of ECN1 in each journey. The concentrator/dispatcher are working as simple buffers to interface two external networks (i.e. ECN1 and ICN2) and so combine message traffic from/to one cluster to/from other cluster. Therefore, the total message rate received by ICN1 and ECN1 can be calculated as follows: I 1 = N0 (1 − Po )g

(14)

E1 = N0 Po g + N0 Po g = 2N0 Po g

(15)

In the second stage, the total message rate of ICN2 can be computed by the following equation: I 2 = C N0 Po g

(16)

The rate of received messages by a given channel in each network would be found by simply dividing the total message rates over the number of channels in the network. Since each message in the m-port n-tree topology traverses, on average, davg links to cross the network, and the number

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

84

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

of bi-directional channels in the network is 4n N0 , the rate of messages received by each channel can be written as I 1 =

I 1 davg() (1 − Po )g davg() = 4n N0 4n

(17)

E1 =

E1 davg(E1) 2Po g davg(E1) = 4n N0 4n

(18)

I 2 =

I 2 davg(I 2) N0 Po g davg(I 2) = 4n c C 4n c

(19)

where davg() is the average distance in the ICN1 and is given by Equation (11). Also, davg(E1) and davg(I 2) are the average distance in the ECN1 and ICN2, respectively, and are given by Equation (7). Since ICN2 has m-port n c -tree topology, the n c is the number of trees in the ICN2 and can be computed as follows: log2 C − 1 nc = (20) log2 m − 1 3.3. Average message latency in the intra-cluster network The analysis was done in the top-down manner. We begin at the last stage and continue backwards to the first stage. The computed average message latency contains three factors; first, the average message service time that takes place across the channel. Next, the average time for the tail flit to reach its destination. Finally, the waiting time at the source queue is added to get the overall latency. 3.3.1. Average network latency In what follows, we find the average latency of intra-cluster communication network. Since each message may cross different number of links to reach its destination, we consider the network latency of an 2 j-link message as T j , and averaging over all the possible nodes destined made by a message yields the average network latency as T I1 =

n j=1

(P, j,n T j )

(21)

Our analysis begins at the last stage and continues backward to the first stage. The network stage numbering is based on the location of switches between the source and the destination nodes. In other words, the numbering starts from the stage next to the source node (stage 0) and goes up as we get closer to the destination node. The number of stage in m-port n-tree topology is K = 2 j − 1. The destination, stage K − 1, is always able to receive a message, so the service time given to a message at the final stage is tn2s(I ) . The service time at internal stages might be more because a channel would be idled when the channel of subsequent stage is busy. The average amount of time that a message waits to acquire a channel at stage k for cluster i, Wk, j , is given by the product

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

85

η1I

π0

π1

1 Tk, j

Figure 4. Markov chain to calculate of blocking probabilities.

of the channel blocking probability in stage k, PBk, j , and the average service time of a channel at stage k, Tk, j /2 [19]: Wk, j = 12 Tk, j PBk, j ,

0≤k ≤ K −1

(22)

The value of PBk, j is determined using a birth–death Markov chain [26]. As it can be seen in Figure 4, the rate of transition out and into the first state is I 1 and 1/Tk, j − I 1 , respectively. State 0 means the channel is free and state 1 corresponds to the channel blocking. Solving this chain for the steady-state probabilities gives PBk, j = I 1 Tk, j

(23)

The average service time of a message at stage k is equal to the message transfer time and waiting time at subsequent stages to acquire a channel, so: ⎧ K −1 ⎪ ⎨ W + Mt l, j s2s(I ) , 0 ≤ k ≤ K − 2 (24) Tk, j = l=k+1 ⎪ ⎩ k=K −1 Mtn2s(I ) , According to this equation, the average network latency for a message with 2 j-link journey is equal to the average service time of a message at stage 0, i.e. T0, j = T j . 3.3.2. Average waiting time at the source node An intra-cluster message originating from a given source node in a cluster, sees a network latency of T I 1 (given by Equation (21)). Due to blocking situation that takes place in the network, the distribution function of message latency becomes general. Therefore, a channel at source node is modeled as an M/G/1 queue. The average waiting time for an M/G/1 queue is given by [26] W I1 =

2007 John Wiley & Sons, Ltd.

(25)

= x

(26)

2x

(27)

Q 2x =

Copyright q

x(1 + Q 2x ) 2(1 − )

x2

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

86

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

where is the average message rate on the network, x is the average service time, and 2x is the variance of the service time distribution. Since the minimum service time of a message at the first stage is equal to Mtn2s(I ) , the variance of the service time distribution is approximated based on a method proposed by Draper and Ghosh [12] as follows: 2x = (T I 1 − Mtn2s(I ) )2

(28)

As a result, the average waiting time in source queue becomes

2 T − Mt I 1 n2s(I ) I 1 (T I 1 )2 1 + (T I 1 )2 W I1 = 2(1 − I 1 T I 1 )

(29)

Finally, the average message latency, L I 1 , seen by the message crossing from source node (S) to its destination (Din ), consists of three parts; the average waiting time at the source queue (W I 1 ), the average network latency (T I 1 ), and the average time for the tail flit to reach the destination (R I 1 ). Therefore, L I1 = W I1 + T I1 + RI1 where RI1 =

n j=1

P, j,n

K −1 k=1

(30)

ts2s(I ) + tn2s(I )

(31)

3.4. Average message latency in the inter-cluster networks As mentioned before, external messages cross through both networks, ECN1 and ICN2, to get to their destination in other cluster. Since the flow control mechanism is wormhole, the latency of these networks should be calculated as a merged one. From this and based on the Equation (21), we can write T E1,I 2 =

nc n

(P j+h,n+n c T j+h )

(32)

j=1 h=1

P j+h,n+n c = P j,n Ph,n c

(33)

It means each external message cross 2 j-link through the ECN1 ( j-link in the source cluster and j-link in the destination cluster) and 2h-link in the ICN2 to reach to its destination. So the analysis would be done for K = 2( j + h) − 1 stages. Based on Equations (22) and (23), the average amount of time that a message waits to acquire a channel at stage k, in the inter-cluster networks is as follows: Wk, j+h = 12 k (Tk, j+h )2 ,

Copyright q

2007 John Wiley & Sons, Ltd.

0≤k ≤ K −1

(34)

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

where the channel rate is driven by the following equation: I 2 , j ≤ k< j + 2h − 1 k = E1 otherwise

87

(35)

Similar to the intra-cluster network, the network latency for an inter-cluster message equals to the average service time of a given channel at stage 0, with the following equation: ⎧ K −1 ⎪ ⎨ Wl, j+h + Mts2s(E) , 0 ≤ k ≤ K − 2 Tk, j+h = l=k+1 (36) ⎪ ⎩ Mtn2s(E) , k=K −1 As before, the source queue is modeled as an M/G/1 queue and the same method is used to approximate the variance of the service time. Thus, the average waiting time of the source queue in the inter-cluster networks can be calculated as

2 T − Mt E1,I 2 n2s(E) E1 (T E1,I 2 )2 1 + (T E1,I 2 )2 W E1,I 2 = (37) 2(1 − E1 T E1,I 2 ) The average time for the tail flit to reach the destination can be obtained as follows:

nc n K −1 P j+h,n+n c R E1,I 2 = ts2s(E) + tn2s(E) j=1 h=1

(38)

k=1

The concentrator/dispatcher is working as simple bi-directional buffers to interface two external networks (i.e. ECN1 and ICN2). The average waiting time at the concentrator/dispatcher is calculated in a similar manner to that for the source queue (Equation (25)). The service time of the queue would be Mts2s(E) and there is no variance in the service time, since the messages length is fixed. So, the average waiting time are given by following equation: W dis =

I 2 (Mts2s(E) )2 2(1 − I 2 Mts2s(E) )

(39)

Also, we model the dispatcher buffers in the concentrator/dispatcher as an M/G/1 queue, with the same rate of concentrate buffers. So the average waiting time is given similarly by Equation (39). Finally, the average message latency in the inter-cluster networks for a message from a source node (S) to a destination node (Dout ) can be obtained as follows: L E1,I 2 = W E1,I 2 + T E1,I 2 + R R1,I 2 + 2W dis

(40)

3.5. Store and forward model The model is modified slightly to capture the store&forward flow control mechanism. Based on this mechanism, each node along a route waits until a packet has been completely received (stored) and

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

88

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

then forwards the packet to the next node. So, we need to adjust Equations (3) and (4) to include the properties of store&forward flow control, as follows: tn2s = 0.5net + M L m net

(41)

ts2s = sw + M L m net

(42)

Since in this mechanism blocking would not happen, the blocking probability is zero (PBk, j = 0). This feature yields that average network latency is equal to transmission time of a packet to the next switch. So, for the intra-cluster network the average network latency is T I 1 = tn2s(I ) . The average service time at the source queue is tn2s , which is the time to transmit of a packet to the next node. Thus, the variance of service time becomes zero and the average waiting time at the source queue for intra- and inter-cluster networks can be approximated as follows, respectively: W I1 =

I 1 (tn2s(I ) )2 2(1 − I 1 tn2s(I ) )

(43)

The above analysis is also applicable to inter-cluster networks, so the average network latency would be T E1,I 2 = tn2s(E) . Therefore, the average waiting time at the source queue for inter-cluster networks can be approximated as follows: W E1,I 2 =

E1 (tn2s(E) )2 2(1 − E1 tn2s(E) )

(44)

The average time for a packet to reach the destination, which is the main latency factor in this mechanism, can be found by Equations (31) and (38) for intra- and inter-cluster networks, respectively. The concentrator/dispatcher function in the store&forward mechanism remains without change, so the average waiting times are given by the following equation: W dis =

I 2 (ts2s(E) )2 2(1 − I 2 ts2s(E) )

(45)

Finally, the average message latency for store&forward mechanism in the intra-cluster and intercluster networks could be computed by the abovementioned parameters through Equations (30) and (40), respectively. Subsequently, the average message latency for the super-cluster system with store&forward mechanism is given by Equation (13). 3.6. Network throughput Throughput is the rate at which the packets are delivered by a network for a particular traffic pattern. At traffic level less than saturation point, the throughput equals to the offered traffic. As the offered traffic is increased beyond saturation point, the network will not be able to deliver messages as fast as they are being generated. Thus, the maximum network throughput will be observed at the point of saturation. We can also determine the maximum throughput of the network by using the proposed analytical model for the network latency. The latency is given as a function of the offered traffic for both

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

900

0.00014

800

0.00012

Throughput

0.0001

600 500

0.00008

400

0.00006

300 0.00004

Throughput (per time unit)

Latency

700

Latency (time unit)

89

200 0.00002

100 0 0

0 0.00005 0.0001 0.00015 Offered Traffic (messages/time unit)

Figure 5. Maximum network throughput at the saturation point.

flow control mechanisms (i.e. wormhole and store&forward) based on Equation (13). The latency versus offered traffic graphs share a distinctive shape that starts at the horizontal asymptote of zero-load latency and slopes upward to the vertical asymptote of saturation throughput as it is shown in Figure 5. The last work point (before it becomes infinite) in Equation (13) is the maximum network throughput and is shown in Figure 5 as the intersection of latency and throughput curves.

4. MODEL VALIDATION In this section, we validate the proposed model through comprehensive simulation experiments under different working conditions. In this section, we use the proposed model to analyze the relative performance merits of the two flow control mechanisms (i.e. wormhole and store& forward) in the system under study. We used the maximum throughput of the network as the performance metric. The validated model is then used to analyze the relative performance merits of two flow control mechanisms (i.e. wormhole and store&forward) in the system under study. In order to validate the proposed model and justify the applied approximations, the model was simulated with a discrete event-driven simulator. This simulator is developed on the OMNeT++ (Objective Modular Network Test-bed in C++) simulation environment [27]. OMNeT++ allows the design of modular simulation models and offers extensive simulation library that includes support for data collection, statistics, random number generators, and so on.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

90

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

Table II. Network configurations for the model validation. Network parameter

Net.1

Net.2

Network technology bandwidth (per time unit) Network latency (time unit) Switch latency (time unit)

500 0.02 0.01

300 0.05 0.02

Extensive validation experiments have been performed for several combinations of cluster sizes, network sizes, and message length. Messages are generated at each node according to Poisson process with the average inter-arrival rate of g . The destination node is determined by using a uniform random number generator. Each packet is time stamped after its generation. The request completion time is checked in every ‘sink’ module at each node to compute the message latency. For each simulation experiment, statistics were gathered for a total number of 100 000 messages. The first 10 000 messages were discarded to avoid distortions due to the warm-up phase. Also, there is a drain phase at the end of simulation in which 10 000 generated messages were not counted in the statistic gathering to provide enough time for all packets to reach their destination. In our experiments, results of the simulations are accurate with confidence interval of 95%. Two different networks are used in the validation experiments which are listed in Table II. For all cases the intra-cluster network used Net.1 and inter-cluster networks adopt Net.2 configurations. For each system, two different traffic patterns, uniform traffic (Po = Po(uniform) , = 1) and locality traffic (Po = 0.7, = 0.4) have been adopted. The default system parameters used in the experiments are as follows: we set M = 32 and 64 flits as in [3,4,12–14]. We set the flit length as L m = 256 and 512 bytes. Most of the cluster computing systems size are in the range of N = 29 –210 with a cluster size usually between C = 24 and 25 [28]. Last but not least, the switch size is set to m = 4 and 8 ports as in [21]. 4.1. Results and discussions The results of simulation and analysis for two systems are depicted in Figures 6–9 for store&forward mechanism and Figures 10–13 for wormhole mechanisms under uniform traffic and under locality traffic in which the average message latencies are plotted against the offered traffic. These figure that the analytical model predicts the average message latency with a good degree of accuracy when the system is in the steady-state region, that is, when it has not reached the saturation point. However, there are discrepancies in the results provided by the model and the simulation when the system is under heavy traffic and approaches the saturation point. This is due to the approximations that have been made in the analysis to ease the model development. For instance, in this region the traffic on the links is not completely independent, as we assume in our analytical model. Also, one of the most significant term in the model under heavily loaded system, is the average waiting time at the source queue. The approximation which is made to compute the variance of the service time received by a message at a given channel (Equation (28)) is a factor of the model inaccuracy. Since most evaluation studies focus on network performance in the steady-state regions, we can conclude that the proposed model can be a practical evaluation tool that can help a system designer to explore the design space and examine various design parameters.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

C=32, 8-port 2-tree, M=32, SF, Uniform Traffic

Average Message Latency

1800 1600

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

1400 1200 1000 800 600 400 200

C=32, 8-port 2-tree, M=64, SF, Uniform Traffic 3000 Average Message Latency

2000

91

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

2500 2000 1500 1000

0 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Offered Traffic

500 0 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 Offered Traffic

Figure 6. Average message latency in a system with N = 210 , M = 32, 64, and store&forward flow control under uniform traffic. C=16, 4-port 4-tree, M=64, SF, Uniform Traffic

C=16, 4-port 4-tree, M=32, SF, Uniform Traffic

1800 1600

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

1400 1200 1000 800 600 400

3000 Average Message Latency

Average Message Latency

2000

2500

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

2000 1500 1000 500

200 0 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012

0 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006

Offered Traffic

Offered Traffic

Figure 7. Average message latency in a system with N = 29 , M = 32, 64, and store&forward flow control under uniform traffic.

4.2. Performance analysis of flow control mechanisms In this section, we study the performance of the two flow control mechanisms (i.e. wormhole and store&foreward) using the proposed analytical model. We use average message latency and maximum network throughput as performance metrics. Without loss of generality, we fixed the system size to N = 29 . Note that changing the system size will not affect the end results. Figure 14 shows the average message latency for store&forward and wormhole flow control mechanisms as a function of the offered traffic under both uniform traffic (Po = Po(uniform) , = 1)

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

92

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

C=32, 8-port 2-tree, M=64, SF, Locality Traffic

C=32, 8-port 2-tree, M=32, SF, Locality Traffic 2000

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

1000 800 600 400 200 0 0.0000

0.0004

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

1800 Average Message Latency

Average Message Latency

1200

0.0008

0.0012

1600 1400 1200 1000 800 600 400 200 0 0.0000

0.0016

0.0004

0.0002

Offered Traffic

0.0006

0.0008

Offered Traffic

Figure 8. Average message latency in a system with N = 210 , M = 32, 64, and store&forward flow control under locality traffic.

C=16, 4-port 4-tree, M=64, SF, Locality Traffic

C=16, 4-port 4-tree, M=32, SF, Locality Traffic

1400 1200

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

1000 800 600 400 200

0 0.0000

0.0004

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

2500 Average Message Latency

Average Message Latency

1600

0.0008

0.0012

Offered Traffic

0.0016

2000 1500 1000 500 0 0.0000

0.0002

0.0004

0.0006

0.0008

Offered Traffic

Figure 9. Average message latency in a system with N = 29 , M = 32, 64, and store&forward flow control under locality traffic.

and locality traffic (Po = 0.7, = 0.4). In the analysis, we set M = 32 and L m = 256. As expected, the store&forward has much more latency relative to the wormhole and it is implicitly shown in the validation section. We can observe that the latency of store&forward is about five times more than wormhole when the traffic rate is low under uniform traffic pattern. Similarly, it is about 5.5 times greater under locality traffic. These results show that the wormhole flow control mechanism has better performance under locality traffic as compared to store&forward mechanism. However, as the offered traffic is increased, in the intersection of two curves, both of them have the same latency. Beyond this point, the store&forward marginally outperforms the wormhole in both traffic

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

C=32, 8-port 2-tree, M=64, WH, Uniform Traffic

C=32, 8-port 2-tree, M=32, WH, Uniform Traffic

700 600

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

500 400 300 200 100

1200 Average Message Latency

Average Message Latency

800

93

0 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

1000 800 600 400 200 0 0.0000

Offered Traffic

0.0001

0.0002

0.0003

0.0004

0.0005

Offered Traffic

Figure 10. Average message latency in a system with N = 210 , M = 32, 64, and wormhole flow control under uniform traffic.

C=16, 4-port 4-tree, M=64, WH, Uniform Traffic

900

Analysis Lm=256

1200

Analysis Lm=256

800

Simulation Analysis Lm=512 Simulation

1000

Simulation Analysis Lm=512 Simulation

700 600 500 400 300 200 100

0 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Offered Traffic

Average Message Latency

Average Message Latency

C=16, 4-port 4-tree, M=32, WH, Uniform Traffic

800 600 400 200 0 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 Offered Traffic

Figure 11. Average message latency in a system with N = 29 , M = 32, 64, and wormhole flow control under uniform traffic.

patterns. This is because of longer blocking times in wormhole mechanism in this region and thus a higher latency. We also observe from the graphs that the message latency of the wormhole mechanism under locality traffic is on an average about 77% better than under uniform traffic. Similarly, the message latency of the store&forward flow control mechanism under locality traffic is on an average about 65% better than under uniform traffic. This is because the higher effect of decreasing average message distance in the wormhole mechanism with respect to the store&forward in the presence of communication locality.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

94

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

C=32, 8-port 2-tree, M=64, WH, Locality Traffic

600

Analysis Lm=256

500

Simulation Analysis Lm=512 Simulation

800 Average Message Latency

Average Message Latency

C=32, 8-port 2-tree, M=32, WH, Locality Traffic

400 300 200 100

Analysis Lm=256 Simulation Analysis Lm=512 Simulation

700 600 500 400 300 200 100

0 0.0000

0.0004

0.0008

0.0012

0 0.0000

0.0016

0.0004

0.0002

Offered Traffic

0.0006

0.0008

Offered Traffic

Figure 12. Average message latency in a system with N = 210 , M = 32, 64, and wormhole flow control under locality traffic.

700

Analysis Lm=256

600

Simulation Analysis Lm=512 Simulation

Analysis Lm=256

1000

500 400 300 200 100 0 0.0000

C=16, 4-port 4-tree, M=64, WH, Locality Traffic

Average Message Latency

Average Message Latency

C=16, 4-port 4-tree, M=32, WH, Locality Traffic

Simulation Analysis Lm=512 Simulation

900 800 700 600 500 400 300 200 100

0.0004

0.0008 0.0012 Offered Traffic

0.0016

0 0.0000

0.0002

0.0004 Offered Traffic

0.0006

0.0008

Figure 13. Average message latency in a system with N = 29 , M = 32, 64, and wormhole flow control under locality traffic.

The maximum network throughput for store&forward and wormhole flow control mechanisms as a function of message length (M) under both uniform traffic (Po = Po(uniform) , = 1) and locality traffic (Po = 0.7, = 0.4) is shown in Figure 15. In this experiment, we set L m = 256 and we used two different network configurations as follows: (a) The basic: ICN1 is Net.1 and ECN1 and ICN2 are Net.2 (see Table II). (b) The increased bandwidth: ICN1 is Net.1 (but the bandwidth is set 600); and ECN1 and ICN2 are Net.2 (but the bandwidth is set 360).

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

95

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

C=16, 4-port 4-tree, M=32, Lm=256 Uniform Traffic

Locality Traffic

Average Message Latency

1500

1200

900

SF 600

WH SF

300

WH 0 0.0000

0.0004

0.0008

0.0012

0.0016

Offered Traffic

Figure 14. Comparison of average message latency for store&forward and wormhole flow control mechanisms versus offered traffic under uniform traffic (Po = Po(uniform) , = 1) and locality traffic (Po = 0.7, = 0.4).

7000

WH, Basic SF, Basic WH, Increased BW SF, Increased BW

6000 5000 4000 3000 2000 1000 0 10

20

30

40

50

60

70

80

Message Length (M)

90

100

C=16, 4-port 4-tree, Lm=256, Locality Traffic Max. Network Throughput (x10e-6)

Max. Network Throughput (x10e-6)

C=16, 4-port 4-tree, Lm=256, Uniform Traffic 7000

WH, Basic SF, Basic WH, Increased BW SF, Increased BW

6000 5000 4000 3000 2000 1000 0 10

20

30

40

50

60

70

80

90

100

Message Length (M)

Figure 15. Comparison of maximum network throughput for store&forward and wormhole flow control mechanisms versus message length under uniform traffic (Po = Po(uniform) , = 1) and locality traffic (Po = 0.7, = 0.4).

The results in Figure 15 show that for all message lengths, the store&forward mechanism outperforms the wormhole mechanism, especially for small message lengths under both traffic patterns. Based on this analysis, maximum network throughput for both flow control mechanisms under locality traffic is about 25% more than under uniform traffic. To consider the effect of network bandwidth on the system performance, the second network configuration has been analyzed. As it can been seen in the figure, this 20% increase in the network bandwidths yields similar behaviors. However, the maximum network throughput increased about 18–20% in all points under both traffic

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

96

B. JAVADI, J. H. ABAWAJY AND M. K. AKBARI

patterns. Also, store&forward has a slightly more improvement than the wormhole in most cases when the network bandwidths increased.

5. CONCLUSIONS AND FUTURE DIRECTIONS In this paper, we proposed an analytical model of interconnection networks for super-cluster computing systems. We showed that the model predicts the average message latency and the throughput of super-cluster systems for the wormhole and store&forward flow control mechanisms under uniform and locality traffics. Simulation experiments have proved that the model predicts message latency with a good degree of accuracy for both flow control mechanisms. The validated model is then used to analyze the relative performance merits of the two flow control mechanisms in the system under study. We believe that the simplicity and reasonable accuracy of the model make it an attractive tool for prediction of the performance behavior of typical cluster and super-cluster systems under different working conditions. We plan to validate the proposed model with measurements from a real system as well as against mixed flow control mechanisms (i.e. WH locally, SF inter-cluster) in the future. REFERENCES 1. Gilfeather F, Kovatch PA. Superclusters: TeraClass computing. Proceedings of the International Conference on High Performance Computing in the Asia-Pacific Region, Beijing, China, 2000; 874–883. 2. Kim J, Lilja DJ. Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs (Lecture Notes in Computer Science, vol. 1362). Springer: Berlin, 1998; 202–216. 3. Sarbazi-Azad H, Khonsari A, Ould-Khaoua M. Performance analysis of deterministic routing in wormhole k-ary n-cubes with virtual channels. Journal of Interconnection Networks 2002; 3(1&2):67–83. 4. Ould-Khaoua M. A performance model for Duato’s fully-adaptive routing algorithm in k-ary n-cubes. IEEE Transactions on Computers 1999; 42(12):1–8. 5. Agarwal A. Limits on interconnection network performance. IEEE Transactions on Parallel and Distributed Systems 1991; 2(4):398–412. 6. Abawajy JH, Dandamudi SP. Scheduling parallel jobs with CPU and I/O resource requirements in cluster computing systems. Proceedings of 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems (MASCOTS 2003), FL, U.S.A., 2003; 336–343. 7. Bucur AID, Epema DHJ. The Influence of the Structure and Sizes of the Jobs on the Performance of Co-allocation (Lecture Notes in Computer Science, vol. 1911). Springer: Berlin, 2000; 154–173. 8. Guest MF, Sherwood P. Computational chemistry applications: Performance on high-end and commodity-class computers. Proceedings of 16th Annual International Symposium on High Performance Computing Systems and Applications, NB, Canada, 2002; 290–301. 9. Maui High Performance Computing Center. http://www.mhpcc.edu, 2002. 10. Kudoh T. AIST super cluster: A large scale heterogeneous cluster with 14. 6TFLOPS peak performance. Proceedings of the 7th International Conference on High Performance Computing and Grids, Tokyo, Japan, 20–22 July 2004. 11. Chun ATT, Wang CL. Contention-free complete exchange algorithm on clusters. Proceedings of the IEEE International Conference on Cluster Computing, Saxony, Germany, 28 November 2000–1 December 2000; 57–64. 12. Draper JT, Ghosh J. A comprehensive analytical model for wormhole routing in multi-computer systems. Journal of Parallel and Distributed Computing 1994; 23(2):202–214. 13. Boura YM, Das CR. Performance analysis of buffering schemes in wormhole routers. IEEE Transactions on Computers 1997; 46(6):687–694. 14. Hu PC, Kleinrock L. A queuing model for wormhole routing with timeout. Proceedings of the 4th International Conference on Computer Communications and Networks, Nevada, LV, 20–23 September 1995; 584–593. 15. Du X, Zhang X, Zhu Z. Memory hierarchy consideration for cost-effective cluster computing. IEEE Transactions on Computers 2000; 49(5):915–933.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

A COMPREHENSIVE ANALYTICAL MODEL OF INTERCONNECTION NETWORKS

97

16. Javadi B, Khorsandi S, Akbari MK. Queuing network modeling of a cluster-based parallel systems. Proceedings of the 7th International Conference on High Performance Computing and Grids, Tokyo, Japan, 20–22 July 2004; 304–307. 17. Javadi B, Khorsandi S, Akbari MK. Study of Cluster-based Parallel Systems using Analytical Modeling and Simulation (Lecture Notes in Computer Science, vol. 3483). Springer: Berlin, 2005; 1262–1271. 18. Vetter JS, Mueller F. Communication characteristic of large-scale scientific applications for contemporary cluster architectures. Journal of Parallel and Distributed Computing 2003; 63:853–865. 19. Dally W, Towles B. Principles and Practices of Interconnection Networks. Morgan Kaufmann: San Francisco, CA, 2004. 20. InfiniBand Clustering, Delivering Better Price/Performance than Ethernet. White Paper, Mellanox Technologies Inc., Santa Clara, CA, 2005. 21. Building Scalable, High Performance Cluster/Grid Networks: The Role of Ethernet. White Paper, Force10 Networks Inc., Milpitas, CA, 2004. 22. Lin X. An efficient communication scheme for fat-tree topology on infiniband networks. MSc Thesis, Department of Information Engineering and Computer Science, Feng Chia University, Taiwan, 2003. 23. Javadi B, Abawajy JH, Akbari MK. Modeling and analysis of heterogeneous loosely-coupled distributed systems. Technical Report TR C06/1, School of Information Technology, Deakin University, Australia, January 2006. 24. Schroeder MD et al. Autonet: A high-speed, self configuring local area network using point-to-point links. SRC Research Report 59, Digital Equipment Corporation, April 1990. 25. Reed DA, Fujimoto RM. Multicomputer Networks: Message-based Parallel Processing. MIT Press: Cambridge, MA, 1987; 33. 26. Kleinrock L. Queuing System: Computer Applications, vol. 2. Wiley: New York, 1975. 27. Varga A. The OMNeT++ discrete event simulation system. The Proceedings of the European Simulation Multiconference, 2001. 28. Top500 Supercomputing site: www.top500.org.

Copyright q

2007 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2007; 20:75–97 DOI: 10.1002/cpe

Lihat lebih banyak...

A comprehensive analytical model of interconnection networks in large-scale cluster systems

Descripción

Comentarios