An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

September 10, 2017 | Autor: Yuankun Xue | Categoría: Parallel Computing, Biomedical Engineering, Biomedical informatics

Descripción

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms Yuankun Xue∗ , Zhiliang Qian† , Guopeng Wei‡ , Paul Bogdan∗ , Chi-Ying Tsui† , Radu Marculescu‡ ∗ University of Southern California, Los Angeles, CA, USA, {yuankunx,pbogdan}@usc.edu † HKUST, Hong Kong, China, {qianzl,eetsui}@ust.hk ‡ Carnegie Mellon University, Pittsburgh, PA, USA, {danielwei,radum}@cmu.edu

Abstract—In this work, we propose a new Network-on-Chip (NoC) architecture for implementing the hierarchical parallel genetic algorithm (HPGA) on a multi-core System-on-Chip (SoC) platform. We first derive the speedup metric of an NoC architecture which directly maps the HPGA onto NoC in order to identify the main sources of performance bottlenecks. Specifically, it is observed that the speedup is mostly affected by the fixed bandwidth that a master processor can use and the low utilization of slave processor cores. Motivated by the theoretical analysis, we propose a new architecture with two multiplexing schemes, namely dynamic injection bandwidth multiplexing (DIBM) and time-division based island multiplexing (TDIM), to improve the speedup and reduce the hardware requirements. Moreover, a task-aware adaptive routing algorithm is designed for the proposed architecture, which can take advantage of the proposed multiplexing schemes to further reduce the hardware overhead. We demonstrate the benefits of our approach using the problem of protein folding prediction, which is a process of importance in biology. Our experimental results show that the proposed NoC architecture achieves up to 240X speedup compared to a single island design. The hardware cost is also reduced by 50% compared to a direct NoC-based HPGA implementation.

I.

I NTRODUCTION

Evolutionary algorithms (EA) mimic the biological evolution and selection processes in nature and can solve complex optimization problems efficiently [4]. As a particular type of EA, the Genetic Algorithms (GA) are widely applied to search for an optimized solution in applications such as wireless communication, physics and bioinformatics [7]. In GA, the potential candidate solutions (i.e., feasible individuals) are encoded into arrays of bits and are evaluated based on a specific fitness function [7]. Then, genetic operations like mutation and crossover are applied to the generation of individual populations [4]. The newly generated individuals, together with the stochastically selected elites from current population, form a new generation [13]. This evolution process repeats until the best ever fitness score converges to an optimal value or a certain termination criteria is met [13]. In general, GAs can solve optimization problems over small spaces within an affordable time [3]. However, for many more complicated optimization problems, such as the protein folding prediction which requires to find out the best protein conformation

from numerous potential solutions [1], the computation time increases dramatically. For these problems, parallel Genetic Algorithms (PGA) have been proposed in order to take the advantage of multiple computing resources available on the system. The most straightforward implementation of PGA is to use massively parallel processors to calculate the fitness value of each individual in a single population (i.e., global singlepopulation master-slave GAs) [3]. As shown in Fig. 1-a, in a master-slave based GA platform [3], the master processor distributes the individuals among the slave processors. Although simple in nature, in this architecture, the master processor becomes the performance bottleneck for the entire system. Moreover, the overall speedup performance is limited as only the master needs to collect the fitness values in the current generation and mount genetic operations to produce a new generation. Consequently, a more efficient PGA implementation consists of multiple master islands that can perform evolution separately and migrate the individuals occasionally among the islands (i.e., distributed PGA) [3]. In Fig.1-b, we show the architecture of a typical hierarchical distributed PGA (HPGA). In the upper level, multiple islands are used to do computations for different populations. The migrations also occur between any two islands to avoid the best ever fitness score getting stuck in a local optimum trap. On the other hand, within each island, the master-slave level parallelization is implemented [3]. In this way, the HPGA can take the advantage of the existing parallel computing platforms such as the NoC-based Multi-processor System-on-Chip (MPSoC)[9] to perform the computations separately and accelerate the overall optimization process across the design space. In this work, we tackle the problem of efficient implementation of the PGA on an NoC-based MPSoC. We first propose a formal method to analyze the speedup gain of directly mapping the HPGA on NoC. Then, we identify two speedup bottlenecks of this architecture and propose two multiplexing schemes to mitigate these limitations. More specifically, a dynamic injection bandwidth multiplexing (DIBM) scheme is proposed to increase the effective channel resources that can be used by the master processor. A time-division based island multiplexing (TDMI) scheme is designed to map and share several islands of the proposed NoC platform dynamically; This improves the

a)

Island Fitness return flow Individual distribution flow

b)

Master process

Master process

S

S

S

S

S

S

S

S

S

M

S

S

Individual S distribute flow

S

S

S

Master process

Slave process Router Slave process 1

Slave process 2

Slave process 3

a)

Slave process n

Slave process

M

b)

Fig. 1: a) The master-slave based parallelization of GA [3] b) The hierarchical parallel algorithm combing the multiplepopoluation coarse-grained GAs in the upper level and masterslave parallelization within each island [3] processors utilization and reduces the hardware requirements. Based on the newly proposed NoC architecture, a task-aware adaptive routing algorithm is used to further increase the resource utilization and reduce the latency. Finally, we use the protein folding prediction problem [14] as a case study and evaluate the performance of the proposed HPGA platform. Our experimental results shows a up to 240X speedup in protein folding prediction compared to a single-master-single-slave implementation. Moreover, the proposed multiplexing schemes significantly reduces hardware overhead compared to a direct HPGA mapping. The reminder of the paper is organized as follows. In section II, we review related works. In section III, we analyze the speedup performance of the NoC-based HPGA platform and discuss the bottlenecks of the architecture. We then present the proposed multiplexing based NoC platform as well as the dynamic routing algorithm. The experimental results are discussed in section IV. Finally, section V concludes this work by numbering our main contributions. II.

R ELATED W ORK

For PGA implementation, most of the proposed platforms are based on GPU or computer-clusters. In [1], a computer cluster-system is used for a single population master-slave based PGA. However, as there is only one master processor in the system, the speedup tends to saturate when the number of processors increases. In [10], an island-based genetic algorithm is implemented on NVIDIA GPU-based architecture. In order to be compatible with the GPU architecture, a simplified unidirectional ring migration is used in CUDA software model; This compromises the speed of migration among populations. Recently, motivated by the high scalability of NoC, the PGAs can also be implemented efficiently on a single chip for many emerging applications in embedded computing. For instance in [5], a NoC-based MPSoC platform is developed for an islandbased parallel genetic algorithm. In this architecture, each processor corresponds to an island and evolve independently. However, their adopted topology allows only for migrations between neighboring processors in NoC. For each population, the evolution process occurs within a single processor. Hence, the computation efficiency can be further improved by using several slave processors for each population, i.e., master-slave level parellelization within each island.

S

Master processor Slave processor

Fitness return flow

Fig. 2: The island-based NoC architecture for HPGA: a) the island associated with master and slave processes in the algorithm b) An example which maps the island onto a 4 × 4 region in NoC In NoC-based multicore systems, several time-divisionmultiplexing (TDM) schemes have been proposed in hard realtime systems. For instance in [8], TDM virtual circuits (VCs) have been proposed to meet the real time requirements of a certain application. In order to reduce the maximum delay, the TDM scheme allows two or more connections to share the usage of buffers and links in dedicated time slots. In [12], a TDM based circuit-switched NoC is presented. In order to achieve predictable worst case execution time, the authors propose an algorithm to determine the schedule for setting up all-to-all virtual circuits on top of NoC topologies. In contrast to these previous efforts on improving the quality-of-service (QoS) of specific flows, we borrow the idea of TDM and propose two multiplexing schemes on NoC-based HPGA platform which target higher bandwidth and processor utilization. In this work, instead of running the HPGA programs in GPUs or CPU-based clusters, we propose an NoC-based customized platform for embedded applications. The speedup performance is enhanced by improving the bandwidth utilization through two new multiplexing schemes. A task-aware routing algorithm is also designed based on the proposed architecture. Towards this end, we make the following contributions: 1) We provide a quantitative analysis of the speedup performance of mapping the PGA onto NoC. This analysis identifies two bottlenecks on overall performance. 2) Based on our analysis, we propose a new NoC architecture with two multiplexing schemes to improve the speedup and reduce the hardware requirements. 3) A task-aware routing algorithm is designed for efficient communication among the processors under the proposed NoC architecture. 4) We demonstrate the benefits of our approach via the protein folding problem and show that our solution achieves a significant speedup while reducing the hardware overhead. III. P ROPOSED N O C- BASED HPGA PLATFORM In this section, we present the proposed HPGA platform based on a mesh NoC. The mesh topology is chosen due to its simplicity, layout and electrical efficiency. We start with a direct island-based architecture. Based on the analysis of achievable speedup gain, we then propose a new architecture with two multiplexing schemes to enhance the processor and

bandwidth utilization. Finally, a task-aware routing algorithm is presented to further reduce the delivery latency between the processors at run time. A. Island-based HPGA-NoC and motivation for the proposed new NoC architecture The HPGA presented in Fig. 1-b can be implemented directly by mapping each population island onto a mesh NoC consisting of a master and a set of slave processors. As shown in Fig. 2, within each island, the messages exchanged among the master and slave processors (i.e., the distribution flow and fitness return flow) are sent through the routers. In the upper level, the islands are interconnected as in Fig. 1-b to address the communication required among the master processors of different islands (i.e., the population migrations in HPGA). However, due to network traffic conditions [2] this brute-force architecture has two drawbacks: 1) Limited injection bandwidth of the master processor: In HPGA, only one master can perform genetic operations on the entire population. So the workloads and traffic flows between the master and slave processors are quite uneven. Most of the time, master processor needs to produce a new generation, divide the individuals and distribute them to the slaves. For the island-based architecture presented in Fig. 2, the injection/ejection bandwidth is fixed between the processors and the network (e.g., one flit width per cycle). The master processor could only send one flit to a specific slave processor every cycle. The rest of slaves need to wait for a certain interval Twait before they receive the chromosome information. So increasing the number of slave processors in the island, on one hand, reduces the overall fitness calculation time in the population. On the other hand, it also significantly increases the average Twait and reduces the slave processor utilization as a result. To explore the tradeoffs between the computation and waiting time, we define the network capacity C as the maximal number of slave processors that can be used in a single island, which is given by: C = arg max{∆G(N )/∆N > 0}, N ∈ N (1) where N is the number of slave processors; G(N ) is the overall speedup compared to the one master and one slave based design. From Eqn. 1, the speedup performance is saturated beyond C slave processors as there is no further improvement in ∆G(N ) = G(N + 1) − G(N ). Experimental results show that the island tends to be saturated if the average waiting time required to receive two individuals in a slave processor (Twait ) becomes comparable to Tsdelay . Here, Tsdelay refers to the average time seen by a master processor from sending out an individual to receive its fitness. For a network with n slave processors, Twait can be represented as n ∗ Tinterval , where Tinterval is the time of transferring an individual in master processor and is given by: Tinterval = Tturnaround + Lchm ; (2) In Eqn. 2, Tinterval consists of two parts: Tturnaround is the average inter-arrival time of two chromosomes which is determined by the master processor, while Lchm is the length of encoded chromosome packet in terms of flits. Based on Eqn. 2, the network capacity C is derived as:

Fig. 3: The ratio of time that an island stays in the DIS and GA phases C ≈ n = Tsdelay /Tinterval (3) Next, we consider the effects of network size on the fitness distribution time Tdis . Here, Tdis is the time from sending the whole population out to receive their fitness values from slave processors. 1/Tdis represents the speed in the fitness calculation process. We further assume Tcom represents the average communication time between a master and a slave processor and Tcalc is the average time required to calculate the fitness value in a slave processor. Based on the size of population N , the network capacity C and the number of available slave processors NC in the network, Tdis can be computed as Eqns. 4-6, respectively. If N < NC , Tdis is calculated as: Tdis = N ∗ (Lchm + Tturnaround ) + 2 ∗ Tcom +Tcalc − Tturnaround (4) If NC < C and N ≥ NC , the calculation time in slaves dominates the distribution process and Tdis equals: Tdis = (N/NC + 1) ∗ Tcalc + (N mod(NC )) ∗ Tinterval (5) Otherwise, if the network size NC is larger than the capacity C, Tdis is determined by the transfer time of the master processor: Tdis = (N/NC ) ∗ NC ∗ Tinterval +2 ∗ Tcom − Tturnaround + Tcalc (6) The population size N in this is set to be large enough. For the single master and single slave design, NC equals to 1 and Eqn. 5 is used to obtain Tdis . On the other hand, the maximum distribution speed of the architecture in Fig. 2 is achieved if NC is larger than the network capacity C (Eqn. 6). We assume Tcom is much smaller than Tcalc and consequently Tsdelay = Tcalc + 2 ∗ Tcom ≈ Tcalc . Then the maximum speedup gain over the single slave architecture is the ratio of the two Tdis values: GN ≈ Tcalc /Tinterval ≈ C (7) From Eqn. 7, GN is influenced by the network capacity C. In the direct island-based HPGA platform, Tcalc and Tinterval are fixed. As Tinterval is dictated by the bandwidth between the processors and network, the maximum speedup gain is also limited for this architecture. 2) Low slave processor utilization: In HPGA, the overall algorithm can be divided into two phases. During first phase,

PE Slave core

b)

Master core

MA

a)

From North (j,k-1)

PE

(j,k) R

Xbar

MA

n n+1 n+2

From East R

A DIS GA DIS GA B GA DIS DIS GA GA C

K=3

(j,k)

From West From Local

(j-1,k-1)

(j,k) P=3

... t

P=5 VA

(j-1,k) From MA

PE

R

PE

(j,k)

SA

R FSM

Router P=9

Fig. 4: a) Proposed DIBM Router b) Concentration Degree the master distributes the task to slaves and waits for the returned fitness values (named as DIS phase). After all the fitness values are returned to the master, in the second step, genetic operations are performed in the master to select individuals for the next generation (named as GA phase). Our profiling results show that most of the slave processors remain idle during the second stage. Fig.3 shows the ratio between the time that an island stays in these two phases versus the number of slave processors. It is observed when the number of slaves approaches the network capacity, the time distribution of DIS and GA phases becomes comparable, which means most slave processors are seldom used at 50% of time. B. Dynamic Injection Bandwidth Multiplexing (DIBM) In an island-based NoC architecture, the speedup gain is mainly determined by the injection channel bandwidth of the given topology. Under a fixed channel bandwidth between the master and the network, the maximum packet injection rate is always smaller than the reciprocal of Tinterval packet/cycle, which results in an increased Twait and limits the network capacity. In order to address this bottleneck, we propose to expand the injection bandwidth at run time. Fig. 4-a shows the proposed router architecture that enables the multiplexing of injection channel bandwidth. The DIBM scheme is motivated by the observation that, in HPGA, the utilization of injection channel and ejection channel is greatly unbalanced. In DIS stage, the master injects the chromosome packet at a much higher rate than the slaves can send back the fitness flits. This is because Tinterval is much smaller than the fitness calculation time in the slave processor. Moreover, the one-flit fitness packet from the slave processor is much shorter than a chromosome packet sent by a master. Compared to the injection channel used by the master processor, most of the time, the injection channels of the slaves remain idle. Thus, it is possible to take advantage of the injection channels and increase the effective injection bandwidth in a multiplexed manner. Fig. 4-a shows the proposed router architecture to support the DIBM scheme. An input queue is used to maintain the information of the multiplexing flow from the master. It is controlled by a local finite state machine (FSM). A built-in latch is used to store the injection data temporarily if the current injection channel is used by the packets from the other processors. After the occupied channel is released, the data stored in the latch will be pushed back into the queue. At run time, when the FIFO queue is not empty, the injection channel

B

A

C

T=n Fitness return flow Individual distribution flow

B

A

B

C

T=n+1

Island A Island B

A

C

T=n+2

Island B

Fig. 5: Example of the TDIM schedule is exclusively used by the input queue. Otherwise, this channel is released to the local injection flow. In order to evaluate the effects on bandwidth multiplexing, we use a parameter P to represent the concentration degree of the NoC architecture; This equals to the number of adjacent routers whose injection channels are multiplexed with the current node. Fig. 4-b shows how the routers are organized under different P values. Similar to the Tdis analysis in an island-based NoC, when N < NC , Tdis is calculated as: N ∗ (Lchm + Tturnaround ) + 2 ∗ Tcom P +Tcalc − Tturnaround (8) On the other hand, if N > NC and NC < Cdim , Tdis is still dominated by the slave processors. Hence, Tdis can be calculated as in Eqn. 5. Different from Eqn. 1, in the derivation of the network capacity Cdim for DIBM, the waiting time Twait dim need to account for a group of P slave processors rather than a single one. Thus, Cdim can be approximated as: Cdim ≈ n ∗ P = P ∗ Tcalc /Tinterval (9) Finally, when NC ≥ Cdim , Tdis is calculated as: N Tdis = ∗ Tinterval + 2 ∗ Tcom − Tturnaround + Tcalc (10) P Similar to (7), the overall speedup could be calculated as: Gf it ≈ P ∗ Tcalc /Tinterval ≈ P ∗ C (11) Based on Eqn. 11 and 9, by multiplexing the injection bandwidth among the processors with a degree of P , the achievable speedup gain for fitness calculation also increases by P times. Tdis =

C. Time-division Island Multiplexing (TIDM) The TIDM scheme is proposed to address the low slave processor utilization observed in the island-based NoC architecture. In particular, since the slave processors remain idle during a large portion of time, these free time slots can be used by other islands. We employ a time-division based task mapping algorithm which allows multiple islands to share the same resources and perform independently without interrupting each other. Multiple masters are located next to each other. As shown in Fig. 5, they are scheduled to take turns to enter the DIS phase. In Fig. 5, three islands are mapped onto

the same network where the ratio between Tdis and Tproc is assumed to be 1/2. Here, Tproc refers to the time needed to produce a new generation during the GA phase. In this case, after each island leaves the DIS stage, the hardware resources are released to the scheduled next neighboring island. In this way, the idle time of slaves in the island-based architecture is shared by the other two islands at run time. Compared to a 3-island-based NoC design, this requires only 1/3 hardware overhead. In addition, since every time when an island enters the DIS phase, the slave processors are occupied exclusively, no extra control or computation overhead is introduced due to the mutual interruption. In the TIDM scheme, we can derive the maximum number of multiplexed islands as a function of the population size N , the concentration degree P and the network size NC as: Tproc (N ) ⌋ + 1; (12) K(N, P, NC ) = ⌊ Tdis (N, P, NC ) In order to evaluate the hardware requirements of TDIM, we define Ni as the total number of islands that are required in the HPGA. Ntdim and Nmul represent the number of slave processors per island for the proposed TDIM and island-based NoC architectures, respectively. We evaluate hardware cost by comparing the total number of slave processors Nall in these different architectures. The hardware reduction factor RH representing the ratio of Nall in the TDIM architecture and the island-based architecture, which is given by: 1 Ntdim RH (N, P, NC ) = ∗ ; (13) K(N, P, NC ) Nmul Here we distinguish two cases: i) If Ntdim = Nmul , then Eqn. 13 yields: 1 RH = (14) K(N, P, NC ) which means the total number processor cores required in proposed TDIM scheme is K(N, P, NC ) times smaller. ii) If Ntdim = M ∗ Nmul = M ∗ C = M ∗ P −1 ∗ Cdim , where 1 < M ≤ P , then from Eqn. 11, the proposed TDIM combined DIBM scheme can achieve M times speedup in the fitness calculation. The RH under this case is given by: M RH = (15) K(N, P, NC ) Eqn. 15 indicates that as long as M is smaller than K, the proposed TDIM scheme can achieve both performance improvement and area reduction compared to the island-based baseline design. Our simulation results in Section IV consistently demonstrate these benefits by combining the TDIM and DIBM schemes together. D. Task-Aware Adaptive Routing In the proposed TDIM scheme, in order to further improve the utilization of slave processors, the DIS phase of two different islands can be overlapped such that more islands can share same physical resources. This overlapped time is represented as ∆tovp . Then, Eqn. 12 can be re-written as: K=⌊

Tproc ⌋ + 1; Tdis (∆tovp ) − ∆tovp

(16)

Interception

Diversion

Original Dest

4,2

4,3

4,4

4,2

4,3

4,4

Alternative Dest

3,2

3,3

3,4

3,2

3,3

3,4

Current Location

2,3

2,4

2,2

2,3

2,4

Don’t-care Slave

Head Flit

Head Flit

2,2

Free Slave Occupide Slave

Fig. 6: Rounting-Interception and Diversion To avoid the extra delays introduced when two multiplexed masters distribute their individuals simultaneously, a taskaware routing algorithm is proposed to support packets (individuals) to change their predefined destination dynamically. In the routing, a chromosome packet generated from the master processor is first assigned with a destination. Then, the packet is sent under a deadlock-free XY routing algorithm. Upon arriving at each router, the header flit checks the router status to determine whether the slave processor associated with current router is available. If it is free, the header flit sends a forwarding request to the router. Because there may be several packets contending for the same slave processor, an arbiter is used to resolve the conflicts. After being granted the access to the current slave processor, the packet enters the current slave directly by flushing its original destination. In the router, an occupation flag is asserted to indicate the status changed from ”available” to ”busy”. On the other hand, if the current slave processor is not free, the packet will proceed towards its original destination. As shown in Fig. 6, the dynamic dropping policy may bring the case that a head flit arrives at its original destination whereas this slave processor has already been occupied by another individual. Under this situation, a deflection routing decision is made based on the evaluation of the current slave status as well as the network traffic conditions (Eqn. 17): |Xdim −s| ∑ Twait(s,t) > 2 ∗ E(ls,t ) = Pk ∗ k (17) k=1

Pk = P {n = k} =

k−1 ∏

(1 − P ) + P (18)

j=1

In Eqn. 17, Twait(s,t) is the time needed to wait for the current slave at (s, t) to be available and indicated by a counter in each router. E(ls,t ) is the average distance to find a free slave from (s, t) along X/Y dimension. Pk is probability the current individual move to a node k-hops from the current node which has a free slave processor. In Eqn. 18, P is probability for a flit at (s, t) to find a free slave at a node which is j-hops away. In this work, we use longest Manhattan distance within the mesh to estimate E(ls,t ) empirically. If Eqn. 17 is met, the destination of this flit will be changed to an end-node alongside a preferred deflected dimension. Otherwise, it will repeatly checking Eqn. 17 and requesting the current router. In short, compared to other taskaware algorithms like [11] with unchangeable destination, the proposed routing algorithm could adaptively access to available resources based on current network status.

250

Algorithm 1 Task-aware Routing Algorithm

IV. S IMULATION AND R ESULTS A. Simulation Setup We implement a cycle-accurate NoC simulator in C++ and apply the proposed platform to solve the protein folding prediction problem based on 3D Hydrophobic-polar side-chain (HPSC) based protein model [1]. We use a fitness function similar to [1] to evaluate the energy of each individual’s conformation. In the GA simulations for the protein folding, 2000 generations are run with a population size of 2400. The crossover and mutation rate are 80% and 10%, respectively. Moreover, 10% elites are kept directly to the next generation. The migration among the master processors happen every 40 generations. We compare the proposed NoC platform with the island-based HPGA architecture described in Section III.A (named as naive mesh NoC) under different size of the network ranging from 2 × 2 to 16 × 16 . Throughout the experiments, we assume that each input channel has a buffer depth of 4 flits and 4 virtual channels. B. Comparison of network capacity extension and speedup In this experiment, we compare the speedup gain, the network capacity and the speedup efficiency of both conven-

200

Fitness Calculation Speedup

Input: Current node cur; Destination node dst; Set of candidate output directions Pc ; Occupation status for current slave Scur ; Diverted times Tdivert ; Predefined maximal diverted times Tmax div ;Remaining calculation time, Trm calc ;Calculation time for a single individual, Tcalc ; Metric function: M (cur, Trm calc ): Boolean decision for diversion. Preferred diversion dimension, D; Output: Output direction Po ∈ Pc ; Updated destination dst 1: if Scur == 0 then 2: Po = Select out port(cur, cur); 3: Scur = 1; 4: Trm calc + = Tcalc 5: else 6: Ptemp = Select out port(cur, dst); 7: end if 8: if Ptemp !=Select out port(cur, cur) then 9: Po = Ptemp ; 10: else 11: if M (cur, Trm calc ) == true ∧ (Tdivert < Tmax div ) then 12: dst=Credit based select node(cur, D); 13: Po = Select out port(cur, dst) 14: Tdivert + +; 15: D = (D == 0 ? 1 : 0) 16: else 17: Po = Select out port(cur, cur); 18: end if 19: end if 20: return Po , dst

150

Speedup@Naive Mesh Speedup@DIBM with P=3 Speedup@DIBM with P=5 Speedup@DIBM with P=9 Linear Speedup Expected Upperbound for Naive Mesh Expected Upperbound for DIBM with P=3 Expected Upperbound for DIBM with P=5 Expected Upperbound for DIBM with P=9

Upperbound=234@P=9

Upperbound=130@P=5

100

Upperbound=78@P=3

50

0 0

Upperbound=26@Naive Mesh

50

100

150 Slave Cores

200

250

Fig. 7: Comparisons of speedup in fitness calculation tional NoC [14] and the proposed architecture using DIBM. The concentration degree changes from P = 3 to P = 9 to evaluate the effect of multiplexing degree on the overall performance. The prediction results obtained from our analysis (in Section III) are compared against experimental ones in Fig. 7. From Eqn. 9, it is expected that the capacity of the proposed architecture with DIBM is P times higher than a naive mesh design. Moreover, in the proposed architecture, the speedup saturates in a network with P times larger size. Fig. 7 shows the simulation results of the fitness calculation speedup. The black line indicates the linear speedup with increasing number of slave processors. The dashed lines are the predicted upper bounds of the speedup gain in fitness calculation. From Fig. 7, it can be observed the saturation appears in a island-based mesh NoC when the number of slave processor exceeds 25. On the other hand, for the NoC with DIBM, the maximum number of slave cores increases to 75, 120, and 225, respectively, which yield a ceiling speedup at 75X, 107X and 206X. The comparison of core efficiency is shown in Fig. 9. The green line with triangle markers shows the efficiency in terms of fitness calculation. For conventional architecture, the efficiency drops dramatically after 25 cores. On the other hand, the efficiency for DIBM with P = 3, 5, 9 remain almost 1 before their network capacity of 75, 120 and 225 is reached. Moreover, to evaluate the realistic hardware speedup, we have prototyped the proposed DIBM-based platform in fully synthesizable Verilog using P=3 as an example. Simulations are done with Xilinx Virtex-6 LX760 FPGA simulator under different network sizes ranging from 2x2 to 7x7 due to hardware limitations. Results shown in Fig. 8 are consistent with the experiments done with a NoC simulator whereas performance degradation could be observed due to latency introduced by extra control. C. Evaluation of time-division island multiplexing Next, the hardware overhead reduction is evaluated. We first measure the maximal multiplexing degree K on the network. Fig. 10 shows the maximal number of islands (i.e., K) under different mesh and population size. As shown in

50

8

Speedup@Naive Mesh on FPGA simulator Speedup@DIBM with P=3 on FPGA simulator Speedup@DIBM with P=3 on NoC simulator

40

7

Overall Speedup

Fitness Calculation Speedup

45

35 30 25 20 15 10

6

5

4

Overall Speedup on DIBM with P=5 Overall Speedup on Naive mesh NoC

3

5 0 0

10

20

30

40

2 1

50

2

3

1.8

0.9

1.7

0.8

1.6

0.7 0.6 0.5 0.4 0.3 Core Efficiency for Naive Mesh Core Efficiency for DIBM with P=3 Core Efficiency for DIBM with P=5 Core Efficiency for DIBM with P=9

0.1 0 0

50

100

150

Slave Cores

200

250

Maximal Timed−division Multiplexed Islands

18 Maximal TDIM Achievable using DIBM with P=9

16 14

14

12

12 Maximal TDIM Achievable using DIBM with P=5

10 8

Maximal TDIM Achievable usingDIBM with P=3

6 Maximal TDIM Achievable using Naive Mesh

4

10

2

6 4

0 50

100

150

Slave Cores

200

250 0

500

7

1.5

Dimension order routing Proposed task−aware routing Minimal adaptive routing

1.4 1.3 1.2 1.1 1 0.9 0.8 0

0.05

0.1

0.15

0.2

0.25

0.3

Fig. 12: Comparisons of fitness calculation delay because each island occupies intra-island resources exclusively. In contrast, if TDIM is adopted, the blue line shows a 50% hardware reduction. Moreover, the TDIM scheme can be combined with DIBM, which allows more logic islands to be mapped on the same physical network. Finally, the possible speedup degradation due to the introduction of TDIM is evaluated. To illustrate the degradation, we adopt an 8 × 8 mesh NoC example. The concentration degree P is set to be 5. TDIM is employed in both designs. The number of islands mapped to this 64-core NoC varies from 2 to 7. Fig. 11 shows that the speedup degrades as more islands are mapped to the same physical network. For the naive island-based mesh NoC, the speedup drops ealier since it offers less fitness calculation speedup gain compared to the proposed architecture with DIBM.

8

2

0

6

Overlap Percentage

Fig. 9: Comparisons of the slave core utilization the figure, it can be observed that with the increasing of the concentration degree P , the maximal number of islands also increases for different slave and population sizes. We also evaluate the hardware reduction when building a 24-island HPGA on NoC. We normalize the hardware overhead of the proposed architecture to that of the naive mesh NoC architecture described in section III.A. As shown in Fig. 13, the area cost of the conventional mesh NoC grows linearly

16

5

Fig. 11: Speedup for different numbers of multiplexed islands

Normalized Fitness Calculation Delay

Core Efficiency

Fig. 8: Comparisons of hardware speedup 1

0.2

4

Multiplexed Island

Slave Cores

2000 2500 1000 1500

Population Size

Fig. 10: Maximal number of islands that can be multiplexed under TDIM

D. Routing Evaluation We compare the proposed routing algorithm with dimension-ordered XY and minimal adaptive routing in [6] under different task overlap ratios. The overlap ratio is the percentage of time that the DIS phases of two logic islands overlap with each other because they are mapped onto the same physical region. As shown in Fig. 12, the adaptive routing achieves better delay performance compared to the static routing algorithms. The proposed task-aware routing

1

Normalized Hardware Overhead

0.9 0.8 0.7 0.6

implement HPGA for protein folding analysis. Simulation results show our proposed architecture, together with the multiplexing scheme and routing algorithm, achieve a fitness calculation speedup of 206X, an overall speedup of 240X and improved quality of solution. Compared to the direct islandbased HPGA implementation, over 50% of hardware overhead is reduced due to the proposed multiplexing schemes.

Naive Mesh Naive Mesh using TDIM TDIM and DIBM with P=3 TDIM and DIBM with P=5 TDIM and DIBM with P=9 Naive mesh using TDIM and proposed routing TDIM, DIBM(P=3) and proposed routing TDIM, DIBM(P=5) and proposed routing TDIM, DIBM(P=9) and proposed routing

0.5 0.4

VI. ACKNOWLEDGEMENT The authors are thankful to anonymous reviewers for their valuable feedback. RM acknowledges the support for this work by US National Science Foundation (NSF) under Grant CNS-1128624. PB acknowledges the support by US National Science Foundation (NSF) under Grant 1331610.

0.3 0.2 0.1 0 0

50

100 Slave Cores

150

200

Fig. 13: Comparisons of normalized hardware overhead further reduces fitness calculation delay by 10 − 15% under 5 − 30% overlap ratios. E. Case study: Protein Folding Analysis To evaluate the performance of proposed architecture for HPGA, we adopt 7 real-world benchmarks for protein folding problem from [1]. We compare the proposed architecture with a single-master-single-slave (1M1S) design and show the speedup as well as the quality of the final solution. For the proposed architecture, the workload is split into 24 islands which are mapped onto 3 physical regions. Each region is mapped onto a 15 × 15 mesh NoC. For each benchmark, 2000 generations are produced and the solutions are averaged over 10 different runs. Table. I summarizes the results for the folding problem. The equivalent number of H-H side-chain contacts (HH) [1], which is defined as a standard measure of goodness, is adopted to evaluate quality of the two platforms. As a reference, the maximal HH contacts reported in [1] are also listed. As shown in Table. I, it is observed the quality of solution provided by the proposed platform is improved due to the hierarchical multiple population architecture. Moreover, the proposed platform achieves an overall speedup up to 240X compared to the 1M1S architecture. TABLE I: Comparisons of protein folding results Benchmark Unger273d.6 Unger273d.7 Unger273d.8 Unger273d.9 Unger273d.10 Dill.2 Dill.3

HH ref 14 16 6 10 14 19 23

Single-Slave Max Avg Tgen HH 12 1147769 12 1148167 6 1147792 6 1147006 12 1147671 16 1146764 20 1147940

Max HH 14 15 6 9 14 18 22

24-island Avg Tgen 4878 4607 4763 4907 5165 4804 5061

Speedup 235.3 249.2 241 233.7 222.2 238.7 226.8

V. CONCLUSION In this work, we have proposed a new NoC architecture towards efficient implementation of HPGA. Two multiplexing scheme have been also developed for improved speedup and hardware overhead. We have also proposed a task-aware routing algorithm to further mitigate the hardware cost. We

R EFERENCES [1] C. Benitez and H. Lopes. A parallel genetic algorithm for protein folding prediction using the 3d-hp side chain model. In Evolutionary Computation, 2009. IEEE Congress on, pages 1297–1304, 2009. [2] P. Bogdan and R. Marculescu. Workload characterization and its impact on multicore platform design. In 8th IEEE/ACM/IFIP International Conference on Hardware/software codesign and system synthesis (CODES+ISSS), 2010. [3] E. Cantu-Paz. A survey of parallel genetic algorithms. CALCULATEURS PARALLELES, 10, 1998. [4] A. E. Eiben and J. E. Smith. Introduction to Evolutionary Computing. SpringerVerlag, 2003. [5] R. Ferreira, L. Macedo Mourelle, and N. Nedjah. A parallel genetic algorithm on a multi-processor system-on-chip. In Trends in Applied Intelligent Systems, volume 6097 of Lecture Notes in Computer Science, pages 164–172. Springer Berlin Heidelberg, 2010. [6] N. Jiang and et al. A detailed and flexible cycle-accurate network-onchip simulator. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 86–96, 2013. [7] C. Y. Jiao and D. Li. Microarray image converted database - genetic algorithm application in bioinformatics. In BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, volume 1, pages 302–305, May 2008. [8] Z. Lu and A. Jantsch. Tdm virtual-circuit configuration for network-onchip. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 16(8):1021–1034, Aug 2008. [9] R. Marculescu and P. Bogdan. The chip is the network: Toward a science of network-on-chip design. Foundations and Trends in Electronic Design Automation, 2(4):371–461, 2009. [10] P. Pospichal, J. Jaros, and J. Schwarz. Parallel genetic algorithm on the cuda architecture. In Applications of Evolutionary Computation, Lecture Notes in Computer Science, pages 442–451. [11] Z. L. Qian, P. Bogdan, G. Wei, C. Tsui, and R. Marculescu. A trafficaware adaptive routing algorithm on a highly reconfigurable networkon-chip architecture. In 11th IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS), 2012. [12] M. Schoeberl, F. Brandner, J. Sparso, and E. Kasapaki. A statically scheduled time-division-multiplexed network-on-chip for real-time systems. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, pages 152–160, May 2012. [13] D. Whitley. A genetic algorithm tutorial. Statistics and Computing, 4(2):65–85, 1994. [14] Y. K. Xue and et al. Disease diagnosis-on-a-chip large scale networkson-chip based multicore platform for protein folding analysis. In Design Automation Conference, June 2014.

Lihat lebih banyak...

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

Descripción

Comentarios