Divide-and-Conquer and parallel graph reduction

June 20, 2017 | Autor: Fethi Rabhi | Categoría: Cognitive Science, Distributed Computing, Parallel Computing, Divide and Conquer
Share Embed


Descripción

Parallel Computing 17 (1991) 189-205 North-Holland

189

Divide-and-Conquer and parallel graph reduction F e t h i A. R a b h i a a n d G o r d o n A. M a n s o n b a Computer Science Department, Allegheny College, Meadoille, PA 16335, USA b Department of Computer Science, Unioersity of Sheffield, Portobello Centre, Sheffield $10 2TN, UK

Received 19 June 1990 Revised 24 October 1990

Abstract

Rabhi, F.A. and G.A. Manson, Divide and Conquer and parallel graph reduction, Parallel Computing 17 (1991) 189-205. This paper is concerned with the design of a multiprocessor system supporting the parallel execution of functional programs. Parallelism in such systems is automatically derived by the compiler but this parallelism is unlikely to meet the physical constraints of the target machine. In this paper, these problems are identified for the class of Divide-and-Conquer algorithms and a solution which consists of reducing the depth of the computational tree is proposed. A parallel graph reduction machine simulated on a network of transputers, developed for testing the proposed solutions, is described. Experiments have been conducted on some simple Divide-and-Conquer programs and the results are presented. Lastly, some proposals are made for an automatic system that would efficiently execute any problem belonging to the same class taking into account the nature of the problem as well as the physical characteristics of the implementation. Keywords. Functional languages; parallel graph reduction; Divide-and-Conquer; dynamic partitioning.

1. Introduction A l t h o u g h there a r e m a n y p r o s p e c t s of e v a l u a t i n g f u n c t i o n a l p r o g r a m s in parallel, few p r a c t i c a l i m p l e m e n t a t i o n s have b e e n realised. T h e m a i n r e a s o n is t h a t the p a r t i t i o n i n g into c o n c u r r e n t units o f c o m p u t a t i o n (or tasks) is d o n e a s s u m i n g a n a r c h i t e c t u r e with u n l i m i t e d resources such as a n infinite n u m b e r o f p r o c e s s o r s a n d n o c o m m u n i c a t i o n o v e r h e a d . D e s i g n i n g an efficient m o d e l that m a t c h e s a n y p r o b l e m o n a n y m a c h i n e w o u l d b e e x t r e m e l y c o m p l e x b e c a u s e of the d i v e r s i t y of p a r a l l e l p r o g r a m s , v a r y i n g in size a n d s h a p e a n d h a v i n g d i f f e r e n t r e q u i r e m e n t s in p r o c e s s i n g time, s t o r a g e o c c u p a t i o n a n d c o m m u n i c a t i o n . M u c h a t t e n t i o n has b e e n d e v o t e d r e c e n t l y i n t o g r o u p i n g p a r a l l e l p r o g r a m s i n t o c a t e g o r i e s in which p a r a l l e l i s m is n o l o n g e r ' a c c i d e n t a l ' b u t o b e y s to a w e l l k n o w n p a r a d i g m [6,22,14]. E x a m p l e s o f p a r a d i g m s i n c l u d e D i v i d e - a n d - C o n q u e r a n d pipelining. If the p a r a l l e l a l g o r i t h m i c structure is clearly identified, it is p o s s i b l e to a n a l y s e the c o r r e s p o n d i n g p a t t e r n o f execution, p r o v i d i n g c o m p i l e - t i m e a n d r u n - t i m e solutions. This p a p e r e x a m i n e s the p r o b l e m s of i m p l e m e n t i n g the class of D i v i d e - a n d - C o n q u e r a l g o r i t h m s e x p r e s s e d in a f u n c t i o n a l l a n g u a g e o n a parallel r e d u c t i o n m a c h i n e , with a d i s t r i b u t e d m e m o r y . A s c h e m e for d y n a m i c p a r t i t i o n i n g is p r o p o s e d a n d tested o n an e x p e r i m e n t a l system, w h i c h consists of a g r a p h r e d u c t i o n e n g i n e s i m u l a t e d o n a n e t w o r k o f t r a n s p u t e r s . T h e results o b t a i n e d are discussed, a n d p r o p o s a l s are 0167-8191/91/$03.50 © 1991 - Elsevier Science Publishers B.V. (North-Holland)

FA. Rabhi and G.A. Manson

190

made for an automatic system that provides the best performance depending on both the problem and the target machine.

2. T h e D i v i d e - a n d - C o n q u e r

approach to parallelism

2.1. Specification The well-known Divide-and-Conquer paradigm has proved useful in deriving parallel algorithms for many problems. In such a model, the solution to some instance of a problem is specified as a combination of solutions to simpler subinstances of the same problem. In a functional program, we can express a Divide-and-Conquer algorithm by the following function: F (x)=if indivisible(x) then s o l v e ( x ) else Let [ x l . x~..... X k ] = d i v i d e ( x ) in combine ( F ( x l ) , F(x 2) ..... F(Xk))

where: - x is the description of a problem - indivisible(x) returns true if the solution of x is trivial - s o l v e ( x ) returns the trivial solution - di vi d e ( x ) splits the original problem into k smaller sub-instances IX 1 ,

-

X2....

combine(

,Xk'l

Es 1. s 2. . . . Sk3 ) combines the k subsolutions into a single solution.

2.2. Parallel implementation In a parallel implementation, each instance of the problem F(x~ ) will be solved through the action of a task. As the execution grows as a tree of tasks, an obvious scheduling strategy would be to allocate one task to each processor in the system, requiring a tree machine topology (e.g. as in [12]). This approach presents m a n y drawbacks: first, processors executing interior nodes in the tree are inactive most of the time; then, the computational tree can be much larger in shape and depth than the physical machine; and lastly, there is a fine granularity at the leaves causing excessive communication overhead. In Z A P P [17], the first two shortcomings are overcome by using a physical machine simulating a virtual tree, in which a three of any size can be mapped. A dynamic load balancing strategy ensures that work is diffused to idle processors. A similar approach is adopted in serial combinators [9], but in addition, one child task is always evaluated locally increasing the granularity of the parent task and avoiding the situation where a parent task is suspended immediately after spawning its child tasks, having to wait for them to complete before continuing. However, this approach still suffers from too fine granularity at the leaves of the tree and the possibility of having an overwhelming number of tasks. In this paper, we investigate the possibility of stopping the creation of tasks at a certain depth in the computational tree. In other words, the execution proceeds in parallel (breadth first) until a certain depth is reached, then the execution proceeds sequentially (depth first). The consequence is a coarser grain and a better flexibility because the size and the number of tasks can be decided at run-time. We refer to such a scheme as dynamic partitioning.

Divide-and-conquer and parallel graph reduction

191

3. Experimenting with dynamic partitioning 3.1. Motivation By controlling the depth of the computational tree, dynamic partitioning offers a choice of several partitions for the same problem. The next question is: which partition offers the best performance? The aim behind these experiments is to study some important factors such as the number of processors, the load balancing strategy, the nature and complexity of the problem and their effect in determining the optimum partition. In the next sections, we briefly describe the implementation which consists of a parallel graph reduction machine, simulated using transputers.

3.2. Review of graph reduction and parallel implementations The graph reduction model [18] is a simple demand driven model of computation suitable for functional languages. In graph reduction, a program is represented by a graph and the execution of this program consists of reducing the corresponding graph until the normal form, i.e. the result, is reached. This process may be carried out in parallel since any subgraph can be reduced independently from the others by a parallel task. A task is to be executed by an evaluator, this task may create parallel subtasks so the run-time system is in charge of dynamically allocating tasks to evaluators. It also ensures the cooperation between tasks and provides an equal access to the computational graph by all the evaluators. There are techniques for generating code that will perform sequential graph reduction, most of them based around the G-machine [1]. The abstract machine used in this implementation is a version of the Spineless G-machine [4], extended with parallelism constructs for the explicit creation of tasks and synchronisation between these tasks; an executable specification is described in [16]. We used a conservative approach to parallelism in which an expression is evaluated in parallel only if its result will certainly be needed, strictness analysis is the common technique used to detect such cases. The experimental system has much in common with recent parallel reduction machines projects, examples include PAM [19], A L F A L F A [10], the {v,G)-machine [2] and the HDG-machine [15].

3.3. The compile-time system The role of the compiler is to convert a functional program into parallel G-code. We used the compilation rules in [4] but with some differences. In this system, annotations are supplied in order to determine if the expression is to the evaluated lazily, eagerly or in parallel. A special annotation called the conditional sparking annotation has been introduced to implement dynamic partitioning. This annotation has the following form: (exp) [CONDSPK= (eXpp) ]

This means that the value of the predicate (eXpp) determines whether the corresponding expression (exp) is evaluated in parallel or not. In the experiments described, the creation of tasks should stop once a certain depth has been reached. This is achieved by adding the depth as an extra-parameter to the Divide-and-Conquer specification which can be rewritten as F (x)=F' (x, O) F' (x, d)=if indivisible(x) then solve(x) else let [xl, x z.....Xk]=divide(x)

192

F.A. Rabhi and G.A. Manson in

(combine[ F'(x1.(d+l )) F'(x2,(d+l ))[CONDSPK= (d < DepthLimi t)] F'(Xk.(d+l ))[CONDSPK= (d < DepthLimi t)])

where D e p t h l i m i t is a system variable containing the m a x i m u m depth. By changing the DepthLimi t from 0 to oo, the same problem is executed in sequence, than in k tasks, then k 2 tasks, etc., until the m a x i m u m partition is reached. N o t e that the first call is always evaluated locally in accordance with our partitioning strategy. Generating G-code from annotated programs is described in [20]. 3.4. Run-time system

The system comprises a Host and a network of Reduction Agents. The host provides the user interface, file access and i n p u t / o u t p u t . It also supports the compile-time system that produces the code, which is then distributed to the network. The network is a charge of executing the code and sending back the result to the host. The system is designed to work on any topology. In the experiments described, we used a wrap-around mesh of 16 transputers. Each transputer in the network supports a Reduction Agent. A reduction agent consists of an Evaluator, a G r a p h Manager and a Router, all p r o g r a m m e d using occam and running in parallel. The role of the Evaluator is to interpret the code produced by the compiler. The G r a p h Manager maintains the local portion of graph, responds to data requests from other processors, and also stores suspension records and reactivates tasks. The Router is the only module which ' k n o w s ' about the topology of the network. It determines the appropriate route depending on the destination of a packet. The router runs as a high priority occam process. The run-time system is further described in [20]. 3.5. Performance of the system

There are no standard measures for the performance of parallel graph reduction machines because execution times depend considerably on the example programs, the abstract machine and m a n y implementation details such as data structures representation and load balancing. In this section, two simple measures will be considered: - N u m b e r of function calls/second: gives an idea about the sequential execution speed. The function used is the N f i b benchmark. - N u m b e r of nodes transmitted/second: gives an idea about the communication speed. These measures are summarized in Table 1 for different types of transputers used. The figures show a rather slow computation time when compared with other published figures, shown in Table 2.

Table 1 Processing and communication speed in the system Transputer type Nfib calls/sec Nodes transmitted/sec 70CPU utilisation in communication

T4 at 15 MHz 447 3230 7670

T4 at 20 MHz 609 4170 7370

T8 at 20 MHz 640 5690 7370

Dioide-and-conquer and parallel graph reduction

193

Table 2 Number Nfib calls/second for other systems System

ZAPP

(p, G)machine

HDG-machine

PAM

Alfalfa

Nfib calls/sec Reference

24000calls/sec [17, p. 256]

63000calls/sec [2, p. 211]

27000calls/sec [26]

1300 calls/sec [19, p. 155]

-- 300 calls/sec [10, p. 192]

All these numbers are approximate, depending on various conditions expressed in the corresponding references. We can see that the fastest implementations are those where the code is directly executed. Surprisingly, the use of an abstract machine as in the H D G - m a c h i n e and in the ~o,G)-machine prove to be faster than a direct implementation in occam as in ZAPP. The ( o , G ) - m a c h i n e ' s high performance seems to be due to the multitude of optimisations performed by the compiler. The current system's performance is close to the P A M machine where the code is interpreted and to Alfalfa, where the code is executed but uses system call routines which must carry heavy overheads. Because the Nf i b function has not been tested on Alfalfa, the time shown in Table 2 is the number of P f a e function calls/second 1 The communication overhead can be divided into two parts: - The latency, time during which data is moved across the link - The processing or C P U time, involved in moving data f r o m / t o the buffers. For example, a graph that is copied is first 'flattened', put into a buffer, and is reconstructed at the other end. The processing time overhead also comprises the time to transfer a packet from the G r a p h Manager to the Router. According to the measures obtained, the C P U time in any communication is much greater than the latency, in a proportion as high as 70%. In conclusion, the system suffers a high overhead in performing sequential computations because the abstract code is interpreted. This makes it about 20 times slower than a fully compiled implementation. Communication is relatively fast but could be improved by reducing the processing overhead involved in moving data to an from buffers. This will have an effect on the experimental results presented in the next section.

4.

Experimental

results

4.1. Examples programs The test programs consisted of some simple Divide-and-Conquer problems, with different shapes and complexities. They are summarized below: Dsum(n): Computes the sum 1 + 2 + ... + n in a Divide-and-Conquer manner. Similar to parallel factorial except that the multiplication is replaced by an addition. This is an example of a regular balanced problem because it generates equal problems with a constantly divided size. Its degree of recursion k is 2. M s o r t ( n ) : Sorts a list of n numbers using the merge-sort algorithm. This is also an example of a regular balanced problem with k = 2. F i b( n ): Computes the N t h element of the Fibonacci numbers, using a double recursive call. This is a regular problem because the size of the sub-instances can be predicted. The degree 1 Pfac(1000) executed 2000 calls in 6 seconds.

194

F.A. Rabhi and G.A. Manson

of recursion k is 2 but it is an unbalanced problem because the two subinstances generated are of unequal size. - Q s o r t ( n): Sorts a list of n (randomly determined) numbers using the quick-sort algorithm. This is also an unbalanced problem with k= 2 but it is irregular in that the size of the subinstances generated cannot be predicted. - I% t t3 and Mu t t4: decompose the multiplication of an n digits integers (represented as a list of digits) into respectively 3 and 4 multiplications of n / 2 digits integers [3]. Both are regular balanced problems with k respectively equal to 3 and 4. The ML definitions of these functions are given in the Appendix. 4.2. Expected results When executing a problem with every possible partition, the general aspect of the figures obtained should be as shown in Fig. 1. In this graph the partition into tasks is plotted on the horizontal axis (in logarithmic progression) and the execution time on the vertical axis. Starting from 1 task partition (sequential time), the execution time should decrease as the problem is further divided. In most of the cases, it should increase again as the m a x i m u m partition is reached. In the rest of the experiments, we studied the factors that have an effect on determining this o p t i m u m partition and these factors are the load-balancing system, the nature of the problem and the system configuration. Some preliminary results are also found in [21]. 4. 3. Defining a measure for load-balancing We defined a measure that indicates how well the tasks have been distributed. This measure is defined such that D=0 if the tasks are equally distributed amongst processors and D has a large positive value if one processor only executed all the tasks. This measure is not completely accurate because it is only based on the number of tasks executed by one processor without referring to tasks size or to the m a x i m u m number of tasks present in the tasks queue at a Execution tim (in ms) 7

Sequential Time 5 (1 task)

\ Best Time (Optimum partition into tasks)

o

'

~

'

J,

Q ~

0

'

d

Parallel Time (Maximum I partition into tasks) '

~

'

i'0

Log2(Partition into tasks) Fig. I. General aspect of the execution times vs the partition.

i

12

Divide-and-conquer and parallelgraph reduction

195

Execution time (in ms) 36 34 32 3O 28 26 24 22 20 18 16 14 12 10 8 6 4 t

0

2

4

Dsum(2048)

+

6 8 kog2(Partition into tasks) Dsum(4096)

J

10 o

i

1'2

Dsum(8192)

Fig. 2. Execution times depending on the partition for Dsum.

specific time. However, in this context, it gave us a g o o d idea about the processors global utilisation. By convention, D=0 if there is only 1 task generated and allocated to one processor (it is the best and worst case at the same time).

4.4. Results obtained for different problems This section presents the results obtained for the examples programs, with different sizes, on a mesh of 16 transputers. The processor type used is a T4 clocked at 15 MHz. The results are shown together with the c o r r e s p o n d i n g tasks distribution.

Tasks Distribution D 12 11 10 9 8 7 6 5 4 3 2 1 0 ,~

•7

Dsum(2048)

¢ 2

¢

, , -,; "¢ ¢ tp ThreshoLd)]

where s i z e ( x i ) represents the size of an instance xi, which must be higher than a specific threshold to be worth evaluating in parallel. This is not needed in the implementation described because of the low communication overhead, but it would matter much in an implementation where the code is directly executed.

6. Comparison with other work Extensive work is being carried out in designing parallel architectures for functional programming languages. Previous effort has mainly concentrated on architectural issues such as topology [5] and load balancing [13]. The remaining problems have to do with efficient partitioning and tasks allocation. Most existing systems use a maximum partitioning strategy and rely on a run-time mechanism to turn to sequential evaluation when the system is fully loaded, this is called a throttle in [23]. For distributed systems, the major problem using a throttle is detecting when the system is saturated, requiring a load information exchange scheme or extra hardware.

202

F.A. Rabhi and G.A. Manson

In contrast, we believe that many predictions can be made at compile-time considering the advantage that functional programs can easily be analysed and transformed. In our case, it became possible to predict the appropriate partition at compile-time and reach the appropriate partition at run-time. However, the analysis only applies to a specific pattern of execution and assumes that only one Divide-and-Conquer function is executed. Divide-and-Conquer parallelism is also studied in the ZAPP project [17] but not in a functional context. Only maximum partitions have been experimented with and in general, a better speedup than the one reported here has been obtained because the application programs were directly programmed into occam. Other patterns of parallel execution, mainly pipelined parallelism, are analysed by Kelly [14] and Cole [8]. Other related work includes serial combinators [9], which we view as a form of restricting the partitioning of a problem, by removing unnecessary parallelism. The method however does not apply to recursive functions, hence to Divide-and-Conquer problems which when executed lead to a maximum partition [10]. Examples known of dynamic partitioning in functional languages is the Stardust system [11] and the Dutch Parallel Reduction Machine [25] where partitioning is decided upon the run-time value of some parameters supplied by the user. The drawbacks is that partitioning depends on the size of problems, this notion of size is particularly difficult to determine automatically. Run-time testing on the size is not efficient because tasks may be created even if there are no free processors to execute them. In our method, a control on the number of tasks (through the depth) reveals to be simpler to implement and does not require any complexity analysis on the problem.

Appendix: ML definitions of test programs (*************** fun dsum" Lo h i = i f

DOUBLE RECURSIVE SUM * * * * * * * )

(Lo=hi) else (Let in (dsum" (dsum" end); fun dsum (n)=(dsum" 1 n);

then hi vaL m i d = ( ( h i + t o ) d i v 2) Lo mid)+ (mid+1)hi)

(******** FIBONACCI NUMBERS * * * * * * * ) fun merge ( x : i n t L i s t , y ) = i f ( x = n i L ) then y else i f (y=niL) then x else i f ((hd y)=(hd x ) ) then (cons (hd x) merge((tL x ) , y ) ) else (cons (hd y) merge(xp(tL y ) ) ) ; fun m s o r t ( z ) = i f (nuLL z) then n i l else i f (Length z ) = l then z else (Let vat n=((tength z)div 2) in merge ((msort(take n z ) ) . (msort (drop n z ) ) ) end); (*************

QUICK SORT * * * * * * * * * * * )

Divide-and-conquer and parallel graph reduction fun lef a x=if

(null x) then ni l else let val y=(hd x) in if (ya) then (cons y (gtf a (tl x))) else (gtf a (tl x)); fun qsort z=if (null z) then nil else let val a--(hd z) in let val x=(tl z) in (append (qsort (lef a x)) (cons a (qsort (gtf a x)))); (***** LONG INTEGERS MULTIPLICATION * * * * * * * * ) fun mult4 x y n = i f n = l then ( m u l t x y)

else Let v a l Let vaL and and and

m=(n d i v xl=(drop xO=(take yl=(drop yO=(take

2) i n x m) x m) y m) y m)

in (combine (addvec ( m u l t 4 x l yO m ) ( m u l t xO y l m) 0 ) , ( m u l t 4 x l y l m), (mult4 xO yO m) m) end end; fun combine3 z2 z11 zO m= (combine (subvec(subvec zli z2 O) zO O) z2. zO, m); fun mult3 x y n= if n--1 then (mult x y) else let val m = ( n div 2) in let val x1=(drop x m) and xO=(take x m) and yl--(drop y m) and yO=(take y m) in (combine31 (mult3 xl yl m)

203

204

F.A. Rabhi and G.A. Manson

(mult3 (addvec xl xO O)(addvec (mult3 xO yO m)

yl yO O) m)

m) end end; where m u l t : multiply 2 digits, the result is a list of digits addvec, subvec : addition and difference of two numbers represented by a list of digits combine : combines 3 numbers z2. zl and zO into a number z such t h a t z = z 2 X l O n + z l × l O n / 2 + z O . ALL numbers a r e r e p r e s e n t e d by a l i s t o f d i g i t s .

References [1] L. Augustsson, Compiling lazy functional languages, Part 1I, PhD thesis, Department of Computer Science, Chalmers University of Technology, Goteborg, 1987. [2] L. Augustsson and T. Johnsson, Parallel graph reduction with the (o,G)-machine, in: Proc. Conf. on Functional Programming Languages and Computer Architecture (ACM Press, London, September 1989). [3] R. Bird and P. Wadler, Introduction to Functional Programming, (Prentice Hall, Englewood Cliffs, N J, 1988). [4] G.L. Burn, A shared-memory parallel G-machine based on the evaluation transformer model of computation, in: Workshop on the Implementation of Lazy Functional Languages (Aspenas, Sweden, September 1988). [5] F.W. Burton and M.R. Sleep, Executing functional programs on a virtual tree of processors, in: Proc. Functional Programming Languages and Computer Architecture Conf. (Portsmouth, 1981) 187-194. [6] M. Cole, Higher-order functions for parallel evaluation., in: Proc. Glasgow Workshop on Functional Programming (Glasgow, August 1988). [7] M. Cole, Algorithmic skeletons: a structural approach to the management of parallel computation, PhD thesis, University of Edinburgh, October 1988. [8] M. Cole, Towards fully local multicomputer implementations of functional programs, Report CSC 90/R7, Department of Computing Science, University of Glasgow, January 1990. [9] B.F. Goldberg and P. Hudak, Serial combinators: optimal grain of parallelism, in: J-P. Jouannaud ed., Proc. Functional Programming Languages and Computer Architecture Conf., Nancy, France, September 1985, LNCS 201 (Springer, Berlin, 1986). [10] B.F. Goldberg, Multiprocessor execution of functional programs, PhD thesis Y A L E U / D C S / R R - 6 1 8 , Department of Computer Science, Yale University, April 1988. [11] D.A. Hornig, Automatic partitioning and scheduling on a network of personal computers, PhD thesis, Department of Computer Science, Carnegie-Mellon University, November 1984. [12] E. Horowitz and A. Zorat, Divide-and-Conquer for parallel processing, 1EEE Trans. Comput. TC-32 (1983) 582-585. [13] R.M. Keller, F.C.H. Lin and J. Tanaka, Rediflow multiprocessing, in: Proc. IEEE COMPCON 28th lnternat. Conf. (1984) 410-417. [14] P. Kelly, Functional Programming for Loosely-coupled Multiprocessors (Pitman, London, 1989). [15] H. Kingdon, D.R. Lester and G.L. Burn, The HDG-machine: A highly distributed graph reducer for a transputer network, Esprit 415 report, March 1989. [16] D.R. Lester and G.L. Burn, An executable specification of the HDG-machine, in: Workshop on Massioe Parallelism: Hardware, Programming and Applications, Italy (October 1989). [17] D. McBurney and M.R. Sleep, Transputer-based experiments with the ZAPP architecture, in: de Bakker et al., eds. Proc. PARLE Parallel Architectures and languages Europe, Eindhoven, Netherlands, June, 1987, LNCS 258 (Springer, Berlin, 1987). [18] S.L. Peyton Jones, The Implementation of Functional Programming Languages (Prentice Hall, Englewood Cliffs, N J, 1987). [19] R. Loogen, H. Kuchen, K. lndermark and W. Damm, Distributed implementation of programmed graph reduction, in: E. Odijk et al. eds. Proc. PARLE "89 Parallel Architectures and Languages Europe June 1989, LNCS 365 (Springer, Berlin, 1989). [20] F.A. Rabhi and G.A. Manson, Divide-and-Conquer algorithms and parallel graph reduction, Report CS-90-2, Department of Computer Science, University of Sheffield, May 1990.

Divide-and-conquer and parallel graph reduction

205

[21] F.A. Rabhi and G.A. Manson, Experimenting with Divide-and-Conquer algorithms on a parallel graph reduction machine, in: D.J. Pritchard and C.J. Scott, Eds. Proc. 2rid Internat. Conf. on Applications of Transputers Southampton (July 1990). [22] P. Roe, Some ideas on parallel functional programming, in: Proc. Workshop on Functional Programming, Glasgow (August 1989). [23] A.A. Ruggiero and J. Sargeant, Control of parallelism in the Manchester data-flow machine, in: G. Kahn, ed. Proc. Conf. on Functional Programming Languages and Computer Architectures, LNCS 274 (Springer, Berlin 1987). [24] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessing (Pitman, London, 1989). [25] W.G. Vree and P.H. Hartel, Parallel graph reduction for divide-and-conquer applications Part I, PRM Project Internal Report D-15, Department of Computer Systems, University of Amsterdam, December 1987. [26] D.A. Lester, Private communication, 1989.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.