Limited width parallel prefix circuits

September 30, 2017 | Autor: Binay Sugla | Categoría: Distributed Computing, Lower Bound, Upper Bound

Descripción

The Journal of Supercomputing, 4, 107-129 (1990) 9 1990 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Limited Width Parallel Prefix Circuits DAVID A. CARLSON Supercomputing Research Center, 17100 Science Drive, Bowie, MD 20715-4300

BINAY SUGLA AT&TBell Laboratories, Crawford Corner Road, Holmdel, NJ 07733-1988

(Received May 1989; final version accepted February 1990.)

Abstract. In this paper, we present lower and upper bounds on the size of limited width, bounded and unbounded fan-out parallel prefix circuits. The lower bounds on the sizes of such circuits are a function of the depth, width, and number of inputs. The size requirement of an N input bounded fan-out parallel prefix circuit having limited width W and extra depth k (the difference between allowed and minimum possible depth) is shown to be fl(N log2W/2k + N) for k < log2W.This implies that insisting on minimum depth causes the circuit size to be nonlinear, while as little as log21og2Wof extra depth can possibly reduce the size to linear. Also, we show that there is a clear difference between the two cases of bounded and unbounded fan-out by proving the size of a limited width, unbounded fan-out parallel prefix circuit lies between a lower bound of f~((2 + 21-k/3)N) and an upper bound of 0((2 + 21-~)N). Uniform, systolicconstructions of limited width parallel prefix circuits are provided here and shown to be asymptotically optimal. By associating the width of the circuit with the number of processors and the fan-out capabilities of the circuit with the interconnection structure of a multiprocessor, time- and processor-efficient algorithms may be developed.

1. Introduction A parallel prefix circuit is a combinational circuit that takes N inputs x I . . . . . x N and produces the N outputs xlo . . . oxi, where 1 _< i _< N, and o is an arbitrary associative binary operation. Parallel prefix circuits (PPC) have found m a n y applications, for example, in the fast addition of two b i n a r y n u m b e r s [Ofman 1963], regular layouts for parallel adders [Brent and Kung 1982], enumeration of consecutive terms of a linear recurrence [Greenberg et al. 1982], simultaneous evaluations of a p o l y n o m i a l at consecutive points of a lattice, and the construction of F F T circuits with reduced latency [Despain et al. 1982]. Ladner and Fischer [1980] gave a general construction for a P P C w h e n gates have u n b o u n d e d fanout, and Fich [1983] provided new lower and upper b o u n d s for circuits with gates having both u n b o u n d e d and bounded fan-out. These constructions [Fich 1983; Ladner and Fischer 1980] possess the property that the m a x i m u m n u m b e r of nodes at any depth of the circuit (the width of the circuit) is proportional to the n u m b e r of inputs N. Since there is no constraint placed o n the m a x i m u m n u m b e r of nodes at any given depth, we characterize the circuits [Fich 1983; L a d n e r and Fischer 1980] as having unlimited width. Some immediate consequences follow from the fact that the width of the circuit is as large as the p r o b l e m size N. A circuit designed to compute all the prefixes of N inputs may not be useful in computing the prefixes of N ' inputs given that N ' > N. A simple

108

D.A. CARLSON AND B. SUGLA

repetitious use of the small circuit does not yield the best time possible and also is not optimal in terms of the size of the large circuit for that depth (time of computation). Moreover, an external mechanism that links the various stages of computation may be required. Since in practice an integrated circuit can only have a limited number of inputs, it is advantageous to obtain a design that not only works in the fastest possible time for problems of size P (the maximum parallelism allowed in the chip), but also achieves the shortest time attainable for problems of any size N ' using the available limited resources. Further interest in limited width circuits is generated when we attempt to carry out parallel prefix computation on a set of P processors. It is known that the parallel computation of a particular problem of arbitrary size on a fixed sized set of processors presents new difficulties of data communication, recombination, and storage. These difficulties are easily illustrated when attempting a straightforward solution for the general case of P processors in the absence of shared memory. The simulation of step i, requiring N processors, by using P processors FN/P7 times, may require each processor to store FN/P7 data values for use in step i + 1. In addition to this, a compiling algorithm may have to restructure the computation by first constructing the prefix circuit for that particular value of N and then assigning processors to different nodes in the circuit graph. Thus, the design of parallel algorithms and circuits for a problem of size N that runs on any set of P processors and makes efficient use of the available resources turns out to be a theoretically interesting problem with practical applications. Several authors have previously considered the problem of designing parallel algorithms for prefix and related computations using a limited number of processors. For polynomial evaluation (a tree computation), Munro and Paterson [1973] gave algorithms that run in time O(N/P + logEP) using P processors connected to a shared memory. Algorithms for a linear recurrence (which corresponds to prefix computation when all the terms have to be evaluated) were also developed for both the shuffle-exchange and shared memory models of parallel computation. Chen and Kuck [1975] and Gajski [1981] presented algorithms for recurrence systems that run on P processors, all having access to a shared memory. Kruskal et al. [1985] also gave solutions to the prefix problem when only P processors connected to a shared memory (EREW model) are available. Carlson and Sugla [1984, 1989] extended the recursive doubling method of Kogge and Stone [1973] to work with P < N processors connected by multiple shUffle-exchanges. An interesting observation they made was that their algorithms, when optimized to generate the last prefix in time N/P + log2P, required I 9 P logEP processors. An equivalent circuit would require fl(N log2P) nodes. Ladner and Fischer [1980] initiated work on circuits for computing prefixes. They showed that fast, linear sized circuits can be built, and that if a small amount of extra depth is available, circuit size can be reduced significantly. Fich [1983] extended Ladner and Fischer's results in a number of ways. She proved lower bounds on the size of both bounded and unbounded fan-out PPCs, and showed that bounded fan-out PPCs require nonlinear size when no extra depth is available. Trade-offs between size and depth for unbounded fan-out PPCs were derived by Snir [1986] and Lakshmivarahan et al. [1987]. Their results have the form size + depth _> 2n - 2, and they constructed PPCs whose size + depth exactly matches this lower bound. Snir also considered limited width PPCs, and showed depth = 2n/(w + 1) + O(1). Bilardi and Preparata [1989] studied boolean networks (bounded fanout circuits composed of gates and storage devices) for prefix computation, and derived

LIMITED WIDTH PARALLEL PREFIX CIRCUITS

109

the interesting result size = O((n/time)log2(n/time)) when the underlying group has a certain property. With the above motivations in mind, we present complexity results for limited width PPCs in this paper. Specifically, we give upper and lower bounds on the size of limited width, bounded and unbounded fan-out PPCs. We show that when gates have bounded fan-out 2, parallel prefix circuits of minimum depth require size = fl(Nlog2W), where W is the width of, and N is the number of inputs to, the circuit. For prefix circuits with extra depth k < log2W we show that size = fl(N logzW/2 ~ + N). Thus, insisting on minimum depth imposes a log2Wpenalty on size, whereas a small amount of extra depth (log21og2W)drops the lower bound to ~(N). Unbounded fan-out PPCs with limited width do not exhibit this same behavior, as we are only able to prove a much weaker lower bound of size > (2 + 21-k/3)N- o(N). For bounded fan-out PPCs, these lower bounds generalize the results of Fich [1983] to the case when width is limited. They also show that the algorithms given in [Carlson and Sugla 1984, 1989] are optimal to within a constant factor; that is, a penalty in terms of number of processors must be paid in order to compute prefixes in minimum parallel time (assuming each processor has the equivalent of bounded fan-out). Bilardi and Preparata's [1989] lower bound is similar to the one here in that it includes an extra logarithmic term under certain conditions, but their computational model is different (storage devices are added to the circuit model) and the logarithmic term arises for a different reason (an underlying group property vs. extra depth). For unbounded fan-out PPCs, our lower bounds are not nearly as strong, only slightly above linear. They again extend Fich's work to the case when width is limited, and reconfirm Snir's [1986] lower bound of size > 2 N - o(N). To complement our lower bounds, we also give uniform constructions of PPCs having width W for both cases of bounded and unbounded fan-out. These circuits solve problems having an arbitrary number of inputs and attain the best computation times possible by working in a systolic fashion. The constructions also imply algorithms that run on a fixed number of processors communicating via an interconnection network or shared memory. The organization of this paper is as follows. Section 2 describes the model of the circuit. In Section 3.1, we obtain a lower bound on the size of bounded fan-out, unlimited width PPCs. Lower bounds on the size of bounded fan-out, limited width PPCs are derived two different ways in Sections 3.2 and 3.3. The lower bounds on unbounded fan-out, limited width PPCs are presented in Section 3.4. The derivation of the lower bounds is followed by a discussion of some of their interesting implications. In Section 4, we provide constructions for limited width, bounded and unbounded fan-out PPCs. Section 5 briefly mentions some applications of the techniques employed in the preceding analyses.

2. The Model

As in [Fich 1983], we model a circuit as a directed acyclic graph, each vertex representing one of two types of gates--an operation or a duplication gate. An operation gate takes two inputs and produces their composition (product), while a duplication gate takes one input and produces multiple copies of it (Figure 1). Thus, in the directed acyclic graph, every input is a vertex of indegree 0, and every operation gate is a vertex of indegree 2 (all circuits

110

D.A. CARLSONAND B. SUGLA

\,/

\o/

/\

T

An operation vertex

1

1-2

A duplication vertex

1-3

1-4

1-5

1-6

1-7

1-8

5

6

7

8

l

j~ 1

q 2

3

4

An unbounded fan-out parallel prefix circuit of eight inputs

~1~~~ 1

1

I-2

2

1-3

3

1-4

4

1-5

1-6

1-7

9 q ~ol 5

6

7

A bounded fan-out parallel prefix circuit of eight inputs with communication between halves illustrated Figure 1. The parallel prefix model.

1-8

8

LIMITED WIDTH PARALLELPREFIX CIRCUITS

111

considered in this paper have fan-in 2). The vertex connected to the left input of an operation gate will be called its left parent and the other parent will be called its right parent. A vertex v will be called an ancestor of u if it is one of the parents of u or a parent of one of the ancestors of u. The depth of a vertex is the length of the longest path from any output (taken over all outputs) to that vertex. To maintain consistency, we also define the depth of a vertex as depth(v) = min {depth(left child), depth(right child)} - 1. The width of the circuit is the maximum number of nonleaf vertices at any depth (taken over all values of depth). The depth o f a PPC is the maximum depth of all vertices in the circuit and the size is the number of vertices in the graph minus the number of input vertices. We shall denote a bounded fan-out PPC P by Pd(W, k, N) where d is the bound on fan-out, W is the maximum width permissible, N is the number of inputs, and k is the extra depth over the minimum possible depth of a circuit with width W and N inputs. The minimum possible depth of Pa(W, k, N) is N / W + loga W - 1. Correspondingly, size will be denoted by S[Pa(W, k, N)]. If unlimited fan-out is allowed or there is no restriction on width, we may represent the circuit by p=(oo, k, N). For elucidation of the model and examples of bounded and unbounded fan-out PPCs, see Figure 1.

3. Lower Bounds 3.1. Fan-out 2, Unlimited Width Parallel Prefix Circuits In this section, lower bounds are derived on the size of fan-out 2, unlimited width parallel prefix circuits. It should be noted first that Fich [1983] has already obtained lower bounds for such circuits. Our purpose in proving similar lower bounds is not only to provide a simpler method for their derivation, but also to present a reference to which the lower bounds on limited width PPCs (obtained in the next section) may be compared. Fich's lower bounds on the size of a PPC (of fan-out 2 and no restrictions on width) are a function of extra depth k, and can be stated as follows. S[P2(oo, 0, N)] _ N(log2N + 1)/2 - O(log~N), S[P2(oo, k, N)] > N(log2N - 2)/2 k+l + N - O(log2N (log2N + k)/2 k) k > 1. New lower bounds on the size of fan-out 2, unlimited width PPCs can be obtained by considering the size of a circuit H needed to communicate inputs xl . . . . , XN/2 to outputs XlO oxN/2+I . . . . . X l o . . . OXN. This technique was first introduced in [Sugla and Carlson 1990] for the purpose of deriving area-time trade-offs for carry-look-ahead addition. Such a circuit has the property that every one of the outputs X l O . . . OXN/2+I, . . . , XlO . . . OXN is a function of all the inputs x~, . . . , XN/2 (Figure 2). First, we derive a lower bound on size when H has depth log2N (i.e., no extra depth). The lower bound on size when H has extra depth k is then obtained by induction. Since a circuit H having extra depth less than or equal to k + 1 must be contained in a PPC of extra depth k, a lower bound on S[H2(oo, k + 1, N/2)] gives a lower bound on S[P2(ao, k, N)]. 9

112

D.A. CARLSON AND B. SUGLA

1

1..N/2

1

N/2

1..N/2+ 1

1..N

N/2+l

N

The function of H

].~

): N/2

k >

log N - k

N T h e structure o f H

Figure 2. The function and structure of the graph H.

LIMITED WIDTH PARALLELPREFIX CIRCUITS

113

THEOREM 1. S[H2(co, k, N)] _> N(log2N - k - 4)/2 k + 4N. P r o o f When k = 0, each of the outputs Yl . . . . , YN is the root of a complete binary tree with Xl, . . . , XN as its leaves. Thus, there are two edges going into every output. Since fan-out is at most 2, it follows that at a distance of 1 from the outputs there have to-be at least N nodes. Similarly, it can be shown that there have to be at least N nodes at every level. When depth is log2N, this means that S[H2(oo, O, N)] >_ N log2N. Consider now the case when k > 0. It easily follows by induction on k that the graph H consists of a subgraph H2(oo, 0, N/2 k) in the middle and two subgraphs H ' such that the number of nodes at successive levels is correspondingly N, N/2, . . . , N/2 k+l (Figure 2). Hence,

S[H2(~, k, N)] _> 2(N + N/2 + . . . + N/2 k - l ) + (N/2k)log2(N/2 k) = N(log2N - k - 4)/2 k + 4N. COROLLARY 1. S[P2(oo, k, N)] > N(log2N - k - 6)/2 k+2 + 3N/2. P r o o f S[P2(oo, k, N)] > S[H=(oo, k + 1, N/2)] - N/2 because any circuit P2(oo, k, N) must contain the circuit H2(Qo, k + 1, N/2), giving rise to the first term on the right-hand side. The N/2 term is subtracted since N / 2 input vertices were counted in the size of H. A simple substitution of values then gives the desired result.

3. 2. Fan-out 2, Limited Width

We now derive lower bounds for parallel prefix circuits whose gates have fan-out 2 and whose width is at most W. The basic technique consists of the following key steps. First, we define the contribution of an internal vertex v to an output u as 2 -distance(u'v). The sum total of the contributions of all the internal vertices to all the ouptuts provides a lower bound on the size of the circuit. Thus, calculation of a lower bound to this sum gives us the desired lower bound. The differences in our techniques from those of [Fich 1983] arise from the fact that when width is constrained, several theorems and constructions become void. To see that the sum total of the contributions provides a lower bound on size, consider an output u = x~o . . . oxj that is the root of a (binary) tree Tu of width at most W whose leaves are Xl, 9 9 xj. Now, an internal vertex v of Tu may not only contribute to u but also to other outputs of the PPC. Hence, consider the contribution of v to u to be 2-distance(u'v) where distance(u, v) is the length of the directed path from u to v. For each output vertex u, define a quantity count(u) that measures the total contribution of all internal vertices v to u. More precisely, let count(u) = ~[2-distance(u'v)lv is an operation or duplication vertex that is an ancestor of u]. Consider now the sum of the counts: ~count(u) = ~ U

U

2 -distance(u'v) = ~ V

1~

2 -distance(u'v). U

114

D.A. CARLSONAND B. SUGLA

This sum is a lower bound on size if it can be shown that for any v, ~-~j 2 -distance(u'v) 1. Take a leaf that is a child of u (a node at height 1) and move it so that it becomes a child of v, thus obtaining a tree T ' whose weight is less than the weight of T, a contradiction. Proposition 1. The constructions given above are light. Proof Start with construction a: D = Fn/W7 + Flog2W7 - 1, n > 2W. We must show that the weight of the light tree L(W, D, n) constructed in this manner is less than or equal to the weight of any other tree T with the same (W, D, n). I f T can be transformed to L in such a way that the weight of Tis reduced or stays the same at each step, then the proposition would hold true. A subtree of T(W, D, n) can be transformed to a W-complete tree (a subtree of L(W, D, n)) as follows. Assume w.l.o.g, that the depth of the left subtree of T is greater than o r equal to the depth of its right subtree. If the left subtree is a W-complete tree of depth D - 1, then there is nothing to do. If not, take a leaf from the right subtree and pack it in the left subtree at a greater depth starting from depth D, D - 1, . . . , reducing the weight at each step. From Lemmas 1 and 2, this can be done until the left subtree is a W-complete tree, at which point the tree is the same as our construction of L(W, D, n). For the other constructions, similar arguments suffice. The following two recurrence relations correspond directly to construction procedures a and b outlined earlier. Their solutions give the value of any minweight(W, D, n). a: D = [nlW 7 + l o g 2 W - 1, n > 2W. minweight(W, D, n) = 2-1(minweight(W, D - 1, W(D - log2W)) + 2-1(minweight(W/2, log2W, or) + 1 for some a, 2 -< c~ _ W.

L I M I T E D WIDTH P A R A L L E L P R E F I X CIRCUITS

117

I

V

Case 1

V

j

Lihat lebih banyak...

Limited width parallel prefix circuits

Descripción

Comentarios