Composite Look-Up Table Gaussian Pseudo-Random Number Generator

Share Embed


Descripción

2009 International Conference on Reconfigurable Computing and FPGAs

Composite Look-Up Table Gaussian Pseudo-Random Number Generator Leonard Colavito and Dennis Silage Department of Electrical and Computer Engineering Temple University especially in the tail regions, is important to obtaining accurate results from the simulation [3]. Many methods for generating Gaussian random numbers have been studied and employed [4]. These methods have various tradeoffs in terms of complexity, output rate and accuracy. The cumulative distribution function (CDF) inversion method is well suited for PGA hardware implementation. Using this method, a uniform PNG provides values that are then transformed using the inverse of the Gaussian CDF [5]. A basic hardware implementation utilizes a well designed linear feedback shift register (LFSR) as the uniform PNG and a look-up table (LUT) to perform the transform. The major disadvantage of this method is the memory storage required for a sufficiently accurate LUT [3]. For our research, we have built a simulation model that utilizes 10 Gaussian PNG. Consequently, the following characteristics are important to our implementation. 1. minimum hardware resource requirements 2. consistent output rate 3. IGCDF accuracy and range equivalent to a LUT with 215 values and a precision of 2-10 With the exception of the required LUT memory, the CDF inversion method is well suited for our application. In the remainder of this paper we describe the method used to reduce the storage requirements of the LUT. We illustrate the method with an example of a Gaussian PNG utilized in our simulation model. The principles may be readily adapted to other implementations. The tradeoff for the reduction in the size of the LUT is a loss of fidelity to a Gaussian distribution. However, as will be discussed, this loss of fidelity is confined to the part of the distribution curve that least affects the simulation results and the contribution of the inaccuracy is demonstrated to be minimal.

Abstract Simulation of digital communication systems for the evaluation of bit error rate (BER) and other performance characteristics can be accelerated if the processing is implemented in programmable gate array (PGA) hardware. These simulations often require one or more Gaussian distributed pseudorandom number sources. Although uniformly distributed pseudorandom number sources can be readily implemented, Gaussian sources are not as easily configured. A typical method is to build a uniform source and transform the distribution to Gaussian. The inversion method accomplishes the transformation by the application of the inverse Gaussian cumulative distribution function (IGCDF). The IGCDF is easily obtained by the use of a look-up table (LUT). However, the memory required for the LUT can become large if it is to accurately represent the IGCDF. In this paper we demonstrate a method that can reduce the size of this LUT while allowing for control of the accuracy.

1

Introduction

Simulation is commonly used to evaluate the bit error rate (BER) and other performance characteristics of digital communication systems. These evaluations, however, may require a very substantial number of iterations. For example, evaluating the BER of QPSK communication system to 10-6 bits per error would typically require at least 108 bits to be processed. The time required to perform such an evaluation in a software-based model can be significant but can be substantially reduced by implementing the model in programmable gate array (PGA) hardware [1][2]. Gaussian pseudorandom number generators (PNG) are a critical part of the digital communication system simulation model. Gaussian PNG are used to simulate noise and channel characteristics that effect the performance of the simulation. Consequently, the representation of the PNG as a valid Gaussian distribution,

978-0-7695-3917-1/09 $26.00 © 2009 IEEE DOI 10.1109/ReConFig.2009.12

2

Basic architecture

The basic hardware architecture for a Gaussian PNG employing the inverse CDF method is illustrated in Fig. 1. 314

below a cumulative probability of 0.9 and asymptotic near 1.0 but highly nonlinear in between. Next, consider that since the IGCDF curve is monotonic, there is a unique relationship between the LUT input x and output y. If we map all possible values for x, or Nx, such that they are evenly spaced over [0.5,1.0) we can then consider the IGCDF curve to be effectively sampled by x at intervals of Δx = 0.5 / Nx. One manner in which the LUT size could be readily reduced is by increasing the size of the sampling interval or equivalently by decimating the LUT. However, if we are to produce a value of y for each value of x, then we must map more than one x to each value of y. Let w be the input to a LUT that is decimated by a factor of m, then Δw = mΔx and Nw = Nx / m. Now let all values of x that fall within a single interval Δw yield the same value

Fig. 1. Basic hardware architecture for Gaussian PGN using CDF inversion method

In our simulation the uniform PNG is a 55-bit maximal length LFSR from which 16-bits words are taken to obtain a uniformly distributed random number sequence. From each 16-bit word, 15 bits form an unsigned random integer x and the remaining bit s is used for the sign. A LUT provides a positive value y of the inverse Gaussian CDF (IGCDF) from x. The value y is then multiplied by either +1 or -1 as determined by s to yield the output value n that consists of positive and negative Gaussian distributed values. The symmetry of the IGCDF curve is exploited by storing only the positive half of the transform in the LUT [1][3]. Thus, the unsigned integer x is transformed to a positive fixed point value y by the LUT and s provides the sign value independent of the LUT. Since s is uniformly distributed, the output value n is evenly distributed between positive and negative values. This implementation has the advantage of requiring only minimal logic and is able to produce one new pseudorandom value for every clock cycle. In our simulation, the LUT contains fixed point values that can be made arbitrarily precise by allocating an appropriate number of bits to the LUT output width. The disadvantage of this implementation is that the LUT must hold 215 values, one for each value of x.

3

of y, as given in (1) where F −1 ( x) is the IGCDF. The result of such a decimation procedure is that the LUT now represents a stair-step approximation of the IGCDF curve.

y = Fx−1 (w1 | w1 ≤ x ≤ w2 ) ≡ Fw−1 ( x )

(1)

Reducing the size of the LUT in this manner introduces two factors for a loss in the fidelity of output Gaussian distribution. The first loss factor εx results from the stair-step approximation of the IGCDF. This error is expressed by (2) for any value of x.

ε x = Fw−1 ( x ) − Fx−1 ( x )

(2)

This loss factor will introduce the least error in the flattest portion of the IGCDF curve, the region below a cumulative probability of 0.9, and the most error in the steepest part of the curve, near 1.0. The second loss factor introduced by the decimation of the LUT is due to truncation of the IGCDF range. Since the smallest value of x is mapped to a cumulative probability of 0.5, the largest value of x maps to a point that is essentially one sampling interval short of 1.0 as given in (3), where the symbol → is to be interpreted as “maps to”.

Development

Can we reduce the required LUT size without significantly increasing the complexity of the logic and while also allowing for control of any loss in fidelity of the resulting Gaussian distribution? First, consider the positive half of the IGCDF for a normal distribution of zero mean and unit variance, N(0,1) as shown in Fig. 2. This curve represents the function to be implemented by the LUT. Note that the curve is approximately linear

xmax → 0.5 + Δ w ( N w − 1) = 1 − Δ w = 1 − mΔ x

(3)

The result is to limit the extent of the tails of the Gaussian distribution. This effect is magnified by the decimation factor m, as seen in (3). In addition, the high rate of change in the asymptotic portion of the IGCDF occurring near a cumulative probability of 1.0 implies that the

315

amount of the Gaussian tail range that is unavailable is sensitive the size of Δw and can be significant.

4

Now consider that a full high-resolution LUT holds one value y for every value of x and would therefore require an input word of b-bits. Next, consider that a full low-resolution LUT is a decimated version of a full highresolution LUT. The size of a full low-resolution LUT is equal to the size of a full high-resolution LUT divided by m. If m is constrained to a power of 2, then the number of bits p required for the input word to a full low-resolution LUT is given by (5).

Composite LUT architecture

It can now be stated that there are two extremes to the LUT design of Fig. 1. A high-resolution (small m) LUT provides high fidelity but has greater memory requirements, while a low-resolution (large m) LUT requires less memory but has lower fidelity. Can a compromise solution be designed that utilizes the advantages of both while allowing control of the associated disadvantages? The architecture of Fig. 3 combines a low-resolution LUT with a high-resolution LUT in a single PNG. The design follows the essential principles of Fig. 1 but utilizes two LUTs. A multiplexer selects between yLo, the output of the low-resolution LUT, and yHi, the output of the highresolution LUT. A decoder controls the multiplexer selection so that every value of x is mapped either to the value of yLo or yHi. Moreover, the decoder decision is such that the range of x is effectively split into two ranges at a point at xsplit.

p = b − log 2 (m)

Moreover, the input to a full low-resolution LUT would consist of exactly the p highest order bits of x as in (6).

x Lo = x ( b−1) 2 b− p −1 + ... + x (b − p − 2) 2 + x (b− p −1)

(6)

In the composite LUT design of Fig. 3, the value xsplit defines a value on the range of x below which all values are taken from the low-resolution LUT and above which the high-resolution LUT is utilized. Therefore, the lowresolution LUT in the composite design need only hold the values of yLo for x < xsplit while the high-resolution LUT need only hold values of yHi for x ≥ xsplit. Now, in order to facilitate the seamless transition between the two tables, xsplit is constrained to a value of xLo. Furthermore, xsplit can also be constrained so as to require that its binary value consists of all 1’s for the r highest order bits and 0’s for all other bits. The decoder can then simply be designed to select yHi whenever the r highest order bits of x are all ones. Thus the decoder can be implemented as a simple r-input AND-gate. A fortuitous consequence of this decoding scheme is that all of the values in the high-resolution LUT correspond to values of x for which the r highest order bits are 1’s. Thus, the width of the input word to the highresolution LUT xHi need be only q = b − r bits and the value xHi is given by the q lowest order bits of x as in (7).

4.1 Mapping The low range x values (x < xsplit) are mapped to yLo, and the high range x values (x ≥ xsplit) to yHi. This design maps the most non-linear portions of the IGCDF curve to the high-resolution LUT, while the more linear region of the curve is represented with lower resolution. It is desirable to represent the non-linear portions of the IGCDF curve with high-resolution since this region has the highest rate of change and is responsible for the critical tail region of the output Gaussian distribution. The flatter portion of the IGCDF curve will introduce less εx when resolution is lowered. This portion of the IGCDF curve is responsible for the less critical central region of the output Gaussian distribution. A reduction in resolution for this portion of the IGCDF curve will result in the least loss of fidelity and in the least critical region of the resulting output distribution.

x Hi = x ( q −1) 2 q −1 + ... + x (1) 2 + x ( 0)

4.2 Decoder

(7)

One further consequence of this design is that the effective composite LUT output F-1(x) retains the range of the high-resolution LUT. Note that F-1(x) follows yLo with interval Δw up to xsplit. Beyond xsplit each value of yLo is replaced by m values of yHi with interval Δx such that x max → 1 − Δ x .

Since one of our objectives is to minimize complexity as well as hardware resource requirements, an efficient decoder scheme is desired. Consider that x is a b-bit binary word as expressed by (4) where x(i) represents the value of the bit of x in the 2i place.

x = x (b −1) 2 b−1 + ... + x (1) 2 + x ( 0)

(5)

(4)

316

4.3 Summary of Composite LUT architecture

0.1

20000

0.06

15000

0.04

10000

0.02

5000

M 0 0

10

20

30

40 50 60 Combination

70

80

90

0 100

2

Fig. 4. Memory requirement, M, and error, ε max, for combinations of m and r when b = 15. Points 2 are sorted in order of increasing ε max.

value of m is limited to p. Therefore, the total number of combinations Q is given by (8) and the table can be quickly computed. The search space for the example presented here consisted of on 105 combinations and was computed in 2.5 seconds using MATLAB®.

The composite LUT produces values of yLo for values of x < x split according to (1), and values of yHi for every value of x ≥ x split

8.

0.08

ε2max

The characteristics of the composite-LUT architecture are summarized as follows: 1. The decoder consists of an r-bit AND-gate whose inputs are the r highest order bits of x. 2. The low-resolution LUT is a subset of a decimated full high-resolution LUT with reduction factor m, which is a power of 2. 3. The low-resolution LUT holds only those values of y corresponding to x < xsplit and has no more than Nw = Nx / m elements. 4. The input to low-resolution LUT, xLo , consists of the p highest order bits of x where p is determined by (5). 5. The value of xsplit always corresponds to a value of xLo such that its r highest order bits are all 1’s and all other bits are 0’s. 6. The high-resolution LUT holds only those values corresponding to x ≥ x split and its size is 2q. 7.

25000 ε2max M

The range of the composite LUT is equal to the range of a full high-resolution LUT.

2b

Q=

4.4 Minimizing cost

∑p

(8)

m =1

Values for m and r must be chosen such that the required hardware resources are minimized while the loss of fidelity to the output Gaussian distribution is not severely impacted. Note that the hardware resource requirements for the composite LUT architecture are primarily due to the memory needed. Let M be the total memory required for both the high and the low-resolution LUT so that M may be used as a cost factor. If the interval size Δx of the high-resolution LUT is chosen so that the architecture of Fig. 1 provides the necessary fidelity for the intended application, then the yHi can be used as a reference for loss of fidelity. Deviation from the reference by the yLo values can then be quantified as εx and correspond to the loss in fidelity due to the LUT size reduction. More specifically, we are generally interested in the worst case error εmax occurring anywhere over the range x. Therefore, the squared worst case error ε2max is taken as a second cost factor for the composite LUT. Since there is a limited number of combinations of m and r for any given word size of x, a complete search can be performed for values of M and ε2max. Since m must be a power 2 less than 2b, the number of values of r for any

Fig. 4 illustrates how the two cost factors vary with the possible selections of m and r. The points have been sorted in order of increasing ε2max in Fig. 4 and the combinations are numbered sequentially so that a cross reference is required to determine the values of m and r for each case. It can be seen from Fig. 4 that the memory cost M varies widely across the range. Therefore, the approach is taken of first choosing an acceptable value for ε2max and then finding a combination of m and r that yields the minimum M while not exceeding the desired ε2max.. In the example here, the LUT holds fixed point binary values with a precision of 2-10. Therefore, we choose 2 ε max = (2 −10 ) 2 = 2 −20 = 9.54 × 10 −7 so that the maximum error will be less than the quantization error of the LUT. Table 1 lists combinations of m and r that yield ε2max in the vicinity of the desired value. Table 1 illustrates that the combination m = 8 and r = 3 provides for the smallest required memory of M = 7680 storage elements, while still satisfying the error requirement. In the example here, the equivalent single LUT architecture would have required

317

M = 32768 storage elements, where one storage element is the hardware resources required to hold one value of y. Thus the requisite memory for the composite LUT is less than 25% of that of the single LUT implementation while requiring only a one additional 3-input AND gate for decoding.

5

Table 1 2 -20 Costs of composite LUT with Ε MAX near 2 m r ε2max M 8 4 2

PGA implementation and results

16

A simple model was created employing a single instance of the Gaussian PNG described in the previous section and having a 13-bit output. Two versions of this simple model were implemented on a Xilinx Vertex-4 device running at 66.7MHz using Xilinx System Generator for DSP™. The first version of the model utilized the single LUT design while the second utilized the composite design. The model required 58 RAM16B blocks or 30% of the memory resources when utilizing the single LUT design compared to 40 blocks or 20% of the memory resources when utilizing the composite LUT design. When the memory overhead of 34 blocks is considered, the resource requirement for the composite LUT design is found to drop from 24 to 6 blocks over the single LUT design. Thus the memory resource requirements for the composite LUT design, in this case, are 25% of that for the single LUT design. The optimization processes of System Generator results in variations in hardware resource utilization for any change in design. Therefore, we where unable to find any measurable change in logic resource utilization between the two designs. Both designs where able to produce one new random value every clock cycle. In our research model, the Gaussian PNG were initially implemented using the architecture of Fig. 1 The final implementation for the Gaussian PNG used the improved composite LUT architecture of Fig. 3 with the same width and precision as our original single LUT design. Our simulation performance utilizing the composite LUT for the Gaussian PNG remained equivalent to that obtained with the original single LUT design while lowering our overall memory utilization by 56% of total memory resources.

6

2 4 6 1

-7

11264

-7

9728

-7

16640

-7

17408

-7

7680

-7

9728

-7

8960

-7

16512

-7

16896

-7

5888

2.69x10 4.22x10 5.03x10 5.18x10

8

3

7.53x10

16

2

12.3x10

4 2 32 8

5 7 1 4

13.6x10 17.1x10 22.1x10

22.9x10

significant logic. The technique also retains the high output rate of a LUT design and allows for the tradeoff between LUT size and output fidelity. We have found no degradation in performance when using the composite LUT architecture in our own simulations. We note here that the technique may be extended to utilize three or more LUT with the possible advantage of further reduction in memory requirements. The technique could be applied recursively to the high-resolution LUT. With iteration, the resulting high-resolution LUT would become successively smaller. Each successive lowresolution LUT generated would be of a higher resolution (smaller m) than the previous. Further, the range of x would be subdivided by a series of split values { x1, x2, … , xN } such that x1 < x2 < … < xN and each intermediate lowresolution LUT would need to hold only the values corresponding to [xi,xi+1).

References [1] J.-L. Danger, A. Ghazel, E. Boutillon, and H. Laamari, “Efficient

[2]

Conclusion

We have demonstrated a practical technique that can be used to reduce hardware resource requirements when implementing the ubiquitous Gaussian distributed PNG in programmable hardware. The technique enhances the basic inverse CDF look-up table method by allowing the memory requirements to be reduced while not adding

[3]

[4]

318

FPGA Implementation of Gaussian Noise Generator for Communications Channel Emulation”, 7th IEEE Int. Conf. Electronics, Circuits and Systems, 2000, vol. 1, pp 366-369. A. Alimohammad, S. Fard, B. Cockburn, and C Schlegel, “A Compact and Accurate Gaussian Variate Generator”, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 5, pp. 517-527, May 2008. D. Lee, R. Cheumg, J. Villasenor, and W. Luk, “Inversion-Based Hardware Gaussian Random Number Generator: A Case Study of Function Evaluation via Hierarchical Segmentation”, IEEE Int. Conf. Field-Programmable Tech., 2006, pp. 33-40. D. Thomas, W. Luk, P. Leong, and J. Villasenor, “Gaussian Random Number Generators”, ACM Computing Surveys, vol. 39, no. 4, pp 11:1-11:38, Oct 2007.

CRC Standard Mathematical Tables and Formulae, 30th ed., CRC Press, New York, 1996, pp. 593-598. [6] A. Ghazel, E. Boutillon, J.-L. Danger, G. Gulak, and H. Laamari, “Design and Performance Analysis of High Speed AWGN Communication Channel Emulator”, IEEE Pacific Rim Conf. Commun., Comput. And Signal Process., 2001, vol. 2, pp. 347-377. [7] D. Lee, A. Gaffar, O. Mencer, and W. Luk, “MiniBit: Bit-Width Optimization via Affine Arithmetic”, Proc. 42nd Design Automation Conf., 2005, pp. 837-840. [8] D. Lee, J. Villasenor, and P. Cheung, “Hierarchical Segmentation Schemes for Function Evaluations”, Proc. IEEE Int. Conf. FieldProgrammable Tech., 2003, pp. 92-99. [9] D. Lee, J. Villasenor, W. Luk, and P. Leong, “A Hardware Gaussian Noise Generator Using the Box-Muller Method and Its Error Analysis”, IEEE Trans. Comput., vol. 55, no. 6, pp. 659-671, Jun. 2006. [10] D. Lee, W. Luk, J. Villasenor, and P. Cheung, “A Gaussian Noise Generator for Hardware-Based Simulations”, IEEE Trans. Comput., vol. 53, no. 12, pp. 1523-1533., Dec. 2004. [11] J. McCollum, J. Lancaster, D. Bouldin, and G. Peterson, “Hardware Acceleration of Pseudo-Random Number Generation for Simulation Applications”, Proc. 35th Southeastern Symp. Syst. Theory, 2003, pp. 299-303. [5]

319

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.