Parallel distributed kernel estimation

July 18, 2017 | Autor: Jeff Racine | Categoría: Econometrics, Statistics, Data Analysis, Non-parametric methods, Distributed Processing, Message Passing Interface, Kernel Estimate, Computational Statistics and Data Analysis, Message Passing Interface, Kernel Estimate, Computational Statistics and Data Analysis

Share Embed

Laporkan tautan ini

Descripción

Computational Statistics & Data Analysis 40 (2002) 293 – 302 www.elsevier.com/locate/csda

Parallel distributed kernel estimation Je% Racine Department of Economics, University of South Florida, 4202 East Fowler Avenue, Tampa, FL 33620, USA Received 1 June 2001; received in revised form 1 November 2001

Abstract Non-parametric kernel methods are becoming more commonplace for data analysis, modeling, and inference. Unfortunately, these methods are known to be computationally burdensome. The burden increases as the amount of available data rises and can quickly overwhelm the computational resources present in modern desktop workstations. Approximation-based approaches exist which can dramatically reduce execution time, however, these approaches remain just that— approximations, however good they may be. Along with the approximate nature of such approaches, they do not admit multivariate kernel estimation with general bandwidths (5xed, variable, and adaptive). In this paper, I consider a parallel implementation of a number of popular kernel methods based on the MPI standard. MPI is a freely available parallel distributed library that runs on ‘commodity hardware’ such as a network of workstations typically found in many o:ce environments. A simple demonstration indicates how one can dramatically reduce the computational burden often associated with kernel methods thereby achieving an almost ‘ideal’ parallel speed-up, while the approach is valid for multivariate kernel estimation with general bandwidths and does not rely on approximations. Some straightforward applications illustrate c 2002 Elsevier Science B.V. All just how disarmingly simple the MPI library can be to use. rights reserved. Keywords: Non-parametric methods; Distributed processing; Portable; Message-passing interface

1. Background Kernel methods are widely acknowledged to be a powerful tool for applied data analysis. They have been successfully applied to a large number of problem domains spanning a range of 5elds including economics, hydrology, and biostatistics to name but a few. The list of books devoted to the subject continues to grow, and I direct the interested reader to some of these works and the references and examples contained E-mail address: [email protected] (J. Racine). c 2002 Elsevier Science B.V. All rights reserved. 0167-9473/02/$ - see front matter PII: S 0 1 6 7 - 9 4 7 3 ( 0 1 ) 0 0 1 0 9 - 8

294

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

therein (see, for example, Prakasa Rao, 1983; Silverman, 1986; HFardle, 1990; Hastie and Tibshirani, 1990; Scott, 1992; Wand and Jones, 1995; Hart, 1997; Azzalini and Bowman, 1997; Horowitz, 1998; Efromovich, 1999; Pagan and Ullah, 1999, and a forthcoming book by Ruppert et al., 2002). Unlike many Jexible modeling tools such as spline smoothers and neural networks which are typically of computational order O(nk) where n is the number of observations and k the number of variables, kernel methods are of order O(n2 k). Data-driven methods of bandwidth selection such as cross-validation (Stone, 1974), while widely acknowledged to be an indispensable component of applied data analysis, can add an additional order of computational magnitude to kernel methods. It is not uncommon to encounter researchers for whom the computational burden of applying kernel methods even to moderately sized data sets is perceived to be too costly to justify their use. Existing approximation-based approaches for reducing the computational burden of kernel estimation include the use of Fast Fourier Transforms (FFTs) (see, for example, Silverman, 1982) and other computationally e:cient algorithms (see for example, Racine, 1993; Egecioglu and Srinivasan, 2000). These approaches have proven to be extremely valuable in single-processor settings, and many are highly innovative. However, it is known that FFTs are of limited utility in general multivariate settings. Furthermore, to the best of my knowledge, existing approximation-based approaches have only been developed for 5xed-bandwidth estimators while variable and adaptive bandwidth estimates are frequently desired in practice. Ideally, practitioners simply want to be able to apply any manner of kernel method without having to su%er intolerably long execution times. Against this backdrop parallel implementations would be most welcome, though it is widely perceived that parallel implementations would impose a heavy programming burden on the researcher in addition to requiring additional specialized hardware. Fortunately, recent developments in distributed computing are not only capable of dramatically reducing the computational burden associated with kernel methods, but are ideally suited to handling large high-dimensional data sets and are surprisingly easy to implement from a programming standpoint. In this paper, I outline one such implementation based on the Message-Passing Interface standard, an internationally recognized and freely available parallel computing paradigm that runs on ‘commodity hardware’ such as a network of inexpensive o:ce desktop computers. Pseudo-code is provided and simulation benchmarks are presented, both of which illustrate the bene5ts and simplicity of a parallel kernel implementation using this well-designed, parsimonious, and computationally e:cient library. 2. MPI and distributed processing A number of distributed processing paradigms exist, perhaps the most well-known being ‘Parallel Virtual Machine’ (PVM), 1 while Sun’s Grid Engine 2 is one of the more recent entrants to the 5eld. However, one of the most promising and freely 1 2

See www.epm.ornl.gov/pvm/pvm home.html for more information. See www.sun.com/software/linux/resources/gridengine.html for more information.

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

295

available approaches is embodied in ‘Message-Passing Interface’ (MPI) which has been an internationally recognized standard since version 1 was released in the summer of 1994, while MPI-2, released in the summer of 1997, is poised to continue this trend. MPI is a parallel distributed computing standard consisting of a library of functions and macros for use in programs that exploit the existence of multiple processors. 3 The library exists for both the Fortran 90 and ANSI C (Kernighan and Ritchie, 1988) programming languages. In order to bene5t from this environment, however, some care must be exercised when designing the implementation. In particular, the level of ‘granularity’ determines the ratio of message-passing (overhead) to computation— too high a ratio and the bene5ts of having multiple processors can quickly vanish. Indeed, there are numerous instances of distributed neural network implementations that experience little or no speed improvement for exactly this reason—an inappropriate level of granularity which is related to their O(nk) computational order. Using an appropriate level of granularity for the O(n2 k) kernel methods, one can experience close to the ‘ideal’ computational speed-up being linear in the number of available nodes (processors). Given that MPI is a freely-available internationally recognized standard that can run on existing networks of workstations, such features may be extremely valuable for practitioners. 3. Kernel estimation and distributed processing For what follows, I will restrict attention to an implementation written in ANSI C, and I begin with the simplest kernel estimator, a univariate density estimator. 3.1. Univariate kernel density estimation By way of example, consider the univariate kernel density estimator originally proposed by Rosenblatt (1956) which is de5ned as n xi − x −1 −1 ˆ f(x) = n x = x1 ; x2 ; : : : ; xj ; hn K (1) hn i=1

where i indexes the ‘estimation data’ and j the ‘evaluation data’ (typically they are the same), the kernel function K(z) satis5es K(z) d z = 1 and some other regularity conditions depending on its order, and hn is a bandwidth satisfying hn → 0 as n → =∞. Loftsgaarden and Quesenberry (1965) considered replacing the ‘5xed bandwidth’ found in Rosenblatt’s (1956) estimator with nearest neighbor weights around the point x (‘variable bandwidth’), while the adaptive kernel estimator with weights depending on xi rather than x was considered by Breiman et al. (1977) and Abramson (1984) (‘adaptive bandwidth’). This latter estimator is obtained by replacing hn with hn; i in Eq. (1). 3 See www-unix.mcs.anl.gov/mpi/index.html and the reference books by Gropp et al. (1999a, b) for more information.

296

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

These three density estimators (5xed, variable, and adaptive bandwidth) are perhaps the most popular of the kernel density estimation methods. Each of these estimators share a common feature—they involve O(n2 k) computations in contrast to, say, parametric density functions that involve only O(nk) computations (in this instance, k = 1, however, we shall relax this in a moment). Consider by way of example the following ANSI C pseudo-code implementation for a univariate adaptive bandwidth density estimator (Abramson, 1984). =∗ Compute a univariate kernel density estimate ∗= for(j = 0; j ¡ num obs eval; j + +) { sum ker = 0.0; for(i = 0; i ¡ num obs train; i + +) { sum ker + = kernel( (x eval[j]-x train[i])=bandwidth[i] )=bandwidth[i]; } pdf[j] = sum ker=num obs train; } Without loss of generality, I shall assume for illustrative purposes that i = 1; : : : ; n and j = 1; : : : ; n. It is evident from this simple example that, without resorting to approximations, there is no way to avoid the O(n2 k) computations if one wishes to compute a kernel density estimate at each of the sample realizations as is often desired. Now consider the following MPI-enabled pseudo-code which demonstrates how the MPI libraries can be disarmingly simple to implement—5rst one needs to initialize the interface and obtain information on the ‘ranks’ of each of the nodes and their total number, next one de5nes the sequence to be computed by each processor (‘stride’), and then one makes use of a few of the valuable library functions thereby converting the serial code into parallel code. =∗ Initialize the interface, obtain the rank of each process, and the number of processors ∗= MPI Init(& argc, & argv); MPI Comm rank(MPI COMM WORLD, & my rank); MPI Comm size(MPI COMM WORLD, & num processors); =∗ Compute the ‘stride’, the sequence of estimates to be computed by each processor ∗= stride = ceil((double) num obs eval = (double) num processors); =∗ Each processor computes the appropriate sequence ∗= for(j = my rank ∗ stride; (j ¡ num obs eval) && (j ¡ (my rank + 1) ∗ stride); j + +) {

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

297

sum ker = 0.0; for(i = 0; i ¡ num obs train; i + +) { sum ker + = kernel( (x eval[j]-x train[i])=bandwidth[i] )=bandwidth[i]; } pdf[j] = sum ker=num obs train;

} =∗ Now that each processor has computed their sequence, we gather this information into one vector on processor ‘0’, the ‘root’ processor ∗= MPI Gather(pdf, stride, MPI DOUBLE, pdf, stride, MPI DOUBLE, 0, MPI COMM WORLD); First, note the simplicity of the implementation—four lines of straightforward MPI library function calls, one line to de5ne the ‘stride’, and a modi5cation of the indices de5ned in the outer loop. Next, note that if the number of nodes equals the number of evaluation observations then we have reduced this computation from an O(n2 k) computation to an O(nk) one. In general this will not be the case. Regardless, if the overhead involved with the interface is negligible, we achieve a linear speed-up which can have rather dramatic real-time implications. That is, we have converted the burden of O(n2 k) computations computed on a single processor into one of O(n2 k=p) being borne simultaneously by each of p processors where p is under our control. Next I consider a simple implementation of this example deployed on a 24-node Beowulf cluster running Redhat Linux 6.2 using the MPICH implementation. 4 Code was compiled with gcc version egcs-2.91.66, and the second-order Gaussian kernel function was used. I report the ratio of the execution time for an x-node cluster (x = 2; : : : ; 24) relative to that for a single-processor implementation (x = 1) using sample sizes of n = 500 through n = 3000. The ratio is based upon execution times averaged over 1000 program invocations. We can see from Table 1 that, for the smaller sample size of n = 500, there is some slight overhead which becomes more marked as the number of processors increases. However, as the sample size increases this quickly disappears. Moreover, we observe that we indeed obtain a speed-up that is linear in the number of available nodes—a doubling of the number of nodes leads to a halving of the execution time. For example, 1 a perfect speed-up would result in a 24 or 0.042 ratio of execution time for a 24-node cluster relative to a 1-node one, while it is clear that we quickly approach this ratio as n increases. A few words on diminishing returns to adding more processors are in order. A careful examination of Table 1 reveals that as additional processors are added, the 4 Information on this implementation can be found at www-unix.mcs.anl.gov/mpi/mpich/ indexold.html.

298

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

Table 1 Ratio of execution time for kernel density estimation on an x-node cluster relative to a 1-node cluster (x = 2; 4; : : : ; 24)

Sample size Number of nodes n 2 4 6

8

10

12

14

16

18

20

22

24

500 1000 1500 2000 2500 3000

0.144 0.130 0.128 0.127 0.127 0.127

0.108 0.104 0.103 0.102 0.102 0.102

0.099 0.088 0.086 0.086 0.086 0.085

0.081 0.079 0.076 0.074 0.074 0.074

0.072 0.070 0.065 0.066 0.065 0.065

0.072 0.062 0.059 0.059 0.058 0.058

0.063 0.055 0.053 0.054 0.052 0.052

0.063 0.051 0.049 0.049 0.048 0.049

0.063 0.048 0.046 0.045 0.045 0.044

0.505 0.502 0.502 0.502 0.502 0.501

0.261 0.251 0.251 0.253 0.251 0.251

0.180 0.174 0.169 0.169 0.168 0.168

marginal reduction in computing time dissipates (i.e. one obtains a larger reduction when going from two to four processors than when going from 22–24 processors). In fact, it is possible for run-time to actually deteriorate (increase) when the additional communications overhead su%ered by adding more processors outweighs the potential reduction in run-time. An obvious solution 5 is to brieJy test one’s application on a small sample to determine whether this is a problem. This ought only become a problem when the sample size is small relative to the number of processors in which case one may not require a parallel environment to begin with, however, one would be well-advised to conduct a simple check when dealing with moderate amounts of data. It should be apparent that, even for the simplest type of univariate kernel estimation with small samples of, say, 500 observations, a parallel implementation is not only faster but can yield close to a ‘frictionless’ speed-up when some care is exercised to obtain the appropriate level of granularity. That is, if rather than implementing the parallelism at the number of evaluation observations we instead had implemented it at the number of estimation observations or possibly even the number of variables, the overhead associated with message-passing would totally negate the bene5ts of parallel computation. 3.2. Multivariate kernel density estimation and regression Extension of this approach to multivariate density estimation and, say, local constant (Nadaraya, 1965; Watson 1964) or local linear (Fan and Gijbels, 1996) kernel regression is straightforward. By way of example, consider the following pseudo-code for a multivariate kernel density implementation involving k variables using a product kernel. =∗ Initialize the interface, obtain the rank of each process, and the number of processors ∗= 5

I am grateful to an anonymous referee for pointing this out.

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

299

MPI Init(& argc, & argv); MPI Comm rank(MPI COMM WORLD, & my rank); MPI Comm size(MPI COMM WORLD, & num processors); =∗ Compute the ‘stride’, the sequence of estimates to be computed by each processor ∗= stride = ceil((double) num obs eval=(double) num processors); =∗ Each processor computes the appropriate sequence ∗= for(j = my rank ∗ stride; (j ¡ num obs eval) && (j ¡ (my rank + 1) ∗ stride); j + +) { sum ker = 0.0; for(i = 0; i ¡ num obs train; i + +) { prod ker = 1.0; for(k = 0; k ¡ num var; k + +) { prod ker ∗ = kernel( (x eval[k][j]-x train [k][i])=bandwidth[k] [i])=bandwidth[k][i]; } sum ker + = prod ker; } pdf[j] = sum ker=num obs train; } =∗ Now that each processor has computed their sequence, we gather this information into one vector on processor ‘0’, the ‘root’ processor ∗= MPI Gather(pdf, stride, MPI DOUBLE, pdf, stride, MPI DOUBLE, 0, MPI COMM WORLD); It should be apparent that this will bene5t from a parallel environment to the same degree as the univariate case. In fact, the relative overhead will be less in this case due to the presence of the third inner loop. Furthermore, since kernel regression involves an analogous computation of the kernel function, it too will bene5t to the same degree. The implementation of local constant regression, local linear regression, and conditional density estimation all have displayed virtually identical properties when deployed in this environment. 3.3. Data-driven bandwidth selection Many popular methods of bandwidth selection such as cross-validation require repeated computation of a ‘leave-one-out’ kernel estimate. If one has successfully implemented a kernel estimator, its leave-one-out counterpart is straightforward. In addition, the leave-one-out counterpart will bene5t as much from the existence of a parallel implementation as did the estimator itself. It is perhaps here where a

300

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

parallel environment can be of most assistance since searching for appropriate datadriven bandwidths is one of the most computationally demanding aspects of applied kernel analysis yet also the most indispensable. For example, an application requiring 5 min of computation time on a 24-node cluster would require roughly 2 h on a 1-node one. Furthermore, clusters and networks having hundreds of nodes are becoming commonplace, hence the same application requiring 2 h on a 1-node environment would require roughly 30 s on a 256-node one. I am sure most users would agree that this is a rather dramatic improvement and one which further broadens the appeal of kernel methods.

4. Example: cross-validated adaptive bandwidth local linear kernel regression By way of illustration I consider the kernel estimation of an unknown conditional mean having two covariates. I apply Fan and Gijbel’s (1996) local linear kernel estimator using cross-validated adaptive bandwidths as in Abramson (1984). This application was chosen because it involves multivariate kernel estimation with non-5xed bandwidths, a situation that cannot be tackled using existing computationally e:cient approximate methods, however, it is a fairly straightforward exercise to directly implement this approach using the parallel framework outlined above. I consider 1024 estimation observations, cross-validation is conducted using the conjugate gradient search algorithm employing 100 random restarts from di%erent random initial values, and then the conditional mean is constructed using the cross-validated bandwidths evaluated on a grid of 625 evenly spaced observations. Fig. 1 presents graphs of the data and the resulting kernel estimate. This example is one for which, again, the almost-ideal speed-up is experienced— the ratio of combined execution time for cross-validation and estimation on a 24-node 1 cluster relative to that using 1 node was only slightly greater than 24 .

Data

Kernel Estimate of E(Y|x)

Y

Y

4 3 2 1 0 -1 -2 -3

2 1.5 1 0.5 0 -0.5 -1 -1.5 -2

1

1 0.8

0.8 0

0.6 0.2

0.4 X1

0.4 0.6

0.8

0.2 1 0

X2

0

0.6 0.2

0.4 X1

0.4 0.6

0.8

0.2 1 0

Fig. 1. Cross-validated local linear regression with adaptive bandwidths.

X2

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

301

5. Remarks This paper outlines a parallel distributed implementation for multivariate kernel density estimation and regression that is applicable for the 5xed, variable, and adaptive kernel estimators. The method is seen to achieve an almost ideal speed-up, is simple to implement, all supporting software such as compilers and the MPI libraries are freely available, and can be implemented in many existing network environments. Many academic and non-academic sites have networks consisting of hundreds of modern computers which are idle in the evening, and some sites are starting to use these networks, originally purchased mainly for word processing chores, as SuperComputers At Night (SCANs). This is achieved by installing, say, one of the publicly-available variants of MPI such as MPICH on, say, a network of Windows NT=2000 workstations 6 and by adding the MPI ‘hooks’ in one’s code and compiling and linking with the appropriate libraries. MPI supports networks of heterogeneous computers, thus the presence of machines with di%ering processor speeds and even operating systems is largely transparent from the user’s perspective. Moreover, the MPI standard is supported on specialized machines sold by more traditional ‘big-iron’ vendors, examples being the Intel Paragon or Thinking Machines CM-5. That is, ANSI C MPI-enabled code (or Fortran 90) is likely to be extremely portable and ought to compile ‘out-of-the-box’ on a number of architectures and operating systems. It is my sincere hope that this outline of MPI-enabled kernel estimators helps facilitate the migration of more advanced kernel methods towards the mainstream of applied statistical methods. Acknowledgements Many thanks to Dan Majchrzak for his valuable comments and suggestions. References Abramson, I.S., 1984. Adaptive density Jattening-a metric distortion principle for combating bias in nearest neighbour methods. Ann. Statist. 12, 880–886. Azzalini, A., Bowman, A.W., 1997. Applied Smoothing Techniques for Data Analysis: the Kernel Approach with S-Plus Illustrations. Oxford University Press, New York. Breiman, L., Meisel, W., Purcell, E., 1977. Variable kernel estimates of multivariate densities. Technometrics 19, 135–144. Efromovich, S., 1999. Nonparametric Curve Estimation: Methods, Theory and Applications. Springer, New York. Egecioglu, O., Srinivasan, A., 2000. E:cient non-parametric density estimation on the sphere with applications in Juid mechanics. SIAM J. Sci. Comput. 22, 152–176. Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and Its Applications. Chapman & Hall, London. Gropp, W., Lusk, E., Skjellum, A., 1999a. Using MPI—Portable Parallel Programming with the Message-Passing Interface, 2nd Edition. Massachusetts Institute of Technology, Massachusetts. 6

See www-unix.mcs.anl.gov/∼ashton/mpich.nt for more information.

302

J. Racine / Computational Statistics & Data Analysis 40 (2002) 293 – 302

Gropp, W., Lusk, E., Thakur, R., 1999b. Using MPI-2—Advanced Features of the Message-Passing Interface. Massachusetts Institute of Technology, Massachusetts. Hart, J.D., 1997. Nonparametric Smoothing and Lack-of-Fit Tests. Springer, New York. Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. Chapman & Hall, London. HFardle, W., 1990. Applied Nonparametric Regression. Cambridge, New Rochelle. Horowitz, J.L., 1998. Semiparametric Methods in Econometrics. Lecture Notes in Statistics. Springer, New York. Kernighan, B.W., Ritchie, D.M., 1988. The C Programming Language, 2nd Edition. Prentice-Hall, Englewood Cli%s, NJ. Loftsgaarden, D.O., Quesenberry, C.P., 1965. A nonparametric estimate of a multivariate density function. Ann. Math. Statist. 36, 1049–1051. Nadaraya, E.A., 1965. On nonparametric estimates of density functions and regression curves. Theory Appl. Probab. 10, 186–190. Pagan, A., Ullah, A., 1999. Nonparametric Econometrics. Cambridge University Press, Cambridge. Prakasa Rao, B., 1983. Nonparametric Functional Estimation. Academic Press, New York. Racine, J.S., 1993. An e:cient cross-validation algorithm for window width selection for nonparametric kernel regression. Comm. Statist. 22 (4), 1107–1114. Rosenblatt, M., 1956. Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832–837. Ruppert, D., Carroll, R., Wand, M., 2002. Semiparametric Regression Modeling. Cambridge University Press. Scott, D.W., 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York. Silverman, B.W., 1982. Algorithm AS176. kernel density estimation using the fast fourier transform. Appl. Statist. 31, 166 –172. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, New York. Stone, C.J., 1974. Cross-validatory choice and assessment of statistical predictions (with discussion). J. Roy. Statist. Soc. 36, 111–147. Wand, M., Jones, M.C., 1995. Kernel Smoothing. Chapman & Hall, London. Watson, G.S., 1964. Smooth regression analysis. Sankhya 26 (15), 175–184.

Lihat lebih banyak...

Parallel distributed kernel estimation

Descripción

Comentarios