Asymptotic efficiency of kernel support vector machines (SVM)

July 1, 2017 | Autor: Vladimir Norkin | Categoría: Applied Mathematics

Descripción

Cybernetics and Systems Analysis, Vol. 45, No. 4, 2009

ASYMPTOTIC EFFICIENCY OF KERNEL SUPPORT VECTOR MACHINES (SVM) V. I. Norkina and M. A. Keyzerb

UDC 519:234:24:85

The paper analyzes the asymptotic properties of Vapnik’s SVM-estimates of a regression function as the size of the training sample tends to infinity. The estimation problem is considered as infinite-dimensional minimization of a regularized empirical risk functional in a reproducing kernel Hilbert space. The rate of convergence of the risk functional on SVM-estimates to its minimum value is established. The sufficient conditions for the uniform convergence of SVM-estimates to a true regression function with unit probability are given. Keywords: machine learning, estimation of dependences, recognition, kernel estimate, support vector machine (SVM), ill-posed problems, regularization, consistency, rate of convergence. INTRODUCTION The support vector method/machine (SVM) plays an important role in the modern pattern recognition theory since it successfully competes to the best-developed multilevel neural network recognition systems [1, 2]. Its nonlinear version is called the kernel SVM. It is included in the statistical learning theory founded in [3–5]. The essence of the theory is as follows [1–11]. Let there be a problem of learning, i.e., estimating a dependence (model) based on “input-output” observations. The statistical learning theory assumes that “input” is generated randomly and “output” is observed with errors. Thus, observations are assumed to be generated independently from the joint distribution of probabilities on the “input-output” space. Recognition and classification problems are important special cases of the estimation of a dependence with discrete “output”. In the statistical learning theory, estimation of dependences, pattern recognition, and classification are reduced to two-criteria optimization problems on some set of feasible dependences, one criterion (empirical risk functional) being responsible for the adequacy of the model to the data being observed, and the other (complexity) for its generalizing ability, i.e., for the adequacy of the model to the new data. The norm as an element of the model space is used as a complexity measure of the model. The main tasks of this approach are to set up space and subset of feasible models, to efficiently enumerate models, and to find a compromise between the degrees of adequacy of the model to data and the model complexity. In the statistical learning theory, the model space is linear and consists of linear or nonlinear (kernel) functions of the vector of “inputs,” each observed vector of “inputs” being associated with a model from this space. The support vector machine assumes that the unknown model is searched as a linear combination of only the models of the space that correspond to the observed “inputs.” Computationally, it can be reduced to finite-dimensional quadratic optimization problems with the number of variables equal to the number of observations. The mathematical basis of the statistical learning theory and support vector machine are the theories of reproducing Hilbert spaces, convex optimization problems in Hilbert spaces, ill-posed optimization problems, and generalized laws of large numbers such as generalizations of the Glivenko–Cantelli theorem and the exponential measure concentration theorem. Being applied to estimate functions, the SVM can be considered as a method of mathematical statistics: in the linear version, it is close to the ridge regression method [12], and in the nonlinear, to nonparametric kernel estimation methods [13]. a

V. M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Ukraine, [email protected]. bCentre for World Food Studies, Vrije Universiteit, Amsterdam, the Netherlands, [email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 81–97, July–August 2009. Original article submitted September 16, 2008. 1060-0396/09/4504-0575

©

2009 Springer Science+Business Media, Inc.

575

The present paper analyzes the asymptotic properties of the SVM-estimators of an unknown dependence, obtained by the SVM with unlimited increase in the number of observations (training sample) as is done in mathematical statistics. The literature on statistical learning analyzes the convergence of estimates mainly with respect to a functional [1–11, 14–16]. In the case of quadratic functionals, this also yields the probability convergence to a root-mean-square regression function in one norm or another [8, 13, 16, 17–19]. Note that the SVM usually employs nonquadratic and even nonsmooth quality functionals. The paper estimates the rate of the mean convergence (proportional to 1/ 4 m , where m is the number of observations) of the values of an arbitrary convex quality functional of SVM-estimates to its theoretical minimum; such an estimate of the rate of convergence for the confidence bound of a quadratic risk functional is presented in [16, Sec. 4]. In the case of binary classification problems, the results assess the rate of convergence of the Bayesian risk (the probability of erroneous classification) to its theoretical minimum. Note that under a strong assumption that components of the input random vector are independent, an unimprovable estimate of the rate of convergence, proportional to 1 / m, is found in [20, 21] for the Bayesian classification method. An analysis of convergence with respect to a functional is justified for classification problems; however, it is insufficient to consider regression problems such as median and quantile regression [15, 22, 23]. Therefore, the present paper provides sufficient conditions for the uniform convergence of SVM-estimators of regression to an unknown function with unit probability, namely, establishes a rule for changing a regularization parameter in the SVM as the number of observations increases. The measures of the cardinal of a class of models (such as VC-dimension [1, 4, 5], which can be infinite in the case being considered) are not used, and the property of robustness of the SVM with respect to separate observations is taken into account [14] and theorems on the exponential concentration of the distribution of averaged random variables around their expectation are applied [24]. For iterative algorithms of learning, the results on the convergence are available in [3, 25]. The studies [26, 27] give an alternative approach to prove the convergence, with unit probability, of the estimates obtained by empirical risk minimization, provided that the feasible domain is compact and the solution is unique. They also consider the cases of periodic, randomly distributed, and dependent observations. Note that a solution to stochastic programming and classification problems is usually not unique; therefore, the solution uniqueness assumption means that a fixed (not disappearing with increasing number of observations) problem regularization is applied. Such a regularization results in asymptotically biased estimates. Moreover, it provides only a weak compactness of level sets of the objective functional. In contrast to [26, 27], we consider here the case of multiple solutions and use a gradually degenerating regularization. The results of this study are partially presented in [28, 29]. The paper is structured as follows. The first section briefly reviews the results on reproducing Hilbert spaces. The second section relates a quantile regression problem and a binary classification with the minimization of convex risk functionals. The third section analyzes the convergence of Tikhonov’s regularization method to minimize integral risk functionals in reproducing Hilbert spaces. The fourth section presents a computational scheme, and the fifth analyzes the SVM for convergence. In the conclusions, the main results are summed up.

1. REPRODUCING HILBERT SPACES The modern SVM is based on the theory of reproducing kernel Hilbert spaces (RKHS), which emerged in the beginning of the 20th century to meet the needs of the theory of integral equations and then has come to be used in many divisions of mathematics [2, 8, 30]. Such spaces are a special case of so-called equipped Hilbert spaces [31]. Definition 1. A Hilbert space H k ( X ) of functions defined on a closed space X Í R n is called a reproducing Hilbert space (RHS) if there exists a function of two vector variables k( × , × ) defined on the Cartesian product X ´ X and having the following properties: (a) k ( × , x ) Î H k ( X ) "x Î X ; (b) f ( x ) = á f , k ( × , x )ñ k for any function f Î H k ( X ) and any x Î X (the reproducing property of a kernel). Property (b) shows that the kernel k ( × , x ) in scalar products is similar to the Dirac delta function. In the statistical learning theory, the value k ( x, x ) of the kernel is interpreted as a measure of similarity of objects x and x . An RHS with the kernel k is designated by H k ( X ), or H k for brevity. The corresponding scalar product and the norm in the RHS are designated by á× , ×ñ k and || × || k = á× , ×ñ 1 / 2 ; R n is an n -dimensional vector space. k

576

ì ü Proposition 1. A set of functions í f ( x ) = å i a i k ( x i , x ) ý from the RHS H k ( X ) (where {x i } is an arbitrary set of î þ points from X , {a i } is an arbitrary set of points) is dense in H k ( X ).

The reproducing property of a kernel allows representing the scalar product of functions f ( x ) = å i a i k ( x i , x ) and

g ( x ) = å j b j k ( x i , x ) from H k ( X ) as á f , g ñ k = å i , j a i b j k ( x i , x j ), for example, || f || 2k = á f , f ñ k = å i , j a i a j k ( x i , x j ).

Proposition 2. A Hilbert space H ( X ) of functions defined on the set X is an RHS if and only if for each x Î X the functional f x : f ® f ( x ) is continuous in the norm || f || k = á f , f ñ 1 / 2 , i.e., there exists a constant K x such that k | f ( x )| £ K x || f || k " f Î H ( X ). The statements below relate the norms || f || k and || f || ¥ = sup xÎX | f ( x ) | . Proposition 3. If sup xÎX | k ( x, x ) | £ K 2 < ¥ , then || f | |¥ £ K || f || k and therefore, the strong convergence f m ® f (with respect to the H k -norm) yields the uniform convergence f m Þ f on X , m ® ¥ . Proposition 4. If a kernel satisfies the condition 0 < e £ k ( x, x ) £ K 2 "x, x Î X and the function f = lim n ®¥ f n , where f n ( x ) = å i =n 1 a ni k ( x, x ni ) , a ni ³ 0, then || f || k £ ( K / e )|| f || ¥ . I

Proposition 5. If sup xÎX ¢ | k ( x, x )| £ K < ¥, then the strong convergence f n ® f with respect to the norm in the RHS

H k ( X ) yields uniform convergence on X ¢ Í X . Proposition 6. If a kernel k ( x, x ) is continuous in ( x, x ) Î X ´ X , then the corresponding RHS consists of continuous

functions. Definition 2. A function k ( x, x ) is called a Mercer kernel if the kernel k( × , × ) is continuous and symmetric on the compact set X ´ X and the matrix {k ( x i , x j )} is nonnegative definite for any finite set of points {x i Î X }. For example, functions k ( x, x ) = exp ( - || x - x || 2 / s 2 ) , k ( x, x ) = ( s 2 + || x - x || 2 ) - a , and k ( x, x ) = (1+ á x, x ñ ) l , 2

where l is a positive integer power, are Mercer kernels. Proposition 7 (Mercer Theorem). Mercer kernels can be expanded into a uniformly converging series ¥

k ( x, y ) = å i =1 l i j i ( x ) j i ( y ) of (continuous) eigenfunctions {j i ( x )} and values { l i > 0} of the integral operator A k f (x ) = ò

X

f ( x ) k ( x, x ) d x : L2 ( X ) ® L2 ( X ) ,

where L2 ( X ) is a Hilbert space of functions square integrable on X . Proposition 8. For any Mercer kernel k ( x, x ), x, x Î X , there exists a reproducing Hilbert space H k ( X ) specified by this kernel according to Definition 1. Given a Mercer kernel k ( x, y ), x, y Î X , the RHS can be set up in two ways: ì ü N (i) H k ( X ) = cl í f ( x ) = å i =1 a i k ( x i , x ) "N ³ 1, a i Î R , x i Î X ý w ith th e s c a la r p r o d u c t á f , g ñ k = î þ

å i , j =1 a i b j k ( x i , x j ) N

of functions f ( x ) = å i =1 a i k ( x, x i ) and g ( x ) = å i =1 b i k ( x, x i ) from H k ( X ); N

N

ì ü ¥ ¥ ¥ (ii) H k ( X ) = í f ( x ) = å i =1 a i j i ( x ) : å i =1 a i2 / l i < + ¥ ý with the scalar product á f , g ñ k = å i =1 a i bi / l i of î þ ¥

¥

functions f ( x ) = å i =1 a i j i ( x ) and g ( x ) = å i =1 bi j i ( x ), where {j i ( x )}, { l i } are the eigenvectors and values of the integral operator A k . In the statistical learning theory, the set { j i ( x )} is called a vector of derived characteristics of the object x. Proposition 9. A finite-dimensional Hilbert space of functions H ( X ) with the basis {j i ( x )}iN=1 and scalar product á× , ×ñ , specified by the Gramian matrix G = {g ij = á j i , j j ñ }, can be considered as an RHS with the kernel k ( x, y ) =

å i , j =1 d ij j i ( x ) j j ( y ) , where {d ij } = G -1. In this RHS, the scalar N N g ( x ) = å i =1 bi j i ( x ) is defined as á f , g ñ k = å i , j =1 g ij a i b j . N

product for the functions f ( x ) = å i =1 a i j i ( x ) and N

577

2. PROBLEMS OF ROOT-MEAN-SQUARE, MEDIAN, AND QUANTILE REGRESSIONS By a regression problem (estimating the relationship between y and x), we mean finding certain (conditional, for fixed x) characteristics of joint distribution of variables ( x, y ), for example, conditional mean, median, or quantiles. The SVM can be applied to solve problems of estimating functions based on observation data. , obtained by observations, which presumably satisfy the Consider a set of points S m = {( x i Î X Ì R n , y i Î R 1 )} m i =1 relation y = f ( x ), where f is an unknown function. The statistical learning theory deals with the problem of finding this function based on observations available. However, this requires additional information on the class of functions F that includes the unknown function f. For example, the linear regression theory assumes that f belongs to the linear span of some basis functions {f h ( x ), h = 1, 2, ... , r }. Another approach requires that f should be represented in the most simple form, for example, with the minimum number of nonzero parameters in a linear representation. This approach is used in kernel learning [1, 2], which is a part of statistical learning theory, and tends to combine the flexibility and simple representation of the function f. The quality of data approximation is measured by empirical risk, i.e., by the sum of the absolute values of deviations of predicted values of the model f from observed ones, and the complexity of the model is estimated by a measure of complexity, for example, by the norm of f in the space of functions under study. An optimal choice of f is based on these two criteria by minimizing the weighted sum of empirical risk and a measure of complexity. The criterion responsible for complexity is called regularizing since it eliminates the incorrectness of the problem, provides the uniqueness and minimizes the number of nonzero components in the representation of the function. Moreover, the statistical learning theory linearizes and thus simplifies the problem by mapping the initial data {x i } from the space R n of initial attributes into another Hilbert space of secondary attributes, probably of very high or even infinite dimension, such that the relationship between the dependent variable y and secondary attributes becomes linear and can be expressed using a scalar product. Such a method is used in the Sobolev–Schwartz theory of distributions [32], which considers nonlinear functions as linear functionals on some space of basis functions. Thus, the unknown function is searched for in a class of functions that are linear in the extended attribute space and can be specified in it by a vector. Computational complexities are obviated by constructing an expanded space of attributes such that computing the scalar product of two vectors of very high dimension from this space reduces to computing the values of the kernel, i.e., a symmetric function of two variables k ( x, y ). This is a reproducing Hilbert space. A key result of [33] on representing the solution of the regularized problem in an RHS is that the solution of the minimization problem for the regularized empirical risk is representable as a finite linear combination of kernel functions f ( x ) = å i =1 a i k ( x i , x ) associated m

with observation points x i . Hence, the estimation (or learning) problem becomes a finite-dimensional optimization problem with . If the loss function is piecewise-linear, the problem reduces to a quadratic respect to the unknown expansion coefficients {a i } m i =1 optimization problem whose dual is especially convenient for the solution (see Sec. 4). The most general probabilistic model of the relationship between y Î Y Í R 1 and x Î X Í R n is the joint distribution . However, of probabilities on X ´ Y , which can be estimated based on observations S m = {( x i Î X Ì R n , y i Î R 1 )} m i =1 even in this general formulation, of interest are only some characteristics of distribution such as the conditional (for given x) average value, median or inverted distribution, which can be calculated if the distribution is known. The conditional mean can be calculated approximately using the direct nonparametric Nadaraya–Watson formulas [34]. At the same time, conditional mean can be obtained by minimizing the quadratic risk functional over all measurable functions. Moreover, quadratic functionals are considered such as the expectation of absolute deviation, with linearly quadratic [35] and e -insensitive loss functions [1] that lead to robust estimates, which are less sensitive to the spread of observations. Let z = ( x, y ) be a random vector variable with the distribution P ( dz ) concentrated on a set Z, c( z , y ) be a loss function, convex in the second argument, f : X ® R 1 be a function from some class F . Let us define a risk functional as the expectation of losses, R ( f ) = E z c( z , f ( x )) = ò c( z , f ( x )) P ( dz ). A risk minimization problem has the form Z

R ( f ) = E z c( z , f ( x )) = ò c( z , f ( x )) P ( dz ) ® inf f ÎF . Z

(1)

Denote the set of solutions of this problem by F * . For a quadratic loss function c( z , f ) = ( y - f ) 2 , if the conditional mean m( x ) = ò 578

Yx

yp( z ) dy exists for each fixed x and m( x ) Î F , then f * = m [13].

For nonquadratic risk functionals, the correspondence of their minima to any characteristics of the distribution is less obvious; however, for a functional of the mean absolute deviation, often used in the statistical learning theory, such a correspondence can be established. Indeed, for a loss function of the form c( z , f ) = | y - f | = max {( f - y ), ( y - f )}, the unconditional minimum in (1) of the functional R ( f ) in all the measurable functions f ( x ) is attained on the conditional median (if exists) of the distribution P ( dz ). In a more general case, the risk minimization problem involves a loss function c( z , f ) = max { m ( f - y ), n ( y - f )}, m , n ³ 0.

(2)

Then the exact minimum (1) is a conditional inverted distribution P ( dz ) of the level a = n / ( m + n ) (see [22, 23]; in the context of stochastic minimax problems, this fact is established in [36, p. 95; 37] and discussed in detail in [38]). For the median, m = n = 1. Then, we formulate the conditions for which a conditional a -inverse distribution is a unique minimum of the appropriate risk functional. Assumption 1 (model identifiability conditions): (a) the distribution P ( dz ) has the density p( z ) = p( x, y ) , z Î Z == { z = ( x, y ): x Î X , y Î Y x }, the mapping x ® Y x is locally bounded; (b) the x-density p( x ) = ò p( x, y ) dy is nondegenerate, i.e., p( x ) > 0 for all x Î X ; Yx

(c) the conditional distribution function Fx ( a ) = ò

a -¥

p x ( y ) dy , where p x ( y ) =p( x, y ) / p( x ), is continuous in ( x, a )

and is strictly monotonic in a for a such that 0 < Fx ( a ) < 1. Condition 1a ensures that the moments of distribution exist; 1b requires that x-values should fill all the set X , and 1c guarantees that quantiles of the distribution function Fx ( × ) and the conditional mean m( x ) = ò yp( x, y ) dy are continuous Yx

functions of x . Now, we can prove the following statement. THEOREM 1. Let conditions of Assumption 1 be satisfied. Let us consider a risk minimization problem R( f ) = ò

Z

max { m ( f ( x ) - y ), n ( y - f ( x ))} p( z ) dz ® inf f

(3)

on a class of functions that are continuous and integrable with respect to measure with density p( x ) on the set X . Then for the given x, the minimum in (3) is attained on the quantile q a ( x ) of the level a = n / ( m + n ), m , n > 0, and this minimum is unique. Proof. Let us consider a conditional risk function j( x, a ) = E y | x max { m ( a - y ), n ( y - a )} = ò

Yx

max{ m ( a - y ), n ( y - a )}dFx ( y ).

(4)

The unconditional minimum of the function j( x, a ) with respect to a is known to be achieved on a conditional inverted distribution P ( dz ) [15, 22, 23, 36, 37, 38]. The proof shows that in the assumptions made, this quantile is a continuous function of x and a unique continuous minimum of functional (3). By virtue of Assumption 1a, the function j( x, a ) is defined for all a Î R and x Î X . Note that j( x, a ) is uniformly continuous (even is uniformly Lipschitz) in a and by virtue of Assumptions 1a and 1c is continuous in x (by Helly’s theorem on passage to the limit under the Lebesgue–Stieltjes integration sign [32]); hence, it is continuous in ( x, a ) Î X ´ R . For any measurable function f, we have R ( f ) = E x, y max { m ( f ( x ) - y ), n ( y - f ( x ))} = E x E y | x max { m ( f ( x ) - y ), n ( y - f ( x ))} = E x j( x, f ( x )) ³ E x inf a j( x, a ) . According to [36, 37], let us represent a conditional risk function as follows: j( x, a ) = ( m + n )ò

a -¥

dFx ( y ) - na + n ò

+¥ -¥

ydFx ( y ) .

By virtue of Assumption 1c, it follows herefrom that the function j( x, a ) is differentiable with respect to a with the derivative j a ( x, a ) = ( m + n )Fx ( a ) - n . Since m , n > 0 by assumption, the quantile q( x ) of the level a = n / ( m + n ) of the distribution function Fx ( a ) is a unique root (with respect to the variable a) of the equation j a ( x, a ) = 0. Note also that by virtue of Assumption 1c, the function j a ( x, a ) = ( m + n ) Fx ( a ) - n is continuous in ( x, a ) and is strictly monotonic in a ; 579

hence, by virtue of the classical implicit-function theorem, the root q( x ) of the equation j a ( x, a ) = 0 is a continuous function of x . Obviously, since the derivative j a ( x, a ) is nonnegative for a < q( x ) and is positive for a > q( x ), and Fx ( × ) is strictly monotonic, the function j( x, × ) attains the minimum at the point q( x ) for a fixed x. Thus, for any measurable function f, inf f R ( f ) ³ E x inf a j( x, a ) = E x j( x, q( x )) = E x, y max { m ( q( x ) - y ), n ( y - q( x ))} = V ( m , n ) and the infimum is attained for f ( x ) = q( x ) . However, whether the minimum function is unique is still an open question. Since the function j( x, a ) is continuous in ( x, a ) , the superpositions j( x, q( x )) and D ( x ) = j( x, f ( x )) are continuous in x for continuous f ( x ). Since q( x ) is optimal, we have j( x, f ( x )) ³ j( x, q( x )) for any x Î X . Now, assume that f ( x¢ ) ¹ q( x¢ ) at some point x¢ Î X . Then since the minimum q( x ) of the function j( x, a ) with respect to a is unique, we have j( x¢ , f ( x¢ )) > j( x¢ , q( x¢ )) . Due to the continuity of D ( x ) and j( x, q( x )), this inequality holds not only at the point x¢ but also in some neighborhood of the point x¢, and since the density is nondegenerate, p( x ) = ò

Yx

p( x, y ) dy , we obtain

R ( f ) = E x j( x, f ( x )) > E x j( x, q( x )) = R ( q ) . This proves that q( x ) is a unique continuous minimum of the functional R( × ) . COROLLARY 1. Let the conditions of Assumption 1 be satisfied. Consider a risk minimization problem R ( f ) = E x, y | f ( x ) - y | ® inf f

(5)

on a set of functions continuous on X . Then for the given x the minimum is attained on the median m( x ) = q1/ 2 ( x ) and this solution is unique. A formulation of a quantile estimation problem more general than, for example, estimation of a median or a mean may be useful because, along with the average tendency, it is also interesting to find a significance domain into which predicted values fall with prescibed probability. In other words, the researcher also tends to estimate the errors D ( x ) = E y | x | y - m( x ) | , which are components of the expected risk R ( m ) = E x, y | y - m( x ) | = E x E y | x | y - m( x ) | = E x D( x ) = ò

X

D ( x ) p( x ) dx .

To this end, a regression D ( x ) is set up based on the observations ( x i , h i ), where h i = | y i - f ( x i ) |. Another way of describing errors is constructing (sequentially or simultaneously) a series of conditional quantile functions q a ( x ) for the levels a varying from 0 to 1, the median being associated with the level 0.5; this approach is discussed in [15, 22]. Thus, estimating quantiles for different confidential levels makes it possible to form statistics such as a median and a confidence interval, 5%-quantile on the left and 95%-quantile on the right, which give together a 90%-confidential domain for the forecast when the forecast is based, say, on the median. For levels close to zero or unity, estimating quantiles reduces to estimating the upper and lower bounds of distribution. Note that risk functionals (3) and (5) penalize all errors, both large and small. Statistical learning methods and the SVM for regression and classification often do not penalize small deviations and solve the perturbed risk minimization problem, i.e., e -insensitive loss functions such as c e ( z , f ( x )) = max {0, m ( f ( x ) - y ) - e, n ( y - f ( x )) - e }

(6)

and the associated e -insensitive risk functional R e ( f ) = E x, y c e ( z , f ( x )) are used. Obviously, R e ( f ) uniformly converges to R ( f ) as e ® 0; however, this is still insufficient to prove that the minima f e Îarg min f R e ( f ) uniformly converge to q = arg min f R ( f ) because the risk, being an integral, is insensitive to variations in the subintegral function in an arbitrarily small domain. Indeed, for such perturbations, it is only possible to establish the convergence, with respect to measure Px , to this quantile as e ® 0. THEOREM 2 (identifying a quantile by means of e -insensitive solutions as e ® 0). Let f e be any minimum of R e ( × ) on a set of measurable functions defined on a closed set X Í R n . Then under the conditions of Assumption 1, f e converges to the quantile q( x ) with respect to measure Px generated by the density p( x ) = ò

Yx

p( x, y ) dy , i.e., for any d> 0 we have

Px {| q( x ) - f e ( x ) | ³ d } ® 0 as e ® 0. 580

Proof. Obviously, c( z , f ( x )) - e £ c e ( z , f ( x )) £ c( z , f ( x )); therefore, R ( f ) - e £ R e ( f ) £ R ( f ) uniformly with respect to all the measurable functions f . Hence, R (q ) - e £ R ( fe ) - e £ R e ( fe ) £ R e (q ) £ R (q ) , i.e., R ( q ) £ R ( f e ) £ R ( q ) + e and lim e® 0 R ( f e ) = R ( q ) . We will prove by contradiction. Assume that there exist constants d> 0 and g > 0 for which the sequence e m ® 0 is such that Px {| q( x ) - f em ( x ) | ³ d } ³ g > 0 . The probabilistic measure Px on the closed set X Í R n is dense; therefore, there exists a compact subset X ¢ Í X such that Px { X ¢} ³ 1- g / 2 . Then Px {x Î X ¢ : | q( x ) - f em ( x ) | ³ d } ³ g / 2 > 0. Consider a function j( x, a ) defined by (4). As was shown in the proof of Theorem 1, under Assumption 1, the function j( x, a ) is continuous in ( x, a ) , has a unique minimum q( x ) in a , which is a continuous function of the variable x. Therefore, the superposition D ( x ) = j( x, q( x )) is also continuous. Denote by y( x ) = ò

Yx

ydFx ( y ) the root-mean-square regression function. Under Assumptions 1a and

1c, the function y( x ) is continuous in x Î X by virtue of Helly’s theorem (passage to the limit under the Lebesgue–Stieltjes integration sign) [32]. By virtue of Jensen’s inequality, j( x, a ) ³ max { m ( a - y( x )), n ( y( x ) - a )}. Denote y min = min xÎX ' y( x ) and y max = maxxÎX ' y( x ) . Obviously, j( x, a ) ³ max { m ( a - y max ), n ( y min - a )} for all x Î X ¢ and a Î R 1 . Then the continuous function j( x, a ) attains its minimum on the set {( x, a ): x Î X ¢, a Î R 1, | q( x ) - a | ³ d } and since the minimum q( x ) = arg min aÎR 1 j( x, a ) is unique, inf {( x, a ) : xÎX ' , aÎR 1} {j( x, a ) - j( x, q( x )) : | q( x ) - a | ³ d } ³ n > 0. Hence, j( x, f em ( x )) ³ j( x, q( x )) for all x Î X ¢ and j( x, f em ( x ))³ j( x, q( x )) + n on a subset of positive measure g / 2; therefore, R ( f em ) = E x j( x, f em ( x )) ³ E x j( x, q( x )) + ng / 2. However, this contradicts the property lim m ®¥ R ( f em ) = R ( q ) . The theorem is proved. The theorem permits that without perturbations as e ® 0, the function f e may deviate from the exact minimum f * on an infinite set of points and not by a small amount though on a set of small measure. Thus, f e may be unstable as e varies and may not converge to f * uniformly. Classification problems are closely related to the optimization of risk functionals. By a classification problem is meant constructing a classifier that minimizes the Bayesian risk (misclassification probability). For any measurable function f ( x ) the binary classifying rule is determined by the formula ì 1, f ( x ) > 1/ 2, I 1/ 2 ( f ( x )) = í î 0 otherwise.

(7)

The quality of the classifying rule I 1 /2 ( f ( × )) is measured by the Bayesian risk, i.e., by the misclassification probability P {I 1 /2 ( f ( x )) ¹ y }, where y Î{0, 1}. Note that the Bayesian risk is minimum (P * ) on the decision rule g h specified by the conditional probability function h( x ) = P{ y = 1 | x } [39, p. 10], which is unknown. Thus, a possible way of constructing optimal qualifiers is to approximate the conditional probability h( x ) = P{ y = 1 | x }. For example, the conditional probability h( x ) is approximated in [20, 21] by the Bayesian method under the (strong) assumption that the components of the random input vector of observations are independent. According to [39, p. 16] and references therein, P{I 1 / 2 ( f ( x )) ¹ y } - min and h( x ) = arg min

f 2

f ÎF

P{I 1 / 2 ( f ( x )) ¹ y} £ 2E | f ( x ) - h( x ) |

~ E ( y - f ( x )) 2 . If f ( x ) is an approximate solution of the quadratic risk minimization problem

R ( f ) = E ( f ( x ) - y ) , then the classifying rule is determined by (7). The following estimate [39, 20] is true: P {I 1 / 2 ( f ( x )) ¹ y } - min

f ÎF

P {I 1 / 2 ( f ( x )) ¹ y} £ 2( R ( f ) - min

f ÎF

R ( f )),

(8)

where the minima are taken over the set of measurable functions. Thus, by virtue of (8), the minimization of the 581

functional R ( f ) = E | f ( x ) - y| over the set of measurable functions F automatically minimizes the Bayesian risk functional. As follows from Corollary 1, the functional R ( f ) = E | f ( x ) - y | is minimized on conditional medians of the distribution P( × ) . If there is a reason to believe that conditional medians of the distribution P( × ) belong to a class of functions such as a reproducing Hilbert space H k , we may put F = H k in (8). 3. REGULARIZATION OF RISK OPTIMIZATION PROBLEMS IN A REPRODUCING HILBERT SPACE Optimization of the integral risk functional (1) is complicated because: (i) the problem may be incorrect, have not a unique solution, and its approximate solutions may be numerically unstable; (ii) the problem may be infinite-dimensional and thus should be approximated; (iii) estimating values of the functional involves repeated computation of integrals. Let us consider the first problem, which is usually solved by the Tikhonov regularization method. The regularization method is detailed in [40] for the solution of operator equations, in [41] for infinite-dimensional optimization problems, and in [1, 5] for statistical estimation. It is necessary to adapt the theory of the regularization method to solve optimization problems for convex integral risk functionals in an RHS. The second and third problems will be considered in detail in the next section, where the method of empirical approximations is applied to approximate the problem and integrals. Along with (1), let us consider a regularized problem R l ( f ) = R ( f ) + l W (|| f | | ) = E x, y c( z , f ( x )) + lW (|| f | | ) ® inf f ÎF ,

(9)

where R ( f ) = E x, y c( z , f ( x )) , c( z , f ) : Z ´ R 1 ® R 1 , l > 0 is the regularization parameter, W( × ) is a nonnegative function of one variable, and F is a set in a Hilbert space H with the norm || × || . Along with (9), let us consider a perturbed problem ~ (10) R l ( f ) = R ( f ) + l W ( || f || ) ® inf f ÎF , ~ where R ( f ) is an approximation or perturbation of the functional R ( f ). Let us derive the conditions whereby solutions ~ f l of problem (9) and solutions f l of problem (10) converge to the solution f * of the original problem (1) as the regularization parameter tends to zero, l ® 0. Assumption 2 (on the original, perturbed, and regularized problems): (a) the functional R ( f ) is bounded from below, convex, and continuous on F Í H ; (b) F is a closed convex set in a Hilbert space H ; (c) the set F * = { f Î H : R ( f ) = inf f ÎF R ( f )} is not empty; (d) W( f ) = || f || 2 ; ~ (e) sup f ÎD | R ( f ) - R ( f ) | £ d l , lim l® + 0 d l / l = 0. Assumptions 2 are general and are not related to specific features of learning problems. Let us recall one general result on the convergence for a family of perturbed problems (10). THEOREM 3 [40, Ch. VIII, § 1; 41, Ch. 2, § 5, Theorem 2]. Under the conditions of Assumption 2, solutions (if ~ exist) f l of a perturbed regularized problem converge to a normal (minimum in norm) solution f * of the original problem as l ® + 0 in the strong topology of the space H (with respect to the norm || × || ). Let us formulate the result on the convergence of the regularization method for the problem (1) of risk functional minimization. Assumption 3 (on the risk minimization problem). The loss function c( z , f ( x )) in (1) satisfies the following conditions: (a) c( z , a ) ³ c 0 > - ¥ for all z Î Z , a Î R 1 ; (b) c( z , a ) is convex in a ;

582

(c) | c( z , a ) | £ c1 ( z ) + c 2 ( z ) r (| a | ) ,

ò Z c1 ( z ) p( z )dz < + ¥ , ò Z c2 ( z ) p( z ) dz < + ¥ ,

where r( × ) is a monotonically

increasing function; (d) H ( X ) = H k ( X ) is an RHS with the kernel bounded on the diagonal, sup xÎX | k ( x, x ) |= K 2 < + ¥ . THEOREM 4 (properties of the risk functional). In Assumptions 3, the risk functional R ( f ) = E z c( z , f ( x )) defined on the reproducing Hilbert space H k ( X ) with the kernel bounded on the diagonal is convex and continuous with respect to the norm || × || k of this space. Proof. Assumptions 3a and 3b yield the lower boundedness and convexity of the risk functional. In an infinite-dimensional space, the convexity of a functional yields only its semicontinuity from below [42]. Under the above assumptions, let us prove its continuity using the Lebesgue theorem (passing to the limit under the integration sign). Let lim s®¥ || f s - f || k = 0 and hence lim s®¥ || f s ||k = || f || k . Then { f s } converges uniformly to f. Indeed, for any x Î X by virtue of the reproducing property of a kernel (see Definition 1b) and the Cauchy–Schwartz inequality, | f s ( x ) - f ( x )| £ k ( x, x ) || f s - f || k £ K || f s - f || k ® 0, s ® ¥ . Since the function c( z , × ) is convex, it is continuous; therefore, lim s®¥ c( z , f s ( x )) = c( z , f ( x )) for any z Î Z . Assumptions 3c and 3d yield | c( z , f s ( x )) | £ c1 ( z ) + c 2 ( z ) r( | f s ( x ) | ) £ c1 ( z ) + c 2 ( z ) r(| k ( x, x ) | || f s ||k ) £ c1 ( z ) + c 2 ( z ) r( K sup s || f s || k ) = U ( z ) . By virtue of Assumption 3c, the majorant U ( z ) is integrable. Now, the continuity of the functional R ( f ) = E z c( z , f ( x )) follows from the Lebesgue theorem (passing to the limit under the integration sign). THEOREM 5 (uniform convergence of regularized solutions as l ® 0). Under Assumptions 2b–2e and 3, solutions ~ (if exist) f l and f l of the regularized problems (9), (10) with the risk functional R ( f ) = E z c( z , f ( x )) and F Í H k ( X ) uniformly converge (as l ® 0) to a normal solution f * (minimum with respect to the norm || × || k from the set of solutions F * of the original problem (1)). Proof. Under Assumptions 3, the risk functional R ( f ) = E z c( z , f ( x )) satisfies Condition 2a by virtue of Theorem 4. Then by Theorem 3, regularized solutions strongly (with respect to norm) converge to the normal solution of the original ~ problem, lim l® + 0 || f l - f * || = 0 ; such a convergence may be nonuniform. However, in a regularized Hilbert space H k ( X ) with the kernel bounded on the diagonal k (Assumption 3d) strong convergence yields the uniform one, ~ lim l® + 0 sup xÎX | f l ( x ) - f * ( x ) | = 0 . The result on the convergence for f l follows from that proved for d l = 0. The theorem is proved. Regularization makes solutions more stable. Note that instead of || f || 2 , other regularizing functionals such as -

W (|| f || ) = || f - f 0 || 2 can be used, where f 0 is a reference function. A translation f = f + f 0 reduces the problem to the homogeneous form (9), (10), and thus the convergence results proved above are applicable to it. Moreover, reference points may vary from one computation to another (in passing from one l to other) as in sequential regularization (see, for example, [43]). 4. OPTIMIZATION OF REGULARIZED EMPIRICAL RISK FUNCTIONALS: SUPPORT VECTOR MACHINE Let us consider the possibilities of the approximation of an infinite-dimensional optimization problem for risk functionals and integrals appearing in them with respect to probabilistic measure P( × ) . In practice, the distribution P( × ) is usually not known completely, and there is a set of independent observations { z i , i = 1,... , m} of a random vector variable z with the distribution P( × ) , which is called a training sample in the statistical learning theory. This allows approximating the unknown distribution P( × ) by the empirical distribution Pm ( × ) , and the risk functional R ( f ) = Ec( z , f ( x )) by empirical ~ m mean (empirical risk) R m ( f ) = (1/ m )å i =1 c( z i , f ( x i )) . This is how the problem of estimating integrals is solved. 583

~ However, the minimization problem for the empirical risk R m ( f ) is still infinite-dimensional since the set F of feasible functions is infinite-dimensional, for example, is a subset of a Hilbert or another linear space of functions H . The classical regression analysis assumes that H is a linear span of a finite set of functions {j i ( x ), i = 1,... , n }, and the more there are these functions, the richer the class H and the more there are possibilities to better approximate the unknown function. The nonparametric statistics uses kernel generating functions such as {k (( x - x i )/ q )}, where k is a fixed function, { x i } are arbitrary points of the set X , and q is a numerical parameter. The flexibility is maximum for an infinite generating set of functions. However, there should be a compromise between the degree of the approximation of observation data (measured ~ by the value of empirical risk R m ( f ) ), and the complexity of the approximating model f ( x ), since because of a too fine adjustment of the model for the random training sample available { z i , i = 1,... , m} it may cease to be generalizing, i.e., may perform badly on another data set. Thus, the model should provide a good approximation for the data and be as simple as possible. A measure of complexity of f may be, for example, the number of nonzero elements of the expansion of f into series of basis functions of the space, the sum of absolute values of expansion coefficients or the sum of their squares. For an orthonormal basis, the sum of squares of expansion coefficients is a norm of the element f from the Hilbert space H ; therefore, a generalized measure of complexity of the function f may be the norm || f || or a function W(|| f || ) of the norm. ~ Thus, the problem of choosing a model is two-criterion with the criteria R m ( f ) and || f || . These criteria can be combined ~ into R m ( f ) + lW (|| f || ) with the coefficient l , choosing its value being the major problem. Another formally mathematical view of the problem of estimating the function f using experimental data { z i , i = 1,... , m} is possible in the context of the theory of stochastic ill-posed problems. The original minimization problem for a risk functional, generally speaking, may be incorrect, i.e., may have ambiguous solutions, be unstable with respect to ~ perturbations of the functional. The original risk functional R ( f ) is replaced with a random approximation R m ( f ) , i.e., its ~ stochastic perturbation of the form R ( f ) + d m ( f ) is considered, where d m ( f ) = R m ( f ) - R ( f ) . To find approximate solutions, Tikhonov’s regularization method in the functional (Hilbert) space H is applied. The regularization method with deterministic perturbations is well understood [40, 41]. The regularization with random perturbations is poorly understood, some results are available in [1, 5] for linear operator equations (quadratic optimization problems) in a reflective Banach space and in [44] for equipped Hilbert spaces with general-form random perturbations (see also references therein). Let us consider regularization in RHS for certain (empirical) random perturbations of the functional and for general convex (not only quadratic) risk functionals, which reduces to a family of minimization problems for regularized empirical risk ~ 1 m (11) R m ( f ) + lW k ( f ) = å c ( z i , f ( x i )) + l || f || 2k ® inf f ÎH k , m i =1 where H k is an RHS generated by the kernel k . It appears that the regularized problem (11) in an RHS reduces to finite-dimensional optimization, and for piecewise-linear loss functions, to quadratic optimization with linear constraints. By virtue of the so-called solution-representation theorem [2, Theorem 4.2, p. 90; 33], a solution of the problem in an RHS exists and can be represented as m (12) f ml ( x ) = å a i k ( x i , x ) , i =1

where a m = { a i } is a set of real numbers. Substituting expression (12) into (11) and using the reproducing property of the kernel (see Definition 1b), we arrive at the following finite-dimensional optimization problem: R m (a m ) =

m 1 m æç c z , å i å a j k ( xi , x j m i =1 ç j =1 è

m ö ) ÷ + l å a i a j k ( x i , x j ) ® min a m . ÷ i , j =1 ø

(13)

If the loss function c( z , f ) is continuous and nonnegative and the matrix {k ( x i , x j )} is positive definite, the problem has a unique solution f ml . Moreover, for quantile loss functions (2) and (6), problem (13) is convex and nonsmooth; (regression errors on observations), it can be reduced to a quadratic however, using additional variables x m = {x i } m i =1 programming problem with linear constraints

584

R m (a m , x m ) =

m 1 m x i + l å a i a j k ( x i , x j ) ® min x m ³0 ,a m , å m i =1 i , j =1 m

x i ³ n yi - n å a j k ( x j , xi ) - e ,

(14)

j=1

m

x i ³ m å a j k ( x j , x i ) - m y i - e, i = 1,K , m . j =1

The studies [2, 28] consider modifications of problem (14), with a variable insensitivity parameter e ³ 0. In this case, the objective function is supplemented with a penalty q ( e ) ³ 0, q( 0) = 0 , monotonically increasing in e ³ 0. Some applications need the unknown regression function to possess theoretical properties known beforehand (monotonicity, convexity or concavity, invariance with respect to some transformations, etc.). Let the set G of such functions be representable as g ( x ) = å h =1 g h f h ( x ) linear combinations of basis functions {f h ( x )} with coefficients r

g r = ( g 1 , K, g r ) Î G. Then the unknown function f ( x ), on the one hand, should minimize the empirical risk, and on the other hand, be closest to the set G . The associated optimization problem takes the following form (see, for example, [2, 9, 15]): r 1 m c ( z i , f ( x i )) + l f - å g h f h ( x ) å m i =1 h =1

2

® inf f ÎH k

k ,g

r

ÎG

,

(15)

where l is a weight coefficient. In this case, a solution f ml ( x ) of problem (15) is representable as [2] m

r

i =1

h =1

f ml ( x ) = å a i k ( x i , x ) + å g h f h ( x ) .

(16)

Substituting (16) into (15) again leads to finite-dimensional optimization with respect to unknown expansion coefficients { a i } and { g h }, which, in turn, reduces to quadratic optimization for piecewise-linear loss functions. 5. CONVERGENCE OF THE SVM Let us consider asymptotic properties (as m ® ¥ and l ® 0) of empirical kernel estimates f ml ( x ), which are solutions of the minimization problem (11) for a regularized empirical risk. The fundamental studies [1, 4, 5] analyze the convergence of R ( f ml ) ® inf f ÎF R ( f ) on the assumption that the class of functions F is of bounded capacity. This approach is based on establishing the conditions of convergence, uniform in f Î F , of empirical approximations of the risk functional 1 m R ml ( f ) = å c( z i , f ( x i )) to its true value R ( f ) = Ec( z , f ( x )) . However, the appropriate class of functions is not always m i =1 of finite capacity (finite dimension in the Vapnik–Chervonenkis sense). Moreover, the condition of uniform convergence of the approximations R ml ( f ) to R ( f ) is not necessary for the convergence of minima [45]. Therefore, we will follow another approach [14], based on the stability of estimates f ml ( x ) with respect to individual observations. A similar approach was used in [2, Sec. 12.1; 15, 16, 19], where the estimates were tested for convergence in probability. In contrast to these studies, we will impose conditions on l = l ( m ) , for which estimates f ml( m ) ( x ) converge (uniformly in x Î X with unit probability) to the minimum f * of the risk functional R ( f ) with the minimum norm || f * ||k . Let us make additional specific assumptions on the loss function c( × , × ) . Assumption 4 (additional properties of the loss function): (a) the loss function c( z , × ) is convex and Lipschitz in the second argument with the constant L uniformly in z Î Z ; (b) sup zÎZ c( z , 0) = C < + ¥ .

585

The theorem below assesses the nonoptimality (on the average) of SVM-estimates f ml of the unknown functional dependence as a function of m and l . These estimates are random variables with values in the space of continuous functions and are defined on a countable product of the initial probabilistic space ( X , B X , P ) . THEOREM 6 [46]. Let a solution of problem (1) exist, the functions f ml be solutions of problem (11). Then the following estimate holds under Assumptions 3 and 4 for any l > 0 and m : E m R ( f ml ) £ R ( f * ) + 2

2C + L || f * || ¥ m

+

LK ( 5LK + 2 lC ) l m

+ l || f * || 2 ,

(17)

k

where the expectation E m is over all the samples {w1, K, w m } with independent equally distributed observations, f * is any solution of problem (1). The theorem guarantees that R ( f ml ) converges on the average to the minimum value R ( f * ) as l( m ) ® 0 and m l( m ) ® 0 as m ® ¥ . Let us establish the strong-consistency conditions for estimates f ml ( x ), i.e., the conditions of their convergence, uniformly in x Î X , to the minimum f * ( x ) of the risk functional R as Definition 3 [40]. A solution || f * ||k = min

f ÎF *

f *Î

F*

l = l ( m ) ® 0 and m ® ¥ .

of problem (1) is called normal if it has minimum norm,

|| f ||k .

The two theorems below establish the sufficient conditions whereby the SVM-estimates f ml( m ) uniformly converge, with unit probability, to a normal solution f * Î F * of problem (1), i.e., lim m ®¥ sup wÎW f ml( m ) ( w ) - f * ( w ) = 0. THEOREM 7 (sufficient strong-consistency conditions for SVM-estimates) [46]. Let a solution of problem (1) exist and Assumptions 3 and 4 hold. Let us consider a family of solutions f ml( m ) of problem (11), with lim m ®¥ l ( m ) = 0. Then if lim m ®¥ m l4 ( m ) / ln m = ¥ , then R ( f ml( m ) ) ® R ( f * ) and solutions f ml( m ) of problem (11) converge, uniformly in x Î X , to a normal solution f * of problem (1) with unit probability as m ® + ¥ . THEOREM 8 (estimate of the rate of convergence of SVM-estimates) [46]. Let under conditions of the previous theorem, l ( m ) = L(ln m ) e / m1 /4 , L > 0, 1/ 4 < e £ 1, then statements of Theorem 7 are true and the following estimate holds: E m R ( f ml( m ) ) - R ( f * ) £ 2

2C + L || f * || ¥ m

+

LK ( 5LK + 2 2 LC ) L( ln m ) e 4 m

|| f * || L( ln m ) e k 2

+

4

m

.

(18)

CONCLUSIONS The paper has established the rate of convergence of the Vapnik support vector machine (SVM), namely, it has been shown that the values of the risk functional on SVM-estimates converge to a minimum on the average proportionally to 1/ 4 m , where m is the number of elements in the training sample. It has been proved that the SVM is asymptotically stable, i.e., converges, with unit probability, to some (normal) solution of the problem, as the regularization parameter decreases proportionally to (ln m ) / 4 m .

REFERENCES 1. 2. 3.

586

V. N. Vapnik, Statistical Learning Theory, Wiley, New York (1998). B. Schoelkopf and A. J. Smola, Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA (2002). M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer, Potential Function Method in Machine Learning Theory [in Russian], Nauka, Moscow (1970).

4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.

29.

30. 31.

V. N. Vapnik and A. Ya. Chervonenkis, Pattern Recognition Theory. Statistical Problems of Learning [in Russian], Nauka, Moscow (1974). V. N. Vapnik, Estimation of Dependences based on Empirical Data [in Russian], Nauka, Moscow (1979). N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge Univ. Press, Cambridge, UK (2000). T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks and support vector machines,” Adv. Comput. Math., 13, No. 1, 1–50 (2000). F. Cucker and S. Smale, “On the mathematical foundations of learning,” Bull. Amer. Math. Soc. (N.S.), 39, No. 1, 1–49 (2002). T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York–Berlin–Heidelberg (2001). T. Poggio and S. Smale, “The mathematics of learning: Dealing with data,” Notices Amer. Math. Soc., 50, No. 5, 537–544 (2003). M. I. Schlesinger and V. Hlav¿c, Ten Lectures on Statistical and Structural Pattern Recognition, Kluwer Acad. Publ. (2004). N. R. Draper and H. Smith, Applied Regression Analysis, 3rd edition, Wiley, Chichester (1998). L. Gy&&orfi, M. Kohler, A. Krzy&zak and H. Walk, A Distribution Free Theory of Nonparametric Regression, Springer, New York–Berlin–Heidelberg (2002). O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. Learn. Res., No. 2, 499–526 (2002). I. Takeuchi, Q. V. Le, T. Sears, and A. J. Smola, “Nonparametric quantile estimation,” J. Mach. Learn. Res., No. 7, 1231–1264 (2006). E. De Vito, A. Caponnetto, and L. Rosasco, “Model selection for regularized least-squares algorithm in learning theory,” Found. Comput. Math., 5, No. 1, 59–85 (2005). F. Cucker and S. Smale, “Best choices for regularization parameters in learning theory: On the bias-variance problem,” Found. Comput. Math., 2, No. 4, 413–428 (2002). S. Smale and D. X. Zhou, “Shanon sampling and function reconstruction from point values,” Bull. Amer. Math. Soc. (N.S.), 41, No. 3, 279–305 (2004). S. Smale and D. X. Zhou, “Shannon sampling II: Connections to learning theory,” Appl. Comput. Harmon. Anal., 19, No. 3, 285–302 (2005). A. M. Gupal, S. V. Pashko, and I. V. Sergienko, “Efficiency of Bayesian classification procedure,” Cybern. Syst. Analysis, 31, No. 4, 543–554 (1995). I. V. Sergienko and A. M. Gupal, “Optimal pattern recognition procedures and their application,” Cybern. Syst. Analysis, 43, No. 6, 799–809 (2007). R. Koenker and G.W. Bassett, “Regression quantiles,” Econometrica, No. 46, 33–50 (1978). R. Koenker, Quantile Regression, Cambridge Univ. Press, Cambridge–New York (2005). C. McDiarmid, “On the method of bounded differences,” in: Survey of Combinatorics, Cambridge Univ. Press, Cambridge (1989), pp. 148–188. S. Smale and Y. Yao, “Online learning algorithms,” Found. Comput. Math., 6, No. 2, 145–170 (2006). P. S. Knopov and E. J. Kasitskaya, Empirical Estimates in Stochastic Optimization and Identification, Kluwer Acad. Publ., Dordrecht–Boston–London (2002). Yu. M. Ermoliev and P. S. Knopov, “Method of empirical means in stochastic programming problems,” Cybern. Syst. Analysis, 42, No. 6, 773–785 (2006). M. A. Keyzer, “Rule-based and support vector (SV-) regression/classification algorithms for joint processing of census, map, survey and district data,” in: Working Paper WP-05-01, Centre for World Food Studies, Amsterdam (http://www.sow.vu.nl/pdf/wp05.01.pdf) (2005). V. I. Norkin and M. A. Keyzer, “On convergence of kernel learning estimators,” in: L. Sakalauskas, O. W. Weber, and E. K. Zavadskas (eds.), Proc. 20th EURO Mini Conf. on Continuous Optimization and Knowledge-Based Technologies (EUROPT-2008), Inst. Math. and Inform., Vilnius (2008), pp. 306–310. N. Aronshain, “Theory of reproducing kernels,” Matematika, 7, No. 2, 67–130 (1963). I. M. Gel’fand and N. Ya. Vilenkin, Generalized Functions. Issue 4. Some Applications of Harmonic Analysis. Equipped Hilbert Spaces [in Russian], Fizmatgiz, Moscow (1961). 587

32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46.

588

A. N. Kolmogorov and S. V. Fomin, Elements of the Theory of Functions and Functional Analysis [in Russian], Nauka, Moscow (1981). G. Wahba, “Spline models for observational data,” CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM, Philadelphia, PA (1990). E. A. Nadaraya, Nonparametric Estimation of Density and Regression Curve [in Russian], Izd. Tbilis. Gos. Univ., Tbilisi (1983). P. J. Huber, Robust Statistics, Wiley, New York (1981). Yu. M. Ermoliev and A. I. Yastremskii, Stochastic Models and Methods in Economic Planning [in Russian], Nauka, Moscow 1979. Y. M. Ermoliev and G. Leonardi, “Some proposals for stochastic facility location models,” Math. Modelling, 3, 407–420 (1982). A. Ruszczynski and A. Shapiro (eds.), Stochastic Programming, Vol. 10 of the Handbooks in Operation Research and Management Science, Elsevier, Amsterdam (2003). L. Devroye, L. Gy&&ofri, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, New York (1996). A. N. Tikhonov and V. Ya. Arsenin, Methods of Solving Ill-Posed Problems [in Russian], Nauka, Moscow (1986). F. P. Vasil’ev, Methods to Solve Extreme Problems. Minimization Problems in Functional Spaces, Regularization, and Approximation [in Russian], Nauka, Moscow (1981). I. Ekeland and R. Temam, Convex Analysis and Variational Problems, Elsevier, North-Holland (1976). A. Kaplan and R. Tichatschke, Stable Methods for Ill-Posed Variational Problems: Prox-Regularization of Elliptic Variational Inequalities and Semi-Infinite Problems, Akad. Verlag, Berlin (1999). A. M. Fedotov, Ill-Posed Linear Problems with Random Errors in Data [in Russian], Nauka, Sib. Otdel., Novosibirsk (1982). R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Springer, Berlin (1998). V. Norkin and M. Keyzer, “On stochastic optimization and statistical learning in reproducing kernel Hilbert spaces by support vector machines (SVM),” Informatika (Vilnius), 20, No. 2, 273–192 (2009).

Lihat lebih banyak...

Asymptotic efficiency of kernel support vector machines (SVM)

Descripción

Comentarios