Hyperparameter design criteria for support vector classifiers

Share Embed


Descripción

Neurocomputing 55 (2003) 109 – 134 www.elsevier.com/locate/neucom

Hyperparameter design criteria for support vector classi"ers Davide Anguita∗ , Sandro Ridella, Fabio Rivieccio, Rodolfo Zunino DIBE, Department of Biophysical and Electronic Engineering, University of Genova, Via all’Opera Pia 11a, 16145 Genova, Italy Received 28 February 2002; accepted 8 January 2003

Abstract The design of a support vector machine (SVM) consists in tuning a set of hyperparameter quantities, and requires an accurate prediction of the classi"er’s generalization performance. The paper describes the application of the maximal-discrepancy criterion to the hyperparameter-setting process, and points out the advantages of such an approach over existing theoretical frameworks. The resulting theoretical predictions are then compared with the k-fold cross-validation empirical method, which probably is the current best-performing approach to the SVM design problem. Experimental results on a wide range of real-world testbeds prove out that the features of the maximal-discrepancy method can notably narrow the gap that so far has separated theoretical and empirical estimates of a classi"er’s generalization error. c 2003 Elsevier B.V. All rights reserved.  Keywords: Generalization error estimate; Hyperparameters tuning; k-fold cross-validation; Maximal-discrepancy criterion; Support vector classi"er

1. Introduction Vapnik developed support vector machines (SVMs) [7,23] as an e>ective model to optimize generalization performance by suitably controlling class-separating surfaces. The opportunity to "nd a subset of training patterns that draw the eventual class boundaries was a notable advantage of SVM classi"ers on their introduction. From a ∗

Corresponding author. Fax: +39-010-353-2175. E-mail addresses: [email protected] (D. Anguita), [email protected] (S. Ridella), [email protected] (F. Rivieccio), [email protected] (R. Zunino). c 2003 Elsevier B.V. All rights reserved. 0925-2312/03/$ - see front matter  doi:10.1016/S0925-2312(03)00430-2

110

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

computational perspective, posing the training of SVMs as a constrained quadraticprogramming (QP) problem [10] boosted their practical e>ectiveness. The widespread di>usion of SVMs resulted mostly from their successful applications in real-world domains, where SVMs performed e>ectively, especially in terms of generalization error. If SVM training involves an eHcient QP algorithm, the design of an SVM requires one to tune a set of quantities (‘hyperparameters’) that ultimately a>ect the classi"er’s generalization error. Therefore, the underlying problem of e>ectively setting the hyperparameters of an SVM is of major importance. In spite of the several approaches proposed in the literature to characterize the generalization ability of SVMs theoretically [6,7,11,23,26], the most accurate estimates of their performances in real applications still remain empirical ones [8]. The main reason for this conKict between theory and practice consists in the goal of theoretical analysis, which mostly aims to cover widely general cases, thus yielding upper bounds to the classi"er’s generalization error; by contrast, empirical approaches deal with actual data distributions and hence give tighter, albeit sometimes ‘optimistic’, estimates. The research presented in this paper faces the problem of SVM design by tackling hyperparameter setting as an indirect result of accurately estimating generalization performance. The paper adopts the maximal-discrepancy criterion [3] for bounding the classi"er’s generalization error; the possibility of applying such a sample-based method to SVMs represents a somewhat novel feature. From a theoretical standpoint, the paper compares the maximal-discrepancy approach with existing theoretical frameworks. A signi"cant result of this analysis lies in showing that the former yields tighter generalization bounds. This is mostly associated with the overall model simplicity, which is not a>ected by the additional bounding terms that typically have to be taken into account when one considers the classi"er’s growth function or VC-dim explicitly. From a practical perspective, the results from the maximal-discrepancy method also seem to close on the empirical estimates more than theoretical predictions have usually done before. The empirical method that has so far been reported as best performing [8] is compared with the maximal-discrepancy approach. To this end, the method is tested in a set of di>erent real-world applications, including disease diagnosis, OCR, and coin recognition. The use of standard testbeds make comparisons feasible and reliable; in all cases, the considerable number of experiments carried out always indicate that empirical and theoretical estimates span a range of values resulting in practical advantage. Section 2 brieKy summarizes the SVM model, mainly to introduce the notation and to frame the SVM design problem. Section 3 addresses the various approaches to hyperparameter setting and describes the maximal-discrepancy method, comparing it with existing empirical and theoretical methods for generalization prediction. Section 4 presents the experimental results obtained on the di>erent testbeds. Some concluding remarks are made in Section 5. 2. A framework for SVM classiers A learning machine (LM) can be thought of as a mapping: LM : X × S × Y → ;

(1)

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

111

where X denotes the d-dimensional input space from which data are drawn, Y is the ny -dimensional space of pattern classes (‘targets’), and S is the space holding all the ns variables specifying the LM. The error on the data population is a number, ∈ , where  is the closed real-valued interval between 0 and 1, and can be expressed as [3,23]: = E {L(f(s; x); y)}; x;y

(2)

where s ∈ S, x ∈ X, L(·; ·) is a {0; 1}-valued loss function measuring errors, and f(s; x) is the LM estimate of the actual target value, y. The variables characterizing an LM can be divided into parameters and hyperparameters; using training data optimizes the former, the latter typically determine the classi"er’s regularity and are tuned on a test set. If the LM is a support vector classi"er [7], the target space is Y = {+1; −1}, and the LM structure can be written as f(s; x) =

nSV 

i yi K(x˜i ; x) + b;

(3)

i=1

where nSV is the number of support vectors, i are positive parameters, x˜i are the support vectors, K(·; ·) is a kernel function and b is a bias. The support vectors are selected from the input data set, hence nSV 6 np, where np is the number of input patterns. Expression (3) shows that f(s; x) is a series expansion having K(·; ·) as a basis and involving part or all of the training examples. The choice of the particular basis K(·; ·) involves a mapping, , of training data into a higher-dimensional space where the kernel function supports an inner product. As a result of this well-known kernel-trick, one can handle inner products of patterns in a space di>erent from X , yet disregarding the speci"c mapping of each single pattern. By using this notation, expression (3) is rewritten as f(s; x) =

nSV 

i yi (x˜i ) · (x) + b

(4)

i=1

and the class-separating surface is a hyperplane in the  space (Fig. 1). Under the hypothesis of linearly separable data, an SVM will choose the maximummargin separating hyperplane, that is, the one having the maximum distance from the two separating hyperplanes closest to each class (Fig. 1). It has been proved [23] that using that particular hyperplane conveys remarkable generalization properties. To accomplish the classi"cation task with the maximum-margin solution, one needs to solve the following constrained minimum problem (Primal problem):   np  1 2 i ; subject to: yi (w · i + b) ¿ 1 − i ∀i (5) min w + C w;;b 2 i=1

where w is an array perpendicular to the separating surface, i = (xi ), and C is a regularization parameter weighting the relevance of classi"cation errors. Setting C = ∞ involves that all i are nulli"ed, and supports linearly separable problems.

112

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

Fig. 1. f(s; x) as a separating hyperplane lying in a high-dimensional space. Support vectors are inside circles. Table 1 Some valid Kernel functions Kernel type

Kernel structure

Linear Polynomial Radial basis function (RBF)

K(x1 ; x2 ) = x1 · x2 K(x1 ; x2 ) = (x1 · x2 + 1)P 2 K(x1 ; x2 ) = e−x1 −x2 

The above formulation can take into account non-linearly separable cases by letting C take on "nite values. The solution of this modi"ed dual problem (DP) can be found [10,15] by maximizing the following cost function:    0 6 i 6 C; ∀i;   np np    1 np  i − i j yi yj K(xi ; xj ) ; subject to: DP ≡ max     2 yi i = 0;  i=1 i; j=1  i=1

(6)

where i are the Lagrange multipliers for the constraints in (5). The DP problem (6) can be solved by using a quadratic-programming algorithm [13,19]. Some valid kernel functions are listed above (Table 1). In the following, we shall denote the set of Lagrange multipliers, {i }, in an SVM classi"er as ‘parameters’; the set of hyperparameters include the error-weighting value, C, and any other speci"c quantity that shapes the kernel functions, such as  for RBF kernels. Hyperparameter values ultimately a>ect the complexity of the resulting function. For example, higher values of C will bring about fewer errors, but the separating function will turn out to be somewhat twisted; at the same time, kernel-speci"c hyperparameters can alter the e>ectiveness of the eventual separating surface. In the speci"c case of RBFs, it has been proved [4] that one can classify any consistent training set with zero errors by using a suHciently large value of . The use of RBF kernels will be adopted as a default throughout the paper.

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

113

When applied to classi"cation problems, the SVM decision function (3) ascribes pattern classes and yields the empirical error rate, , on the available data set. However, one should be aware that the solution of problem (6) does not actually seek the desired minimum of . Indeed, it is known [4] that the solution of SVM training (6) yields a set of i for which the following bound holds: np 1  6 i : (7) np i=1

As a result, the second term of the cost function (5) that is actually optimized is just an upper bound to the error, . One should also consider that the cost function involves the regularization term w2 . The use of such an analog cost function is mainly justi"ed by the availability of eHcient QP algorithms; in fact, processing discrete costs would require the expensive use of integer-programming techniques. 3. Hyperparameter tuning criteria for support vector classiers SVM training based on the cost function (6) optimizes the values of the Lagrange multipliers {i }, but requires that the values of the hyperparameters be "xed in advance. Then the problem of setting hyperparameters e>ectively is strongly felt [6,8]; it does not relate to the empirical training error, but it is rather formulated in terms of generalization performance. As a consequence, the SVM design process ultimately brings about the problem of evaluating . Several tuning criteria exist for this purpose, and can be roughly divided into two basic approaches. Theoretical methods aim to derive the most stringent bounds to the expected generalization error, typically by setting the accuracy of the estimator as a function of the number of training patterns. A typical approach of this kind characterizes the expected generalization performance by statistically taking into account the possible distribution of test patterns that might be encountered [23]. Empirical methods, instead, aim to squeeze all possible information from the available data set, and ultimately rely on the fact that the data distribution is well represented by the available sample. A typical empirical technique is to put aside some portion of the data set in the training process, and to use it to assess generalization performance [8]. Statistical learning theory states that a machine featuring a small value of should score few errors on training data and, at the same time, exhibit low complexity. In practice, this property is often estimated empirically with a test set, and one tunes hyperparameters to shape the best-performing function among those permitted by the speci"c kernel model. 3.1. A theoretical approach to hyperparameter tuning This section describes a theoretical approach called “penalization by maximal discrepancy” to predicting generalization performance [3]. The method adopts the empirical error rate on the training set as an estimate of generalization performance. Therefore,

114

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

one "rst must train a support vector classi"er on the training set by solving the dual problem (6), for a given setting of the hyperparameter values {C; }. Then one measures the empirical classi"cation error attained by the resulting machine on the training patterns; in the following, such a quantity will be denoted as . Clearly,  estimates but is subject to a bias, and some penalty should somehow take into account the classi"er complexity. To compute the required correction term, the maximal-discrepancy method requires that training data be randomly split into two halves; recent results [5] also explored the possibility of splitting such data into unequal portions. Under the balanced-splitting assumption, the two errors made on the two subsets can then be de"ned as np=2

1 =

2  l(f(xi ); yi ); np i=1

2 =

2 np

np 

l(f(xi ); yi );

(8)

i=np=2+1

where l(f(xi ); yi ) = 1 if f(xi ) · yi 6 0, and l(f(xi ); yi ) = 0 otherwise. The classi"er’s complexity term that is associated to an empirical estimate is then attained by maximizing the di>erence (2 − 1 ) over parameter con"gurations, {i }; in this process, the hyperparameters C and  are the same used for estimating . The problem to be solved is therefore max(2 − 1 )

(9)



The following inequality [3] eventually relates the maximum discrepancy with a bound, MD , to the generalization error: 2 P MD −  ¿ max(2 − 1 ) + 6 e−2 np=9 (10) 

Expression (10) is usually rewritten to derive the generalization bound MD 6  + max(2 − 1 ) 

(11)

with

probability 1 − ! and an associated con"dence interval given by " = 3 log(1=!)=2np. In some respect, such an approach is similar to that for estimating the e>ective VC-dimension [24]. The crucial di>erence here is that discrepancy is used to estimate the generalization error directly, and neither the VC-dim nor the Growth Function enter the eventual expression for the estimate of . A simple and e>ective procedure to solve (9) was described in [3,20,24], and can be outlined as follows: 1. Split input data into two halves (np is taken even). 2. Keep the targets of the pattern in the "rst half unchanged, and Kip those in the second half. Denote the modi"ed data set by (X; Y  ). 3. Train an SVM classi"er on (X; Y  ).

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

115

4. Work out the empirical loss, , V on the modi"ed data set in terms of the real targets as V =

np

np=2

np 

i=1

i=1

i=np=2+1

1  1 1 1  l(f(Xi ); Yi ) = + l(f(Xi ); Yi ) − np 2 np np

l(f(Xi ); Yi )

1 + 1 − 2 : 2 5. Compute the maximum discrepancy value as =

V max (2 − 1 ) = 1 − 2: 

(12)

(13)

As a result, minimizing (12) amounts to maximizing the discrepancy value in (9). The above algorithm appears quite handy from a computational perspective, as it allows one to use conventional QP algorithms and training SVM strategies for solving (9). Under the caveats noted in Section 2 about approximation (7) involved in using the SVM optimization process for classi"cation purposes, the practical advantages of the above-described algorithm seem overwhelming. That approximation proves acceptable as the implemented criterion applies e>ectively for SVM model selection; indeed, the target-Kip approach has also been used by Vapnik [24] and Cherkassky [20] in their methods to "nd the e>ective VC dimension. The above procedure applies for a "xed con"guration of the hyperparameters C, . To extend such an approach to model selection, one simply iterates the algorithm involving -computation and discrepancy maximization (9) for various values of C and ; Each con"guration {C; } will have an associated generalization bound (11). The optimal hyperparameter con"guration will be that solving min{ + max(2 − 1 )}: C;



(14)

The con"dence interval of the bound (11) does not enter the optimization process because the quantities involved do not depend on the hyperparameter values. In practice, the procedure is repeated for all pairs {C; } in a discrete lattice with a given step. 3.2. The maximal-discrepancy criterion and previous theoretical results This section discusses the relationships between the maximal-discrepancy approach and other well-known theoretical approaches to generalization-performance bounding. In particular, Vapnik [23] developed a theoretical framework for predicting a classi"er’s generalization performance; a basic result of that comprehensive work is that, with probability 1 − !, one has    4 6 + 1+ 1+ ; (15) 2 where = 4=np(ln GF(2np) − ln(!=4)); GF(m) is the classi"er’s growth function [1] that by Sauer’s lemma satis"es GF(m) 6 (e · m=h)h , where h is the classi"er’s VCdim. Expression (15) does not allow one to sharply separate the complexity and

116

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

con"dence-correction terms; a comparison of (11) with (15) actually becomes feasible only when  = 0. Anyway, common practice clearly indicates that the maximumdiscrepancy approach yields much tighter bounds than those resulting from (15). Those bene"ts can be explained by considering a few theoretical issues. Although Vapnik’s proof involves the same complexity term (9), subsequent derivations penalize that term by bringing in the concept of growth function, which is further upper bounded by using Sauer’s Lemma. Finally, one should consider that the value of the VC-dim is most often unknown and subject to additional upper bounds. Several approaches have been proposed in the literature to make up for the overestimates involved by Vapnik’s framework. Sample-based techniques disregard the highest generality of obtained solutions, and derive generalization bounds by taking into account the available training set. The previously cited methods by Vapnik and Cherkassky [20,23] lie in this context. As a matter of fact, estimating an e>ective value of the VC-dim does not free the overall framework from the subsequent penalties brought about by the growth function and Sauer’s Lemma. Fat-shattering methods [2] estimate the generalization error for SVMs by using margin properties; they derive a substitute for the VC-dim, which is de"ned as the ratio of the radius to the margin and is computed as h = Dw2 . Otherwise, the number of support vectors has been used as an approximate assessment of the VC-dim [11]; such a technique is certainly interesting when the number, nSV , is small. As compared with all the above methods, the maximal-discrepancy technique seems to perform better (i.e., to yield tighter bounds) thanks to some crucial features: "rst of all, it has the advantages shared by all sample-based methods; nevertheless, the complexity term (9) involves a very small number of subsequent bounding operations, in particular, it does not take into account the classi"er’s growth function. In this sense, the relationships between the concepts of Vapnik’s theory and the maximum-discrepancy approach can be illustrated by considering the following curious properties. Property 1. Let A and B denote two learning machines having the same VC-dim: hA = hB . Then there exists some case in which GFA (m) = GFB (m). Property 2. Let A and B be two learning machines such that ∃m: GFA (m) = GFB (m). Then there exists some case in which: E{maxA (2 − 1 )} = E{maxB (2 − 1 )} over the same sample. The proofs of both properties are given in the appendix. The relevance of these facts to the generalization-assessment problem lies in de"ning a sort of graded hierarchy of quantities arranged according to their descriptive powers. Property 1 seems to suggest the use of the classi"er’s Growth Function in place of its VC-dim whenever possible, especially because bypassing Sauer’s Lemma yields tighter generalization bounds. In that case, however, one usually faces the problem that the GF of a classi"er is most often unknown and possibly harder to bound than its VC-dim. Conversely, property 2 also indicates that even characterizing a classi"er in terms of its growth function may somehow yield incomplete or inaccurate results. Indeed,

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

117

the sample-based nature of the discrepancy-based criterion may be more suited to the empirical data distribution, thus better rendering the classi"er’s expected performance. In fact, some researchers recently tried to characterize a classi"er’s generalization ability ‘qualitatively’ by checking on the existence of some clusters within the target con"gurations spanned by the GF. The overall goal is to "nd a classi"er’s ability descriptor that improves the trivial con"guration counting supported by the GF. An analysis of that kind was based on the concept of ‘connectedness’ [21]. A theoretical and established explanation for this phenomenon still seems to be lacking, and opens new interesting vistas for future work. 3.3. Hyperparameter tuning via performance measures As the actual data distributions in real problems are not known in advance, one needs some reliable estimate of the classi"er’s performance; in several real cases, it often happens that the results from empirical methods exhibit a greater practical e>ectiveness than those predicted by theoretical criteria. The comprehensive research described in [8] presents a review of several techniques, covering both empirical and theoretical methods: the comparison includes, among others, k-fold cross-validation, Xi-Alpha bound [14], GACV [25], Approximate Span Bound [8], VC Bound [23], D2 w2 [23]. All of the criteria considered were tested on a wide variety of datasets, and provided di>erent results; for example, the span rule given in [23] performs well, although the Approximate Span Bound in [8] does not. The comparative analysis reported therein showed that k-fold cross-validation (CV) proved the best method in terms of accuracy in predicting generalization performance (and therefore in terms of e>ectiveness in driving model selection). Such a technique requires one to partition the original training set into k non-overlapping subsets. The learning machine is trained on the union of (k −1) subsets; then one uses the remaining kth subset as a provisional test set and measures the associated classi"cation performance; in the following, that test error rate will be denoted as 2(k) . The procedure is cycled over all possible k test sets, and the average test error, CV , is regarded as a "nal index of generalization performance: k

CV =

1  ( j) 2 : k

(16)

j=1

In principle, the k-fold technique just aims to point out the classi"er that promises the best generalization performance. As a matter of fact, the ultimate purpose of the overall process is not an accurate prediction of the eventual generalization error; model selection stems from the empirical comparison among di>erent classi"er con"gurations. This is mainly due to the fact that the empirical tests on folded partitions are correlated with one another, hence the resulting estimate (16) is a>ected by an empirical bias. As a consequence, a reliable comparison between this method and the theoretical criteria described in the previous sections requires some prediction of the actual generalization error. In particular, one needs to express the associated estimate of by means of a correction term to be added to the empirical error on the training sets.

118

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

A consistent and theoretically sound baseline to that purpose can be provided by the statistical framework of bootstrap techniques [9]. The related procedure is very similar to that adopted for MD . The basic estimator still is the empirical classi"cation error, , attained on the entire data set by a trained SVM. The additional correction term to such an estimate is computed according to the k-fold approach: one trains an SVM by using k − 1 subsets and minimizes the associated empirical error, 1(k−1) ; then one measures the classi"cation error, 2(k) , on the remaining data subset. The di>erence between those quantities gives a discrepancy term acting as an estimate-correction penalty. Iterating this algorithm over all possible partitions and averaging leads to the predicted generalization error, KF : KF =  +

k  1   (k) 2 − 1(k−1) ; k

(17)

j=1

where ln(1=!)=(2np=k) is the con"dence interval [12]. The second term in (17) expresses the bias displacing the resubstitution error from the generalization error. Setting k =2 in (17) clearly points out the di>erence between the theoretical and empirical approaches to estimating . The complexity penalty in (11) features a worst-case discrepancy term, max (2 − 1 ), that is typical of theoretical approaches; conversely, the parallel complexity-related contribution in (17) stems from a di>erent discrepancy problem, 2(k) − min 1(k−1) , that is instead characteristic of empirical methods. This reKects the practical nature of the k-fold method, which requires that generalization performance be estimated after minimizing the empirical error, and indirectly explains why the associated estimates prove smaller than theoretical ones. It is also worth noting that expression (17) reduces to the conventional quantity used in k-fold CV (16) if the average training error over the several subsets equals the emk (k) pirical error rate on the entire training set, that is, if  ∼ = (1=k) j=1 1 ; incidentally, this result has actually been veri"ed throughout all the experiments performed in the present research. Those properties also holds for unbalanced partitions using k = 2, provided one exploits the results presented in [5]. Clearly no comparison is easy to perform when the two approaches use di>erent split portions. The work presented in [8] gives the best value, k = 5, for the splitting strategy. Therefore, thanks to the comprehensive nature of that comparison, the k-fold method with k = 5 will be adopted in the following as the reference paradigm for the empirically driven design of hyperparameters, and KF will denote the associated estimate of the classi"er’s generalization error. 4. Hyperparameter setting in practice This section compares the theoretical and practical design criteria experimentally. K-fold cross-validation is used as a reference method to assess the performance of the maximal-discrepancy criterion. The experiments "rst compare the lowest values of generalization-error estimates obtained by the two methods over an area of the hyperparameter space; then the two model-selection approaches are also compared by

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

119

estimating the actual generalization error they eventually exhibit. The consistency between theory and practice is analyzed by considering di>erent, real-world applications, which provide standard testbeds to make comparisons feasible. The results show that the maximal-discrepancy approach actually "ts the empirical "ndings obtained by the k-fold method quite accurately. The experimental procedure adopted was the following. For each testbed, the available data set was split into a ‘basic’ set for model selection and a ‘validation’ set for assessing the true generalization error of each classi"er. The basic data set underwent an empirical calibration of the hyperparameters, according to the 5-fold crossvalidation runs described in [8]. The constraint parameter was sampled in the range C ∈ [1; 105 ], whereas the  parameter swept in the range −1 ∈ [1; 107 ], and the associate generalization prediction, KF , was computed according to (17). The most promising values of {C; } yielding the optimal error estimate were recorded. Likewise, the same intervals were scanned to search for the best-performing values of {C; } by using the theoretical approach described in Section 3.1. For this method, the optimal values of the hyperparameters were recorded, together with the generalization estimate, MD , computed according to (11). The comparison between the two approaches is attained both by matching the speci"c hyperparameter settings and by measuring the discrepancies between the individual error estimates. The former comparison gives an estimate of the robustness of the setting method, whereas the latter measures the relative accuracies of the compared approaches. Then the research also validates experimentally the actual prediction performances of both methods and the overall consistency of the associated model-selection criteria. 4.1. The breast cancer testbed Wisconsin’s breast cancer database [27] collects 699 cases for such diagnostic samples; after removing 16 cases with ‘missing’ values, the dataset cardinality eventually amounted to Np = 683 patterns. Each pattern (patient) was represented by a nine-dimensional feature vector. The discrimination problem featured an almost balanced distribution between the two classes in the original set of patients. The data set was divided into two sets of 500 and 183 patterns, respectively. The former was used for both the k-fold and the theoretical model-selection procedure, whereas the latter provided the ‘validation’ set for assessing the true generalization error of the designed classi"ers. When applied to the 500-pattern sample, the 5-fold procedure gave the results shown in Table 2(a). Boldface "gures point out, for each value of C, the −1 values yielding the smallest predicted error, KF , computed according to (17); those minima therefore provide the actual solutions of the hyperparameter-setting problem. For graphical purposes, the error distribution is also given as a 3-D surface (Fig. 2). The relatively small values obtained witness the possibly simple nature of this problem. This result is con"rmed by the outcomes of the theoretical generalization criterion, MD (11), which are provided in Table 2(b) and graphically presented in Fig. 3. The comparative analysis of the obtained results can be performed from di>erent viewpoints. First of all, as far as the hyperparameter-design problem is concerned,

120

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

Table 2 BCancer: Perc. generalization error C

−1 1

10

102

(a) Estimated empirically by 5-fold, KF 1 9.84 3.28 2.73 10 9.84 3.28 2.73 102 9.84 3.28 3.28 9.84 3.28 3.57 103 104 9.84 3.28 3.83 9.84 3.28 3.83 105 (b) Predicted theoretically, MD 1 74.0 55.3 10 74.0 65.0 102 76.0 72.0 103 75.0 74.0 76.0 76.0 104 75.0 75.0 105

103 1.64 2.19 2.19 3.28 4.92 3.28

104 1.64 1.64 2.19 2.19 2.19 4.37

105 2.73 1.64 1.64 2.19 2.19 2.19

106 3.82 2.73 2.19 1.64 2.19 2.19

107 3.83 2.73 3.28 2.19 1.64 2.19

24.3 40.8 53.0 61.0 67.0 73.0

10.3 15.8 25.3 32.0 42.0 54.0

8.5 10.0 12.8 16.5 22.3 29.3

41.0 10.5 10.4 10.8 14.8 17.5

43.0 40.0 8.5 10.4 13.8 12.8

40.0 41.0 44.0 10.5 10.4 11.8

(c) Measured on the validation set, 1 12.6 4.0 3.4 10 12.6 3.4 4.6 102 12.6 3.4 4.0 103 12.6 3.4 4.0 104 12.6 3.4 4.6 12.6 3.4 4.6 105

2.3 3.4 4.0 5.2 5.2 6.9

2.9 2.3 2.9 3.4 4.0 5.2

3.4 2.9 2.3 3.4 3.4 3.4

4.6 3.4 2.9 2.3 3.4 3.4

6.9 4.6 3.4 2.9 2.3 3.4

Fig. 2. BCancer: generalization estimates by 5-fold cross-validation.

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

121

Fig. 3. BCancer: generalization estimates by theoretical prediction.

Fig. 4. BCancer: comparative results of hyperparameter design.

experimental evidence shows a remarkable "t between the empirical and theoretical approaches in suggesting the most promising classi"er con"gurations (Fig. 4). Secondly, the two model-selection methods agree in predicting the overall generalization error. The 95%-con"dence value of the empirical minimum is given by (0:05) KF =12:38% (1:91+10:47), which well covers the theoretical prediction MD =7:2%.

122

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

Fig. 5. BCancer: generalization errors.

Finally, the accuracy of the two estimators was compared by sweeping again the hyperparameter space to train a set of SVMs with all of the 500 patterns available, and measuring the classi"cation errors on the remaining validation patterns. This provided rough estimates of the classi"ers’ generalization performances. The obtained results are presented in Table 2(c) and graphically displayed in Fig. 5. Empirical evidence con"rms the validity of both model-selection methods as estimators of the eventual value of . As expected, the empirical method using the k-fold technique yields a better approximation over the whole hyperparameter space, whereas the maximal-discrepancy criterion ascribes a severe penalty term to those con"guration liable to over"tting. The relative advantage of the k-fold approach is mainly due to the empirical nature of the estimation procedure itself. Conversely, it is worth stressing that both methods agree in the error predictions associated with the model-selection outcomes. If 95% con"dence intervals are rated acceptable, the most promising hyperparameter setting derived from Table 2(b) {C = 102 ; −1 = 10−6 ; MD = 8:5 ± 16:42%} predicts a generalization error that well covers the value predicted by KF = 1:64±12:24% and the corresponding empirical measure of = 2:9 ± 9:05% in Table 2(c). Of course, these values are a>ected by quite wide con"dence intervals due to the relatively small number of patterns involved. 4.2. The diabetes testbed The “Pima Indians diabetes database” [22] provided another clinical testbed. The data set included 768 patient descriptions, each characterized by 8 features; the two classes were quite intermixed and made this sample rather diHcult to discriminate.

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

123

Table 3 Diabetes: Perc. generalization C

−1 102

103

31 35 35 35 35 35

26 27 30 33 33 33

(b) Predicted theoretically, MD 1 100.0 100.0 10 100.0 100.0 102 100.0 100.0 103 100.0 100.0 100.0 100.0 104 100.0 100.0 105

98.0 100.0 100.0 100.0 100.0 100.0

68.5 86.3 95.5 100.0 100.0 100.0

(c) Measured experimentally, 1 41 41 10 41 41 102 41 41 103 41 41 41 41 104 105 41 41

42 42 42 42 42 42

35 35 47 48 48 48

1

10

(a) Error estimated empirically, KF 1 33 33 10 33 33 33 33 102 103 33 33 33 33 104 33 33 105

104

105

106

107

21 20 21 24 22 21

22 21 20 22 20 27

33 21 21 23 20 19

33 33 22 21 19 21

40.3 48.8 61.5 73.5 84.3 92.5

33.5 34.5 44.0 45.3 52.8 58.5

40.8 28.3 32.8 36.5 42.0 42.0

37.8 38.8 29.0 34.8 35.0 38.3

28 28 30 30 34 41

32 27 28 24 27 30

41 29 26 27 26 25

41 41 28 25 28 27

In this case, too, 500 patterns formed the data set used for model selection, and the remaining 268 were used for subsequent validation. Table 3(a) and the associate graph in Fig. 6 provide the experimental results obtained by processing the diabetes dataset by the k-fold procedure for generalization assessment, as computed in (17). The relatively large values of the estimated error, KF , witness the diHculty of the testbed in separating the classes consistently. Then the data set was processed by using the maximal-discrepancy estimation criterion (11), which yielded the results presented in Table 3(b) and Fig. 7. The diHcult testbed distribution was con"rmed by the fact that a signi"cant portion of the hyperparameter con"gurations did not allow a meaningful prediction of the generalization error (in several cases, MD = 100%). In spite of the fact that theoretical predictions for large values of the  parameter appeared useless, the values in the ‘valid’ region of Table 3(b) matched empirical evidence quite well. The two methods also agreed substantially in choosing the hyperparameter con"gurations promising the smallest generalization estimates for possible values of C (Fig. 8). Table 3(c) and Fig. 9 give the classi"cation errors on the validation set, and allow one to compare the two methods in terms of predicted generalization error. When

124

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

Fig. 6. Diabetes: generalization estimates by 5-fold cross-validation.

Fig. 7. Diabetes: generalization estimates by theoretical prediction.

averaging over the entire hyperparameter space, the k-fold approach seems to perform better than the maximal-discrepancy one; nevertheless, for those con"gurations supporting model selection, the theoretical estimator better succeeds in rendering the inherent problem complexity and proves much more accurate than the k-fold estimator, which turns out to be overly ‘optimistic’.

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

125

Fig. 8. Diabetes: comparative results of hyperparameter design.

Fig. 9. Diabetes: generalization errors.

4.3. The NIST handwritten-numerals testbed The optical character recognition application allowed by the NIST database of handwritten numerals o>ered a signi"cant real-world problem to verify the method’s consistency. In order to reduce the optimization complexity, 500 patterns were randomly extracted from the original data set (holding 60,000 patterns) to make up the training set; two-class problems were built up by sampling separate pairs of numerals (250 patterns per each digit). For the sake of simplicity and brevity, among the several

126

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

Table 4 NIST49: Perc. generalization error C

−1 1

10

102

103

104

105

106

107

(a) Estimated empirically, KF 1 0.40 0.00 10 0.40 0.00 0.40 0.00 102 103 0.40 0.00 0.40 0.00 104 0.40 0.00 105

0.90 0.10 0.20 0.30 0.30 0.30

7.10 0.90 0.10 0.30 0.40 0.40

8.00 6.90 0.90 0.10 0.30 0.40

8.00 8.00 6.90 0.90 0.10 0.30

8.00 8.00 8.00 6.90 0.90 0.10

8.00 8.00 8.00 8.00 6.90 0.90

(b) Estimated theoretically, MD 1 0.98 0.32 10 1.00 0.75 102 1.00 1.00 103 1.00 1.00 1.00 1.00 104 1.00 1.00 105

0.87 0.45 0.41 0.74 0.98 1.00

6.64 0.88 0.40 0.29 0.41 0.73

7.84 6.63 0.89 0.40 0.27 0.31

7.83 7.84 6.63 0.89 0.42 0.30

7.84 7.82 7.84 6.64 0.87 0.42

7.84 7.84 7.83 7.83 6.65 0.88

(c) Estimated on the validation set 1 9.82 2.89 4.61 10 9.61 2.81 3.54 102 9.61 2.81 3.62 103 9.61 2.81 3.82 9.61 2.81 3.82 104 105 9.61 2.81 3.82

17.58 4.59 3.66 3.85 4.28 4.28

18.09 17.61 4.6 3.7 3.89 4.43

18.09 18.09 17.61 4.59 3.7 3.89

18.09 18.09 18.09 17.61 4.59 3.7

18.09 18.09 18.09 18.09 17.6 4.59

experiments performed with this sampling methodology, only the results on the problem ‘4’ versus ‘9’ are reported here. This case has been picked up mainly because it involved the hardest pair of numerals to separate by the SV classi"er; all other results were substantially similar to those presented in this section. Each twin-class data set held 80-dimensional patterns; feature extraction followed the procedure described in [17]. The 5-fold procedure for the 500 patterns extracted gave the generalization estimates (17) shown in Table 4(a) and Fig. 10. The "gures indicated rather a surprising result, as the SVM classi"er always managed to separate the two classes consistently. In fact, such a peculiar distribution in the NIST database had already been observed previously [16], and mainly depends on the fact that the pattern categories in the overall NIST training set are con"ned to thick, well-separated clusters that make very small errors possible in both the training and testing phases. A very interesting result is obtained when comparing the empirical estimates with the theoretical predictions, MD as per (11). Table 4(b) and Fig. 11 give in the usual formats the results of the tests performed. In principle, one should expect theory to diverge signi"cantly from empirical evidence, mainly due to the random, independent settings of pattern classes in the maximal-discrepancy method. Instead, the "gures shown are in notable accordance with the values in Table 4(a). From this viewpoint, the

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

127

Fig. 10. NIST49: generalization estimates by 5-fold cross-validation.

Fig. 11. NIST49: generalization estimates by theoretical predictions.

agreement of the generalization estimates in this odd case ultimately provided an important support for the validity of both estimation approaches. It is worth noting that the actual estimates associated with the best-performing con"gurations in the two cases were quite close, as the theoretical criterion yielded an

128

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

Fig. 12. NIST49: generalization performance on the validation set.

estimated error MD = 0:27%. The most striking di>erence between the two estimation methods was observed when considering the hyperparameter-design results, which di>ered in the two cases especially for larger values of the hyperparameter C. The following discussion will point out that this deviation is not so signi"cant as it might seem, when considering the overall performance. Indeed, the huge sample size of the NIST testbed made an additional measurement feasible to get a reliable estimate of the true generalization error. The database provided a validation set (including 58,646 patterns), which previous research had proved quite diHcult to classify. The validation set had been drawn from a distribution apparently quite ‘distant’ from that used for the training set [16,17]. The validation subset holding numerals ‘4’ and ‘9’ amounted to 11,535 patterns. Generalization measurements followed a standard procedure: for each hyperparameter con"guration tested previously, all of the original 500 patterns were used to train an SVM, whose classi"cation error was subsequently measured on the validation set. The considerable size of the latter set made the 95% con"dence interval very narrow (" = 1:14%). Table 4(c) and Fig. 12 present the results in the usual formats, allowing comparisons with empirical and theoretical predictions. The boldface numbers in Table 4(c) indicate, for each setting of C, the bestperforming value of . Such results clearly highlight the e>ectiveness of the 5-fold method, in terms of both consistency in setting hyperparameters and accuracy in predicting (the 95% con"dence interval associated with the null values of KF in Table 4(a) is " = 12:24%). Actually one should also note that the error distribution in Table 4(c) has two ‘valleys’: one coincides with the minima in Table 4(a) (empirical method) and the other

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

129

Table 5 Coins: Perc. generalization error estimated theoretically, MD C

1 10 102 103 104 105

−1 1

10

102

103

104

105

106

107

4.0 6.0 3.0 3.0 3.0 4.0

5.0 7.0 6.0 3.0 1.0 5.0

5.0 5.0 5.0 6.0 2.0 4.0

6.0 5.0 4.0 5.0 6.0 3.0

6.0 4.0 3.0 4.0 7.0 6.0

5.0 5.0 6.0 4.0 5.0 6.0

3.0 5.0 5.0 4.0 4.0 4.0

4.0 4.0 4.0 4.0 5.0 4.0

with those in Table 4(b) (theoretical method). This very important result clearly indicates that both methods ultimately performed satisfactorily, as they always succeeded in suggesting e>ective SVM design. 4.4. The coin-recognition testbed This pattern-recognition testbed involves the classi"cation of coins in vending machines. In fact, the original domain involved "ve pattern classes, and was converted into a two-class problem by a procedure most similar to that used for the NIST database, namely, by considering only two categories. The resulting data set included 20,000 patterns, each described by a "ve-dimensional feature vector representing electromagnetic measures of observed coins. The classes were balanced in the data set, which was split into a training set of 1,000 patterns and a validation set of 19,000 patterns. This partitioning mainly aimed to limit the training-set size for computational purposes and to tighten the con"dence interval on the validation results. A peculiarity of the coin-recognition database lies in the sample distribution, which is fairly easy to classify; indeed, the entire data set was composed of two well-separated clusters with very few errors. In spite of such apparent simplicity, this real-world testbed has a practical signi"cance [18], and has been included here because it can yet help clarify the behavior of the generalization prediction criteria. The crucial issue about those tests was that the empirical training errors invariably turned out to be null for both the maximal-discrepancy and the 5-fold training sessions, regardless of the speci"c hyperparameter settings. As compared with the empirical estimator (17), this issue also implied a null cross-validation term, hence KF ≡ 0 ∀C ∀. The tests with Kipped targets to maximize the discrepancy (9), instead, resulted in di>erent correction terms for the various hyperparameter con"gurations. The results on the maximal-discrepancy model selection tests, pointing out a best classi"cation error of MD = 1%, are shown in Table 5 and Fig. 13. The actual generalization error over the validation set always turned out to be very small (0.1%) for all con"gurations, again as a result of the particularly favorable sample distribution. From this viewpoint, the k-fold estimator seemed to perform more accurately than the theoretical one; nevertheless, a noticeable aspect of those

130

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

Fig. 13. Coins: generalization estimates by theoretical prediction.

experiments was that the predictions based on the maximal-discrepancy criterion were in close agreement with empirical evidence. The fact that a theoretically derived estimator also exhibits practical validity is not often shared by formal approaches to generalization performance, and denotes a basic merit of the related framework. Such a feature mostly results from the sample-based nature of the correction procedure. 5. Discussion and conclusions A crucial issue in setting up an e>ective SVM classi"er often concerns the method for the eventual hyperparameter setting; in particular, one generally faces the odd situation in which established theory seldom supports the choices that would instead be suggested by empirical expertise. This is typically due to the fact that theoretical predictions aim to attain the widest generality, also covering peculiar situations that do not usually occur in real-world cases. This paper has considered a theoretical and an empirical method for predicting a classi"er’s generalization performance within the context of SVMs, and hence for choosing optimal hyperparameters. The k-fold cross-validation approach has been adopted as a ‘reference’ paradigm mainly because it is the favorite approach [8] to e>ectively setting up a classi"er on the basis of its generalization behavior; the maximal-discrepancy criterion, which is relatively novel in the context of SVMs, has been chosen for it bene"ts from both a sound theoretical framework and the practical e>ectiveness of sample-based methods. The di>erent operations of the two approaches help clarify the existing gap between theory and practice. Both methods split the available sample into two subsets; the theoretical approach, however, exploits both subsets to derive a bound to the generalization

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

131

error, whereas the empirical approach uses one subset for training and the other for estimating . As a result, the former approach often overestimates the error but, for example, need not tune the additional parameter k, which rules data partitioning. From this viewpoint, a "rst, signi"cant achievement of the comparative research presented in this paper lies in showing that the estimates of obtained by the maximaldiscrepancy method and those attained by k-fold CV span quite a narrow interval; as a result, one may choose to adopt the theoretical estimator in order to limit the computational cost of empirical tests that is usually required by CV-based techniques. Secondly, experimental results also show that the two approaches often concur in the model-selection outcomes, as the corresponding hyperparameter settings are strongly correlated. Finally, both paradigms seem to provide reliable estimates of the resulting classi"er’s actual generalization error. As expected, the empirical method based on CV yields better estimates whenever the nature of the problem or the validation set is well correlated with training data. In diHcult problems, however, with complex sample distributions, the advantages of the general approach supported by a theoretical framework seem to hold great promise and often prove more satisfactory. Therefore an important conclusion that can be drawn from the presented research is that the maximal-discrepancy framework is of high practical interest for SVM design. The paper has also analyzed the possible advantages of the proposed approach over existing theoretical frameworks. In spite of the aforesaid apparent convergence, yet a gap does exist between practice and theory. In this respect, the paper has also de"ned possible lines of research to further analyze the obtained results. The idea of connectedness, for example, might be developed and applied to interpret theoretical predictions in order to overcome the overestimates that often a>ect approaches based only on the concept of Growth Function or VC-dim.

Acknowledgements The authors wish to thank the Reviewers for their critical reading of the manuscript and for providing valuable suggestions, which helped improve the technical contents and the overall presentation quality.

Appendix Property 1. Let A and B denote two learning machines having the same VC-dim: hA = hB . Then there exists some case for which GFA (m) = GFB (m). Proof. Let A be a linear perceptron over the d-dimensional space; it is known [23]  d−1  that, in this case, hA = d + 1 and GFA (m) = 2m if m 6 hA ; GFA (m) = 2 i=0 m−1 i if m ¿ hA . Let B be a nearest-prototype classi"er with nh = d + 1 prototypes; in this

132

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

case, theory shows [17] that hB = nh = d + 1 and GFB (m) = 2nh independently of m. Therefore one has: hA = hB and ∀m ¿ d + 1: GFA (m) = GFB (m). Property 2. Let A and B be two Learning Machines such that ∃m: GFA (m) = GFB (m). Then there exists some case for which: E{maxA (2 − 1 )} = E{maxB (2 − 1 )} over the same sample. Proof. Consider a discrete distribution, X , including np patterns drawn from the set Z of integer numbers in the range (−∞; +∞); possible target values are ‘1’ and ‘0’. Let A be the ray classi"er [1] de"ned as: fA(T ) (x)=‘1’ if x ¿ T , and fA(T ) (x)=‘0’ otherwise. Clearly, the number of pattern con"gurations spanned over X that A can classify correctly is GFA (np) = np + 1. Computing maxA (2 − 1 ) follows the procedure described in (13); the classi"cation error, VA , for a given target con"guration can be obtained by the following algorithm:  # of ‘0’ following the leftmost ‘1’; 1  1 min E{VA } = np 2 np # of ‘1’ preceding the rightmost ‘0’: By imposing np = 4 one has E{VA } = 12=64 = 0:1875 and E{maxA (2 − 1 )} = 1 − 2E{VA } = 0:625. Let now B be the odd “1-point” classi"er, de"ned as: fB(x0 ) (x0 ) = ‘1’, x0 ∈ Z, and (x0 ) fB (x)=‘0’ ∀x = x0 , x ∈ Z. One can easily verify that the number of target assignments over the sample X that B can handle correctly is GFB (np) = np+1. In order to compute VB , let k denote, for a given target con"guration, the number of patterns in X labeled as class ‘1’; then one enumerates all possible target con"gurations and counts classi"cation errors as   np 1  np (k − 1) : E{VB } = np 2 np k k=1

In the sample case, np =4 gives: E{VB }=0:2656 and E{maxB (2 −1 )}=1−2E{VB }= 0:4869. Thus, GFA (np)=GFB (np) ∀np but E{maxA (2 −1 )} = E{maxB (2 −1 )}. When applying the concept of connectedness to the above sample cases, one enumerates the target con"gurations that are covered by the Growth Function of each classi"er. In the case np = 4, the 5 con"gurations accounted by the GFs can be listed as follows: A ≡ Ray classifer

B ≡ One-point classifer

‘0000’

‘0000’

‘0001’

‘0001’

‘0011’

‘0010’

‘0111’

‘0100’

‘1111’

‘1000’

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

133

Averaging the Hamming distance, H , for the con"guration set associated with each classi"er gives the average values HA = 2 and HB = 1:6 for the ray and one-point classi"ers, respectively. This result indicates that the classi"er B seems more connected than A. According to the theory illustrated in [21], this justi"es the fact that the classi"er B exhibits a smaller discrepancy value than A, even though the two Growth Functions coincide. References [1] M. Anthony, N. Biggs, Computational Learning Theory, Cambridge University Press, Cambridge, 1992. [2] P. Bartlett, The sample complexity of pattern classi"cation with neural networks: the size of the weights is more important than the size of the networks, IEEE Trans. Inform. Theory 44 (2) (1998) 525–536. [3] P. Bartlett, S. Boucheron, G. Lugosi, Model selection and error estimation, Mach. Learn. 48 (1–3) (2002) 85–113. [4] C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery 2 (2) (1998) 121–167. [5] A. Cannon, J.M. Ettinger, D. Hush, C. Scovel, Machine learning with data dependent hypothesis classes, J. Mach. Learn. Res. 2 (2002) 335–358. [6] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for Support Vector Machines, Mach. Learn. 46 (1–2) (2002) 131–159. [7] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20 (1995) 273–297. [8] K. Duan, S. Keerthi, A. Poo, Evaluation of simple performance measures for tuning svm hyperparameters, Technical Report CD-01-11, Department of Mechanical Engineering, National University of Singapore, Singapore, 2001. [9] B. Efron, R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, New York, 1993. [10] R. Fletcher, Practical Methods of Optimization, 2nd Edition, Wiley, New York, 1987. [11] S. Floyd, M. Warmuth, Sample compression, learnability, and the Vapnik-Chervonenkis dimension, Mach. Learn. 21 (1995) 1–36. [12] W. Hoe>ding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc. 58 (1963) 13–30. [13] http://www.kernel-machines.org/software.html [14] T. Joachims, The maximum-margin approach to learning text classi"ers: methods, theory and algorithms, Ph.D. Thesis, Department of Computer Science, University of Dortmund, 2000. [15] D.G. Luenberger, Introduction to Linear and Nonlinear Programming, Addison-Wesley, Reading, MA, 1973. [16] S. Ridella, S. Rovetta, R. Zunino, Circular Backpropagation networks embed Vector Quantization, IEEE Trans. Neural Networks 10 (4) (1999) 972–975. [17] S. Ridella, S. Rovetta, R. Zunino, K-Winner Machines for pattern classi"cation, IEEE Trans. Neural Networks 12 (2) (2001) 371–385. [18] S. Ridella, R. Zunino, Using K-winner machines for domain analysis—Part II: applications, Neurocomputing, submitted for publication. [19] B. Scholkopf, C. Burges, A. Smola, (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, USA, 1998. [20] X. Shao, V. Cherkassky, W. Li, Measuring the VC-dimension using optimized experimental design, Neural Comput. 12 (2000) 1969–1986. [21] J. Sill, Monotonicity and connectedness in learning systems, Ph.D. Thesis, California Institute of Technology, 1998. [22] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, R.S. Johannes, Using the ADAP learning algorithm to forecast the onset of diabetes mellitus, Proceeding of the IEEE Symposium on Comp. Applied and Medical Care, IEEE Computer Society Press, Silver Spring, MD, 1988, pp. 261–265. [23] V. Vapnik, Statistical Learning Theory, Wiley-Interscience Pub., New York, 1998.

134

D. Anguita et al. / Neurocomputing 55 (2003) 109 – 134

[24] V. Vapnik, E. Levin, Y. Le Cunn, Measuring the VC-dimension of a learning machine, Neural Comput. 6 (1994) 851–876. [25] G. Wahba, Support Vector Machine, Reproducing Kernel Hilbert Spaces and the Randomized GACV, in: B. Sholkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods-Support Vector Learning, MIT Press, Cambridge, MA, 1999. [26] R.C. Williamson, J. Shawe-Taylor, B. Sch\olkopf, A.J. Smola, Sample based generalization bounds, NeuroCOLT2 Technical Report Series, November 1999, NC-TR-1999-055. [27] W.H. Wolberg, O.L. Mangasarian, Multisurface method of pattern separation for medical diagnosis applied to breast cytology, Proc. Natl. Acad. Sci. USA 87 (1990) 9193–9196. Davide Anguita (M’93) graduated in Electronic Engineering in 1989 and obtained the Ph.D. in Computer Science and Electronic E˙ngineering at the University of Genova, Italy, in 1993. After working as a research associate at the International Computer Science Institute, Berkeley, USA, on special-purpose processors for neurocomputing, he joined the Department of Biophysical and Electronic Engineering at the University of Genova, where he teaches digital electronics. His current research focuses on industrial applications of arti"cial neural networks and kernel methods and their implementation on digital and analog electronic devices. He is a member of IEEE and chair of the Smart Adaptive Systems committee of the European Network on Intelligent Technologies (EUNITE).

Sandro Ridella (M’93) received the Laurea degree in electronic engineering from the University of Genova, Italy, in 1966. He is a full Professor in the Department of Biophysical and Electronic Engineering, University of Genova, Italy, where he teaches Inductive Learning. In the last nine years, his scienti"c activity has been mainly focused in the "eld of neural networks. Fabio Rivieccio received the Laurea degree in electronic engineering in 1999 and is currently pursuing his Ph.D. in electronic engineering and computer science. His research interests include applications of support vector machines and model selection criteria.

Rodolfo Zunino (S’90 –M’90) received the Laurea degree in electronic engineering from Genova University, Italy, in 1985. From 1986 to 1995, he was a Research Consultant with the Department of Biophysical and Electronic Engineering of Genova University. He is currently with the same department as an Associate Professor in Industrial Electronics and Application-oriented Electronic Systems. His main scienti"c interests include electronic systems for neural networks, eHcient models for data representation and learning, advanced techniques for multimedia data processing, and distributed-control methodologies. Since 2001 he is contributing as an Associate Editor of the IEEE Transaction on Neural Networks.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.