Kernel Machines for Non-vectorial Data

June 8, 2017 | Autor: Cecilio Angulo | Categoría: Machine Learning, Support vector machine, Kernel Function
Share Embed


Descripción

Kernel Machines for Non-vectorial Data F.J. Ruiz1 , C. Angulo1 , N. Agell2 , and A. Catal` a1 1

GREC - Knowledge Engineering Research Group, UPC - Universitat Polit`ecnica de Catalunya, Spain {francisco.javier.ruiz,cecilio.angulo,andreu.catala}@upc.edu 2 GREC - Knowledge Engineering Research Group, ESADE-URL - Universitat Ramon Llull, Spain [email protected]

Abstract. This work presents a short introduction to the main ideas behind the design of specific kernel functions when used by machine learning algorithms, for example support vector machines, in the case that involved patterns are described by non-vectorial information. In particular the interval data case will be analysed as an illustrating example: explicit kernels based on the centre-radius diagram will be formulated for closed bounded intervals in the real line.

1

Introduction

A computer program is said to learn from the experience with respect some task according to some fitness measure if it is able to complete the task improving the fitness value by using the experience [7]. Main goal for the program is basically estimating the value for an output variable from a set of input variables. The experience, composed by a set of patterns with known input and, for the supervised case, known output, is used for the supervised learning systems to derive the input-output relationship. Once time finished the learning phase, the system is ready for evaluating new instances. The simplest supervised learning systems are linear classifiers, which determine an adequate set of weights for a linear mapping between input and output variables. However they are not efficient when data is not linearly separable. A direct method to deal with this more general problems is to expand linear learning systems to no linear ones by considering no linear decision functions, like for example Multi Layer Perceptrons with sigmoidal activation functions. A different approach to the no linear separability problem is to apply the so called ’kernel trick’ [1]. This methodology can be considered when the learning algorithm has a formulation depending only on some inner product of the input variables, like radial basis function networks, which decision function is a linear expansion of a basis of radial functions. In fact, this property is also shared in the formulation of the support vector machines, the Fisher’s discriminant analysis, the principal components analysis, etc. The kernel trick is based on the substitution of the inner product in the input space for some adequate bivariate function, the kernel K (·, ·), accomplishing the Mercer condition [6]. The F. Sandoval et al. (Eds.): IWANN 2007, LNCS 4507, pp. 252–259, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Kernel Machines for Non-vectorial Data

253

Mercer’s theorem ensures in this case that the kernel function express an inner product of the images of the input data in a different space, the feature space, according to some, possibly unknown mapping. It will be in this space that the learning will be performed in a easier form, eventually linear, that in the original input space because the new space is usually of a higher dimensionality. Machine learning systems based on kernel functions are specially suitable for non-vectorial learning problems, that is, for no quantitative input patterns, because the representation in the feature space due to the kernel function allows using any kind of data, even those coming from not structured input spaces. The space endowed with some structure must be the feature space, where actually the learning is performed. Therefore, these learning systems can be applied on no vectorial data whether an adequate representation of them exists into a structured feature space. Two approaches for definiting kernel functions can be followed: by using a bi variate function accomplishing the Mercer condition or by defining a certain feature space through a specific map. In the former case, by defining the kernel function neither the feature space nor the inner product must be explicitly defined. In the latter one, the mapping function defining the Euclidian feature space and the inner product are explicitly defined and its composition is a kernel function. This strategy is used for defining, for instance, kernels on string of characters [5] and it allows using this kind of learning systems on any type of data: documents, images, graphs, etc. In Section 2, kernel machines are defined from the perspective of the support vector machines, a state-of-the-art machine learning algorithm searching for linear discriminants. Next, they are applied on non-vectorial data, in particular on interval data, hence several approaches for using this kind of data in learning algorithms are derived. Finally, some conclusions about the developed work are presented.

2

Kernel Machines

Kernel Machines are a a relatively new class of learning algorithms looking for a linear discriminant solution where the kernel trick is able to be applied. The support vector machine is a learning method, initially designed for supervised bi-classification problems, that finds a maximal margin separating hyperplane between classes [10]. Support vectors are those for which the distance to the optimal hyperplane is exactly the largest geometrical margin and, therefore they determine it univocally. The problem of searching for the maximum of a (convex) margin function subject to a constraint (convex) set can be dealt as an optimisation problem, a convex quadratic programming (QP) problem for the case of the SVM, which can be settled in primal or dual form. Main advantage of this type of optimization is that the solution, the linear discriminant with maximal margin, is a global one. For the SVM, the dual formulation of the QP problem used for searching the discriminant function is based only on inner products of the training patterns,

254

F.J. Ruiz et al.

I

X

F

u u u u

u

u

u

x

x

u

x x

u u

u

x

x

u

u

x x

u

u

x

x x

u

u

x

x x

x

u

x

u

u

Fig. 1. Representation of the feature space. Patterns in the input space become linearly separable in the feature space.

not directly on them. On the other hand, the discriminant function only depends also on the inner product of the learning patterns with the test patterns. These facts allow the use of the kernel trick, i.e. to replace the inner product in the input space for an adequate function K, representing an inner product in a certain Euclidian (possibly Hilbert) feature space (see Figure 1). The dimension of the feature space is usually higher than that of the input space, even infinite, providing the linear separability of the images of the input patterns. It could be argued that this favourable property signifies also increasing the complexity of the learning algorithm due to the curse of the dimensionality, however it is not the case because data in the feature space is not longer considered for the algorithm, but only their inner product are present in the discriminant function. One of the most simple kernel functions defined in Rn is the (no homogeneous) polynomial kernel, defined as, K (x, y) = (x, y + c)d

(1)

representing the feature space of all the monomials with a degree equal or lower than d. The generated feature space has a dimension,   (d + n)! d+n (2) = d d! · n! that, for instance, in the case n = 10 and d = 4 is 1001. Another kernel function, surely the most popular one, is the Gaussian kernel. The expression for this function is,   ||x − y| |2 K (x, y) = exp − (3) 2σ 2

Kernel Machines for Non-vectorial Data

255

with σ > 0. By selecting the Gaussian kernel, the feature space considered has no dimension, hence any training set can be effectively separated. The Mercer condition of positive definitiveness ensures that a function K can be considered an inner product in a certain (possibly unknown) Hilbert space. In fact, thinking about the training patterns used for the learning machines, this condition is equivalent to ensure that for any subset {x1 , . . . , xn } of instances in the input space, the associated Gram matrix K (xi , xj )i,j=1,...,n , is symmetric and positive semi-definite , i.e. without negative eigenvalues. It is not direct to validate the Mercer condition for a bi variate function in order to confirm that it is a kernel; nevertheless, it is possible to build new kernels from other ones by applying some simple properties like addition, product, etc [3]. Maybe the most interesting one is that ensuring that the normalised images of input patterns in the feature space also define a kernel, K n (x, y) := K (xn , y n ), because it is not necessary to known the implied feature space to normalise the images,   φ (y) φ (x) K (x, y) n (4) , K (x, y) = = ||φ (x)| | ||φ (y)| | K (x, x) K (y, y)

3

No Vectorial Structures

Patterns employed for learning techniques are usually real variables. Standard kernel functions like polynomial or Gaussian ones meet this requirement. However, data processed in many learning procedures are not numerical at all, but they are described by different qualitative structures. The nature of the kernel trick, sending the original patterns in the input space to another Euclidian feature space where learning is performed, ables machine learning systems to deal with this kind of non-vectorial data. The term non-vectorial is as general as possible in order to cover a wide range of cases: graphs, categorical, fuzzy, interval [2], order of magnitude [9] variables. Interval-type variables will be analysed in particular in this work. 3.1

Interval Kernels

Roughly speaking, a kernel is a similarity measure that arises from a particular representation of patterns. The simplest kernel is the usual dot product (known as the linear kernel), where the feature space is equal to the input space. The geometric interpretation of this linear kernel is that it computes the cosine of the angle between the two vectors, provided they are normalised to length 1. A kernel defined in an arbitrary non-empty set X allows considering a distance function (indeed a pseudo-distance). If {E, ·, ·} is a Euclidean space and φ : X −→ E a map, the function   d(x, y) = K(x, x) + K(y, y) − 2K(x, y) = φ(x) − φ(y), φ(x) − φ(y) (5) is a pseudo-distance measure (distance for φ injective) defined on X.

256

F.J. Ruiz et al.

In non-vectorial spaces such as the set of real closed and bounded intervals (I(R)) there not exists a direct measure of similarity based on an inner product. However when a kernel is considered in such space, an indirect similarity measure is also included. The map φ must be able to emphasise the basic features of the input space. The centre-radius representation [4] associating to every interval in I(R), [a, b] = B(c, r) the point (c, r) ∈ R2 is one of the most simplest maps that it can be considered. In this way, the set I(R) is represented by the points in the semiplane r > 0 and this map catches the two features of an interval: the position and the precision. The former associated with the centre and the later with the radius. 3.2

Defining an Intervalar Distance from a Kernel

For the I(R) space, the most general choice to define a distance is using the centre-radius representation. The kernel machine perspective drives this first choice to define a mapping φ associating every interval to a vector in R2 in the natural form1 ,      1 1 1 −1 c c φ( I ) = √ (c − r, c + r) = √ (6) =P ∈ R2 r r 2 2 1 1 where P is an orthogonal matrix2 (P t P = I). The image for the mapping φ is   2 (c, r) ∈ R : r > 0 . The kernel function associated to this mapping φ allowing to establish a similarity function between intervals, is k : I × I → R defined by k(I1 , I2 ) = φ(I1 ), φ(I2 ) = c1 c2 + r1 r2

(7)

Most interesting in this similarity kernel function is that it allows to define a distance between intervals from a distance between their images in R2 . Given two intervals Ii = (ai , bi ) = B(ci , ri ), i = 1, 2, it is defined the distance between both intervals, and it will be denoted d(I1 , I2 ), as,  1  d(I1 , I2 ) = √ (a2 − a1 )2 + (b2 − b1 )2 = Δ2 c + Δ2 r 2

(8)

with Δc = c2 − c − 1, Δr = r2 − r1 , since d2 (I1 , I2 ) = φ(I2 ) − φ(I1 ), φ(I2 ) − φ(I1 ) 1 1 = (Δc − Δr)2 + (Δc + Δr)2 = Δ2 c + Δ2 r 2 2

(9)

It can be noticed that the defined distance is the 2 distance on the components Δc and Δr. Hence, it is evidenced that the defined distance have in consideration 1

2

The term √12 is inserted, as it will be see later, for obtaining an orthogonal transformation. In fact, to ensure that the mapping is injective, it will be enough P no singular.

Kernel Machines for Non-vectorial Data

257

the weighted relation of the distance between the centres of the intervals Δc, and their relative size, Δr. In quadratic form it could be expressed as,    10 Δc d2 (I1 , I2 ) = (Δc, Δr) (10) 01 Δr 3.3

Generalising the Intervalar Distance

Several more or less direct generalisations can be derived from the initially defined intervalar distance. Weighted Intervalar Distance. A generalisation by weighting both, interval sizes and distance between them can be considered for unbalanced measures. For this end, a mapping φ is defined by inserting the original input space in a feature space in the form,      c a11 a12 c φ(I) = A = r a21 a22 r

(11)

so that,  k(I1 , I2 ) =





c1 r1

t   

  c2 c2 , A· = c1 r1 S r2 r2

(12)

and therefore,   t      Δc Δc Δc · A· = Δc Δr S d (I1 , I2 ) = A · Δr Δr Δr 2

(13)

being A a no singular matrix, to preserve the injectivity of the mapping φ, and S = At A is a symmetric positive definite matrix. In this form, it can be controlled the balance between position and size of the intervals. Intervalar Distance Defined on Compact Support. The similarity kernel function allowing to define the intervalar distance is not bounded, so it could be improved to be more useful when representing results. Further, the defined distance is not able to deal with semi-open intervals when the working support is all along the real line, R. In order to bound the similarity and the distance, it will be analysed defining the similarity kernel between intervals inside a compact in R, in general [α, β], with α, β ∈ R and α < β. Let [α, β] be a compact (bounded interval in R) and it is considered the set, I1 = {intervals (a, b) : α ≤ a < b ≤ β}

(14)

258

F.J. Ruiz et al.

In this set it is defined a new similarity measure, φ1 : I1 → R2 with 1 φ1 ( (a, b) ) = √ (a − α, β − b) 2 It can be noticed that this mapping can be expressed in the form, 1 φ1 (a, b) = √ (c − r − α, β − c − r) 2       1 1 c 1 −1 c α = √ +Λ +√ = P1 r −1 −1 r β 2 2

(15)

(16)

where P1 is an orthogonal matrix. Hence, the obtained distance does not vary with respect to those initially formulated because,   Δc φ1 (I1 ) − φ1 (I2 ) = P1 (17) Δr and P1t P1 = P t P = I. Moreover, as it was desired, by considering the Weierstrass’s Theorem, distance is now bounded with d(I1 , I2 ) ≤ β − α for any pair of intervals. Formulation for the new distance is simple, nevertheless the associated similarity kernel function is more complex, k((a1 , b1 ), (a2 , b2 )) = c1 c2 + r1 r2 + C 2 + R2 + −2C(c1 + c2 ) − 2R(r1 + r2 ) with C =

4

α+β 2

being the centre of the compact basis, and R =

β−α 2

(18) its radius.

Conclusions and Future Works

Machine learning systems based on kernel functions, SVM being the most representative among them, are gaining high relevance in the research community due to: its high accuracy even on reduced training sets in high dimension input sets; its theoretical justification on Statistical Learning theory; its nature, by performing learning in a feature space, allowing using all kind of data to be learnt. Inspired on the way the kernel machines work, it has been defined how no vectorial data can be dealt for them. Particularly, interval data has been considered by using the centre-radius diagram in order to represent it like points in a plane. Several distances have been defined in this space and some of their characteristics have been analysed. By considering R2 as a feature space of the set of intervals allow using any kernel defined into, i.e. polynomial, Gaussian etc. composed with any of the maps introduced above. This composition will give rise another valid kernel. The kernels defined on the interval space were all explicit kernels, i.e. they are defined by using explicit feature space and explicit map. Another methodologies can be apply for searching kernels in the interval space and other non-vectorial structures such as looking for functions K that directly satisfy the Mercer condition [8].

Kernel Machines for Non-vectorial Data

259

Acknowledgements This research has been partially supported by the research projects AURA (TIN2005-08873-C02-01,02) and EXODUS-ADA (DPI2006-15630-C02-01) of the Spanish Ministry of Education and Science.

References 1. Aizerman, M. A., Braverman, E. A., Rozonoer, L.: Theoretical foundations of the potential function method in pattern recognition learning. In: Automation and Remote Control, number 25 in Automation and Remote Control, pp. 821–837 (1964) 2. Angulo, C., Anguita, D., Gonz´ alez, L.: Interval discriminant analysis using support vector machines. In: ESANN, page forthcoming (2007) 3. Cristianini, N., Shawe-Taylor, J.: An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 4. Kulpa, Z.: Diagrammatic representation for interval arithmetic. Linear Algebra and its Applications 324(1–3), 55–80 (2001) 5. Lodhi, H., Shawe-Taylor, J., Cristianini, N., Watkins, C. J. C. H.: Text Classification using String Kernels. Adv. Neural Inform. Process. Syst., pp. 563–569 (2000) 6. Mercer, V.: Functions of positive and negative type, and their connection with the theory of integral equations, vol. 209, pp. 415–446, Philosophical Transactions of the Royal Society, London (1909) 7. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997) 8. Ruiz, F.J., Angulo, C., Agell, N.: A kernel intersection defined on intervals. In: Aguil´ o, I., Vitri´ a, J., Radeva, P. (eds.) Frontiers in Artificial Intelligence and Applications: Recent Advances in Artificial Intelligence Research and Development, pp. 103–110. IOS Press, Amsterdam (2004) 9. S´ anchez, M., Prats, F., Agell, N., Rovira, X.: Kernel functions over orders of magnitude spaces by means of usual kernels. application to measure financial credit risk. In: Current Topics in AI — CAEPIA’2003, pp. 69–78. Springer Lecture Notes in Artificial Intelligence, Vol. 3040 (2003) 10. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.