Transferring neural network based knowledge into an exemplar-based learner

August 25, 2017 | Autor: M. Nicoletti | Categoría: Cognitive Science, Knowledge Transfer, Neural Network, Learning Model, Learning System, Hybrid System
Share Embed


Descripción

Neural Comput & Applic (2007) 16:257–265 DOI 10.1007/s00521-007-0088-8

O R I G I N A L A RT I C L E

Transferring neural network based knowledge into an exemplar-based learner Maria do Carmo Nicoletti Æ Lucas Baggio Figueira Æ Estevam R. Hruschka Jr

Received: 1 December 2006 / Accepted: 21 December 2006 / Published online: 17 February 2007  Springer-Verlag London Limited 2007

Abstract This paper investigates knowledge transfer from a neural network based system into an exemplarbased learning system. In order to examine the possibilities of such transfer, it proposes and evaluates a system that implements a collaborative scheme, where a particular type of neural network induced by the neural system RuleNet is used by an exemplar-based system (NGE) to carry on a learning task. The proposed collaboration between the two learning models implemented as the hybrid system RuleNet fi NGE is feasible due to the similarity of the concept description languages employed by both. The paper also describes a few experiments conducted; results show that the RuleNet-NGE collaboration is plausible and, in some domains, it improves the performance of NGE on its own. Keywords Knowledge transfer  RuleNet  NGE  Hybrid systems

M. do Carmo Nicoletti (&)  E. R. Hruschka Jr Computer Science Department, UFSCar, Sao Carlos, Brazil e-mail: [email protected] URL: www.dc.ufscar.br/~carmo E. R. Hruschka Jr e-mail: [email protected] L. B. Figueira Physics and Math Department, DFM-FFCLRP, University of Sao Paulo, Sao Carlos, Brazil e-mail: [email protected]

1 Introduction One of the main goals of inductive machine learning (ML) research is to develop computational tools capable of automatic knowledge acquisition (see [1] for a characterization of the area and its main contributions). Inductive learning models have been largely explored and have effectively been used as the basis for implementing very successful machine learning systems. Inductive ML systems can implement a broad variety of models such as instancebased, neural network, decision trees, rule-based, and so on. Independently of the model, however, each ML system is strongly dependent on many factors, such as expressiveness of the representation languages used for describing training instances and induced concepts, the use (or not) of knowledge domain for inducing concept(s), the degree of generalization employed, the user-defined parameters, the type of search mechanism employed, etc. and consequently, each system has advantages and disadvantages, depending upon the learning situation in which they are used. Aiming at both overcoming and compensating for some frailties of machine learning models, it would be prudent to consider hybrid systems which implement a combination of models allowing systems to collaborate among themselves for carrying out learning subtasks as well as allowing them to collaborate by means of transference of knowledge between them. This paper proposes collaboration between a neural network system known as RuleNet [2, 3] and an exemplar-based system known as Nested Generalized Exemplar (NGE) [4, 5] and aims to investigate the plausibility of such collaboration, to evaluate its

123

258

Neural Comput & Applic (2007) 16:257–265

advantages as well as to identify learning situations where it can be useful. The motivation for the hybrid system proposed in this work, implemented as the system RuleNet fi NGE, is to promote a neural-symbolic collaboration by means of transferring the knowledge induced by RuleNet as a neural network into an NGE based learner. NGE then uses this knowledge to establish its initial set of hypotheses and, then, carries on its own learning process, using a training set. Implementing this idea was a particularly straightforward process because of the similarity between the knowledge representation languages used by both systems. This paper is organized as follows: Sect. 2 describes the main characteristics of the neural network method named RuleNet and its variation named XRuleNet. Sect. 3 presents a general review of the exemplar-based method named NGE and its greedy version NONGE. In Sect. 4, the experiments conducted using the hybrid system RuleNet fi NGE (as well as some of its variations) are described and the results are analyzed. Section 5 presents the conclusions and highlights possible steps to be taken next.

2 The RuleNet method In the context of this work, RuleNet is understood to be either the learning algorithm or the resulting network. RuleNet is a simple constructive neural learning method with a direct propagation algorithm, suitable for classification tasks. Although the RuleNet model has been proposed as a feedforward neural network with discrete outputs, it has a very strong symbolic aspect due to its learning algorithm as well as to the language used to represent the induced concept. A RuleNet network has three layers. They are input, hidden and output; only one-way connections are allowed among neighboring layers. The input layer is

vij Hk



hidden layer

wij u1 x1



uN xN

Fig. 1 Network architecture of RuleNet

123

2.

Each hidden node v 2 UH is connected to exactly one output node u 2 UO. The network input of a hidden node v 2 UH is

netu = j j o  wv j j 1 = maxf j ou1  W(u1 ,v) j , . . . , j oun  W(un ,v) j g

d(ui,vj)

λLui

λRui

1 W(ui,vj)

xui

output layer

… H1

1.

-1

OM

O1

fully connected to the hidden layer and each hidden node is connected to a single output node through a constant weight. The input nodes have a buffering function only and each of them represents a single attribute that describes the training instances. Figure 1 shows a general RuleNet architecture. In a RuleNet network, the hidden and the output layers are empty at the beginning of the training phase and are constructed along with the learning process. After learning, the hidden layer of a RuleNet can be seen as a set of hyper-rectangles whose associated classes are given by the output nodes. The hidden nodes are modified radius-limited perceptrons. The geometrical meaning of this modification is that the influence regions of the hidden nodes describe hyperrectangles instead of hyper-spheres (see Fig. 2). During the training phase, hidden nodes are added to the network as needed. A hidden node is added in its most general shape, i.e., the largest hyper-rectangle possible, respecting the limits given by a default value and provided it does not overlap others connected to output nodes representing different classes. A connection between a hidden and an output node simply informs whether the hyper-rectangle described by the hidden node belongs to the class represented by the output of the output node. According to [6], a RuleNet system is a three-layer (UI, UH and UO) radial basis function network with the following specifications:

input layer

i: 1, …, n, n = number of input nodes j: 1, …, m, m= number of hidden nodes xu : input value for the input node ui i

W(ui,vj): reference point of the hidden node vj relative to input node ui λLui: left width of influence region of the hidden node vj λRui: right width of influence region of the hidden node vj d(ui,vj): part of summation function Fig. 2 Influence region of a hidden node

Neural Comput & Applic (2007) 16:257–265

where o ¼ ðou1 ; . . . ;oun Þ is the vector of the output values of the input nodes and wv = [W(u1,v),...,W(un,v)] is the weight vector of the hidden node v. 3. For each hidden node v 2 UH there are two vectors kL and kR 2 RjUI j : They determine for each input node u 2 UI, which is connected to v, the left and right influence region of this hidden node. Figure 2 shows the reference point of a hidden node vj relative to input node ui. 4. The activation of a hidden node v 2 UH is computed as follows: av = Dv  netv , with Dv = min u2UI f d(u,v)g and ( 1 if Wðu; vÞ  kLu  ou  Wðu,v) + kR u d(u,v) =  1 otherwise 5.

The activation of an output node w 2 UO is determined as follows:

aw = max

v 2UH

f av  W(v,w)g

where W (u,w) is the weight of the connection between hidden node v and output node w. Each hidden node v 2 UH of a RuleNet can be directly translated into a rule of the form: h i If xu1 2 W(u1 ,v)  kLu1 , W(u1 ,v) + kR and . . . and u1 h i then pattern xun 2 W(un ,v)  kLun , W(un ,v) + kR un X = ð xu1 , . . . , xun Þ belongs to class C This rule checks whether, for each variable of the antecedent, a condition is satisfied, i.e., whether its value belongs to an interval. If all conditions are satisfied the pattern lies within the hyper-rectangle representing the hidden node and consequently it becomes active. The propagation algorithm used in RuleNet is based on the winner-take-all rule. As pointed out in [2, 3], during the learning phase three possible situations can occur, when a training instance X of class C is propagated: 1. 2.

The network classifies X correctly and no changes occur. The network cannot classify X, meaning that X is not contained in any of the hyper-rectangles represented by the hidden nodes. A new hidden node is created with X as its reference point. The parameters kL and kR associated to each dimension of the hyper-rectangle are chosen to be as large as possible, without exceeding a user-defined default

259

3.

value and with no overlapping areas with other hyper-rectangles created so far. If there is an output node for class C, the new hidden node is connected to this output node. If this is the first training example of class C, a new output node with value C is created and is connected to the new hidden node. The network classifies X incorrectly (Fig. 3). In this case, first, the k values of the winning hidden nodes are adjusted in one dimension causing the removal of X from their decision region. Theoretically there are two ways of doing that, as shown in Figs. 4 and 5. Second, a new node is created, with X as its reference point, as explained in situation 2.

The size and accuracy of a RuleNet network are dependent on the order the training instances are presented, during the learning phase. The algorithm used in the classification phase is based on the winner-take-all rule as well. The hidden node HK is the winner node for the input pattern X if X is contained in the hyper-rectangle described by HK. The input pattern X can be contained in more than one hyper-rectangle and consequently can activate more than one hidden node. During learning, however, it is assured that all active nodes are connected to the same output neuron. During the classification phase if the input pattern is not contained in any of the hyper-rectangles represented by the hidden nodes of a RuleNet network, the output has the value ‘‘unknown class’’. For applications where this is not convenient, an extended version named XRuleNet can be used. The XRuleNet network is ‘‘forced’’ to produce an output, by finding a class to which the pattern has greater chances to belong to. This can be accomplished either by (a) calculating the distance between the input pattern and the reference points of the hidden nodes or (b) calculating the distance between the input patterns to the closest hyperrectangle dimension. The XRuleNet version used in this work implements the option (b) as well as the improvement suggested in [3], that of modifying the reference points of the hidden nodes, during the learning phase, according to their performance. xu

2

X d2

R u1

d1 L

R u2

u2

w

L u1

xu

1

Fig. 3 RuleNet classifies X incorrectly

123

260

Neural Comput & Applic (2007) 16:257–265

xu

2

X

d1/2

d1/2

R u1 R u2 L

w

u1

xu

1

Fig. 4 Resizing the first dimension of the hidden node that classified X incorrectly

X

xu

2

d2/2 d2/2

w

L

R u2

u2 L u1

xu

1

Fig. 5 Resizing the second dimension of the hidden node that classified X incorrectly

3 The NGE method The exemplar-based learning model was originally proposed in [7] as a model for human learning, based on the idea that, when solving a problem, human beings tend to retrieve previously experienced similar situations and adapt their solutions to solving the current problem. This human strategy for solving problems has been used in the machine learning area as a basis for establishing an automatic learning model known as exemplar-based learning. As suggested in [5] learning algorithms characterized as exemplar-based can be classified either as instance-based or exemplar-based generalization learning algorithms. Learning using instance-based algorithms consists of storing the training examples in memory and never changing them. The concept description consists of the training set itself. For classifying a new instance, these algorithms retrieve similar instances from memory and use their classes for classifying the new instance. This is why instance-based algorithms are known as lazy algorithms—they postpone generalization to the classification phase. Instance-based algorithms approach similarity as a synonym for closeness and rely on a distance metric to measure the similarity of a new instance with the stored training instances. A family of instance-based algorithms is proposed in [8]; they are strongly based on the nearest neighbor algorithm (NN) [9] and were proposed aiming at exploring the limitations and possible extensions of the NN (see [10] for a in-depth and extensive presentation and analysis of lazy learning algorithms).

123

The nested generalized exemplar (NGE) [4, 5] can be approached as an exemplar-based generalization algorithm and, as such, can be considered an instance based algorithm, with the extra capability of generalization. Figure 6, found in [5], shows NGE as a type of exemplar-based learning. In an NGE learning environment, an exemplar is a training instance and a generalized exemplar is an axisparallel hyper-rectangle. NGE can be approached as a symbolic method, since the concepts it induces can be easily ‘‘translated’’ into a set of symbolic rules. The input in an NGE system is a set of training examples (training set), presented incrementally, each described as a vector of pairs numeric_attribute/value and an associated class. The n attributes used for describing the training examples define the n-dimensional Euclidean space where the learnt concepts will be represented. NGE generalizes an initial set of training examples (considered trivial hyper-rectangles and named seeds) into a set of hyper-rectangles, which can be nested one inside other. NGE generalizes the initial user-defined set of seeds, expanding (and during the learning phase sometimes shrinking) them along one or more dimensions, as new training examples are presented. The choice of which hyper-rectangle to generalize depends on a distance metric, which generally is a weighted Euclidean distance, either point-to-point or point-tohyper-rectangle. For each new training instance Enew, NGE finds among all hyper-rectangles built to date, the closest to Enew, referred to as Hclosest1 and the second closest, Hclosest2. These are the candidates to be generalized. If Enew and Hclosest1 have the same class, Hclosest1 is expanded to include Enew, a process called generalization; otherwise the class comparison will take place between Enew and Hclosest2. If they have the same class, NGE will specialize Hclosest1, reducing its size by moving its edges away from Enew, so that Hclosest2 becomes the closer of the two to Enew along that dimension, and stretching Hclosest2 to make it absorb Exemplar-based learning

Instancebased learning

Exemplarbased generalization

NGE

Case-based Reasoning

Fig. 6 NGE as a type of exemplar-based learning

Neural Comput & Applic (2007) 16:257–265

261

Enew. If the class of Enew differs from the classes of both Hclosest1 and Hclosest2, Enew itself becomes a new exemplar, assuming the shape of a trivial hyper-rectangle (i.e., a point in the n-dimensional space). Figure 7 shows the induced concepts of three different classes, represented by five hyper-rectangles, one of them trivial. Weight adjustments are adopted by NGE as a way of reinforcing the relevance of attributes and exemplars in the classification process. Such reinforcement can be either positive or negative, depending on the contribution of each attribute to the correct classification of examples. During the learning process, the increasing relevance of an attribute is reflected by the decreasing value of its associated weight and viceversa. A similar policy is adopted for weights associated with exemplars. According to [11], ‘‘... Salzberg’s weight procedure has no significant impact on NGE’s behaviour in most domains’’ and, for this reason, the implemented NGE used in the experiments has neither attribute nor exemplar weight strategy. Our implementation also follows the suggestion given in [11], that ‘‘the construction of overlapping hyper-rectangles should be avoided.’’ There is the possibility of using another version of the NGE called Greedy NGE, also proposed and described in [4, 5], which does not implement the second choice heuristic. The Greedy NGE always stores an example as a new hyper-rectangle (trivial) when the class of the example is different from the class given by its closest hyper-rectangle. In spite of inhibiting the second choice, the Greedy NGE can still construct overlapping hyper-rectangles. The implemented version named NONGE (used in the experiments described in sect. 4) evaluates the impact of modifying a hyper-rectangle before effectively doing it. If the modification will cause an overlap, the NONGE creates a new trivial hyper-rectangle instead.

y

(H2,c2) •

4 Experimental results This section describes a set of experiments, which were conducted using eight knowledge domains, in order to investigate the feasibility and adequacy of the hybrid system RuleNet fi NGE (and a few variations). Three artificial domains, identified in this paper as A, B, and C, were used with the intention of detecting the sensitivity and studying the behavior of the hybrid system in well defined domains with different decision boundaries. Each of them has 500 instances. Domains A and B describe two classes (250 instances per class) and domain C describes ten classes (50 instances per class). Figure 8 presents a pictorial representation of these domains. The Breast Cancer (BC), E. coli and Iris domains are well known; the data as well as their descriptions are part of the UCI repository [12]. The Vestibular System (VS) data are related to the identification of a lesion in the cerebellum, known as a central vestibular lesion. The domain contains 198 instances, divided into 97 normal and 101 abnormal. Six attributes and a corresponding class describe each example. The VS data were collected by electrodes placed next to the patient’s left and right eyes. The movements of both eyes were monitored as they focused on a spotlight that shone alternatively from one extremity to the other of a horizontal bar, at a constant frequency, during a certain period of time. The electrodes measured the electrical signals that were produced by the saccadic movements. These signals were amplified, filtered and recorded for further analysis. The movements were recorded as a vector of 1,240 different values (taken throughout the period of testing) and an associated class, given by a human expert. For each eye and for each eye movement, the values of three characteristics, namely latency, amplitude and accuracy were extracted from the data by means of software. The average of these values measured over the period of testing were recorded, as well as its correspondent class, and these, collectively constitute the data used for the purpose of this work.

(H1,c1)

(H3,c3) 125

125



+

+



250



x z

(H4,c2) (H5,c3)

Fig. 7 Induced concept represented as five hyper-rectangles, one of them trivial

125

125

+ 250

50 50 50 50 50 50 50 50 50 50

1 2 3 5 6 8 9

0 … 0 … 1 … 2 … 3 4 … 4 … 5 … 6 7 … 7 … 8 … 9

Fig. 8 Artificial domains A, B and C, each containing 500 instances

123

262

Neural Comput & Applic (2007) 16:257–265

Each instance is described by six attributes (three related to the left eye and three related to the right eye) and a corresponding class. The meanings of the three attributes are as follows: Latency: time interval between the movement of the target (spotlight) and the beginning of the eye movement in order to look at it. Amplitude: the extension of the eye movement in order to reach the final position that locates the target. The exact amplitude determines the precision of the eyes’ search. Accuracy: precision in locating the target. Excipients is a pharmaceutical knowledge domain, with 170 instances, related to the industrial production of pharmaceutical drugs. In order to optimize drug delivery systems, a better understanding of excipients, their properties and limitations is required. The Excipients domain consists of data described by 14 different characteristics associated to excipients: bulk density, tapped density, compressibility, angle of repose, relative filling, flow rate, particle size distributions (with percentage of retention on 840, 420, 250, 177, 149 and
Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.