Generation of temporal sequences using local dynamic programming

June 24, 2017 | Autor: Michael Arbib | Categoría: Neural Networks, Multidisciplinary, Optimal Control theory, Soft Constraints, Adaptive Optimization

Share Embed

Laporkan tautan ini

Descripción

Generation of temporal sequences using Local Dynamic Programming N.A. Borghese+ and M.A. Arbib* * Center for Neural Engineering, University of Southern California, Los Angeles, CA 90089 -0782 - USA + I.N.B. - C.N.R. - Via Mario Bianco 9, 20131 MILANO - I

ABSTRACT The generation of a sequence of Control actions to move a system from an initial state to a final one is an ill-posed problem because the solution is not unique. Soft constraints like the minimization of a cost associated to control actions makes the problem mathematically solvable in the framework of optimal control theory. We present here a method to approximate the solution of the problems of this category based on Heuristic Dynamic Programming proposed by Werbos: Local Dynamic Programming. Its main features are the exploration of a volume around the actual trajectory and the introduction of a set of correcting functions. Its application to the generation of a trajectory whose kinematics is minimum jerk is presented; in this situation, the introduction of a short term temporal credit assignment improves the convergence tackling the lack of controllability in the Plant .

1.0 INTRODUCTION A control problem can be stated as the generation of a temporal sequence that would move a certain plant in a desired way. The simplest structure that can accomplish this is the feedback controller, figure 1a, in which the control signal is a function of the actual state of the plant and of a set of parameters, wb, that can be tuned in order to generate the desired control sequence. The way in which these parameters are tuned originates very different learning paradigms that in connectionist literature are grouped in two broad classes: Supervised learning and Reinforcement learning.

23.47

1

aprile 14, 2002

In Supervised Learning, the temporal sequence of the desired states of the plant and/or of the Controller output is given to the system as a training set (Kawato et al., 1987; Jordan, 1988; Jordan & Rummelhart, 1992, Massone & Bizzi, 1989). The discrepancy between the generated and the desired state or control at each time step gives an error that is used to tune the controller parameters. This procedure is repeated for every time step of the movement. In this paradigm, we can define "learning" as the memorization of a given temporal sequence of controls, which can be later retrieved. New trajectories can be generated by interpolation of the memorized ones. A very different learning paradigm is adopted those systems that are classified under the broad family of Reinforcement Learning. Here there is no apriori information on the temporal sequence of controls and states which will do the job, but these are the outcome of the learning process itself. Here "learning" is the self-discovery of the temporal sequence that will accomplish the specified task. In this paradigm, the hard problem to face is the Temporal Credit Assignment: "which control in the temporal sequence should receive the credit or the blame to improve the overall performances when its consequences unfold over time and interact with consequences of other controller outputs through the plant? (Werbos, 1990a). This problem is solved introducing a particular module that relates the actual control to the Global goal of the system. Among the approaches based on Reinforcement Learning, those related to Optimal Control theory (Bryson & Ho, 1969) are considered the most general: Back Propagation through Time (Nguyen & Widrow, 1990; Werbos, 1990c; Kawato et al., 1991), Minimum Principle (Narendra & Parthesarathy, 1991; Uno et al., 1989) and Dynamic Programming (Barto et al., 1983; Werbos, 1987, 1990a, 1990b; Sutton, 1988; Williams, 1988; Watkins & Dyan, 1992; Berenji & Khedkar, 1992), are all inspired by it and differ in the architecture and in the learning procedure implemented. We present here Local Dynamic Programming, an improved version of Heuristic Dynamic Programming, proposed by Werbos as an approximated method to generate the desired sequence of controls and its application to the generation of movement point to point in three-dimensional space in which the jerk is minimum.

23.47

2

aprile 14, 2002

2.0 HEURUSTIC DYNAMIC PROGRAMMING (HDP) Werbos (1987, 1990a, 1990b) proposed an approximated, gradient descent, method to do approximated Dynamic Programming (cf.. Appendix A). In this approach called Heuristic Dynamic Programming (HDP), the Optimal Control sequence is achieved performing the given task repetitively and updating the parameters after each repetition until convergence. Moreover, this is obtained running the system only forward in time In HDP the feedback controller is embedded into a two level hierarchical adaptive structure (figure 1b). To the first level constituted of the Plant and the Controller, a second level is added which is used in the learning phase only. This second level comprehends two modules. The Cost outputs at each time step a Local Cost, Uk, as a function of the actual Control and State (eq. A2a); it may be regarded as a tactic function involved in a local minimization process, related only to the kth step. The Critic module outputs an estimation of the Cost-To-Go for each state, given a certain sequence of controls (eq. A3); this is a strategic function involved in a long-term optimization goal. J k+1 Uk

xk+1

CRITIC

wa

PLANT

xk

COST

xk+1

uk

PLANT CONTROLLER

wb

uk

xk

CONTROLLER

wb Figure 1a (left). The first level of the control system constituted of the Plant and the Feedback Controller. Figure 1b (right). The four components required to learn the optimal control using dynamic programming are shown: the Plant and the Cost function are in the analytical form, while the Critic and the Control are parametric functions (neural networks). The second level constituted of the Critic and the Cost is used only in the learning phase.

23.47

3

aprile 14, 2002

In this approximated formulation of DP, the Critic and the Controller are parameterized by two sets of parameters wa and wb. A convenient parametrization is through three-layered sigmoidal neural networks for which it has been shown that, with an arbitrarily large number of hidden units, they can approximate any continuous function in the real space to any degree of approximation (Hornik et al, 1988). The parameters wa and wb become the synaptic weights of the units and we can write the cost-to-go and control functions implemented in the Critic and in the Controller as: Jk= J(x k, uk, k | wa)

(1a)

uk = g(xk, k | wb)

(1b)

In this parametrization, gradient descent is efficiently carried out using the back-propagation algorithm (Rummelhart et al., 1986, Werbos, 1974) provided that an error or an increment1 on the output is given. The gradient descent updating of the weights, ∆w k , can be expressed as: ∆ wk = −µ ∇w k ek

(2)

where µ is the learning rate, ek is the error or the increment in the output of ∇ the network, wk is the vector of synaptic weights and w k is the gradient of the error repsect to the weights. After each iteration, n, the new value of the weights would be: (n) w(n+1) = w (n) + µ∇w e

(3)

In our situation, the error on the output is not given. If the Critic were already perfect, at each time step, the Controller could be tuned by computing the gradient of the output of the Critic (cost-to-go) respect to the input state, back propagating this gradient through the Plant and using the obtained increment in u k to tune the Controller. If the Controller were already optimal, the error on the 1 The vector of the output increment and of the error have the same orientation and differ only in module. This does not restraint from convergence, because the updating of the weights is proportional to the error (equation 2a), that means that whatever measure of the "error" is taken, it is effective to train the network, provided that it is an effective way to change the learning rate µ as the system get close to the target. This distinction is critical when there is not an explicit target, but only the direction of the change in the output is given. In our case, where the output of the Critic network is a scalar, an effective output error can be +/-1 provided that the learning rate is kept very small (0.001 in our examples).

23.47

4

aprile 14, 2002

cost-to-go at each time step could be determined and used to train the Critic. But neither the temporal sequence of the cost-to-go nor of the optimal control is known in advance; the Critic and the Controller are therefore trained contemporary with a procedure based, at each time step k, on an estimation of the error for the Critic and of the increment of uk for the Controller. Learning is obtained running the system only forward in time and it is carried out in two steps iteratively: Forward Pass and Adaptation. Every iteration, named n, is the execution of a movement from the initial state xo , at time k=0, to the final state xN at k=N using the sequence of controls {uk = g(xk, k |wb(n))}. The Critic monitors the trajectory of the plant and produces at each time step an output, Jk, which is an estimation of the cumulative future cost, using the control sequence {u k = g(xk, k|wbn)}. Forward Pass . The nth movement is executed. The output of the Controller, of the Plant and of the Critic is computed but it is not memorized like in Back Propagation Through Time approaches (Nguyen & Widrow, 1990; Werbos, 1990c). At each time step, the updating of the weights is computed as follows: (a) Critic. The updating of the weights wa(n+1) is computed such that J(xk, k |wa(n+1)) will be a better approximation of J(xk+1, k+1 | wa(n)) + Uk, given the sequence of state vectors {xk} and assuming that {u k = g(x k, k |wb(n))} is used as the control. The local error used to tune the Critic weights therefore is:

eak = J( f ( x k ,uk , k), k +1) +U k − J(x k , k )

(4) where f(.) is the output of the Plant and Uk is the local cost at time step k. (b) Controller. The procedure to update the Controller weights is more complex. We do not have an explicit target value for the Controller for which to compute an error; at any time step, we can only change the weights in the Controller to get a lower cost-to-go from there to the end of the movement. This is equivalent to apply gradient descent procedure in the minimization of function A5a, from which we obtain:

 ∂J( f (x k ,uk ,k),k)   ∂u(. )k     −µ (n )  ∂uk   ∂wbk  T

uk

( n+1 )

= uk

( n)

(5)

This function can be rewritten using the chain rule of derivatives as:

23.47

5

aprile 14, 2002

uk

( n+1 )

= uk

( n)

T  ∂J(.) k +1   ∂x k +1   ∂u(. )k    ∂x       = uk ( n) − µ∇w ∇x J k +1 (.) k +1  −µ ( n) bk  k+1  ∂xk +1   ∂uk   ∂wb k   ∂uk   

(6) which is in the form of equation 2 and can be used to tune the Controller network. The term inside square brackets represents the estimation of the increment required in the Controller output to decrease the cost-to-go. ∇ x k +1 represents the gradient of the Critic respect to its input (what Jordan & Rummelhardt (1992) called back propagation-to-activation to distinguish it from back propagation-to-weights). This gradient is back-propagated through the Plant (∂xk+1 ∂uk ) to obtain the gradient of Jk+1 respect to u k, that is the change in the vector uk that would decrease the cost-to-go. Finally, the gradient on the control is back propagated into the Controller, and the Controller weights, wb, are updated. Adaptation. At the end of the movement, the updating of the weights collected through the entire trajectory is added up to the actual value of the weights to generate a new weight value. Whenever an analytical formulation of the Plant is adopted, another transformation has to be introduced at the interfaces between the plant and the neural network to allow the two structures to communicate properly. Different types of coding input/output variables can be adopted which have different properties and different ability to generalize: gaussian coding and non-linear pre-processing are widely used (e.g. Miller et al., 1990; Massone and Bizzi, 1989; Kawato et al., 1987; Moody & Darken, 1989). We concentrate here on the learning capability and such complex coding, which require more complex structures, are beyond the scope of this work. As a first approximation, linear coding can be sufficient. In this case the transformation from an analytical variable to an input to a neural network is:

v NN =v min NN +( vr − vmin ) / fs

(7)

where the input of the neural network varies between v min NN and v max NN while the actual input ( νr ) varies between νr min and νr M a x . 'fs' is the scale factor applied to the state. The transformation from an output unit of a neural network to an analytical variable can be expressed by: 23.47

6

aprile 14, 2002

νr = ν min + ν NN * fs

(8)

where the output value of a sigmoidal unit is constrained to be between 0 and 1. From this linear transformation, the derivatives of the output as a function of the input are computed and are equivalent to the scale factors. The updating of the weights, equation 2, has therefore the final form: N −1

∆ wa = −µ ∑ ∇ wa k ea k

(9a)

0

N −1

∆wb = −µ ∑ ∇ wb k ( f c ∇ x k+1 J (.) k +1 0

∂xk +1 −1 fs ) ∂uk

(9a)

If the Plant is approximated with a third neural network, the gradient of the Critic can be propagated directly through the three networks as if they were a single multi -layered network. This procedure has been named Back Propagated Adaptive Critic (B.A.C.), by Werbos (1990a). Nevertheless, we point out that learning the weights of the Plant network (identification problem) should be carried out before the optimization procedure (control problem) is started to avoid problems of multicollinearity and interference between the three networks. These problems may lead to poor learning (Psaltis et al., 1987) because the system does not know if the error should be attributed to an incorrect plant or to a poor controller.

3.0 LOCAL DYNAMIC PROGRAMMING Starting from HDP, we propose here a Local approximation to Dynamic Programming (LDP) which improves HDP in two aspects: volume exploration and correction of the weights update.

3.1 Volume exploration In tuning the controller, as prescribed by the HDP algorithm, not only the absolute value of the cost-to-go, but also the shape of this function in the surroundings of the actual trajectory is required to reliably compute the local gradient of the cost-to-go relative to its input state. Using HDP in the original formulation, the information is collected only on the actual trajectory and no information is given on its behaviour outside that trajectory. Therefore, unless a 23.47

7

aprile 14, 2002

massive exploration of the state space is contemporary carried out, the gradient of the critic respect to the input state can be incorrectly estimated and the system may diverge during learning. When this happens, the system can oscillate or diverge. In LDP, the state space close to the actual trajectory is explored sampling the output of the Critic on the actual trajectory and in a volume around it. To achieve this, we start the trajectories {x k} in parallel from xo and from a set of close states: {xo} = {xo ; xo+dxi o; xk-dxi o } where dxi o is an S x 1 vector defined

as: dxi o[j]= {0 if j ? i, d if j=i} and S is the dimension of the state space. The number of trajectories explored in every Forward Pass is 1 + 2 * S. If the temporal behaviour of the Plant and the Controller is continuos, the perturbed trajectories will remain close to the actual trajectory (Bryson & Ho, 1969). The exploration of the cost-to-go on the set of the resulting trajectories allows to have a good approximation of the cost-to-go in an entire region of the state space. With this procedure, we can determine with a good approximation, not only the absolute value of the Critic, but also its local gradient. This is schematically represented in figure 2 where the explored trajectories constitute the border of the volume in the state/time space inside which the gradient of the cost-to-go relative to its input state can be reliably assessed. In this case, the state space has dimension equal to 2. INSERT FIGURE 2 HERE Figure 2. Exploration of a volume around the nominal trajectory for a system with S = 2. The actual trajectory is located in the center of the tube. With xa is indicated the trajectory starting from xodx o1 , with xb that starting from xo-dx o1, with xc that starting from x o-dxo2 and with xd the one starting from xo-dx o2 .

3.2 Correcting Functions An important issue related to the speed of convergence is the evaluation of the information collected by the Critic during the first repetitions of the task. Given the choice of running the system forward in time, we question the importance of the weights update, ?w, obtained from the early stages of the trajectories. In fact, the error used to train the Critic: ea k = (J k + 1 + U k ) − Jk

23.47

8

aprile 14, 2002

(equation 4), has a recursive formulation that goes backwards in time. The accuracy in the estimation of the reliability of the cost-to-go for every time step k depends on the cost-to-go at the next time step, Jk+1 , which, in turns, increases as learning progresses. In the very first movement, the only exact error can be determined in the last time step, N: ea N = C( xN ) , where C (xN) can be directly measured at k=N. This error is effective not only for the last stage, but also predisposes the Critic to learn the correct cost-to-go in the previous states and times2 . Therefore the information in the different time steps has not the same reliability. A set of Correcting Functions for the updating of the weights has been introduced (figure 3). Each function is used for a set of P trajectories. At the beginning of the learning, the parameters update due to the last stages of the trajectory is heavily weighted and very little the updates collected in the first ones. After that, the weights in the first time steps progressively increase. An improved version of the Correcting Function could take into account the runtime estimation of the reliability of the Critic at each time step to weight the parameters updating in the Critic and the Controller. INSERT FIGURE 3 HERE Figure 3. The four correcting functions used in our simulations. They are used for 20 trajectories ( αk − β )

each (P=20). The first function has the shape of e with α = 0.01 and β = 5; the second function has the shape of a straight line with slope equal to 1/N; the third function has a logarithmic shape: ln (1 + γk) δ with γ = 0.01 and δ = ln (1 + γN) .

4.0 APPLICATION OF LDP TO THE TRAJECTORY GENERATION IN REACHING MOVEMENTS We applied the paradigm of Optimal Control/LDP to the problem of moving a human hand in the 3D space. It has been shown that the kinematics of this movement can be predicted hypothesising that the jerk is minimized throughout the entire movement (Hogan, 1982; Hoff, 1992). 2 Differently from the analytical solution of DP, here the different stages are not completely independent because they are represented in the same neural network. Due to smoothness capabilities of the network, cost-to-gos of states and times close to each other tend to have close values too.This is an information that flows inside the Critic network and predispose it to learn the correct gradient of the cost-to-go corresponding to previous states and times without explicitely back-propagation through time.

23.47

9

aprile 14, 2002

The problem can be stated as follows: the movement of the hand (considered as a single point) in one direction in space, can be described by the kinematics equations:

x p = xv

.

x p(k + 1) − x p (k) = x v (k)∆t

(10a)

.

x v (k +1) − x v (k) = x a (k)∆ t

(10b)

x v = xa .

xa = u x a (k + 1) − xa ( k) = u(k)∆ t (10c) with xp representing the position, xv the velocity, xa the acceleration of the hand

and u the jerk. In matrix notation we can write: 1 xk + 1 = 0 0

∆t 1 0

0 ∆t  * x k 1 

 0 +  0  *uk  ∆t 

(11) This is our plant to be controlled which moves from xo at k=0 to x N at k=N. The goal is to minimize the jerk during the movement realizing a trajectory which reaches a specified target, xd, from the given initial state, xo . The global cost combines a penalty for lack of smoothness with a penalty for deviation from the desired final state (cf. eq. A2): 2 Uk = µ*u k2 (12a) C N = λ * (x N − x d ) (12b) where λ and µ are positive scalars. The problem can be stated as finding the temporal sequence of controls {u k} which minimizes the overall cost (cf. eq. A1):

min (µ * ∑ 0 uk + λ * ( xN − xd ) ) N −1

2

2

(13)

{u k }

The Critic and Controller are implemented as three layer networks fully interconnected with 4 input units, 21 hidden units and 1 output unit. The four input units linearly encode the state of the system: position, velocity, acceleration and time, while the other two levels are composed of sigmoidal units. The two networks learn to approximate the temporal sequence of the costto-go and the of the control. All the inputs are scaled to the interval [-1,+1] and corresponds to [100mm,1100mm] for position, [-.3 m/sec,4.2m/sec] for velocity, [.27m/sec2,.27m/sec2] for acceleration, [-50msec, 550msec] for time. The output units are constrained in the interval [0,1] which corresponds to [0.27m/sec3,0.27m/sec 3] for the jerk and [-0.1, 2.3] for the cost-to-go in arbitrary

23.47

10

aprile 14, 2002

units (it is composed by a squared position, velocity, acceleration and a squared jerk). The system is implemented within the NSL simulation environment (Weitzenfeld, 1991).

4.1 Convergence In order to learn the optimal control, we follow the strategy adopted by Kawato et al. (1991): the system first learns to maximize the smoothness of the movement and after that to reach the desired final state. This is achieved by decrementing the value of µ and incrementing the value of λ in the cost function (equation 13) in discrete time steps. In the particular simulation shown in figure 4, three steps are used. After each change, the value of the cost and the cost-to-go change and the estimation stored at that time in the Critic is not anymore representative of the real cost-to-go. To facilitate the learning, the Critic network alone is trained for few repetitions of the task (10 in our simulations) while the weights in the Controller are kept fixed therefore generating the same trajectory). In the first phase, until the 50th repetition, figure 4, the value of λ is set to 0.01, very small compared with µ, which is set to 10. The great part of the Co st is associated to the jerk and the Critic forces the Controller to decrease the output jerk to zero; this causes the hand to move very little from its initial position. This phase is associated to teaching the Controller to produce a smooth output. When the final state gets close to the initial one (no movement), the system triggers a decrement in µ which is set to 1 and an increment in λ which is set to 1 too. This causes a jump in the global cost3 which is monitored by the Critic and forces the Controller to modify its output to get the final state closer to the desired one. This in turns decrements the global cost and the cost-to-go. TO avoid saturation, when the output of the Critic goes under a certain threshold, λ is incremented again, to 2, at repetition 680. And the procedure is repeated until convergence.

3 This is not shown in the picture where the global cost is scaled respect to λ.

23.47

11

aprile 14, 2002

Figure 5 A typical convergence graph. The overall cost associated to the movement (normalized to λ), the position at the end of the trajectory and the parameter λ are shown as learning progresses.

4.2 Controllability The key operation in dynamic programming is to relate the gradient offered by the Critic to the gradient needed by the Controller. This is done by back propagating at each time step, the gradient through the Plant to find the change in the weights wb that would decrease the cost-to-go at that time step. To perform this operation, we need to compute ∂xk +1 ∂uk that are the derivatives of the Plant with respect to its input (cfr. equation 6). Inspecting equations 10, it can be noticed that the jerk at time k, uk, affects only the acceleration at time k+1, but has no direct effect on position or velocity. This means that acting on the jerk at time k we can improve only that part of the cost-to-go that depends on the acceleration.  ∂J   ∂x p    ≡0  ∂x p   ∂u  k

23.47

 ∂J   ∂x v    ≡0  ∂xv   ∂u  k

 ∂J   ∂xa

12

  ∂x a  ≠0   ∂u  k

(14)

aprile 14, 2002

Let us suppose that µ=0 in equation 13, the global cost will be equal to the cost associated to the deviations from the desired final state. If at time k=N, xa N is equal to the desired final acceleration, the resulting trajectory can be stable whichever value is assumed by the final position xp N and velocity xv N . The system is trapped into a minimum and cannot escape with the local gradient information. In control theory, this is called lack of controllability (Kalman et al., 1969). There is no control uk that can transfer the state of the Plant at time k from an arbitrary state xk to another arbitrary state xk+1 in a single time step. The subspace of controllability has rank 1 which means that only those states that lay on a straight line depends only on the matrix A (equation 11). To be able to reach the whole three-dimensional state space, we must act on the control u k-1 and uk2 as can be seen by explicitly relating equations 10 to the jerk:

xp (k +1) = x p (k ) + ∆t (x v (k −1) + xa ( k − 2) ∆t + u(k − 2)∆t 2 )

(15a)

xv (k + 1) = x v (k) + ∆t( xa (k − 1) + u(k −1)∆t )

(15b)

xa ( k +1) = xa (k ) + ∆t u(k)

(15c)

When we back-propagate the gradient given by the Critic back through the Plant, we take explicitly into account this and create a short-time memory (three steps) that allow to compute the derivatives of all the three components of the state, given by the critic, respect to the jerk on which they depend on:

∂x p( k + 3)

∂x v (k + 2) ∂x ( k +1) ∆xv (k + 2) + a ∆x a (k +1) (16) ∂u( k) ∂u(k) ∂u(k) where ∆ xp, ∆ xv and ∆ xa are given by the critic network. Substituting the ∆u(k) =

∆xp (k + 3) +

values of the derivatives: ∆ u(k) = ∆t 3 ∆x p (k + 3) + ∆t 2 ∆x v (k + 2) + ∆t∆x a (k + 1)

(17)

In connectionist terms, this can be considered equivalent to create a structure for a "short term" credit assignment that takes care to change the values output by the controller in the previous three time step in order to effectively decrease the overall cost and the cost-to-go at that time step.

23.47

13

aprile 14, 2002

4.3 Learning one optimal trajectories starting from random initial weights In figure 5 the trajectories of the state (position, velocity and acceleration) and of the control (jerk) are plotted versus time at the end of the learning. They are indistinguishable from the analytical solution (Hogan, 1983) and show the classical minimum jerk profile. In the bottom row, the time course of the activity in the hidden units of the Controller is shown. The analysis of the temporal profile of the activity in the hidden units of the Controller and the Critic reveals that this is highly modulated throughout the movement. The hidden units can be seen as "primitives" of the movement whose activity is fed into the units of the output layer which through a non-linear sum generates the actual output temporal sequence. We also observe that none of them is directly correlated either with position, or with velocity or acceleration alone. They are all effective in producing the jerk and only their ensemble can output the correct control as a function of the input state.

Figure 5. Trajectories of the Plant and the Controller at the end of the learning phase. Upper row: the trajectories of the state (position, velocity and acceleration and time) plotted versus time. Middle row: the trajectory of the output of the controller (jerk). Bottom row: the activity in the 21 hidden units of the controller network plotted over time.

We can obtain the same optimal control with very different temporal activity in the hidden units starting with a different set of initial weights and a different

23.47

14

aprile 14, 2002

scheduling of λ and µ. In the simulation shown in figure 6, the activity of some of the hidden units gets very close to saturation levels during the movement while before they were all operating in their most linear region. This may explain the poorer approximation of the optimal control (observe the slight overshooting of the target) achieved in this experiment.

Figure 6 Optimal control obtained starting with a different set of initial weights and adopting different learning parameters respect to that shown in Figure 5.

The learning is achieved specifying only a cost at each stage of the movement and the desired final state at the end of the movement. No idea on the path and on the final trajectories is given to the system. In fact, the trajectories of the first repetitions of the task are very different from the final ones (figure 7). The Critic and the Controller are initialized with random weights between -.1 and .1 which correspond to a small positive or negative jerk almost constant throughout the movement (middle row). This is integrated three times in the Plant (equations 10) and drives the hand far away from the desired final position (in this simulation the final initial position was -654mm: 5 times further the desired final position). In the first phase of learning, when the system minimizes the smoothness without caring about the final state, the system decreases the Controller output to zero for the entire movement. This results in trajectories for position, velocity and acceleration that slightly deviate from zero. The big 23.47

15

aprile 14, 2002

change in the final value of the state is accomplished with a small change in the activity of the hidden units of the Controller (Figure 7, bottom row). With initial random weights, the activity in these units is monotone and usually flat. The substantial change is to go from this kind of activity in the hidden units to the highly modulated one shown in Figures 5 and 6.

Figure 7. Optimal solution obtained starting from random values of the weights for both the Controller and the Critic. The Controller generates a small offset (middle row) that is integrated three times in the Plant and drives the position five time further than the desired final position (upper left plot). After the first learning phase, the jerk is reduced to zero and the trajectories of the state have small deviations from zero (upper trace of U and the trajectories of A_Hidden with the flatter slope).

4.4 Learning multiple optimal trajectories We tested the ability of the system to learn to generate multiple optimal trajectories, starting from different initial states. We choose three initial states with a position of 10% and 50% of the path length and corresponding to the initial states: xo1= [0, 0, 0], xo2 = [110, 0, 0], xo3 = [500, 0, 0] (Figure 8). The system explores the three trajectories sequentially, updating the weights of the Controller and the Critic after each trajectory has been completed. In this situation the system learn to gets to the final state with a residual error that is 23.47

16

aprile 14, 2002

less than 2% in position and acceleration and less than 5% in velocity and with a jerk sub-optimal to the problem. We explicitly notice that also in this case the modulation in the activity of the hidden units of the Controller is small compared to the modofications occurring in its output.

Figure 8 The trajectory of position, velocity, acceleration, jerk and controller’s hidden units are shown for the three different initial states: xo1 = [0, 0, 0], x o2 = [110, 0, 0], x o3 = [500, 0, 0]. State is plotted between -1.5 and 1.5.

We tested the ability of extrapolation of the system in regions in which the networks have not been trained (Figure 9). We choose the initial states: xo1 = [0, 0, 0], xo2 = [110, 0, 0], xo3 = [500, 0, 0] to start the movement with non zero initial velocity and acceleration. At the end of the learning the Controller is still able to drive the plant very close to the desired final state but the jerk is still sub optimal to the problem.

23.47

17

aprile 14, 2002

Figure 9. The trajectories obtained starting from xo1 =[0,0,0]. xo2=[300, 2, .13] and xo3=[500, 3, .13] are reported. State is plotted between -1.5 and 1.5.

5.0 DISCUSSION LDP belongs to the class of neural networks concerned with the learning and the generation of temporal sequences that are not known a-priori but are the discovery of the learning process itself. The common feature of the different architectures devoted to this task, is an embedded machinery which transforms the available information collected throughout the movement into an information useful to tune the Controller. The approaches based on Dynamic Programming/Optimal Control allow the maximum degree of flexibility in defining the performances of the Controller and can run forward in time. The price to be paid is the complexity of the cost-togo and the Controller functions. These quantities have to be estimated with a very limited indirect information like the local cost at each time step and the

23.47

18

aprile 14, 2002

deviation from the desired final state detected at the end of the movement, if any4 . HDP and LDP are the deterministic, gradient descent, approximation of Dynamic Programming. The Critic estimates the cost-to-go and the system uses its local gradient respect to the input state to deterministically update the Controller parameters: the job of the Critic is to transform the long-term cost represented by the cost-to-go into a gradient local in time used to tune the Controller.

5.1 Comparison of LDP with other methods The other approaches based on a Critic are BTT, ACE/ASE and Q-learning. They all rely directly only on the information on the final state but they are very different in their task. In BTT (Nguyen & Widrow, 1990; Werbos, 1990c; Kawato et al., 1991) the state reached at the end of the movement is compared with the desired final state and the error computed; this is back propagated through time to the different states and controls starting from the last time step, back to the first one. This approach can be visualized as unfolding the network as many times as time steps, yielding a spatial representation of time. Thus the problem of performing BTT on neural network with m layers, run through n time steps, becomes the problem of performing normal back propagation through a feed-forward network of n*m layers. At each time step, the gradient necessary to tune the weights in the Controller network is generated from the combination of the error back propagated from the next time step, and of the activation in the units of the

4 We remark here that control problems can have a finite or infinite horizon (Bryson and Ho, 1969) depending if the task can go on forever (like balancing a pole on a rig) or not (like reaching for a target in a fixed time). For infinite horizon problems, the control output does not have to be explicitely a function of time; when a system reaches a state already touched, the cost-to-go associated to it will be the same. For finite horizon problems the Controller output does depend not only on the state but also explicitely on time. The cost-to-go depends on the fixed number of states touched from that time step to the end of the movement: the same state can have two different costto-goes if they are touched at different time steps. Methods like TD (Temporal Differences, Sutton, 1988), that estimate the long-term reward, analogous to the cost-to-go for infinite horizon problems, loose some interest here.

23.47

19

aprile 14, 2002

network at that time step. Therefore the activity in all the units of the network has to be memorized throughout the movement and the memory resources that must be allocated to the task greatly increase with the duration of the movement. This and the propagation of the error backwards through time is not consistent with true real-time learning. The other systems use recurrent structures (like the Plant-Controller loop of LDP), in which the time is an additional input (explicit or implicit) to the network (Jordan, 1988; Williams & Zipser, 1989; Jordan and Rumelhardt, 1992). The recurrent formulation allows to use a number of units which is independent from the number of time steps in which the movement has been discretized. Moreover, in classical BTT the control sequence to which the system converges to is the result of the interaction between the gradient at each time step, created by back propagating through time the error on the final state, and the continuity properties of the input/output mapping given by a sigmoidal neural network. Therefore, in BTT there is an implicit optimization where only format allowed is related to the temporal smoothness of the Controller output. In HDP, LDP and similar approaches, the optimization criterion can be general, yielding more flexible architectures. In Q-learning (Watkins and Dayan, 1992), the concepts of Dynamic Programming are revised in the framework of stochastic systems. The learning is carried out by a stochastic exploration of the different possibilities in the Controller at each time step; those which give the lowest estimated cost-to-go (future discounted reward) based on a given Cost function, increase the probability to be chosen until the system converges to the optimal sequence of controls. The choice is therefore stochastic, that means that a large set of controls are tried before settling to the optimal ones; the convergence is guaranteed with probability one for a Markov domain and infinite sampling of the state space. In ACE/ASE approaches (Barto et al., 1983), the task is to avoid a certain set of states that correspond to failures of the system (like the controller that fails to balance a pole on a rig). The control sequence is monitored by a special Critic (Adaptive Critic Element) that analyses the correlation between the state and the control output at each time step. In these procedures instead of explicitly computing the gradient of the Critic respect to the Controller output, a run-andtwiddle strategy (mutuated from trial and error in psychology) is adopted: a 23.47

20

aprile 14, 2002

control is stochastically chosen, and its effect on the quality of the subsequent state is observed. A better performance of the controller increases the probability that sequence of controls would be used again. It is a much slower process because it requires trial and error and a stochastic search overall the state space, but eventually the needed information is collected to tune the controller (Jarvis & Fallside, 1992). The lattest two approaches are therefore related to a global search throughout the weights space. Heuristic, stochastic, or exhaustive search methods through the weights space might be used, but they all require a strong computational effort. The computations involved in LDP, HDP approaches are equivalent to searching the minimum of a parametrized function respect to its parameters. Gradient descent procedures quickly and directly proceed to a solution, avoiding an exhaustive search of the parameter space. The resulting control sequence is not necessarily the optimal one, but may correspond to a local minimum (i.e., a sub-optimal solution). A possible solution is to start the learning with a different set of weights. After the system has converged, the different local minima obtained by running the system from each initial configuration can be compared and the weights corresponding to the absolute one chosen to obtain the optimal control. Alternatively, improved methods can be applied, in particular Boltzman learning (Hinton & Sejnowski, 1983) and conjugate directions (Shanno, 1990). Back propagation has the advantage of simplicity and it has been used here. When the system got stuck, a change in the learning rate can lead the controller to oscillation which would drive the system beyond local minima (Heskes et al., 1993) but also to instability and saturation. The scheduling of the learning rate is therefore critical for learning.

5.3 Local Dynamic Programming In HDP, the frailest module is the Critic. What we get from there are the derivatives of the estimation of the cost-to-go, J*, respect to the State at each time step. These cannot be reliably estimated with the exploration of a single trajectory where at each time step only one state is touched but they need the knowledge of the values of the function J* in a region around the state {x} at each time step, k.

23.47

21

aprile 14, 2002

A proposed solution to this problem is Dual Heuristic Programming (DHP: Werbos, 1987; 1990), where the Critic outputs directly the derivatives of the costto-go respect to the state (shadow prices or Lagrange multipliers). The main drawback in this approach is consistency (Werbos, 1990b). In DHP, instead of having a single evaluation of a certain state like HDP does, through the cost-to-go, there is a multiple evaluation of it. In fact, there are as many λ as the dimension of the state space that have to be estimated. They can resent of auto correlation and unobserved variables which can cause poor approximation (these difficulties parallel conventional least-squares statistical estimation). As a result, the λi can be in conflict one with the other and the Schwartz theorem may not hold:

∂λ j (t) ∂λ i (t) ∂J* (t) ≠ where λ i(t) = ∂xj (t − 1) ∂xi (t − 1) ∂x i(t)

(18)

Moreover also the derivatives of the local cost respect to the state and the control have to be determined and, if an analytical form is not available, they do add noise to the system. In Local Dynamic Programming this problem is solved using a modified learning procedure where a set of trajectories close to the actual one is explored. This reliability is further increased by the use of the Correcting functions. Operatively, this is equivalent to performing the movement starting from the actual initial state and from a set of close states. The reliability of the estimation of the gradient of the Cost-To-Go decreases outside the trajectory. The price to be paid is the increase of the number of trajectories to be explored in every forward pass that increase from 1 to 1 + 2 * S and the required computations that increase for the same amount only for the Critic. This does require more time in training but allows to move in the right direction in weights space every time the weights are updated. This procedure can be seen also as a “local run and twiddle” strategy, similar to that proposed by Barto et al. (1983) where the set of trajectories around the actual trajectory is sequentially explore and, at the end, the sequence of controls which gave the minimum cost-to-go for each time step are chosen. In this respect LDP can be seen as a particular case of reinforcement learning algorithm.

23.47

22

aprile 14, 2002

5.3 Classification of LDP From the connectionist point of view, the systems based on neural-networks that generate temporal sequences are classified in two broad families depending on the paradigm that they adopt: supervised learning or reinforcement learning (Jordan and Rumelhart, 1992). Supervised learning requires a quantitative information about the error vector at each time step. It can be subdivided into two categories depending on whether the signed vector refers to the error on the output of the units of the network or on a function that is only related to them. In the latter case the term learning with a distal teacher has been introduced (Jordan, 1988). Reinforcement learning requires only a qualitative information about the error (successful, unsuccessful, good or bad) that is given only at the end of the movement and it is therefore global. In the first case, the most used learning technique is the gradient descent method, while in reinforcement learning the optimal sequence of control is achieved by a stochastic search of the whole space. Dynamic programming approaches (like LDP) are in between. They rely on global quantitative information like the cost and the cost-to-go, and it is the job of the Critic to translate this global measure into a local error. As the error in the Critic is concerned, the local output expected from the Critic this is partially given from the outside and partially generated by the Critic itself (equation 4). For the Controller network, we do not have any error measure for the output units but we use the local gradient of the Critic to tune its parameters (equation 6). This gradient goes to zero when the control sequence becomes optimal. Although the local "error" used at every time step is a deterministic quantity, there is no external teacher which can provide the target values for the two networks from which compute this error. It is the structure of the system and the learning procedure that constructs the error by themselves: the system creates its teacher by itself. Therefore these procedures can be more conveniently classified as unsupervised learning with an internal teacher .

5.4 Convergence The use of sigmoidal neural networks as approximators has the nice property to constraint the output in the finite interval [0,1] which, using a linear coding, 23.47

23

aprile 14, 2002

corresponds to the interval [νmin, νMax]. Therefore, the output of both the Critic and the Controller is kept limited and there is a natural resistance of the networks, thanks to the sigmoidal shape of their transfer function, to increase (or decrease) their output towards the maximum (minimum) value5 . The drawback are the distortions introduced at the extreme ends of the output space. We defined values for [νmin, νMax ] bigger than the extreme expected values of the cost-to-go (for the jerk it has been defined an interval -0.3 - +0.3 which is about 10% wider than the output jerk). The other drawback is the need to dynamically remap the output of the Critic to encode a broad class of values: in the simulation described in the previous section, the critic has to be able to output values that range form 2.7 (optimum) to 25. for the worst initial situation (Figure 7). If we want to maintain a constant scaling factor (fs in equation 7), we must change the value of λ and µ in equation A4 as learning progresses to avoid saturation; this is equivalent to dynamically change the range of the cost-to-go represented by the Critic6. If the output of the Critic enter into saturation, the gradient tends to zero, the signal to noise ratio becomes very low (the gradient becomes comparable to the error in the computer) and the overall system gets stuck. We therefore monitor the value of the global cost to set new λ and µ values whenever it gets close to the lower saturation value of the cost-to-go. With the new values of λ and µ, we increase the absolute value of the cost and we move the Critic network out of saturation. This is equivalent to dynamically remap the representation of the global cost. Whenever we changed the values of these parameters, we briefly retrained the Critic network without changing the Controller to facilitate the learning process. The updating of λ and µ could also be done in a continuos way (Kawato et al., 1990), but in this case, in every trial the estimation of the cost-to-go would

5

This may be a feature shared with our nervous system from which NN models are often inspired: when neural commands are forced to operate at their limits they can be highly distorted. A low firing rate may be highly affected by noise, and a very high firing rate can reach saturation and is impossible to be modulated. 6 The same problem applies to the Controller when an estimation of the minimum and maximum fo the control sequence cannot be estimated in advance.

23.47

24

aprile 14, 2002

never achieve a high degree of reliability because the function that it has to estimate (equation A4) is changing at every trajectory. If we set µ to 0, there is an implicit cost in the smoothness of the output of the Controller given by the continuity properties of the sigmoidal Neural Networks where two outputs generated by two close states (given by position, velocity, acceleration and time) tends to be close one to the other. In this respect, with µ=0 the HDP and LDP approaches can be viewed as a way to perform BTT running the system only forward in a recurrent neural network structure. The optimal temporal sequence of controls can be generated with a very different time course in the hidden units. Different scheduling of λ and µ and different set of initial weights can cause the system to settle to different temporal profiles for the hidden units. Comparing figures 5 and 6, we can notice that the time course activity in the hidden units is structurally different and it cannot be reduced one to the other by a permutation of the hidden units. This can be attributed to the different path used by the system for learning, in terms of learning parameters, scheduling of the dynamic remapping. In supervised learning this would suggest a redundant representation of the Control function, but in LDP the system undergoes the generation of many control sequences and need a greater computational capability. This shows that the representation of the final trajectory is over determined and this is indeed the case. Using supervised learning to teach the network to store and retrieve the minimum jerk sequence of controls, eight hidden units are sufficient to memorize the sequence, but they are not sufficient to discover the optimal sequence starting from scratch. The higher number of hidden units is required to have a greater computational power that is required to sequentially explore the whole set of movements performed during the learning phase. We remark here that after an optimal sequence of control has been learned (that is, it has been discovered), it can be stored in smaller structures more suitable to storage that to computation. In mammalian motor control, it has been hypothise that these two structures exist and can be identified with the cerebral motor cortex and the cerebellum (Burnod & Dufossé, 1990). The cerebral cortex would be mainly involved in discovering how to perform the task. After learning has taken place, the cerebellum would become responsible for the 23.47

25

aprile 14, 2002

repetitive parts of the task memorizing the sensory motor (state to output) transformations that are required. To speed up the learning different time steps for the Controller and the Critic. Due to the simple procedure for computing the derivatives (Euler integration in the Plant) a small time step is required to get a good approximation of the continuos behaviour. The controller-plant loop therefore has a very small time-step to keep this error small. This is not required by the Critic for which the integration of the information over certain number of time steps contributes to filter out noise introduced by the estimation of the Cost-ToGo. We use a ratio of 10:1 between the time step in the Plant-Controller loop and the Computation of the updating of the weights: after p time steps of the Controller-Plant loop have been processed, the system analyses the actual output of the Critic, compares it with that p time steps before and computes the error used to tune the Critic as: p −1

eak = ( J k1 − ∑ U j − Jk 1 + p )2 0

k=k1,k1+1,k1+2,....k 1+p-1

(19)

and the value of the Critic to tune the Controller p time steps before. Moreover the time step is not fixed throughout the whole learning process. In the first epochs, a coarse representation in time of the trajectory can be accepted. As learning progresses, the need of more accuracy requires a greater number of samples of both the controller and the critic. A coarse representation of the trajectory is used in the first phases of learning until the system has a low value on the final state, while the finer representation is used thereafter. This coarse to fine coding has been recently explored (Mjolsness et al. 1991) and appears to significantly improve the speed of convergence without degrading the accuracy of the output learned by the network.

5.5 Multiple Trajectories One of the difficulties in learning the absolute optimum in the multiple trajectories situation may arise from the difficulty of the function to be learnt. For a particular input state vector, for example [500,0,0], the system must be able to output a positive jerk for t=0 and a negative jerk for t=250 and very different values for the Critic (figure 8). In this situation a pre-processing of the inputs 23.47

26

aprile 14, 2002

would help. It has been recently proposed (Jacobs et al., 1991) to decompose the functional space into a set of "experts" that work in a particular domain. Different networks are allocated to different "experts". Another approach might be the decomposition of the temporal profile of the controller and the trajectory respect to the spatial profile. The temporal profile would be scaled according to the distance to be travelled, the initial velocity and the initial acceleration and replayed at variable speeds (Heskes & Gielen, 1992) ££. Another solution could be the partitioning of the state and control spaces, using gaussian coding (Moody and Darken 1989) for the weights. It has been shown to improve the learning performances by decreasing the degree of colinearity of the error (i.e. the correlation of the network units respect to the updating of the weights). Once the system has been trained to learn a set of trajectories it is not guaranteed that it can learn an arbitrary new trajectory, unless the new trajectory can be obtained by interpolating in a region of space where the executed trajectories lie. In our case, we can see that the system does extrapolate and produce controls that are comparable to the ones generated for the explored trajectories (Figure 10). This may be attributed to the exploration of different regions of the space carried out during the learning phase and to the relative simple shape of the cost-to-go for this task.

6.0 CONCLUSION Optimal control is a natural way to describe the formation of an elementary motor act in humans where the goal would be dictated by the internal needs and the smoothness by the evolution. DP is an effective way to relate actual consequences to past actions translating global minimisation problems into local ones which are mathematically much more tractable. The use of LDP with a local exploration of the space allows a convergence to an optimal solution in a more robust way. The system proposed is able to learn the minimum jerk trajectories specifying only a local cost associated to each control in the sequence and the desired final state. It is constituted of a two-level hierarchical adaptive structure that allows a separation of the learning phase (Critic) from the execution phase (Controller-Plant loop). The Critic may have also the role of re calibrating the 23.47

27

aprile 14, 2002

Controller whenever the performances degrade. This property can make the system robust with respect to changes in the environment and to ageing. It has also implications for system intelligence, where intelligence is defined as the ability to discover the best long-term circumstances.

behaviour

depending

on

the

Appendix A: Dynamic Programming and Optimal Control In Optimal Control theory, the temporal sequence that is generated is the solution of a minimization problem in functional space: the system has to find the control sequence that minimizes a global cost composed of a cost associated to each control element of the sequence and of the deviation from a desired final state ( xNd ), and a set of via-points. This problem can be analytically expressed as: N −1

min {u k }

∑U

k

+ λ ⋅ CN

k=0,1,2.....N-1

(A1)

0

Uk is the local cost associated to control and state at time step k and CN is the cost associated to the deviation from the desired final state: Uk = h(xk ,uk )

(A2a)

C N = φ( x − x d ) N

(A2b)

The parameter λ regulate the degree of accuracy with which the final target is achieved: the higher the value of λ, the higher will be the cost associated to the deviations from the final state and the more precise the system will try to achieve the final state. In connectionist literature Kawato et al. (1991) suggested the terminology of hard constraints for A2b, that specifies a point in the trajectory of the state, and soft constraint for A2a because it does not directly specify any constraint on the elements of the temporal sequences. In Dynamic Programming the Temporal Credit Assignment problem is solved implicitly transforming the global cost into a local minimization problem. The key concept is the cost-to-go that is the cost associated to a certain sequence of controls from a certain state and time to the end of the movement: N −1

Jk = ∑ U j + λ ⋅CN = J(x k , k )

(A3)

0

23.47

28

aprile 14, 2002

The Cost-To-Go at time k=0 for the initial state x=xo corresponds to the global cost of that movement. To find the optimal sequence of controls {uk}, the solution is carried out recursively, working backwards in time from the last time step to the first one. For each time step, a procedure in two steps is carried out: - The cost-to-go is computed for all the possible states at step k:

J( x k ,uk , k) = Uk + J k+1 = h( x k ,uk ) + J( f ( xk ,uk ,k +1))

k = N − 1, N − 2,...,1,0

(A4)

- The optimal control uk is computed for that kth time step: J* (xk ,k) = Min J (xk ,uk , k) uk

⇒

(A5a)

u k = g(xk , k) *

(A5b)

where J*(xk,k) is the optimal Cost-To-Go at time k. If Jk+1 is optimal, it can be demonstrated that u *k and J*(xk,k) are themselves optimal for the global minimization problem. We indicated with: xk + 1 = f ( xk ,uk ,k)

(A6)

the next-state function that at each time step outputs the state that the plant would reach, xk+1 , starting from the actual state xk, applying the control uk. All the computations required are related to the single time step and take place "locally" in time. We stress here that we have no a-priori clue on which trajectory and path the system will produce, and which will be the control sequence: it is determined solely by the process of optimization itself. Dynamic Programming is computationally very expensive: the optimal costto-go and control should be determined for each possible state, at each time step. If S is the dimension of the state space quantized in k levels and N the number of time steps, the number of possible states to be evaluated amounts to (S * k)N. Analytical solutions are therefore not feasible for a large state space and numerical approximated solutions have been proposed in the last years. ACKNOWLEDGEMENTS

23.47

29

aprile 14, 2002

We wish to thank dr. Bruce Hoff and dr. F. Lacquaniti for the helpful and stimulating discussions. This work was mainly carried out at CNE of USC where N.A. Borghese was supported by CNR grant n. 203.15.3/1990.

REFERENCES Atkenson & Reinskenmayer (1990) Control using Content Addressable Memories to Control Robots. Neural Networks for Control, WT Miller, RS Sutton, PJ Werbos, eds., Cambridge, MIT Press, 287-299. Barto AG Sutton RS Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Trans. Sys, Man, Cybern SMC13(5):834-846 Bellman, R.E. (1957) Dynamic Programming. Princeton, N.J.: Princeton University Press. Berenji H.R. Khedkar P. (1992) Learning and tuning fuzzy logic controllers through reinforcement. IEEE Trans. Neural Networks, 3(5):724-740. Bryson, A.E. Jr. and Ho, Y.C. (1969). Adapted optimal control. Waltham, MA: Ginn and Co. Burnod Y Dufosse' M (1990) A model for the co-operation between cerebral cortex and cerebellar cortex in movement learning. In: Brain and Space, J. Paillard (Ed.), Oxford University Press, 446-458 Heskes TM Gielen S (1992) Retrieval of Pattern Sequences at Variable Speeds in a Neural Network With Delays, Neural Networks, Vol. 5, pp. 145-152. Heskes TM, Slijpen ETP Kappen B (1993) Physical Review, Vol. 46, No. 8. Hinton G.E. & Sejnowski T.J. (1983) Optimal perceptual inference. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 448-453. Hoff BR (1992) A Computational Description of the Organization of Human Reaching and Prehension, Technical Report USC-CS-92-523, USC-CNE-92-02, University of Southern California -- Center for Neural Engineering, Los Angeles, CA.

23.47

30

aprile 14, 2002

Hogan N (1982) An organizing principle for a class of voluntary movements. J Neurosci 4(11): 2745-2754 Hornik K, Stinchcombe M & White H (1988) Multi-layer feed-forward networks are universal approximators. Discussion paper, Departement of Economics, University of California, San Diego, La Jolla, CA. Ito M (1984) The cerebellum and neural control. New York: Raven Press. Jacobs RA Jordan MI, Nowlan SJ Hinton GE (1991) Adaptive Mixtures of Local Experts, Neural Computation, Vol 3, 79-87. Jervis TT Fallside F (1992) Pole Balancing on a Real Rig using a Reinforcement Learning Controller, TR 115, Cumbridge University. Jordan MI (1988) Supervised learning and systems with excess degrees of freedom, MIT COINS Technical Report 88-27 Jordan MI, Rumelhart DE (1992) Forward models: Supervised learning with a distal teacher, Cognitive Science, in press Kalman RE, Falb PL, Arbib MA, (1969) Topics in Mathematical System Theory, McGraw-Hills. Kawato M, Furukawa and Suzuki (1987). A hierarchical Neural Network model for Control and learning voluntary movement. Biological Cybernetics, 57: 169185. Kawato M, Maeda Y, Uno Y, Suzuki R (1991) Trajectory Formation of Arm Movement by Cascade Neural Network Model Based on Minimum Torquechange Criterion, technical report, TR-A-0056. Massone L, Bizzi E (1989) A Neural Network Model for Limb Trajectory Formation, Biol Cybern 61: 417-425 Miller WT, Herves RP, Glanz FH, Kraft LG (1990) Real-time dynamic control of an industrial manipulator using a neural network based learning controller. IEEE Trans. Robotics and Automation 6:1-9. Mjolsness E, Garrett CD, Miranker WL (1991) Multiscale Optimization in Neural Nets. IEEE Trans. Neural Networks, Vol. 2, No. 2, pp. 263-273. Moody J, Darken CJ (1989) Fast Learning in Networks of Locally-Tuned Processing Units, Neural Computation Vol 1, 281-294.

23.47

31

aprile 14, 2002

Narendra KS Parthasarathy K (1991) Gradient Methods for the Optimization of Dynamical Systems Containing Neural Networks, IEEE Trans. on Neural Networks, Vol. 2, No.2, pp.252-262. Nguyen D. and Widrow B. (1990) The truck Backer-Upper: An Example of SelfLearning in Neural Networks, Neural Networks for Control, WT Miller, RS Sutton, PJ Werbos, eds., Cambridge, MIT Press, 287-299. Psaltis D, Sideris A, Yamamura A (1987) Neural Controllers. Proc. ICNN, IV:551558. Rumelhart DE, Hinton GE & McClelland JL (1986) Parallel distribute processing: Vol. 1. Cambridge, MA. Bradford Book. MIT Press. Shanno DF (1990) Recent advances in numerical techniques for large-scale optimization, Neural Networks for Control, WT Miller, RS Sutton, PJ Werbos, eds., Cambridge, MIT Press, 171-178. Sutton R S (1988) Learning to Predict by the Methods of Temporal Differences, Machine Learning 3: 9-44. Uno Y, Kawato M, Suzuki R (1989) Formation and control of optimal trajectory in human multijoint arm movement - A minimum torque-change model. Biol Cybern 61: 89-101 Watkins CJCH, Dayan P (1992) Q-Learning - Machine Learning, 8:279-292. Weitzenfeld A (1991) NSL: Neural Simulation Language Version 2.1, CNE-TR 9105, University of Southern California, Center for Neural Engineering. Werbos PJ (1974) Beyond regression: New tools for prediction and analysis in the behavioral sciences, PhD. dissertation, Committee on Appl. Math., Harvard Univ., Cambridge, MA, Nov. 1974. Werbos PJ (1987) Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man and Cybernetics 17: 7-19. Werbos (1989) Maximizing long-term gas industry profits in two minutes in Lotus using neural network methods. IEEE Transections on Systems, Man & Cybernetics 19 (2): 315-333. Werbos PJ (1990a) A menu of designs for reinforcement learning over time, Neural Networks for Control, WT Miller, RS Sutton, PJ Werbos, eds., Cambridge, MIT Press, 67-95. 23.47

32

aprile 14, 2002

Werbos PJ (1990b) The consistency of HDP applied to a simple reinforcement problem. Neural Networks, 3: 179-189 . Werbos PJ (1990c) Backpropagation through Time: What it is and how to do it. Proceeding of IEEE. Williams RJ (1988) Toward a theory of Reinforcement learning connectionist systems. Technical Report NU-CCS-88-3. Northeastern University. Williams RJ & Zipser D (1989) A learning algorithm for Continually Running Fully Recurrent Networks. Neural Computation, 1; 270-280.

23.47

33

aprile 14, 2002

Lihat lebih banyak...

Generation of temporal sequences using local dynamic programming

Descripción

Comentarios