Knowledge transfer in markov decision processes (Technical Report

June 30, 2017 | Autor: Doina Precup | Categoría: Machine Learning, Markov Decision Process, Knowledge Transfer

Descripción

Knowledge Transfer in Markov Decision Processes Caitlin Phillips Supervised By Joelle Pineau, Doina Precup, and Prakash Panangaden Reasoning and Learning Lab, McGill University September 14, 2006

Abstract Markov Decision Processes (MDPs) are an effective way to formulate many problems in Machine Learning. However, learning the optimal policy for an MDP can be a time-consuming process, especially when nothing is known about the policy to begin with. An alternative approach is to find a similar MDP, for which an optimal policy is known, and modify this policy as needed. We present a framework for measuring the quality of knowledge transfer when transferring policies from one MDP to another. Our formulation is based upon the use of MDP bisimulation metrics, which provide a stable quantitative notion of state similarity for MDPs. Given two MDPs and a state mapping from the first to the second, a policy defined on the latter naturally induces a policy on the former. We provide a bound on the value function of the induced policy, showing that if the two MDPs are behaviorally close in terms of bisimulation distance and the original policy is close to optimal then the induced policy is guaranteed to be close to optimal as well. We also present some experiments in which simple MDPs are used to test the tightness of the bound provided by the bisimulation distance. In light of the results of these experiments, we suggest a new similarity measure.

1

Contents 1 Introduction

3

2 Background 2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Bisimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Bisimulation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4 5

3 State Similarity Metrics 3.1 The Fixed Point Metric . . . . . . . . . . . . . . . . . . . . 3.2 The Total Variation Metric . . . . . . . . . . . . . . . . . . 3.3 The Bisimulation Metric . . . . . . . . . . . . . . . . . . . 3.4 The Fixed Point Metric With Sampling . . . . . . . . . . . . 3.5 The Fixed Point Metric Adapted for Policy Transfer Purposes 3.6 The Fixed Policy Measure . . . . . . . . . . . . . . . . . .

6 6 7 7 7 8 8

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 The Similarity Calculation 5 Experiments 5.1 Benchmarking Samples . . . . . . . . . . 5.2 Basic Knowledge Transfer on Gridworlds 5.3 How Tight Are The Bounds? . . . . . . . 5.4 Policy-Specific Estimates . . . . . . . . .

8

. . . .

6 Conclusion and Future Work

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

9 9 11 11 16 16

2

1 Introduction Markov decision processes (MDPs) are a common way of encoding decision making problems in reinforcement learning environments. They provide a standard formalism for multi-stage decision making, while at the same time being versatile enough to capture the stochastic nature of realistic situations. The goal of an MDP is to maximize some measure of performance over time. An MDP is considered to be solved when a policy (a way of behaving for each state) has been discovered which maximizes the expected return. However, finding this policy is a computationally intensive task, usually done using dynamic programming algorithms such as policy iteration or value iteration [Puterman, 1994]. It is often the case that such algorithms are too costly to be used on any decision-making problem of realistic size. Thus, when it is possible to make these computations, it would be nice to get as much mileage out of them as possible. Specifically, rather than learning a new policy for every MDP, a policy could be learned on one MDP, then transferred to another, similar MDP, and either used as is, or treated as a starting point from which to learn a policy. Clearly this transfer cannot be done successfully between any two MDPs. We must give a better definition of what constitutes a similar MDP. In probabilistic systems such as MDPs, similarity is measured in terms of behavior. We consider two systems to be alike if they behave in a similar manner to one another. This is best captured formally through the notion of bisimulation [Milner, 1980; Park, 1981; Larsen and Skou, 1991]. Bisimulation can roughly be described as the largest equivalence relation on the state space of an MDP that relates two states precisely when for every action, they achieve the same immediate reward and have the same probability of transitioning to classes of equivalent states. This means that bisimilar states lead to essentially the same long-term behavior. For the purposes of knowledge transfer, an equivalence relation such as bisimulation is too strong. That is, it is unlikely that we will find two distinct MDPs each of whose states have the exact same probability of transitioning to other equivalent states, and the exact same rewards. We need more information. Rather than simply asking if two MDPs are similar or not, we would like to know how dissimilar they are. To this end, we use (pseudo)metrics. The metrics on which we focus specify the degree to which objects of interest behave similarly. In [Ferns et al., 2004; 2005], based on similar work in the context of labeled Markov processes [Desharnais et al., 1999; van Breugel and Worrell, 2001a; 2001b] ), the authors extended bisimulation for MDPs quantitatively in terms of such metrics. This summer, we have been adapting these metrics for the purposes of knowledge transfer as well as defining new measures of similarity. In addition to this, we present a bound on the sub-optimality of a transferred policy, based on the similarity of the MDPs and the success of the original policy. The policy transfer was then tested on simple MDPs. The paper is organized as follows. In section 2, we define the concepts of MDPs and bisimulation metrics. In section 3, we outline some of the various ways in which we estimated the bisimulation metric in our experiments. In section 4, we formalize the notion of policy transfer in Markov decision processes and present a bound on the success of a transferred policy. In section 5 we present the results of experiments designed to test the accuracy of this bound.

2 Background 2.1 Markov Decision Processes a }, {Ra }) where: S is a finite set of A Markov decision process is defined as a quadruple M = (S, A, {Pss 0 s a is a probability function defined by P a : S × A × S 7→ [0, 1], which states, A is a set of actions, Pss 0 0 ss

3

denotes the probability of tranistioning from state s to state s0 under action a, and Rsa is a reward function defined by Rsa : S × A 7→ R, the reward received for taking action a while in state s. A policy π : S × A 7→ [0, 1] is a way of choosing actions probabilistically based on the current state, meaning in a given state s, each action is chosen with a certain probability. To measure the success of a policy, we use a value function V π , which estimates the expected cumulative reward over an infinite horizon with a suitable discount factor. It can be expressed as the unique solution to the set of fixed point equations known as the Bellman equations: X X a π 0 V π (s) = π(s, a) ∗ [(Rsa + γ Pss 0 V (s ))] s0 ∈S

a∈A

where γ ∈ (0, 1) is a discount factor. An optimal policy is one which maximizes the expected cumulative reward over an infinite horizon. It is associated with a more specialized form of the Bellman equations: X a ∗ 0 V ∗ (s) = max(Rsa + γ Pss 0 V (s )) a∈A

s0 ∈S

Note that this is just the standard Bellman equation under a policy which chooses the optimal action with a probability of 1. Both of these equations can be calculated using dynamic programming, but as mentioned earlier, this technique is too costly to repeat for every MDP. Specifically, if a good policy is known for a certain MDP, we would like to either reuse this policy on similar MDPs, or at least use it as a starting point for further policy iteration (making the dynamic programming much less costly). Intuitively, this makes sense only if these other models are similar in some sense to the original. a }, {Ra }), M = (S , A, {Qa }, {Ra }) which share the same set Given two MDPs, M1 = (S1 , A, {Pss 0 2 2 t s tt0 of actions, we can transfer a policy from M1 to M2 as follows. To transfer a policy, we must first define a mapping ρ : S1 7→ S2 between the corresponding state spaces. Note that this mapping does not need to be one-to-one. With this mapping, a policy defined on M1 naturally induces a policy on M2 : π1 (s, a) = π2 (ρ(s), a) The goal of the metrics described below is to determine how effective a policy will be when transferred from one MDP to another. This is done by estimating the resulting value functions under the policy we are trying to transfer.

2.2 Bisimulation In order to compare the similarity of two MDPs, we need in fact only concern ourselves with the similarity of states in a given MDP; for two MDPs can be conceptually viewed as one by taking their disjoint union. Thus, equivalence of two models can be defined in terms of equivalence of their states. In [Givan et al., 2003], the authors investigated several notions of MDP state equivalence and determined that the most appropriate is bisimulation, the strongest of a number of equivalence relations considered in concurrency theory: Definition 2.1. Bisimulation is the largest equivalence relation ∼ on S satisfying the following property: s ∼ s0 ⇐⇒ ∀a ∈ A, (rsa = rsa0 and ∀C ∈ S/ ∼, Psa (C) = Psa0 (C)) Unfortunately, as an exact equivalence, bisimulation suffers from issues of instability; that is, slight a }, can lead to vastly different bisimulation numerical differences in the MDP parameters, {rsa } and {Pss 0 partitions. To get around this, one generalizes the notion of bisimulation equivalence through bisimulation metrics. 4

2.3 Bisimulation Metrics A metric is essentially a function that assigns distances between all pairs of points in the state space. Definition 2.2. A pseudometric on S is a map d : S × S → [0, ∞) such that for all s, s0 , s00 : 1. s = s0 ⇒ d(s, s0 ) = 0 2. d(s, s0 ) = d(s0 , s) 3. d(s, s00 ) ≤ d(s, s0 ) + d(s0 , s00 ) If the converse of the first axiom holds as well, we say d is a metric.

1

The main reason for moving to metrics is that they quantitatively generalize equivalence relations; the three axioms are the quantitative analogues of reflexivity, symmetry, and transitivity, respectively. In fact, if one identifies states that are at distance zero for a given metric, one gets a full-fledged equivalence relation. If that relation happens to be bisimulation, we say that the metric is a bisimulation metric. In [Ferns et al., 2004; 2005], the authors provided the following metric generalization of bisimulation: Theorem 2.3. Let c ∈ (0, 1) and V be the set of real-valued functions on S × S that are bounded by where R = maxs,s0 ,a |rsa − rsa0 |. Define F c : V → V by

R (1−c) ,

F c (h)(s, s0 ) = max(|rsa − rsa0 | + cTK (h)(Psa , Psa0 )) a∈A

Then : 1. F c has a unique fixed point dcf ix , 2. dcf ix (s, s0 ) = 0 ⇐⇒ s ∼ s0 , 3. for any h0 ∈ V, kdcf ix − (F c )n (h0 )k ≤

cn kF c (h0 ) − h0 k, and 1−c

a }. 4. dcf ix is continuous in {rsa } and {Pss 0

Here TK (h)(P, Q) is the Kantorovich probability metric applied to distributions P and Q. It is defined as minλ Eλ [h] where λ is a joint probability function on S × S with projections P and Q, i.e. min λkj

subject to: ∀k.

|S| X

λkj h(sk , sj )

k,j=1

X

λkj = P (sk )

j

∀j.

X

λkj = Q(sj )

k

∀k, j. λkj ≥ 0

5

Psa

f (s1 , t1 ) d(s1 , t1 )

1

Qa t

d(si , tj ) : Cost of shipping one unit from si to tj −−→ f (si , tj ) : d(si , tj ) × [no. units shipped along si tj ]

1

Psa

Qa t

2

2

Qa t

3

Supply Nodes

Demand Nodes

Goal: Find the minimum cost of shipping the contents of the Supply nodes to the Demand nodes, according to their needs.

Figure 1: A visualization of the minimum cost flow problem This is an instance of the minimum cost flow (MCF) network linear program, and as such it admits various solution schemes. In fact, efficient means of approximating the Kantorovich metric exist and have recently lead to an efficient approximation algorithm for computing dcf ix [Ferns et al., 2006]. As shown in 1, the goal of minimum cost flow problem is to ”ship” contents of the supply nodes to the demand nodes, according to their needs. The importance of theorem 2.3 is twofold: it provides us with a bisimulation metric that is sensitive to perturbations in the MDP parameters and moreover that specifies the degree to which two states are equivalent, e.g. it makes sense to state that two states are 50% equivalent. The former implies that the metric is a stable measure of similarity and the latter is useful for aggregation. Perhaps most important, however, is what the bisimulation metrics have to say about the values of states. It is the following theorem, relating how the optimal values of states vary with quantitative bisimilarity, that initially provides justification for the belief that bisimulation metrics can be usefully applied to the problem of knowledge transfer in MDPs: Theorem 2.4. Suppose γ ≤ c. Then V ∗ is 1-Lipschitz continuous with respect to dcf ix , i.e. |V ∗ (s) − V ∗ (s0 )| ≤ dcf ix (s, s0 ). This continuity result tells us that if two models are closely similar with respect to bisimulation distance then so are their optimal value functions. In this sense, one can “transfer” the optimal policy of one model to the other.

3 State Similarity Metrics 3.1 The Fixed Point Metric The Fixed Point Metric dcf ix , as defined above can be rewritten as: dcf ix (s, s0 ) = max(|rsa − rsa0 | + c ∗ TK (dcf ix )(Psa , Psa0 )) a∈A

Because dcf ix is a fixed point, this metric can be calculated iteratively, by starting with a metric that is zero everywhere, and iterating until the difference in metric distances between rounds drops below a certain 1

For convenience we will use the terms metric and pseudometric interchangeably; however, we really mean the latter.

6

threshold [Ferns et al., 2006]. However, for our purposes, we found this metric to be to slow to run on any MDP of significant size (in some cases it required more time and memory than computing the value functions directly would), and because of this, we were only able to use the Fixed Point Metric for benchmarking purposes, and in checking the accuracy of other metrics.

3.2 The Total Variation Metric Unlike the Fixed Point Metric, the Total Variation Metric doesn’t use the Kantorovich distance at all. Instead, it uses the total variation distance and gives a very rough upper bound of the distance between two states. It is however, very quick to run and can be useful for larger MDPs. The total variation metric is as follows: dcT V (s, s0 ) = max(|rsa − rsa0 |) + where T V (Psa , Psa0 ) =

1 2

a∈A

P

t∈S

cR T V (Psa , Psa0 )) 1−c

|Psa (t) − Psa0 (t)| [Ferns et al., 2006]

3.3 The Bisimulation Metric This metric is very similar to the total variation metric, however it uses bisimulation as well. First, the exact bisimulation is computed, (which states are bisimilar to one another) then the total variation metric is calculated with respect to the equivalence classes determined by the bisimulation. dcT V∼ (s, s0 ) = max(|rsa − rsa0 | + a∈A

where T V∼ (Psa , Psa0 ) =

1 2

P

C∈S\∼

cR T V∼ (Psa , Psa0 )) 1−c

|Psa (C) − Psa0 (C)|

3.4 The Fixed Point Metric With Sampling The Fixed Point Metric with Sampling is a compromise between runtime and accuracy. Rather than using the transition probabilities to set up a linear program and solve the distances exactly, it generates a specified number of sample trajectories from each state [Ferns et al., 2006]. The Kantorovich distances between states in these trajectories are then used to solve a linear program, much like the Fixed Point Metric. However, the linear program in this case is an assignment problem rather than a transportation problem, making it much quicker to solve. n 1X c a a d(Xk , Yσ(k) ) TK (d )(Ps , Qt ) = min σ n k=1

The estimated distances are then used to update the Kantorovich according to the following equation: dc (s, s0 ) = max(|Rsa − Rsa0 | + c ∗ TK (dc )(Psa , Psa0 )) a∈A

Then the updated distance is used to solve another assignment problem, and the process is repeated until the maximum change in kantorovich values between updates is below a specified threshold.

7

3.5 The Fixed Point Metric Adapted for Policy Transfer Purposes In order to use the Fixed Point Metric with Sampling for policy transfer, we needed to modify it to calculate distances between states in different MDPs. This is done by generating two separate matrices of samples (1 for each MDP) and using them to solve the assignment problem and update the Kantorovich. In other words, the distance between two states s and t is only computed if s lies in one MDP and t lies in another. It is important to note two things about this variation: The matrix of distances it produces is not necessarily square or symmetric with a zero-diagonal, unlike the ones produced by comparing states within the same MDP. The updating equation looks like this: dc (sk , tj ) = max(|R1ask − R2atj | + c ∗ TK (dc )(Psak , Ptaj )) a∈A

Where R1, R2 are the reward matrices for the first and second MDP respectively.

3.6 The Fixed Policy Measure The Fixed Policy Measure, rather than maximizing distances over all actions, uses a weighted sum, as defined by a policy specified on input to estimate distances. X X dc,π (sk , tj ) = |π(sk , a) ∗ (R1ask − R2atj )| + c ∗ (π(sk , a) ∗ TK (dc,π )(Psak , Ptaj )) a∈A

a∈A

The purpose of this measure is to weight the distance between two states under a certain action proportionally to how often that action is chosen. Besides this, the Fixed Policy Measure works in a manner very similar to the Fixed Point Metric, as adapted for policy transfer. It is important to note, however that this variation is not actually a metric, only a means of measuring similarity in MDPs.

4 The Similarity Calculation Using the concepts described above, we now present a bound on the sub-optimality of a transferred policy. a } , {r a } ) are two given MDPs with the same action set for i = 1, 2, and Suppose Mi = (Si , A, {Pss 0 i s i suppose further that there is a mapping ρ : S1 → S2 specifying for each state in the original MDP its representative state in the latter. Any policy π2 defined on M2 naturally defines a policy on M1 given by π1 (s, a) = π2 (ρ(s), a). In this way, we are transferring policy π2 from M2 to M1 . The policies here may not be optimal. Thus, in order to assess the quality of knowledge transfer in MDPs we first need to specify what it means for a given policy to be good; it is this quality that we would like to be invariant under transfer. Intuitively, a policy π is good if it is very close to optimal; thus, we can take as a measure the distance between the value of that policy and the optimal value function, kV π − V ∗ k (and of course, the policy is good if this quantity is small). We now come to the crux of the matter; namely, how does one assess the quality of policy transfer under the current setup? In order to answer this question formally one must first understand that the problem at hand is to measure the quality of the transfer with respect to two components: the knowledge to be transferred (the policy π2 ) and the mechanism of that transfer (the mapping ρ). With that in mind we have the following result:

8

Theorem 4.1. Given M1 , M2 , ρ, π2 , and π1 as above, kV π1 − V1∗ k ≤

1 + c π2 2 max dc (s, ρ(s)) + kV − V2∗ k, 1 − c s∈S1 f ix 1−c

where γ ≤ c. a }, {r a }) for the disjoint union of M and M . That is, S is the disjoint Proof. Let us write M = (S, A, {Pss 0 1 2 s union of S1 and S2 , and transitions between the models are taken to be zero. It follows simply from the notation that the restriction of the optimal value function V ∗ of M toMi is Vi∗ for i = 1, 2. First let us note that by the triangle inequality, we have for any s ∈ S1 , |V π1 (s) − V1∗ (s)| ≤ |V π1 (s) − V π2 (ρ(s))| + |V π2 (ρ(s)) − V2∗ (ρ(s))| + |V2∗ (ρ(s)) − V1∗ (s)|. The latter two terms are easily seen to be bounded above by kV π2 − V2∗ k and maxs∈S1 dcf ix (s, ρ(s)), respectively (for the last one, note that V ∗ (ρ(s)) = V2∗ (ρ(s)), V ∗ (s) = V1∗ (s). The first term requires a little more work:

V π1 (s) − V π2 (ρ(s)) a ≤ max{|rsa − rρ(s) | + γ( a∈A

X

a ≤ max{|rsa − rρ(s) | + γ( a∈A

+ γ(

X

a ∗ Psx V1 (x) −

x∈S1 a (V π1 (x) − V1∗ (x)) + Psx

x∈S1

a Pρ(s)y V π2 (y))}

y∈S2

x∈S1

X

X

a π1 Psx V (x) −

X

a Pρ(s)y V2∗ (y))

y∈S2

X

a Pρ(s)y (V2∗ (y) − V π2 (y)))}

y∈S2

X a a a ≤ max{|rsa − rρ(s) | + γ( (Psz − Pρ(s)z )V ∗ (z)} + c(kV π1 − V1∗ k + kV2∗ − V π2 k) a∈A

≤ ≤

z∈S a a a )} + c(kV π1 max{|rs − rρ(s) | + cTK (dcf ix )(Psa , Pρ(s) a∈A dcf ix (s, ρ(s)) + c(kV π1 − V1∗ k + kV2∗ − V π2 k)

− V1∗ k + kV2∗ − V π2 k)

Here we have again used theorem as well as the primal LP for the Kantorovich metric. Combining this inequality with the other two yields the desired result.

5 Experiments 5.1 Benchmarking Samples Prior to beginning our work on knowledge transfer, we ran a simple experiment to determine how many samples were required (when estimating the distance using the sampling method) to get an reasonable estimate. For testing purposes, we looked at the distance between states in one MDP, rather than comparing two different MDPs. The MDP (shown in figure 5.1) used was a simple two-state one with 2 actions a0 and a1 , flip and stay, each succeeding with p = 0.9, and one reward of 1.0 for taking the stay action in state 1. The bisimulation distances were estimated by running the sampling algorithm 30 times and averaging the result. ~ ~ ||∞ , They were then compared to the actual bisimulation distances using the L∞ norm: ||Dsample − Doptim ~ ~ where Dsample is the vector of pairwise distances as determined by the sampling method, and Doptim is the vector of actual distances. The resulting values were then scaled by a factor of 1 − γ to account for the fact that, as γ approaches 1, the errors due to imperfect sampling build up more. 9

Figure 2: A diagram of the simple MDP used for benchmarking Accuracy of Metric vs Number of Samples

Accuracy of Metric vs Number of Samples with discount of 0.1 0.045

0.7 discount factor of 0.1 discount factor of 0.5 discount factor of 0.9

0.6

0.04 0.035

0.5 0.03

Error

Error

0.4

0.025 0.02

0.3

0.015 0.2 0.01 0.1

0

0.005

0

20

40 60 Number of Samples

80

0

100

Accuracy of Metric vs Number of Samples with discount of 0.5 0.35

0

20

40 60 Number of Samples

80

100

Accuracy of Metric vs Number of Samples with discount of 0.9 0.55 0.5

0.3 0.45 0.25

0.4 0.35 Error

Error

0.2

0.15

0.3 0.25 0.2

0.1

0.15 0.05 0.1 0

0

20

40 60 Number of Samples

80

0.05

100

0

20

40 60 Number of Samples

80

100

Figure 3: A comparison of the accuracy of the metric on 3 different discount factors (γ = 0.1, 0.5, 0.9) As the number of samples used increased from 5 to 100, the errors in the distance estimations seemed to decrease steadily when the discount factor was 0.1 or 0.5. However, when the discount factor was 0.9, even with 100 samples, the estimates still fluctuated a lot with each iteration. This is because, as the discount factor approaches 1, the horizon of events approaches infinity. This means that slight differences in probabilities and rewards continue to build up, causing the distance between states to appear much larger than would be expected.

10

5.2 Basic Knowledge Transfer on Gridworlds We were interested in determining whether the distances estimated by the sampling algorithm made sense with respect to policy transfer in a simple gridworld environment. To access this, we created three gridworlds, each with the same starting state, and a slightly different goal location. These were created using an MDP-generating script, and had four actions, up, down, right, left, available in each state (when not bounded by walls). Each action succeeds with a probability of 0.9. For each pair of gridworlds, we plotted the distance between corresponding states according to a color

Figure 4: gridworlds A, B, and C map. The dark blue squares are walls, which were assumed to have a distance of 0. While all other states are colored according to how “close” they were estimated to be. The blue states (furthest from the goal) are estimated to be relatively close, while the red states (closer to the goal) are estimated to be a great distance apart. This is to say that if a policy were to be transferred from one map to another, the policy would be near optimal at the states near the beginning, but approaching the goal, the policy would become suboptimal. 0.5

0.5

1

1

0.5 1

1.5

1.5

1.5

2

2

2

2.5

2.5

2.5

3

3

3

3.5

3.5

3.5

4

4

4

4.5

4.5

4.5

5 5.5

5

1

2

3

4

5

6

7

5.5

5

1

2

3

4

5

6

7

5.5

1

2

3

4

5

6

7

Figure 5: State to state distances in Gridworlds A and B (left) and A and C (middle) and B and C (right)

5.3 How Tight Are The Bounds? Finally, we wanted to check if the state-to-state distances estimated by the metric were close to the actual difference in value-functions of the states. To determine this, we used a simple dynamic programming algorithm which iteratively approximated the value function at each state. Then, we compared the actual difference in value functions to the bounds on these distances we had determined using the metric. Unfortunately, the bounds were remarkably loose. While the value functions of each state differed by no more than 1.0 at any time, the estimated distances almost consistently a factor of 10 higher. One obvious reason for this is the different way the metric andP the value function are calculated: P0 a a ∗ V ∗ (s0 ))} − max {Ra + γ ∗ ∗ 0 |V ∗ (s) − V ∗ (t)| = | maxa {Rsa + γ ∗ 0s (Pss 0 a t t (Ptt0 ∗ V (t ))}| 11

While the distance according to the Kantorovich metric is calculated as follows: d(s, t) = maxa {|Rsa − Rta | + γ ∗ TK (d)(Psa , Pta )} Thus, if two states had equal values under two different actions, the first method would find them to be of equal value, while the second would not.

12

state number 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

MDP A Values 1.855691 2.084788 2.342170 2.956181 3.321142 2.956181 2.084788 2.342170 2.631326 2.956181 3.321142 3.731160 3.321142 2.084788 4.191797 1.651769 1.855691 1.651769 4.191797 4.709302 5.290698 1.470256 1.651769 1.470256 4.709302 5.290698 4.709302

MDP B Values 2.084788 2.342170 2.631326 3.321142 3.731160 3.321142 2.342170 2.631326 2.956181 3.321142 3.731160 4.191797 3.731160 2.342170 4.709302 1.855691 2.084788 1.855691 4.709302 5.290698 4.709302 1.651769 1.855691 1.651769 5.290698 4.709302 5.290698

Estimated Bound 3.421240 3.844900 4.317780 5.569800 6.397050 3.002120 3.638480 4.176220 4.808280 5.501970 6.177630 7.155850 2.685850 4.141060 8.093910 3.202630 3.578440 2.391550 8.122640 9.251210 9.269860 2.942590 2.384340 2.142650 9.141850 9.209990 9.223670

Actual Distance 0.229097 0.257382 0.289156 0.364961 0.410018 0.364961 0.257382 0.289156 0.324855 0.364961 0.410018 0.460637 0.410018 0.257382 0.517505 0.203922 0.229097 0.203922 0.517505 0.581396 0.581396 0.181513 0.203922 0.181513 0.581396 0.581396 0.581396

Figure 6: Difference in value functions vs. estimated distances on maps A and B

13

state number 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

MDP A Values 1.855691 2.084788 2.342170 2.956181 3.321142 2.956181 2.084788 2.342170 2.631326 2.956181 3.321142 3.731160 3.321142 2.084788 4.191797 1.651769 1.855691 1.651769 4.191797 4.709302 5.290698 1.470256 1.651769 1.470256 4.709302 5.290698 4.709302

MDP C Values 2.084788 2.342170 2.631326 3.321142 3.731160 3.321142 2.342170 2.631326 2.956181 3.321142 3.731160 4.191797 3.731160 2.342170 4.709302 1.855691 2.084788 1.855691 4.709302 5.290698 4.709302 1.651769 1.855691 1.651769 4.191797 4.709302 5.290698

Estimated Bound 2.929290 3.300910 3.827520 5.144850 5.925020 2.303890 3.236510 3.709860 4.376920 4.985930 5.768060 6.617360 2.050670 3.598320 7.569580 2.585340 2.978970 1.373590 7.706310 8.772420 8.769280 2.441830 1.368290 1.203640 7.766920 8.850660 8.813700

Actual Distance 0.229097 0.257382 0.289156 0.364961 0.410018 0.364961 0.257382 0.289156 0.324855 0.364961 0.410018 0.460637 0.410018 0.257382 0.517505 0.203922 0.229097 0.203922 0.517505 0.581396 0.581396 0.181513 0.203922 0.181513 0.517505 0.581396 0.581396

Figure 7: Difference in value functions vs. estimated distances on maps A and C

14

state number 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

MDP B Values 2.084788 2.342170 2.631326 3.321142 3.731160 3.321142 2.342170 2.631326 2.956181 3.321142 3.731160 4.191797 3.731160 2.342170 4.709302 1.855691 2.084788 1.855691 4.709302 5.290698 4.709302 1.651769 1.855691 1.651769 5.290698 4.709302 5.290698

MDP C Values 2.084788 2.342170 2.631326 3.321142 3.731160 3.321142 2.342170 2.631326 2.956181 3.321142 3.731160 4.191797 3.731160 2.342170 4.709302 1.855691 2.084788 1.855691 4.709302 5.290698 4.709302 1.651769 1.855691 1.651769 4.191797 4.709302 5.290698

Estimated Bound 3.236810 3.686560 4.169410 5.354830 5.985870 2.731580 3.665020 4.118510 4.659310 5.235970 5.967660 6.667000 3.059440 4.072350 7.490190 3.213740 3.605180 1.576010 7.428470 8.330860 8.108500 3.208820 1.583600 1.399950 8.289150 8.099150 9.099150

Actual Distance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.098901 0.000000 0.000000

Figure 8: Difference in value functions vs. estimated distances on maps B and C

15

5.4 Policy-Specific Estimates In an effort to get a more accurate estimate of the success of a transferred policy, we developed the fixedpolicy measure, as described in 3. While this measure lacks several qualities of the bisimulation metric which we would like to have available, it does serve some intuitive purpose. In a manner similar to which the use of metrics allowed us to look at similar MDPs in addition to equivalent ones, the fixed policy measure allows for the consideration of MDPs which may not always behave similarly, but when restricted to choosing actions according to a given policy, are indeed very similar. The graph below displays the results of transferring an optimal policy from map A to map B. While the state-to-state distance is still larger than the actual value function, it is not nearly as large as the distance estimated by the original metric. Perhaps the biggest issue with the fixed-policy measure is that it puts knowledge transfer back in the realm of heuristic methods. Indeed, just as a policy can be defined to minimize behavioral differences between MDPs, one can also be defined to exploit them. This means that the bound established in 4 no longer holds, and it may be the case that our only means of evaluating a transferred policy is by transferring it then calculating the value function–exactly the calculation we were trying to avoid.

Figure 9: Actual difference in value functions compared estimated distances with respect to a policy, and estimated distances according to the metric, on maps A and B

6 Conclusion and Future Work At the current time, it appears that we have two possible means of measuring the similarities of MDPs. One has all the characteristics of a metric that we would like, but does not provide a close enough estimate, whereas the other is more accurate, but doesn’t provide the guarantees we would like. One possible action to take from here would be to continue to work towards a compromise between the two extremes. Perhaps it is possible to modify the fixed policy measure to retain some of the attributes of a metric. Another option is to look at our original metric using larger time steps. As well as putting less emphasis on differences in rewards at every time step, this approach involves raising the discount factor to a power, thus solving the issues that occurred when the discount factor was 0.9 or higher. On the other hand, it may be that bisimulation is just not the right approach to policy transfer in MDPs, and while it provides a rough estimate, another technique is needed to give an accurate estimation of how well a policy will transfer.

16

References [Boutilier et al., 1999] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1– 94, 1999. [Desharnais et al., 1999] Josee Desharnais, Vineet Gupta, Radha Jagadeesan, and Prakash Panangaden. Metrics for labeled markov systems. In International Conference on Concurrency Theory, pages 258– 273, 1999. [Ferns et al., 2004] Norman Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In Proceedings of the 20th Annual Conference on Uncertainty in Artificial Intelligence (UAI-04), pages 162–16, Arlington, Virginia, 2004. AUAI Press. [Ferns et al., 2005] Norman Ferns, Prakash Panangaden, and Doina Precup. Metrics for markov decision processes with infinite state spaces. In Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI-05), page 201, Arlington, Virginia, 2005. AUAI Press. [Ferns et al., 2006] Norman Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for computing state similarity in markov decision processes. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), Arlington, Virginia, 2006. AUAI Press. [Givan et al., 2003] Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in markov decision processes. Artif. Intell., 147(1-2):163–223, 2003. [Larsen and Skou, 1991] Kim G. Larsen and Arne Skou. Bisimulation through probabilistic testing. Inf. Comput., 94(1):1–28, 1991. [Milner, 1980] Robin Milner. A Calculus of Communicating Systems. Lecture Notes in Computer Science Vol. 92. Springer-Verlag, 1980. MIL r 80:1 1.Ex. [Park, 1981] David Park. Concurrency and automata on infinite sequences. In Proceedings of the 5th GI-Conference on Theoretical Computer Science, pages 167–183, London, UK, 1981. Springer-Verlag. [Puterman, 1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. [van Breugel and Worrell, 2001a] Franck van Breugel and James Worrell. Towards quantitative verification of probabilistic transition systems. In ICALP ’01: Proceedings of the 28th International Colloquium on Automata, Languages and Programming,, pages 421–432, London, UK, 2001a. Springer-Verlag. [van Breugel and Worrell, 2001b] Franck van Breugel and James Worrell. An algorithm for quantitative verification of probabilistic transition systems. In CONCUR ’01: Proceedings of the 12th International Conference on Concurrency Theory, pages 336–350, London, UK, 2001b. Springer-Verlag.

17

Lihat lebih banyak...

Knowledge transfer in markov decision processes (Technical Report

Descripción

Comentarios