Approximate dynamic programming for an inventory problem: Empirical comparison

Share Embed


Descripción

Computers & Industrial Engineering 60 (2011) 719–743

Contents lists available at ScienceDirect

Computers & Industrial Engineering journal homepage: www.elsevier.com/locate/caie

Approximate dynamic programming for an inventory problem: Empirical comparison q Tatpong Katanyukul a,⇑, William S. Duff a, Edwin K.P. Chong b a b

Mechanical Engineering Department, College of Engineering, Colorado State University, United States Electrical and Computer Engineering Department, College of Engineering, Colorado State University, United States

a r t i c l e

i n f o

Article history: Received 4 December 2009 Received in revised form 22 November 2010 Accepted 16 January 2011 Available online 22 January 2011 Keywords: Approximate dynamic programming Inventory control Reinforcement learning Simulation Heterogeneity AR(1)/GARCH(1,1)

a b s t r a c t This study investigates the application of learning-based and simulation-based Approximate Dynamic Programming (ADP) approaches to an inventory problem under the Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model. Specifically, we explore the robustness of a learning-based ADP method, Sarsa, with a GARCH(1,1) demand model, and provide empirical comparison between Sarsa and two simulation-based ADP methods: Rollout and Hindsight Optimization (HO). Our findings assuage a concern regarding the effect of GARCH(1,1) latent state variables on learning-based ADP and provide practical strategies to design an appropriate ADP method for inventory problems. In addition, we expose a relationship between ADP parameters and conservative behavior. Our empirical results are based on a variety of problem settings, including demand correlations, demand variances, and cost structures. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Inventory management is one of the major business activities. A well-managed inventory can help a business stay competitive by keeping its cash flow at a controllable level. Recently, Zhang (2007) analyzed data obtained from the M3 Census Bureau and established the appropriateness of the GARCH(1,1) model in inventory demand. He also showed that the inventory costs may increase significantly when the GARCH(1,1) model is not accounted for and analytically developed an orderupto-level policy for a problem without a setup cost. However, his solution is highly problem specific—a change in the structure of the problem requires reanalysis of the problem. For example, a setup cost is common in practice, but it is not included in Zhang (2007). Inclusion of a setup cost into a problem makes a cost function highly nonlinear and renders an analytical approach very difficult. Furthermore, an analytical approach is time consuming and requires highly specialized skill and effort, as discussed in Kleinau and Thonemann (2004) and Jiang and Sheng (2009). Inventory problems appear in various forms and their forms often change over time. Therefore an analytical approach is not suitable for practical inventory problems. Many articles, e.g., Silver (1981), Lee and

q

This manuscript was processed by area editor Mohamad Y. Jaber.

⇑ Corresponding author.

E-mail addresses: [email protected] (T. Katanyukul), [email protected] (W.S. Duff), [email protected] (E.K.P. Chong). 0360-8352/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.cie.2011.01.007

Billington (1992) and Bertsimas and Thiele (2006) to name a few, addressed the need for an efficient flexible inventory solution simple to implement in practice. Approximate Dynamic Programming (ADP) is a method to solve a practical Markov decision problems. It does not rely on an analytical treatment of the problem or hard-to-obtain information, e.g., transition probabilities. Therefore ADP has received extensive attention in many control applications, as discussed by Werbos (2004). ADP uses an approximation technique, which mostly is attained by either a learning-based scheme or a simulation-based scheme. A learning-based ADP method uses a learning technique, e.g., temporal difference learning, to approximate state-action costs. A simulation-based ADP method uses simulation, e.g., Monte Carlo simulation, to approximate state-action costs. Most previous studies of an ADP application to inventory management focused on learning-based ADP; see, e.g., Van Roy, Bertsekas, Lee, and Tsitsiklis (1997), Giannoccaro and Pontrandolfo (2002), Shervais, Shannon, and Lendaris (2003), Kim, Jun, Baek, Smith, and Kim (2005), Kwon, Kim, Jun, and Lee (2008), Kim, Kwon, and Baek (2008), Chaharsooghi, Heydari, and Zegordi (2008) and Jiang and Sheng (2009). Only Choi, Realff, and Lee (2006) investigated simulationbased ADP. Nonetheless, these two schemes are not mutually exclusive. For example, Kim et al. (2008) used simulation to evaluate consequences of actions not taken for accelerating a learning process in their learning-based ADP. Although all these studies showed satisfying results in their domain problems, to the best of our knowledge, there is no previous

720

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

work investigating an ADP application to a problem with the GARCH(1,1) model. GARCH(1,1) introduces two latent state variables. The latent state variables will be inadvertently left out if the GARCH(1,1) model is not accounted for. This has posed a challenge to the model-free property of a learning-based ADP method. When a model of the problem is available, a simulation-based ADP method is another alternative. A simulation-based ADP method is an approach intermediate between an analytical approach and a learning-based ADP method. An analytical approach requires a model of the problem and a development of an analytical solution. A learning-based ADP method requires minimum knowledge of the problem. A simulation-based ADP method requires a model of the problem and, rather than relying purely on analysis, it uses simulation to provide information for assisting action selection. Two simulation-based ADP methods are investigated here: Rollout and Hindsight Optimization (HO). Though it has been used in other applications, Rollout had been investigated for inventory problems in only one study. HO, as discussed by Chong, Givan, and Chang (2000), had not been studied for an inventory problem previously. Also, how these simulation-based ADP methods perform compared to a learning-based ADP method had not been investigated previously. Our study is intended to fill in this space. It investigates the application of ADP to an inventory problem with demand of AR1/GARCH(1,1) and provides empirical comparison among Sarsa, Rollout, and HO. The findings here provide a practical approach to design an ADP method for an inventory problem and an insight into relations of ADP components, performance, and control behavior. Furthermore, the results reaffirm the model-free property of a learning-based ADP method even in the presence of latent state variables introduced by GARCH(1,1). The understanding exposed in this paper will help promote efficient inventory management and aid in the transfer of inventory research into practice. This section has provided an overview of our study. Section 2 provides a review of the related literature. Section 3 establishes the general background of the three methods. Section 4 explains the problem under investigation, how our empirical study is conducted, how each controller is set up, how experiments are performed, and how the experimental results are evaluated. Section 5 explains what the experimental results indicate. The last section presents conclusions and further discussions. 2. Literature review ADP has been recently introduced into inventory management research by Van Roy et al. (1997), Godfrey et al. (2001), Pontrandolfo, Gosavi, and Okobaa (2002), Giannoccaro and Pontrandolfo (2002), Shervais et al. (2003), Kim et al. (2005), Choi et al. (2006), Topaloglu and Kunnumkal (2006), Iida and Zipkin (2006), Chaharsooghi et al. (2008), Kim et al. (2008), Kwon et al. (2008) and Jiang and Sheng (2009). Fig. 1 shows a block diagram illustrating ADP methods used in previous studies. 2.1. Simulation-based ADP Of all these authors, only Choi et al. (2006) investigates the application of simulation-based ADP. Simulation was used to provide reduced state space, reduced action space, and approximate transition probabilities for a dynamic program, which in turn was solved with either value iteration or Rollout. Rollout uses simulation to provide approximate state-action costs. The simulation requires a control method to provide decisions in simulation. Such a control method is called a base policy. For their base policy, Choi et al. used an (s, S) policy whose parameters were obtained from a heuristic search over pre-defined sets. The pre-defined sets of

parameters used in Choi et al. (2006) are problem specific and it is unclear how they obtained them. Rollout is also investigated in our study. We use a simple formula based on the well-known Economic Order Quantity (EOQ) equation to determine parameters for the base policy. In addition to Rollout, HO—another simulationbased ADP method—has never been investigated for inventory problems. HO does not require a base policy. Therefore we investigate HO for its own virtues as well as to provide a good measure of how simulation-based ADP performs without the choice of a base policy. 2.2. Learning-based ADP Authors studying learning-based ADP methods investigated several learning schemes.  Van Roy et al. (1997) used one-step temporal difference learning (TD0).  Chaharsooghi et al. (2008) used Q-learning. Q-learning, introduced by Watkins (1989), is an algorithm based on TD0.  Kim et al. (2005) used an action-value method whose learning scheme was based on a weighted average value of a current approximation and a new observation. Their approach is similar to TD0, but it only approximates a current state-action value without a cost-to-go.  Kwon et al. (2008) and Jiang and Sheng (2009) used the casebased myopic reinforcement learning (CMRL) method developed by Kwon et al. (2008) CMRL is based on a combination of an action-value method and a case-based reasoning technique. Case-based reasoning is state aggregation with an ability to create a new aggregation when an observed state is over a preset range of any existing aggregation group.  Kim et al. (2008) proposed and used an asynchronous actionreward learning method. For a fast changing inventory system they assumed that information of action-consequence relations, regardless of state, was sufficient for decision making. Their asynchronous action-reward learning scheme is developed based on characteristics of inventory problems that allows simultaneously multiple action updates. Multiple action updates help accelerate the learning process to enable it to catch up with changes in the system. Instead of only updating an action-reward value for an action taken, approximate values of actions not taken were updated as well. Given an observation of an exogenous variable (e.g., demand), consequences of actions not taken can be calculated and the multiple updates achieved with these computed consequences.  Shervais et al. (2003) used the dual heuristic programming (DHP) method. DHP, discussed by Werbos (1992), is a learning ADP method that updates a control policy directly using derivatives of the cost function. It should be noted that inclusion of a setup cost, formulated as a mathematical step function, renders this method inapplicable to the problems addressed in our study, because a step function is not differentiable. However, there is a technique to approximate a step function with a sigmoid function. A sigmoid function is differentiable. The application of DHP with an approximate step function has not been investigated and it may be worth further study.  Giannoccaro and Pontrandolfo (2002) used the SMART algorithm, developed by Das, Gosavi, Mahadevan, and Marchalleck (1999). The SMART algorithm is similar to Q-learning. In Qlearning, every time step is assumed to be equal. Giannoccaro and Pontrandolfo studied an inventory problem whose time response is a function of a current state, a next state and a current action. To handle varied time response, SMART uses a time correction term and its associate procedures to approximate an average state-action value.

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

721

Fig. 1. A block diagram shows ADP methods using in previous studies.

Our work investigates Sarsa, a widely-used algorithm based on TD0. The learning scheme used in Van Roy et al. (1997, Section 6.2) is equivalent to Sarsa. 2.3. Function approximations used in learning-based ADP studies For the issue of function approximation, there are a few choices of function approximation studied previously for application to inventory management.  Look-up table. Jiang and Sheng (2009), Kim et al. (2008), Kwon et al. (2008), Kim et al. (2005) used a look-up table to implement a cost-to-go approximation. A look-up table is a simple index table where each entry, an approximate cost, can be accessed by an index, a state-action pair.  Aggregation. Giannoccaro and Pontrandolfo (2002) and Chaharsooghi et al. (2008) used an Aggregation. An Aggregation is an enhanced version of a look-up table. It is a look-up table with

a group of indexes, instead of a single index. Any indexing value falling within the same index group will be linked to the same entry. For the same problem, an Aggregation requires a smaller table than a look-up table.  Other function approximation. Van Roy et al. (1997) experimented with a linear combination of features and the Multilayer Perceptron Neural Network (MLP). Shervais et al. (2003) also used MLP for an approximate cost-to-go. Among these approximation choices, a look-up table is simplest to implement, but it suffers from a scalability issue. An Aggregation is a good alternative, but the size of its aggregation step needs to be carefully designed. A linear combination of features provides the efficiency of a linear computation, but it requires a customized selection of features specific to each problem. MLP is a very powerful approximation function, but its highly nonlinear nature makes it difficult to fine tune with ADP. A Radial Basis Function (RBF) is a universal approximation function. It can be operated in

722

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

a linear mode, which is preferable for ADP. RBF is a linear combination of locally active functions. Therefore it can be viewed either as a smooth interpolation of an Aggregation or as a linear combination of features, which are the Radial Bases. Our study investigates an application of ADP with RBF to inventory problems. Among previous authors applying ADP to inventory problems, no one has investigated the performance differences between simulation-based and learning-based ADP and applicability to inventory problems of ADP with RBF. In addition to previous studies of ADP application to inventory problems, the conclusions of Zhang (2007) about the presence and significance of the GARCH(1,1) model in inventory management has laid one of the important questions our research intents to answer. A learning-based ADP method has been referred to as the model-free solution by many authors, including Kaelbling, Littman, and Moore (1996). However, the GARCH(1,1) model introduces two latent state variables, which will be inadvertently left out if GARCH(1,1) is not accounted for. This has posed a challenge to the model-free property of a learning-based ADP method. The robustness of learning-based ADP for an inventory problem with GARCH(1,1) has not been investigated previously. The intent of our study is to investigate these unexplored issues to foster ADP applications for practical inventory management. 3. Background

X

a2AðsÞ

  pðs; a; s0 Þ cðs; a; s0 Þ þ aC tþ1 ðs0 Þ

ð1Þ

s0

where a is a discount factor; C t ðsÞ is a minimal expected cost from time t with state s; C tþ1 ðs0 Þ is a next minimal expected cost, often referred to as a cost-to-go; a is an action, chosen from A(s), a feasi0 ble set of actions for the given state; p(s, a, s ) is a transition proba0 bility and c(s, a, s ) is a transition cost. There are many ADP methods. Our study investigates Sarsa, Rollout, and HO. We use the Look-Ahead method as a benchmark for a performance comparison. (See Si, Barto, Powell, & Wunsch (2004) & Powell (2007) for a general discussion on ADP.) 3.1. Look-Ahead A Look-Ahead method is a method to solve MDP based on a future projection. Due to its ease of computation, our study uses the Look-Ahead method as a benchmark to compare its performance with other methods. A Look-Ahead method uses a model of the problem to project events ahead. It chooses actions to minimize a projected cost of a specific number of period(s) ahead. The cost and state projections in our study are projections with zero demand noise. Eq. (2) shows how the Look-Ahead method determines an action:

~ aH ¼ arg min ~ a

3.2. Sarsa Sarsa, as discussed by Sutton and Barto (1998), uses a learning technique, TD0, to facilitate decisions. Using TD0 helps alleviate a requirement of a model of the problem and a need for a transition probability. Sarsa has been investigated for inventory problems by Van Roy et al. (1997). However, the GARCH(1,1) model introduces two extra state variables, forecasting error z and its variance r2. These state variables are less likely to be observable than others. If a model of the problem is not well-defined, these extra state variables may be inadvertently left out. The performance of Sarsa with missing latent state variables had not been investigated previously. We investigated the robustness of Sarsa in the presence of the GARCH(1,1) model (see Section 4.3). Sarsa uses an approximate state-action cost Q(st, at), often called the Q-value, to determine an action and updates Q(st, at) based on TD0. The equations for Sarsa are

Cðst ; at Þ ¼ cðst ; at Þ þ a  Cðstþ1 ; atþ1 Þ wðs; aÞ ¼ cðs; aÞ þ aQ ðs0 ; a0Þ  Q ðs; aÞ Q ðs; aÞ

An inventory structure investigated here is a multi-period inventory decision problem. In multi-period decision problems, a decision is made to optimize both an immediate return and an effect that will carry on to the following periods. This type of problems can be generally classified as Markov decision problems (MDP). ADP is an approach to solve MDP based on the Bellman equation (Eq. (1)) by using an approximation technique to overcome the computational difficulty when applying an exact analysis. The computational difficulties are associated with transition probabilities and computational inefficiencies. The tendency of computational difficulties to increase exponentially when the number of state or action variables increases is referred to as curses of dimensionality. This equation is

C t ðsÞ ¼ min

where ~ a ¼ ½að1Þ að2Þ að3Þ . . . aðHÞ ; cðs; aÞ is the projected period cost; stþ1 ; . . . ; stþH1 are projected states at times t + 1, . . ., t + H  1, respectively; st ¼ st is the state of the beginning period; and the decided action is at = a(1).

H X i¼1

ai1 cðstþi1 ; aðiÞ Þ

ð2Þ

ð3Þ ð4Þ

Qðs; aÞ þ bwðs; aÞ

ð5Þ

at ¼ pðst jQ Þ

ð6Þ

where Q(s, a) approximates the state-action cost C(s, a); c(s, a) is a period cost; b is a Sarsa parameter, called the learning rate, with the other variables as defined earlier. Eq. (4) shows the updated signal w(s, a), called a temporal difference. There are several ways to implement Q(s, a). The simplest way is to use a look-up table so that each entry corresponds to each state-action pair. A Q-value implemented by a look-up table can be updated as shown in Eq. (5). Then, given a state, an action can be determined according to Q-values, as in Eq. (6). When Q-values well represent state-action costs, choosing an action with a low Q-value will lead to a low actual cost. However, in order to have well-approximate Q-values, the state-action space has to be well explored. Therefore, in order to perform well overall, an action has to be chosen on a basis of a good balance between exploitation and exploration (see Sutton & Barto (1998) for detail). Although it is simple to implement, a look-up table is computationally inefficient in case of a large or continuous state-action space. An alternative is a function approximation. A function approximation can be set up to use fewer parameters than the entries required by a look-up table. As discussed by Barreto and Anderson (2008), RBF, because of its capability to represent an arbitrary function, its focus on local adjustments and its ability to operate in a linear mode, is one of the most widely-used function approximations. RBF output is a weighted summation of basis functions of scaled distances between inputs ~ x and centers ~ v m . Eq. (7) shows a single output y 2 R of an RBF with M bases, given inputs ~ x 2 RD , with k~ x~ v m kZm a scaled Euclidean distance:

y ¼ w0 þ

M X

wm  /ðk~ x~ v m kZm Þ

ð7Þ

m¼1

where /() is a basis function; ~ v m is the mth RBF center; Zm is a D  D diagonal matrix whose entries along its diagonal are RBF scales (denoted as z1,m, z2,m, . . ., zD,m), w0, w1, w2, . . ., wM are RBF weights, and

k~ x~ v m kZm

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u D uX ¼ ð~ x~ v m ÞT  Zm  ð~x  ~ v m Þ ¼ t zd;m  ðxd  v d;m Þ2 : d¼1

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

Our study uses a Gaussian function, /(a) = exp(a2), as an RBF basis. (See Bishop (2006) & Haykin (1994) for a general discussion on RBF.) When operating RBF in the linear mode, centers and scales are pre-defined and only RBF weights need to be tuned during the learning process. The RBF weights are updated as shown in Eq. (8) (a strategy to determine RBF centers and scales will be discussed in Section 4.2):

~ w

~ þ b  wðs; aÞ  ~ w /ðs; aÞ

ð8Þ T

~ ¼ ½w0 w1 . . . wM  ; ~ where w /ðs; aÞ is the column vector [1 /1 . . . T /M] with /m ¼ /ðk½s a0  ~ v m kZm Þ and [s a]0 the column vector composed of state and action variables.

where nn,i is the nth simulated value of a random variable at period t þ i; ^cnn;i ðs; aÞ is the simulated period cost; ^stþi ðnn;i Þ is the simulated state of period t þ i; ~ u ¼ ½uð1Þ uð2Þ . . . uðTÞ ; and u(i) is the replenishment order for the HO simulation period i with the rest of variables as mentioned earlier. 4. Methodology Our study uses computer simulations to conduct numerical experiments to evaluate each inventory control approach and provide evidence to answer research questions. The inventory problem with AR1/GARCH(1,1) demand is used. 4.1. An inventory problem

3.3. Rollout Rollout, as discussed by Bertsekas and Tsitsiklis (1996), is a simulation-based ADP method. Provided there is a model of the problem, it determines an action based on approximate costs obtained with simulation. The action is selected such that it minimizes the approximate cost composed of an approximate period state-action cost and an approximate discounted cost-to-go. The approximate cost-to-go in Rollout is the cost (obtained from simulation) associated with a fixed base policy. Eq. (9) shows how Rollout determines an action:

( ) N T X 1 X i p ^cn ðst ; aÞ þ at ¼ arg min  a ^cn ð^stþi;n Þ a N n¼1 i¼1

ð9Þ

where at is the chosen action; N is the number of iterations; T is the simulation horizon; ^cn ðs; aÞ is the nth simulated period cost when an action a is taken; ^cpn ðsÞ is the nth simulated period cost when an action is determined by a base policy p; and ^stþi;n is the nth simulated state of period t + i. First Rollout uses simulation to obtain a period cost ^cðst ; aÞ and a next state ^stþ1 . Then, with a next state ^stþ1 and a base policy p, a cost for the next period ^cp ð^stþ1 Þ is found by simulation. The process keeps going until the end of simulation horizon. The second term in Eq. (9) represents an approximate cost-to-go. A combination of a period cost and a discounted cost-to-go is an estimated cost of each action in an action search space. In order to take into account process variation, this estimated cost can be re-evaluated multiple times and the decision can be made based on an average value. A base policy used in our study is discussed in Section 4.2.

Similar to Rollout, Hindsight Optimization (HO), as introduced by Chong et al. (2000), uses simulation to estimate a period cost and a cost-to-go. HO does not require a base policy. Therefore we investigate HO for its own virtues as well as to provide a good measure of how simulation-based ADP performs without having to choose a base policy. Rollout uses a base policy to determine a sequence of actions in its simulation. Unlike Rollout, which simulates events in series, HO uses a special simulation. It first generates a sequence of random variables, which are realizations of inventory demand. Then given the known sequence of random variable values, it chooses a sequence of actions to minimize the cost. Therefore the cost obtained in HO is the minimal cost achieved for that instance. Eq. (10) summarizes the HO mechanism:

1 at ¼ arg min a N ( !) N T X X i ðiÞ ^cnn;0 ðst ; aÞ þ min a ^cnn;i ð^stþi ðnn;i Þ; u Þ  ~ u

The inventory problem investigated here is a periodic review single-echelon problem with non-zero leadtime and a setup cost. The inventory system has both in-transit and on-site inventories. A flow of items starts from supplier delivery after a supplier has received a replenishment order. Then, while the items are being delivered, they become in-transit inventory. Later after a delivery duration, called a leadtime, the items will arrive at a storage point. The items at a storage point are called an on-hand inventory. There they are ready to be delivered to a customer once there is a demand. Our study treats the case where demand is exceeded as a backlog order, which is a promise to deliver the items whenever they are available. Since an on-hand inventory and a backlog order are mutually exclusive, they will be represented inclusively as an onsite inventory, denoted x, such that non-negative x will represent an on-hand inventory and negative x will represent a backlog order. The demand is modeled with AR1/GARCH(1,1). Therefore, a state is composed of a previous demand Dt1, a previous demand error t1, a variance of a previous demand error r2t1 , an on-site inventory xt and an in-transit inventory ~ BðtÞ whose length depends on a leadtime L. The state space of this problem is f0; Iþ g  R  R  I  f0; Iþ gL for Dt1 ; t1 ; r2t1 ; xt , and ~ BðtÞ , þ respectively. An action ut 2 f0; I g is a replenishment order. State transitions are specified in Eqs. (11)–(15):

Dt ¼ a0 þ a1  Dt1 þ t

ð11Þ

t ¼ et  rt r2t ¼ m0 þ m1  2t1 þ m2  r2t1

ð12Þ

xtþ1 ¼ xt þ

3.4. Hindsight optimization

n¼1

723

i¼1

ðtþ1Þ

ðtÞ

ðtþ1Þ

ðtÞ

B1

ðtÞ B1

 Dt

ð14Þ

¼ B2 .... .. .. .

BL1 ¼ BL ðtþ1Þ BL

¼ ut

ð15Þ

where a0 and a1 are AR1 model parameters; m0, m1, and m2 are GARCH(1,1) parameters; et is white noise distributed according to N(0, 1); and ut is the replenishment order. The period cost, or transition cost, is the cost associated with the system transitioning from the current state to the next state by the chosen action. In our study, the period cost consists of the replenishment cost and the inventory handling cost, as shown in Eq. (16).

ctþ1 ¼ K t  dðut Þ þ g t  ut þ ht  ðxtþ1 Þþ þ bt  ðxtþ1 Þþ ð10Þ

ð13Þ

ð16Þ

where ct+1 is the period cost, whose value will be known at time t + 1 (the end of period t); Kt is the setup cost; gt is the unit replenishment cost; ht is the unit holding cost; bt is the unit backlogging

724

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

cost; d() is the step function; ()+ is the positive function defined by (a)+ = a  d(a); and other variables are as mentioned earlier. The first two terms in Eq. (16) represent the replenishment cost and the last two terms represent the handling cost. 4.2. Implementation 4.2.1. Sarsa Sarsa is implemented with a pre-defined structure RBF. RBF centers are set up to be distributed uniformly throughout state-action space. The top left plot of Fig. 2 shows the two-dimensional structure of RBF centers. Since RBF scales correspond to centers, therefore RBF scales can be set up to have the same value for each dimension. Our study assigned scales using a strategy based on a relative basis effect at a middle point. A middle point is a point in an empty space surrounded by RBF centers and a location of the middle point is at the center of that surrounding space. The top right plot of Fig. 2 shows middle points, as circles, among RBF centers. The form of each Gaussian basis has a peak of value 1 at its center and declines along a distance from the center with 2 rate expðdh Þ, where dh is a scaled distance from its center. The lower plot of Fig. 2 gives a representation of a Gaussian basis versus the distance from its center. Values of RBF scales are assigned such that the effect of each RBF center will drop to a specific fraction (e.g., half) of its peak at the middle point. For convenience, this strategy will be referred to as a midpoint strategy and the fraction is prefixed to indicate the fraction of a peak effect at the middle

point. Eq. (17) shows a midpoint strategy to obtain scale along the dth dimension:

zd ¼ 

4 D  D2d

 log /

where zd is a scale along the dth dimension; D is the number of dimensions; and Dd is the distance between two adjacent centers along the dth dimension. For example, the 1/10-, 1/2- and 9/10-midpoint strategies assign RBF scales such that the effect of each RBF center will drop to 1/10, 1/2, and 9/10 of its peak value, respectively. Fig. 3 shows a combining effect of all bases when scales are assigned by 1/10-, 1/2-, and 9/10-midpoint strategies. These plots show that the 1/ 2-midpoint strategy has its combining effect distributed evenly throughout the space as being seen as a flat surface. Therefore the 1/2-midpoint strategy is used for experiments in our study. In our experiments, we investigates Sarsa with all state variables and Sarsa without latent state variables introduced by GARCH(1,1). Sarsa with all state variables is implemented with six-dimensional Q-values. A structure of RBF for Sarsa with all state variables has centers at each combination of {0, 100, 200, 300}  {200, 100, 0, 100, 200}  {0, 800, 1600, . . . , 3200}  {300, 200, 100,. . . ,300}  {0,100,200,300,400}  {0,100,200,300,400}, corresponding to five state variables (Dt1 ; t1 ; r2t1 ; xt and ~ BðtÞ ) and one action variable at. With the 1/2-midpoint strategy, RBF scales z1 = z2 = z4 = z5 = z6 = 0.4621  104 and z3 = 0.0072  104. Sarsa without the GARCH(1,1) variables is implemented with four-

RBF centers with middle points

RBF centers 300

300

200

200

100

100

D2

D2

ð17Þ

0

0

−100

−100

−200

−200

−300

−300 0

50

100

150

200

250

300

0

50

100

D1

150

200

250

300

D1 RBF basis effect v.s. distance

1 0.9 0.8 0.7

φ

0.6 0.5 0.4 0.3 0.2 0.1 0 −3

−2

−1

0

1

||x − v||Z Fig. 2. RBF centers and their midpoints. (Midpoints are shown as red circles.)

2

3

725

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

RBF basis effect with 0.1−midpoint

RBF basis effect with 0.1−midpoint (contour) 300 200

2

100

D2

2.5

1.5

0 −100

1 200

−200

300 0

200 −200

D2

100 0

−300

D1

0

50

100

150

200

250

300

D1

RBF basis effect with 0.5−midpoint

RBF basis effect with 0.5−midpoint (contour) 300 200

3.5

100

D2

3

0 −100

2.5 200

−200

300 0

200 −200

D2

100 0

−300 0

D1

50

100

150

200

250

300

D1

RBF basis effect with 0.9−midpoint

RBF basis effect with 0.9−midpoint (contour) 300 200

15

100

D2

20

10

0 −100

5 200

300 0 −200

D2

−200

200 100 0

D1

−300 0

50

100

150

200

250

300

D1

Fig. 3. Combined effect of RBF basis with different midpoint strategies.

dimensional Q-values. A structure of RBF for Sarsa without the GARCH(1,1) variables has centers at each combination of {0, 100, 200, 300}  {300, 200, 100, . . . , 300}  {0, 100, 200, 300, 400}  {0, 100, 200, 300, 400}, corresponding to Dt1, xt, B(t), and at, respectively. RBF scales z1 = z2 = z3 = z4 = 0.4621  104. Regarding to an exploration issue, similar to Van Roy et al. (1997), our study uses an addition of noise for an explorative mechanism. In addition, the zero initialization of RBF weights provides a greater range of exploration to Sarsa, due to the nature of non-negativity of Q-values in a minimization problem.

4.2.2. Rollout A base policy used in Rollout is an (s, S) policy, whose parameters are determined by using the Economic Order Quantity, EOQ, and safety stock calculations. The EOQ and safety stock calculations are easily computed formulas. Since a base policy may be re-calculated several times during the simulation, a simple computation is advantageous. An (s, S) inventory control policy is such that replenishment will be ordered whenever an inventory level, a combination of both onsite and in-transit inventories, falls below a reordering point, denoted s, and the amount of replenishment is determined to bring an inventory level back to an order-upto-level, denoted S. EOQ is a widely used deterministic formula for inventory control, as dis-

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi cussed by Lambert, Stock, and Ellram (1998). EOQ ¼ 2KD=h, where K is a setup cost, D is a demand rate, and h is a unit holding cost. A safety stock is an inventory kept as a buffer against stock pffiffiffi out. A safety stock, ss, is commonly calculated as ss ¼ 3  r  L, where r is the standard deviation of demand and L is the leadtime. (See Lambert et al. (1998) for a discussion on EOQ and safety stock.) Since the problem investigated here is not steady-state, a previous demand (the demand at time period t  1) is used instead of a demand rate and parameters, s and S, are re-calculated at every period. In our study, parameters of Rollout’s (s, S) base policy are determined as shown in Eq. (18):

s ¼ roundfðL  D mod EOQ Þ þ ssg S ¼ s þ roundfEOQ g

ð18Þ

4.2.3. Hindsight optimization HO does not require a base policy or any extra decision, but a set of algorithmic parameters (N iteration(s) and simulation horizon T). 4.3. Experiments Numerical experiments are conducted to evaluate the performance of Sarsa (with all state variables), Sarsa without the GARCH(1,1) state variables, Rollout, and HO. Our study investigated 15 different inventory problem structures (as shown in Table 1). The

726

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

without the latent state variables introduced by GARCH(1,1), Rollout, and HO.

Table 1 A summary of demand and cost variables investigated. Problem

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Demand variables

Cost variables

Correlation (a1)

Noise variance offset (m0)

Set up cost (K: $/ transaction)

Penalty cost (b: $/ item)

Holding cost (h: $/ item)

0.8 0.4 0 0.4 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8

70 70 70 70 70 40 10 70 70 70 70 70 70 70 70

80 80 80 80 80 80 80 150 100 50 0 0 0 0 0

180 180 180 180 180 180 180 100 100 100 100 200 50 50 50

1 1 1 1 1 1 1 1 1 1 1 1 1 10 50

demand is modeled with AR1/GARCH(1,1) as shown in Eqs. (11)– (13) with AR1’s a0 = 2 and GARCH(1,1)’s m1 = 0.1 and m2 = 0.8. The AR1’s a1 and GARCH(1,1)’s m0 are set up as shown in Table 1. All problems are undiscounted one-period leadtime problems (a discount factor a = 1 and a leadtime L = 1). Each problem has four cost variables. The unit replenishment cost g is set to $100/unit for all problems. Values of a setup cost (K), a penalty cost (b) and a holding cost (h) of each problem are shown in Table 1. Each experiment is initialized at D0 = 50, 0 = 10, r20 ¼ 400, x1 = 10, and B(1) = 0. 4.3.1. Experimental design parameters The experiments run each inventory controller for 25 replications of each of 60 time-unit-indeterminate periods. The exception is that each inventory controller investigated on Problem 1 is run for 50 replications.1 The study of problems 1–5 provides performance comparison across different values of a1 of 0.8, 0.4, 0, 0.4, and 0.8, respectively. The study of problems 1, 6, and 7 provides performance comparison across different values of m0 of 70, 40, and 10, respectively. The study of problems 8–11 provides performance comparison across different values of K of 150, 100, 50, and 0, respectively. The study of problems 12, 11, and 13 provides performance comparison across different values of b of 200, 100, and 50, respectively. The study of problems 13–15 provides performance comparison across different values of h of 1, 10, and 50, respectively. 4.3.2. Evaluation An aggregate cost is used as a main performance indicator. Overall performance is over 60 periods. Because of the highly stochastic nature of this problem, our study uses statistical significance tests as the main tools for drawing conclusions. A one-sided t-test is used when the normality assumption for experimental results holds. For each pair of measurements, the mean of a control method l1 and the mean of a treatment method l2, is tested twice. The first test is a t-test against an alternative hypothesis l1 > l2. In the second test a t-test against an alternative hypothesis l1 < l2 is used. The normality assumption is tested at a 5% significance level with both Lilliefors and v2-good-of-fitness tests. When it is suspected that the normality assumption does not hold, a Wilcoxon rank sum test is used instead of the t-test. The treatments investigated here are the Look-Ahead, Sarsa with all state variables, Sarsa 1 The reason to have more replications on this problem is to provide better data for comments (in Section 5.3), cumulative distribution function (CDF) plots (in Fig. A.11), and maximum single-period cost plots (in Fig. 8).

5. Experimental results The experimental results are shown in Sections 5.1 and 5.2 and comments on each ADP method are provided in Section 5.3. The detailed experimental data are provided in the Appendix. 5.1. Different demand settings Fig. 4 shows the percentage of relative cost differences over different problem scenarios. It has been organized into two rows and three columns. The first row shows the percentage of relative cost differences over different demand correlations (a1 = 0.8 to 0.8). The second row shows the percentage of relative cost differences over different demand noise variance offsets (m0 = 10, 40, and 70). The column corresponds to comparison on different perspectives.  The left column shows percentage of relative total cost difference of each ADP method to the 12-period Look-Ahead method, denoted H12. Using total cost provides better performance comparison between simulation-based ADP methods and the Look-Ahead method.  The middle column shows percentage of relative aggregate cost difference of each ADP method to the Look-Ahead method. Using the percentage of relative cost differences of aggregate costs of Periods 13–60 allows learning-based ADP methods to have initial learning periods and thus provides better performance evaluation of learning-based ADP methods.  The left column shows percentage of relative aggregate cost difference of Sarsa without GARCH variables to Sarsa with GARCH variable. It is to provide evaluation of the significance of GARCH variable when using a learning-based ADP method. The percentage of relative cost difference of method A to method B is (costA/costB  1)  100%, where costA and costB are costs obtained from applying method A and B, respectively. The negative percentage means that method A results in lower inventory cost than method B. The positive percentage means the relation is reverse. The percentage close to zero means that both methods result in similar inventory costs. Sarsa with GARCH variables is denoted as Sa and presented with a green2 solid line and circle markers. Sarsa without GARCH variables is denoted as S and presented with a blue solid line and asterisk markers. Rollout is denoted as R and presented with a light blue solid line. Hindsight Optimization is denoted as HO and presented with a pink solid line with cross markers. The dotted red line provides a visual reference for the zero percentage level. The plots of the percentage of relative cost difference of ADP methods to the Look-Ahead method from negative to positive demand correlation shown in the upper left and upper middle plots of Fig. 4 show that all ADP methods have negative percentages of relative cost difference to the Look-Ahead method. This means that ADP methods provided lower costs than the Look-Ahead method. Also, all ADP methods give stronger negative percentages of relative cost difference for negative demand correlation. This means that ADP methods performed better under negative demand correlation, when compared to the Look-Ahead method. The upper right plot shows the percentage of relative cost difference of Sarsa without the GARCH(1,1) variables to Sarsa (with all state variables) lying close to zero level. This means that the inventory costs 2 For interpretation of color in Figs. 2–10, the reader is referred to the web version of this article.

727

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

(holding cost per penalty cost or h/b) are shown in the top, middle and bottom rows, respectively. The top left plot of Fig. 5 shows that Rollout performed better than HO, especially when the set up cost ratio is either 0.5 or 1, as indicated by larger gaps between the light blue line (representing Rollout) and the pink line with cross markers (representing HO) at K/g = 0.5 and 1. The top middle plot shows that Sarsa and Sarsa without the GARCH(1,1) variables result in about 5% lower cost than the Look-Ahead method, throughout different set up cost ratios. The top right plot shows that the percentage of relative cost differences of Sarsa without the GARCH(1,1) variables compared to Sarsa is close to zero, which indicates that the costs obtained from using Sarsa without the GARCH(1,1) variables are similar to the costs obtained from using Sarsa, throughout different set up cost ratios. The middle left plot shows fluctuation of the percentages of relative cost difference of Rollout and HO. This may be explained by that the costs obtained from Rollout, HO, and the Look-Ahead method are not significantly different. (See summaries of the significance tests shown in Tables A.12,A.13,A.14 for details.) Therefore, the fluctuation seen in the middle left plot may only reflect the variation in the experiment. The middle center plot shows the downtrend, from small to large penalty cost ratio, of the percentage of relative cost difference of the learning-based ADP methods (Sarsa and Sarsa without the GARCH(1,1) variables) to the

obtained from using Sarsa without the GARCH(1,1) variables are similar to costs obtained from using Sarsa. The lower left plot of Fig. 4 shows that Rollout (shown in light blue line) has lower percentages of relative cost difference than HO (shown in a pink line with cross markers), especially under m0 = 40 and 10 (the gaps between the light blue line and the pink line with cross markers are wider than at m0 = 70). This indicates that Rollout performed better than HO, especially under low and medium demand variance scenarios. The lower middle plot of Fig. 4 shows a downtrend of the percentage of relative cost difference of learning-based ADP methods (Sarsa and Sarsa without the GARCH(1,1) variables) to the Look-Ahead method. This indicates that the learning-based ADP methods performed better, when compared to the Look-Ahead method, under a high demand variance scenario. The lower right plot shows a blue line with asterisk markers representing Sarsa without the GARCH(1,1) variables lying close to the zero level. This indicates that the performance of Sarsa and Sarsa without the GARCH(1,1) variables are similar. 5.2. Different cost structures Similar to Figs. 4, 5 shows relative cost differences of ADP methods over different cost structures. Relative cost differences over different set up cost ratio (set up cost per unit cost or K/g), penalty cost ratio (penalty cost per unit cost or b/g) and holding cost ratio

% rel. cost diff. to H12 cost; Period 13−60

0

0

0

−10

−10

−10

−20

−20

−20

−30

%

10

%

10

−30 −40

−40

−50

−50

−50

−60

−60

−60

−0.8

−0.4

0

0.4

−70

0.8

−0.8

−0.4

0

0.4

−70

0.8

a1

a1

% rel. cost diff. to H12 cost; Total cost

% rel. cost diff. to H12 cost; Period 13−60

0

0

0

−10

−10

−10

−20

−20

−20

%

10

−30 −40

−40

−50

−50

−50

−60

−60

−60

40

ν0

70

−70

10

40

ν0

0

0.4

0.8

−30

−40

10

−0.4

% rel. cost diff. of S to Sa; Period 13−60

10

−70

−0.8

a1

10

−30

Sa S S R R HO HO

−30

−40

−70

%

% rel. cost diff. of S to Sa; Period 13−60

10

%

%

% rel. cost diff. to H12 cost

70

−70

10

40

ν0

Fig. 4. Relative cost differences of ADP methods on different demand correlations (upper row) and variances (lower row).

70

728

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

15

% rel. cost diff. to H12 cost; Period 13−60

10

5

5

5

0

0

0

%

10

−5

−5

−5

−10

−10

−10

−15 0.5

1

1.5

0

K/g: $/$ % rel. cost diff. to H12 cost; Total cost

0.5

1

1.5

0

15

% rel. cost diff. to H12 cost; Period 13−60

15 10

5

5

5

0

0

%

10

−5

−5

−10

−10

−10

−15

−15

−15

1

2

0.5

350

350

1

2

0.5

% rel. cost diff. to H12 cost; Period 13−60

350

250

250

200

200

200

%

300

250

%

300

150 100

100

50

50

50

0

0

0

−50

−50

−50

0.2

h/b: $/$

1

0.02

0.2

2

1

% rel. cost diff. of S to Sa; Period 13−60

150

100

0.02

1

b/g: $/$

300

150

1.5

% rel. cost diff. of S to Sa; Period 13−60

b/g: $/$

% rel. cost diff. to H12 cost; Total cost

1

0

−5

b/g: $/$

0.5

K/g: $/$

10

0.5

Sa S R HO

K/g: $/$

%

%

15

% rel. cost diff. of S to Sa; Period 13−60

−15

−15 0

%

15

10

%

%

15

% rel. cost diff. to H12 cost; Total cost

0.02

h/b: $/$

0.2

1

h/b: $/$

Fig. 5. Relative cost differences of ADP methods on different set up costs (top row), penalty costs (middle row) and holding costs (bottom row).

Look-Ahead method. This indicates that the learning-based ADP methods performed better, when compared to the Look-Ahead method, under the high penalty cost scenarios. The middle right plot shows the relatively flat line close to zero level of the percentage of relative cost difference of Sarsa without the GARCH(1,1) variables to Sarsa. This indicates that the cost performance of Sarsa without the GARCH(1,1) variables and Sarsa are not significantly different. The bottom left and center plots show large positive percentages of relative cost difference of HO (total cost, shown in the left plot), Sarsa, and Sarsa without the GARCH(1,1) variables (aggregate costs of Periods 13–60, shown in the center plot) at large holding cost ratios. This indicates that HO and the learning-based ADP methods do not perform well under the large holding cost ratio scenarios. The Rollout method seems to be affected least, as it does not perform worse than the Look-Ahead method under the high holding cost ratio scenarios (the light blue line in bottom left plot lies relatively flat close to zero level). The bottom right plot indicates that the cost performance of Sarsa without the GARCH(1,1) variable is similar to Sarsa. Among previous studies, only Zhang (2007) has studied inventory problems with AR(1)/GARCH(1,1) demand. Zhang (2007) provided the solution for an inventory problem without set up cost. In

order to compare our ADP methods to Zhang’s solution, relative cost differences of our ADP methods to Zhang’s solution (denoted Z) are shown in Fig. 6. All plots in Fig. 6 show that Zhang’s solution yielded lower costs than ADP methods in most cases, as the light blue line (representing RO) and the pink line with cross markers (representing HO) lie above the zero level in most scenarios in the upper and lower left plots and the green line with circle markers (representing Sarsa) and the blue line with asterisk markers (representing Sarsa without the GARCH(1,1) variables) lie above the zero level in the upper and lower right plots. 5.3. Comments on ADP methods Problem 1 experiments (Table 1) have been conducted with more replications and each ADP method has been investigated with wider range of its algorithmic parameter value(s). The comments on ADP methods addressed here are based on observation of Problem 1 experiments. 5.3.1. Sarsa Sarsa was experimentally investigated with learning rates of 0.1, 0.2, . . ., 1.0. These learning rates did not show significantly

729

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

% rel. cost diff. to Z cost; Total cost

20

15

15

10

10

5

5

%

%

20

0

Sa S R HO

0

−5

−5

−10

−10

−15

−15

−20

% rel. cost diff. to Z cost; Period 13−60

−20 0.5

1

2

0.5

b/g: $/$ % rel. cost diff. to Z cost; Total cost

350

300

300

250

250

200

200

%

%

350

150

100

50

50

0

0

−50

−50 0.2

2

1

h/b: $/$

% rel. cost diff. to Z cost; Period 13−60

150

100

0.02

1

b/g: $/$

0.02

0.2

1

h/b: $/$

Fig. 6. Relative cost difference of different ADP methods to Zhang’s AR1/GARCH controller on different penalty (top row) and holding (bottom row) costs.

different performance. Our experiments for Sarsa without the GARCH(1,1) variables spent only around 1–10% of time spent for Sarsa with all state variables included. It should be noted that the timing information mentioned in our study represent only rough estimates since the experiments were not run in a controlled environment, but in a multi-server/multi-user computer system with variable computational loads. 5.3.2. Rollout Rollout was experimentally investigated with N = 1, 5, 10, and 50 iterations and simulation horizons T of 1, 3, 6, and 12. Fig. 7 shows the mean, maximum, third-quartile, second-quartile, firstquartile, and minimum of total costs obtained from Rollout with the different N and T values. The upper left plot of Fig. 7 shows a downtrend of the pink solid line, representing average total costs, when using Rollout with longer T, when N = 1. However, other plots (for N = 5, 10, and 50) do not show this downtrend. Section 6 provides further discussion and investigation on this behavior. While the time spent for Sarsa depends mainly on the size of RBF, the time spent for Rollout and HO depend on the number of iterations and simulation horizons. Rollout with N = 50, T = 12, which requires the most computation time compared to other settings investigated here, spent roughly the same amount of time as Sarsa with all state variables. Rollout with other settings spent much less time than Sarsa with all state variables. 5.3.3. HO HO was experimentally evaluated with the same parameters N and T as Rollout was. Our experiments with HO spent around 4–60 times the computation time spent for Rollout with equivalent settings of N and T. 5.3.4. Zhang’s solution As expected, when it is applicable, Zhang’s solution provided better performance than ADP methods. However, over different problem scenarios investigated, total costs obtained from Zhang’s solution and from Rollout are not significantly different.

6. Conclusions and discussions Regarding the performance3 of ADP methods under different scenarios, Figs. 4 and 5 indicate the following.  As the demand correlation decreases, all ADP methods seem to perform much better than the Look-Ahead method. The improvement seems to settle around 55% for zero to negative correlation.  As the demand variance decreases, Sarsa seems to perform worse compared to the Look-Ahead method. At low demand variance, Sarsa seems to have similar performance as the Look-Ahead method. This may be explained by the better performance of the Look-Ahead method, because the Look-Ahead future projection uses zero demand noise, which provides a good approximate cost for the problem with low variance.  Across different set up cost ratios, the performance of Sarsa seems to be quite constant.  As the penalty cost ratio increases, Sarsa seems to perform better compared to the Look-Ahead method.  As the holding cost ratio increases, Sarsa and HO perform substantially worse than the Look-Ahead method. Rollout still performs relatively similar to the Look-Ahead method at high holding cost ratio. Since the cost of holding inventory and stock out are the same, high holding cost ratio (h/b = 1) may confuse Sarsa learning mechanism and lead to substantially bad performance. Both Sarsa (with all state variables) and Sarsa without GARCH(1,1) state variables perform well compared to the LookAhead method, when holding cost is low compared to penalty cost. Regarding the issue of the significance of the GARCH model when applying learning-based ADP, the performance of Sarsa with all

3 Due to its learning nature that requires initial periods, the conclusions on Sarsa performance or comparison of Sarsa to other methods are drawn based on aggregate costs of Periods 13–60. When not compared to Sarsa, the conclusions on Rollout and HO performance are drawn based on total costs.

730

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 5

5

x 10

N=1

10

9

9

8

8

7

7

total cost: $

total cost: $

x 10 10

6 5 4

max Q3 Q2 Q1 min

6 5 4

3

3

2

2

1

1

0

N=5

0 1

3

6

12

1

T: period(s) 5

5

x 10

N = 10

10

10

9

9

8

8

7

7

6 5 4

4 3 2

1

1 3

6

12

T: period(s)

N = 50

5

2

1

12

6

3

0

6

T: period(s)

total cost: $

total cost: $

x 10

3

0

1

3

6

12

T: period(s)

Fig. 7. Performance comparison of different Rollout parameters. The mean is shown as a solid line.

state variables and Sarsa without GARCH(1,1) state variables is not significantly different. This result assuages the concern about learning-based ADP for problems with latent state variables. The explanation may be that the GARCH(1,1) state variables do not have a direct effect on a period cost (Eq. (16)). They only give extra information about uncertainty in the variance in the demand forecast. The robustness of Sarsa with some information missing shown in our experiments confirms the observations of Csáji and Monostori (2008) that temporal difference learning, a learning technique used in Sarsa, can tolerate inaccurate information to some extent. However, the good performance of Zhang’s solution (in problems without set up costs) and Rollout (in all scenarios) indicate the significance of GARCH model in inventory management and supports Zhang’s claim. Sarsa’s ability to tolerate some degree of missing information is beneficial and can be exploited. Inclusion of the only more relevant state variables allows the use of a smaller RBF. Sarsa with a smaller RBF requires less computational effort. Bertsekas and Tsitsiklis (1996) also support this exploitation within a more general concept, using feature extraction. Feature extraction is an ADP state preparation technique such that the system raw state variables can be preprocessed to extract features that represent more important aspects of state. These features are then used as ADP state variables. To address the issue of function approximation, the evenly distributed structure and the midpoint strategy were used and were effective in setting up RBF. The evenly distributed structure and the midpoint strategy provide a systematic approach to design RBF for an ADP application to inventory problems. Pertinent to a comparison among different methods, Rollout performs significantly better than the Look-Ahead method, Sarsa, and HO in most cases. HO results in lower costs (aggregate costs

in Periods 13–60) than Sarsa in most cases, but only few cases (demand correlation a1 = 0 and 0.8) that significance tests can confirm the differences. Rollout has been investigated in Choi et al. (2006). Choi et al. (2006) used heuristic search over sets of pre-specified (s, S) policies as Rollout’s base policy. The heuristic search is computationally time consuming compared to the EOQ parameterized (s, S) policy used in our study. In addition to a base policy, another decision in the implementation of Rollout is a selection of its parameter values. Rollout has two parameters: the number of iterations N and the simulation horizon T. Rollout with N = 1 shows a monotonically downtrend in its relation between the simulation horizon length and the resultant average total cost. This monotonic downward trend is not apparent for N > 1. To delve into this behavior, average on-site inventory x, an average period cost and a maximum period cost were used as shown in Fig. 8 (based on Problem 1 experiment). The plots in Fig. 8 are arranged such that plots in the left column show average on-site inventory x versus T, plots in the middle column show average single cost versus T, and plots in the right column show the maximum single cost versus T. Each column has four plots for each value of N: N = 1 (top row), N = 5 (second row), N = 10 (third row), and N = 50 (last row), as indicated in each plot title. Plots of average on-site inventory show an uptrend in every N. An average cost when Rollout has lower N and the maximum cost when Rollout has higher N show a more noticeable downtrend. The explanation may lie in the role that the combination of algorithmic parameters, N and T, has in approximating state-action cost-to-go. With a few iterations (small N), Rollout may not have seen many rare demand surges, and a longer horizon (large T) helps to promote an averaging effect, which results in a better decision for the average case. With more iterations and a longer horizon (large

731

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

single cost: $

50

0

7000 6000 5000

N = 1; max(cost)

8 7 6

4000 1

3

6

12

1

3

6

T: period(s)

T: period(s)

N = 5; mean(x)

N = 5; mean(cost)

12

1

3

6

4

T: period(s)

x 10

N = 5; max(cost)

12

10

single cost: $

250 200 150 100 50

6000

single cost: $

x: item(s)

100

x: item(s)

4

x 10

N = 1; mean(cost)

single cost: $

N = 1; mean(x)

5500 5000 4500 4000

9 8 7

0 3

6

12

3

N = 10; mean(x)

N = 10; mean(cost)

single cost: $

200 150 100 50

12

1

1

3

6

4

x 10

5500 5000 4500 4000

12

3

T: period(s) N = 50; mean(x)

6

5

1

3

6

12

T: period(s) 4

x 10

N = 50; mean(cost)

N = 50; max(cost)

6

single cost: $

single cost: $

N = 10; max(cost)

5.5

12

5500

100

12

6

T: period(s)

200

6

4.5 1

300

3

T: period(s)

3500

0

x: item(s)

6

T: period(s)

250

x: item(s)

1

T: period(s)

single cost: $

1

5000 4500 4000

5.5 5 4.5

3500

0 1

3

6

12

1

T: period(s)

3

6

T: period(s)

12

1

3

6

12

T: period(s)

Fig. 8. Plot of average inventory and average and maximum single-period cost for each Rollout setting. The plot is arranged along number of period(s) of Rollout simulation.

N and T), Rollout may have a better chance to see more rare demand surges, which, even though rare, can substantially effect the total cost when this happens. This may drive Rollout to choose a more conservative action, such as stocking a higher inventory level, to safeguard against any chance of unusually high demand. Therefore it generates a higher average cost, but lower maximum cost. This finding on the relationship between Rollout parameters and conservative behavior is new and needs further study to further confirm its nature as well as to better formulate the new relationship. Understanding this relationship may contribute to a new design for ADP, which may explicitly incorporate a factor to control conservative behavior, or to create a systematic strategy to determine ADP parameters to balance out average and worst-case performance. Fig. 9 shows average inventory levels of each Rollout parameter under different demand scenarios. The upper row shows different demand correlations, a1 = 0.8, 0.4, 0, 0.4, and 0.8, as indicated in the legend of the upper left plot. The lower row shows different demand variances, m0 = 70, 40, and 10, as indicated in the legend of the lower left plot. The left, middle, and right columns show plots

when using Rollout with N = 1, 5, and 10, respectively, as indicated in each plot title. Fig. 10 shows average inventory levels of each Rollout parameter under different cost structures. The top row shows different set up cost ratios, K/g = 1.5, 1, 0.5, and 0, as indicated in the legend of the top left plot. The middle row shows different penalty cost ratios, b/g = 2, 1, and 0.5, as indicated in the legend of the middle left plot. The bottom row shows different holding cost ratios, h/b = 1, 0.2, and 0.02, as indicated in the legend of the top left plot. The plot title indicates a value of the Rollout parameter N. The same uptrend relationship between Rollout simulation horizon (T) and average inventory level is found in every scenario, as shown in Figs. 9 and 10. In addition, at N = 10 and T = 12, these two figures indicate the following.  As the demand correlation increases, the average inventory level increases.  As the demand variance increases, the average inventory level increases. This may be explained by that the inventory is raised to safeguard against higher risk of large demand.

732

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

N = 1; mean(x)

N = 10; mean(x)

N = 5; mean(x)

300

300

300

250

250

200

200

a = 0.8 1

a1 = 0.4

250

a = −0.4 1

x: item(s)

x: item(s)

200

a = −0.8 1

150

x: item(s)

a1 = 0

150

150

100

100

100

50

50

50

0

0 1

6

12

0 1

6

12

1

T: period(s)

T: period(s) N = 1; mean(x)

12

N = 10; mean(x)

N = 5; mean(x)

300

6

T: period(s)

300

300

250

250

200

200

ν = 70 0

ν = 40

250

0

x: item(s)

x: item(s)

200

150

x: item(s)

ν0 = 10

150

150

100

100

100

50

50

50

0

0

0

1

6

12

1

T: period(s)

6

T: period(s)

12

1

6

12

T: period(s)

Fig. 9. Plot of average inventory levels of each Rollout parameter under different demand correlations (upper row) and variances (lower row).

 As the set up cost ratio increases, the average inventory level seems to increase. This may be explained by that as the inventory is raised in order to reduce the number of transactions.  As the penalty cost ratio increases, the average inventory level increases. This is explained by that the inventory is raised to reduce the chance of inventory shortage.  As the holding cost ratio increases, the average inventory level decreases. This is explained by that the inventory is lower in response to higher cost of holding items. In summary, Sarsa, Rollout, and HO can be used as inventory controllers for a problem with temporal demand heteroscedasticity, in most cases. Sarsa and HO are not recommended for an inventory problem with high holding to penalty cost ratio. In other cases, Sarsa can be employed effectively without a model of the problem. The absence of latent state variables, introduced by GARCH(1,1), does not apparently deteriorate Sarsa performance. Furthermore, omitting these latent state variables is beneficial in terms of shortening the computation time. When there is a model of the problem, Rollout is seen to provide better performance for all scenarios. Our study compares learning-based and simulation-based ADP methods to each other. As mentioned earlier, each method has its own strength. Learning-based ADP is more adaptive. Simulation-based ADP is more effective at reducing cost and does so more reliably. A combination of learning-based and simulation-based schemes may provide a good balance between adaptability and reliability. A forecasting technique can be used to estimate

uncertainty and variance, or a distribution, for simulation. Then, with known system dynamics, simulation can be used to approximate a state-action value. A forecasting technique provides adaptability, while simulation provides reliability as it reduces variation in control quality. A hybrid system may provide a new inventory management approach that is more adaptive and reliable, as well as new approaches for other decision/control applications. Kim et al. (2008) have started in this direction. They used known system dynamics to simulate consequences of actions not taken, after demand is observed. Then, not only is a value of a taken action updated, but values of all other actions are also updated. Therefore, their approach provides a fast-adapting inventory control system. However, the work of Kim et al. (2008) is limited to a stateless inventory problem. The extension to a state-based case will produce a wider range of applications. Appendix A. Detailed experimental results: statistics and summaries of cross reference tests Tables A.2–A.16 show statistics of the experimental results: mean, minimum, first-quartile, second-quartile, third-quartile and maximum, and a summary of their cross significance tests of aggregate costs obtained with different controllers. Look-Ahead with 12-period looking ahead is denoted H12. Sarsa (with all state variables) is denoted Sa, Sarsa without the GARCH(1,1) variables is denoted S, Rollout is denoted R, and HO is denoted HO. These methods are shown with their best performing sets of parameters.

733

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

N = 1; mean(x)

N = 5; mean(x)

x: item(s)

200

x: item(s)

K/g = 1.5 K/g = 1 K/g = 0.5 K/g = 0

250

150 100

300

250

250

200 150 100

50

50

0

0 1

6

12

100

250

200 150 100

0 12

6

12

1

150 100

300

250

250

200 150 100

50

50

0

0 12

12

N = 10; mean(x)

300

x: item(s)

h/b = 1 h/b = 0.2 h/b = 0.02

6

T: period(s)

N = 5; mean(x)

x: item(s)

x: item(s)

100

T: period(s)

N = 1; mean(x)

T: period(s)

150

0

T: period(s)

6

200

50

1

300

12

N = 10; mean(x)

250

50

6

T: period(s)

300

0

1

1

300

50

200

12

x: item(s)

x: item(s)

x: item(s)

150

250

6

N = 5; mean(x)

b/g = 2 b/g = 1 b/g = 0.5

6

100

T: period(s)

N = 1; mean(x)

1

150

0 1

300

200

200

50

T: period(s)

250

N = 10; mean(x)

300

x: item(s)

300

200 150 100 50 0

1

6

T: period(s)

12

1

6

12

T: period(s)

Fig. 10. Plot of average inventory levels of each Rollout parameters under different set up costs (top row), penalty costs (middle row) and holding costs (bottom row).

In the summary of cross significance tests, an entry of 0 denotes the null hypothesis, i.e., that the mean of a row method equals the mean of a column method cannot be rejected. An entry of 1 denotes that the null hypothesis is rejected in favor of that the mean of a row method is less than the mean of a column method and an entry of 1 means the inequality is reversed. Each table is presented in three parts, each showing statistics and summaries of the first 12 periods, the later 48 periods and the total 60 periods, respectively. The mean of aggregate costs provides a simple indicator to evaluate the performance of each method. Minimum, first-quartile, second-quartile, third-quartile, and maximum provide information of how data is distributed and skewed, for interested readers. Summary of cross significance tests provides a decisive answer for whether the two methods under investigation really gave different results and helps screening out variation in experimental results that might be confusing or misleading. A.1. Problem 1 Table A.2 shows statistics and a summary of their cross significance tests of aggregate costs obtained with the Look-Ahead methods, Sarsa, Sarsa without GARCH(1,1) state variables, Rollout, and HO. Look-Ahead methods with 2- and 6-period looking ahead, denoted H2 and H6 respectively, also are shown here. In this

experiment, Sarsa is used with a learning rate of 0.5. Sarsa without GARCH(1,1) state variables is used with a learning rate of 0.8. Rollout is used with N = 50 and T = 3. HO is used with N = 1 and T = 3. In Table A.2 under the group header Periods 1–12, both Sarsa and Sarsa without the GARCH(1,1) variables have greater values of the mean of the aggregate costs, $80,044.56 and $80,287.12 compared to the means of the aggregate costs, obtained by other methods, ranging from $60,928.78 (Rollout) to $72,219 (2-period Look-Ahead method). The significance tests also confirm that both Sarsa and Sarsa without the GARCH(1,1) variables delivered greater aggregate costs than other method, as indicated by ‘1’s in Columns H2, H6, H12, R, and HO on Rows S and Sa. This is explained by the nature of Sarsa, which requires initial periods. In Periods 13–60, the average aggregate costs obtained from both Sarsa and Sarsa without the GARCH(1,1) variables are lower than the average aggregate costs obtained from Look-Ahead methods. The significance tests also confirm that Sarsa without the GARCH(1,1) variables performed better than 2-, 6-, and 12-period Look-Ahead methods and Sarsa performed better the 2- and 6-period LookAhead methods. This indicates that the learning mechanism has worked effectively. Comparing Sarsa to Sarsa without the GARCH(1,1) variables in Periods 13–60, Sarsa without the GARCH(1,1) variables delivered a lower aggregate cost than Sarsa. However, the difference could

734

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

Table A.2 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Q3

Max

H2

H6

H12

67,363 51,241 65,899 80,291 80,319 59,507 58,435

103,577 79,134 87,173 93,698 91,949 69,859 75,218

194,211 166,939 140,042 108,835 112,443 116,162 121,955

0 0 0 1 1 0 0

0 0 1 1 1 0 0

0 1 0 1 1 0 0

1 1 1 0 0 1 1

196,366 171,990 127,108 134,615 133,724 119,575 124,174

298,174 233,320 201,410 173,578 163,567 150,579 146,923

391,553 303,644 248,606 216,267 218,416 180,392 220,191

615,827 426,484 445,277 325,417 294,014 272,252 468,517

0 1 1 1 1 1 1

1 0 1 1 1 1 1

1 1 0 0 1 1 0

263,364 218,624 199,173 207,057 211,360 182,738 187,900

371,688 290,199 255,406 254,922 252,068 212,578 219,928

483,533 372,331 333,089 293,759 296,034 246,560 278,461

810,038 593,423 581,682 428,320 400,647 343,796 590,472

0 1 1 1 1 1 1

1 0 0 1 1 1 1

1 0 0 0 0 1 1

Periods 1–12 H2 72,219.00 H6 61,453.86 H12 69,539.66 Sa 80,044.56 S 80,287.12 R 60,928.78 HO 62,163.54

8271 10,634 29,579 52,140 45,741 28,265 29,891

34,008 32,882 44,822 67,265 68,770 43,798 48,085

Periods 13–60 H2 310,081.52 H6 239,951.12 H12 200,561.66 Sa 176,637.76 S 173,607.68 R 152,749.60 HO 173,871.12

95,160 70,057 61,752 73,513 70,988 75,762 77,664

Periods 1–60 H2 382,300.52 H6 301,404.98 H12 270,101.32 Sa 256,682.32 S 253,894.80 R 213,678.38 HO 236,034.66

125,846 102,271 130,103 139,495 145,074 133,999 128,299

Sa

S

R

HO

1 1 1 0 0 1 1

0 0 0 1 1 0 0

0 0 0 1 1 0 0

1 1 0 0 0 1 0

1 1 1 0 0 1 0

1 1 1 1 1 0 0

1 1 0 0 0 0 0

1 1 0 0 0 1 0

1 1 0 0 0 1 0

1 1 1 1 1 0 0

1 1 1 0 0 0 0

 H2, H6, and H12 represent Look-Ahead methods with 2-, 6-, and 12-period looking ahead, respectively.  Sa, S, R, and HO represent Sarsa (with all state variables), Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  Summary of cross significance tests: Entry of ‘1’: the average cost obtained from a row method is less than from a column method. Entry of ‘0’: the average costs obtained from row and column methods are not significantly different. Entry of ‘1’: the average cost obtained from a row method is greater than from a column method.

Table A.3 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.4, m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Q3

Max

H12

Sa

S

R

HO

Periods 1–12 H12 41,013 Sa 60,531 S 61,027 R 39,015 HO 62,524

7654 47,376 51,528 22,063 48,880

26,393 53,062 55,377 30,846 57,656

38,826 60,777 59,092 36,211 60,646

52,845 66,636 64,878 45,643 67,372

95,718 74,849 78,904 72,278 80,735

0 1 1 0 1

1 0 0 1 0

1 0 0 1 0

0 1 1 0 1

1 0 0 1 0

Periods 13–60 H12 139,188 Sa 89,826 S 90,535 R 81,405 HO 88,934

81,966 53,132 59,058 53,380 59,145

101,357 73,046 77,942 68,233 75,485

120,496 91,543 86,773 78,084 92,668

153,118 100,966 102,503 93,651 97,205

287,409 146,708 145,165 126,996 124,425

0 1 1 1 1

1 0 0 0 0

1 0 0 1 0

1 0 1 0 0

1 0 0 0 0

Periods 1–60 H12 180,201 Sa 150,357 S 151,562 R 120,419 HO 151,458

106,338 109,299 115,278 75,618 124,991

136,766 134,911 136,052 104,208 135,371

167,363 146,717 148,120 116,375 150,595

198,148 159,920 167,848 135,908 160,103

324,893 206,087 208,872 172,216 192,930

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 1

0 0 0 1 0

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

not be confirmed by the significance test, as indicated by ‘0’s in entries corresponding to comparison of Sa and S (Column S on Row Sa and Column Sa on Row S). Comparing learning-based methods to simulation-based methods, the average aggregate costs of Sarsa, Sarsa without the GARCH(1,1) variables, and HO are not signifi-

cantly different, as indicated by summaries of significance tests showing ‘0’s in entries corresponding to comparison of these 3 methods. As indicated by its lower aggregate cost and ‘1’s in Columns Sa and S on Row R, Rollout performed significantly better than Sarsa and Sarsa without the GARCH(1,1) variables. Since

735

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.4 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0, m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Q3

Max

H12

Sa

R

HO

1 0 0 1 1

0 1 1 0 0

0 1 1 0 0

1 0 0 1 1

1 0 0 1 1

1 1 1 0 1

1 1 1 1 0

1 0 0 1 1

1 0 0 1 1

1 1 1 0 0

1 1 1 0 0

Periods 1–12 H12 33,287 Sa 53,876 S 55,098 R 26,573 HO 26,281

4242 45,993 46,069 10,841 17,563

20,788 51,891 51,975 20,153 20,428

27,375 53,867 55,256 24,315 23,903

42,244 54,740 58,405 29,888 29,539

85,060 67,436 65,222 59,913 43,456

0 1 1 0 0

1 0 0 1 1

Periods 13–60 H12 166,327 Sa 72,136 S 71,892 R 59,600 HO 64,351

55,534 50,683 51,522 41,494 44,179

97,012 58,440 62,408 48,617 57,555

167,219 73,199 72,289 52,772 63,963

208,136 79,461 77,970 65,354 70,005

375,810 113,413 114,282 98,770 86,108

0 1 1 1 1

Periods 1–60 H12 199,614 Sa 126,012 S 126,989 R 86,173 HO 90,633

86,813 102,488 106,982 59,598 69,277

134,285 115,243 114,713 72,429 82,925

187,547 123,159 122,234 82,914 89,805

244,791 136,069 135,785 99,762 95,035

395,552 166,205 168,940 139,213 120,222

0 1 1 1 1

S

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Table A.5 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.4,m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Q3

Max

H12

Sa

R

HO

1 0 0 1 1

0 1 1 0 1

1 1 1 1 0

1 0 0 1 0

1 0 0 1 0

1 1 1 0 1

1 0 0 1 0

1 0 0 1 1

1 0 0 1 1

1 1 1 0 0

1 1 1 0 0

Periods 1–12 H12 29,330 Sa 52,066 S 51,621 R 25,107 HO 20,869

5833 44,981 46,229 17,567 15,314

17,785 50,416 49,529 22,663 18,321

26,501 51,845 51,488 25,004 20,730

36,490 54,492 52,662 27,206 22,853

68,638 58,959 60,114 33,668 27,632

0 1 1 0 1

1 0 0 1 1

Periods 13–60 H12 149,284 Sa 63,718 S 64,000 R 55,079 HO 62,481

72,296 48,401 49,787 41,637 45,108

112,546 52,325 53,125 50,558 50,645

141,864 61,453 61,839 54,672 59,863

178,102 71,023 70,332 60,286 70,612

310,972 98,727 97,745 69,792 90,026

0 1 1 1 1

Periods 1–60 H12 178,614 Sa 115,784 S 115,622 R 80,186 HO 83,350

97,856 99,234 99,582 60,661 64,772

144,552 105,721 105,814 75,684 74,442

158,326 111,092 111,952 81,556 82,429

211,002 126,944 124,631 86,279 90,900

345,803 150,760 149,331 96,746 110,018

0 1 1 1 1

S

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Rollout and HO do not require initial periods, the total cost (aggregate costs of Periods 1–60) is used to compare Rollout to HO. Although Rollout has lower average total costs ($213,678.38) than HO ($236,034.66), the significance tests could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO. Fig. A.11 shows a performance comparison of the 12-period Look-Ahead method (H12), Sarsa (Sa), Sarsa without the GARCH(1,1) state variables (S), Rollout (R), and HO (HO). The top left plot shows aggregate costs of each method. On this plot, the horizontal axis indicates aggregation periods. This plot shows how each method performed over time. Sarsa and Sarsa

without the GARCH(1,1) variables performed worse than 12-period Look-Ahead method in the first aggregation period (Periods 1–12), but they performed better than 12-period Look-Ahead in the following aggregation periods. It should be noted that Rollout yielded the best performance throughout every aggregation period. The top right plot shows the mean as a solid line, maximum ‘+’, third quartile ‘.’, median ‘*’, first-quartile ‘.’, and minimum ‘+’, of each method’s total costs. It should be noted that the differences between minimum and maximum of total costs obtained from 12-period Look-Ahead method and HO are relatively large. The difference between the first and third quartiles of total costs obtained from 12-period Look-Ahead method is also much larger than the

736

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

Table A.6 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8,m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Q3

Max

H12

Sa

R

HO

1 0 0 1 1

0 1 1 0 0

0 1 1 0 0

1 0 0 1 1

1 0 0 1 1

1 1 1 0 0

1 1 1 0 0

1 0 0 1 1

1 0 0 1 1

1 1 1 0 0

1 1 1 0 0

Periods 1–12 H12 22,373 Sa 51,049 S 49,884 R 21,770 HO 20,246

5125 43,949 44,638 16,735 12,677

10,109 49,101 47,905 19,955 17,079

18,523 50,855 48,829 21,523 19,465

29,440 53,518 52,449 23,506 23,220

56,678 55,014 57,435 28,665 27,746

0 1 1 0 0

1 0 0 1 1

Periods 13–60 H12 135,892 Sa 60,002 S 61,171 R 50,424 HO 52,622

55,770 44,724 44,872 35,112 39,238

104,405 50,846 51,474 43,343 45,282

125,575 56,611 59,167 50,268 52,639

152,049 66,691 68,126 58,114 57,199

292,277 90,252 94,648 66,625 75,872

0 1 1 1 1

Periods 1–60 H12 158,265 Sa 111,051 S 111,055 R 72,194 HO 72,869

65,920 95,654 93,640 58,444 57,927

123,427 102,673 102,524 62,189 67,201

145,603 107,420 108,557 72,920 72,911

175,337 120,631 117,301 78,874 76,772

302,520 144,214 142,761 90,534 97,346

0 1 1 1 1

S

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Table A.7 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 40, K = 80, b = 180, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q3

Max

H12

Sa

S

R

HO

53,598 74,391 75,041 55,644 88,948

73,242 96,438 88,772 82,876 101,645

109,935 112,713 103,773 127,457 135,746

0 1 1 0 1

1 0 0 1 1

1 0 0 1 1

0 1 1 0 1

1 1 1 1 0

94,359 86,584 91,039 98,797 96,212

127,016 138,396 131,994 119,031 118,629

171,768 159,249 165,179 136,544 159,486

298,842 234,386 233,521 315,946 221,765

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

146,545 178,915 177,032 147,565 194,771

176,336 203,327 202,153 184,378 220,703

229,839 247,875 242,412 210,684 238,961

363,694 334,968 305,862 416,047 297,220

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 1

0 0 0 1 0

Periods 1–12 H12 56,437 Sa 78,975 S 77,020 R 63,783 HO 89,858

28,227 52,977 59,681 30,872 64,502

39,717 65,641 65,107 35,166 75,710

Periods 13–60 H12 135,228 Sa 133,436 S 133,819 R 125,561 HO 127,455

47,999 59,505 60,938 55,958 66,339

Periods 1–60 H12 191,665 Sa 212,411 S 210,839 R 189,344 HO 217,314

87,971 127,540 122,311 111,849 130,841

Q2

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

differences obtained from other methods. This smaller difference between the first and third quartiles of total costs obtained from all ADP methods may imply that an ADP method also helps improve system stability. The lower left and right plots show empirical cumulative distribution functions (CDFs)4 of period costs. The lower left plot shows the CDF for the complete range of period costs. The lower right plot shows the CDF for a selected range of period costs such that it allows a better look at how each method’s CDF approaches one. These plots

4 These CDFs are obtained from MATLAB’s ecdf function, Revision 1.3.6.7, which uses a Kaplan-Meier estimate, as discussed in Kaplan and Meier (1958).

show that the CDF of period costs obtained from the 12-period LookAhead method is lower than those obtained from ADP methods when period cost is larger than around $500. This means that the chance that an ADP method would deliver a large period cost is less than the 12-period Look-Ahead method. In other words, the ADP method has better worst-case performance than the 12-period Look-Ahead method. A.1.1. Experimental results of different demand correlations Similar to Tables A.2,A.3,A.4,A.5 and A.6 show statistics and a summary of cross significance tests of different methods on inventory problem with demand correlation a1 = 0.4, 0, 0.4 and

737

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.8 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 10, K = 80, b = 180, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Max

H12

Sa

S

R

HO

70,282 87,933 81,944 56,854 73,728

113,306 98,165 103,139 89,244 99,725

0 1 1 1 0

1 0 0 1 1

1 0 0 1 1

1 1 1 0 1

0 1 1 1 0

91,771 94,724 97,770 64,499 80,303

108,777 111,604 112,254 82,810 96,523

177,481 144,595 143,335 116,854 136,175

0 0 0 1 0

0 0 0 1 0

0 0 0 1 0

1 1 1 0 1

0 0 0 1 0

137,680 171,205 157,572 123,279 151,495

177,985 190,641 190,078 141,345 171,216

263,056 228,150 239,835 164,526 198,968

0 1 1 1 0

1 0 0 1 1

1 0 0 1 1

1 1 1 0 1

0 1 1 1 0

Periods 1–12 H12 59,498 Sa 79,012 S 75,665 R 48,748 HO 65,848

23,625 51,585 48,067 30,193 39,300

47,832 69,395 67,793 37,653 53,032

59,176 82,260 76,454 45,371 67,468

Periods 13–60 H12 93,352 Sa 95,161 S 94,566 R 72,693 HO 87,810

47,187 53,790 52,865 47,761 60,788

70,202 71,253 75,958 55,499 74,715

Periods 1–60 H12 152,850 Sa 174,173 S 170,231 R 121,441 HO 153,657

99,907 125,244 116,460 84,058 110,716

124,509 149,891 151,015 102,563 134,764

Q3

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Table A.9 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 150, b = 100, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Max

H12

Sa

S

R

HO

68,811 85,600 90,410 85,915 87,954

94,897 102,937 105,915 146,039 124,199

0 1 1 1 1

1 0 0 0 0

1 0 0 0 0

1 0 0 0 0

1 0 0 0 0

168,709 166,136 166,746 148,366 150,946

216,927 204,436 199,270 183,620 190,354

335,187 298,433 296,133 250,906 260,903

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

223,176 237,051 235,288 221,329 219,942

274,713 280,229 283,689 254,433 271,190

393,114 359,545 374,504 383,787 370,272

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

Periods 1–12 H12 57,542 Sa 76,910 S 78,544 R 75,477 HO 76,769

32,460 60,443 52,576 51,522 52,992

44,745 63,781 66,785 59,623 64,846

55,470 79,170 82,580 69,318 68,837

Periods 13–60 H12 173,177 Sa 165,061 S 164,570 R 150,921 HO 158,544

83,315 73,216 73,242 71,724 67,756

124,044 112,441 113,408 114,338 128,169

Periods 1–60 H12 230,719 Sa 241,971 S 243,113 R 226,398 HO 235,313

125,115 136,646 139,225 129,960 133,762

176,218 198,674 194,275 190,356 198,288

Q3

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

0.8, respectively. A summary of the different methods applied to the inventory problem with demand correlation a1 = 0.8 is shown in Table A.2. As indicated by ‘1’s in Column H12 on Rows S and Sa in Periods 13–60 in Table A.3, Sarsa and Sarsa without the GARCH(1,1) variables performed significantly better than the 12-period LookAhead method. Rollout also performed significantly better than the 12-period Look-Ahead method, as indicated by ‘1’ in Column H12 on Row R in Periods 1–60. Although HO has lower total cost than the 12-period Look-Ahead method, the significance test could not confirm the difference, as indicated by ‘0’ in Column H12 on Row HO in Periods 1–60. When Sarsa is compared to Sarsa without

the GARCH(1,1) variable, the ‘0’s in entries corresponding to comparison of Sa and S in Periods 13–60 indicate that they performed indifferently. The entry of ‘1’ in Column HO on Row R in Periods 1–60 indicates that Rollout performed significantly better than HO. Table A.4 shows that Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO performed significantly better than the 12period Look-Ahead method, as as indicated by ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. Sarsa and Sarsa without the GARCH(1,1) variables performed indifferently, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. Rollout outperformed Sarsa and Sarsa

738

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

Table A.10 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 100, b = 100, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q3

Max

H12

Sa

S

R

HO

50,114 73,239 77,420 54,819 86,164

68,706 88,185 94,100 64,632 101,488

110,903 107,206 109,846 90,121 120,575

0 1 1 0 1

1 0 0 1 1

1 0 0 1 0

0 1 1 0 1

1 1 0 1 0

122,735 116,926 114,100 100,762 113,913

164,454 149,029 167,048 129,255 143,856

196,317 202,628 195,993 161,582 179,257

334,458 277,317 291,487 274,435 219,288

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

166,128 189,757 187,603 141,611 196,657

216,028 238,195 231,012 197,529 235,248

262,221 276,813 280,508 219,613 274,059

382,917 370,298 377,042 329,060 292,228

0 0 0 0 0

0 0 0 1 0

0 0 0 1 0

0 1 1 0 1

0 0 0 1 0

Periods 1–12 H12 54,723 Sa 77,318 S 78,706 R 55,731 HO 86,610

22,652 47,764 48,733 32,356 48,743

41,055 66,566 62,665 44,537 74,363

Periods 13–60 H12 169,188 Sa 160,324 S 161,833 R 140,570 HO 146,784

64,845 71,013 62,990 59,595 72,348

Periods 1–60 H12 223,910 Sa 237,642 S 240,539 R 196,302 HO 233,394

110,180 145,602 127,016 116,610 154,566

Q2

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Table A.11 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 50, b = 100, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Max

H12

Sa

S

R

HO

68,352 94,154 89,176 60,730 88,338

84,130 114,332 112,383 110,928 111,769

0 1 1 0 1

1 0 0 1 0

1 0 0 1 0

0 1 1 0 1

1 0 0 1 0

154,976 163,043 164,153 139,879 151,157

216,924 191,624 197,658 179,630 178,400

339,026 288,173 286,082 218,004 271,947

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

207,020 233,910 231,013 195,788 226,531

269,197 278,875 278,687 232,230 251,913

389,201 380,876 364,789 278,502 359,137

0 0 0 0 0

0 0 0 1 0

0 0 0 1 0

0 1 1 0 1

0 0 0 1 0

Periods 1–12 H12 55,363 Sa 78,081 S 79,325 R 54,401 HO 78,640

30,484 52,164 49,648 30,969 53,179

43,225 64,769 72,326 43,848 65,071

52,089 74,325 79,767 51,961 77,096

Periods 13–60 H12 167,459 Sa 160,550 S 160,496 R 145,233 HO 157,968

69,601 67,548 69,630 62,335 102,333

111,325 111,636 108,476 117,242 128,922

Periods 1–60 H12 222,822 Sa 238,630 S 239,822 R 199,633 HO 236,608

112,727 131,692 149,397 105,734 188,647

161,796 193,457 192,938 171,316 210,443

Q3

 H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

without the GARCH(1,1) variables, as indicated by ‘1’s in Columns Sa and S on Row R in Periods 13–60. Although Rollout has lower average total cost than HO, the significance test could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO in Periods 1–60. Table A.5 shows that Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO performed significantly better than the 12period Look-Ahead method, as indicated by ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. Sarsa and Sarsa without the GARCH(1,1) variables performed indifferently, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. Rollout outperformed Sarsa and Sarsa without

the GARCH(1,1) variables, as indicated by ‘1’s in Columns Sa and S on Row R in Periods 13–60. Although Rollout has lower average total cost than HO, the significance test could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO in Periods 1–60. Table A.6 shows the entries of ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. This means that Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO outperformed the 12-period Look-Ahead method. Sarsa and Sarsa without the GARCH(1,1) variables performed indifferently, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. The entries of ‘1’s in Columns Sa and S on Row R in

739

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.12 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 100, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Max

H12

Sa

S

R

HO

Z

74,422 87,043 94,681 78,318 93,834 65,504

94,606 102,512 111,380 97,890 113,496 91,124

0 1 1 0 1 0

1 0 0 1 0 1

1 0 0 1 0 1

0 1 1 0 1 0

1 0 0 1 0 1

0 1 1 0 1 0

161,661 158,858 156,131 142,983 147,223 150,676

207,095 195,125 185,111 164,300 166,832 187,004

360,465 285,208 287,375 238,114 314,752 300,671

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

217,615 225,156 226,146 207,381 227,918 194,194

273,032 275,326 274,385 244,980 239,341 243,579

410,306 346,725 364,336 323,891 403,329 337,442

0 0 0 0 0 0

0 0 0 0 0 1

0 0 0 0 0 1

0 0 0 0 0 0

0 0 0 0 0 0

0 1 1 0 0 0

Periods 1–12 H12 61,071 Sa 76,177 S 78,050 R 62,213 HO 78,661 Z 54,041

17,216 58,426 52,469 31,165 57,380 32,604

43,641 63,766 63,547 43,510 63,106 36,767

63,492 77,626 75,373 62,155 74,276 52,080

Periods 13–60 H12 166,638 Sa 158,500 S 156,897 R 147,223 HO 151,540 Z 150,510

75,164 71,070 65,936 86,820 72,425 59,044

112,282 100,641 110,682 114,064 128,378 95,219

Periods 1–60 H12 227,708 Sa 234,677 S 234,947 R 209,436 HO 230,201 Z 204,551

125,328 130,939 128,879 137,906 143,960 92,879

178,420 189,992 189,145 168,107 200,962 156,586

Q3

 H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Table A.13 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 200, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Max

H12

Sa

S

R

HO

Z

75,997 98,013 88,573 88,566 87,302 72,736

118,090 116,455 119,299 126,010 102,442 94,835

0 1 1 1 1 0

1 0 0 0 0 1

1 0 0 1 0 1

1 0 1 0 0 1

1 0 0 0 0 1

0 1 1 1 1 0

161,134 156,951 150,032 149,493 141,835 152,239

219,709 191,474 199,377 186,721 170,456 179,703

374,962 291,025 278,681 354,871 229,182 314,713

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

208,872 233,234 225,421 223,967 233,124 207,487

275,681 278,112 280,322 257,563 246,984 245,793

434,490 390,412 362,782 397,287 318,048 371,821

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

Periods 1–12 H12 58,083 Sa 81,129 S 81,467 R 72,222 HO 78,343 Z 57,727

26,638 53,353 60,766 42,416 58,205 20,841

41,443 66,295 67,418 53,945 67,045 37,629

50,390 77,151 75,955 71,143 80,481 57,108

Periods 13–60 H12 176,271 Sa 158,721 S 158,679 R 157,325 HO 147,128 Z 152,270

68,743 64,344 71,031 71,657 71,393 48,087

111,268 110,723 114,948 127,077 114,325 101,452

Periods 1–60 H12 234,354 Sa 239,850 S 240,146 R 229,547 HO 225,471 Z 209,996

110,957 134,743 148,425 145,059 147,200 85,735

175,309 196,664 198,889 194,764 192,018 155,079

Q3

 H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Periods 13–60 indicate that Rollout outperformed both Sarsa and Sarsa without the GARCH(1,1) variables. Again, Rollout has lower average total cost than HO, but the significance test could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO in Periods 1–60.

A.1.2. Experimental results of different demand variances Tables A.7 and A.8 show statistics and summaries of cross significance tests of different methods on inventory problems with medium and low noise variances (m0 = 40 and 10), respectively. Detailed experimental results of the different methods applied to

740

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

Table A.14 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 50, and h = 1. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Max

H12

Sa

S

R

HO

Z

67,481 93,884 84,502 74,384 57,978 60,946

107,000 114,095 100,965 137,081 111,193 82,754

0 1 1 1 0 0

1 0 0 1 1 1

1 0 0 1 1 1

1 1 1 0 1 1

0 1 1 1 0 0

0 1 1 1 0 0

152,239 156,433 149,342 133,884 144,771 150,701

191,173 186,973 196,213 170,886 170,908 186,465

350,288 287,381 285,845 249,570 242,769 298,200

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

190,538 229,456 221,830 199,214 196,715 191,328

259,315 276,419 278,895 241,631 227,137 237,402

360,100 377,454 359,840 347,656 309,537 332,784

0 0 0 0 0 0

0 0 0 0 1 1

0 0 0 0 1 1

0 0 0 0 0 0

0 1 1 0 0 0

0 1 1 0 0 0

Periods 1–12 H12 50,926 Sa 77,253 S 74,465 R 66,631 HO 50,317 Z 49,777

9812 46,878 50,796 34,214 24,678 29,952

37,973 58,601 64,092 49,951 37,962 34,924

45,952 75,494 78,101 58,795 44,896 49,809

Periods 13–60 H12 157,838 Sa 158,046 S 158,833 R 149,849 HO 146,971 Z 149,914

56,136 59,862 71,067 80,684 90,417 58,160

103,188 114,196 112,807 120,725 115,021 94,398

Periods 1–60 H12 208,764 Sa 235,299 S 233,298 R 216,480 HO 197,288 Z 199,692

102,842 121,797 135,242 131,073 121,763 90,555

157,909 180,151 187,084 176,765 155,093 151,652

Q3

 H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

Table A.15 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b =50, and h = 10. Statistics Mean

Summaries of cross tests Min

Q1

Q2

Q3

Max

H12

Sa

S

R

HO

Z

Periods 1–12 H12 47,427 Sa 106,898 S 105,540 R 57,969 HO 60,629 Z 47,673

22,630 85,740 87,910 30,060 28,450 19,300

29,548 92,220 95,835 40,198 42,625 33,030

47,220 106,420 104,020 49,750 55,770 41,490

57,603 118,603 115,003 66,150 67,788 56,293

90,020 137,580 131,420 137,160 142,910 111,830

0 1 1 0 1 0

1 0 0 1 1 1

1 0 0 1 1 1

0 1 1 0 0 1

1 1 1 0 0 1

0 1 1 1 1 0

Periods 13–60 H12 187,556 Sa 304,352 S 305,220 R 177,700 HO 198,515 Z 180,778

80,700 231,060 226,760 134,000 108,020 65,490

124,663 270,570 270,228 147,430 158,418 135,415

184,550 306,450 302,650 165,880 193,240 177,720

227,673 330,340 326,313 191,268 230,215 210,470

360,920 396,650 411,970 346,890 307,730 371,140

0 1 1 0 0 0

1 0 0 1 1 1

1 0 0 1 1 1

0 1 1 0 0 0

0 1 1 0 0 0

0 1 1 0 0 0

Periods 1–60 H12 234,984 Sa 411,250 S 410,760 R 235,670 HO 259,144 Z 228,451

109,900 323,670 322,430 179,880 163,410 105,680

175,088 370,280 372,190 204,665 212,110 177,793

232,630 401,740 408,090 228,910 254,170 220,970

287,890 447,308 439,183 250,530 300,960 272,663

393,010 525,580 508,510 389,450 380,680 399,470

0 1 1 0 0 0

1 0 0 1 1 1

1 0 0 1 1 1

0 1 1 0 0 0

0 1 1 0 0 1

0 1 1 0 1 0

 H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

the inventory problem with high noise variance m0 = 70 is shown in Table A.2. The summary of significance tests shown in Table A.7 reveals that under medium variance (m0 = 40) the average costs obtained from the ADP methods investigated in our study and the Look-

Ahead method are not significantly different, as indicated by ‘0’ in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. The difference between a learning-based ADP method and a simulation-based ADP method is not significant, as indicated by all ‘0’s in entries comparing methods in Periods

741

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.16 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 50, and h = 50. Statistics

Summaries of cross tests

Mean

Min

Q1

Q2

Q3

Max

H12

Sa

S

R

HO

Z

114,300 272,150 263,950 86,350 170,600 110,400

0 1 1 0 1 0

1 0 0 1 1 1

1 0 0 1 1 1

0 1 1 0 1 0

1 1 1 1 0 1

0 1 1 0 1 0

289,713 967,913 963,650 264,100 884,813 269,675

450,150 977,700 978,200 358,350 942,750 411,600

0 1 1 0 1 0

1 0 0 1 1 1

1 0 0 1 1 1

0 1 1 0 1 0

1 1 1 1 0 1

0 1 1 0 1 0

344,725 1,218,275 1,209,188 317,650 980,388 338,200

489,250 1,238,750 1,238,100 405,550 1,107,700 472,200

0 1 1 0 1 0

1 0 0 1 1 1

1 0 0 1 1 1

0 1 1 0 1 0

1 1 1 1 0 1

0 1 1 0 1 0

Periods 1–12 H12 60,098 Sa 244,640 S 240,052 R 59,572 HO 114,246 Z 56,084

32,450 224,800 202,150 33,200 83,950 17,400

44,300 235,763 234,438 45,225 91,650 33,638

50,650 245,100 240,500 55,750 115,000 53,350

67,113 253,988 252,138 79,800 129,488 82,588

Periods 13–60 H12 236,810 Sa 953,938 S 948,726 R 227,592 HO 777,532 Z 232,630

113,150 905,000 902,350 145,650 472,350 123,950

182,588 947,650 943,538 187,138 717,513 179,288

227,700 960,200 954,300 211,000 795,200 229,700

Periods 1–60 H12 296,908 Sa 1,198,578 S 1,188,778 R 287,164 HO 891,778 Z 288,714

162,700 1,139,650 1,104,500 193,700 564,150 164,550

245,900 1,183,963 1,177,975 245,513 815,313 224,825

288,650 1,201,150 1,189,100 288,000 903,950 285,600

 H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively.  Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header.  In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.

x 10

4

5

average aggregated cost H12 Sa S R HO

8 7.5 7 6.5

5.5 5 4.5

cost: $

cost: $

statistics of total costs

x 10 6

8.5

6 5.5 5

4 3.5 3 2.5

4.5

2

4

1.5

3.5 1−12

1 13−24

25−36

37−48

49−60

H12

Sa

S

R

HO

Period empirical cdf of period costs

empirical cdf of period costs

1

1

0.9 0.9

0.8 0.7

0.8

cdf

cdf

0.6 0.5

0.7

0.4 0.3 H12 Sa S R HO

0.2 0.1 00

1

2

3

4

cost: $

5

6

7

8 4 x 10

0.6

H12 Sa S R HO

0.5 0

0.5

1

1.5

2

2.5

3

cost: $

Fig. A.11. Performance comparison of Look-Ahead, Sarsa, Sarsa without GARCH(1,1) variables, Rollout, and HO.

3.5

4 4 x 10

742

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

13–60. The average inventory cost obtained from Sarsa is not significantly different from Sarsa without the GARCH(1,1) variables, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. Rollout yielded significantly better performance than HO, as indicated by ‘1’ in Column HO on Row R, in Periods 1–60. Table A.8 shows that Rollout is the only method that outperformed the Look-Ahead method (in Periods 1–60), as indicated by ‘1’ in Column H12 on Row R. The other ADP methods have higher average costs than the Look-Ahead method, but the significance tests could not confirm the difference, as indicated by ‘0’s in Column H12 on Rows Sa and S in Periods 13–60 and on Row HO in Periods 1–60. It should be noted that Sarsa and Sarsa without the GARCH(1,1) variables have lower maximum cost than the LookAhead method. HO has lower third-quartile and maximum cost than the Look-Ahead method. The lower maximum cost may indicate the reliability improvement achieved by Sarsa, Sarsa without the GARCH(1,1) variables, and HO. The average inventory cost obtained by using Sarsa is not significantly different from Sarsa without the GARCH(1,1) variables. Rollout outperformed HO, as indicated by ‘1’ in Column HO on Row R, in Periods 1–60. A.1.3. Experimental results of different set up costs Tables A.9,A.10,A.11 and A.12 show statistics and summaries of cross significance tests of different methods on inventory problems with set up cost K = 150, 100, 50, and 0, respectively. In Tables A.12,A.13,A.14,A.15 and A.16, the previous solution of Zhang (2007) is included. Zhang’s AR(1)/GARCH(1,1) solution, denoted Z, is available for an inventory problem without set up cost. Table A.9 shows that although Sarsa and Sarsa without the GARCH(1,1) in Periods 13–60 and Rollout in Periods 1–60 provided lower average costs than the Look-Ahead method, the significance tests could not confirm the differences. The average inventory costs obtained from Sarsa without the GARCH(1,1) variables are not significantly different from Sarsa, as indicated by ‘0’s in entries corresponding to comparison of S and Sa, in Periods 13–60. Rollout provides lower average cost than Sarsa in Periods 13–60 and HO in Periods 1–60, but the significance tests could not confirm the difference. Table A.10 shows that Sarsa and Sarsa without the GARCH(1,1) variables have lower average costs than the Look-Ahead method in Periods 13–60, but the significance tests could not confirm the difference. The average inventory costs obtained from Sarsa without the GARCH(1,1) and Sarsa are not significantly different, as indicated by ‘0’s in entries corresponding to comparison of Sa and S in Periods 13–60. When comparing learning-based ADP methods to simulation-based ADP methods, although Rollout and HO have lower average costs than Sarsa and Sarsa without the GARCH(1,1) variables in Periods 13–60, the significance test could not confirm the difference. However, when comparing Rollout and HO in Periods 1–60, Rollout performed significantly better than HO, as indicated by ‘1’ in Column HO on Row R. Table A.11 shows that, as indicated by ‘0’ in all entries corresponding to comparison of the Look-Ahead method and ADP methods, the significance test could not confirm the difference between average costs obtained from the ADP methods and the Look-Ahead method. The costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different. Although Sarsa and Sarsa without the GARCH(1,1) variables have higher average costs than Rollout and HO, the significance test could not confirm the differences. Rollout significantly outperformed HO, as indicated by ‘1’ in Column HO on Row R, in Periods 1–60. Table A.12 shows that, as indicated by ‘0’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R, HO, and Z, although average inventory costs obtained from the Look-Ahead method are higher than Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and Zhang’s solution, the significance tests could not confirm the

differences. Comparing learning-based ADP methods to Rollout, HO, and Zhang’s solution, the ‘0’s on Rows Sa and S in Periods 13–60 indicate that average costs obtained from Sarsa or Sarsa without the GARCH(1,1) variables are not significantly different from Rollout, HO, or Zhang’s solution. The difference between average inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables is not significant either. Comparing Rollout, HO, and Zhang’s solution, as indicated by average inventory costs and significance tests in Periods 1–60, Zhang’s solution gives the lowest average inventory costs, but the significance tests could not confirm the differences. A.1.4. Experimental results of different penalty costs Tables A.13 and A.14 show statistics and summaries of cross significance tests of different methods on inventory problems with penalty costs $200 (b/g = 2) and $50 (b/g = 0.5), respectively. A summary of the different methods applied to the inventory problem with penalty cost $100 (b/g = 1) is shown in Table A.12. Table A.13 shows that although Sarsa and Sarsa without the GARCH(1,1) variables have lower average inventory costs than the Look-Ahead method in Periods 13–60 and Rollout, HO, and Zhang’s solution have lower average inventory costs than the Look-Ahead method in Periods 1–60, the significance tests could not confirm the differences. The average inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’ in Column S on Row Sa, in Periods 13–60. Comparing Rollout, HO, and Zhang’s solution, Zhang’s solution has the lowest average inventory cost, but the significance tests could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R, HO, and Z in Periods 1–60. Table A.14 shows that the average inventory costs obtained from Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and Zhang’s solution are not significantly different from the Look-Ahead method, as indicated by ‘0’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R, HO, and Z in Periods 1–60. The average costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’ in Column S on Row Sa, in Periods 3–60. Comparing Rollout, HO, and Zhang’s solution, HO gives the lowest average inventory cost, but the significance tests could not confirm the differences, as indicated by ‘0’s in entries corresponding to comparison of R, HO, and Z in Periods 1–60. A.1.5. Experimental results of different holding costs Tables A.15 and A.16 show statistics and summaries of cross significance tests of different methods on inventory problem with holding costs $10 (h/b = 0.2) and $50 (h/b = 1), respectively. A summary of the different methods applied to the inventory problem with holding cost $1 (h/b = 0.02) is shown in Table A.14. Table A.15 shows that the Look-Ahead method performed significantly better than Sarsa and Sarsa without the GARCH(1,1) variables, as indicated by ‘1’s in Column H12 on Rows Sa and S, in Periods 13–60. Although the Look-Ahead method has lower average inventory cost than Rollout and HO, the significance tests could not confirm the difference, as indicated by ‘0’s in Column H12 on Rows R and HO. The costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’s in Column S on Row Sa in Periods 13–60. When compared to Rollout, HO, and Zhang’s solution, inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are significantly higher than Rollout, HO, Zhang’s solution, as indicated by ‘1’s in Columns R, HO, and Z on Rows Sa and S, in Periods 13–60. The significance test confirms that Zhang’s solution outperformed HO, as indicated by ‘1’ in Column HO on Row Z, in Periods 1–60. The significance test could not confirm the difference

T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743

between average inventory costs obtained from Rollout and Zhang’s solution, as indicated by ‘0’ in Column R on Row Z, in Periods 1–60. Table A.16 shows that the Look-Ahead method outperformed Sarsa, Sarsa without the GARCH(1,1) variables, and HO, as indicated by ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Row HO in Periods 1–60. The average inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’ in Column S on Row Sa, in Periods 13–60. Rollout, HO, and Zhang’s solution outperformed Sarsa and Sarsa without the GARCH(1,1) variables, as indicated by ‘1’ in Columns Sa and S on Rows R, HO, and Z, in Periods 13–60. Rollout and Zhang’s solution also show superior performance to HO, as indicated by ‘1’ in Column HO on Rows R and Z, in Periods 1–60. Rollout has slightly lower cost than Zhang’s solution, but the significance test could not confirm the difference. References Barreto, A. M. S., & Anderson, C. W. (2008). Restricted gradient-descent algorithm for value-function approximation in reinforcement learning. Artificial Intelligence, 172(4–5), 454–482. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA, USA: Athena Scientific. Bertsimas, D., & Thiele, A. (2006). A robust optimization approach to inventory theory. Operations Research, 54(1), 150–168. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Chaharsooghi, S. K., Heydari, J., & Zegordi, S. H. (2008). A reinforcement learning model for supply chain ordering management: An application to the beer game. Decision Support System, 45(4), 949–959. Choi, J., Realff, M. J., & Lee, J. H. (2006). Approximate dynamic programming: Application to process supply chain management. AIChe Journal, 52(7), 2473–2485. Chong, E. K. P., Givan, R. L., & Chang, H. S. (2000). A framework for simulation-based network control via hindsight optimization. In Proceedings of the 39th IEEE conference on decision and control. Csáji, B. C., & Monostori, L. (2008). Value function based reinforcement learning in changing Markovian environments. Journal of Machine Learning Research, 9, 1679–1709. Das, T. K., Gosavi, A., Mahadevan, S., & Marchalleck, N. (1999). Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45(4), 560–574. Giannoccaro, I., & Pontrandolfo, P. (2002). Inventory management in supply chains: A reinforcement learning approach. International Journal of Production Economics, 78(2), 153–161. Godfrey, G. A., & Powell, W. B. (2001). An adaptive, distribution-free algorithm for the newsvendor problem with censored demands, with applications to inventory and distribution. Management Science, 47(8), 1101–1112. Haykin, S. (1994). Neural networks: A comprehensive foundation. Macmillan College Publishing Company, Inc.

743

Iida, T., & Zipkin, P. H. (2006). Approximate solutions of a dynamic forecastinventory model. Manufacturing and Service Operations Management, 8(4), 407–425. Jiang, C., & Sheng, Z. (2009). Case-based reinforcement learning for dynamic inventory control in a multi-agent supply chain system. Expert Systems with Applications, 36(3), 6520–6526. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481. Kim, C. O., Jun, J., Baek, J. K., Smith, R. L., & Kim, Y. D. (2005). Adaptive inventory control models for supply chain management. International Journal of Advanced Manufacturing Technology, 26(9–10), 1184–1192. Kim, C. O., Kwon, I. H., & Baek, J. G. (2008). Asynchronous action-reward learning for nonstationary serial supply chain inventory control. Applied Intelligence, 28(1), 1–16. Kleinau, P., & Thonemann, W. (2004). Deriving inventory-control policies with genetic programming. OR Spectrum, 26(4), 521–546. Kwon, I. H., Kim, C. O., Jun, J., & Lee, J. H. (2008). case-based myopic reinforcement learning for satisfying target service level in supply chain. Expert Systems with Applications, 35(1–2), 389–397. Lambert, D., Stock, J. R., & Ellram, L. M. (1998). Fundamentals of logistics. McGrawHill/Irwin. Lee, H. L., & Billington, C. (1992). Managing supply chain inventory: Pitfalls and opportunities. Sloan Management Review, 65–73. Pontrandolfo, P., Gosavi, A., & Okobaa, O. G. (2002). Global supply chain management: A reinforcement learning approach. International Journal of Production Research, 40(6), 1299–1317. Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality. Wiley. Shervais, S., Shannon, T. T., & Lendaris, G. G. (2003). Intelligent supply chain management using adaptive critic learning. IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans, 33(2), 235–244. Si, J., Barto, A., Powell, W., & Wunsch, D. II (Eds.). (2004). Handbook of learning and approximate dynamic programming. IEEE Press. Silver, E. A. (1981). Operations research in inventory management: A review and critique. Operations Research, 29(4), 628–645. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. MIT Press. Topaloglu, H., & Kunnumkal, S. (2006). Approximate dynamic programming methods for an inventory allocation problem under uncertainty. Naval Research Logistics, 53(8), 822–841. Van Roy, B., Bertsekas, D. P., Lee, Y., & Tsitsiklis, J. N. (1997). A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the IEEE conference on decision and control. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge University. Werbos, P. (1992). Handbook of intelligent control. In D. White & D. Sofge (Eds.), chap. neurocontrol and supervised learning: An overview and evaluation. New York: Van Nostrand Rheinhold. Werbos, P. (2004). Handbook of learning and approximate dynamic programming. In J. Si, A. Barto, W. Powell, & D. Wunsch, II (Eds.), ADP: Goals, opportunities and principles. IEEE Press. Zhang, X. (2007). Inventory control under temporal demand heteroscedasticity. European Journal of Operational Research, 182(1), 127–144.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.