Computers & Industrial Engineering 60 (2011) 719–743
Contents lists available at ScienceDirect
Computers & Industrial Engineering journal homepage: www.elsevier.com/locate/caie
Approximate dynamic programming for an inventory problem: Empirical comparison q Tatpong Katanyukul a,⇑, William S. Duff a, Edwin K.P. Chong b a b
Mechanical Engineering Department, College of Engineering, Colorado State University, United States Electrical and Computer Engineering Department, College of Engineering, Colorado State University, United States
a r t i c l e
i n f o
Article history: Received 4 December 2009 Received in revised form 22 November 2010 Accepted 16 January 2011 Available online 22 January 2011 Keywords: Approximate dynamic programming Inventory control Reinforcement learning Simulation Heterogeneity AR(1)/GARCH(1,1)
a b s t r a c t This study investigates the application of learning-based and simulation-based Approximate Dynamic Programming (ADP) approaches to an inventory problem under the Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model. Specifically, we explore the robustness of a learning-based ADP method, Sarsa, with a GARCH(1,1) demand model, and provide empirical comparison between Sarsa and two simulation-based ADP methods: Rollout and Hindsight Optimization (HO). Our findings assuage a concern regarding the effect of GARCH(1,1) latent state variables on learning-based ADP and provide practical strategies to design an appropriate ADP method for inventory problems. In addition, we expose a relationship between ADP parameters and conservative behavior. Our empirical results are based on a variety of problem settings, including demand correlations, demand variances, and cost structures. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction Inventory management is one of the major business activities. A well-managed inventory can help a business stay competitive by keeping its cash flow at a controllable level. Recently, Zhang (2007) analyzed data obtained from the M3 Census Bureau and established the appropriateness of the GARCH(1,1) model in inventory demand. He also showed that the inventory costs may increase significantly when the GARCH(1,1) model is not accounted for and analytically developed an orderupto-level policy for a problem without a setup cost. However, his solution is highly problem specific—a change in the structure of the problem requires reanalysis of the problem. For example, a setup cost is common in practice, but it is not included in Zhang (2007). Inclusion of a setup cost into a problem makes a cost function highly nonlinear and renders an analytical approach very difficult. Furthermore, an analytical approach is time consuming and requires highly specialized skill and effort, as discussed in Kleinau and Thonemann (2004) and Jiang and Sheng (2009). Inventory problems appear in various forms and their forms often change over time. Therefore an analytical approach is not suitable for practical inventory problems. Many articles, e.g., Silver (1981), Lee and
q
This manuscript was processed by area editor Mohamad Y. Jaber.
⇑ Corresponding author.
E-mail addresses:
[email protected] (T. Katanyukul),
[email protected] (W.S. Duff),
[email protected] (E.K.P. Chong). 0360-8352/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.cie.2011.01.007
Billington (1992) and Bertsimas and Thiele (2006) to name a few, addressed the need for an efficient flexible inventory solution simple to implement in practice. Approximate Dynamic Programming (ADP) is a method to solve a practical Markov decision problems. It does not rely on an analytical treatment of the problem or hard-to-obtain information, e.g., transition probabilities. Therefore ADP has received extensive attention in many control applications, as discussed by Werbos (2004). ADP uses an approximation technique, which mostly is attained by either a learning-based scheme or a simulation-based scheme. A learning-based ADP method uses a learning technique, e.g., temporal difference learning, to approximate state-action costs. A simulation-based ADP method uses simulation, e.g., Monte Carlo simulation, to approximate state-action costs. Most previous studies of an ADP application to inventory management focused on learning-based ADP; see, e.g., Van Roy, Bertsekas, Lee, and Tsitsiklis (1997), Giannoccaro and Pontrandolfo (2002), Shervais, Shannon, and Lendaris (2003), Kim, Jun, Baek, Smith, and Kim (2005), Kwon, Kim, Jun, and Lee (2008), Kim, Kwon, and Baek (2008), Chaharsooghi, Heydari, and Zegordi (2008) and Jiang and Sheng (2009). Only Choi, Realff, and Lee (2006) investigated simulationbased ADP. Nonetheless, these two schemes are not mutually exclusive. For example, Kim et al. (2008) used simulation to evaluate consequences of actions not taken for accelerating a learning process in their learning-based ADP. Although all these studies showed satisfying results in their domain problems, to the best of our knowledge, there is no previous
720
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
work investigating an ADP application to a problem with the GARCH(1,1) model. GARCH(1,1) introduces two latent state variables. The latent state variables will be inadvertently left out if the GARCH(1,1) model is not accounted for. This has posed a challenge to the model-free property of a learning-based ADP method. When a model of the problem is available, a simulation-based ADP method is another alternative. A simulation-based ADP method is an approach intermediate between an analytical approach and a learning-based ADP method. An analytical approach requires a model of the problem and a development of an analytical solution. A learning-based ADP method requires minimum knowledge of the problem. A simulation-based ADP method requires a model of the problem and, rather than relying purely on analysis, it uses simulation to provide information for assisting action selection. Two simulation-based ADP methods are investigated here: Rollout and Hindsight Optimization (HO). Though it has been used in other applications, Rollout had been investigated for inventory problems in only one study. HO, as discussed by Chong, Givan, and Chang (2000), had not been studied for an inventory problem previously. Also, how these simulation-based ADP methods perform compared to a learning-based ADP method had not been investigated previously. Our study is intended to fill in this space. It investigates the application of ADP to an inventory problem with demand of AR1/GARCH(1,1) and provides empirical comparison among Sarsa, Rollout, and HO. The findings here provide a practical approach to design an ADP method for an inventory problem and an insight into relations of ADP components, performance, and control behavior. Furthermore, the results reaffirm the model-free property of a learning-based ADP method even in the presence of latent state variables introduced by GARCH(1,1). The understanding exposed in this paper will help promote efficient inventory management and aid in the transfer of inventory research into practice. This section has provided an overview of our study. Section 2 provides a review of the related literature. Section 3 establishes the general background of the three methods. Section 4 explains the problem under investigation, how our empirical study is conducted, how each controller is set up, how experiments are performed, and how the experimental results are evaluated. Section 5 explains what the experimental results indicate. The last section presents conclusions and further discussions. 2. Literature review ADP has been recently introduced into inventory management research by Van Roy et al. (1997), Godfrey et al. (2001), Pontrandolfo, Gosavi, and Okobaa (2002), Giannoccaro and Pontrandolfo (2002), Shervais et al. (2003), Kim et al. (2005), Choi et al. (2006), Topaloglu and Kunnumkal (2006), Iida and Zipkin (2006), Chaharsooghi et al. (2008), Kim et al. (2008), Kwon et al. (2008) and Jiang and Sheng (2009). Fig. 1 shows a block diagram illustrating ADP methods used in previous studies. 2.1. Simulation-based ADP Of all these authors, only Choi et al. (2006) investigates the application of simulation-based ADP. Simulation was used to provide reduced state space, reduced action space, and approximate transition probabilities for a dynamic program, which in turn was solved with either value iteration or Rollout. Rollout uses simulation to provide approximate state-action costs. The simulation requires a control method to provide decisions in simulation. Such a control method is called a base policy. For their base policy, Choi et al. used an (s, S) policy whose parameters were obtained from a heuristic search over pre-defined sets. The pre-defined sets of
parameters used in Choi et al. (2006) are problem specific and it is unclear how they obtained them. Rollout is also investigated in our study. We use a simple formula based on the well-known Economic Order Quantity (EOQ) equation to determine parameters for the base policy. In addition to Rollout, HO—another simulationbased ADP method—has never been investigated for inventory problems. HO does not require a base policy. Therefore we investigate HO for its own virtues as well as to provide a good measure of how simulation-based ADP performs without the choice of a base policy. 2.2. Learning-based ADP Authors studying learning-based ADP methods investigated several learning schemes. Van Roy et al. (1997) used one-step temporal difference learning (TD0). Chaharsooghi et al. (2008) used Q-learning. Q-learning, introduced by Watkins (1989), is an algorithm based on TD0. Kim et al. (2005) used an action-value method whose learning scheme was based on a weighted average value of a current approximation and a new observation. Their approach is similar to TD0, but it only approximates a current state-action value without a cost-to-go. Kwon et al. (2008) and Jiang and Sheng (2009) used the casebased myopic reinforcement learning (CMRL) method developed by Kwon et al. (2008) CMRL is based on a combination of an action-value method and a case-based reasoning technique. Case-based reasoning is state aggregation with an ability to create a new aggregation when an observed state is over a preset range of any existing aggregation group. Kim et al. (2008) proposed and used an asynchronous actionreward learning method. For a fast changing inventory system they assumed that information of action-consequence relations, regardless of state, was sufficient for decision making. Their asynchronous action-reward learning scheme is developed based on characteristics of inventory problems that allows simultaneously multiple action updates. Multiple action updates help accelerate the learning process to enable it to catch up with changes in the system. Instead of only updating an action-reward value for an action taken, approximate values of actions not taken were updated as well. Given an observation of an exogenous variable (e.g., demand), consequences of actions not taken can be calculated and the multiple updates achieved with these computed consequences. Shervais et al. (2003) used the dual heuristic programming (DHP) method. DHP, discussed by Werbos (1992), is a learning ADP method that updates a control policy directly using derivatives of the cost function. It should be noted that inclusion of a setup cost, formulated as a mathematical step function, renders this method inapplicable to the problems addressed in our study, because a step function is not differentiable. However, there is a technique to approximate a step function with a sigmoid function. A sigmoid function is differentiable. The application of DHP with an approximate step function has not been investigated and it may be worth further study. Giannoccaro and Pontrandolfo (2002) used the SMART algorithm, developed by Das, Gosavi, Mahadevan, and Marchalleck (1999). The SMART algorithm is similar to Q-learning. In Qlearning, every time step is assumed to be equal. Giannoccaro and Pontrandolfo studied an inventory problem whose time response is a function of a current state, a next state and a current action. To handle varied time response, SMART uses a time correction term and its associate procedures to approximate an average state-action value.
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
721
Fig. 1. A block diagram shows ADP methods using in previous studies.
Our work investigates Sarsa, a widely-used algorithm based on TD0. The learning scheme used in Van Roy et al. (1997, Section 6.2) is equivalent to Sarsa. 2.3. Function approximations used in learning-based ADP studies For the issue of function approximation, there are a few choices of function approximation studied previously for application to inventory management. Look-up table. Jiang and Sheng (2009), Kim et al. (2008), Kwon et al. (2008), Kim et al. (2005) used a look-up table to implement a cost-to-go approximation. A look-up table is a simple index table where each entry, an approximate cost, can be accessed by an index, a state-action pair. Aggregation. Giannoccaro and Pontrandolfo (2002) and Chaharsooghi et al. (2008) used an Aggregation. An Aggregation is an enhanced version of a look-up table. It is a look-up table with
a group of indexes, instead of a single index. Any indexing value falling within the same index group will be linked to the same entry. For the same problem, an Aggregation requires a smaller table than a look-up table. Other function approximation. Van Roy et al. (1997) experimented with a linear combination of features and the Multilayer Perceptron Neural Network (MLP). Shervais et al. (2003) also used MLP for an approximate cost-to-go. Among these approximation choices, a look-up table is simplest to implement, but it suffers from a scalability issue. An Aggregation is a good alternative, but the size of its aggregation step needs to be carefully designed. A linear combination of features provides the efficiency of a linear computation, but it requires a customized selection of features specific to each problem. MLP is a very powerful approximation function, but its highly nonlinear nature makes it difficult to fine tune with ADP. A Radial Basis Function (RBF) is a universal approximation function. It can be operated in
722
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
a linear mode, which is preferable for ADP. RBF is a linear combination of locally active functions. Therefore it can be viewed either as a smooth interpolation of an Aggregation or as a linear combination of features, which are the Radial Bases. Our study investigates an application of ADP with RBF to inventory problems. Among previous authors applying ADP to inventory problems, no one has investigated the performance differences between simulation-based and learning-based ADP and applicability to inventory problems of ADP with RBF. In addition to previous studies of ADP application to inventory problems, the conclusions of Zhang (2007) about the presence and significance of the GARCH(1,1) model in inventory management has laid one of the important questions our research intents to answer. A learning-based ADP method has been referred to as the model-free solution by many authors, including Kaelbling, Littman, and Moore (1996). However, the GARCH(1,1) model introduces two latent state variables, which will be inadvertently left out if GARCH(1,1) is not accounted for. This has posed a challenge to the model-free property of a learning-based ADP method. The robustness of learning-based ADP for an inventory problem with GARCH(1,1) has not been investigated previously. The intent of our study is to investigate these unexplored issues to foster ADP applications for practical inventory management. 3. Background
X
a2AðsÞ
pðs; a; s0 Þ cðs; a; s0 Þ þ aC tþ1 ðs0 Þ
ð1Þ
s0
where a is a discount factor; C t ðsÞ is a minimal expected cost from time t with state s; C tþ1 ðs0 Þ is a next minimal expected cost, often referred to as a cost-to-go; a is an action, chosen from A(s), a feasi0 ble set of actions for the given state; p(s, a, s ) is a transition proba0 bility and c(s, a, s ) is a transition cost. There are many ADP methods. Our study investigates Sarsa, Rollout, and HO. We use the Look-Ahead method as a benchmark for a performance comparison. (See Si, Barto, Powell, & Wunsch (2004) & Powell (2007) for a general discussion on ADP.) 3.1. Look-Ahead A Look-Ahead method is a method to solve MDP based on a future projection. Due to its ease of computation, our study uses the Look-Ahead method as a benchmark to compare its performance with other methods. A Look-Ahead method uses a model of the problem to project events ahead. It chooses actions to minimize a projected cost of a specific number of period(s) ahead. The cost and state projections in our study are projections with zero demand noise. Eq. (2) shows how the Look-Ahead method determines an action:
~ aH ¼ arg min ~ a
3.2. Sarsa Sarsa, as discussed by Sutton and Barto (1998), uses a learning technique, TD0, to facilitate decisions. Using TD0 helps alleviate a requirement of a model of the problem and a need for a transition probability. Sarsa has been investigated for inventory problems by Van Roy et al. (1997). However, the GARCH(1,1) model introduces two extra state variables, forecasting error z and its variance r2. These state variables are less likely to be observable than others. If a model of the problem is not well-defined, these extra state variables may be inadvertently left out. The performance of Sarsa with missing latent state variables had not been investigated previously. We investigated the robustness of Sarsa in the presence of the GARCH(1,1) model (see Section 4.3). Sarsa uses an approximate state-action cost Q(st, at), often called the Q-value, to determine an action and updates Q(st, at) based on TD0. The equations for Sarsa are
Cðst ; at Þ ¼ cðst ; at Þ þ a Cðstþ1 ; atþ1 Þ wðs; aÞ ¼ cðs; aÞ þ aQ ðs0 ; a0Þ Q ðs; aÞ Q ðs; aÞ
An inventory structure investigated here is a multi-period inventory decision problem. In multi-period decision problems, a decision is made to optimize both an immediate return and an effect that will carry on to the following periods. This type of problems can be generally classified as Markov decision problems (MDP). ADP is an approach to solve MDP based on the Bellman equation (Eq. (1)) by using an approximation technique to overcome the computational difficulty when applying an exact analysis. The computational difficulties are associated with transition probabilities and computational inefficiencies. The tendency of computational difficulties to increase exponentially when the number of state or action variables increases is referred to as curses of dimensionality. This equation is
C t ðsÞ ¼ min
where ~ a ¼ ½að1Þ að2Þ að3Þ . . . aðHÞ ; cðs; aÞ is the projected period cost; stþ1 ; . . . ; stþH1 are projected states at times t + 1, . . ., t + H 1, respectively; st ¼ st is the state of the beginning period; and the decided action is at = a(1).
H X i¼1
ai1 cðstþi1 ; aðiÞ Þ
ð2Þ
ð3Þ ð4Þ
Qðs; aÞ þ bwðs; aÞ
ð5Þ
at ¼ pðst jQ Þ
ð6Þ
where Q(s, a) approximates the state-action cost C(s, a); c(s, a) is a period cost; b is a Sarsa parameter, called the learning rate, with the other variables as defined earlier. Eq. (4) shows the updated signal w(s, a), called a temporal difference. There are several ways to implement Q(s, a). The simplest way is to use a look-up table so that each entry corresponds to each state-action pair. A Q-value implemented by a look-up table can be updated as shown in Eq. (5). Then, given a state, an action can be determined according to Q-values, as in Eq. (6). When Q-values well represent state-action costs, choosing an action with a low Q-value will lead to a low actual cost. However, in order to have well-approximate Q-values, the state-action space has to be well explored. Therefore, in order to perform well overall, an action has to be chosen on a basis of a good balance between exploitation and exploration (see Sutton & Barto (1998) for detail). Although it is simple to implement, a look-up table is computationally inefficient in case of a large or continuous state-action space. An alternative is a function approximation. A function approximation can be set up to use fewer parameters than the entries required by a look-up table. As discussed by Barreto and Anderson (2008), RBF, because of its capability to represent an arbitrary function, its focus on local adjustments and its ability to operate in a linear mode, is one of the most widely-used function approximations. RBF output is a weighted summation of basis functions of scaled distances between inputs ~ x and centers ~ v m . Eq. (7) shows a single output y 2 R of an RBF with M bases, given inputs ~ x 2 RD , with k~ x~ v m kZm a scaled Euclidean distance:
y ¼ w0 þ
M X
wm /ðk~ x~ v m kZm Þ
ð7Þ
m¼1
where /() is a basis function; ~ v m is the mth RBF center; Zm is a D D diagonal matrix whose entries along its diagonal are RBF scales (denoted as z1,m, z2,m, . . ., zD,m), w0, w1, w2, . . ., wM are RBF weights, and
k~ x~ v m kZm
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u D uX ¼ ð~ x~ v m ÞT Zm ð~x ~ v m Þ ¼ t zd;m ðxd v d;m Þ2 : d¼1
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
Our study uses a Gaussian function, /(a) = exp(a2), as an RBF basis. (See Bishop (2006) & Haykin (1994) for a general discussion on RBF.) When operating RBF in the linear mode, centers and scales are pre-defined and only RBF weights need to be tuned during the learning process. The RBF weights are updated as shown in Eq. (8) (a strategy to determine RBF centers and scales will be discussed in Section 4.2):
~ w
~ þ b wðs; aÞ ~ w /ðs; aÞ
ð8Þ T
~ ¼ ½w0 w1 . . . wM ; ~ where w /ðs; aÞ is the column vector [1 /1 . . . T /M] with /m ¼ /ðk½s a0 ~ v m kZm Þ and [s a]0 the column vector composed of state and action variables.
where nn,i is the nth simulated value of a random variable at period t þ i; ^cnn;i ðs; aÞ is the simulated period cost; ^stþi ðnn;i Þ is the simulated state of period t þ i; ~ u ¼ ½uð1Þ uð2Þ . . . uðTÞ ; and u(i) is the replenishment order for the HO simulation period i with the rest of variables as mentioned earlier. 4. Methodology Our study uses computer simulations to conduct numerical experiments to evaluate each inventory control approach and provide evidence to answer research questions. The inventory problem with AR1/GARCH(1,1) demand is used. 4.1. An inventory problem
3.3. Rollout Rollout, as discussed by Bertsekas and Tsitsiklis (1996), is a simulation-based ADP method. Provided there is a model of the problem, it determines an action based on approximate costs obtained with simulation. The action is selected such that it minimizes the approximate cost composed of an approximate period state-action cost and an approximate discounted cost-to-go. The approximate cost-to-go in Rollout is the cost (obtained from simulation) associated with a fixed base policy. Eq. (9) shows how Rollout determines an action:
( ) N T X 1 X i p ^cn ðst ; aÞ þ at ¼ arg min a ^cn ð^stþi;n Þ a N n¼1 i¼1
ð9Þ
where at is the chosen action; N is the number of iterations; T is the simulation horizon; ^cn ðs; aÞ is the nth simulated period cost when an action a is taken; ^cpn ðsÞ is the nth simulated period cost when an action is determined by a base policy p; and ^stþi;n is the nth simulated state of period t + i. First Rollout uses simulation to obtain a period cost ^cðst ; aÞ and a next state ^stþ1 . Then, with a next state ^stþ1 and a base policy p, a cost for the next period ^cp ð^stþ1 Þ is found by simulation. The process keeps going until the end of simulation horizon. The second term in Eq. (9) represents an approximate cost-to-go. A combination of a period cost and a discounted cost-to-go is an estimated cost of each action in an action search space. In order to take into account process variation, this estimated cost can be re-evaluated multiple times and the decision can be made based on an average value. A base policy used in our study is discussed in Section 4.2.
Similar to Rollout, Hindsight Optimization (HO), as introduced by Chong et al. (2000), uses simulation to estimate a period cost and a cost-to-go. HO does not require a base policy. Therefore we investigate HO for its own virtues as well as to provide a good measure of how simulation-based ADP performs without having to choose a base policy. Rollout uses a base policy to determine a sequence of actions in its simulation. Unlike Rollout, which simulates events in series, HO uses a special simulation. It first generates a sequence of random variables, which are realizations of inventory demand. Then given the known sequence of random variable values, it chooses a sequence of actions to minimize the cost. Therefore the cost obtained in HO is the minimal cost achieved for that instance. Eq. (10) summarizes the HO mechanism:
1 at ¼ arg min a N ( !) N T X X i ðiÞ ^cnn;0 ðst ; aÞ þ min a ^cnn;i ð^stþi ðnn;i Þ; u Þ ~ u
The inventory problem investigated here is a periodic review single-echelon problem with non-zero leadtime and a setup cost. The inventory system has both in-transit and on-site inventories. A flow of items starts from supplier delivery after a supplier has received a replenishment order. Then, while the items are being delivered, they become in-transit inventory. Later after a delivery duration, called a leadtime, the items will arrive at a storage point. The items at a storage point are called an on-hand inventory. There they are ready to be delivered to a customer once there is a demand. Our study treats the case where demand is exceeded as a backlog order, which is a promise to deliver the items whenever they are available. Since an on-hand inventory and a backlog order are mutually exclusive, they will be represented inclusively as an onsite inventory, denoted x, such that non-negative x will represent an on-hand inventory and negative x will represent a backlog order. The demand is modeled with AR1/GARCH(1,1). Therefore, a state is composed of a previous demand Dt1, a previous demand error t1, a variance of a previous demand error r2t1 , an on-site inventory xt and an in-transit inventory ~ BðtÞ whose length depends on a leadtime L. The state space of this problem is f0; Iþ g R R I f0; Iþ gL for Dt1 ; t1 ; r2t1 ; xt , and ~ BðtÞ , þ respectively. An action ut 2 f0; I g is a replenishment order. State transitions are specified in Eqs. (11)–(15):
Dt ¼ a0 þ a1 Dt1 þ t
ð11Þ
t ¼ et rt r2t ¼ m0 þ m1 2t1 þ m2 r2t1
ð12Þ
xtþ1 ¼ xt þ
3.4. Hindsight optimization
n¼1
723
i¼1
ðtþ1Þ
ðtÞ
ðtþ1Þ
ðtÞ
B1
ðtÞ B1
Dt
ð14Þ
¼ B2 .... .. .. .
BL1 ¼ BL ðtþ1Þ BL
¼ ut
ð15Þ
where a0 and a1 are AR1 model parameters; m0, m1, and m2 are GARCH(1,1) parameters; et is white noise distributed according to N(0, 1); and ut is the replenishment order. The period cost, or transition cost, is the cost associated with the system transitioning from the current state to the next state by the chosen action. In our study, the period cost consists of the replenishment cost and the inventory handling cost, as shown in Eq. (16).
ctþ1 ¼ K t dðut Þ þ g t ut þ ht ðxtþ1 Þþ þ bt ðxtþ1 Þþ ð10Þ
ð13Þ
ð16Þ
where ct+1 is the period cost, whose value will be known at time t + 1 (the end of period t); Kt is the setup cost; gt is the unit replenishment cost; ht is the unit holding cost; bt is the unit backlogging
724
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
cost; d() is the step function; ()+ is the positive function defined by (a)+ = a d(a); and other variables are as mentioned earlier. The first two terms in Eq. (16) represent the replenishment cost and the last two terms represent the handling cost. 4.2. Implementation 4.2.1. Sarsa Sarsa is implemented with a pre-defined structure RBF. RBF centers are set up to be distributed uniformly throughout state-action space. The top left plot of Fig. 2 shows the two-dimensional structure of RBF centers. Since RBF scales correspond to centers, therefore RBF scales can be set up to have the same value for each dimension. Our study assigned scales using a strategy based on a relative basis effect at a middle point. A middle point is a point in an empty space surrounded by RBF centers and a location of the middle point is at the center of that surrounding space. The top right plot of Fig. 2 shows middle points, as circles, among RBF centers. The form of each Gaussian basis has a peak of value 1 at its center and declines along a distance from the center with 2 rate expðdh Þ, where dh is a scaled distance from its center. The lower plot of Fig. 2 gives a representation of a Gaussian basis versus the distance from its center. Values of RBF scales are assigned such that the effect of each RBF center will drop to a specific fraction (e.g., half) of its peak at the middle point. For convenience, this strategy will be referred to as a midpoint strategy and the fraction is prefixed to indicate the fraction of a peak effect at the middle
point. Eq. (17) shows a midpoint strategy to obtain scale along the dth dimension:
zd ¼
4 D D2d
log /
where zd is a scale along the dth dimension; D is the number of dimensions; and Dd is the distance between two adjacent centers along the dth dimension. For example, the 1/10-, 1/2- and 9/10-midpoint strategies assign RBF scales such that the effect of each RBF center will drop to 1/10, 1/2, and 9/10 of its peak value, respectively. Fig. 3 shows a combining effect of all bases when scales are assigned by 1/10-, 1/2-, and 9/10-midpoint strategies. These plots show that the 1/ 2-midpoint strategy has its combining effect distributed evenly throughout the space as being seen as a flat surface. Therefore the 1/2-midpoint strategy is used for experiments in our study. In our experiments, we investigates Sarsa with all state variables and Sarsa without latent state variables introduced by GARCH(1,1). Sarsa with all state variables is implemented with six-dimensional Q-values. A structure of RBF for Sarsa with all state variables has centers at each combination of {0, 100, 200, 300} {200, 100, 0, 100, 200} {0, 800, 1600, . . . , 3200} {300, 200, 100,. . . ,300} {0,100,200,300,400} {0,100,200,300,400}, corresponding to five state variables (Dt1 ; t1 ; r2t1 ; xt and ~ BðtÞ ) and one action variable at. With the 1/2-midpoint strategy, RBF scales z1 = z2 = z4 = z5 = z6 = 0.4621 104 and z3 = 0.0072 104. Sarsa without the GARCH(1,1) variables is implemented with four-
RBF centers with middle points
RBF centers 300
300
200
200
100
100
D2
D2
ð17Þ
0
0
−100
−100
−200
−200
−300
−300 0
50
100
150
200
250
300
0
50
100
D1
150
200
250
300
D1 RBF basis effect v.s. distance
1 0.9 0.8 0.7
φ
0.6 0.5 0.4 0.3 0.2 0.1 0 −3
−2
−1
0
1
||x − v||Z Fig. 2. RBF centers and their midpoints. (Midpoints are shown as red circles.)
2
3
725
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
RBF basis effect with 0.1−midpoint
RBF basis effect with 0.1−midpoint (contour) 300 200
2
100
D2
2.5
1.5
0 −100
1 200
−200
300 0
200 −200
D2
100 0
−300
D1
0
50
100
150
200
250
300
D1
RBF basis effect with 0.5−midpoint
RBF basis effect with 0.5−midpoint (contour) 300 200
3.5
100
D2
3
0 −100
2.5 200
−200
300 0
200 −200
D2
100 0
−300 0
D1
50
100
150
200
250
300
D1
RBF basis effect with 0.9−midpoint
RBF basis effect with 0.9−midpoint (contour) 300 200
15
100
D2
20
10
0 −100
5 200
300 0 −200
D2
−200
200 100 0
D1
−300 0
50
100
150
200
250
300
D1
Fig. 3. Combined effect of RBF basis with different midpoint strategies.
dimensional Q-values. A structure of RBF for Sarsa without the GARCH(1,1) variables has centers at each combination of {0, 100, 200, 300} {300, 200, 100, . . . , 300} {0, 100, 200, 300, 400} {0, 100, 200, 300, 400}, corresponding to Dt1, xt, B(t), and at, respectively. RBF scales z1 = z2 = z3 = z4 = 0.4621 104. Regarding to an exploration issue, similar to Van Roy et al. (1997), our study uses an addition of noise for an explorative mechanism. In addition, the zero initialization of RBF weights provides a greater range of exploration to Sarsa, due to the nature of non-negativity of Q-values in a minimization problem.
4.2.2. Rollout A base policy used in Rollout is an (s, S) policy, whose parameters are determined by using the Economic Order Quantity, EOQ, and safety stock calculations. The EOQ and safety stock calculations are easily computed formulas. Since a base policy may be re-calculated several times during the simulation, a simple computation is advantageous. An (s, S) inventory control policy is such that replenishment will be ordered whenever an inventory level, a combination of both onsite and in-transit inventories, falls below a reordering point, denoted s, and the amount of replenishment is determined to bring an inventory level back to an order-upto-level, denoted S. EOQ is a widely used deterministic formula for inventory control, as dis-
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi cussed by Lambert, Stock, and Ellram (1998). EOQ ¼ 2KD=h, where K is a setup cost, D is a demand rate, and h is a unit holding cost. A safety stock is an inventory kept as a buffer against stock pffiffiffi out. A safety stock, ss, is commonly calculated as ss ¼ 3 r L, where r is the standard deviation of demand and L is the leadtime. (See Lambert et al. (1998) for a discussion on EOQ and safety stock.) Since the problem investigated here is not steady-state, a previous demand (the demand at time period t 1) is used instead of a demand rate and parameters, s and S, are re-calculated at every period. In our study, parameters of Rollout’s (s, S) base policy are determined as shown in Eq. (18):
s ¼ roundfðL D mod EOQ Þ þ ssg S ¼ s þ roundfEOQ g
ð18Þ
4.2.3. Hindsight optimization HO does not require a base policy or any extra decision, but a set of algorithmic parameters (N iteration(s) and simulation horizon T). 4.3. Experiments Numerical experiments are conducted to evaluate the performance of Sarsa (with all state variables), Sarsa without the GARCH(1,1) state variables, Rollout, and HO. Our study investigated 15 different inventory problem structures (as shown in Table 1). The
726
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
without the latent state variables introduced by GARCH(1,1), Rollout, and HO.
Table 1 A summary of demand and cost variables investigated. Problem
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Demand variables
Cost variables
Correlation (a1)
Noise variance offset (m0)
Set up cost (K: $/ transaction)
Penalty cost (b: $/ item)
Holding cost (h: $/ item)
0.8 0.4 0 0.4 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
70 70 70 70 70 40 10 70 70 70 70 70 70 70 70
80 80 80 80 80 80 80 150 100 50 0 0 0 0 0
180 180 180 180 180 180 180 100 100 100 100 200 50 50 50
1 1 1 1 1 1 1 1 1 1 1 1 1 10 50
demand is modeled with AR1/GARCH(1,1) as shown in Eqs. (11)– (13) with AR1’s a0 = 2 and GARCH(1,1)’s m1 = 0.1 and m2 = 0.8. The AR1’s a1 and GARCH(1,1)’s m0 are set up as shown in Table 1. All problems are undiscounted one-period leadtime problems (a discount factor a = 1 and a leadtime L = 1). Each problem has four cost variables. The unit replenishment cost g is set to $100/unit for all problems. Values of a setup cost (K), a penalty cost (b) and a holding cost (h) of each problem are shown in Table 1. Each experiment is initialized at D0 = 50, 0 = 10, r20 ¼ 400, x1 = 10, and B(1) = 0. 4.3.1. Experimental design parameters The experiments run each inventory controller for 25 replications of each of 60 time-unit-indeterminate periods. The exception is that each inventory controller investigated on Problem 1 is run for 50 replications.1 The study of problems 1–5 provides performance comparison across different values of a1 of 0.8, 0.4, 0, 0.4, and 0.8, respectively. The study of problems 1, 6, and 7 provides performance comparison across different values of m0 of 70, 40, and 10, respectively. The study of problems 8–11 provides performance comparison across different values of K of 150, 100, 50, and 0, respectively. The study of problems 12, 11, and 13 provides performance comparison across different values of b of 200, 100, and 50, respectively. The study of problems 13–15 provides performance comparison across different values of h of 1, 10, and 50, respectively. 4.3.2. Evaluation An aggregate cost is used as a main performance indicator. Overall performance is over 60 periods. Because of the highly stochastic nature of this problem, our study uses statistical significance tests as the main tools for drawing conclusions. A one-sided t-test is used when the normality assumption for experimental results holds. For each pair of measurements, the mean of a control method l1 and the mean of a treatment method l2, is tested twice. The first test is a t-test against an alternative hypothesis l1 > l2. In the second test a t-test against an alternative hypothesis l1 < l2 is used. The normality assumption is tested at a 5% significance level with both Lilliefors and v2-good-of-fitness tests. When it is suspected that the normality assumption does not hold, a Wilcoxon rank sum test is used instead of the t-test. The treatments investigated here are the Look-Ahead, Sarsa with all state variables, Sarsa 1 The reason to have more replications on this problem is to provide better data for comments (in Section 5.3), cumulative distribution function (CDF) plots (in Fig. A.11), and maximum single-period cost plots (in Fig. 8).
5. Experimental results The experimental results are shown in Sections 5.1 and 5.2 and comments on each ADP method are provided in Section 5.3. The detailed experimental data are provided in the Appendix. 5.1. Different demand settings Fig. 4 shows the percentage of relative cost differences over different problem scenarios. It has been organized into two rows and three columns. The first row shows the percentage of relative cost differences over different demand correlations (a1 = 0.8 to 0.8). The second row shows the percentage of relative cost differences over different demand noise variance offsets (m0 = 10, 40, and 70). The column corresponds to comparison on different perspectives. The left column shows percentage of relative total cost difference of each ADP method to the 12-period Look-Ahead method, denoted H12. Using total cost provides better performance comparison between simulation-based ADP methods and the Look-Ahead method. The middle column shows percentage of relative aggregate cost difference of each ADP method to the Look-Ahead method. Using the percentage of relative cost differences of aggregate costs of Periods 13–60 allows learning-based ADP methods to have initial learning periods and thus provides better performance evaluation of learning-based ADP methods. The left column shows percentage of relative aggregate cost difference of Sarsa without GARCH variables to Sarsa with GARCH variable. It is to provide evaluation of the significance of GARCH variable when using a learning-based ADP method. The percentage of relative cost difference of method A to method B is (costA/costB 1) 100%, where costA and costB are costs obtained from applying method A and B, respectively. The negative percentage means that method A results in lower inventory cost than method B. The positive percentage means the relation is reverse. The percentage close to zero means that both methods result in similar inventory costs. Sarsa with GARCH variables is denoted as Sa and presented with a green2 solid line and circle markers. Sarsa without GARCH variables is denoted as S and presented with a blue solid line and asterisk markers. Rollout is denoted as R and presented with a light blue solid line. Hindsight Optimization is denoted as HO and presented with a pink solid line with cross markers. The dotted red line provides a visual reference for the zero percentage level. The plots of the percentage of relative cost difference of ADP methods to the Look-Ahead method from negative to positive demand correlation shown in the upper left and upper middle plots of Fig. 4 show that all ADP methods have negative percentages of relative cost difference to the Look-Ahead method. This means that ADP methods provided lower costs than the Look-Ahead method. Also, all ADP methods give stronger negative percentages of relative cost difference for negative demand correlation. This means that ADP methods performed better under negative demand correlation, when compared to the Look-Ahead method. The upper right plot shows the percentage of relative cost difference of Sarsa without the GARCH(1,1) variables to Sarsa (with all state variables) lying close to zero level. This means that the inventory costs 2 For interpretation of color in Figs. 2–10, the reader is referred to the web version of this article.
727
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
(holding cost per penalty cost or h/b) are shown in the top, middle and bottom rows, respectively. The top left plot of Fig. 5 shows that Rollout performed better than HO, especially when the set up cost ratio is either 0.5 or 1, as indicated by larger gaps between the light blue line (representing Rollout) and the pink line with cross markers (representing HO) at K/g = 0.5 and 1. The top middle plot shows that Sarsa and Sarsa without the GARCH(1,1) variables result in about 5% lower cost than the Look-Ahead method, throughout different set up cost ratios. The top right plot shows that the percentage of relative cost differences of Sarsa without the GARCH(1,1) variables compared to Sarsa is close to zero, which indicates that the costs obtained from using Sarsa without the GARCH(1,1) variables are similar to the costs obtained from using Sarsa, throughout different set up cost ratios. The middle left plot shows fluctuation of the percentages of relative cost difference of Rollout and HO. This may be explained by that the costs obtained from Rollout, HO, and the Look-Ahead method are not significantly different. (See summaries of the significance tests shown in Tables A.12,A.13,A.14 for details.) Therefore, the fluctuation seen in the middle left plot may only reflect the variation in the experiment. The middle center plot shows the downtrend, from small to large penalty cost ratio, of the percentage of relative cost difference of the learning-based ADP methods (Sarsa and Sarsa without the GARCH(1,1) variables) to the
obtained from using Sarsa without the GARCH(1,1) variables are similar to costs obtained from using Sarsa. The lower left plot of Fig. 4 shows that Rollout (shown in light blue line) has lower percentages of relative cost difference than HO (shown in a pink line with cross markers), especially under m0 = 40 and 10 (the gaps between the light blue line and the pink line with cross markers are wider than at m0 = 70). This indicates that Rollout performed better than HO, especially under low and medium demand variance scenarios. The lower middle plot of Fig. 4 shows a downtrend of the percentage of relative cost difference of learning-based ADP methods (Sarsa and Sarsa without the GARCH(1,1) variables) to the Look-Ahead method. This indicates that the learning-based ADP methods performed better, when compared to the Look-Ahead method, under a high demand variance scenario. The lower right plot shows a blue line with asterisk markers representing Sarsa without the GARCH(1,1) variables lying close to the zero level. This indicates that the performance of Sarsa and Sarsa without the GARCH(1,1) variables are similar. 5.2. Different cost structures Similar to Figs. 4, 5 shows relative cost differences of ADP methods over different cost structures. Relative cost differences over different set up cost ratio (set up cost per unit cost or K/g), penalty cost ratio (penalty cost per unit cost or b/g) and holding cost ratio
% rel. cost diff. to H12 cost; Period 13−60
0
0
0
−10
−10
−10
−20
−20
−20
−30
%
10
%
10
−30 −40
−40
−50
−50
−50
−60
−60
−60
−0.8
−0.4
0
0.4
−70
0.8
−0.8
−0.4
0
0.4
−70
0.8
a1
a1
% rel. cost diff. to H12 cost; Total cost
% rel. cost diff. to H12 cost; Period 13−60
0
0
0
−10
−10
−10
−20
−20
−20
%
10
−30 −40
−40
−50
−50
−50
−60
−60
−60
40
ν0
70
−70
10
40
ν0
0
0.4
0.8
−30
−40
10
−0.4
% rel. cost diff. of S to Sa; Period 13−60
10
−70
−0.8
a1
10
−30
Sa S S R R HO HO
−30
−40
−70
%
% rel. cost diff. of S to Sa; Period 13−60
10
%
%
% rel. cost diff. to H12 cost
70
−70
10
40
ν0
Fig. 4. Relative cost differences of ADP methods on different demand correlations (upper row) and variances (lower row).
70
728
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
15
% rel. cost diff. to H12 cost; Period 13−60
10
5
5
5
0
0
0
%
10
−5
−5
−5
−10
−10
−10
−15 0.5
1
1.5
0
K/g: $/$ % rel. cost diff. to H12 cost; Total cost
0.5
1
1.5
0
15
% rel. cost diff. to H12 cost; Period 13−60
15 10
5
5
5
0
0
%
10
−5
−5
−10
−10
−10
−15
−15
−15
1
2
0.5
350
350
1
2
0.5
% rel. cost diff. to H12 cost; Period 13−60
350
250
250
200
200
200
%
300
250
%
300
150 100
100
50
50
50
0
0
0
−50
−50
−50
0.2
h/b: $/$
1
0.02
0.2
2
1
% rel. cost diff. of S to Sa; Period 13−60
150
100
0.02
1
b/g: $/$
300
150
1.5
% rel. cost diff. of S to Sa; Period 13−60
b/g: $/$
% rel. cost diff. to H12 cost; Total cost
1
0
−5
b/g: $/$
0.5
K/g: $/$
10
0.5
Sa S R HO
K/g: $/$
%
%
15
% rel. cost diff. of S to Sa; Period 13−60
−15
−15 0
%
15
10
%
%
15
% rel. cost diff. to H12 cost; Total cost
0.02
h/b: $/$
0.2
1
h/b: $/$
Fig. 5. Relative cost differences of ADP methods on different set up costs (top row), penalty costs (middle row) and holding costs (bottom row).
Look-Ahead method. This indicates that the learning-based ADP methods performed better, when compared to the Look-Ahead method, under the high penalty cost scenarios. The middle right plot shows the relatively flat line close to zero level of the percentage of relative cost difference of Sarsa without the GARCH(1,1) variables to Sarsa. This indicates that the cost performance of Sarsa without the GARCH(1,1) variables and Sarsa are not significantly different. The bottom left and center plots show large positive percentages of relative cost difference of HO (total cost, shown in the left plot), Sarsa, and Sarsa without the GARCH(1,1) variables (aggregate costs of Periods 13–60, shown in the center plot) at large holding cost ratios. This indicates that HO and the learning-based ADP methods do not perform well under the large holding cost ratio scenarios. The Rollout method seems to be affected least, as it does not perform worse than the Look-Ahead method under the high holding cost ratio scenarios (the light blue line in bottom left plot lies relatively flat close to zero level). The bottom right plot indicates that the cost performance of Sarsa without the GARCH(1,1) variable is similar to Sarsa. Among previous studies, only Zhang (2007) has studied inventory problems with AR(1)/GARCH(1,1) demand. Zhang (2007) provided the solution for an inventory problem without set up cost. In
order to compare our ADP methods to Zhang’s solution, relative cost differences of our ADP methods to Zhang’s solution (denoted Z) are shown in Fig. 6. All plots in Fig. 6 show that Zhang’s solution yielded lower costs than ADP methods in most cases, as the light blue line (representing RO) and the pink line with cross markers (representing HO) lie above the zero level in most scenarios in the upper and lower left plots and the green line with circle markers (representing Sarsa) and the blue line with asterisk markers (representing Sarsa without the GARCH(1,1) variables) lie above the zero level in the upper and lower right plots. 5.3. Comments on ADP methods Problem 1 experiments (Table 1) have been conducted with more replications and each ADP method has been investigated with wider range of its algorithmic parameter value(s). The comments on ADP methods addressed here are based on observation of Problem 1 experiments. 5.3.1. Sarsa Sarsa was experimentally investigated with learning rates of 0.1, 0.2, . . ., 1.0. These learning rates did not show significantly
729
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
% rel. cost diff. to Z cost; Total cost
20
15
15
10
10
5
5
%
%
20
0
Sa S R HO
0
−5
−5
−10
−10
−15
−15
−20
% rel. cost diff. to Z cost; Period 13−60
−20 0.5
1
2
0.5
b/g: $/$ % rel. cost diff. to Z cost; Total cost
350
300
300
250
250
200
200
%
%
350
150
100
50
50
0
0
−50
−50 0.2
2
1
h/b: $/$
% rel. cost diff. to Z cost; Period 13−60
150
100
0.02
1
b/g: $/$
0.02
0.2
1
h/b: $/$
Fig. 6. Relative cost difference of different ADP methods to Zhang’s AR1/GARCH controller on different penalty (top row) and holding (bottom row) costs.
different performance. Our experiments for Sarsa without the GARCH(1,1) variables spent only around 1–10% of time spent for Sarsa with all state variables included. It should be noted that the timing information mentioned in our study represent only rough estimates since the experiments were not run in a controlled environment, but in a multi-server/multi-user computer system with variable computational loads. 5.3.2. Rollout Rollout was experimentally investigated with N = 1, 5, 10, and 50 iterations and simulation horizons T of 1, 3, 6, and 12. Fig. 7 shows the mean, maximum, third-quartile, second-quartile, firstquartile, and minimum of total costs obtained from Rollout with the different N and T values. The upper left plot of Fig. 7 shows a downtrend of the pink solid line, representing average total costs, when using Rollout with longer T, when N = 1. However, other plots (for N = 5, 10, and 50) do not show this downtrend. Section 6 provides further discussion and investigation on this behavior. While the time spent for Sarsa depends mainly on the size of RBF, the time spent for Rollout and HO depend on the number of iterations and simulation horizons. Rollout with N = 50, T = 12, which requires the most computation time compared to other settings investigated here, spent roughly the same amount of time as Sarsa with all state variables. Rollout with other settings spent much less time than Sarsa with all state variables. 5.3.3. HO HO was experimentally evaluated with the same parameters N and T as Rollout was. Our experiments with HO spent around 4–60 times the computation time spent for Rollout with equivalent settings of N and T. 5.3.4. Zhang’s solution As expected, when it is applicable, Zhang’s solution provided better performance than ADP methods. However, over different problem scenarios investigated, total costs obtained from Zhang’s solution and from Rollout are not significantly different.
6. Conclusions and discussions Regarding the performance3 of ADP methods under different scenarios, Figs. 4 and 5 indicate the following. As the demand correlation decreases, all ADP methods seem to perform much better than the Look-Ahead method. The improvement seems to settle around 55% for zero to negative correlation. As the demand variance decreases, Sarsa seems to perform worse compared to the Look-Ahead method. At low demand variance, Sarsa seems to have similar performance as the Look-Ahead method. This may be explained by the better performance of the Look-Ahead method, because the Look-Ahead future projection uses zero demand noise, which provides a good approximate cost for the problem with low variance. Across different set up cost ratios, the performance of Sarsa seems to be quite constant. As the penalty cost ratio increases, Sarsa seems to perform better compared to the Look-Ahead method. As the holding cost ratio increases, Sarsa and HO perform substantially worse than the Look-Ahead method. Rollout still performs relatively similar to the Look-Ahead method at high holding cost ratio. Since the cost of holding inventory and stock out are the same, high holding cost ratio (h/b = 1) may confuse Sarsa learning mechanism and lead to substantially bad performance. Both Sarsa (with all state variables) and Sarsa without GARCH(1,1) state variables perform well compared to the LookAhead method, when holding cost is low compared to penalty cost. Regarding the issue of the significance of the GARCH model when applying learning-based ADP, the performance of Sarsa with all
3 Due to its learning nature that requires initial periods, the conclusions on Sarsa performance or comparison of Sarsa to other methods are drawn based on aggregate costs of Periods 13–60. When not compared to Sarsa, the conclusions on Rollout and HO performance are drawn based on total costs.
730
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 5
5
x 10
N=1
10
9
9
8
8
7
7
total cost: $
total cost: $
x 10 10
6 5 4
max Q3 Q2 Q1 min
6 5 4
3
3
2
2
1
1
0
N=5
0 1
3
6
12
1
T: period(s) 5
5
x 10
N = 10
10
10
9
9
8
8
7
7
6 5 4
4 3 2
1
1 3
6
12
T: period(s)
N = 50
5
2
1
12
6
3
0
6
T: period(s)
total cost: $
total cost: $
x 10
3
0
1
3
6
12
T: period(s)
Fig. 7. Performance comparison of different Rollout parameters. The mean is shown as a solid line.
state variables and Sarsa without GARCH(1,1) state variables is not significantly different. This result assuages the concern about learning-based ADP for problems with latent state variables. The explanation may be that the GARCH(1,1) state variables do not have a direct effect on a period cost (Eq. (16)). They only give extra information about uncertainty in the variance in the demand forecast. The robustness of Sarsa with some information missing shown in our experiments confirms the observations of Csáji and Monostori (2008) that temporal difference learning, a learning technique used in Sarsa, can tolerate inaccurate information to some extent. However, the good performance of Zhang’s solution (in problems without set up costs) and Rollout (in all scenarios) indicate the significance of GARCH model in inventory management and supports Zhang’s claim. Sarsa’s ability to tolerate some degree of missing information is beneficial and can be exploited. Inclusion of the only more relevant state variables allows the use of a smaller RBF. Sarsa with a smaller RBF requires less computational effort. Bertsekas and Tsitsiklis (1996) also support this exploitation within a more general concept, using feature extraction. Feature extraction is an ADP state preparation technique such that the system raw state variables can be preprocessed to extract features that represent more important aspects of state. These features are then used as ADP state variables. To address the issue of function approximation, the evenly distributed structure and the midpoint strategy were used and were effective in setting up RBF. The evenly distributed structure and the midpoint strategy provide a systematic approach to design RBF for an ADP application to inventory problems. Pertinent to a comparison among different methods, Rollout performs significantly better than the Look-Ahead method, Sarsa, and HO in most cases. HO results in lower costs (aggregate costs
in Periods 13–60) than Sarsa in most cases, but only few cases (demand correlation a1 = 0 and 0.8) that significance tests can confirm the differences. Rollout has been investigated in Choi et al. (2006). Choi et al. (2006) used heuristic search over sets of pre-specified (s, S) policies as Rollout’s base policy. The heuristic search is computationally time consuming compared to the EOQ parameterized (s, S) policy used in our study. In addition to a base policy, another decision in the implementation of Rollout is a selection of its parameter values. Rollout has two parameters: the number of iterations N and the simulation horizon T. Rollout with N = 1 shows a monotonically downtrend in its relation between the simulation horizon length and the resultant average total cost. This monotonic downward trend is not apparent for N > 1. To delve into this behavior, average on-site inventory x, an average period cost and a maximum period cost were used as shown in Fig. 8 (based on Problem 1 experiment). The plots in Fig. 8 are arranged such that plots in the left column show average on-site inventory x versus T, plots in the middle column show average single cost versus T, and plots in the right column show the maximum single cost versus T. Each column has four plots for each value of N: N = 1 (top row), N = 5 (second row), N = 10 (third row), and N = 50 (last row), as indicated in each plot title. Plots of average on-site inventory show an uptrend in every N. An average cost when Rollout has lower N and the maximum cost when Rollout has higher N show a more noticeable downtrend. The explanation may lie in the role that the combination of algorithmic parameters, N and T, has in approximating state-action cost-to-go. With a few iterations (small N), Rollout may not have seen many rare demand surges, and a longer horizon (large T) helps to promote an averaging effect, which results in a better decision for the average case. With more iterations and a longer horizon (large
731
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
single cost: $
50
0
7000 6000 5000
N = 1; max(cost)
8 7 6
4000 1
3
6
12
1
3
6
T: period(s)
T: period(s)
N = 5; mean(x)
N = 5; mean(cost)
12
1
3
6
4
T: period(s)
x 10
N = 5; max(cost)
12
10
single cost: $
250 200 150 100 50
6000
single cost: $
x: item(s)
100
x: item(s)
4
x 10
N = 1; mean(cost)
single cost: $
N = 1; mean(x)
5500 5000 4500 4000
9 8 7
0 3
6
12
3
N = 10; mean(x)
N = 10; mean(cost)
single cost: $
200 150 100 50
12
1
1
3
6
4
x 10
5500 5000 4500 4000
12
3
T: period(s) N = 50; mean(x)
6
5
1
3
6
12
T: period(s) 4
x 10
N = 50; mean(cost)
N = 50; max(cost)
6
single cost: $
single cost: $
N = 10; max(cost)
5.5
12
5500
100
12
6
T: period(s)
200
6
4.5 1
300
3
T: period(s)
3500
0
x: item(s)
6
T: period(s)
250
x: item(s)
1
T: period(s)
single cost: $
1
5000 4500 4000
5.5 5 4.5
3500
0 1
3
6
12
1
T: period(s)
3
6
T: period(s)
12
1
3
6
12
T: period(s)
Fig. 8. Plot of average inventory and average and maximum single-period cost for each Rollout setting. The plot is arranged along number of period(s) of Rollout simulation.
N and T), Rollout may have a better chance to see more rare demand surges, which, even though rare, can substantially effect the total cost when this happens. This may drive Rollout to choose a more conservative action, such as stocking a higher inventory level, to safeguard against any chance of unusually high demand. Therefore it generates a higher average cost, but lower maximum cost. This finding on the relationship between Rollout parameters and conservative behavior is new and needs further study to further confirm its nature as well as to better formulate the new relationship. Understanding this relationship may contribute to a new design for ADP, which may explicitly incorporate a factor to control conservative behavior, or to create a systematic strategy to determine ADP parameters to balance out average and worst-case performance. Fig. 9 shows average inventory levels of each Rollout parameter under different demand scenarios. The upper row shows different demand correlations, a1 = 0.8, 0.4, 0, 0.4, and 0.8, as indicated in the legend of the upper left plot. The lower row shows different demand variances, m0 = 70, 40, and 10, as indicated in the legend of the lower left plot. The left, middle, and right columns show plots
when using Rollout with N = 1, 5, and 10, respectively, as indicated in each plot title. Fig. 10 shows average inventory levels of each Rollout parameter under different cost structures. The top row shows different set up cost ratios, K/g = 1.5, 1, 0.5, and 0, as indicated in the legend of the top left plot. The middle row shows different penalty cost ratios, b/g = 2, 1, and 0.5, as indicated in the legend of the middle left plot. The bottom row shows different holding cost ratios, h/b = 1, 0.2, and 0.02, as indicated in the legend of the top left plot. The plot title indicates a value of the Rollout parameter N. The same uptrend relationship between Rollout simulation horizon (T) and average inventory level is found in every scenario, as shown in Figs. 9 and 10. In addition, at N = 10 and T = 12, these two figures indicate the following. As the demand correlation increases, the average inventory level increases. As the demand variance increases, the average inventory level increases. This may be explained by that the inventory is raised to safeguard against higher risk of large demand.
732
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
N = 1; mean(x)
N = 10; mean(x)
N = 5; mean(x)
300
300
300
250
250
200
200
a = 0.8 1
a1 = 0.4
250
a = −0.4 1
x: item(s)
x: item(s)
200
a = −0.8 1
150
x: item(s)
a1 = 0
150
150
100
100
100
50
50
50
0
0 1
6
12
0 1
6
12
1
T: period(s)
T: period(s) N = 1; mean(x)
12
N = 10; mean(x)
N = 5; mean(x)
300
6
T: period(s)
300
300
250
250
200
200
ν = 70 0
ν = 40
250
0
x: item(s)
x: item(s)
200
150
x: item(s)
ν0 = 10
150
150
100
100
100
50
50
50
0
0
0
1
6
12
1
T: period(s)
6
T: period(s)
12
1
6
12
T: period(s)
Fig. 9. Plot of average inventory levels of each Rollout parameter under different demand correlations (upper row) and variances (lower row).
As the set up cost ratio increases, the average inventory level seems to increase. This may be explained by that as the inventory is raised in order to reduce the number of transactions. As the penalty cost ratio increases, the average inventory level increases. This is explained by that the inventory is raised to reduce the chance of inventory shortage. As the holding cost ratio increases, the average inventory level decreases. This is explained by that the inventory is lower in response to higher cost of holding items. In summary, Sarsa, Rollout, and HO can be used as inventory controllers for a problem with temporal demand heteroscedasticity, in most cases. Sarsa and HO are not recommended for an inventory problem with high holding to penalty cost ratio. In other cases, Sarsa can be employed effectively without a model of the problem. The absence of latent state variables, introduced by GARCH(1,1), does not apparently deteriorate Sarsa performance. Furthermore, omitting these latent state variables is beneficial in terms of shortening the computation time. When there is a model of the problem, Rollout is seen to provide better performance for all scenarios. Our study compares learning-based and simulation-based ADP methods to each other. As mentioned earlier, each method has its own strength. Learning-based ADP is more adaptive. Simulation-based ADP is more effective at reducing cost and does so more reliably. A combination of learning-based and simulation-based schemes may provide a good balance between adaptability and reliability. A forecasting technique can be used to estimate
uncertainty and variance, or a distribution, for simulation. Then, with known system dynamics, simulation can be used to approximate a state-action value. A forecasting technique provides adaptability, while simulation provides reliability as it reduces variation in control quality. A hybrid system may provide a new inventory management approach that is more adaptive and reliable, as well as new approaches for other decision/control applications. Kim et al. (2008) have started in this direction. They used known system dynamics to simulate consequences of actions not taken, after demand is observed. Then, not only is a value of a taken action updated, but values of all other actions are also updated. Therefore, their approach provides a fast-adapting inventory control system. However, the work of Kim et al. (2008) is limited to a stateless inventory problem. The extension to a state-based case will produce a wider range of applications. Appendix A. Detailed experimental results: statistics and summaries of cross reference tests Tables A.2–A.16 show statistics of the experimental results: mean, minimum, first-quartile, second-quartile, third-quartile and maximum, and a summary of their cross significance tests of aggregate costs obtained with different controllers. Look-Ahead with 12-period looking ahead is denoted H12. Sarsa (with all state variables) is denoted Sa, Sarsa without the GARCH(1,1) variables is denoted S, Rollout is denoted R, and HO is denoted HO. These methods are shown with their best performing sets of parameters.
733
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
N = 1; mean(x)
N = 5; mean(x)
x: item(s)
200
x: item(s)
K/g = 1.5 K/g = 1 K/g = 0.5 K/g = 0
250
150 100
300
250
250
200 150 100
50
50
0
0 1
6
12
100
250
200 150 100
0 12
6
12
1
150 100
300
250
250
200 150 100
50
50
0
0 12
12
N = 10; mean(x)
300
x: item(s)
h/b = 1 h/b = 0.2 h/b = 0.02
6
T: period(s)
N = 5; mean(x)
x: item(s)
x: item(s)
100
T: period(s)
N = 1; mean(x)
T: period(s)
150
0
T: period(s)
6
200
50
1
300
12
N = 10; mean(x)
250
50
6
T: period(s)
300
0
1
1
300
50
200
12
x: item(s)
x: item(s)
x: item(s)
150
250
6
N = 5; mean(x)
b/g = 2 b/g = 1 b/g = 0.5
6
100
T: period(s)
N = 1; mean(x)
1
150
0 1
300
200
200
50
T: period(s)
250
N = 10; mean(x)
300
x: item(s)
300
200 150 100 50 0
1
6
T: period(s)
12
1
6
12
T: period(s)
Fig. 10. Plot of average inventory levels of each Rollout parameters under different set up costs (top row), penalty costs (middle row) and holding costs (bottom row).
In the summary of cross significance tests, an entry of 0 denotes the null hypothesis, i.e., that the mean of a row method equals the mean of a column method cannot be rejected. An entry of 1 denotes that the null hypothesis is rejected in favor of that the mean of a row method is less than the mean of a column method and an entry of 1 means the inequality is reversed. Each table is presented in three parts, each showing statistics and summaries of the first 12 periods, the later 48 periods and the total 60 periods, respectively. The mean of aggregate costs provides a simple indicator to evaluate the performance of each method. Minimum, first-quartile, second-quartile, third-quartile, and maximum provide information of how data is distributed and skewed, for interested readers. Summary of cross significance tests provides a decisive answer for whether the two methods under investigation really gave different results and helps screening out variation in experimental results that might be confusing or misleading. A.1. Problem 1 Table A.2 shows statistics and a summary of their cross significance tests of aggregate costs obtained with the Look-Ahead methods, Sarsa, Sarsa without GARCH(1,1) state variables, Rollout, and HO. Look-Ahead methods with 2- and 6-period looking ahead, denoted H2 and H6 respectively, also are shown here. In this
experiment, Sarsa is used with a learning rate of 0.5. Sarsa without GARCH(1,1) state variables is used with a learning rate of 0.8. Rollout is used with N = 50 and T = 3. HO is used with N = 1 and T = 3. In Table A.2 under the group header Periods 1–12, both Sarsa and Sarsa without the GARCH(1,1) variables have greater values of the mean of the aggregate costs, $80,044.56 and $80,287.12 compared to the means of the aggregate costs, obtained by other methods, ranging from $60,928.78 (Rollout) to $72,219 (2-period Look-Ahead method). The significance tests also confirm that both Sarsa and Sarsa without the GARCH(1,1) variables delivered greater aggregate costs than other method, as indicated by ‘1’s in Columns H2, H6, H12, R, and HO on Rows S and Sa. This is explained by the nature of Sarsa, which requires initial periods. In Periods 13–60, the average aggregate costs obtained from both Sarsa and Sarsa without the GARCH(1,1) variables are lower than the average aggregate costs obtained from Look-Ahead methods. The significance tests also confirm that Sarsa without the GARCH(1,1) variables performed better than 2-, 6-, and 12-period Look-Ahead methods and Sarsa performed better the 2- and 6-period LookAhead methods. This indicates that the learning mechanism has worked effectively. Comparing Sarsa to Sarsa without the GARCH(1,1) variables in Periods 13–60, Sarsa without the GARCH(1,1) variables delivered a lower aggregate cost than Sarsa. However, the difference could
734
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
Table A.2 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Q3
Max
H2
H6
H12
67,363 51,241 65,899 80,291 80,319 59,507 58,435
103,577 79,134 87,173 93,698 91,949 69,859 75,218
194,211 166,939 140,042 108,835 112,443 116,162 121,955
0 0 0 1 1 0 0
0 0 1 1 1 0 0
0 1 0 1 1 0 0
1 1 1 0 0 1 1
196,366 171,990 127,108 134,615 133,724 119,575 124,174
298,174 233,320 201,410 173,578 163,567 150,579 146,923
391,553 303,644 248,606 216,267 218,416 180,392 220,191
615,827 426,484 445,277 325,417 294,014 272,252 468,517
0 1 1 1 1 1 1
1 0 1 1 1 1 1
1 1 0 0 1 1 0
263,364 218,624 199,173 207,057 211,360 182,738 187,900
371,688 290,199 255,406 254,922 252,068 212,578 219,928
483,533 372,331 333,089 293,759 296,034 246,560 278,461
810,038 593,423 581,682 428,320 400,647 343,796 590,472
0 1 1 1 1 1 1
1 0 0 1 1 1 1
1 0 0 0 0 1 1
Periods 1–12 H2 72,219.00 H6 61,453.86 H12 69,539.66 Sa 80,044.56 S 80,287.12 R 60,928.78 HO 62,163.54
8271 10,634 29,579 52,140 45,741 28,265 29,891
34,008 32,882 44,822 67,265 68,770 43,798 48,085
Periods 13–60 H2 310,081.52 H6 239,951.12 H12 200,561.66 Sa 176,637.76 S 173,607.68 R 152,749.60 HO 173,871.12
95,160 70,057 61,752 73,513 70,988 75,762 77,664
Periods 1–60 H2 382,300.52 H6 301,404.98 H12 270,101.32 Sa 256,682.32 S 253,894.80 R 213,678.38 HO 236,034.66
125,846 102,271 130,103 139,495 145,074 133,999 128,299
Sa
S
R
HO
1 1 1 0 0 1 1
0 0 0 1 1 0 0
0 0 0 1 1 0 0
1 1 0 0 0 1 0
1 1 1 0 0 1 0
1 1 1 1 1 0 0
1 1 0 0 0 0 0
1 1 0 0 0 1 0
1 1 0 0 0 1 0
1 1 1 1 1 0 0
1 1 1 0 0 0 0
H2, H6, and H12 represent Look-Ahead methods with 2-, 6-, and 12-period looking ahead, respectively. Sa, S, R, and HO represent Sarsa (with all state variables), Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. Summary of cross significance tests: Entry of ‘1’: the average cost obtained from a row method is less than from a column method. Entry of ‘0’: the average costs obtained from row and column methods are not significantly different. Entry of ‘1’: the average cost obtained from a row method is greater than from a column method.
Table A.3 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.4, m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Q3
Max
H12
Sa
S
R
HO
Periods 1–12 H12 41,013 Sa 60,531 S 61,027 R 39,015 HO 62,524
7654 47,376 51,528 22,063 48,880
26,393 53,062 55,377 30,846 57,656
38,826 60,777 59,092 36,211 60,646
52,845 66,636 64,878 45,643 67,372
95,718 74,849 78,904 72,278 80,735
0 1 1 0 1
1 0 0 1 0
1 0 0 1 0
0 1 1 0 1
1 0 0 1 0
Periods 13–60 H12 139,188 Sa 89,826 S 90,535 R 81,405 HO 88,934
81,966 53,132 59,058 53,380 59,145
101,357 73,046 77,942 68,233 75,485
120,496 91,543 86,773 78,084 92,668
153,118 100,966 102,503 93,651 97,205
287,409 146,708 145,165 126,996 124,425
0 1 1 1 1
1 0 0 0 0
1 0 0 1 0
1 0 1 0 0
1 0 0 0 0
Periods 1–60 H12 180,201 Sa 150,357 S 151,562 R 120,419 HO 151,458
106,338 109,299 115,278 75,618 124,991
136,766 134,911 136,052 104,208 135,371
167,363 146,717 148,120 116,375 150,595
198,148 159,920 167,848 135,908 160,103
324,893 206,087 208,872 172,216 192,930
0 0 0 1 0
0 0 0 1 0
0 0 0 1 0
1 1 1 0 1
0 0 0 1 0
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
not be confirmed by the significance test, as indicated by ‘0’s in entries corresponding to comparison of Sa and S (Column S on Row Sa and Column Sa on Row S). Comparing learning-based methods to simulation-based methods, the average aggregate costs of Sarsa, Sarsa without the GARCH(1,1) variables, and HO are not signifi-
cantly different, as indicated by summaries of significance tests showing ‘0’s in entries corresponding to comparison of these 3 methods. As indicated by its lower aggregate cost and ‘1’s in Columns Sa and S on Row R, Rollout performed significantly better than Sarsa and Sarsa without the GARCH(1,1) variables. Since
735
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.4 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0, m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Q3
Max
H12
Sa
R
HO
1 0 0 1 1
0 1 1 0 0
0 1 1 0 0
1 0 0 1 1
1 0 0 1 1
1 1 1 0 1
1 1 1 1 0
1 0 0 1 1
1 0 0 1 1
1 1 1 0 0
1 1 1 0 0
Periods 1–12 H12 33,287 Sa 53,876 S 55,098 R 26,573 HO 26,281
4242 45,993 46,069 10,841 17,563
20,788 51,891 51,975 20,153 20,428
27,375 53,867 55,256 24,315 23,903
42,244 54,740 58,405 29,888 29,539
85,060 67,436 65,222 59,913 43,456
0 1 1 0 0
1 0 0 1 1
Periods 13–60 H12 166,327 Sa 72,136 S 71,892 R 59,600 HO 64,351
55,534 50,683 51,522 41,494 44,179
97,012 58,440 62,408 48,617 57,555
167,219 73,199 72,289 52,772 63,963
208,136 79,461 77,970 65,354 70,005
375,810 113,413 114,282 98,770 86,108
0 1 1 1 1
Periods 1–60 H12 199,614 Sa 126,012 S 126,989 R 86,173 HO 90,633
86,813 102,488 106,982 59,598 69,277
134,285 115,243 114,713 72,429 82,925
187,547 123,159 122,234 82,914 89,805
244,791 136,069 135,785 99,762 95,035
395,552 166,205 168,940 139,213 120,222
0 1 1 1 1
S
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Table A.5 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.4,m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Q3
Max
H12
Sa
R
HO
1 0 0 1 1
0 1 1 0 1
1 1 1 1 0
1 0 0 1 0
1 0 0 1 0
1 1 1 0 1
1 0 0 1 0
1 0 0 1 1
1 0 0 1 1
1 1 1 0 0
1 1 1 0 0
Periods 1–12 H12 29,330 Sa 52,066 S 51,621 R 25,107 HO 20,869
5833 44,981 46,229 17,567 15,314
17,785 50,416 49,529 22,663 18,321
26,501 51,845 51,488 25,004 20,730
36,490 54,492 52,662 27,206 22,853
68,638 58,959 60,114 33,668 27,632
0 1 1 0 1
1 0 0 1 1
Periods 13–60 H12 149,284 Sa 63,718 S 64,000 R 55,079 HO 62,481
72,296 48,401 49,787 41,637 45,108
112,546 52,325 53,125 50,558 50,645
141,864 61,453 61,839 54,672 59,863
178,102 71,023 70,332 60,286 70,612
310,972 98,727 97,745 69,792 90,026
0 1 1 1 1
Periods 1–60 H12 178,614 Sa 115,784 S 115,622 R 80,186 HO 83,350
97,856 99,234 99,582 60,661 64,772
144,552 105,721 105,814 75,684 74,442
158,326 111,092 111,952 81,556 82,429
211,002 126,944 124,631 86,279 90,900
345,803 150,760 149,331 96,746 110,018
0 1 1 1 1
S
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Rollout and HO do not require initial periods, the total cost (aggregate costs of Periods 1–60) is used to compare Rollout to HO. Although Rollout has lower average total costs ($213,678.38) than HO ($236,034.66), the significance tests could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO. Fig. A.11 shows a performance comparison of the 12-period Look-Ahead method (H12), Sarsa (Sa), Sarsa without the GARCH(1,1) state variables (S), Rollout (R), and HO (HO). The top left plot shows aggregate costs of each method. On this plot, the horizontal axis indicates aggregation periods. This plot shows how each method performed over time. Sarsa and Sarsa
without the GARCH(1,1) variables performed worse than 12-period Look-Ahead method in the first aggregation period (Periods 1–12), but they performed better than 12-period Look-Ahead in the following aggregation periods. It should be noted that Rollout yielded the best performance throughout every aggregation period. The top right plot shows the mean as a solid line, maximum ‘+’, third quartile ‘.’, median ‘*’, first-quartile ‘.’, and minimum ‘+’, of each method’s total costs. It should be noted that the differences between minimum and maximum of total costs obtained from 12-period Look-Ahead method and HO are relatively large. The difference between the first and third quartiles of total costs obtained from 12-period Look-Ahead method is also much larger than the
736
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
Table A.6 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8,m0 = 70, K = 80, b = 180, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Q3
Max
H12
Sa
R
HO
1 0 0 1 1
0 1 1 0 0
0 1 1 0 0
1 0 0 1 1
1 0 0 1 1
1 1 1 0 0
1 1 1 0 0
1 0 0 1 1
1 0 0 1 1
1 1 1 0 0
1 1 1 0 0
Periods 1–12 H12 22,373 Sa 51,049 S 49,884 R 21,770 HO 20,246
5125 43,949 44,638 16,735 12,677
10,109 49,101 47,905 19,955 17,079
18,523 50,855 48,829 21,523 19,465
29,440 53,518 52,449 23,506 23,220
56,678 55,014 57,435 28,665 27,746
0 1 1 0 0
1 0 0 1 1
Periods 13–60 H12 135,892 Sa 60,002 S 61,171 R 50,424 HO 52,622
55,770 44,724 44,872 35,112 39,238
104,405 50,846 51,474 43,343 45,282
125,575 56,611 59,167 50,268 52,639
152,049 66,691 68,126 58,114 57,199
292,277 90,252 94,648 66,625 75,872
0 1 1 1 1
Periods 1–60 H12 158,265 Sa 111,051 S 111,055 R 72,194 HO 72,869
65,920 95,654 93,640 58,444 57,927
123,427 102,673 102,524 62,189 67,201
145,603 107,420 108,557 72,920 72,911
175,337 120,631 117,301 78,874 76,772
302,520 144,214 142,761 90,534 97,346
0 1 1 1 1
S
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Table A.7 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 40, K = 80, b = 180, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q3
Max
H12
Sa
S
R
HO
53,598 74,391 75,041 55,644 88,948
73,242 96,438 88,772 82,876 101,645
109,935 112,713 103,773 127,457 135,746
0 1 1 0 1
1 0 0 1 1
1 0 0 1 1
0 1 1 0 1
1 1 1 1 0
94,359 86,584 91,039 98,797 96,212
127,016 138,396 131,994 119,031 118,629
171,768 159,249 165,179 136,544 159,486
298,842 234,386 233,521 315,946 221,765
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
146,545 178,915 177,032 147,565 194,771
176,336 203,327 202,153 184,378 220,703
229,839 247,875 242,412 210,684 238,961
363,694 334,968 305,862 416,047 297,220
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
0 0 0 1 0
Periods 1–12 H12 56,437 Sa 78,975 S 77,020 R 63,783 HO 89,858
28,227 52,977 59,681 30,872 64,502
39,717 65,641 65,107 35,166 75,710
Periods 13–60 H12 135,228 Sa 133,436 S 133,819 R 125,561 HO 127,455
47,999 59,505 60,938 55,958 66,339
Periods 1–60 H12 191,665 Sa 212,411 S 210,839 R 189,344 HO 217,314
87,971 127,540 122,311 111,849 130,841
Q2
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
differences obtained from other methods. This smaller difference between the first and third quartiles of total costs obtained from all ADP methods may imply that an ADP method also helps improve system stability. The lower left and right plots show empirical cumulative distribution functions (CDFs)4 of period costs. The lower left plot shows the CDF for the complete range of period costs. The lower right plot shows the CDF for a selected range of period costs such that it allows a better look at how each method’s CDF approaches one. These plots
4 These CDFs are obtained from MATLAB’s ecdf function, Revision 1.3.6.7, which uses a Kaplan-Meier estimate, as discussed in Kaplan and Meier (1958).
show that the CDF of period costs obtained from the 12-period LookAhead method is lower than those obtained from ADP methods when period cost is larger than around $500. This means that the chance that an ADP method would deliver a large period cost is less than the 12-period Look-Ahead method. In other words, the ADP method has better worst-case performance than the 12-period Look-Ahead method. A.1.1. Experimental results of different demand correlations Similar to Tables A.2,A.3,A.4,A.5 and A.6 show statistics and a summary of cross significance tests of different methods on inventory problem with demand correlation a1 = 0.4, 0, 0.4 and
737
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.8 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 10, K = 80, b = 180, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Max
H12
Sa
S
R
HO
70,282 87,933 81,944 56,854 73,728
113,306 98,165 103,139 89,244 99,725
0 1 1 1 0
1 0 0 1 1
1 0 0 1 1
1 1 1 0 1
0 1 1 1 0
91,771 94,724 97,770 64,499 80,303
108,777 111,604 112,254 82,810 96,523
177,481 144,595 143,335 116,854 136,175
0 0 0 1 0
0 0 0 1 0
0 0 0 1 0
1 1 1 0 1
0 0 0 1 0
137,680 171,205 157,572 123,279 151,495
177,985 190,641 190,078 141,345 171,216
263,056 228,150 239,835 164,526 198,968
0 1 1 1 0
1 0 0 1 1
1 0 0 1 1
1 1 1 0 1
0 1 1 1 0
Periods 1–12 H12 59,498 Sa 79,012 S 75,665 R 48,748 HO 65,848
23,625 51,585 48,067 30,193 39,300
47,832 69,395 67,793 37,653 53,032
59,176 82,260 76,454 45,371 67,468
Periods 13–60 H12 93,352 Sa 95,161 S 94,566 R 72,693 HO 87,810
47,187 53,790 52,865 47,761 60,788
70,202 71,253 75,958 55,499 74,715
Periods 1–60 H12 152,850 Sa 174,173 S 170,231 R 121,441 HO 153,657
99,907 125,244 116,460 84,058 110,716
124,509 149,891 151,015 102,563 134,764
Q3
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Table A.9 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 150, b = 100, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Max
H12
Sa
S
R
HO
68,811 85,600 90,410 85,915 87,954
94,897 102,937 105,915 146,039 124,199
0 1 1 1 1
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
168,709 166,136 166,746 148,366 150,946
216,927 204,436 199,270 183,620 190,354
335,187 298,433 296,133 250,906 260,903
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
223,176 237,051 235,288 221,329 219,942
274,713 280,229 283,689 254,433 271,190
393,114 359,545 374,504 383,787 370,272
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Periods 1–12 H12 57,542 Sa 76,910 S 78,544 R 75,477 HO 76,769
32,460 60,443 52,576 51,522 52,992
44,745 63,781 66,785 59,623 64,846
55,470 79,170 82,580 69,318 68,837
Periods 13–60 H12 173,177 Sa 165,061 S 164,570 R 150,921 HO 158,544
83,315 73,216 73,242 71,724 67,756
124,044 112,441 113,408 114,338 128,169
Periods 1–60 H12 230,719 Sa 241,971 S 243,113 R 226,398 HO 235,313
125,115 136,646 139,225 129,960 133,762
176,218 198,674 194,275 190,356 198,288
Q3
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
0.8, respectively. A summary of the different methods applied to the inventory problem with demand correlation a1 = 0.8 is shown in Table A.2. As indicated by ‘1’s in Column H12 on Rows S and Sa in Periods 13–60 in Table A.3, Sarsa and Sarsa without the GARCH(1,1) variables performed significantly better than the 12-period LookAhead method. Rollout also performed significantly better than the 12-period Look-Ahead method, as indicated by ‘1’ in Column H12 on Row R in Periods 1–60. Although HO has lower total cost than the 12-period Look-Ahead method, the significance test could not confirm the difference, as indicated by ‘0’ in Column H12 on Row HO in Periods 1–60. When Sarsa is compared to Sarsa without
the GARCH(1,1) variable, the ‘0’s in entries corresponding to comparison of Sa and S in Periods 13–60 indicate that they performed indifferently. The entry of ‘1’ in Column HO on Row R in Periods 1–60 indicates that Rollout performed significantly better than HO. Table A.4 shows that Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO performed significantly better than the 12period Look-Ahead method, as as indicated by ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. Sarsa and Sarsa without the GARCH(1,1) variables performed indifferently, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. Rollout outperformed Sarsa and Sarsa
738
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
Table A.10 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 100, b = 100, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q3
Max
H12
Sa
S
R
HO
50,114 73,239 77,420 54,819 86,164
68,706 88,185 94,100 64,632 101,488
110,903 107,206 109,846 90,121 120,575
0 1 1 0 1
1 0 0 1 1
1 0 0 1 0
0 1 1 0 1
1 1 0 1 0
122,735 116,926 114,100 100,762 113,913
164,454 149,029 167,048 129,255 143,856
196,317 202,628 195,993 161,582 179,257
334,458 277,317 291,487 274,435 219,288
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
166,128 189,757 187,603 141,611 196,657
216,028 238,195 231,012 197,529 235,248
262,221 276,813 280,508 219,613 274,059
382,917 370,298 377,042 329,060 292,228
0 0 0 0 0
0 0 0 1 0
0 0 0 1 0
0 1 1 0 1
0 0 0 1 0
Periods 1–12 H12 54,723 Sa 77,318 S 78,706 R 55,731 HO 86,610
22,652 47,764 48,733 32,356 48,743
41,055 66,566 62,665 44,537 74,363
Periods 13–60 H12 169,188 Sa 160,324 S 161,833 R 140,570 HO 146,784
64,845 71,013 62,990 59,595 72,348
Periods 1–60 H12 223,910 Sa 237,642 S 240,539 R 196,302 HO 233,394
110,180 145,602 127,016 116,610 154,566
Q2
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Table A.11 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, and HO controllers; a1 = 0.8, m0 = 70, K = 50, b = 100, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Max
H12
Sa
S
R
HO
68,352 94,154 89,176 60,730 88,338
84,130 114,332 112,383 110,928 111,769
0 1 1 0 1
1 0 0 1 0
1 0 0 1 0
0 1 1 0 1
1 0 0 1 0
154,976 163,043 164,153 139,879 151,157
216,924 191,624 197,658 179,630 178,400
339,026 288,173 286,082 218,004 271,947
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
207,020 233,910 231,013 195,788 226,531
269,197 278,875 278,687 232,230 251,913
389,201 380,876 364,789 278,502 359,137
0 0 0 0 0
0 0 0 1 0
0 0 0 1 0
0 1 1 0 1
0 0 0 1 0
Periods 1–12 H12 55,363 Sa 78,081 S 79,325 R 54,401 HO 78,640
30,484 52,164 49,648 30,969 53,179
43,225 64,769 72,326 43,848 65,071
52,089 74,325 79,767 51,961 77,096
Periods 13–60 H12 167,459 Sa 160,550 S 160,496 R 145,233 HO 157,968
69,601 67,548 69,630 62,335 102,333
111,325 111,636 108,476 117,242 128,922
Periods 1–60 H12 222,822 Sa 238,630 S 239,822 R 199,633 HO 236,608
112,727 131,692 149,397 105,734 188,647
161,796 193,457 192,938 171,316 210,443
Q3
H12, Sa, S, R, and HO represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO, respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
without the GARCH(1,1) variables, as indicated by ‘1’s in Columns Sa and S on Row R in Periods 13–60. Although Rollout has lower average total cost than HO, the significance test could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO in Periods 1–60. Table A.5 shows that Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO performed significantly better than the 12period Look-Ahead method, as indicated by ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. Sarsa and Sarsa without the GARCH(1,1) variables performed indifferently, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. Rollout outperformed Sarsa and Sarsa without
the GARCH(1,1) variables, as indicated by ‘1’s in Columns Sa and S on Row R in Periods 13–60. Although Rollout has lower average total cost than HO, the significance test could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO in Periods 1–60. Table A.6 shows the entries of ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. This means that Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and HO outperformed the 12-period Look-Ahead method. Sarsa and Sarsa without the GARCH(1,1) variables performed indifferently, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. The entries of ‘1’s in Columns Sa and S on Row R in
739
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.12 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 100, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Max
H12
Sa
S
R
HO
Z
74,422 87,043 94,681 78,318 93,834 65,504
94,606 102,512 111,380 97,890 113,496 91,124
0 1 1 0 1 0
1 0 0 1 0 1
1 0 0 1 0 1
0 1 1 0 1 0
1 0 0 1 0 1
0 1 1 0 1 0
161,661 158,858 156,131 142,983 147,223 150,676
207,095 195,125 185,111 164,300 166,832 187,004
360,465 285,208 287,375 238,114 314,752 300,671
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
217,615 225,156 226,146 207,381 227,918 194,194
273,032 275,326 274,385 244,980 239,341 243,579
410,306 346,725 364,336 323,891 403,329 337,442
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 1
0 0 0 0 0 0
0 0 0 0 0 0
0 1 1 0 0 0
Periods 1–12 H12 61,071 Sa 76,177 S 78,050 R 62,213 HO 78,661 Z 54,041
17,216 58,426 52,469 31,165 57,380 32,604
43,641 63,766 63,547 43,510 63,106 36,767
63,492 77,626 75,373 62,155 74,276 52,080
Periods 13–60 H12 166,638 Sa 158,500 S 156,897 R 147,223 HO 151,540 Z 150,510
75,164 71,070 65,936 86,820 72,425 59,044
112,282 100,641 110,682 114,064 128,378 95,219
Periods 1–60 H12 227,708 Sa 234,677 S 234,947 R 209,436 HO 230,201 Z 204,551
125,328 130,939 128,879 137,906 143,960 92,879
178,420 189,992 189,145 168,107 200,962 156,586
Q3
H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Table A.13 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 200, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Max
H12
Sa
S
R
HO
Z
75,997 98,013 88,573 88,566 87,302 72,736
118,090 116,455 119,299 126,010 102,442 94,835
0 1 1 1 1 0
1 0 0 0 0 1
1 0 0 1 0 1
1 0 1 0 0 1
1 0 0 0 0 1
0 1 1 1 1 0
161,134 156,951 150,032 149,493 141,835 152,239
219,709 191,474 199,377 186,721 170,456 179,703
374,962 291,025 278,681 354,871 229,182 314,713
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
208,872 233,234 225,421 223,967 233,124 207,487
275,681 278,112 280,322 257,563 246,984 245,793
434,490 390,412 362,782 397,287 318,048 371,821
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Periods 1–12 H12 58,083 Sa 81,129 S 81,467 R 72,222 HO 78,343 Z 57,727
26,638 53,353 60,766 42,416 58,205 20,841
41,443 66,295 67,418 53,945 67,045 37,629
50,390 77,151 75,955 71,143 80,481 57,108
Periods 13–60 H12 176,271 Sa 158,721 S 158,679 R 157,325 HO 147,128 Z 152,270
68,743 64,344 71,031 71,657 71,393 48,087
111,268 110,723 114,948 127,077 114,325 101,452
Periods 1–60 H12 234,354 Sa 239,850 S 240,146 R 229,547 HO 225,471 Z 209,996
110,957 134,743 148,425 145,059 147,200 85,735
175,309 196,664 198,889 194,764 192,018 155,079
Q3
H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Periods 13–60 indicate that Rollout outperformed both Sarsa and Sarsa without the GARCH(1,1) variables. Again, Rollout has lower average total cost than HO, but the significance test could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R and HO in Periods 1–60.
A.1.2. Experimental results of different demand variances Tables A.7 and A.8 show statistics and summaries of cross significance tests of different methods on inventory problems with medium and low noise variances (m0 = 40 and 10), respectively. Detailed experimental results of the different methods applied to
740
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
Table A.14 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 50, and h = 1. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Max
H12
Sa
S
R
HO
Z
67,481 93,884 84,502 74,384 57,978 60,946
107,000 114,095 100,965 137,081 111,193 82,754
0 1 1 1 0 0
1 0 0 1 1 1
1 0 0 1 1 1
1 1 1 0 1 1
0 1 1 1 0 0
0 1 1 1 0 0
152,239 156,433 149,342 133,884 144,771 150,701
191,173 186,973 196,213 170,886 170,908 186,465
350,288 287,381 285,845 249,570 242,769 298,200
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
190,538 229,456 221,830 199,214 196,715 191,328
259,315 276,419 278,895 241,631 227,137 237,402
360,100 377,454 359,840 347,656 309,537 332,784
0 0 0 0 0 0
0 0 0 0 1 1
0 0 0 0 1 1
0 0 0 0 0 0
0 1 1 0 0 0
0 1 1 0 0 0
Periods 1–12 H12 50,926 Sa 77,253 S 74,465 R 66,631 HO 50,317 Z 49,777
9812 46,878 50,796 34,214 24,678 29,952
37,973 58,601 64,092 49,951 37,962 34,924
45,952 75,494 78,101 58,795 44,896 49,809
Periods 13–60 H12 157,838 Sa 158,046 S 158,833 R 149,849 HO 146,971 Z 149,914
56,136 59,862 71,067 80,684 90,417 58,160
103,188 114,196 112,807 120,725 115,021 94,398
Periods 1–60 H12 208,764 Sa 235,299 S 233,298 R 216,480 HO 197,288 Z 199,692
102,842 121,797 135,242 131,073 121,763 90,555
157,909 180,151 187,084 176,765 155,093 151,652
Q3
H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
Table A.15 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b =50, and h = 10. Statistics Mean
Summaries of cross tests Min
Q1
Q2
Q3
Max
H12
Sa
S
R
HO
Z
Periods 1–12 H12 47,427 Sa 106,898 S 105,540 R 57,969 HO 60,629 Z 47,673
22,630 85,740 87,910 30,060 28,450 19,300
29,548 92,220 95,835 40,198 42,625 33,030
47,220 106,420 104,020 49,750 55,770 41,490
57,603 118,603 115,003 66,150 67,788 56,293
90,020 137,580 131,420 137,160 142,910 111,830
0 1 1 0 1 0
1 0 0 1 1 1
1 0 0 1 1 1
0 1 1 0 0 1
1 1 1 0 0 1
0 1 1 1 1 0
Periods 13–60 H12 187,556 Sa 304,352 S 305,220 R 177,700 HO 198,515 Z 180,778
80,700 231,060 226,760 134,000 108,020 65,490
124,663 270,570 270,228 147,430 158,418 135,415
184,550 306,450 302,650 165,880 193,240 177,720
227,673 330,340 326,313 191,268 230,215 210,470
360,920 396,650 411,970 346,890 307,730 371,140
0 1 1 0 0 0
1 0 0 1 1 1
1 0 0 1 1 1
0 1 1 0 0 0
0 1 1 0 0 0
0 1 1 0 0 0
Periods 1–60 H12 234,984 Sa 411,250 S 410,760 R 235,670 HO 259,144 Z 228,451
109,900 323,670 322,430 179,880 163,410 105,680
175,088 370,280 372,190 204,665 212,110 177,793
232,630 401,740 408,090 228,910 254,170 220,970
287,890 447,308 439,183 250,530 300,960 272,663
393,010 525,580 508,510 389,450 380,680 399,470
0 1 1 0 0 0
1 0 0 1 1 1
1 0 0 1 1 1
0 1 1 0 0 0
0 1 1 0 0 1
0 1 1 0 1 0
H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
the inventory problem with high noise variance m0 = 70 is shown in Table A.2. The summary of significance tests shown in Table A.7 reveals that under medium variance (m0 = 40) the average costs obtained from the ADP methods investigated in our study and the Look-
Ahead method are not significantly different, as indicated by ‘0’ in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R and HO in Periods 1–60. The difference between a learning-based ADP method and a simulation-based ADP method is not significant, as indicated by all ‘0’s in entries comparing methods in Periods
741
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743 Table A.16 Summary of cross significance tests of Look-Ahead, Sarsa, Rollout, HO, and Zhang’s AR1/GARCH controllers; a1 = 0.8, m0 = 70, K = 0, b = 50, and h = 50. Statistics
Summaries of cross tests
Mean
Min
Q1
Q2
Q3
Max
H12
Sa
S
R
HO
Z
114,300 272,150 263,950 86,350 170,600 110,400
0 1 1 0 1 0
1 0 0 1 1 1
1 0 0 1 1 1
0 1 1 0 1 0
1 1 1 1 0 1
0 1 1 0 1 0
289,713 967,913 963,650 264,100 884,813 269,675
450,150 977,700 978,200 358,350 942,750 411,600
0 1 1 0 1 0
1 0 0 1 1 1
1 0 0 1 1 1
0 1 1 0 1 0
1 1 1 1 0 1
0 1 1 0 1 0
344,725 1,218,275 1,209,188 317,650 980,388 338,200
489,250 1,238,750 1,238,100 405,550 1,107,700 472,200
0 1 1 0 1 0
1 0 0 1 1 1
1 0 0 1 1 1
0 1 1 0 1 0
1 1 1 1 0 1
0 1 1 0 1 0
Periods 1–12 H12 60,098 Sa 244,640 S 240,052 R 59,572 HO 114,246 Z 56,084
32,450 224,800 202,150 33,200 83,950 17,400
44,300 235,763 234,438 45,225 91,650 33,638
50,650 245,100 240,500 55,750 115,000 53,350
67,113 253,988 252,138 79,800 129,488 82,588
Periods 13–60 H12 236,810 Sa 953,938 S 948,726 R 227,592 HO 777,532 Z 232,630
113,150 905,000 902,350 145,650 472,350 123,950
182,588 947,650 943,538 187,138 717,513 179,288
227,700 960,200 954,300 211,000 795,200 229,700
Periods 1–60 H12 296,908 Sa 1,198,578 S 1,188,778 R 287,164 HO 891,778 Z 288,714
162,700 1,139,650 1,104,500 193,700 564,150 164,550
245,900 1,183,963 1,177,975 245,513 815,313 224,825
288,650 1,201,150 1,189,100 288,000 903,950 285,600
H12, Sa, S, R, HO, and Z represent the Look-Ahead method with 12-period looking ahead, Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and the previous solution of Zhang (2007), respectively. Statistics provide mean, minimum, first-quartile, second-quartile, third-quartile, and maximum of aggregate costs obtained with method indicated on the row header in periods as indicated in the group header. In the summaries of cross significance tests, entries of ‘1’, ‘0’, and ‘1’ mean that the average cost obtained from a row method is less than, is not significantly different from, and is greater than a column method, respectively.
x 10
4
5
average aggregated cost H12 Sa S R HO
8 7.5 7 6.5
5.5 5 4.5
cost: $
cost: $
statistics of total costs
x 10 6
8.5
6 5.5 5
4 3.5 3 2.5
4.5
2
4
1.5
3.5 1−12
1 13−24
25−36
37−48
49−60
H12
Sa
S
R
HO
Period empirical cdf of period costs
empirical cdf of period costs
1
1
0.9 0.9
0.8 0.7
0.8
cdf
cdf
0.6 0.5
0.7
0.4 0.3 H12 Sa S R HO
0.2 0.1 00
1
2
3
4
cost: $
5
6
7
8 4 x 10
0.6
H12 Sa S R HO
0.5 0
0.5
1
1.5
2
2.5
3
cost: $
Fig. A.11. Performance comparison of Look-Ahead, Sarsa, Sarsa without GARCH(1,1) variables, Rollout, and HO.
3.5
4 4 x 10
742
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
13–60. The average inventory cost obtained from Sarsa is not significantly different from Sarsa without the GARCH(1,1) variables, as indicated by ‘0’s in entries corresponding to comparison of Sa and S. Rollout yielded significantly better performance than HO, as indicated by ‘1’ in Column HO on Row R, in Periods 1–60. Table A.8 shows that Rollout is the only method that outperformed the Look-Ahead method (in Periods 1–60), as indicated by ‘1’ in Column H12 on Row R. The other ADP methods have higher average costs than the Look-Ahead method, but the significance tests could not confirm the difference, as indicated by ‘0’s in Column H12 on Rows Sa and S in Periods 13–60 and on Row HO in Periods 1–60. It should be noted that Sarsa and Sarsa without the GARCH(1,1) variables have lower maximum cost than the LookAhead method. HO has lower third-quartile and maximum cost than the Look-Ahead method. The lower maximum cost may indicate the reliability improvement achieved by Sarsa, Sarsa without the GARCH(1,1) variables, and HO. The average inventory cost obtained by using Sarsa is not significantly different from Sarsa without the GARCH(1,1) variables. Rollout outperformed HO, as indicated by ‘1’ in Column HO on Row R, in Periods 1–60. A.1.3. Experimental results of different set up costs Tables A.9,A.10,A.11 and A.12 show statistics and summaries of cross significance tests of different methods on inventory problems with set up cost K = 150, 100, 50, and 0, respectively. In Tables A.12,A.13,A.14,A.15 and A.16, the previous solution of Zhang (2007) is included. Zhang’s AR(1)/GARCH(1,1) solution, denoted Z, is available for an inventory problem without set up cost. Table A.9 shows that although Sarsa and Sarsa without the GARCH(1,1) in Periods 13–60 and Rollout in Periods 1–60 provided lower average costs than the Look-Ahead method, the significance tests could not confirm the differences. The average inventory costs obtained from Sarsa without the GARCH(1,1) variables are not significantly different from Sarsa, as indicated by ‘0’s in entries corresponding to comparison of S and Sa, in Periods 13–60. Rollout provides lower average cost than Sarsa in Periods 13–60 and HO in Periods 1–60, but the significance tests could not confirm the difference. Table A.10 shows that Sarsa and Sarsa without the GARCH(1,1) variables have lower average costs than the Look-Ahead method in Periods 13–60, but the significance tests could not confirm the difference. The average inventory costs obtained from Sarsa without the GARCH(1,1) and Sarsa are not significantly different, as indicated by ‘0’s in entries corresponding to comparison of Sa and S in Periods 13–60. When comparing learning-based ADP methods to simulation-based ADP methods, although Rollout and HO have lower average costs than Sarsa and Sarsa without the GARCH(1,1) variables in Periods 13–60, the significance test could not confirm the difference. However, when comparing Rollout and HO in Periods 1–60, Rollout performed significantly better than HO, as indicated by ‘1’ in Column HO on Row R. Table A.11 shows that, as indicated by ‘0’ in all entries corresponding to comparison of the Look-Ahead method and ADP methods, the significance test could not confirm the difference between average costs obtained from the ADP methods and the Look-Ahead method. The costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different. Although Sarsa and Sarsa without the GARCH(1,1) variables have higher average costs than Rollout and HO, the significance test could not confirm the differences. Rollout significantly outperformed HO, as indicated by ‘1’ in Column HO on Row R, in Periods 1–60. Table A.12 shows that, as indicated by ‘0’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R, HO, and Z, although average inventory costs obtained from the Look-Ahead method are higher than Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, and Zhang’s solution, the significance tests could not confirm the
differences. Comparing learning-based ADP methods to Rollout, HO, and Zhang’s solution, the ‘0’s on Rows Sa and S in Periods 13–60 indicate that average costs obtained from Sarsa or Sarsa without the GARCH(1,1) variables are not significantly different from Rollout, HO, or Zhang’s solution. The difference between average inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables is not significant either. Comparing Rollout, HO, and Zhang’s solution, as indicated by average inventory costs and significance tests in Periods 1–60, Zhang’s solution gives the lowest average inventory costs, but the significance tests could not confirm the differences. A.1.4. Experimental results of different penalty costs Tables A.13 and A.14 show statistics and summaries of cross significance tests of different methods on inventory problems with penalty costs $200 (b/g = 2) and $50 (b/g = 0.5), respectively. A summary of the different methods applied to the inventory problem with penalty cost $100 (b/g = 1) is shown in Table A.12. Table A.13 shows that although Sarsa and Sarsa without the GARCH(1,1) variables have lower average inventory costs than the Look-Ahead method in Periods 13–60 and Rollout, HO, and Zhang’s solution have lower average inventory costs than the Look-Ahead method in Periods 1–60, the significance tests could not confirm the differences. The average inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’ in Column S on Row Sa, in Periods 13–60. Comparing Rollout, HO, and Zhang’s solution, Zhang’s solution has the lowest average inventory cost, but the significance tests could not confirm the difference, as indicated by ‘0’s in entries corresponding to comparison of R, HO, and Z in Periods 1–60. Table A.14 shows that the average inventory costs obtained from Sarsa, Sarsa without the GARCH(1,1) variables, Rollout, HO, and Zhang’s solution are not significantly different from the Look-Ahead method, as indicated by ‘0’s in Column H12 on Rows Sa and S in Periods 13–60 and on Rows R, HO, and Z in Periods 1–60. The average costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’ in Column S on Row Sa, in Periods 3–60. Comparing Rollout, HO, and Zhang’s solution, HO gives the lowest average inventory cost, but the significance tests could not confirm the differences, as indicated by ‘0’s in entries corresponding to comparison of R, HO, and Z in Periods 1–60. A.1.5. Experimental results of different holding costs Tables A.15 and A.16 show statistics and summaries of cross significance tests of different methods on inventory problem with holding costs $10 (h/b = 0.2) and $50 (h/b = 1), respectively. A summary of the different methods applied to the inventory problem with holding cost $1 (h/b = 0.02) is shown in Table A.14. Table A.15 shows that the Look-Ahead method performed significantly better than Sarsa and Sarsa without the GARCH(1,1) variables, as indicated by ‘1’s in Column H12 on Rows Sa and S, in Periods 13–60. Although the Look-Ahead method has lower average inventory cost than Rollout and HO, the significance tests could not confirm the difference, as indicated by ‘0’s in Column H12 on Rows R and HO. The costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’s in Column S on Row Sa in Periods 13–60. When compared to Rollout, HO, and Zhang’s solution, inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are significantly higher than Rollout, HO, Zhang’s solution, as indicated by ‘1’s in Columns R, HO, and Z on Rows Sa and S, in Periods 13–60. The significance test confirms that Zhang’s solution outperformed HO, as indicated by ‘1’ in Column HO on Row Z, in Periods 1–60. The significance test could not confirm the difference
T. Katanyukul et al. / Computers & Industrial Engineering 60 (2011) 719–743
between average inventory costs obtained from Rollout and Zhang’s solution, as indicated by ‘0’ in Column R on Row Z, in Periods 1–60. Table A.16 shows that the Look-Ahead method outperformed Sarsa, Sarsa without the GARCH(1,1) variables, and HO, as indicated by ‘1’s in Column H12 on Rows Sa and S in Periods 13–60 and on Row HO in Periods 1–60. The average inventory costs obtained from Sarsa and Sarsa without the GARCH(1,1) variables are not significantly different, as indicated by ‘0’ in Column S on Row Sa, in Periods 13–60. Rollout, HO, and Zhang’s solution outperformed Sarsa and Sarsa without the GARCH(1,1) variables, as indicated by ‘1’ in Columns Sa and S on Rows R, HO, and Z, in Periods 13–60. Rollout and Zhang’s solution also show superior performance to HO, as indicated by ‘1’ in Column HO on Rows R and Z, in Periods 1–60. Rollout has slightly lower cost than Zhang’s solution, but the significance test could not confirm the difference. References Barreto, A. M. S., & Anderson, C. W. (2008). Restricted gradient-descent algorithm for value-function approximation in reinforcement learning. Artificial Intelligence, 172(4–5), 454–482. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA, USA: Athena Scientific. Bertsimas, D., & Thiele, A. (2006). A robust optimization approach to inventory theory. Operations Research, 54(1), 150–168. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Chaharsooghi, S. K., Heydari, J., & Zegordi, S. H. (2008). A reinforcement learning model for supply chain ordering management: An application to the beer game. Decision Support System, 45(4), 949–959. Choi, J., Realff, M. J., & Lee, J. H. (2006). Approximate dynamic programming: Application to process supply chain management. AIChe Journal, 52(7), 2473–2485. Chong, E. K. P., Givan, R. L., & Chang, H. S. (2000). A framework for simulation-based network control via hindsight optimization. In Proceedings of the 39th IEEE conference on decision and control. Csáji, B. C., & Monostori, L. (2008). Value function based reinforcement learning in changing Markovian environments. Journal of Machine Learning Research, 9, 1679–1709. Das, T. K., Gosavi, A., Mahadevan, S., & Marchalleck, N. (1999). Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45(4), 560–574. Giannoccaro, I., & Pontrandolfo, P. (2002). Inventory management in supply chains: A reinforcement learning approach. International Journal of Production Economics, 78(2), 153–161. Godfrey, G. A., & Powell, W. B. (2001). An adaptive, distribution-free algorithm for the newsvendor problem with censored demands, with applications to inventory and distribution. Management Science, 47(8), 1101–1112. Haykin, S. (1994). Neural networks: A comprehensive foundation. Macmillan College Publishing Company, Inc.
743
Iida, T., & Zipkin, P. H. (2006). Approximate solutions of a dynamic forecastinventory model. Manufacturing and Service Operations Management, 8(4), 407–425. Jiang, C., & Sheng, Z. (2009). Case-based reinforcement learning for dynamic inventory control in a multi-agent supply chain system. Expert Systems with Applications, 36(3), 6520–6526. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481. Kim, C. O., Jun, J., Baek, J. K., Smith, R. L., & Kim, Y. D. (2005). Adaptive inventory control models for supply chain management. International Journal of Advanced Manufacturing Technology, 26(9–10), 1184–1192. Kim, C. O., Kwon, I. H., & Baek, J. G. (2008). Asynchronous action-reward learning for nonstationary serial supply chain inventory control. Applied Intelligence, 28(1), 1–16. Kleinau, P., & Thonemann, W. (2004). Deriving inventory-control policies with genetic programming. OR Spectrum, 26(4), 521–546. Kwon, I. H., Kim, C. O., Jun, J., & Lee, J. H. (2008). case-based myopic reinforcement learning for satisfying target service level in supply chain. Expert Systems with Applications, 35(1–2), 389–397. Lambert, D., Stock, J. R., & Ellram, L. M. (1998). Fundamentals of logistics. McGrawHill/Irwin. Lee, H. L., & Billington, C. (1992). Managing supply chain inventory: Pitfalls and opportunities. Sloan Management Review, 65–73. Pontrandolfo, P., Gosavi, A., & Okobaa, O. G. (2002). Global supply chain management: A reinforcement learning approach. International Journal of Production Research, 40(6), 1299–1317. Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality. Wiley. Shervais, S., Shannon, T. T., & Lendaris, G. G. (2003). Intelligent supply chain management using adaptive critic learning. IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans, 33(2), 235–244. Si, J., Barto, A., Powell, W., & Wunsch, D. II (Eds.). (2004). Handbook of learning and approximate dynamic programming. IEEE Press. Silver, E. A. (1981). Operations research in inventory management: A review and critique. Operations Research, 29(4), 628–645. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. MIT Press. Topaloglu, H., & Kunnumkal, S. (2006). Approximate dynamic programming methods for an inventory allocation problem under uncertainty. Naval Research Logistics, 53(8), 822–841. Van Roy, B., Bertsekas, D. P., Lee, Y., & Tsitsiklis, J. N. (1997). A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the IEEE conference on decision and control. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge University. Werbos, P. (1992). Handbook of intelligent control. In D. White & D. Sofge (Eds.), chap. neurocontrol and supervised learning: An overview and evaluation. New York: Van Nostrand Rheinhold. Werbos, P. (2004). Handbook of learning and approximate dynamic programming. In J. Si, A. Barto, W. Powell, & D. Wunsch, II (Eds.), ADP: Goals, opportunities and principles. IEEE Press. Zhang, X. (2007). Inventory control under temporal demand heteroscedasticity. European Journal of Operational Research, 182(1), 127–144.