!python --version
Python 3.7.15
Tabular Monte-Carlo Control
Kobus Esterhuysen
July 20, 2022
In previous projects we have only dealt with the prediction problem. In this project we move on to the control problem. We stay in the model-free or learning domain (rather than planning), i.e. making use of Reinforcement Learning algorithms. We will explore the depth of update dimension, considering both bootstrapping and non-bootstrapping methods like Temporal-Difference (TD) and Monte-Carlo (MC) algorithms respectively.
What is different about Control, rather than Prediction, is the depence on the concept of Generalized Policy Iteration (GPI).
The key concept of GPI is that we can: - Evaluate a policy with any Policy Evaluation method - Improve a policy with any Policy Improvement method
Here is a reasonable way to approach MC Control based on what we have done so far: - do MC Policy Evaluation (which is essentially MC Prediction) - do greedy Policy Improvement - do MC Policy Evaluation on the improved policy - etc., etc.
However, this naive approach leads to some complications.
The first complication is that: - Greedy Policy Improvement requires the - state transition probability function \(\mathcal P\) - reward function \(\mathcal R\) This is not available in the model-free domain. (They were, however, available in the model-based domain when we performed dynamic programming).
We address this complication by noting that
\[ \pi'_D(s) = \underset{a \in \mathcal A}{\text{arg max}} \{\mathcal R(s, a) + \gamma \sum_{s' \in \mathcal N}\mathcal P(s, a, s') \cdot V^\pi(s')\} \text{ for all } s \in \mathcal N \]
can be expressed more concisely as:
\[ \pi'_D(s) = \underset{a \in \mathcal A}{\text{arg max}} Q^\pi(s, a) \text{ for all } s \in \mathcal N \]
This is valuable because: - Instead of doing - Policy Evaluation to calculate \(V^\pi\) - We do - Policy Evaluation to calculate \(Q^\pi\) (i.e. MC Prediction for the Action Value Function, \(Q\))
The second complication is that updates can get biased by initial random occurrences of returns. This could prevent certain actions from being sufficiently chosen. This will lead to inaccurate estimates of the action values for those actions. We want to exploit actions that provide better returns but at the same time also be sure to explore all possible actions. This problem is known as the exploration-exploitation dilemma.
The way to address this complication is to modify our Tabular MC Control algorithm: - Instead of doing - greedy Policy Improvement - we want to do - \(\epsilon\)-greedy Policy Improvement
This impoved stochastic policy is defined as:
\[ \pi'(s, a) = \left\{ \begin{array}{ll} \frac{\epsilon}{|\mathcal A|}+1-\epsilon & \mbox{if } a= \mbox{arg max}_{b \in \mathcal A} Q(s, b) \\ \frac{\epsilon}{|\mathcal A|} & \mbox{otherwise} \end{array} \right. \]
This means the: - exploit probability is: \(1-\epsilon\) - select the action that maximizes \(Q\) for a given state - explore probability is: \(\epsilon\) - uniform-randomly select an allowable action
The deterministic greedy policy \(\pi'_D\) is a special case of the \(\epsilon\)-greedy policy \(\pi'\). This happens when \(\epsilon=0\).
We need to consider two more aspects:
Here is a summary of the GLIE Tabular Monte-Carlo Control algorithm.
For each episode (terminating trace experience): - Generate the trace experience (episode) with actions sampled from the \(\epsilon\)-greedy policy \(\pi\) coming from the latest estimate of the Action Value Function \(Q\) - Sample the initial state of the trace experience uniformly from the non-terminal states \(\mathcal N\) - We now have a trace experience: $ S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, …, R_T, S_T$ - The return \(G_t\) associated with \((S_t,A_t)\) is: \(G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...+\gamma^{T-t-1}R_T\) - End of trace updates:
\[ Count(S_t,A_t) ← Count(S_t,A_t) + 1 \]
\[ Q(S_t,A_t) ← Q(S_t,A_t) + \frac{1}{Count(S_t,A_t)} \cdot [G_t - Q(S_t,A_t)] \]
\[ \epsilon ← \frac{1}{k} \]
The adaptation for Function Approximation is simple. Instead of the above udpate, we have:
\[ \Delta w = \alpha \cdot [G_t - Q(S_t,A_t;w)] \cdot \nabla_wQ(S_t,A_t;w) \]
where \(\alpha\) is the learning rate and \(G_t\) is the trace experience return from state \(S_t\) after taking action \(A_t\) at time \(t\) on a trace experience.
The following code can handle both TAB and FAP cases.
def epsilon_greedy_policy(
q: QValueFunctionApprox[S, A],
mdp: MarkovDecisionProcess[S, A],
ε: float = 0.0
) -> Policy[S, A]:
def explore(s: S, mdp=mdp) -> Iterable[A]:
return mdp.actions(NonTerminal(s))
return RandomPolicy(Categorical(
{UniformPolicy(explore): ε,
greedy_policy_from_qvf(q, mdp.actions): 1 - ε}
))
def glie_mc_control(
mdp: MarkovDecisionProcess[S, A],
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
γ: float,
ϵ_as_func_of_episodes: Callable[[int], float],
episode_length_tolerance: float = 1e-6
) -> Iterator[QValueFunctionApprox[S, A]]:
q: QValueFunctionApprox[S, A] = approx_0
p: Policy[S, A] = epsilon_greedy_policy(q, mdp, 1.0)
yield q
num_episodes: int = 0
while True:
trace: Iterable[TransitionStep[S, A]] = \
mdp.simulate_action(states, p) #.
# mdp.simulate_actions(states, p) #.
num_episodes += 1
for step in returns(trace, γ, episode_length_tolerance):
q = q.update([((step.state, step.action), step.return_)])
p = epsilon_greedy_policy(q, mdp, ϵ_as_func_of_episodes(num_episodes))
yield q
We run the GLIE TAB MC Control algorithm on the Inventory problem.
Let us set the capacity \(C=5\), but keep the other parameters as before.
qvfas: Iterator[QValueFunctionApprox[InventoryState, int]] = glie_mc_control(
mdp=si_mdp,
states=Choose(si_mdp.non_terminal_states),
approx_0=Tabular(
values_map=initial_qvf_dict,
count_to_weight_func=learning_rate_func
),
γ=gamma,
ϵ_as_func_of_episodes=epsilon_as_func_of_episodes,
episode_length_tolerance=mc_episode_length_tol
)
qvfas
<generator object glie_mc_control at 0x7ff9e1b4d350>
Now we get the final estimate of the Optimal Action-Value Function after n_episodes. Then we extract from it the estimate of the Optimal State-Value Function and the Optimal Policy.
CPU times: user 2min 50s, sys: 7.34 s, total: 2min 57s
Wall time: 2min 48s
Tabular(values_map={(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): -34.03322148393033, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): -29.020209183999388, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): -26.627145648076866, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): -29.14229159609633, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): -29.083772263900514, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): -30.59031203906599, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): -25.32177188084976, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): -20.960023620302074, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): -19.771390238869504, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): -21.918336203169535, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): -22.204214961666906, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): -21.59608715735911, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): -18.514480015418165, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): -19.729972311340145, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): -20.670368262219455, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): -22.505027675244786, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): -18.498951466322506, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): -23.631308215603628, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): -21.064211166622616, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): -25.700858037710592, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): -23.87449284911821, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): -24.28212841574819, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): -23.07702634694885, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): -20.08614993159479, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): -22.578766436037245, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): -23.67197364066413, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): -23.1976582684385, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): -19.28700559050116, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): -21.194611200755286, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): -20.787724580885936, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): -21.920701267313003, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): -20.436319583369052, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): -22.716447291896973, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): -22.639086837096524, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): -24.14606280853149, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): -24.284458257076704, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): -22.688895113138887, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): -20.353379740767895, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): -22.30071399052781, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): -23.729150109387337, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): -22.827065624947515, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): -21.25757526657738, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): -24.68781063371742, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): -22.75024821444209, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): -24.83171253975065, (NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): -24.809513898197483, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): -24.40754279540118, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): -22.718947367168315, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): -24.324439592173484, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): -23.388534356752288, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): -24.26983392987888, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): -26.345027269838162, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): -24.387050038557934, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): -26.536561759412, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): -25.347216865676412, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): -29.056513846485817}, counts_map={(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): 2698, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): 3805, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): 78127, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): 4287, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): 122974, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): 2009, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): 82490, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): 205196, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): 1250, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): 2826, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): 1140, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): 71, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): 550, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): 84122, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): 1201, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): 1920, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): 473, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): 902, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): 632, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): 2696, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): 6130, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): 250, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): 6067, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): 3552, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): 9349, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): 100, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): 7695, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): 546, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): 38, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): 52886, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): 442, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): 52, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): 1225, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): 99, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): 453, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): 109272, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): 151, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): 708, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): 4748, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): 4694, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): 1474, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): 3203, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): 10985, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): 25991, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): 931, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): 8558, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): 169733, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): 636, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): 982, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): 13227, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): 873, (NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): 1645, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): 40806, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): 1986, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): 259, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): 776}, count_to_weight_func=<function learning_rate_schedule.<locals>.lr_func at 0x7ff9e1b4bc20>)
def get_vf_and_policy_from_qvf(
mdp: FiniteMarkovDecisionProcess[S, A],
qvf: QValueFunctionApprox[S, A]
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
opt_vf: V[S] = {
s: max(qvf((s, a)) for a in mdp.actions(s))
for s in mdp.non_terminal_states
}
opt_policy: FiniteDeterministicPolicy[S, A] = \
FiniteDeterministicPolicy({
s.state: qvf.argmax((s, a) for a in mdp.actions(s))[1]
for s in mdp.non_terminal_states
})
return opt_vf, opt_policy
GLIE MC Optimal Value Function with 10000 episodes
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -26.627145648076866,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -19.771390238869504,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -18.514480015418165,
NonTerminal(state=InventoryState(on_hand=0, on_order=3)): -18.498951466322506,
NonTerminal(state=InventoryState(on_hand=0, on_order=4)): -21.064211166622616,
NonTerminal(state=InventoryState(on_hand=0, on_order=5)): -23.87449284911821,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -20.08614993159479,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -19.28700559050116,
NonTerminal(state=InventoryState(on_hand=1, on_order=2)): -20.436319583369052,
NonTerminal(state=InventoryState(on_hand=1, on_order=3)): -22.639086837096524,
NonTerminal(state=InventoryState(on_hand=1, on_order=4)): -24.284458257076704,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -20.353379740767895,
NonTerminal(state=InventoryState(on_hand=2, on_order=1)): -21.25757526657738,
NonTerminal(state=InventoryState(on_hand=2, on_order=2)): -22.75024821444209,
NonTerminal(state=InventoryState(on_hand=2, on_order=3)): -24.809513898197483,
NonTerminal(state=InventoryState(on_hand=3, on_order=0)): -22.718947367168315,
NonTerminal(state=InventoryState(on_hand=3, on_order=1)): -23.388534356752288,
NonTerminal(state=InventoryState(on_hand=3, on_order=2)): -26.345027269838162,
NonTerminal(state=InventoryState(on_hand=4, on_order=0)): -24.387050038557934,
NonTerminal(state=InventoryState(on_hand=4, on_order=1)): -25.347216865676412,
NonTerminal(state=InventoryState(on_hand=5, on_order=0)): -29.056513846485817}
GLIE MC Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
For comparison, we run a Value Iteration to find the true Optimal Value Function and Optimal Policy.
True Optimal State Value Function
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -26.98163492786722,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -19.9909558006178,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -18.86849301795165,
NonTerminal(state=InventoryState(on_hand=0, on_order=3)): -19.934022635857023,
NonTerminal(state=InventoryState(on_hand=0, on_order=4)): -20.91848526863397,
NonTerminal(state=InventoryState(on_hand=0, on_order=5)): -22.77205311116058,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -20.9909558006178,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -19.86849301795165,
NonTerminal(state=InventoryState(on_hand=1, on_order=2)): -20.934022635857026,
NonTerminal(state=InventoryState(on_hand=1, on_order=3)): -21.918485268633972,
NonTerminal(state=InventoryState(on_hand=1, on_order=4)): -23.77205311116058,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -20.86849301795165,
NonTerminal(state=InventoryState(on_hand=2, on_order=1)): -21.934022635857023,
NonTerminal(state=InventoryState(on_hand=2, on_order=2)): -22.91848526863397,
NonTerminal(state=InventoryState(on_hand=2, on_order=3)): -24.77205311116058,
NonTerminal(state=InventoryState(on_hand=3, on_order=0)): -22.934022635857026,
NonTerminal(state=InventoryState(on_hand=3, on_order=1)): -23.918485268633972,
NonTerminal(state=InventoryState(on_hand=3, on_order=2)): -25.772053111160584,
NonTerminal(state=InventoryState(on_hand=4, on_order=0)): -24.91848526863397,
NonTerminal(state=InventoryState(on_hand=4, on_order=1)): -26.772053111160577,
NonTerminal(state=InventoryState(on_hand=5, on_order=0)): -27.772053111160577}
True Optimal Policy
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
Now we compare values by state for the State Value Function
[(-26.98163492786722, -26.627145648076866),
(-19.9909558006178, -19.771390238869504),
(-18.86849301795165, -18.514480015418165),
(-19.934022635857023, -18.498951466322506),
(-20.91848526863397, -21.064211166622616),
(-22.77205311116058, -23.87449284911821),
(-20.9909558006178, -20.08614993159479),
(-19.86849301795165, -19.28700559050116),
(-20.934022635857026, -20.436319583369052),
(-21.918485268633972, -22.639086837096524),
(-23.77205311116058, -24.284458257076704),
(-20.86849301795165, -20.353379740767895),
(-21.934022635857023, -21.25757526657738),
(-22.91848526863397, -22.75024821444209),
(-24.77205311116058, -24.809513898197483),
(-22.934022635857026, -22.718947367168315),
(-23.918485268633972, -23.388534356752288),
(-25.772053111160584, -26.345027269838162),
(-24.91848526863397, -24.387050038557934),
(-26.772053111160577, -25.347216865676412),
(-27.772053111160577, -29.056513846485817)]
We also compare values by state for the Policy
[(true_opt_pol.policy_map[s.state], opt_pol.policy_map[s.state]) for s in si_mdp.non_terminal_states]
[(Constant(value=2), Constant(value=2)),
(Constant(value=2), Constant(value=2)),
(Constant(value=1), Constant(value=1)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=2), Constant(value=2)),
(Constant(value=1), Constant(value=1)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=1), Constant(value=1)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0))]
Let us visualize the convergence of the Action Value Function (Q) for each of the states:
Tabular(values_map={(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): -16.66003352275379, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): -8.84712186006307, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): -3.484852528806128, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): -3.0417892530956645, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): -3.323154957160499, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): -5.436978480160036, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): -2.5474975726878393, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): -2.8275062369570962, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): -4.466077967978892, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): -9.695543163249456, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): -2.933739288365049, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): -5.749167329955894, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): -2.8750300064628984, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): -3.166028228334391, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): -6.3666719406671, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): -2.5812832857338877, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): -4.707597321874659, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): -8.07172866454282, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): -3.0339476206022704, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): -10.993906420542446, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): -2.844003683042871, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): -2.7911653293309358, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): -6.316763168951876, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): -5.623645309953783, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): -2.4301257768173894, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): -3.259941800669189, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): -12.446739523987436, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): -7.468555067559018, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): -3.6864546028819642, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): -2.716748543151219, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): -2.5805663199524878, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): -5.029640798466062, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): -5.72700024482039, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): -3.5178091425937676, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): -8.58102363687888, (NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): 0.0, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): -2.8123996539699156, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): -9.664140042001613, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): -5.957203516680925, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): -6.901471514837327, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): -16.37093741815397, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): -3.76606713446086, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): -8.303203454300261, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): -18.762667910359255, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): -22.132186839172704, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): -22.579487794364617}, counts_map={(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): 8, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): 10, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): 3, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): 9, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): 3, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): 2, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): 3, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): 2, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): 4, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): 1, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): 4, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): 2, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): 4, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): 3, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): 2, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): 1, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): 10, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): 2, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): 2, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): 2, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): 2, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): 4, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): 3, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): 2, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): 1}, count_to_weight_func=<function learning_rate_schedule.<locals>.lr_func at 0x7ff9e1b4bc20>)
{(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
0): -16.66003352275379,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
2): -8.84712186006307,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
3): -3.484852528806128,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
4): -3.0417892530956645,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
5): -3.323154957160499,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
0): -5.436978480160036,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
2): -2.5474975726878393,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
3): -2.8275062369570962,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
4): -4.466077967978892,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)),
1): -9.695543163249456,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)),
2): -2.933739288365049,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)),
0): -5.749167329955894,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)),
1): -2.8750300064628984,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)),
2): -3.166028228334391,
(NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=4)),
1): -6.3666719406671,
(NonTerminal(state=InventoryState(on_hand=0, on_order=5)),
0): -2.5812832857338877,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)),
0): -4.707597321874659,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)),
3): -8.07172866454282,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)),
4): -3.0339476206022704,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)),
0): -10.993906420542446,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)),
2): -2.844003683042871,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)),
3): -2.7911653293309358,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)),
0): -6.316763168951876,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)),
0): -5.623645309953783,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)),
1): -2.4301257768173894,
(NonTerminal(state=InventoryState(on_hand=1, on_order=4)),
0): -3.259941800669189,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
0): -12.446739523987436,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
1): -7.468555067559018,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
2): -3.6864546028819642,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
3): -2.716748543151219,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)),
0): -2.5805663199524878,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)),
1): -5.029640798466062,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)),
2): -5.72700024482039,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)),
0): -3.5178091425937676,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)),
1): -8.58102363687888,
(NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): 0.0,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)),
0): -2.8123996539699156,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)),
1): -9.664140042001613,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)),
2): -5.957203516680925,
(NonTerminal(state=InventoryState(on_hand=3, on_order=1)),
0): -6.901471514837327,
(NonTerminal(state=InventoryState(on_hand=3, on_order=1)),
1): -16.37093741815397,
(NonTerminal(state=InventoryState(on_hand=3, on_order=2)),
0): -3.76606713446086,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)),
0): -8.303203454300261,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)),
1): -18.762667910359255,
(NonTerminal(state=InventoryState(on_hand=4, on_order=1)),
0): -22.132186839172704,
(NonTerminal(state=InventoryState(on_hand=5, on_order=0)),
0): -22.579487794364617}
{(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): 8,
(NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): 10,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): 3,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): 9,
(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): 3,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): 2,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): 3,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): 2,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): 4,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): 4,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): 2,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): 4,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): 3,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): 2,
(NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): 10,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): 2,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): 2,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): 2,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): 2,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): 4,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): 3,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): 2,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): 1}
print('capacity =', capacity)
print('poisson_lambda =', poisson_lambda)
print('holding_cost =', holding_cost)
print('stockout_cost =', stockout_cost)
print('gamma =', gamma)
capacity = 5
poisson_lambda = 1.0
holding_cost = 1.0
stockout_cost = 10.0
gamma = 0.9
\(\theta = \text{max}(r - (\alpha + \beta), 0)\)
where: - \(\theta \in \mathbb Z_{\ge0}\) is the order quantity - \(r \in \mathbb Z_{\ge0}\) is the reorder point below which reordering is allowed. - \(\alpha\) is the on-hand inventory - \(\beta\) is the on-order inventory - \((\alpha,\beta)\) is the state
We set \(r = C\) where \(C\) is the inventory capacity for the item.
SimpleInventoryDeterministicPolicy(action_for=<function SimpleInventoryDeterministicPolicy.__init__.<locals>.action_for at 0x7ff9e1b8e440>)
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), action=3, next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=3)), reward=-2.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=3)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), action=3, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=3)), reward=-2.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=3)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), reward=-2.0)]
\(\theta = \text{max}(r - (\alpha + \beta), 0)\)
where: - \(\theta \in \mathbb Z_{\ge0}\) is the order quantity - \(r \in \mathbb Z_{\ge0}\) is the reorder point below which reordering is allowed. - \(\alpha\) is the on-hand inventory - \(\beta\) is the on-order inventory - \((\alpha,\beta)\) is the state
We set \(r = 1\).
SimpleInventoryDeterministicPolicy(action_for=<function SimpleInventoryDeterministicPolicy.__init__.<locals>.action_for at 0x7ff9e1b8e680>)
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-4.678794411714423),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-10.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-3.6787944117144233),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-10.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-3.6787944117144233)]
\(\theta = \text{max}(r - (\alpha + \beta), 0)\)
where: - \(\theta \in \mathbb Z_{\ge0}\) is the order quantity - \(r \in \mathbb Z_{\ge0}\) is the reorder point below which reordering is allowed. - \(\alpha\) is the on-hand inventory - \(\beta\) is the on-order inventory - \((\alpha,\beta)\) is the state
We set \(r = \frac{C}{2}\).
SimpleInventoryDeterministicPolicy(action_for=<function SimpleInventoryDeterministicPolicy.__init__.<locals>.action_for at 0x7ff9decef4d0>)
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), reward=-5.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), reward=-5.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-3.2333692644293284)]
GLIE MC Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=4)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=1)), reward=-3.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=3, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), reward=-3.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-4.0)]
GLIE MC True Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=4)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-0.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=1)), reward=-2.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=1)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-2.2333692644293284),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), action=2, next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=2)), reward=-0.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=2)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0)]
fig,axs = plt.subplots(figsize=(16,15))
axs.set_xlabel('Steps', fontsize=20)
axs.set_title(f'Cumulative Reward', fontsize=24)
for i,cum_r in enumerate(plot_list):
axs.plot(cum_r, label=label_list[i])
# axs.set_ylim([-5000,0])
axs.legend(fontsize=15);
fig,axs = plt.subplots(figsize=(16,15))
axs.set_xlabel('Steps', fontsize=20)
axs.set_title(f'Cumulative Cost', fontsize=24)
for i,cum_r in enumerate(plot_list):
axs.plot(-cum_r, label=label_list[i])
# axs.set_ylim([0, 5000])
axs.legend(fontsize=15);
Now we zoom in a bit: