!python --version
Python 3.7.15
Tabular Monte-Carlo Control
Kobus Esterhuysen
July 20, 2022
In previous projects we have only dealt with the prediction problem. In this project we move on to the control problem. We stay in the model-free or learning domain (rather than planning), i.e. making use of Reinforcement Learning algorithms. We will explore the depth of update dimension, considering both bootstrapping and non-bootstrapping methods like Temporal-Difference (TD) and Monte-Carlo (MC) algorithms respectively.
What is different about Control, rather than Prediction, is the depence on the concept of Generalized Policy Iteration (GPI).
The key concept of GPI is that we can: - Evaluate a policy with any Policy Evaluation method - Improve a policy with any Policy Improvement method
Here is a reasonable way to approach MC Control based on what we have done so far: - do MC Policy Evaluation (which is essentially MC Prediction) - do greedy Policy Improvement - do MC Policy Evaluation on the improved policy - etc., etc.
However, this naive approach leads to some complications.
The first complication is that: - Greedy Policy Improvement requires the - state transition probability function
We address this complication by noting that
can be expressed more concisely as:
This is valuable because: - Instead of doing - Policy Evaluation to calculate
The second complication is that updates can get biased by initial random occurrences of returns. This could prevent certain actions from being sufficiently chosen. This will lead to inaccurate estimates of the action values for those actions. We want to exploit actions that provide better returns but at the same time also be sure to explore all possible actions. This problem is known as the exploration-exploitation dilemma.
The way to address this complication is to modify our Tabular MC Control algorithm: - Instead of doing - greedy Policy Improvement - we want to do -
This impoved stochastic policy is defined as:
This means the: - exploit probability is:
The deterministic greedy policy
We need to consider two more aspects:
Here is a summary of the GLIE Tabular Monte-Carlo Control algorithm.
For each episode (terminating trace experience): - Generate the trace experience (episode) with actions sampled from the
The adaptation for Function Approximation is simple. Instead of the above udpate, we have:
where
The following code can handle both TAB and FAP cases.
def epsilon_greedy_policy(
q: QValueFunctionApprox[S, A],
mdp: MarkovDecisionProcess[S, A],
ε: float = 0.0
) -> Policy[S, A]:
def explore(s: S, mdp=mdp) -> Iterable[A]:
return mdp.actions(NonTerminal(s))
return RandomPolicy(Categorical(
{UniformPolicy(explore): ε,
greedy_policy_from_qvf(q, mdp.actions): 1 - ε}
))
def glie_mc_control(
mdp: MarkovDecisionProcess[S, A],
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
γ: float,
ϵ_as_func_of_episodes: Callable[[int], float],
episode_length_tolerance: float = 1e-6
) -> Iterator[QValueFunctionApprox[S, A]]:
q: QValueFunctionApprox[S, A] = approx_0
p: Policy[S, A] = epsilon_greedy_policy(q, mdp, 1.0)
yield q
num_episodes: int = 0
while True:
trace: Iterable[TransitionStep[S, A]] = \
mdp.simulate_action(states, p) #.
# mdp.simulate_actions(states, p) #.
num_episodes += 1
for step in returns(trace, γ, episode_length_tolerance):
q = q.update([((step.state, step.action), step.return_)])
p = epsilon_greedy_policy(q, mdp, ϵ_as_func_of_episodes(num_episodes))
yield q
We run the GLIE TAB MC Control algorithm on the Inventory problem.
Let us set the capacity
qvfas: Iterator[QValueFunctionApprox[InventoryState, int]] = glie_mc_control(
mdp=si_mdp,
states=Choose(si_mdp.non_terminal_states),
approx_0=Tabular(
values_map=initial_qvf_dict,
count_to_weight_func=learning_rate_func
),
γ=gamma,
ϵ_as_func_of_episodes=epsilon_as_func_of_episodes,
episode_length_tolerance=mc_episode_length_tol
)
qvfas
<generator object glie_mc_control at 0x7ff9e1b4d350>
Now we get the final estimate of the Optimal Action-Value Function after n_episodes. Then we extract from it the estimate of the Optimal State-Value Function and the Optimal Policy.
CPU times: user 2min 50s, sys: 7.34 s, total: 2min 57s
Wall time: 2min 48s
Tabular(values_map={(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): -34.03322148393033, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): -29.020209183999388, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): -26.627145648076866, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): -29.14229159609633, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): -29.083772263900514, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): -30.59031203906599, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): -25.32177188084976, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): -20.960023620302074, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): -19.771390238869504, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): -21.918336203169535, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): -22.204214961666906, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): -21.59608715735911, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): -18.514480015418165, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): -19.729972311340145, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): -20.670368262219455, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): -22.505027675244786, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): -18.498951466322506, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): -23.631308215603628, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): -21.064211166622616, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): -25.700858037710592, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): -23.87449284911821, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): -24.28212841574819, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): -23.07702634694885, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): -20.08614993159479, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): -22.578766436037245, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): -23.67197364066413, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): -23.1976582684385, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): -19.28700559050116, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): -21.194611200755286, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): -20.787724580885936, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): -21.920701267313003, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): -20.436319583369052, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): -22.716447291896973, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): -22.639086837096524, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): -24.14606280853149, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): -24.284458257076704, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): -22.688895113138887, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): -20.353379740767895, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): -22.30071399052781, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): -23.729150109387337, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): -22.827065624947515, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): -21.25757526657738, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): -24.68781063371742, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): -22.75024821444209, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): -24.83171253975065, (NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): -24.809513898197483, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): -24.40754279540118, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): -22.718947367168315, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): -24.324439592173484, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): -23.388534356752288, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): -24.26983392987888, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): -26.345027269838162, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): -24.387050038557934, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): -26.536561759412, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): -25.347216865676412, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): -29.056513846485817}, counts_map={(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): 2698, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): 3805, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): 78127, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): 4287, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): 122974, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): 2009, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): 82490, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): 205196, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): 1250, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): 2826, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): 1140, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): 71, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): 550, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): 84122, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): 1201, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): 1920, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): 473, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): 902, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): 632, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): 2696, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): 6130, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): 250, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): 6067, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): 3552, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): 9349, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): 100, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): 7695, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): 546, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): 38, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): 52886, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): 442, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): 52, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): 1225, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): 99, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): 453, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): 109272, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): 151, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): 708, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): 4748, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): 4694, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): 1474, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): 3203, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): 10985, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): 25991, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): 931, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): 8558, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): 169733, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): 636, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): 982, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): 13227, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): 873, (NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): 1645, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): 40806, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): 1986, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): 259, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): 776}, count_to_weight_func=<function learning_rate_schedule.<locals>.lr_func at 0x7ff9e1b4bc20>)
def get_vf_and_policy_from_qvf(
mdp: FiniteMarkovDecisionProcess[S, A],
qvf: QValueFunctionApprox[S, A]
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
opt_vf: V[S] = {
s: max(qvf((s, a)) for a in mdp.actions(s))
for s in mdp.non_terminal_states
}
opt_policy: FiniteDeterministicPolicy[S, A] = \
FiniteDeterministicPolicy({
s.state: qvf.argmax((s, a) for a in mdp.actions(s))[1]
for s in mdp.non_terminal_states
})
return opt_vf, opt_policy
GLIE MC Optimal Value Function with 10000 episodes
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -26.627145648076866,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -19.771390238869504,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -18.514480015418165,
NonTerminal(state=InventoryState(on_hand=0, on_order=3)): -18.498951466322506,
NonTerminal(state=InventoryState(on_hand=0, on_order=4)): -21.064211166622616,
NonTerminal(state=InventoryState(on_hand=0, on_order=5)): -23.87449284911821,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -20.08614993159479,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -19.28700559050116,
NonTerminal(state=InventoryState(on_hand=1, on_order=2)): -20.436319583369052,
NonTerminal(state=InventoryState(on_hand=1, on_order=3)): -22.639086837096524,
NonTerminal(state=InventoryState(on_hand=1, on_order=4)): -24.284458257076704,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -20.353379740767895,
NonTerminal(state=InventoryState(on_hand=2, on_order=1)): -21.25757526657738,
NonTerminal(state=InventoryState(on_hand=2, on_order=2)): -22.75024821444209,
NonTerminal(state=InventoryState(on_hand=2, on_order=3)): -24.809513898197483,
NonTerminal(state=InventoryState(on_hand=3, on_order=0)): -22.718947367168315,
NonTerminal(state=InventoryState(on_hand=3, on_order=1)): -23.388534356752288,
NonTerminal(state=InventoryState(on_hand=3, on_order=2)): -26.345027269838162,
NonTerminal(state=InventoryState(on_hand=4, on_order=0)): -24.387050038557934,
NonTerminal(state=InventoryState(on_hand=4, on_order=1)): -25.347216865676412,
NonTerminal(state=InventoryState(on_hand=5, on_order=0)): -29.056513846485817}
GLIE MC Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
For comparison, we run a Value Iteration to find the true Optimal Value Function and Optimal Policy.
True Optimal State Value Function
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -26.98163492786722,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -19.9909558006178,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -18.86849301795165,
NonTerminal(state=InventoryState(on_hand=0, on_order=3)): -19.934022635857023,
NonTerminal(state=InventoryState(on_hand=0, on_order=4)): -20.91848526863397,
NonTerminal(state=InventoryState(on_hand=0, on_order=5)): -22.77205311116058,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -20.9909558006178,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -19.86849301795165,
NonTerminal(state=InventoryState(on_hand=1, on_order=2)): -20.934022635857026,
NonTerminal(state=InventoryState(on_hand=1, on_order=3)): -21.918485268633972,
NonTerminal(state=InventoryState(on_hand=1, on_order=4)): -23.77205311116058,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -20.86849301795165,
NonTerminal(state=InventoryState(on_hand=2, on_order=1)): -21.934022635857023,
NonTerminal(state=InventoryState(on_hand=2, on_order=2)): -22.91848526863397,
NonTerminal(state=InventoryState(on_hand=2, on_order=3)): -24.77205311116058,
NonTerminal(state=InventoryState(on_hand=3, on_order=0)): -22.934022635857026,
NonTerminal(state=InventoryState(on_hand=3, on_order=1)): -23.918485268633972,
NonTerminal(state=InventoryState(on_hand=3, on_order=2)): -25.772053111160584,
NonTerminal(state=InventoryState(on_hand=4, on_order=0)): -24.91848526863397,
NonTerminal(state=InventoryState(on_hand=4, on_order=1)): -26.772053111160577,
NonTerminal(state=InventoryState(on_hand=5, on_order=0)): -27.772053111160577}
True Optimal Policy
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
Now we compare values by state for the State Value Function
[(-26.98163492786722, -26.627145648076866),
(-19.9909558006178, -19.771390238869504),
(-18.86849301795165, -18.514480015418165),
(-19.934022635857023, -18.498951466322506),
(-20.91848526863397, -21.064211166622616),
(-22.77205311116058, -23.87449284911821),
(-20.9909558006178, -20.08614993159479),
(-19.86849301795165, -19.28700559050116),
(-20.934022635857026, -20.436319583369052),
(-21.918485268633972, -22.639086837096524),
(-23.77205311116058, -24.284458257076704),
(-20.86849301795165, -20.353379740767895),
(-21.934022635857023, -21.25757526657738),
(-22.91848526863397, -22.75024821444209),
(-24.77205311116058, -24.809513898197483),
(-22.934022635857026, -22.718947367168315),
(-23.918485268633972, -23.388534356752288),
(-25.772053111160584, -26.345027269838162),
(-24.91848526863397, -24.387050038557934),
(-26.772053111160577, -25.347216865676412),
(-27.772053111160577, -29.056513846485817)]
We also compare values by state for the Policy
[(true_opt_pol.policy_map[s.state], opt_pol.policy_map[s.state]) for s in si_mdp.non_terminal_states]
[(Constant(value=2), Constant(value=2)),
(Constant(value=2), Constant(value=2)),
(Constant(value=1), Constant(value=1)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=2), Constant(value=2)),
(Constant(value=1), Constant(value=1)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=1), Constant(value=1)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=1), Constant(value=1)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0)),
(Constant(value=0), Constant(value=0))]
Let us visualize the convergence of the Action Value Function (Q) for each of the states:
Tabular(values_map={(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): -16.66003352275379, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): -8.84712186006307, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): -3.484852528806128, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): -3.0417892530956645, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): -3.323154957160499, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): -5.436978480160036, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): -2.5474975726878393, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): -2.8275062369570962, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): -4.466077967978892, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): -9.695543163249456, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): -2.933739288365049, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): -5.749167329955894, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): -2.8750300064628984, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): -3.166028228334391, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): 0.0, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): -6.3666719406671, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): -2.5812832857338877, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): -4.707597321874659, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): -8.07172866454282, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): -3.0339476206022704, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): -10.993906420542446, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): -2.844003683042871, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): -2.7911653293309358, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): -6.316763168951876, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): 0.0, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): -5.623645309953783, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): -2.4301257768173894, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): -3.259941800669189, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): -12.446739523987436, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): -7.468555067559018, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): -3.6864546028819642, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): -2.716748543151219, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): -2.5805663199524878, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): -5.029640798466062, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): -5.72700024482039, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): -3.5178091425937676, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): -8.58102363687888, (NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): 0.0, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): -2.8123996539699156, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): -9.664140042001613, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): -5.957203516680925, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): -6.901471514837327, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): -16.37093741815397, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): -3.76606713446086, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): -8.303203454300261, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): -18.762667910359255, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): -22.132186839172704, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): -22.579487794364617}, counts_map={(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): 8, (NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): 10, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): 3, (NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): 9, (NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): 3, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): 2, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): 3, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): 2, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): 4, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): 1, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): 4, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): 2, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): 4, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): 3, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): 2, (NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): 1, (NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): 10, (NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): 2, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): 2, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): 2, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): 2, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): 4, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): 1, (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): 1, (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): 3, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): 2, (NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): 1, (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): 1}, count_to_weight_func=<function learning_rate_schedule.<locals>.lr_func at 0x7ff9e1b4bc20>)
{(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
0): -16.66003352275379,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
2): -8.84712186006307,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
3): -3.484852528806128,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
4): -3.0417892530956645,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
5): -3.323154957160499,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
0): -5.436978480160036,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
2): -2.5474975726878393,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
3): -2.8275062369570962,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
4): -4.466077967978892,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 0): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)),
1): -9.695543163249456,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)),
2): -2.933739288365049,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 3): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)),
0): -5.749167329955894,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)),
1): -2.8750300064628984,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)),
2): -3.166028228334391,
(NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 0): 0.0,
(NonTerminal(state=InventoryState(on_hand=0, on_order=4)),
1): -6.3666719406671,
(NonTerminal(state=InventoryState(on_hand=0, on_order=5)),
0): -2.5812832857338877,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)),
0): -4.707597321874659,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 2): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)),
3): -8.07172866454282,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)),
4): -3.0339476206022704,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)),
0): -10.993906420542446,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)),
2): -2.844003683042871,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)),
3): -2.7911653293309358,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)),
0): -6.316763168951876,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 1): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 2): 0.0,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)),
0): -5.623645309953783,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)),
1): -2.4301257768173894,
(NonTerminal(state=InventoryState(on_hand=1, on_order=4)),
0): -3.259941800669189,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
0): -12.446739523987436,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
1): -7.468555067559018,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
2): -3.6864546028819642,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)),
3): -2.716748543151219,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)),
0): -2.5805663199524878,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)),
1): -5.029640798466062,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)),
2): -5.72700024482039,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)),
0): -3.5178091425937676,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)),
1): -8.58102363687888,
(NonTerminal(state=InventoryState(on_hand=2, on_order=3)), 0): 0.0,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)),
0): -2.8123996539699156,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)),
1): -9.664140042001613,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)),
2): -5.957203516680925,
(NonTerminal(state=InventoryState(on_hand=3, on_order=1)),
0): -6.901471514837327,
(NonTerminal(state=InventoryState(on_hand=3, on_order=1)),
1): -16.37093741815397,
(NonTerminal(state=InventoryState(on_hand=3, on_order=2)),
0): -3.76606713446086,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)),
0): -8.303203454300261,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)),
1): -18.762667910359255,
(NonTerminal(state=InventoryState(on_hand=4, on_order=1)),
0): -22.132186839172704,
(NonTerminal(state=InventoryState(on_hand=5, on_order=0)),
0): -22.579487794364617}
{(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 1): 8,
(NonTerminal(state=InventoryState(on_hand=4, on_order=1)), 0): 10,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 0): 3,
(NonTerminal(state=InventoryState(on_hand=4, on_order=0)), 1): 9,
(NonTerminal(state=InventoryState(on_hand=3, on_order=1)), 0): 3,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 2): 2,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 1): 3,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 1): 2,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 3): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 1): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 0): 4,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 5): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=5)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 1): 4,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 3): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 0): 2,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 0): 4,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 3): 3,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 0): 2,
(NonTerminal(state=InventoryState(on_hand=3, on_order=2)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=5, on_order=0)), 0): 10,
(NonTerminal(state=InventoryState(on_hand=0, on_order=3)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=2)), 0): 2,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 2): 2,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 4): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=4)), 1): 2,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 1): 2,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 0): 4,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 3): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=0)), 4): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=3)), 1): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=1)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), 4): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=4)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=0, on_order=2)), 2): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=2)), 1): 3,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 2): 2,
(NonTerminal(state=InventoryState(on_hand=3, on_order=0)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=1)), 0): 1,
(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), 3): 1}
print('capacity =', capacity)
print('poisson_lambda =', poisson_lambda)
print('holding_cost =', holding_cost)
print('stockout_cost =', stockout_cost)
print('gamma =', gamma)
capacity = 5
poisson_lambda = 1.0
holding_cost = 1.0
stockout_cost = 10.0
gamma = 0.9
where: -
We set
SimpleInventoryDeterministicPolicy(action_for=<function SimpleInventoryDeterministicPolicy.__init__.<locals>.action_for at 0x7ff9e1b8e440>)
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), action=3, next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=3)), reward=-2.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=3)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), action=3, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=3)), reward=-2.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=3)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), reward=-2.0)]
where: -
We set
SimpleInventoryDeterministicPolicy(action_for=<function SimpleInventoryDeterministicPolicy.__init__.<locals>.action_for at 0x7ff9e1b8e680>)
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-4.678794411714423),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-10.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-3.6787944117144233),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-10.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-3.6787944117144233)]
where: -
We set
SimpleInventoryDeterministicPolicy(action_for=<function SimpleInventoryDeterministicPolicy.__init__.<locals>.action_for at 0x7ff9decef4d0>)
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), reward=-5.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=5, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), reward=-5.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-3.2333692644293284)]
GLIE MC Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=4)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=3, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=3, on_order=1)), reward=-3.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=3, on_order=1)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), reward=-3.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), reward=-4.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=4, on_order=0)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-4.0)]
GLIE MC True Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 2
For State InventoryState(on_hand=0, on_order=2): Do Action 1
For State InventoryState(on_hand=0, on_order=3): Do Action 1
For State InventoryState(on_hand=0, on_order=4): Do Action 0
For State InventoryState(on_hand=0, on_order=5): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 2
For State InventoryState(on_hand=1, on_order=1): Do Action 1
For State InventoryState(on_hand=1, on_order=2): Do Action 1
For State InventoryState(on_hand=1, on_order=3): Do Action 0
For State InventoryState(on_hand=1, on_order=4): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 1
For State InventoryState(on_hand=2, on_order=1): Do Action 1
For State InventoryState(on_hand=2, on_order=2): Do Action 0
For State InventoryState(on_hand=2, on_order=3): Do Action 0
For State InventoryState(on_hand=3, on_order=0): Do Action 1
For State InventoryState(on_hand=3, on_order=1): Do Action 0
For State InventoryState(on_hand=3, on_order=2): Do Action 0
For State InventoryState(on_hand=4, on_order=0): Do Action 0
For State InventoryState(on_hand=4, on_order=1): Do Action 0
For State InventoryState(on_hand=5, on_order=0): Do Action 0
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=4)), action=0, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-0.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=1)), reward=-2.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=1)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-2.2333692644293284),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), action=2, next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=2)), reward=-0.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=2)), action=1, next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0)]
fig,axs = plt.subplots(figsize=(16,15))
axs.set_xlabel('Steps', fontsize=20)
axs.set_title(f'Cumulative Reward', fontsize=24)
for i,cum_r in enumerate(plot_list):
axs.plot(cum_r, label=label_list[i])
# axs.set_ylim([-5000,0])
axs.legend(fontsize=15);
fig,axs = plt.subplots(figsize=(16,15))
axs.set_xlabel('Steps', fontsize=20)
axs.set_title(f'Cumulative Cost', fontsize=24)
for i,cum_r in enumerate(plot_list):
axs.plot(-cum_r, label=label_list[i])
# axs.set_ylim([0, 5000])
axs.legend(fontsize=15);
Now we zoom in a bit: