!python --version
Python 3.7.15
Tabular Monte-Carlo Prediction
Kobus Esterhuysen
July 18, 2022
So far, we always had a model for the environment. This model came in the form of transition probabilities. It is often the case in the real world that we do not have these probabilties. We simply get individual experiences of next state and reward, given we take a specific action in a specific state. However, we still need a way to obtain the optimal value function or the optimal policy. There are algorithms available for this need. We have now entered the subfield of Reinforcement Learning. When we have a model (as in previous parts) the subfield is called Dynamic Programming or Approximate Dynamic Programming.
Note that we can use Reinforcement Learning even if we may have the option to obtain a model. Sometimes the state space is so large that obtaining a model is hard or the computational aspects become intractable.
Let us repeat some points related to prediction: - prediction is the problem of estimating the value function of an MDP given a policy \(\pi\) - the equivalent problem is to estimate the value function of the \(\pi\)-implied MRP
In this project we choose to work with MRPs, rather than MDPs, depending on the latter point. The relationship with the MRP environment is such that the: - Environment is available as an interface that serves up individual experiences of (next state, reward), given a current state. Note the absence of an action. - Environment might the real or simulated.
We define the agent’s experience with the environment as follows: - atomic experience - agent receives a single experience of (next state, reward), given current state - trace experience - starting from state \(S_0\), repeated interactions with the environment leading to a sequence of atomic experiences
The RL prediction problem is the estimate the value function, given a stream of atomic experiences or a stream of trace experiences.
An MRP’s value function is:
\[ V(s) = \mathbb E[G_t|S_t=s] \]
for all \(s \in \mathcal N\), for all \(t=0,1,2,...\)
where the return \(G_t\) for each \(t=0,1,2,...\) is defined as:
\[ \begin{aligned} G_t &= R_{t+1}+\gamma \cdot R_{t+2}+\gamma^2 \cdot R_{t+3}+... \\ &= \sum_{i=t+1}^\infty \gamma ^{i-t-1} \cdot R_i \\ &= R_{t+1}+\gamma \cdot G_{t+1} \end{aligned} \]
This infinite sum is true even if a trace experience terminates, say at \(t=T\) (\(S_T \in \mathcal T\)) because we take \(R_i=0\) for all \(i>T\).
We make use of the approach and code used in http://web.stanford.edu/class/cme241/.
Let us setup our Inventory problem again:
Next we look at how we want to implement the simulation of experiences as stated above in code. The essential element is the TransitionStep
:
@dataclass(frozen=True)
class TransitionStep(Generic[S]): #. s -> s'r (atomic experience)
state: NonTerminal[S]
next_state: State[S]
reward: float
def add_return(self, γ: float, return_: float) -> ReturnStep[S]:
return ReturnStep( #. s -> s'r
self.state,
self.next_state,
self.reward,
return_=self.reward + γ*return_
)
A TransitionStep
instance captures an atomic experience. It carries the state
, next_state
, and reward
information. The add_return
method allows for the incorporation of return values, making use of the class ReturnStep
:
In general, the input to an RL prediction algorithm will be either: - a stream/sequence of atomic experiences - Iterable[TransitionStep[S]]
- a stream/sequence of trace experiences - Iterable[Iterable[TransitionStep[S]]]
As before, we make use of the class MarkovRewardProcess
:
class MarkovRewardProcess(MarkovProcess[S]):
#. transition from this state
def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]: #s'|s or s->s'
distribution = self.transition_reward(state)
def next_state(distribution=distribution):
next_s, _ = distribution.sample() #.ignores reward
return next_s
return SampledDistribution(next_state)
@abstractmethod
def transition_reward(#. transition from this state
self,
state: NonTerminal[S]
) -> Distribution[Tuple[State[S], float]]: #. s'r|s or s->s'r
pass
def simulate_reward(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[TransitionStep[S]]: #. sequence of atomic experiences
state: State[S] = start_state_distribution.sample()
reward: float = 0.
while isinstance(state, NonTerminal):
next_distribution = self.transition_reward(state)
next_state, reward = next_distribution.sample()
yield TransitionStep(state, next_state, reward) # s -> s'r
state = next_state
def reward_traces(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[Iterable[TransitionStep[S]]]: #. sequence of trace experiences
while True:
yield self.simulate_reward(start_state_distribution)
Our current focus is on the methods: - simulate_reward()
- operates as a step (atomic experience) generator - yields a sequence of (state, next state, reward) 3-tuples, i.e. a sequence of atomic experiences - reward_traces()
- operates as a trace (trace experience) generator - yields a sequence of trace experiences, each trace yielding a sequence of (state, next state, reward) atomic experiences - picks a start state \(S_0\) from the provided start_state_distribution
[NonTerminal(state=InventoryState(on_hand=0, on_order=0)),
NonTerminal(state=InventoryState(on_hand=0, on_order=1)),
NonTerminal(state=InventoryState(on_hand=0, on_order=2)),
NonTerminal(state=InventoryState(on_hand=1, on_order=0)),
NonTerminal(state=InventoryState(on_hand=1, on_order=1)),
NonTerminal(state=InventoryState(on_hand=2, on_order=0))]
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 0.16666666666666666, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 0.16666666666666666, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 0.16666666666666666, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 0.16666666666666666, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 0.16666666666666666, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 0.16666666666666666}
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0)
#
# this trace generator (reward_traces()) will generate 3 atomic generators (simulate_reward())
[atom_gen for atom_gen in it.islice(si_mrp.reward_traces(ssd), n_traces)]
[<generator object MarkovRewardProcess.simulate_reward at 0x7fde24b94dd0>,
<generator object MarkovRewardProcess.simulate_reward at 0x7fde24b94750>,
<generator object MarkovRewardProcess.simulate_reward at 0x7fde24bc9150>]
<generator object MarkovRewardProcess.reward_traces at 0x7fde24bc90d0>
[[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-4.678794411714423),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-3.6787944117144233),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-0.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-2.0363832351432696)],
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-2.0363832351432696),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=2)), reward=-10.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=2)), next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-0.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=2, on_order=0)), reward=-2.0)],
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=2)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-0.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), reward=-2.0363832351432696),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=2)), reward=-10.0)]]
The Monte_Carlo (MC) prediction algorithm is a popular and simple RL algorithm which: - performs supervised learning in an incremental way - takes as predictor points \(x\) the encountered states across the stream of input trace experiences - takes as response points \(y\) the associated returns on the trace experiences, i.e. starting from the corresponding encountered state - predict the expected return from any state of an MRP
The function mc_prediction()
provides this prediction:
def mc_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],#. seq/stream of trace experiences
approx_0: ValueFunctionApprox[S],
γ: float,
episode_length_tolerance: float = 1e-6
) -> Iterator[ValueFunctionApprox[S]]:
episodes: Iterator[Iterator[mp.ReturnStep[S]]] = \
(returns(trace, γ, episode_length_tolerance) for trace in traces)
f = approx_0
yield f
for i,episode in enumerate(episodes): #.
print(f"\repisode {i}", end="") #.
f = iterate.last(f.iterate_updates(
[(step.state, step.return_)] for step in episode
))
yield f
print('\n') #.
The inputs to mc_prediction()
are: - traces: Iterable[Iterable[mp.TransitionStep[S]]]
- stream/sequence of of trace experiences - each trace experience is a Iterable
TransitionStep
s
To prepare the stream of trace experiences we make use of two helper functions: - fmrp_episodes_stream()
- mrp_episodes_stream()
def mrp_episodes_stream(
mrp: MarkovRewardProcess[S],
start_state_distribution: NTStateDistribution[S]
) -> Iterable[Iterable[TransitionStep[S]]]:
return mrp.reward_traces(start_state_distribution)
def fmrp_episodes_stream(
fmrp: FiniteMarkovRewardProcess[S]
) -> Iterable[Iterable[TransitionStep[S]]]:
return mrp_episodes_stream(fmrp, Choose(fmrp.non_terminal_states))
Let us see how they behave.
<generator object MarkovRewardProcess.reward_traces at 0x7fde24bc95d0>
[<generator object MarkovRewardProcess.simulate_reward at 0x7fde24bc96d0>,
<generator object MarkovRewardProcess.simulate_reward at 0x7fde24bc92d0>]
<generator object MarkovRewardProcess.reward_traces at 0x7fde24bc9750>
[<generator object MarkovRewardProcess.simulate_reward at 0x7fde24bc98d0>,
<generator object MarkovRewardProcess.simulate_reward at 0x7fde24bc97d0>]
initial_vf_dict: Mapping[NonTerminal[InventoryState], float] = \
{s: 0. for s in si_mrp.non_terminal_states}
initial_vf_dict
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 0.0,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 0.0,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 0.0,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 0.0,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 0.0,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 0.0}
n_traces = 2
n_atoms = 4
some_traces = [list(it.islice(trace, n_atoms)) for trace in it.islice(traces, n_traces)]; some_traces
[[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0)],
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-4.678794411714423),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-3.6787944117144233),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-3.6787944117144233),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=0, on_order=1)), reward=-3.6787944117144233)]]
The returns()
function calculates the accumulated discounted rewards, \(G_t\), from each state \(S_t\) in a trace experience. This is done by walking backwards from the end of the trace to the start (allowing the reuse of the calculated returns) making use of
\[ G_t = R_{t+1} + \gamma \cdot G_{t+1} \]
It makes use of iterate.accumulate
to perform the backwards-walk, which uses TransistionStep
s add_return
to return a ReturnStep
. class ReturnStep
is derived from TransitionStep
and it allows for the inclusion of the return_
attribute which captures the return.
Trace experiences are of two kinds: - episodic trace - trace experience ends in a terminal state - continuing trace - trace experience does not terminate
An RL problem is: - episodic - all the input trace experiences are episodic - continuing - some of the input trace experiences are continuing
def returns(trace, γ, tolerance):
trace = iter(trace)
max_steps = round(math.log(tolerance)/math.log(γ)) if γ < 1 else None
if max_steps is not None:
trace = it.islice(trace, max_steps*2)
*transitions, last_transition = list(trace)
return_steps = iterate.accumulate(
reversed(transitions),
func=lambda next, curr: curr.add_return(γ, next.return_),
initial=last_transition.add_return(γ, 0)
)
return_steps = reversed(list(return_steps))
if max_steps is not None:
return_steps = it.islice(return_steps, max_steps)
return return_steps
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0)]
gamma: float = 0.9
episode_length_tolerance: float = 1e-6
max_steps = round(math.log(episode_length_tolerance)/math.log(gamma)) if gamma < 1 else None; max_steps
131
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0)]
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0)
The iterate
accumulate()
function is a wrapper around the itertools
accumulate()
function.
Let us investigate how the accumulate()
function in the itertools
module works.
Multiply
<itertools.accumulate at 0x7fde24b64a50>
Max
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0)]
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0)
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0)
[TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0),
TransitionStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0)]
ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0, return_=-1.0)
return_steps = iterate.accumulate(
reversed(transitions),
func=lambda next, curr: curr.add_return(gamma, next.return_),
initial=last_transition.add_return(gamma, 0)
)
return_steps
<itertools.accumulate at 0x7fde2579aaf0>
[ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0, return_=-1.0),
ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0, return_=-1.9),
ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0, return_=-2.71),
ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0, return_=-3.439)]
[ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0, return_=-1.0),
ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0, return_=-1.9),
ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), reward=-1.0, return_=-2.71),
ReturnStep(state=NonTerminal(state=InventoryState(on_hand=1, on_order=0)), next_state=NonTerminal(state=InventoryState(on_hand=1, on_order=1)), reward=-1.0, return_=-3.439)]
The other important function called by the mc_prediction()
is FunctionApprox
’s iterate_updates()
. This method calls the update
method of FunctionApprox
in an iterative manner. Each call to update
updates the ValueFunctionApprox
for a single (state, return) data point. Here is the FunctionApprox
class:
X = TypeVar('X') #. for arbitrary data types scriptX
class FunctionApprox(ABC, Generic[X]):
@abstractmethod
def __add__(self: F, other: F) -> F:
pass
@abstractmethod
def __mul__(self: F, scalar: float) -> F:
pass
@abstractmethod
def objective_gradient( #. scriptG(w_t)
self: F,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], np.ndarray]
) -> Gradient[F]:
pass
@abstractmethod
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
pass
def __call__(self, x_value: X) -> float:
return self.evaluate([x_value]).item()
@abstractmethod
def update_with_gradient(
self: F,
gradient: Gradient[F]
) -> F:
pass
def update(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]]
) -> F:
def deriv_func(x: Sequence[X], y: Sequence[float]) -> np.ndarray:
return self.evaluate(x) - np.array(y)
return self.update_with_gradient(
self.objective_gradient(xy_vals_seq, deriv_func)
)
@abstractmethod
def solve(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> F:
pass
@abstractmethod
def within(self: F, other: F, tolerance: float) -> bool:
pass
def iterate_updates(
self: F,
xy_seq_stream: Iterator[Iterable[Tuple[X, float]]]
) -> Iterator[F]:
return iterate.accumulate(
xy_seq_stream,
lambda fa, xy: fa.update(xy),
initial=self
)
def rmse(
self,
xy_vals_seq: Iterable[Tuple[X, float]]
) -> float:
x_seq, y_seq = zip(*xy_vals_seq)
errors: np.ndarray = self.evaluate(x_seq) - np.array(y_seq)
return np.sqrt(np.mean(errors * errors))
def argmax(self, xs: Iterable[X]) -> X:
args: Sequence[X] = list(xs)
return args[np.argmax(self.evaluate(args))]
As stated before, we perform incremental supervised learning with: - state \(x=S_t\) represented by step.state
- return \(y=G_t\) represented by step.return_
- return estimate \(\hat y=V(S_t;w)\) represented by f: ValueFunctionApprox[S]
The loss function is:
\[ \mathcal L_{(S_t,G_t)}(w) = \frac{1}{2}\cdot [V(S_t;w)-G_t]^2 \]
where \(S_t\) is a state visited at time \(t\) in a trace experience and \(G_t\) is its associated return on the trace experience.
The gradient with respect to \(w\) is:
\[ \nabla_w \mathcal L_{(S_t,G_t)}(w) = [V(S_t;w)-G_t] \cdot \nabla_w V(S_t;w) \]
The change in the parameters is:
\[ \Delta w = \alpha \cdot [G_t-V(S_t;w)] \cdot \nabla_w V(S_t;w) \]
where the 3 factors are the: - step size or learning rate - return residual - gradient estimate
<generator object MarkovRewardProcess.reward_traces at 0x7fde240be950>
episodes: Iterator[Iterator[mp.ReturnStep[S]]] = \
(returns(trace, gamma, episode_length_tolerance) for trace in traces); episodes
<generator object <genexpr> at 0x7fde2409d1d0>
Tabular(values_map={NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 0.0, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 0.0, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 0.0, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 0.0, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 0.0, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 0.0}, counts_map={}, count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7fde240c0dd0>)
<itertools.islice at 0x7fde2404f350>
for i,episode in enumerate(episodes_):
lst = [(step.state, step.return_) for step in episode]
# print(f"step.state={step.state}, step.return_={step.return_}")
print(lst)
# print(f"\repisode {i}", end="") #.
[(NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -36.80235899659813), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -38.66928777399792), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -41.85476419333102), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -44.2426455090975), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -38.04738389899722), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -41.12333407094883), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -34.58148230105426), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -37.27233229545665), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -30.302591439396274), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -33.669546043773636), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -32.21194625784357), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.703502051254606), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.138564043933535), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.51085514691012), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.81340081688411), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.03845156129965), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.17739683287247), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.220669356842276), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -26.157638827919836), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.97649379578379), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -23.664110426743736), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.205906683365903), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -20.585680301834977), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -18.785428766800614), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -20.87269862977846), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -22.0807762553094), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -23.423084728121555), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -24.91453858680173), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.48416019454145), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -20.894850869807804), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -23.216500966453115), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -24.685001073836794), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.2291185134693), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -24.699020570521444), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.33224507835716), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -28.146938975952402), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -30.163265528836003), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -31.292517254262226), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -31.39570446568773), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -23.77300496187525), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.414449957639167), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -28.238277730710184), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -30.264753034122425), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -31.405281149024916), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.696096374789434), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.908113292305565), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.032576533990156), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.0597579136397), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -25.978848335472527), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.77783769306456), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -23.44338142372237), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -21.960652235564385), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -20.31317535983329), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -18.48264549790985), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -20.536272775455387), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -21.706969750505984), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -21.896633056117757), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -22.107370062353063), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -22.341522291503402), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -23.71280254611489), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -25.23644727346099), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -26.929385859401098), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -28.810428732667887), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -29.789365258519876), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -29.725535581529563), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -21.91726175725507), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -24.352513063616744), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -24.836125626240825), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.397034682807114), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -24.885594092007903), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.53954899111989), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.289727310450516), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -26.988585900500574), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -28.87620655611175), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -26.886013493774804), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -29.873348326416448), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -30.929961212525754), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -23.25551245836195), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -25.83945828706883), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -23.511848750393785), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -26.12427638932643), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -27.915862654807142), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -29.906514060896825), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -32.118348956552026), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -33.46483217394669), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -34.96092463771855), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -33.646811362226806), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -33.297796611680425), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -36.99755179075603), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -39.9972797675067), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -43.33031085278522), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -45.882141797379944), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -39.86904644153327), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -44.29894049059252), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -45.84728583938806), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -39.83031759932006), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -44.25590844368895), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -45.79947245393965), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -39.7771916154885), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -43.04534264482803), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -36.71704738314226), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -39.64518238666554), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -32.93909154073949), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -36.59899060082166), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -37.291785961864875), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -30.32420662429431), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -32.5420259879456), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -25.04669554216178), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -27.829661713513087), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -25.723185890887404), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -28.581317656541557), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -30.6459085072684), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.85234899505997), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.97061620371727), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -26.990913102225385), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -29.989903446917094), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -32.21100382990788), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.59134379799273), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -33.9903819977697), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -35.50444306958492), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -28.338270077316583), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -30.335429824637014), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -22.59492202737446), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -25.105468919304954), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.69630500843392), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -25.21811667603769), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -26.909018528930766), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -27.67668725436752), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -25.55321426961455), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.30491095322236), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.00545661469151)]
[(NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -30.337311265963837), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -32.597012517737596), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -33.956254758438135), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -26.61806084270904), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -28.42408623062863), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -20.4712069229207), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -22.74578546991189), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -24.161983855457656), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -25.735537617175172), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -26.37281957463908), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -27.080910638487868), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -27.867678487208742), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -29.8529760968986), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -32.05886232988733), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.42229768685879), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -33.802552985398655), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -36.447281094887394), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -35.29831853685886), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -35.132804583493815), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -34.94890019086599), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -34.74456197683507), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -34.517519516800725), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -38.352799463111914), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -40.35157358663183), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -33.72397065181314), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -36.31954157407763), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -29.243935082308482), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -32.49326120256498), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -33.88140133618331), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -32.44734102718765), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.96505179497025), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.42917487028425), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.833756065077583), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.172179614847956), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.43709467014837), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.62033362048216), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.712821343075262), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -26.70447436817871), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -29.67163818686523), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -31.85737576318359), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.198423723854628), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -33.553804137616254), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -35.019356558303315), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -27.799285064781465), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -29.736557588486885), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -21.929508431652096), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -24.36612047961344), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -24.851244977348266), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -25.39027219705363), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -25.989191330059587), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.765768144510652), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -29.73974238278961), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -30.821935980877345), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.04793507684769), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.187934072370293), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -31.319926747078103), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -33.68880749675345), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -35.20978610750383), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -36.89976234167092), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -38.77751371296769), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -40.86390412551965), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -42.03057876708487), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -35.58953196342763), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -39.543924403808475), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -40.56393463185023), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -33.95992736872248), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -37.73325263191387), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -38.55207710752289), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -31.72453011946988), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -34.09794098258512), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -26.775489980650136), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -29.750544422944596), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -30.83393824771622), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -32.03770916412913), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.398794169349678), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.688888619594724), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -32.98765402177192), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -34.39030087403183), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -27.10033430447981), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -30.11148256053312), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.258542387576327), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -31.398380430640362), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -33.77597825626707), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -32.33020427172516), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.834899844456373), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.2845615919355), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.673074644690082), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.993644703306288), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.238722546213182), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.399920149443062), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -31.55546683271451), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -32.798981775079156), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -25.33220197231017), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -28.14689108034463), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -26.075662965144673), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.88540950381139), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -23.562905657885516), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -26.18100628653946), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -27.978895873932732), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -25.88900162468701), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.678008014413983), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.420008904904424), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -29.355565449893803), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -30.395072722104224), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -31.550080802338027), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.85698487847067), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.086878296395827), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.23120431631267), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -31.368004795902966), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -32.59069062306633), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -25.10076735896259), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -27.88974150995843), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -29.877490566620477), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -30.93456370164134), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -23.260626335157045), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -25.845140372396717), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -23.51816217853588), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.043741963134952), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -20.40549727935614), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -18.585225408490796), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -20.650250453878662), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -21.833611615420736), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -19.060908004118122), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -21.178786671242356), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -22.420874079158175), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -22.689860087953527), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -22.988733431059472), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -20.344376688161162), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -22.604862986845735), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -24.005403318717484), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -25.561559243019428)]
[(NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -38.69076506175256), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -40.72709091845476), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -34.1412121316164), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -36.78314321830348), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -29.759048020337193), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -33.06560891148577), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -34.51734323498418), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -34.97884444426768), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -27.754271604741863), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -29.686542632887324), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -21.873936258763692), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -24.304373620848548), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -21.806199121260136), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -24.229110134733485), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -25.810122371926095), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.566802635473437), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -29.518669594970483), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.599861314728955), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -26.578963225571698), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -25.44463201539697), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.184264004091716), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.783855102641436), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -21.227845212141123), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -19.49894533380744), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -21.6654948153416), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -22.96166090593511), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -24.40184545103901), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.002050501154454), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -23.692506766044477), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -22.237458171477837), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -20.620737510848237), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -18.82438122125979), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -20.915979134733103), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -22.128865705259003), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -23.47651745028778), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -24.973908278097532), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -25.5265647534417), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.14062750382411), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.934030559804565), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -29.92670062200507), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.05322912254516), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.082705234256373), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -30.091894704729302), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -32.324327449699226), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.717258931094225), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.042738354866444), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.29327104794669), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -32.54807894216299), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -35.05342104684777), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -33.74958515014816), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -33.411989709370815), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -33.03688366406266), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -32.62009916927582), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -32.15700528617933), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.642456527183228), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.070735683854224), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.435490302377556), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.72966210073681), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.945408543358205), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -32.16156504817578), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -33.472424236702786), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -26.080471374114204), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -28.978301526793558), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -26.999452350087925), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -25.911842153748335), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.703386380037678), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.44820708893075), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -29.38689676547861), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -30.429885294976234), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -31.58876143886248), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -32.87640159873609), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -33.155575959547576), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -25.728417732830636), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -28.587130814256263), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -29.541256460284735), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.62495783174479), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -30.694397590827545), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -32.99377510091949), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -31.461089654672303), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.869216936619864), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -34.299129929577624), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -36.999033255086246), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -37.73627779993664), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -30.818086444374043), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -34.24231827152671), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -36.935909190585235), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -38.777251061602186), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -31.974723401780203), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -35.52747044642245), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -36.10120801253242), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -29.00134223614713), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -32.22371359571903), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -30.605465760005124), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.918523720322998), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -29.15525478734286), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -28.307178195142704), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.364870870475865), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -30.405412078306515), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -32.67268008700724), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -34.080755652230266), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -34.49374713009666), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -27.215274588996287), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -29.087657059836687), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -21.208507844262986), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -23.56500871584776), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -25.07223190649751), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.746924340552788), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.52014436537596), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.244604850417733), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -29.160672056019703), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -30.178524506688557), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -30.15793474616143), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -22.3977052735127), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -24.886339192791887), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -25.42926576976876), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -26.03251752196529), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.702797246628098), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -24.47111426101519), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -27.190126956683546), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -29.100141062981717), (NonTerminal(state=InventoryState(on_hand=0, on_order=1)), -27.134829612519212), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -30.149810680576902), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -31.23714160603737), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -23.596824006708186), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -26.218693340786874), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -28.02077037865208), (NonTerminal(state=InventoryState(on_hand=1, on_order=0)), -30.02307819850231), (NonTerminal(state=InventoryState(on_hand=1, on_order=1)), -32.24786466500257), (NonTerminal(state=InventoryState(on_hand=0, on_order=0)), -33.56831269984366), (NonTerminal(state=InventoryState(on_hand=0, on_order=2)), -26.187014110937398), (NonTerminal(state=InventoryState(on_hand=2, on_order=0)), -29.096682345486)]
Tabular(values_map={NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 0.0, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 0.0, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 0.0, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 0.0, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 0.0, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 0.0}, counts_map={}, count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7fde240c0dd0>)
<generator object MarkovRewardProcess.reward_traces at 0x7fde240c53d0>
Tabular(values_map={NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 0.0, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 0.0, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 0.0, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 0.0, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 0.0, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 0.0}, counts_map={}, count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7fde240c3b00>)
#
# Tabular(
# values_map={
# NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 0.0,
# NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 0.0,
# NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 0.0,
# NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 0.0,
# NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 0.0,
# NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 0.0},
# counts_map={},
# count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7f65c8d738c0>
# )
vfa_gen = mc_prediction(
traces=episodes,
approx_0=Tabular(values_map=initial_vf_dict),
γ=gamma,
episode_length_tolerance=episode_length_tolerance
)
vfa_gen
<generator object mc_prediction at 0x7fde240c5b50>
episode 0episode 1episode 2episode 3
Tabular(values_map={NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -33.93609296741643, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -28.2923333935379, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -26.654802007458628, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.400756560600964, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.143275761034605, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -29.5809835739014}, counts_map={NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 67, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 66, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 97, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 87, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 120, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 87}, count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7fde240c35f0>)
Tabular(values_map={NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -33.93609296741643, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -28.2923333935379, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -26.654802007458628, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.400756560600964, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.143275761034605, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -29.5809835739014}, counts_map={NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 67, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 66, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 97, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 87, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 120, NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 87}, count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7fde240c35f0>)
#
# Tabular(
# values_map={
# NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -34.424545927793496,
# NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -26.88378412441469,
# NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -27.126144224638786,
# NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.348748283036677,
# NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -28.678457445602575,
# NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -29.25572665163403},
# counts_map={
# NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 368,
# NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 222,
# NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 141,
# NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 143,
# NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 216,
# NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 220},
# count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7f65c8db0680>)
Let us now focus on the essential part of this project. We create an instance of the MRP.
This is the exact value function:
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.511,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.932,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.345,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.932,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.345,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.345}
Now we perform a Monte-Carlo Prediction. First, we generate a stream of trace experiences:
traces: Iterable[Iterable[TransitionStep[S]]] = \
si_mrp.reward_traces(Choose(si_mrp.non_terminal_states))
traces
<generator object MarkovRewardProcess.reward_traces at 0x7fe478d55c50>
Here are the value function approximations:
vfas: Iterator[ValueFunctionApprox[InventoryState]] = \
mc_prediction(
traces=traces,
# approx_0=Tabular(values_map=initial_vf_dict),
approx_0=Tabular(),
γ=item_gamma,
episode_length_tolerance=1e-6
)
vfas
<generator object mc_prediction at 0x7fe478646ed0>
episode 59998CPU times: user 10min, sys: 12.3 s, total: 10min 12s
Wall time: 10min 18s
%%time
# last_func: ValueFunctionApprox[InventoryState] = iterate.last(it.islice(vfas, n_traces))
last_func: ValueFunctionApprox[InventoryState] = iterate.last(vfa_lst)
CPU times: user 509 µs, sys: 0 ns, total: 509 µs
Wall time: 518 µs
Tabular(values_map={NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.33718825911932, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.3204417831081, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.92510625546161, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.919237394996568, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.51051162503372, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.34249492226053}, counts_map={NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 212530, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 212155, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 213439, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 364768, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 153249, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 153728}, count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7fe478c00320>)
After Monte-Carlo:
{NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.342,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.32,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.511,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.925,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.337,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.919}
Exact Value Function:
{NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.345,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.345,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.511,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.932,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.345,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.932}
The comparison shows that, with 60,000 trace experiences, the Tabular Monte-Carlo Prediction is within 0.01 of the exact Value Function for the states.
Let us visualize the convergence of the State Value Function for each of the states:
Tabular(values_map={NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -27.47703855547519, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -28.266796323503186, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -27.692010838072004, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.81699461451744, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -33.31945960871553, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -25.606237182246815}, counts_map={NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 22, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 25, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 21, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 44, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 10, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 9}, count_to_weight_func=<function Tabular.<lambda>.<locals>.<lambda> at 0x7fe478c00320>)
{NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -27.47703855547519,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -28.266796323503186,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -27.692010838072004,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.81699461451744,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -33.31945960871553,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -25.606237182246815}
{NonTerminal(state=InventoryState(on_hand=1, on_order=1)): 22,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): 25,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): 21,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): 44,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): 10,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): 9}
Let us visualize how the value function for each state converges during the operation of the Monte-Carlo algorithm.
import matplotlib.pyplot as plt
fig,axs = plt.subplots(figsize=(13,10))
axs.set_xlabel('Iterations', fontsize=20)
axs.set_title(f'Convergence of value functions during Monte-Carlo', fontsize=24)
for it in merged_dict.items():
axs.plot(it[1], label=f'{it[0].state}')
axs.legend(fontsize=20);
Next we visualize the number of visits for each state during the operation of the Monte-Carlo algorithm.