There are 3 classical dynamic programming algorithms: - Policy Evaluation algorithm - calculate the Value Function of a Finite MDP evaluated with a fixed policy (this is the Prediction problem) - Equivalent to calculating the Value Function of the -implied Finite MRP - Policy Iteration algorithm - solve the MDP Control problem - use Policy Evaluation together with Policy Improvement - Value Iteration algorithm
Each of these is an iterative algorithm such that the computed Value Function converges to the true Value Function.
These algorithms depends on the idea of a fixed-point. The computed Value Function is updated each time towards the fixed-point which is the true Value Function in this setting.
Formally, the fixed-point of a function is defined to be a value such that .
This simply means that when the function under consideration is applied to the fixed-point, there is no change - the effect of the function is to just give back the value of the fixed-point.
2 Problem Statement
In this project we need to setup a mechanism what provides an iterative numerical solution to an equation like
where is the fixed-point of . We need this to solve a number of Bellman equations. For example
By choosing
we have
which resembles .
is known as the Bellman Policy Operator. This means that is a fixed-point of the Bellman Policy Operator.
3 Implementation
!python --version
Python 3.7.14
from typing import Callable, Iterable, Iterator, Optional, TypeVarimport numpy as npimport itertoolsimport matplotlib.pyplot as plt
3.1 Finding Fixed-Points
In practical terms, finding the fixed-point means solving the equation . Depending on the complexity of this may be trivial or very difficult. Visually, we need to find where and the line intersect.
3.1.1 Example 1:
Let’s take our first example:
such that
x = np.linspace(start=1, stop=2, num=10)y =1+1/xz = x
fig,axs = plt.subplots(figsize=(13,10))axs.set_xlabel('$x$', fontsize=20)axs.set_title(f'Fixed-Point for $f(x) = 1 + 1/x$', fontsize=24)axs.plot(x, y, color='r', label='$f(x)=1 + 1/x$')axs.plot(x, z, color='k', label='$f(x)=x$')axs.legend(fontsize=20);
3.1.2 Example 2:
Here is another example:
such that
x = np.linspace(start=0, stop=np.pi/2, num=10)y = np.cos(x)z = x
fig,axs = plt.subplots(figsize=(13,10))axs.set_xlabel('$x$', fontsize=20)axs.set_title(f'Fixed-Point for $f(x) = cos(x)$', fontsize=24)axs.plot(x, y, color='r', label='$f(x)=cos(x)$')axs.plot(x, z, color='k', label='$f(x)=x$')axs.legend(fontsize=20);
3.2 Iteration to find Fixed-Points
Instead of relying on an analytical technique or depending on a visual inspection to try and find the fixed-point, we could try a different approach: Using numerical principles and iteration. We pick a starting value for , say . Then we apply the function to this value to get . Then apply the function again to this result, and on and on, to get:
For a certain class of functions this approach converges to the fixed-point of the function which is the solution to the equation . The theory that studies this is known as fixed-point theory. A relevant theorem is that by Stefan Banach and is called the Banach Fixed-Point Theorem. In addition we only consider functions that have a single fixed-point.
Next, we will create a function that provides the iteration to find the fixed-point of a function. We make use of the approach followed in http://web.stanford.edu/class/cme241/.
The following function will serve our purpose:
X = TypeVar('X')Y = TypeVar('Y')
def iterate(step: Callable[[X], X], start: X) -> Iterator[X]:'''Get the fixed point of a function f by applying it to its own result repeatedly to get x, f(x), f(f(x)), f(f(f(x)))... ''' state = startwhileTrue:yield state state = step(state)
The step input accepts the function for which we want to find its fixed-point. The start input takes the value. The function is constructed in the form of a generator. Each invocation generates the next value of the iteration.
3.2.1 Example 1:
Let’s now use this function on the first example:
x =1.0gen = iterate(step=lambda y: 1+1/y, start=x); gen
# # x = f(x)np.cos(np.cos(np.cos(np.cos(np.cos(np.cos(np.cos(np.cos(np.cos(fp)))))))))
0.7390851332151607
3.3 converge function
The iterate function provides an endless stream of values. It would be nice to wrap it with another function that can specify the criterium for convergence so that the iteration can stop based on this criterium rather than a pre-specified number of iterations.
Let’s call such a function converge():
def converge(values: Iterator[X], done: Callable[[X, X], bool]) -> Iterator[X]:'''Read values from an iterator until two consecutive values satisfy the given done function or the input iterator ends. ''' a =next(values, None)if a isNone:returnyield afor b in values:yield bif done(a, b):return a = b
The converge() function has the inputs: - values: the values generated by the iterate function - done: a function that returns True if the convergence condition is satisfied
Next we call the converge() function with: - values the iterate() function - done the distance between the two latest values
3.3.1 Example 1:
x =1.0values = converge( values=iterate(lambda y: 1+1/y, x), done=lambda a, b: np.abs(a-b) <1e-3)
# # convert values iterator to a listvals =list(values); vals
The converge() function returns an iterator. It would be nice if we wrap this function even more to just return the final converged value. We also need a last() function to extract the last value of the iteration.
def last(values: Iterator[X]) -> Optional[X]:'''Return the last value of the given iterator. '''try:*_, last_element = valuesreturn last_elementexceptValueError:returnNone
def converged(values: Iterator[X], done: Callable[[X, X], bool]) -> X:'''Return the final value of the given iterator after its values have converged subject to the done function. ''' result = last(converge(values, done))if result isNone:raiseValueError("converged called on an empty iterator")return result
3.4.1 Example 1:
x =1.0converged_value = converged( values=iterate(lambda y: 1+1/y, x), done=lambda a, b: np.abs(a-b) <1e-3)converged_value
1.6181818181818182
3.4.2 Example 2:
x =0.0converged_value = converged( values=iterate(lambda y: np.cos(y), x), done=lambda a, b: np.abs(a-b) <1e-3)converged_value
0.7387603198742113
3.5 values_provider_ and almost_equal_ functions
Finally, it would be nice if we wrap the iterate function with a values_provider function that includes the function . It could also accept some parameters that might be used in . Each use case will likely have its own version of this function.
We can also wrap the function that provides the convergence test, which will be called almost_equal_floats.