
The DP algorithm.
Finite horizon control
Principle of optimality
Global optimality
The Dynamical programming algorthim is used in the context of a decision problem. This is where we have some sort of problem or game we want to solve. For the problem or game we have a discrete time steps series of actions we can take along the way in order to optimally solve the problem. For a given problem we have different stages k that goes from 0 up to N. This N is our planning horizon. After the planning horizon the problem will terminate.
For each stage k we have the a certain amount of possible states called Xk in our environment.
We also have a number of available actions called Uk.
Sometimes we also have some noise or randomness which can be described as Wk.
In order to find out if the agent of the problem has made good decisions we can give attribute certain actions with a cost, which is often done in control theory or reward which is often seen in reinforcement learning. We always want to minimize cost and maximize reward. We do this by defining a policy µk for each step of the control uk. The full policy sequence can defined as π. The cost we incur at each step k can defined as g and the total cost after all step k can be defined GN. Since there is typically noise in our system we must formulate the problem in order to achieve the best expected cost that takes the noise into consideration.
The dynamical programming algorithm rests on the idea of the principle of optimality. It uses the intuation that in order for a policy to be globally optimal all of its subparts must be optimal as well. An example could be that if we know that the shortest path between Copenhagen and Berlin goes from Odense, then we know that shortest path between Odense and Berlin is part of the longer journey from Copenhagen, as we knew the first longer trip was optimal.
The dynamical programming algorithm goes through a series of steps to solve a finite horizon problem.
On the first line it initialises an empty list J which will be used to save the cost functions.
On the second line an empty list π that will hold the optimal policy at each stage.
On line 3-5 for all states in the state space S at the final stage N, set the cost functions. JN to be equal to gN, the terminal cost associated with ending up in state x at time N.
On line 6 a loop that will iterate backward from stage N−1 to stage 0.