


find optimal policy
a policy that acts greedy on an action value function will

does converge, is a way to find the optimal epsilon greedy
sarsa can find a good policy without the state-action pair converge


If you apply a random policy will after enough steps converge to the optimal q-function bellmann optimally equation. This would not happen with sarsa.
Q learning vs Sarsa
