Finite Markov Decision Processes (Finite MDPs)

All of the notation for RL by Sutton

states: x → s

agents

actions: u → a

rewards: -cost → R

environments

Untitled

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

Untitled

Untitled

Markov chains