Reinforcement learning introduction
is a vector, and is the probability that you’re in state at timestep .
Markov Decision Process
Partially Observed Markov Decision Process
The goal of reinforcement learning
If we know the policy, we can easily turn our MDP to markov chain.
finite horizon case
The probability of and is the marginal distribution at timestep t in this Markov chain.
infinite horizon case
You can always find a distribution that state and action will converge to. So accord with a stationary distribution. And as the timestep is infinity, if the Markov chain falls into the stationary distribution, and then stays there for infinitely long time, that sum is entirely dominated by the expectation under stationary distribution. But rarely exist reinforcement learning algorithms try to find .
Expectations is what we care about
In reinforcement learning, we always care about expectations not individual values and this is important because this gives us some good mathematics properties. If we drive a car to climb a mountain, and if we drive on the road, we get a reward +1, and if we drive off the cliff, we get a reward -1. And this reward function is not smooth. But if we abstract the probability of falling in this dynamic system, then the expectation of the reward under the stationary distribution with this probability is actually smooth in . This is very important because this is what allows us to use gradient based algorithms with reinforcement learning to optimize non-smooth objectives, include non-smooth dynamics or non-smooth reward or both.
The different between on policy and off policy
If you have very fast simulator, maybe you don’t care about efficiency. Policy gradient algorithms even though they are on-policy, they are also often very easy to parallelize. And model-based RL algorithms always need to fit a lot of different neural network models.
Examples of specific algorithms