cs294 note3

Posted by ZhY on September 3, 2018

Reinforcement learning introduction

Markov chain

is a vector, and is the probability that you’re in state at timestep .

Markov Decision Process

Partially Observed Markov Decision Process

The goal of reinforcement learning

If we know the policy, we can easily turn our MDP to markov chain.

finite horizon case

The probability of and is the marginal distribution at timestep t in this Markov chain.

infinite horizon case

You can always find a distribution that state and action will converge to. So accord with a stationary distribution. And as the timestep is infinity, if the Markov chain falls into the stationary distribution, and then stays there for infinitely long time, that sum is entirely dominated by the expectation under stationary distribution. But rarely exist reinforcement learning algorithms try to find .

Expectations is what we care about

In reinforcement learning, we always care about expectations not individual values and this is important because this gives us some good mathematics properties. If we drive a car to climb a mountain, and if we drive on the road, we get a reward +1, and if we drive off the cliff, we get a reward -1. And this reward function is not smooth. But if we abstract the probability of falling in this dynamic system, then the expectation of the reward under the stationary distribution with this probability is actually smooth in . This is very important because this is what allows us to use gradient based algorithms with reinforcement learning to optimize non-smooth objectives, include non-smooth dynamics or non-smooth reward or both.

Some algorithms

Trade-off

The different between on policy and off policy

If you have very fast simulator, maybe you don’t care about efficiency. Policy gradient algorithms even though they are on-policy, they are also often very easy to parallelize. And model-based RL algorithms always need to fit a lot of different neural network models.

Examples of specific algorithms