cs294 note3

Posted by ZhY on September 3, 2018

Reinforcement learning introduction

Markov chain

$\hat{\mu_t}$ is a vector, and $\mu_{t,i}$ is the probability that you’re in state $i$ at timestep $t$ .

The goal of reinforcement learning

If we know the policy, we can easily turn our MDP to markov chain.

finite horizon case

The probability of $s_t$ and $a_t$ is the marginal distribution at timestep t in this Markov chain.

infinite horizon case

You can always find a distribution that state$s$ and action$a$ will converge to. So $p_\theta(s,a)$ accord with a stationary distribution. And as the timestep $T$ is infinity, if the Markov chain falls into the stationary distribution, and then stays there for infinitely long time, that sum is entirely dominated by the expectation under stationary distribution. But rarely exist reinforcement learning algorithms try to find $\mu$.

Expectations is what we care about

In reinforcement learning, we always care about expectations not individual values and this is important because this gives us some good mathematics properties. If we drive a car to climb a mountain, and if we drive on the road, we get a reward +1, and if we drive off the cliff, we get a reward -1. And this reward function is not smooth. But if we abstract the probability of falling in this dynamic system, then the expectation of the reward under the stationary distribution with this probability$\psi$ is actually smooth in $\psi$. This is very important because this is what allows us to use gradient based algorithms with reinforcement learning to optimize non-smooth objectives, include non-smooth dynamics or non-smooth reward or both.