# cs294 note3

Posted by ZhY on September 3, 2018

# Reinforcement learning introduction

## Markov chain $\hat{\mu_t}$ is a vector, and $\mu_{t,i}$ is the probability that you’re in state $i$ at timestep $t$ .

## Markov Decision Process ## Partially Observed Markov Decision Process ## The goal of reinforcement learning If we know the policy, we can easily turn our MDP to markov chain. #### finite horizon case The probability of $s_t$ and $a_t$ is the marginal distribution at timestep t in this Markov chain.

#### infinite horizon case You can always find a distribution that state$s$ and action$a$ will converge to. So $p_\theta(s,a)$ accord with a stationary distribution. And as the timestep $T$ is infinity, if the Markov chain falls into the stationary distribution, and then stays there for infinitely long time, that sum is entirely dominated by the expectation under stationary distribution. But rarely exist reinforcement learning algorithms try to find $\mu$. ## Expectations is what we care about In reinforcement learning, we always care about expectations not individual values and this is important because this gives us some good mathematics properties. If we drive a car to climb a mountain, and if we drive on the road, we get a reward +1, and if we drive off the cliff, we get a reward -1. And this reward function is not smooth. But if we abstract the probability of falling in this dynamic system, then the expectation of the reward under the stationary distribution with this probability$\psi$ is actually smooth in $\psi$. This is very important because this is what allows us to use gradient based algorithms with reinforcement learning to optimize non-smooth objectives, include non-smooth dynamics or non-smooth reward or both.

## Some algorithms      The different between on policy and off policy  If you have very fast simulator, maybe you don’t care about efficiency. Policy gradient algorithms even though they are on-policy, they are also often very easy to parallelize. And model-based RL algorithms always need to fit a lot of different neural network models.   ## Examples of specific algorithms  