# Cs294_note5

Posted by ZhY Blog on September 25, 2018

layout: post title: “Acvtor-critic introduction” date: 2018-09-25 12:00:00 author: “ZhY” header-img: “img/post-bg-basic.jpg” header-mask: 0.3 catalog: true tags: - 强化学习 —

## Acvtor-critic introduction

### Review  The $\hat{Q}_{i,t}$ here is the estimate of $Q$ function, and it is an estimates that uses just a single sample. There’s a trajectory here, if we want to calculate $\hat{Q}_{i,t}$, we can sum together these rewards and it can give us an estimate of the reward to go from that timestep but it’s a single sample estimate. The reality is that the true expected reward from that state is more complex that some expectation that depends on our policy and on our dynamics, and there are many possible futures all of which needs to be average together to obtain the true reward to go. The true reward to go we do not know. But if we knew it, we can plug it in the formula so we could replace the $\hat{Q}$ with the true $Q$ if we have some ways to find the true integral and this would be a better estimate of the reward to go.

If we use the true reward to go, we can have a lower variance. We might choose to add a baseline as well. The $Q$ function is the expected reward to go, and the baseline is the average reward. So in these single sample estimates we just average together the rewards, and now the baseline is the average of the $Q$ values. And this baseline called $V$ function, $Q-V$ is the advantage function here.  In the single sample estimates, it is unbiased but has high variance. We will use neural network to introduce a little bias to reduce variance.

### Value function fitting If we want to directly fit $A^\pi$, we need to fit $s$ and $a$, which will introduce many parameters, and will become hard to fit. But there are a trick here.   In this way, we don’t need to learn a function of both $s$ and $a$, we just need to learn $V$ which is a function of only $s$. So we will fit $V^\pi$.

### Policy evaluation   ### An actor-critic algorithm #### Discount    ### Architecture design ### Practice  #### Control variates #### Eligibility traces & n-step returns   