layout: post title: “Acvtor-critic introduction” date: 2018-09-25 12:00:00 author: “ZhY” header-img: “img/post-bg-basic.jpg” header-mask: 0.3 catalog: true tags: - 强化学习 —
Improving the policy gradient
The here is the estimate of function, and it is an estimates that uses just a single sample. There’s a trajectory here, if we want to calculate , we can sum together these rewards and it can give us an estimate of the reward to go from that timestep but it’s a single sample estimate. The reality is that the true expected reward from that state is more complex that some expectation that depends on our policy and on our dynamics, and there are many possible futures all of which needs to be average together to obtain the true reward to go.
The true reward to go we do not know. But if we knew it, we can plug it in the formula so we could replace the with the true if we have some ways to find the true integral and this would be a better estimate of the reward to go.
If we use the true reward to go, we can have a lower variance. We might choose to add a baseline as well.
The function is the expected reward to go, and the baseline is the average reward. So in these single sample estimates we just average together the rewards, and now the baseline is the average of the values. And this baseline called function, is the advantage function here.
In the single sample estimates, it is unbiased but has high variance. We will use neural network to introduce a little bias to reduce variance.
Value function fitting
If we want to directly fit , we need to fit and , which will introduce many parameters, and will become hard to fit. But there are a trick here.
In this way, we don’t need to learn a function of both and , we just need to learn which is a function of only . So we will fit .
An actor-critic algorithm
Eligibility traces & n-step returns
Generalized advantage estimation