Policy gradients introduction
Evaluating the objective
We can use Monte Carlo to approximate the objective.
And we already know how to evaluate the objective, then how do we improve the objective?
Remember is the probability of choosing one action condition on .
We want to output a continuous vector, so one of the representation we might use is a conditional Gaussian. So we have our neural network and our network will look at the state, and we will output the mean of the Gaussian distribution, and maybe the variance is fixed and we only care about the mean.
Because in our derivation of policy gradient, we didn’t use the Markov property anywhere. So if we substitute for , all math still works out. But that does not necessarily mean that you will get a good policy, because the non-Markov property maybe led the good policy related to privious history. So if it is really important for your policy in order to be successful to consider the entire history observations, then maybe this one work well, you will get a good policy in your policy class. If you believe that incooperating past history is very important in order to be successful, that maybe you want to use RNN.
Problem of the policy gradient
If we use sample three samples from our policy , one of them has a big negative reward and the others have small positive reward, it will push the curve to the right. If we add a constant to this reward function, then the policy still moves to the right but just a little, and the variance becomes wider. So a problem of the policy gradient is high variance.
Of couse we can add more samples to fix this problem, but adding more samples means we need to use more time to train.
Considering a one-dimontional continuous state and action problem. The parameters in here is . The policy function is a Gaussian distribution which is determined by these two parameters. And the reward function here means we want the final state becomes 0, and we hope to get there by taking small actions. And in policy gradient, because the reward function want to take small actions, the gradient prefer to decrease to reduce probability of big actions. In fact, the smaller prefer to decrease more. And we will notice that decrease fast and eventually will not move, so the rate of getting right answer is very slow. If we use a big step to overcome the slow convergance, will not move more quickly.
The here is the weighted expected reward. So using the average reward is unbiased but a little variance, using weighted expected reward is unbiased and lowest variance.