cs294 note2

Posted by ZhY on September 2, 2018

Terminology & notation

The top of this model can be a discrete(softmax) or continuous(output the mean and the variance of a Gaussian distribution) or hybrid.

For instance, a cheetah chasing after a gazelle. The observation of mine is I can see a list of huge pixels. And underlying this observation, there is a real physical dynamic system, which includes the position of the cheetah and the position of the gazelle and their velocities.

If there is a car drives in front of the cheetah, and the cheetah is no longer visible, but the cheetah is still there, the state of the cheetah does not change. The observation may not always be sufficient to fully infer the state.

There is a probabilistic graphical model. There are relates states, observations, and actions.

observation do not have markov property , which means the observations are not by themselves conditionally independent in time.

Imitation Learning

Why not work?

Maybe policy probably was trained well, and we got a good training error, but it is never really perfect. And when we run this policy, it maybe make a little mistake and it deviated a little bit from the training trajectory. In this case, it will see a different state which it never saw before. And it will make more mistakes, then it will deviate more. And eventually the policy diverges.

Any trick?

In this paper, we have a camera facing to the left and a camera facing to the right. And we label images from that camera with the command human executed plus a little turn to the right , likewise the right camera.

This idea is a special case of stabilizing controller.

Because of the action influence the state, the distribution of the training data and our real data is always different. That means we are testing on a different distribution from we trained on, which is also called domain shift.

DAgger

Why might we fail to fit the expert?

  1. Non-Markovian behavior

The expert behavior might not depend entirely on the current observation. The history observation does also affact our behavior. If we have seen a sign which tells us children are there, we will slow down even when the sign is no longer visible. So we neet to use the whole history observations.

We can use recurrent network to represent our policy. And we process each image in our history through share weights.

  1. Multimodal behavior

If we want to avoid a tree, we can choose to turn left or right. If we have continuous actions, the distribution will became messy. There are three ways to solve this problem:

1) Output mixture of Gaussians. But we have to pick the number of the modes.

2) Implicit density model. Input noise to the model. Thre trouble is it is harder to train.

3) Autoregressive discretization.

What’s the problem of Imitation learning?

Reward functions