- The model is characterized by a set of parameters w
- Given an input x, for the model predicts y_hat = f(x; w) for regression, or the model predicts the probability of each possible class for classification
- Define a Lost function L(y, y_hat) for regression, or L(y, P(y=a | x), P(y=b | x) ...), find the parameters w to minimize L

This problem is typically view as an optimization problem, and use gradient descent approach to solve it.

### Need for Structure Learning

However, in some cases, y is not as simple as a number or a class. For example

- For machine translation, x is a sentence in English, and y is a translated sentence in French
- For self-driving-vehicle, x is a camera image, and y is the control action on steering wheel, brake, gas pedal

In these cases, the output y can be viewed as an object. But wait, can we break down the object as multiple numbers / categories and use the classical regression / classification approach to solve it ? Not quite, because the lost function cannot be just formulated as the summation of loss of individual components. For example, two French sentences using different words may still be a very good translation of the same English sentence. So we need to generalize a bit more and introduce the concept of an object and compatibility here.

The prediction problem can be generalized as: Given input x, finding an object y such that y is the "most compatible" with x. The compatibility is a parameterized function that we are going to learn from the training data.

- The compatibility function is defined as F(x, y; w)
- During training phase, we tune parameter w such that for every sample in training data, F(x, y; w) is the maximum. In other words, F(x, Y=y; w) > F(x, Y=other_val; w)

Notice that the training process is different from classical ML in the following way.

- There are two optimization loop here. a) Given parameter w, find y_opt that maximize F(x,y,w). b) Given Lost = gap between F(x,y,w) and F(x,y_opt,w), find w that minimize the gap.
- It turns out the first optimization is solved in a problem specific way while the second optimization can be solved by classical gradient descent approach.

Rather than trying to exactly match y_hat to y in the training data, structure learning enable us to learn a more abstract relationship (ie: compatibility) between x and y so that we can output other equally-good y even it is not the same as the y in the training data. This more-generalized form is very powerful when we don't have a lot of training data. The downside of structure learning is it is compute intensive, because at inference phase it needs to solve an optimization problem which is typically expensive.

### Imitation Learning

In a typical setting of Reinforcement Learning, an digital agent observe the state from the environment, formulate its policy to determine an action that it believes will maximize its accumulative future reward, take the action, get the reward from the environment and transition to the next state. Reinforcement learning is about how the agent optimize its policy along its experience when interacting with the environment.

For an overview of Reinforcement Learning and basic algorithm, you can visit my previous blog here.

Basically, reinforcement learning is based on a trial-and-error approach to learn the lesson. This approach can be very costly because during the trial process, serious mistake can be made (imagine if we use reinforcement learning to learn self-driving car, we may crash many cars before we learn something meaningful). In practice, people rely on simulator to mimic the environment. However, coming up good simulator is not easy because it requires a very deep understanding of how the actual environment behaves, this is one of the limitation that restrict reinforcement learning to be broadly applied.

Another important design consideration is how the reward is assigned. Of course, we can use the actual reward from the environment to train the policy, but this is usually very inefficient. Imagine when playing the chess game, we only get the reward at the end when we win/lose the game. Propagating this reward all the way to each move will be very inefficient. To make the learning faster, we usually use a technique called "reward shaping", basically to assign some artificial reward along the trajectory to bias the agent towards certain desirable actions (based on domain knowledge).

One special form of reward shaping is "imitation learning" where we assign intermediate reward based on how the action "similar" to when an expert does in real-life circumstances. Lets say we collect a set of observations that the expert is taking action y in state x, and try to learn a model that the agent will bias to take action y when seeing state x. But wait, does it sound like a supervised learning problem ? Can we just train a prediction model from x to y and we are done ?

Unfortunately, it is not as simple. Expert data is typically very sparse and expensive to get, meaning we usually don't have too many data from the expert. Imagine in a self-driving program, if we want to learn how to react when the car is almost crash, we may not find any situation in the expert's observation because the expert may not run into such dangerous situation at all. On the other hand, we don't need to copy exactly when the expert did in every situation, we just need to copy those that is relevant to the situation.

"Inverse Reinforcement Learning" comes into rescue. Basically, it cuts off the reward from the environment and replace it with a "reward estimator function", which is trained from a set of expert behavior, assuming that expert behavior will achieve highest reward.

Underlying algorithm of inverse reinforcement learning is based on the "structure learning" algorithm. In this case, the x is the start state, y is the output of the trajectory of the expert which is basically the training data. And y_opt is the output of the trajectory based on the agent policy, which is learned from the reward function using Reinforcement Learning algorithm. The compatibility function is basically our reward function because we assume expert behavior achieve highest reward.

Then we bring it the structure learning algorithm below ...

The agent still need to interact with the environment (or simulator) to get its trajectory, but the environment only need to determine the next state, but not the reward. Again, there are two nested optimization loop in the algorithm

- Given a reward function (characterized by w), use classical RL to learn the optimal policy
- Use the optimal policy to interact with the environment to collect the total reward of each episode, then adjust the reward function parameter w such that the expert behavior always get the highest total reward.