Pragmatic Programming Techniques: Reinforcement Learning Overview

There are basically 3 different types of Machine Learning

Supervised Learning: The major use case is Prediction. We provide a set of training data including the input and output, then train a model that can predict output from an unseen input.
Unsupervised Learning: The major use case is Pattern extraction. We provide a set of data that has no output, the algorithm will try to extract the underlying non-trivial structure within the data.
Reinforcement Learning: The major use case is Optimization. Mimicking how human learn from childhood, we use a trial and error approach to find out what actions will produce good outcome, and bias our preference towards those good actions.

In this post, I will provide an overview of the settings of Reinforcement Learning as well as some of its key algorithms.

Agent / Environment Interaction

Reinforcement Learning is all about how we can make good decision through trial and error. It is the interaction between the "agent" and the "environment".

Repeat the following steps until reaching a termination condition

The agent observe the environment having state s
Out of all possible actions, the agent need to decide which action to take. (this is called "policy", which is a function that output an action given the current state)
Agent take the action, and the environment receive that action
Through a transition matrix model, environment determine what is the next state and proceed to that state
Through a reward distribution model, the environment determines the reward to the agent given he take action a at state s

The goal for the agent is to determine an optimal policy such that the "value" of the start state is maximized.

Some terminology

Episode: a sequence of (s1, a1, r1, s2, a2, r2, s3, a3, r3 .... st, at, rt ... sT, aT, rT)
Reward rt: Money the agent receive after taking action at a state at time t
Return: Cumulative reward since the action is taken (sum of rt, r[t+1], ... rT)
Value: Expected return at a particular state, called "state value" V(s), or expected return when taking action a at state s, called "Q Value" Q(s,a)

The optimal policy can be formulated as choosing action a* amount all choices of a at state s such that Q(s, a*) is maximum.

To deal with never ended interaction, we put a discount factor "gamma" on future reward. This discount factor will turn the sum of an infinite series into a finite number.

Optimal Policy when model is known

If we know the "model", then figuring out the policy is easy. We just need to use dynamic programming technique to compute the optimal policy offline and there is no need for learning.

Two algorithms can be used:

"Value iteration" starts with a random value and iteratively update the value based on the Bellman's equation, and finally compute the "value" of each state or state/action pair (also call Q state). The optimal policy for a given state s is to choose the action a* that maximize the Q value, Q(s, a).

Another algorithm "Policy iteration" starts with a random policy, and iteratively modifies the policy to make it better, until the policy at next iteration doesn't change any more.

However, in practice, we usually don't know the model, so we cannot compute the optimal policy as described above.

Optimal Policy when model is unknown

One solution is the "model based" learning, we spare some time to find out the transition probability model as well as the reward distribution model. To make sure we experience all possible combinations of different state/action pairs, we will take random action in order to learn the model.

Once we learn the model, we can go back to use the value iteration or policy iteration to determine the optimal policy.

Learning has a cost though. Rather than taking the best action, we will take random action in order to explore new actions that we haven't tried before and it is very likely that the associated reward is not maximum. However we accumulate our knowledge about how the environment reacts under a wider range of scenarios and hopefully this will help us to get a better action in future. In other words, we sacrifice or trade off our short term gain for a long term gain.

Making the right balance is important. A common approach is to use the epsilon greedy algorithm. For each decision step, we allocate a small probability e where we take random action and probability (1-e) where we take the best known action we have explored before.

Another solution approach is the "model free" learning. Lets go back to look at the detail formula under Value iteration and Policy iteration, the reason of knowing the model is to calculate the expected value of state value and Q value. Can we directly figure out the expected state and Q value through trial and error ?

Value based model free learning

If we modify the Q value iteration algorithm to replace the expected reward/nextstate with the actual reward/nextstate, we arrive at the SARSA algorithm below.

Deep Q Learning

The algorithm above requires us to keep a table to remember all Q(s,a) values which can be huge, and also becomes infinite if any of the state or action is continuous. To deal with this, we will introduce the idea of value function. The state and action will become the input parameters of this function, which will create "input features" and then feed into a linear model and finally output the Q value.

Now we modify the previous SARSA algorithm to the following ...

Instead of lookup the Q(s,a) value, we call the function (can be a DNN) to pass in the f(s, a) feature, and get its output
We randomly initialize the parameter of the function (can be weights if the function is a DNN)
We update the parameters using gradient descent on the lost which can be the difference between the estimated value and the target value (can be a one step look ahead estimation: r + gamma*max_a'[Q(s',a)] )

If we further generalize the Q value function using a deep neural network, and update the parameter using back propagation, then we reach a simple version of Deep Q Learning.

While this algorithm allow us to learn the Q value function which can represents a continuous state, we still need to evaluate every action and pick the one with the maximum Q value. In other words, the action space can only be discrete and finite.

Policy gradient

Since the end goal is to pick the right action, and finding out the Q value is just the means (so we can pick the action of maximum Q), why don't we learn a function that takes a state and directly output an action. Using this policy function approach, we can handle both continuous or discrete action space as well.

The key idea is to learn a function (given a state, output an action)

If the action is discrete, it outputs a probability distribution of each action
It the action is continuous, it output the mean and variance of the action, assume normal distribution

The agent will sample from the output distribution to determine the action, so its chosen action is stochastic (nondeterministic). Then the environment will determine the reward and next state. Cycle repeats ...

The goal is to find the best policy function where the expected value of Q(s, a) is maximize. Notice that s and a are random variable parameterized by θ.

To maximize an "expected value" of a function with parameters θ, we need to calculate the gradient of that function.

Actor Critic Algorithm

There are 2 moving targets in this equation:

To improve the policy function, we need an accurate estimation of Q value and also need to know the gradient of log(s, a)
To make the Q value estimation more accurate, we need a stable policy function

We can break down these into two different roles

An actor, whose job is to improve the policy function by tuning the policy function parameters
A critic, whose job is to fine tune the estimation of Q value based on current (incrementally improving) policy

The "actor critic" algorithm is shown below.

Then we enhance this algorithm by adding the following steps

Replace the Q value function with an Advantage function, where A(s, a) = Q(s, a) - Expected Q(s, *). ie: A(s, a) = Q(s, a) - V(s)
Run multiple thread Asynchronously

This is the state of the art A3C algorithm.

Learning resources and credits

Some of the algorithms I discussed above is extracted from the following sources

Pragmatic Programming Techniques

Friday, August 25, 2017

Reinforcement Learning Overview