In RL, consider there exist a set of "states" (from the environment) where the agent is going to make some decision of which actions to take and this action will cause it to transfer to a different state. A reward is assigned to the agent after this state transition. During the RL process, the agent's goal is to go through a trial and error process to learn what would be the optimal decision at each state such that the reward is maximized.

The hard part of RL is to know which action has a long term effect on the final outcome. For example, a wrong decision may not have an immediate bad result and therefore may be hidden. RL is about how to assign blames to previous decisions when a bad outcome has been detected.

Basic Iteration Approach

There is a reward matrix, each row represents the from-state and each column represent the to-state. The cell (i, j) represent the "immediate reward" obtained when moving from state i to state j.

The goal is to find an optimal policy which recommends the action that should be taken at each state in order to maximize the sum of reward.

We can use a value vector of each element (i) to represent agent's perception of the overall gained reward if he is at state (i). At the beginning, the value vector is set with random value. We use the following iterative approach to modify the value vector until it converges.

`def learn_value_vector`

current_state = initial_state

set value_vector to all zeros

repeat until value_vector.converges

# Need to enumerate all reachable next states

for each state(j) reachable by current state(i)

Take action to reach next state(j)

Collect reward(i, j)

action_value(j) =

reward(i, j) + discount * value_vector(j)

end

# Since you will always take the path of max overall reward

value_vector(i) = max_over_j(action_value(j))

current_state = state(maxj)

end

end

After we figure out this value vector, deriving the policy is straightforward. We just need to look across all the value of subsequent next states and pick the one with the highest value.`def learn_policy1`

for each state(i)

best_transition = nil

max_value = 0

for each state(j) reachable from state(i)

if value(j) > max_value

best_transition = j

max_value = value(j)

end

end

policy(i) = best_transition

end

end

One problem of this approach is requiring us to try out all possible actions and evaluate all the rewards to the next state. So there is an improve iterative approach described below.

Q-Learning

In Q-Learning, we use a Q Matrix instead of the value vector. Instead of estimating the value of each state, we estimate the value of each transition from the current state. In other words, we associate the value with the

Therefore, the cell(i, j) of the Q matrix represents the agent's perceived value of the transition from state(i) to state(j). We use the following iterative approach to modify the value vector until it converges.

`def learn_q_matrix`

current_state = initial_state

set Q matrix to all zeros

repeat until Q matrix converges

select the next state(j) randomly

collect reward (i, j)

value(j) = max Q(j, k) across k

Q(i, j) = reward(i, j) + discount * value(j)

end

end

After figuring out the Q matrix, the policy at state (i) is simply by picking state(j) which has the max Q(i, j) value.`def learn_policy2`

for each state(i)

best_transition = nil

max_value = 0

for each state(j) reachable from state(i)

if Q(i,j) > max_value

best_transition = j

max_value = Q(i,j)

end

end

policy(i) = best_transition

end

end

The relationship between the action (what the agent do) and the state transition (what is the new state end up) doesn't necessary be deterministic. In real life, the action and its effect is usually probabilistic rather than deterministic. (e.g. if you leave your house early, you are more likely to reach your office earlier, but it is not guaranteed). Imagine of a probabilistic state transition diagram, where each action has multiple branches leading to each possible next state with a probability assigned to each branch. Making decisions in this model is called the Marchov Decision Process.

The Q-learning approach described above is also good for Marchov Decision Process.

For some good articles in RL,

Reinforcement Learning: A tutorial

Q-learning by examples