Sunday, September 27, 2009

Reinforcement Learning

Reinforcement Learning (RL) is a type of Machine Learning other than "supervised learning" (having a teaching phase that shows the learning between inputs and correct answers) and "unsupervised learning" (discovering clusters and outliers from a set of input samples).

In RL, consider there exist a set of "states" (from the environment) where the agent is going to make some decision of which actions to take and this action will cause it to transfer to a different state. A reward is assigned to the agent after this state transition. During the RL process, the agent's goal is to go through a trial and error process to learn what would be the optimal decision at each state such that the reward is maximized.

The hard part of RL is to know which action has a long term effect on the final outcome. For example, a wrong decision may not have an immediate bad result and therefore may be hidden. RL is about how to assign blames to previous decisions when a bad outcome has been detected.

Basic Iteration Approach
There is a reward matrix, each row represents the from-state and each column represent the to-state. The cell (i, j) represent the "immediate reward" obtained when moving from state i to state j.

The goal is to find an optimal policy which recommends the action that should be taken at each state in order to maximize the sum of reward.

We can use a value vector of each element (i) to represent agent's perception of the overall gained reward if he is at state (i). At the beginning, the value vector is set with random value. We use the following iterative approach to modify the value vector until it converges.

def learn_value_vector
current_state = initial_state
set value_vector to all zeros
repeat until value_vector.converges
# Need to enumerate all reachable next states
for each state(j) reachable by current state(i)
Take action to reach next state(j)
Collect reward(i, j)
action_value(j) =
reward(i, j) + discount * value_vector(j)
# Since you will always take the path of max overall reward
value_vector(i) = max_over_j(action_value(j))
current_state = state(maxj)
After we figure out this value vector, deriving the policy is straightforward. We just need to look across all the value of subsequent next states and pick the one with the highest value.

def learn_policy1
for each state(i)
best_transition = nil
max_value = 0
for each state(j) reachable from state(i)
if value(j) > max_value
best_transition = j
max_value = value(j)
policy(i) = best_transition

One problem of this approach is requiring us to try out all possible actions and evaluate all the rewards to the next state. So there is an improve iterative approach described below.

In Q-Learning, we use a Q Matrix instead of the value vector. Instead of estimating the value of each state, we estimate the value of each transition from the current state. In other words, we associate the value with the pair instead of just .

Therefore, the cell(i, j) of the Q matrix represents the agent's perceived value of the transition from state(i) to state(j). We use the following iterative approach to modify the value vector until it converges.

def learn_q_matrix
current_state = initial_state
set Q matrix to all zeros
repeat until Q matrix converges
select the next state(j) randomly
collect reward (i, j)
value(j) = max Q(j, k) across k
Q(i, j) = reward(i, j) + discount * value(j)
After figuring out the Q matrix, the policy at state (i) is simply by picking state(j) which has the max Q(i, j) value.

def learn_policy2
for each state(i)
best_transition = nil
max_value = 0
for each state(j) reachable from state(i)
if Q(i,j) > max_value
best_transition = j
max_value = Q(i,j)
policy(i) = best_transition

The relationship between the action (what the agent do) and the state transition (what is the new state end up) doesn't necessary be deterministic. In real life, the action and its effect is usually probabilistic rather than deterministic. (e.g. if you leave your house early, you are more likely to reach your office earlier, but it is not guaranteed). Imagine of a probabilistic state transition diagram, where each action has multiple branches leading to each possible next state with a probability assigned to each branch. Making decisions in this model is called the Marchov Decision Process.

The Q-learning approach described above is also good for Marchov Decision Process.

For some good articles in RL,
Reinforcement Learning: A tutorial
Q-learning by examples

Thursday, September 10, 2009

Math Concepts for kids and teens

Summarizing some key math concepts that I teach my kids.

  • A correct value system is the most important foundation (the goal to excel, the willingness to help)
  • How to make decisions ? (differentiate emotional decision and strategic decision)
  • How to do planning ?
  • How to do reasoning, analyzing and drawing conclusion ?
  • How to be open minded, humble but not blindly follow conventional wisdom ? (why human walk with 2 legs, why do we have supermarkets, how do we decide where to put a bus station, why an apple is more expensive than an orange)
  • How to be patient and control emotions ?
  • Develop a good sense of numbers and able to read different charts and graphs, observing relationship between variables and their trends.
  • Appreciation of doing things in a smart way

Basic Math Concepts
  • Numbers, counting (Integer) and quantities (Real)
  • Cause and effect
  • Set (belongs, union, intersect, subset)
  • Function (dependent and independent variable, continuous vs discrete). Various graphing (histogram, line graph, plot), 2D curve and 3D plane
  • Linear equations, degree of freedoms, relationship between number of variables and number of equations.
  • Calculus (differentiation and integration), multi-variables and partial differentiation
  • Logic (if/then, necessary/sufficient conditions, equivalence) and Proof establishments
  • Debate and Logic fallacies
  • Geometry and Vector (think 3D instead of 2D)
  • Probabilities (Draw a tree of all outcomes and counting)
  • Probability distribution function and expected gains
  • Permutations and Combinations (how to find out "all possibilities")
  • Mathematical induction, recursion in proofs.
  • Digits with different bases (and their relationship with Polynomials)
  • Making predictions: False positives, False negatives and how trade-off decisions should be made

Math Models
  • Decision tree (decision and outcome alternations, min/max strategy). Expected gain and optimization
  • Game theory (Nash equilibrium). Outcome prediction within a social group. Win/win and win/lose and lose/lose situations.
  • Finding solution using Search tree, exhaustive search in all possibilities in a systematic way (tree traversal, breath-first vs depth-first vs heuristic)
  • Linear programming for constraint satisfaction and optimization
  • Deterministic vs Stochastic process (Markov chains), Queuing theory
  • Control system (equilibrium, stability and feedback loop)
  • Graph model (nodes and arcs, path finding, shortest path, minimal spanning tree)
  • Finite State Machines (everything happens in a cycle)