Pragmatic Programming Techniques: September 2009

Sunday, September 27, 2009

Reinforcement Learning

Reinforcement Learning (RL) is a type of Machine Learning other than "supervised learning" (having a teaching phase that shows the learning between inputs and correct answers) and "unsupervised learning" (discovering clusters and outliers from a set of input samples).

In RL, consider there exist a set of "states" (from the environment) where the agent is going to make some decision of which actions to take and this action will cause it to transfer to a different state. A reward is assigned to the agent after this state transition. During the RL process, the agent's goal is to go through a trial and error process to learn what would be the optimal decision at each state such that the reward is maximized.

The hard part of RL is to know which action has a long term effect on the final outcome. For example, a wrong decision may not have an immediate bad result and therefore may be hidden. RL is about how to assign blames to previous decisions when a bad outcome has been detected.

Basic Iteration Approach
There is a reward matrix, each row represents the from-state and each column represent the to-state. The cell (i, j) represent the "immediate reward" obtained when moving from state i to state j.

The goal is to find an optimal policy which recommends the action that should be taken at each state in order to maximize the sum of reward.

We can use a value vector of each element (i) to represent agent's perception of the overall gained reward if he is at state (i). At the beginning, the value vector is set with random value. We use the following iterative approach to modify the value vector until it converges.

def learn_value_vector
 current_state = initial_state
 set value_vector to all zeros
 repeat until value_vector.converges
   # Need to enumerate all reachable next states
   for each state(j) reachable by current state(i)
     Take action to reach next state(j)
     Collect reward(i, j)
     action_value(j) =
       reward(i, j) + discount * value_vector(j)
   end
   # Since you will always take the path of max overall reward
   value_vector(i) = max_over_j(action_value(j))
   current_state = state(maxj)
 end
end

After we figure out this value vector, deriving the policy is straightforward. We just need to look across all the value of subsequent next states and pick the one with the highest value.

def learn_policy1
 for each state(i)
   best_transition = nil
   max_value = 0
   for each state(j) reachable from state(i)
     if value(j) > max_value
       best_transition = j
       max_value = value(j)
     end
   end
   policy(i) = best_transition
 end
end

One problem of this approach is requiring us to try out all possible actions and evaluate all the rewards to the next state. So there is an improve iterative approach described below.

Q-Learning
In Q-Learning, we use a Q Matrix instead of the value vector. Instead of estimating the value of each state, we estimate the value of each transition from the current state. In other words, we associate the value with the pair instead of just .

Therefore, the cell(i, j) of the Q matrix represents the agent's perceived value of the transition from state(i) to state(j). We use the following iterative approach to modify the value vector until it converges.

def learn_q_matrix
 current_state = initial_state
 set Q matrix to all zeros
 repeat until Q matrix converges
   select the next state(j) randomly
   collect reward (i, j)
   value(j) = max Q(j, k) across k
   Q(i, j) = reward(i, j) + discount * value(j)
 end
end

After figuring out the Q matrix, the policy at state (i) is simply by picking state(j) which has the max Q(i, j) value.

def learn_policy2
 for each state(i)
   best_transition = nil
   max_value = 0
   for each state(j) reachable from state(i)
     if Q(i,j) > max_value
       best_transition = j
       max_value = Q(i,j)
     end
   end
   policy(i) = best_transition
 end
end

The relationship between the action (what the agent do) and the state transition (what is the new state end up) doesn't necessary be deterministic. In real life, the action and its effect is usually probabilistic rather than deterministic. (e.g. if you leave your house early, you are more likely to reach your office earlier, but it is not guaranteed). Imagine of a probabilistic state transition diagram, where each action has multiple branches leading to each possible next state with a probability assigned to each branch. Making decisions in this model is called the Marchov Decision Process.

The Q-learning approach described above is also good for Marchov Decision Process.

For some good articles in RL,
Reinforcement Learning: A tutorial
Q-learning by examples

Thursday, September 10, 2009

Math Concepts for kids and teens

Summarizing some key math concepts that I teach my kids.

Fundamentals

A correct value system is the most important foundation (the goal to excel, the willingness to help)
How to make decisions ? (differentiate emotional decision and strategic decision)
How to do planning ?
How to do reasoning, analyzing and drawing conclusion ?
How to be open minded, humble but not blindly follow conventional wisdom ? (why human walk with 2 legs, why do we have supermarkets, how do we decide where to put a bus station, why an apple is more expensive than an orange)
How to be patient and control emotions ?
Develop a good sense of numbers and able to read different charts and graphs, observing relationship between variables and their trends.
Appreciation of doing things in a smart way

Basic Math Concepts

Numbers, counting (Integer) and quantities (Real)
Cause and effect
Set (belongs, union, intersect, subset)
Function (dependent and independent variable, continuous vs discrete). Various graphing (histogram, line graph, plot), 2D curve and 3D plane
Linear equations, degree of freedoms, relationship between number of variables and number of equations.
Calculus (differentiation and integration), multi-variables and partial differentiation
Logic (if/then, necessary/sufficient conditions, equivalence) and Proof establishments
Debate and Logic fallacies
Geometry and Vector (think 3D instead of 2D)
Probabilities (Draw a tree of all outcomes and counting)
Probability distribution function and expected gains
Permutations and Combinations (how to find out "all possibilities")
Mathematical induction, recursion in proofs.
Digits with different bases (and their relationship with Polynomials)
Making predictions: False positives, False negatives and how trade-off decisions should be made

Math Models

Decision tree (decision and outcome alternations, min/max strategy). Expected gain and optimization
Game theory (Nash equilibrium). Outcome prediction within a social group. Win/win and win/lose and lose/lose situations.
Finding solution using Search tree, exhaustive search in all possibilities in a systematic way (tree traversal, breath-first vs depth-first vs heuristic)
Linear programming for constraint satisfaction and optimization
Deterministic vs Stochastic process (Markov chains), Queuing theory
Control system (equilibrium, stability and feedback loop)
Graph model (nodes and arcs, path finding, shortest path, minimal spanning tree)
Finite State Machines (everything happens in a cycle)