## Saturday, May 2, 2009

### Machine Learning Intuition

As more and more user data are gathered on different web sites (such as e-commerce, social network), data mining / machine learning technique becomes an increasingly important tool to analysis them and extract useful information out of it.

There a wide variety of machine learning applications, such as …
• Recommendation: After buying a book at Amazon, or rent a movie from Netflix, they recommends what other items that you may be interested
• Fraud detection: To protect its buyer and seller, an auction site like EBay detect abnormal patterns to identify fraudulent transaction
• Market segmentation: Product company divide their market into segments of similar potential customers and design specific marketing campaign for each segment.
• Social network analysis: By analysis the user’s social network profile data, social networking site like Facebook can categorize their users and personalize their experience
• Medical research: Analyzing DNA patterns, Cancer research, Diagnose problem from symptoms
However, machine learning theory involves a lot of math which is non-trivial for people who doesn’t have the rigorous math background. Therefore, I am trying to provide an intuition perspective behind the math.

General Problem

Each piece of data can be represented as a vector [x1, x2, …] where xi are the attributes of the data.

Such attributes can be numeric or categorical. (e.g. age is an numeric attribute and gender is a categorical attribute)

There are basically 3 branch of machine learning ...

Supervised learning
• The main use of supervised learning is to predict an output based on a set of training data. A set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data.
• When y is numeric, the prediction is called regression. When y is categorical, the prediction is called classification.

Unsupervised learning
• The main use of unsupervised learning is to discover unknown patterns within data. (e.g. grouping similar data, or detecting outliers).
• Identifying clusters is a classical scenario of unsupervised learning

Reinforcement learning
• This is also known as “continuous learning” where the final output is not given. The agent will choose an action based on its current state and then will be present with an award. The agent learns how to maximize its award and come up with a model call “optimal policy”. A policy is a mapping between from “state” to “action” (given I am at a particular state, what action should I take).

Data Warehouse

Data warehouse is not “machine learning”, it is basically a special way to store your data so that it can be easily group in many ways for doing analysis in a manual way.

Typically, data is created from OLTP systems which runs the company’s business operation. OLTP capture the “latest state” of the company. Data are periodically snapshot to the data-warehouse for OLAP, in other words, data-warehouse add a time dimension to the data.

There is an ETL process that extract data from various sources, cleansing the data, transform to the form needed by the data-warehouse and then load into the data cube.

Data-warehouse typically organize the data as a multi-dimensional data cube based on a "Star schema" (1 Fact table + N Dimension tables). Each cell contains aggregate data along different (combination) of dimensions.

OLAP processing involves the following operations
• Rollup: Aggregate data within a particular dimension. (e.g. For the “time” dimension, you can “rollup” the aggregation from “month” into “quarter”)
• Drilldown: Breakdown the data within a particular dimension (e.g. For the “time” dimension, you can “drilldown” from months” into “days”)
• Slice: Cut a layer out of a particular dimension (e.g. Look at all data at “Feb”)
• Dice: Select a sub data cube (e.g. Look at all data at “Jan” and “Feb” as well as product “laptop” and “hard disk”
The Data-warehouse can be further diced into specific “data marts” that focus in different areas for further drilldown analysis.

Some Philosophy

To determine the output from a set of input attributes, one way is to study the physics behinds them and write a function that transform the input attributes to the output. However, what if the relationship is unknown ? or the relationship hasn’t been formally specified ?

Instead of based on a sound theoretical model, machine learning is trying to make prediction based on previously observed data. There are 2 broad type of learning strategies

Instance-based learning
• Also known as lazy learning, the learner remembers all the previous seen examples. When a new piece of input data arrives, it tried to find the best matched data it previous seen and use its output to predict the output of the new data. It has an underlying assumption that if two piece of data are “similar” in their input attributes, their output are also similar.
• Nearest neighbor is a classical approach for instance-based learning

Model-based learning
Eager learning that learn a generalized model upfront, and lazy learning learn from seen examples at the time of query. Instead of learning a generic model that fits all observed data, lazy learning can focus its analysis close to the query point. However, getting a comprehensible model in lazy learning is harder and it also require large memory to store all seen data.