There a wide variety of machine learning applications, such as …
- Recommendation: After buying a book at Amazon, or rent a movie from Netflix, they recommends what other items that you may be interested
- Fraud detection: To protect its buyer and seller, an auction site like EBay detect abnormal patterns to identify fraudulent transaction
- Market segmentation: Product company divide their market into segments of similar potential customers and design specific marketing campaign for each segment.
- Social network analysis: By analysis the user’s social network profile data, social networking site like Facebook can categorize their users and personalize their experience
- Medical research: Analyzing DNA patterns, Cancer research, Diagnose problem from symptoms
Each piece of data can be represented as a vector [x1, x2, …] where xi are the attributes of the data.
Such attributes can be numeric or categorical. (e.g. age is an numeric attribute and gender is a categorical attribute)
There are basically 3 branch of machine learning ...
- The main use of supervised learning is to predict an output based on a set of training data. A set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data.
- When y is numeric, the prediction is called regression. When y is categorical, the prediction is called classification.
- The main use of unsupervised learning is to discover unknown patterns within data. (e.g. grouping similar data, or detecting outliers).
- Identifying clusters is a classical scenario of unsupervised learning
- This is also known as “continuous learning” where the final output is not given. The agent will choose an action based on its current state and then will be present with an award. The agent learns how to maximize its award and come up with a model call “optimal policy”. A policy is a mapping between from “state” to “action” (given I am at a particular state, what action should I take).
Data warehouse is not “machine learning”, it is basically a special way to store your data so that it can be easily group in many ways for doing analysis in a manual way.
Typically, data is created from OLTP systems which runs the company’s business operation. OLTP capture the “latest state” of the company. Data are periodically snapshot to the data-warehouse for OLAP, in other words, data-warehouse add a time dimension to the data.
There is an ETL process that extract data from various sources, cleansing the data, transform to the form needed by the data-warehouse and then load into the data cube.
Data-warehouse typically organize the data as a multi-dimensional data cube based on a "Star schema" (1 Fact table + N Dimension tables). Each cell contains aggregate data along different (combination) of dimensions.
OLAP processing involves the following operations
- Rollup: Aggregate data within a particular dimension. (e.g. For the “time” dimension, you can “rollup” the aggregation from “month” into “quarter”)
- Drilldown: Breakdown the data within a particular dimension (e.g. For the “time” dimension, you can “drilldown” from months” into “days”)
- Slice: Cut a layer out of a particular dimension (e.g. Look at all data at “Feb”)
- Dice: Select a sub data cube (e.g. Look at all data at “Jan” and “Feb” as well as product “laptop” and “hard disk”
To determine the output from a set of input attributes, one way is to study the physics behinds them and write a function that transform the input attributes to the output. However, what if the relationship is unknown ? or the relationship hasn’t been formally specified ?
Instead of based on a sound theoretical model, machine learning is trying to make prediction based on previously observed data. There are 2 broad type of learning strategies
- Also known as lazy learning, the learner remembers all the previous seen examples. When a new piece of input data arrives, it tried to find the best matched data it previous seen and use its output to predict the output of the new data. It has an underlying assumption that if two piece of data are “similar” in their input attributes, their output are also similar.
- Nearest neighbor is a classical approach for instance-based learning
- Also known as eager learning, the learner assume the input attributes and the output are related in a particular model (e.g. linear regression, logistic regression, neural network, probabilistic belief network, decision tree … etc). So the learner learn the parameters of such a model based on previously seen examples. When a new piece of input data arrives, it will use the learned model to predict the output.