Pragmatic Programming Techniques: How AI differs from ML

AI is not a new term, it is multiple decades old starting around early 80s when computer scientist design algorithms that can "learn" and "mimic human behavior".

On the "learning" side, the most significant algorithm is Neural Network, which is not very successful due to overfitting (the model is too powerful but not enough data). Nevertheless, in some more specific tasks, the idea of "using data to fit a function" has gained significant success and this form the foundation of "machine learning" today.

On the "mimic" side, people have focus in "image recognition", "speech recognition", "natural language processing", experts have been spending tremendous amount of time to create features like "edge detection", "color profile", "N-grams", "Syntax tree" ... etc. Nevertheless, the success is moderate.

Traditional Machine Learning

Machine Learning (ML) Technique has played a significant role in prediction and ML has undergone multiple generations, with a rick set of model structure, such as

Linear regression
Logistic regression
Decision tree
Support Vector Machine
Bayesian model
Regularization model
Ensemble model
Neural network

Each of these predictive model is based on certain algorithmic structure, with parameters as tunable knobs. Training a predictive model involves the following

Choose a model structure (e.g. Logistic regression, or Random forest, or ...)
Feed the model with training data (with both input and output)
The learning algorithm will output the optimal model (ie: model with specific parameters that minimize the training error)

Each model has its own characteristics and will perform good in some tasks and bad in others. But generally, we can group them into the low-power (simple) model and the high-power (complex) model. Choose between different models is a very tricky question.

Traditionally, using a low power / simple model is preferred over the use of a high power / complex model for the following reasons

Until we have massive processing power, training the high power model will take too long
Until we have massive amount of data, training the high power model will cause the overfit problem (since the high power model has rich parameters and can fit into a wide range of data shape, we may end up train a model that fits too specific to the current training data and not generalized enough to do good prediction on future data).

However, choosing a low power model suffers from the so called "under-fit" problem where the model structure is too simple and unable to fit the training data in case it is more complex. (Imagine the underlying data has a quadratic relationship: y = 5 * x^2, there is no way you can fit a linear regression: y = a*x + b no matter what a and b we pick).

To mitigate the "under-fit problem", data scientist will typically apply their "domain knowledge" to come up with "input features", which has a more direct relationship with the output. (e.g. Going back to the quadratic relationship: y = 5 * square(x), if you create a feature z = x^2, then you can fit a linear regression: y = a*z + b, by picking a = 5 and b = 0)

The major obstacle of "Machine Learning" is this "Feature Engineering" step which requires deep "domain experts" to identify important signals before feeding into training process. The feature engineering step is very manual and demands a lot of scarce domain expertise and therefore become the major bottleneck of most machine learning tasks today.

In other words, if we don't have enough processing power and enough data, then we have to use the low-power / simpler model, which requires us to spend significant time and effort to create appropriate input features. This is where most data scientists spending their time doing today.

Return of Neural Network

At early 2000, machine processing power has increased tremendously, with the advancement of cloud computing, massively parallel processing infrastructure, together with big data era where massive amount of fine grain event data being collected. We are no longer restricted to the low-power / simple model. For example, two most popular, mainstream machine learning model today are RandomForest and Gradient Boosting Tree. Nevertheless, although both of them are very powerful and provide non-linear model fitting to the training data, data scientist still need to carefully create features in order to achieve good performance.

At the same time, computer scientists has revisited the use of many layers Neural Network in doing these human mimic tasks. This give a new birth to DNN (Deep Neural Network) and provide a significant breakthrough in image classification and speech recognition tasks. The major difference of DNN is that you can feed the raw signals (e.g. the RGB pixel value) directly into DNN without creating any domain specific input features. Through many layers of neurons (hence it is called "deep" neural network), DNN can "automatically" generate the appropriate features through each layer and finally provide a very good prediction. This saves significantly the "feature engineering" effort, a major bottleneck done by the data scientists.

DNN also evolves into many different network topology structure, so we have CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short Term Memory), GAN (Generative Adversarial Network), Transfer Learning, Attention Model ... etc. The whole spectrum is called Deep Learning, which is catching the whole machine learning community’s attention today.

Reinforcement Learning

Another key component is about how to mimic a person (or animal) learn. Imagine the very natural animal behavior of perceive/act/reward cycle. A person or animal will first understand the environment by sensing what "state" he is in. Based on that, he will pick an "action" which brings him to another "state". Then he will receive a "reward". The cycle repeats until he dies. This way of learning (called "Reinforcement Learning") is quite different from the "curve fitting" approaches of traditional supervised machine learning approach. In particular, learning in RL is very fast because every new feedback (such as perform an action and receive a reward) is sent immediately to influence subsequent decisions. Reinforcement Learning has gain tremendous success in self-driving cars as well as AlphaGO (Chess Playing Robot).

Reinforcement Learning also provides a smooth integration between "Prediction" and "Optimization" because it maintains a belief of current state and possible transition probabilities when taking different actions, and then make decisions which action can lead to the best outcome.

AI = DL + RL

Compare to the classical ML Technique, DL provide a more powerful prediction model that usually produce good prediction accuracy. Compare to the classical Optimization model using LP, RL provide a much faster learning mechanism and also more adaptive to change of the environment.

Pragmatic Programming Techniques

Sunday, July 2, 2017

How AI differs from ML

Traditional Machine Learning

Return of Neural Network

Reinforcement Learning

AI = DL + RL

No comments: