Gradient Descent
The basic idea of Gradient Descent is to use a feedback loop to adjust the model based on the error it observes (between its predicted output and the actual output). The adjustment (notice that there are multiple model parameters and therefore should be considered as a vector) is pointing to a direction where the error is decreasing in the steepest sense (hence the term "gradient").Notice that we intentionally leave the following items vaguely defined so this approach can be applicable in a wide range of machine learning scenarios.
- The Model
- The loss function
- The learning rate
- Intuitive and easy to understand
- Easy to run in parallel processing architecture
- Easy to run incrementally with additional data
Batch vs Online Learning
While some other Machine Learning model (e.g. decision tree) requires a batch of data points before the learning can start, Gradient Descent is able to learn each data point independently and hence can support both batch learning and online learning easily. The difference lies in how the training data is fed into the model and how the loss function computes its error.In batch learning, all training will be fed to the model, who estimates the output for all data points. Error will then be summed to compute the loss and then update the model. Model in this case will be updated after predicting the whole batch of data points.
In online learning mode (also called stochastic gradient descent), data is fed to the model one at a time while the adjustment of the model is immediately made after evaluating the error of this single data point. Notice that the final result of incremental learning can be different from batch learning, but it can be proved that the difference is bound and inversely proportional to the square root of the number of data points.
The learning rate can be adjusted as well to achieve a better stability in convergence. In general, the learning rate is higher initially and decrease over the iteration of training (in batch learning it decreases in next round, in online learning it decreases at every data point). This is quite intuitive as you paid less attention to the error as you have learn more and more. Because of that online learning is sensitive to the arrival order of data.
One way to adjust the learning rate is to have a constant divide by the square root of N (where N is the number of data point seen so far).
ɳ = ɳ_initial / (t ^ 0.5).
By using different decay factor, we can control how much attention we should pay for the late coming data. In online learning, as data comes in time of occurrence, we can play around with this decay factor to guide how much attention the learning mechanism should be paying to latest arrival data. Online learning automatically adapt to change of trends over time.
Most real world machine learning scenario relies on stationality of the model. By the way, learning is about "learning from past experience". If the environment changes too rapidly that the past experience is invalid, there is little value to learn. Because of this reason, most machine learning project are satisfied by using batch learning (daily or weekly) and the demand of online learning is not very high. A very common batch learning model is described in my previous blog here.
Parallel Learning
Because of no dependency in data processing, Gradient Descent is very easy to put into a parallel processing environment such as Map/Reduce. Here we illustrate how to parallelize the execution of batch learning.
Notice that there are multiple rounds of Map/Reduce until the model converges. On the other hand, online learning is not possible for Hadoop Map/Reduce which doesn't support real-time at this moment.
In summary, gradient descent is a very powerful approach of machine learning and works well in a wide spectrum of scenarios.
5 comments:
For a framework around online gradient descent (and online processing in general) I would point to Storm which is a distributed and fault-tolerant realtime computation framework developed by Nathan Marz from Twitter.
Hi,
I have learnt that one should randomly pick up training examples when applying stochastic gradient descent, which might not be true for your MapRedice pseudocode. If I understood you correctly, each mapper will processes a subset of training examples and they will do it in parallel. The problem is that these calculations can be interdependent and then different ordering produces different results. Is this still OK?
Perhaps, the processing order is not so important for SGD, is it?
Thank you!
Hi,
I have learnt that one should randomly pick up training examples when applying stochastic gradient descent, which might not be true for your MapRedice pseudocode. If I understood you correctly, each mapper will processes a subset of training examples and they will do it in parallel. The problem is that these calculations can be interdependent and then different ordering produces different results. Is this still OK?
Perhaps, the processing order is not so important for SGD, is it?
Thank you!
Yes, SGD is non-deterministic anyway.
So different order is OK.
Very interesting entry, I look forward to the next! Thanks for sharing. I am searching for informative articles about Australian web design services. If you have idea about good posts of this please share with me
Post a Comment