Saturday, July 15, 2017

Regression model outputting probability density distribution

For a classification problem (let say output is one of the labels R, G, B), how do we predict ?

There are two formats that we can report our prediction
  1. Output a single value which is most probable outcome.  e.g. output "B"  if P(B) > P(R) and P(B) > P(G)
  2. Output the probability estimation of each label.  (e.g. R=0.2, G=0.3, B=0.4)
But if we look at regression problem (lets say we output a numeric value v), most regression model only output a single value (that minimize the RMSE).  In this article, we will look at some use cases where outputting a probability density function is much preferred.

Predict the event occurrence time

As an illustrative example, we want to predict when would a student finish her work given she has already spent some time s.  In other words, we want to estimate E[t | t > s] where t is a random variable representing the total duration and s is the elapse time so far.

Estimating time t is generally hard if the model only output an expectation.  Notice that the model has the same set of features, expect that the elapse time has changed in a continuous manner as time passes.

Lets look at how we can train a prediction model that can output a density distribution.

Lets say our raw data schema: [feature, duration]
  • f1, 13.30
  • f2, 14.15
  • f3, 15.35
  • f4, 15.42
Take a look at the range (ie. min and max) of the output value.  We transform into the training data of the following schema:
[feature, dur<13, dur<14, dur<15, dur<16]
  • f1, 0, 1, 1, 1
  • f2, 0, 0, 1, 1
  • f3, 0, 0, 0, 1
  • f4, 0, 0, 0, 1
After that, we train 4 classification model.
  • feature, dur<13
  • feature, dur<14
  • feature, dur<15
  • feature, dur<16

Now, given a new observation with corresponding feature, we can invoke these 4 model to output the probability of binary classification (cumulative probability).  If we want the probability density, simply take the difference (ie: differentiation of cumulative probability).

At this moment, we can output a probability distribution given its input feature.

Now, we can easily estimate the remaining time from the expected time in the shade region.  As time passed, we just need to slide the red line continuously and recalculate the expected time, we don't need to execute the prediction model unless the input features has changed.

Predict cancellation before commitment 

As an illustrative example, lets say a customer of restaurant has reserved a table at 8:00pm.  Time now is 7:55pm and the customer still hasn't arrive, what is the chance of no-show ?

Now, given a person (with feature x), and current time is S - t (still hasn't bought the ticket yet), predict the probability of this person watching the movie.

Lets say our raw data schema: [feature, arrival]
  • f1, -15.42
  • f2, -15.35
  • f3, -14.15
  • f4, -13.30
  • f5, infinity
  • f6, infinity
We transform into the training data of the following schema:
[feature, arr<-16, arr<-15, arr<-14, arr<-13]
  • f1, 0, 1, 1, 1
  • f2, 0, 1, 1, 1
  • f3, 0, 0, 1, 1
  • f4, 0, 0, 0, 1
  • f5, 0, 0, 0, 0
  • f6, 0, 0, 0, 0
After that, we train 4 classification models.
  • feature, arr<-16
  • feature, arr<-15
  • feature, arr<-14
  • feature, arr<-13
Notice that P(arr<0) can be smaller than 1 because the customer can be no show.

In this post, we discuss some use cases where we need the regression model to output not just its value prediction but also the probability density distribution.  And we also illustrate how we can build such prediction model.

No comments: