Pragmatic Programming Techniques: An output of a truly random process

Recently I have a discussion with my data science team whether we can challenge the observations is following a random process or not. Basically, data science is all about learning hidden pattern that is affecting the observations. If the observation is following a random process, then there is nothing we can learn about. Let me walk through an example to illustrate.

Lets say someone is making a claim that he is throwing a fair dice (with number 1 to 6) sequentially.
Lets say I claim the output of my dice throw is uniformly random, ie: with equal chances of getting a number from 1 to 6.

And then he throws the dice 12 times, and show you the output sequence. From the output, can you make a judgement whether this is really a sequential flow of a fair dice ? In other words, is the output really follow a random process as expected ?

Lets look at 3 situations

Situation 1 output is [4, 1, 3, 1, 2, 6, 3, 5, 5, 1, 2, 4]
Situation 2 output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Situation 3 output is [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]

At first glance, the output of situation 1 looks like resulting from a random process. Situation 2 definitely doesn't look like it. Situation 3 is harder to judge. If you look at the proportion of the output numbers, the frequency of each output number of situation 3 definitely follows a uniform distribution of a fair dice. But if you look at the number ordering, situation 3 follows a well-defined ordering that doesn't seem to be random at all. Therefore, I don't think the output of situation 3 is following a random process.

However, this seems to be a very arbitrary choice. Why would I look at the number ordering at all ? Should I look for more properties ? such as ...

Whether the number of the even position are even
Average gap between consecutive throws
Whether the number in the 3rd position always smaller than the 10th position
...

As you can see, depends on my imagination, the list can go on and on. How can I tell whether situation 3 is following a random process or not ?

Method 1: Randomization Test

This is based on the hypothesis testing methodology. We establish null hypothesis H0 that situation 3 follows a random process.

First, I define an arbitrary list of statistics of my choices

statisticA = proportion of even numbers in even position
statisticB = average gap between consecutive output numbers
statisticC = ...

Second, I run a simulation to generate 12 numbers based on a random process. Calculate the corresponding statistics defined above.

Third repeat the simulation for N times, output the mean and standard deviation of the statistics.

If the statisticA or B or C of situation 3 are too far away (based on the likelihood pValue) from the mean of statistics A/B/C by the number of standard deviation of statistics A/B/C, then we conclude that situation 3 is not following a random process. Otherwise, we don't have enough evidence to show our null hypothesis is violated and so we accept situation 3 follows the random process.

Method 2: Predictability Test

This is based on the theory of predictive analytics.

First, I pick a particular machine learning algorithm, lets say time series forecast using ARIMA.

Notice that I can also choose to use RandomForest and create some arbitrary input features (such as previous output number, maximum number in last 3 numbers ... etc)

Second, I train my selected predictive model based on the output data of situation 3 (in this example, situation 3 has only 12 data point, but imagine we have much more than 12 data point).

Third, I evaluate my model in the test set. And see whether the prediction is much better than a random guess. For example I can measure the lift of my model by comparing the RMSE (root mean square error) or my prediction and the standard deviation of the testing data. If the lift is very insignificant, then I conclude that situation 3 results from a random process, because my predictive model doesn't learn any pattern.

Pragmatic Programming Techniques

Saturday, April 29, 2017

An output of a truly random process

Method 1: Randomization Test

Method 2: Predictability Test

No comments: