Sometimes answers lead to more questions. In data science this sometime is true almost everytime.
Analytics Challenge – The Shady Gamble
In the previous article, you were approached by Scotland Yard to investigate charges against a gambler with dubious character. The allegation against him was that he had loaded his die so that some numbers are more likely to appear after a throw than others. They were seeking your help to understand whether the gambler’s die is biased or not? They had shared the following barplot representation of the gambler’s previous 1000 throws.
|As mentioned in the previous part, all the analytics challenges on YOU CANalytics require your participation to proceed. We have some good discussion running in the comments section of the previous part (discussion link). This part will reveal a few more clues for you to further that discussion. I will use your comments from the previous part, along with some creative freedom, to move ahead with this investigation.|
Your initial reaction after looking at this plot was : the gambler’s die looks biased. You know that for a fair die each number on the die has the equal likelihood of appearing. In other words, the length of all the bars in the above bar chart must be roughly the same. Since that doesn’t seem to be the case hence the die must be biased.
Later after a more careful analysis, you noticed that the bar plot is zoomed on the Y-axis. In this case, Scotland Yard has used Sherlock Holmes’ favorite instrument for detection i.e. magnifying glasses.
Realizing this, you make the appropriately-scaled version of the same plot (look this one starts from 0 on the Y-axis).
Please help Scotland Yard answer the following questions. Like the last time please add your comments in the discussion section at the bottom of this article.
|1||Do you still feel that the die is biased? Why or why not?|
|2||Is the use of magnifying glass (Sherlock Holmes’ favorite instrument for detection) justified here? Why or why not?|
|3||Could you see some similarity between this analysis and the way the popular press reports economic indicators (stock market in particular)?|
To continue with your analysis, you have placed the bars of expected values i.e. 1000/6 = 166.67 next to the observed value for the gambler’s die. The following is the same plot.
Now, our original question can be rephrased to: are the blue bars and the orange bars significantly different from each other? If yes then the coin is biased. At first, to answer this question you had suggested simulation. The idea behind simulation is to repeat the experiment of rolling 1000 trials of a die several times through a random number generator and compare those results with our observation.
Simulation is a good way to test our hypothesis however for this problem we will employ a much more theoretically robust method i.e. Chi-Square goodness-of-fit test. In it’s simplest form Chi-Square GOF is a normalized difference between expected and observed distributions, and comparing this difference against a standard value (i.e. Chi-Square distribution). Scotland Yard wants to know:
|4||What is the difference between simulation and Chi-square test? How are they similar to test random events?|
You wrote down the following lines of R code to calculate Chi-Square statistics (Credit for this code: Jason Corderoy / Ram). The first line of the code takes in observed value of frequency for each number on the die
Observed <- c(162,160,156,173,182,167)
The next line of the code produced expected frequency through uniform probability distribution where each number on the die has 1/6 chances of appearance.
Expected <- rep(1/6,6)
The subsequent line of the code does the Chi-Square goodness of fit test for observed and expected values.
test.chi <- chisq.test(Observed,p=Expected)
The final line of the code test the hypothesis at 0.05 significance level i.e. 5% chances that observed values are not the same as expected values.
ifelse(test.chi$p.value <= 0.05, "NULL Hypothesis Rejected: Die not fair.", "NULL Hypothesis not rejected: Die fair.")
After running the last line of code, R spews out the following result: “NULL Hypothesis is not rejected: Die fair.”. The inspector from Scotland Yard is completely confused by now. He phrased the following questions:
|5||What is the rationale behind choosing p-value of 0.05 as the threshold to accept or reject the hypothesis?|
|6||Is there a better way to represent the results rather than conclusive acceptance or rejection of hypothesis? How will it help?|
Finally, Scotland Yard provides you with one final piece of information around values that are in favor of the house. The house always makes money when a number above 3 appears on the die (Credit: Terry Taerum). The final question you hear from the inspector from Scotland Yard is:
|7||In the light of this new evidence i.e number 4,5 and 6 always make money for the house, would you change your conclusion about the die being fair? Why or why not?|
|This time, Scotland Yard has way too many questions for you (7 to be precise). You don’t need to discuss all of them, take your pick and start a discussion in the comment section at the bottom. Please mention the question number(s) in your comment.|