Why do data science and analytics projects fail? At what stage of the project life-cycle are they most vulnerable to failure? Like any living creature, the probability of analytics projects to fail is the highest either in their infancy or at the final stages of their life cycle. A successful analytics project, like a successful life-form, leaves a legacy for the next generations to follow.
Thinking about data in a meticulous and scientific way is at the core of a successful analytics project that produces a competitive edge for the organization. In this article, we will discuss some of the mistakes while working with data that invariably lead analytics projects to failure. In the absence of scientific and rational thought process while working with data most analytics projects experience infant mortality. A good way for us to understand this struggle is through correlating it with the..
Struggle of a New Born Wildebeest
National Geographic Channel is certainly the most important source for many of us to experience the wild. While I was growing up, on Sunday mornings the national television in India used to broadcast an hour-long show by the National Geographic Society. I vividly remember one of the episodes where a wildebeest mother gave birth to a baby. Wildebeest, by the way, got the name because of their resemblance to wild cattle i.e wild-ox or wildebeest.
I found the birth of a baby wildebeest both a wonderful and grotesque event at the same time. The baby wildebeest covered in slimy discharge slowly dropped out of the mother. Yuck! that was gross to me when I was 13 years old. However, what followed after that were the visuals of a great struggle and triumph. The baby had to immediately stand up after the birth and suckle milk from its mother. This was absolutely important for the baby’s survival otherwise, it had become a meal for the lurking predators. The baby struggled for a while and fell hard on the ground several times during this effort. The baby had no one to help, including the mother, while it rose from the ground and reached for the mother. Finally, we witnessed a great triumph of nature when the baby crossed the first hurdle for its survival.
Analytics projects also need to cross this initial hurdle for their survival. Sound thinking about data and the problem statement is at the core of crossing this hurdle.
5 Mistakes at the Initial Stages of Analytics Projects
There are several reasons why analytics projects fail to generate beneficial outcomes for an organization. In this article, I will focus purely on the initial hurdles for analytics projects. Moreover, our complete attention will be on good practices while thinking about data and the business problem statement.
This is my list of 5 mistakes that one wants to avoid at the beginning of analytics projects.
- Eagerness to solve problems
- Failure to identify the right variables
- Treatment of missing data
- Beating down the outliers
- Not being careful about reproducibility of results
Let me discuss these mistakes and ways to avoid them in some detail in the next sections.
Mistake 1. Eagerness to Solve Problems
On both Linkedin and Facebook you must have seen people posting problems like the ones shown below:
i) Tell a word that starts and ends with the letter ‘R’ |
ii) 95% people will fail this simple mathematics problem : solve 36÷3−4×9÷3 |
Almost always users on these social media sites immediately start answering these questions. For instance, for the first problem, you will notice answers such as rear, roar, render, rejoinder etc. The answers tend to get much more sophisticated and complicated with more users pooling in. You will invariably find hundreds and thousands of responses to every such problem. Interestingly you will rarely find anybody retorting with : why is this an important question? If someone does ask this, he/she is considered a spoilsport.
There is something extremely interesting happening up there. Humans are wired, both by nature and nurture (read schooling), to answer questions without questioning the question. We see a problem and we need to solve it. This is a dangerous strategy for analytics projects and often results in quick mortality for the projects.
Identification of the right business problem is at the core of successful analytics projects. Moreover, estimation of business benefit, both financial and intangible, is the foremost task for the project team. Not every business problem is equality important, and trust me several problems are not even worth putting any effort into. Always ask why the problem you are solving is important and don’t start your project till you have a satisfactory answer.
As for the second problem posted at the top of this section, the solution is in the BODMAS rule we learned in the primary school. The answer is ‘zer0’ but again why is this problem important? You could type this equation in an Excel cell and get the answer in no time.
Mistake 2. Failure to Identify the Right Variables
After identification of the right question(s), the second step is to identify the right data and variables to work with. Assume you want to build a model to predict job satisfaction for employees. In any human resources system, the easily available and highly quantifiable metrics are income, bonus, designation, increments etc. But we all know from our experience that job satisfaction is a highly complicated phenomenon and can barely be predicted with just these variables. However, when one builds this model there is a greater temptation to just use the easily available variables. The ability to identify the right set of variables at the beginning of the project differentiates a good analyst from the rest. Identification of variables requires a good understanding of the domain and lots of creativity. Creativity helps in generating derived variables from the available data in the business systems.
Once the right set of variables are identified and prepared, the next step is diagnostic or exploratory data analysis (EDA) of these variables. The next two mistakes are linked to EDA and they happen while handling missing data and outliers.
Mistake 3. Treatment of Missing data
“Is there any point to which you would wish to draw my attention [Mr. Sherlock Holmes]?”
“To the curious incident of the dog in the night-time.”
“The dog did nothing in the night-time.”
“That was the curious incident, ” remarked Sherlock Holmes.― form Silver Blaze in The Memoirs of Sherlock Holmes
Missing data is a reality of virtually every business data-set. In statistics classes, it is taught that missing data is the biggest enemy of analysis. You are told to replace missing data with either the average or some other sophisticated values generate through regression or other fancy techniques. At times, this process of replacing missing values becomes so mechanical that the analysts tend to forget that there could be a reason why data is missing.
Sherlock Holmes, in the above dialogue, enunciated that absence of something is also evidence as in the case of the dog, not barking. This signified that someone familiar to the dog had entered the barn at the night time. This helped Sherlock Holmes solve the mystery of a lost horse named Silver Blaze.
Similarly missing data or absence of something in certain cases can be a strong evidence in itself. This is particularly true in risk and fraud analytics. At the beginning of the analytics projects, it is a good idea to scrutinize missing data and identify if there are compelling clues hiding within them.
Mistake 4. Beating down the Outliers
Another problem for analysis, highlighted by every statistics textbook, is outliers. Outliers are the observations that are extremely dissimilar to the studied population. For instance, if you are studying the net wealth of individuals on the planet then Bill Gates and the Sultan of Brunei are complete outliers. One of the strategies to deal with outliers is data transformation i.e. taking the log or the square root of all the observations. This beats the data down to normal range. Again, this is a good strategy in many cases but is equally ineffective in several others. For example in several marketing analytics applications, it is a good idea to create different segments of the population and create a separate model for each segment. Including Bill Gates and Sultan of Brunei in the same model as for the majority of world’s population does not make sense.
I have used missing data and outliers as a way to highlight that analysts need to be careful about blindly using any statistical technique. In the next segment, we will discuss a serious problem that plagues many scientific investigations.
Mistake 5. Not being Careful about Reproducibility of Results
A few years ago, Amgen, a biotech company, decided to repeat over 50 landmark cancer biology studies published in the topmost scientific journals. They could only reproduce results for 6 out of 53 studies. That is a success rate of a little over 10 percent. In another effort of a similar sort, a group of qualified researchers tried to repeat studies from three prestigious psychology journals. They could reproduce just 39 out of 100 studies. Reproducibility is a fundamental tenet of any scientific investigation. Any result that you get today must be reproducible tomorrow in somewhat similar conditions. Analytics or business analysis is no different.
On the brighter side, recently scientists have confirmed that Albert Einstein’s General Theory of Relativity holds true for a galaxy 13 billion light-years from Earth. Now if you talk about reproducibility, Einstein’s theory takes it to the new height or distance. Analytics projects need not be as generalizable as the General Theory of Relativity but they still need to be reproducible in a localized boundary region and time.
Predictive models are built with the idea that the model built today will be good in the future. If the results are not reproducible than the predictive models are completely worthless. It is essential for the project team to identify reasons why their models won’t work in the future.
Define Segments and Boundaries to Make Your Models Robust and Reproducible
Moreover, it is also a good idea to define boundaries and segments within which the model will operate properly. For instance, consider this fictitious model to estimate work experience for professionals
Work Expreience = Age - 21
This mathematical equation says that if someone who is just born will have -21 years of work experience. We know this is incorrect. However, most models in business systems are implemented without defining the boundaries of predictor variables and the surrounding environment. This will make the model behave erratically for a new segment. The above model for salary is possibly correct in the boundary of age between 21 to 60 years. Outside these boundaries, this model will make no sense.
Sign-off Note
The struggle and triumph of a baby wildebeest have an important lesson for teams involved in analytics projects. The baby had to stand up without any help from anyone including the mother. Similarly, analytics teams need to rely on their own scientific logic and knowledge of numbers to travel the journey because for these aspects of the project they won’t get any help from the champions or the sponsors of the project.