Why do data science and analytics projects fail? At what stage of the project life-cycle are they most vulnerable to failure? Like any living creature, the probability of analytics projects to fail is the highest either in their infancy or at the final stages of their life cycle. A successful analytics project, like a successful life-form, leaves a legacy for the next generations to follow.
Thinking about data in a meticulous and scientific way is at the core of a successful analytics project that produces a competitive edge for the organization. In this article, we will discuss some of the mistakes while working with data that invariably lead analytics projects to failure. In the absence of scientific and rational thought process while working with data most analytics projects experience infant mortality. A good way for us to understand this struggle is through correlating it with the..
Struggle of a New Born Wildebeest
National Geographic Channel is certainly the most important source for many of us to experience the wild. While I was growing up, on Sunday mornings the national television in India used to broadcast an hour-long show by the National Geographic Society. I vividly remember one of the episodes where a wildebeest mother gave birth to a baby. Wildebeest, by the way, got the name because of their resemblance to wild cattle i.e wild-ox or wildebeest.
I found the birth of a baby wildebeest both a wonderful and grotesque event at the same time. The baby wildebeest covered in slimy discharge slowly dropped out of the mother. Yuck! that was gross to me when I was 13 years old. However, what followed after that were the visuals of a great struggle and triumph. The baby had to immediately stand up after the birth and suckle milk from its mother. This was absolutely important for the baby’s survival otherwise, it had become a meal for the lurking predators. The baby struggled for a while and fell hard on the ground several times during this effort. The baby had no one to help, including the mother, while it rose from the ground and reached for the mother. Finally, we witnessed a great triumph of nature when the baby crossed the first hurdle for its survival.
Analytics projects also need to cross this initial hurdle for their survival. Sound thinking about data and the problem statement is at the core of crossing this hurdle.
5 Mistakes at the Initial Stages of Analytics Projects

This is my list of 5 mistakes that one wants to avoid at the beginning of analytics projects.
- Eagerness to solve problems
- Failure to identify the right variables
- Treatment of missing data
- Beating down the outliers
- Not being careful about reproducibility of results
Let me discuss these mistakes and ways to avoid them in some detail in the next sections.
Mistake 1. Eagerness to Solve Problems
On both Linkedin and Facebook you must have seen people posting problems like the ones shown below:
|
i) Tell a word that starts and ends with the letter ‘R’ |
|
ii) 95% people will fail this simple mathematics problem : solve 36÷3−4×9÷3 |
Almost always users on these social media sites immediately start answering these questions. For instance, for the first problem, you will notice answers such as rear, roar, render, rejoinder etc. The answers tend to get much more sophisticated and complicated with more users pooling in. You will invariably find hundreds and thousands of responses to every such problem. Interestingly you will rarely find anybody retorting with : why is this an important question? If someone does ask this, he/she is considered a spoilsport.

Identification of the right business problem is at the core of successful analytics projects. Moreover, estimation of business benefit, both financial and intangible, is the foremost task for the project team. Not every business problem is equality important, and trust me several problems are not even worth putting any effort into. Always ask why the problem you are solving is important and don’t start your project till you have a satisfactory answer.
As for the second problem posted at the top of this section, the solution is in the BODMAS rule we learned in the primary school. The answer is ‘zer0’ but again why is this problem important? You could type this equation in an Excel cell and get the answer in no time.
Mistake 2. Failure to Identify the Right Variables

Once the right set of variables are identified and prepared, the next step is diagnostic or exploratory data analysis (EDA) of these variables. The next two mistakes are linked to EDA and they happen while handling missing data and outliers.
Mistake 3. Treatment of Missing data
“Is there any point to which you would wish to draw my attention [Mr. Sherlock Holmes]?”
“To the curious incident of the dog in the night-time.”
“The dog did nothing in the night-time.”
“That was the curious incident, ” remarked Sherlock Holmes.― form Silver Blaze in The Memoirs of Sherlock Holmes

Sherlock Holmes, in the above dialogue, enunciated that absence of something is also evidence as in the case of the dog, not barking. This signified that someone familiar to the dog had entered the barn at the night time. This helped Sherlock Holmes solve the mystery of a lost horse named Silver Blaze.
Similarly missing data or absence of something in certain cases can be a strong evidence in itself. This is particularly true in risk and fraud analytics. At the beginning of the analytics projects, it is a good idea to scrutinize missing data and identify if there are compelling clues hiding within them.
Mistake 4. Beating down the Outliers

I have used missing data and outliers as a way to highlight that analysts need to be careful about blindly using any statistical technique. In the next segment, we will discuss a serious problem that plagues many scientific investigations.
Mistake 5. Not being Careful about Reproducibility of Results

On the brighter side, recently scientists have confirmed that Albert Einstein’s General Theory of Relativity holds true for a galaxy 13 billion light-years from Earth. Now if you talk about reproducibility, Einstein’s theory takes it to the new height or distance. Analytics projects need not be as generalizable as the General Theory of Relativity but they still need to be reproducible in a localized boundary region and time.
Predictive models are built with the idea that the model built today will be good in the future. If the results are not reproducible than the predictive models are completely worthless. It is essential for the project team to identify reasons why their models won’t work in the future.
Define Segments and Boundaries to Make Your Models Robust and Reproducible
Moreover, it is also a good idea to define boundaries and segments within which the model will operate properly. For instance, consider this fictitious model to estimate work experience for professionals
Work Expreience = Age - 21
This mathematical equation says that if someone who is just born will have -21 years of work experience. We know this is incorrect. However, most models in business systems are implemented without defining the boundaries of predictor variables and the surrounding environment. This will make the model behave erratically for a new segment. The above model for salary is possibly correct in the boundary of age between 21 to 60 years. Outside these boundaries, this model will make no sense.
Sign-off Note
The struggle and triumph of a baby wildebeest have an important lesson for teams involved in analytics projects. The baby had to stand up without any help from anyone including the mother. Similarly, analytics teams need to rely on their own scientific logic and knowledge of numbers to travel the journey because for these aspects of the project they won’t get any help from the champions or the sponsors of the project.

