Every data scientist knows that in any business analytics and data science exercise 70-80% of the time is consumed in data preparation and data preprocessing. This is usually considered a drudgery in comparison to the actual statistical modeling, machine learning, and business insights part. However, every good data scientist understands that data preparation is an art and a highly intellectual exercise. We will discover the art and science of data preparation in this article. However, before that let’s experience some culinary delights in the…
Master Chef’s Kitchen
My wife is a huge fan of MasterChef Australia. I think, she has watched all the seasons of the show till date since its inception. For me, I enjoy eating good food more than watching someone cook, and judges relishing their meal. However, over the years I have watched a few episodes because of my wife. I must say that there are a lot of good lessons for data scientists in cooking. In virtually every episode of MasterChef I have seen, the participants spent a larger part of their time preparing ingredients for their final dishes. They all runaround and collect the appropriate supplies for their dishes from the larder. To me, all this seemed quite similar to data scientists running around and getting the right data fields from data sources & data warehouses. After this, the act of peeling, chopping, cutting, roasting etc. is similar to data preparation in data science.
I use to think that master chef’s in big restaurants usually just order around sous chefs and assistants to prepare ingredients for their main dish. However, in one of the episodes of MasterChef Marco Pierre White, a celebrity chef, asks the participants to chop onions really fine to test their skills with the knife. He first demonstrated his meticulous knife skills by chopping the onion into perfect microscopic pieces within a few seconds. For Marco Pierre White, this was not an act of showcasing his supremacy but a necessity for the dish he had in mind. Similarly, a senior data scientist needs to be completely involved in the process of data preparation and preprocessing to produced desirable models and business results.
A data scientist can’t expect a delicious model unless she has spent enough time preparing the right ingredients for the model through data preparation. In the next section, let’s explore some key elements of
Continuous improvement to generate competitive advantage is the only way for companies to survive in the current times of cut-throat business competition. For the modern businesses, Analytics is an essential instrument to generate competitive advantage. Analytics and data science generate novel business insights for improved business actions to keep the company ahead in the race. Even in the age of big data, companies can dismiss the fundamental rules of rigorous research and analysis design at their own peril. As we will see, intelligent data preparation is a fundamental aspect of rigorous data science design. Unfortunately, till date, there is no software or tool that will create an intelligent data preparation design across the industry. Hence the onus of data preparation is still with creative human minds.
Before I share a few key aspects of data preparation, let us have a look at a simplistic data schema from banking. The following schema is from banking however the discussion we will have about data preparation will suggest general strategies for all other industries including healthcare, telecom, retail etc.
In every industry, the IT systems are designed to capture transaction data. For example, consider the adjacent chequing / saving accounts statement where every debit and credit transaction for a customer is recorded with the description. This is similar to the bank statement you get from your banking account. Additionally, your bank captures transaction information for other investment and loan products you hold with them. In their databases, they have transaction level information for all the customers with account numbers. Hence, the base data that data scientists start with are a continuous stream of transactions. It is easy to lose site of the big picture for data preparation. In my opinion, the following six points will help you keep the focus on right data preparation:
1) Business objectives, questions: business objectives are the driving force for data preparation. From the business, objective comes questions for which the analytics will provide solutions. I have noticed a tendency in new data scientists to immediately jumping into data without focusing on business objectives. I recommend, don’t touch the data till you are clear about business objectives and business questions. Having a clear data strategy based on business objectives will help you not get lost in the labyrinth of huge data and save a lot of your rework time.
2) Curiosity: data science doesn’t start with data but with a curiosity which is the key to being a good data scientist. All data scientist possess the inner desire to decipher and learn business patterns or facts hidden within data. To decipher new patterns data preparation is the first step. Hence, let your curiosity run wild while preparing derived data fields from the transaction data.
3) Unit of analysis & business hypothesis: one could analyse transaction data at various units i.e. customer, branch, region, agents, relationship manager, channel partners etc. The unit of analysis for model development comes from the business objective. For instance, customer risk scorecard has the customer as the unit of analysis while business expansion model has the branch or region as a unit of analysis. Once, the unit of analysis is defined it a good practice to create a few business hypotheses or hunches for variables that you believe will feature in your predictive model. One might also go for a complete data mining approach of creating hundreds of variables while data preparation to detect patterns with the target variable. However, I prefer a mix of hypotheses / hunches driven and data mining approaches while data preparation.
4) Data roll-up & data quality checks: after preparing your unit of analysis and prospective predictor variables list you are ready to approach the data. The idea is to roll up the transaction data to your unit of analysis and prepare predictor variables for model development. Most statistics and data mining books provide you with the data that is the end product of this exercise. These datasets make your life easy but also prevent you from experiencing an essential process of data preparation in data science.
5) Operationalization of analytics base table : finally, data scientists must think of operationalization of their model on business systems while preparing data . It is better to lose out on a few notches of predictive power for your model than to prepare a complicated data set that is difficult to productionize.
My wife pointed out to me that as the contestants get better at cooking in Masterchef, judges put a lot of emphasis on the final plating and presentation of the dish. The idea is that the served dish needs to be aesthetically pleasing which will enhance the experience of the dish. This is an important lesson for data scientists for presenting their models and analysis results to the senior management of the company. Communicating and presenting predictive models to make them appetising is an art that every great data scientist is a master of.
See you soon with a new post.