Modeling in Advanced Analytics
The room, full of Analysts, erupts with a loud round of laughter when a young business analyst narrates to us an incident from his recent trip back home. A distant aunt inquired about his new profession. His response – I am into modeling. She got all excited and asked – is it just on the ramp or will I see you on the television? Jokes apart, this left me wondering about the roots of the word modeling or model. What is a model?
A model is defined as a simplified representation of reality. A representation of reality, hmmm, a photograph is a representation of reality – a moment of reality capture on the reel – does that makes it into a model. I think yes. Similarly, a newspaper reporter covering an incident and makes it into breaking news is also a model – a descriptive model. Now, let us try to link models with Analytics.
Data warehouse, Business Intelligence and Advanced Analytics
Analytics has received a massive boost because of the emergence of information technology. We are living in the era of big data. A plethora of data collected at every stage of the business process had created a need to extract knowledge out of the information. This overall process has three aspects to it
1. Data warehouse or data marts: transactional data is extracted-transformed and loaded (ETL) into a data model / schema for the purpose of analysis
2. Business Intelligence or dashboards: “as is” business reports
3. Predictive Analytics or Advanced Analytics: high-end statistical and data mining exercise
As the quantum of data is exponentially increasing, Hadoop and big data technologies are replacing the data warehouses. However, the thought process for business intelligence and predictive analytics – the focus of this article – will not change much. Let me try to distinguish between business intelligence and predictive Analytics using something I learned at a professional theater.
5Ws for business intelligence & predictive Analytics – Lessons from Theater
I joined a professional theater group a few years ago. To understand the nuances of acting we started with improv or improvisation theater. This form of theater does not have a predefined script but the actors built the story while performing. Most people thought I was a good improv actor. However, the style of remembering dialogue while performing did not work very well for me and hence it was the end of my theater gig. However, I learn some good lessons from the whole experience. One of them was the five-Ws of deciphering a character to build the drama.
1. What had happened?
2. When did it happen?
3. Where did it happen?
4. Who was part of this?
5. Why did it happen?
Clearly, the first four questions are trying to report an as-is version of the reality – a descriptive model. This is exactly what the business intelligence professionals try to achieve through the fancy reporting platforms & software. The fifth question is the trickiest of the lot. The question that keeps scientists and inquisitive minds awake late at night.
An apple falls from a tree. How difficult is it to answer the first four questions? Most of us can answer them with a help of a clock and a map. However, Isaac Newton answered the fifth question and his answer – Gravity. If he had stopped there, nobody would have remembered him after close to four hundred years since his birth. He gave a mathematical model to explain this phenomenon.
Replace apple and earth with any other objects and you have the general equation for the model. Albert Einstein did shatter the Newtonian notion of Gravity. However, this model still holds good for all problems of practical purposes and used extensively in rocket science.
Advanced analytics tries to facilitate the answer to the fifth question of why did something happen using predictive modeling. The combination of high-end statistical and data mining techniques along with analysts’ business acumen produces models that help organizations make informed decisions. Remember, this is just the beginning and causality is still a fair distance!
Credit Scoring Models
Credit scorecards are models to predict the probability of a borrower default on his/her loan. The following is a simplified version of credit score with three variables
Credit Score = Age + Loan to Value Ratio (LTV) + Installment (EMI) to Income Ratio (IIR)
A 28-year-old man with the LTV of 75 and the IIR of 60 will have the score of 10+50+5 =65 and hence is a high credit risk.
Now the question is, how did we arrive at the bucket-wise score points and associated risk tables? By now, after going through the previous three articles of the series, you must have some idea how we will go about it. We have a historical list of good / bad borrowers (article 2) that we want to distinguish using predictor variables (article 3). There are several statistical & data mining techniques that could help us achieve our object such as
1. Decision tree
2. Neural Networks
3. Support Vector Machines
4. Probit Regression
5. Linear discriminant analysis
6. Logistic Regression
Logistic regression is the most commonly used technique for the purpose. We will explore more about logistic regression in the next article.
I must conclude this article by saying that the good analysts find a good mathematical model as beautiful as the model walking on the catwalk ramp.
References 1. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring – Naeem Siddiqi 2. Credit Scoring for Risk Managers: The Handbook for Lenders – Elizabeth Mays and Niall Lynas