A Primer on Logistic Regression – Are you Happy?
A few years ago, my wife and I took a couple of weeks’ vacation to England and Scotland. Just before boarding the British Airway’s plane, an air-hostess informed us that we were upgraded to business class. Jolly good! What a wonderful start to the vacation. Once we got onto to the plane, we got another tempting offer for a further upgrade to the first class. However, this time, there was a catch – just one seat was available. Now that is a shame, of course, we could not take this offer. The business class seats were fabulous before the first class offer came – by the way, all free upgrades. This is the situation behavioral economist describe as relativity & anchoring – in plain English comparison. Anchoring or comparison is at the root of pricing strategies in business and also to all the human sorrow. However, eventually the vacation mood took over and we enjoyed the business class thoroughly. Humans are phenomenally good at adjusting to the situation in the end and enjoy it as well. You will find some of the happiest faces with people in the most difficult situations. Here is a quote by Henry Miller “I have no money, no resources, no hopes. I am the happiest man alive”. Human behavior is full of anomaly – full of puzzles. The following is an example to strengthen this thesis.
Lennon, McCartney, Harrison, and Best are the members of the most famous band ever on the planet – the Beatles. Ok, I know you have spotted the error. By now your must have uttered out the right names: John Lennon, Paul McCartney, George Harrison and Ringo Starr not Pete Best. Actually, Ringo Starr was the replacement for Pete Best, the original regular drummer for the Beatles. Pete must have been devastated seeing his partners rising to glory while he was left behind. Wrong, search for him on Google – he is the happiest Beatle of all. Now that is counter intuitive, I guess we do not have a clue what makes us happy.
As promised in a previous article, in this article I will attempt to explore happiness using logistic regression – the technique extensively used in scorecard development.
Logistic Regression – An Experiment
I am a thorough empiricist – a proponent of fact-based management. Hence, let me design a quick and dirty experiment* to generate data to evaluate happiness. The idea is to identify the factors / variables that influence our overall happiness. Let me present a representative list of factors for a working adult living in a city:
Now, throw in some other factors to the above list such as – random act of kindness or an unplanned visit to a friend. As you could see, the above list can easily be expanded (recall the article on variable selection- article 3). This is a representative list and you will have to create your own to figure out factors that influence your level of happiness.
The second part of the experiment is to collect data. This is like maintaining a diary only this one will be in Microsoft Excel. Every night before sleeping, you could assess your day and fill up numbers in the Spreadsheet along with your overall level of happiness for the day (as shown in the figure below).
*I am calling this a quick and dirty experiment for the following reasons (1) It’s not a well thought out experiment but is created more to illustrate how logistic regression works (2) the observer and the observed are same in this experiment which might create a challenge for objective measurement.
After a couple of years of data collection, you will have enough observations to create a model – a logistic regression model in this case. We are trying to model feeling of happiness (column B) with other columns (C to I) in the above data set. If we plot B on the Y-axis and the additive combination of C to I (we’ll call it Z) on the X-axis it will look something like the plot shown below.
The idea behind logistic regression is to optimize Z in such a way that we get the best possible distinction between happy and sad faces, as achieved in the plot above. This is a curve-fitting problem with sigmoid function (the curve in violet) as the choice of function.
I would recommend using dates of observations (column A) in our model; this might give an interesting influence of seasons on our mood.
Applications in Banking and Finance
This is exactly what we do in case of analytical scorecards such as credit scorecards, behavioral scorecards, fraud scorecards or buying propensity models. Just replace happy and sad faces with …
• Good and Bad borrowers
• Fraud and genuine cases
• Buyers and non-buyers
…. for the respective cases and you have the model. If you remember in the previous article (4), I have shown a simple credit scorecard model: Credit Score = Age + Loan to Value Ratio (LTV) + Instalment (EMI) to Income Ratio (IIR)
A straightforward transformation of the sigmoid function will help us arrive at the above equation of the line. This is the final link to arrive at the desired scorecard.
Variable Transformation in Credit Scorecards
I loved the movie Kill-Bill, both parts. In the first part, I enjoyed when Uma Thurman’s character went to Japan to get a sword from Hattori Hanzō, the legendary swordsmith. After learning about her motive, he agrees to make his finest sword for her. Then Quentin Tarantino, director of the movie, briefly showed the process of making the sword. Hattori Hanzō transformed a regular piece of iron to the fabulous sword – what a craftsman. This is fairly similar to how analysts perform transformation of the sigmoid function to the linear equation. The difference is that analysts use mathematical tools rather than hammers and are not as legendary as Hattori Hanzō.
Reject inference is a distinguishing aspect about credit or application scorecards which is different from all other classification models. For the application scorecards, the development sample is biased because of the absence of performance for rejected loans. Reject inference is a way to rectify this shortcoming and removing the bias from the sample. We will discuss reject inference in detail in some later article on YOU CANalytics.
Now that we have our scorecard ready the next task is to validate the predictive power of the scorecard. This is precisely what we will do in the next article. See you soon.
References 1. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring – Naeem Siddiqi 2. Credit Scoring for Risk Managers: The Handbook for Lenders – Elizabeth Mays and Niall Lynas
Very interesting.Really appreciate. Roopam,it will be really helpful if you explain it more,with all the possible case which we might face during its execution practically.
Thanks Vikash. The articles on this site are presented as icebreakers for all of us to start the discussion and share our views and working knowledge. I know the devil is in details but all the fun is also in details. Hence, let’s post specific question(s) and as a community we’ll try to answer them. So, let me know what specific question you have to start with.
In my experience as an Analytics practitioner, I have realized that there is never one right answer but several good answers. We will explore these several good answers through the round of discussions. The possible cases one would face during the execution will also come out during the discussions.
Guys, for credit scorecard development a data reduction method mostly used is based on Information value which is talked about in the first book mentioned in 1/7 by Mr. Siddiqi Pg 80-83. I want to discuss about other methods that are practiced in industry for data reduction?
@Saumitra, Information value (IV) is a widely used measure in the scorecard world. The reason for this is some very convenient rules of thumb for variables selection associated with it and is quite handy. However, if we examine the formula for IV closely, there is a log component in it i.e. ln(good/bad). What this means is that the IV will break down when either the numerator or the denominator goes to zero. A mathematician will hate it. The assumption, a fair one, is that this will never happen while a scorecard development because of the reasonable sample size. A word of caution, if you are developing non-standardized scorecards with smaller sample size use IV carefully.
Measure other than IV that is much more mathematically consistent is Chi-square value – extensively used in statistics. I would guess Mr. Sidiqqi must have mentioned this in his book. I personally prefer to examine the data visually for trends and logical consistency. However, both IV and Chi-square are extremely useful measures.
Thanks Roopam. Very interesting post that I’ve learnt a lot thru your blog.
Relating to your response, can you help to show me key differences between IV, Chi-Square, and Z-Statistic. And in practice, which one should we rely on if three measures give different results?
Thanks a lot Roopam.
Thanks David, am glad you find my blog useful.
To answer your question, in practice with a relatively large data (as is often the case with most business problems) all the measures will more or less provide the same results. A variable class that is a significant predictor with one measure won’t become completely irrelevant with another.
Just being devils advocate ,isn’t econometric models are point estimate or dependant on that instance of time when data collection happened and subsequently model is developed in another 2 months of time .
Delta changes , although we do OOTime Validation to check relevant characteristics of variable but what if there is a new variable comes into existence hence need to rebuild the complete model
This was the case at 2009 recession when all econometric models failed like anything ?
any thoughts ,
I have replaced continuous variables by WOE and checked for Multicollinearity. Is this procedure right?
Yes, you could check multicollinearity with WoE – that’s fine. However, WoE is used for binned variables i.e. categorical variables not continuous. I would assume you have converted your continuous variables into bins.