Analytics Lab
Malcolm Gladwell in his book ‘Outliers‘ says that it takes approximately 10,000 hours of training or practice for anybody to master a subject. There is nothing in the world that could beat hands-on practice for learning. Keeping this in mind let me start a new series on YOU CANalytics called ‘Analytics Lab’ where we will practice some of the case studies we have discussed in the past on YOU CANalytics. We will use R and Microsoft Excel (both easily available tools) for this.
It’s a bird…It’s a plane…It’s Excel man! Did I ever tell you what a super hero Excel is. In my opinion Excel is the best tool to start learning advanced analytics! Today we will start the ‘Analytics Lab’ using concepts we have discussed in the banking case study. This time we will create a two Y axes plot by using the secondary axis function in Excel. In the next article we will create the same plot on R.
Banking Case
Recall the banking case: in that you were playing the role of the Chief Risk Officer (CRO) for CyndiCat bank. The bank had disbursed 60816 auto loans in the quarter between April–June 2012. Additionally, you had noticed around 2.5% of overall bad rate for these loans. The idea was to identify customer segments with distinct bad rates. Bad rate, by the way, is percentage of customer defaulted on their payments. You did some exploratory data analysis (EDA) using tools of data visualization and found a relationship between age with bad rates (Part 1). If you recall, you have observed the following age groups wise population histogram and bad rate trend.
Using Secondary Axis to Create Two Y Axes in Excel
A good practice is to present these two graphs in a single chart. This is useful because you and your audience don’t need to jump from one graph to another during analysis. Let me present the data we are going to use for our purpose. I recommend you copy past this data on Excel and evaluate for yourself the merits of a single chart in comparison to two.
Age Groups | Number of Loans | Bad Loans | Good Loans | Bad Rate |
21-24 | 310 | 14 | 296 | 4.5% |
24-27 | 511 | 20 | 491 | 3.9% |
27-30 | 4000 | 172 | 3828 | 4.3% |
30-33 | 4568 | 169 | 4399 | 3.7% |
33-36 | 5698 | 188 | 5510 | 3.3% |
36-39 | 8209 | 197 | 8012 | 2.4% |
39-42 | 8117 | 211 | 7906 | 2.6% |
42-45 | 9000 | 216 | 8784 | 2.4% |
45-48 | 7600 | 152 | 7448 | 2.0% |
48-51 | 6000 | 84 | 5916 | 1.4% |
51-54 | 4000 | 64 | 3936 | 1.6% |
54-57 | 2000 | 26 | 1974 | 1.3% |
57-60 | 788 | 9 | 779 | 1.1% |
The last column i.e. ‘Bad Rate’ is the percentage of ‘Bad Loans’ to ‘Number of Loans’ i.e. for the age group 21-24 it is 14/310=4.5%.
The first thing you need to do is create a bar plot of the first two columns i.e. ‘Age Groups’ and ‘Number of Loans’. Use Column in Charts within Insert tab (as shown above) to create a vertical bar plot. The default excel bar plot will look different. I have done some additional formatting for the bar plot to make it appropriate for my purpose. You could leave the default excel format for most part. The only thing you want to change is the color of the bars from default excel blue to something mellow like grey.
The next step is to add the trend line for the last column i.e. ‘Bad Rate’ on top of the above chart. For this, right click you mouse anywhere on top of the above bar chart in Excel and then choose Select Data…You will notice a pop up window similar to the one shown below
In the above pop-up window click on Add below Legend Entries (Series). You will get a new pop-up window similar to the one shown below. For series name choose the cell $E$1 (the cell that contain the name ‘Bad Loans’) and for series values select cells $E$2 to $E$14 i.e. the values of ‘Bad Rate’ for all the ‘Age Groups’.
Once you will click OK in the above Edit Series pop-up window, you will get additional bars for ‘Bad Loans’. Notice, these bars are really tiny in comparison to ‘Number of Loans’ bars.
Select any of these tiny bars for ‘Bad Rate’ by a mouse click. Then right click your mouse and choose Format Data Series…You will see a pop-up window similar to the one shown below.
In the Plot Series On section choose Secondary Axis and press Close in the pop-up window. You will get a bar plot similar to the one shown below. Notice the secondary Y Axis at the right side of the plot with ‘Bad Rate’. Remember, I have re-formatted the default Excel plot so your plot will have different appearance. Now we are left with just one final thing.
We need to now convert the above orange bars for ‘Bad Rate’ to a line plot. This is really simple, click on the orange ‘Bad Rate’ bars and convert them to line by choosing Line in Charts from Insert tab
Great! We are done. You might appreciate how useful it is to present these two plots in one single chart with 2 Y axes or the secondary axis in Excel. In the above chart, it is much easier to see where you have thinner data and modify your bad rates accordingly. Now, you could easily create coarse classes as discussed in the banking case study.
Sign-off Note
Hope you had fun performing the above task on Excel. In the next article we will do the exactly same thing on R. See you soon!
The following are the parts of the banking case study for risk scoring - Part 1: Data visualization for scoring - Part 2: Creating ratio variables for better scoring - Part 3: Logistic regression - Part 4: Information Value and Weights of Evidence - Part 5: Reject inference - Part 6: Population stability index for scorecard monitoring