This article is a continuation of the retail case study example we have been working on for the last few weeks. You can find the previous 4 parts of the case at the following links:
In this article, we will discuss a type of decision tree called classification and regression tree (CART) to develop a quick & dirty model for the same case study example. But before that let us explore the essence of..
Let’s accept, we all do this before we pick a slice of pizza from the box: we quickly analyze the size of the piece, and proportions of toppings. In this quick optimization, you mostly look for the biggest slice with the maximum amount of your favorite toppings (and possibly avoid your least favorite ones). Hence, I would rather not call this little boy, shown in the picture, greedy. He is just trying to cut his birthday cake to maximize his preferred taste. The cake has his favorite topping i.e red cherries, and not so favorite green apples in equal proportions (50-50). He needs to make a clean cut with just two strokes of the knife otherwise, the guests at his party won’t appreciate his messy use of the knife. With a complete deftness and using the built-in decision tree in his brain, this boy cuts the perfect piece to savor his taste. Let us have a look at his artistry:
He started with equal proportions of red and green (50%-50%). Remember, he wanted the most number of reds and least number of greens on his piece. His slice , a quarter of the cake, has 71% reds and 29% greens. Not bad! This is precisely how decision tree algorithms operate. Like the above problem, the CART algorithm tries to cut/split the root node (the full cake) into just two pieces (no more). However, there are other decision tree algorithms we will discuss in the next article, capable of splitting the root node into many more pieces.
I must point out that though we are using discrete data (such as red cherries and green apples) for the decision tree in this article, CART is equally capable of splitting continuous data such as age, distance etc. Let us explore more about CART decision tree algorithm.
Classification and Regression Tree (CART)
I find algorithms extremely fascinating be it Google’s PageRank algorithm, Alan Turing’s cryptography algorithms, or several machine-learning algorithms. To me, algorithms are a mirror of structured thinking expressed through logic. For instance, the CART algorithm is an extension of the process that happened inside the brain of the little boy while splitting his birthday cake. He was trying to cut the largest piece for himself with maximum cherries and least green apples. In this problem he had two objectives:
- Cut the largest piece with a clean cut
- Maximize the number of cherries on this piece while keeping green apples at lowest
The CART decision tree algorithm is an effort to abide with the above two objectives. The following equation is a representation of a combination of the two objectives. Don’t get intimidated by this equation, it is actually quite simple; you will realize it after we will have solved an example in the next segment.
The first term here i.e. controls the first objective to cut the largest piece. Let me call this first term ‘Ψ(Large Piece)’ because it will remind us of the purpose behind the mathematical equation.
Where as, the second term () controls the second objective. Again like the last equation, let me call this second term ‘Ψ(Pick Cherries)’.
For our example, k=0,1 in above equation are 0=green apples & 1=red cherries. Remember, for our case study with marketing campaigns, k=0,1 will become responded (r) and not-responded (nr) customers. Similarly, for our banking case study & credit scoring articles (link) they will become loan defaulters & non-defaulters. However, the philosophy of decision tree and the CART will remain the same for all these examples and much more practical classification problems.
Let me define some important terminologies for the CART decision tree algorithm before explaining the components of the above equation for the goodness of split.
The following is the definition of the components in the above equation for the goodness of split.
I hope you have realized, the largest value of the product of Ψ(Large Piece) and ‘Ψ(Pick Cherries) called the goodness of split will generate the best decision tree for our purpose. Things will get much clearer when we will solve an example for our retail case study example using CART decision tree.
Retail Case – Decision Tree (CART)
Back to our retail case study example, where you are the Chief Analytics Officer & Business Strategy Head at an online shopping store called DresSMart Inc. that specializes in apparel and clothing. In this case example, your effort is to improve a future campaign’s performance. To meet this objective, you are analyzing data from an earlier campaign where direct mailing product catalogs were sent to hundred thousand customers from the complete customer base of over a couple of million customers. The overall response rate for this campaign was 4.2%.
You have divided the total hundred thousand solicited customers into three categories based on their past 3 months activities before the campaign. The following is the distribution of the same. Here, the success rate is the percentage of customers responded (r) to the campaigns out of total solicited customers.
|Activity in the Last Quarter||Number of Solicited Customers||Campaign Results||Success Rate|
|Responded (r)||Not Responded (nr)|
As you know, CART decision tree algorithm splits the root node into just two child nodes. Hence for this data, CART can form three combinations of binary trees as shown in the table below. We need to figure out which is the best split among these 3 combinations. The results for the same are shown in the table below.
|Left Node||Right Node||PL||PR||P(k|L) = a||P(k|R) = b||Ψ(Large Piece)||Ψ(Pick Cherries)||Goodness of Split|
|Low||Medium+High||0.4||0.6||r: 0.018||r: 0.058||0.48||0.080||0.0384|
|nr: 0.982||nr: 0.942|
|Low+Medium||High||0.7||0.3||r: 0.030||r: 0.070||0.42||0.080||0.0336|
|nr: 0.970||nr: 0.930|
|Low+high||Medium||0.7||0.3||r: 0.040||r: 0.046||0.42||0.011||0.0048|
|nr: 0.960||nr: 0.954|
Let me help you out with the calculation of each column for the above tree. We will use the first row (i.e left node: Low and right node: Medium+High) for the following calculations and then you could do the rest of the calculations yourself. To start with we have calculated PL and PR in the following way:
Now the calculation for Ψ(Large Piece) is simple as shown below:
Now, let’s come to the second part of the equation that is Ψ(Pick Cherries). Remember, r represents responded and nr represents not-responded customers for our campaign’s example.
You may want to calculate the other two terms (i.e r: P(k|R), and nr: P(k|R)) yourself before plugging them in the following equation to get the value for Ψ(Pick Cherries).
This leaves us with one last calculation for the last column i.e. goodness of split which is:
The final task now is to find the maximum value for goodness of split in the last column. This will produce the following decision tree through the CART algorithm with Low on the left node, and Medium+High on the right node.
This is an important business insight as well that people with higher activity tend to respond better to campaigns. I agree it was clear from the first table at the top as well, but we have learned the science of creating decision tree using the CART algorithm in the process. This is extremely useful when you are dealing with a large dataset and want to create decision tree through recursive partitioning.
OK, next time while choosing that pizza slice, remember the evolutionary decision tree that helps you maximize your chances for the best slice. Once in a while, you may want to leave that best slice for someone else – I bet you will feel equally good!
In the next article, we will extend this concept of binary child node decision tree through the CART algorithm to more than two nodes decision tree through other algorithms. See you soon!