This is a continuation of our case study example to estimate property pricing. In this part, you will learn nuances of regression modeling by building three different regression models and compare their results. We will also use results of the principal component analysis, discussed in the last part, to develop a regression model. You can find all the parts of this case study at the following links: regression analysis case study example.
However, before we start building regression models let me highlight the importance of information in pricing and also explain how data science & regression creates a level playing field by eliminating information asymmetry.
Information Asymmetry & Regression Models
You want to sell your house that you had purchased 8 years ago. You get mixed messages from different sources about the state of the real-estate market. Some say the housing market is booming and others believe it’s a bust. You are thoroughly confused and in good faith approached a real-estate agent to help you crack a good deal. The agent is an expert. He gets a 1.5% commission on the selling price hence will grab you the best deal. Freakonomics, a book by Steven Levitt & Stephen Dubner, argues otherwise. Levitt, as a part of his research, analyzed sales patterns of houses owned by real-estate agents versus their customers. He observed that houses owned by agents were selling at a 3% higher price than their customers. So how do these agents get better deals for their own properties? Incidentally, they keeping their own properties listed on the market for roughly 10 more days than their clients’ properties. A 300,000 dollars house will fetch $10,000 more in these additional 10 days. However, if the property belongs to the client, the agent will get just $150 for 10 extra days of effort. He would rather close the deal early and move on to the next deal.
According to Freakonomics, real-estate agents are like Ku Klux Klan (KKK), a notorious secret society responsible for lynching African Americans. The entire existence of KKK and real-estate agents depends on the asymmetry of information. The more information they have than others, the stronger they become. KKK was eventually destroyed by efforts to make their secret information public. Once the mask was removed from KKK’s face and information asymmetry was gone, the secret society was blown away like a puff of dust. Data science has a massive role to play for the democratization of information. Imagine the world where everyone has access to all the data/information, and everyone knows how to extract knowledge from that data. In this scenario, expertise will rely upon sophisticated human skills like creativity and innovation rather than secrecy and deceit.
As we move towards our regression case study, we can take a big lesson from this Freakonomics’ analysis on real estate agents. While in our case study example we are using just a handful of predictor variables to estimate housing price, there are so many interesting phenomena outside our dataset that determine housing price like if the owner of the house is also a real estate agent. As a data scientist, it’s our job to unearth these interesting phenomena and build robust models.
Case Study Example – Regression Model
In this case study example, you are building regression models to help an investment firm make money through property price arbitrage. You are under a lot of pressure from your client to deliver the price estimation model soon. You have prepared your data by adjusting it for outliers and missing values. To begin with, you will build a complete model with all the predictor variables. You can find the entire R code used in this article at this link: regression-models-r-code.
The first step in model building is to fetch data in R and identify numeric and categorical predictor variables. Moreover, we will also tag house price as the target or response variable. This is exactly what the next few lines of code is doing.
Step 1: fetch data for regression modeling & tag the variables
Clean_Data = read.csv('http://ucanalytics.com/blogs/wp-content/uploads/2016/09/Regression-Clean-Data.csv')
Now, we will tag the variables based on their properties.
numeric=c('Dist_Taxi','Dist_Market','Dist_Hospital','Carpet','Builtup','Rainfall') categoric = c('Parking', 'City_Category') Target = c('House_Price')
The next step is to divide your sample into training and test set. We will build all the 3 models on the training set and evaluate the performance of the model on the test set. These datasets are formed by random selection of 70% of data as the training set and the remaining 30% dataset is the testing set.
Step 2: prepare train and test data for regression modeling
set.seed(42) train = sample(nrow(Clean_Data), 0.7*nrow(Clean_Data)) test = setdiff(seq_len(nrow(Clean_Data)), train)
Now, we will build our first regression model with all the available variables in our dataset.
Step 3: build 1st regression model with all the available variables
Org_Reg=lm(House_Price~.,data=Clean_Data[train,c(Target,numeric,categoric)]) summary(Org_Reg)
These are the results of the first regression model.
Coefficients: | |||||
Estimate | Std.Error | t value | Pr(>|t|) | ||
(Intercept) | 5.25E+06 | 4.40E+05 | 11.913 | < 2e-16 | *** |
Dist_Taxi | 3.25E+01 | 3.07E+01 | 1.059 | 0.2902 | |
Dist_Market | 4.74E+00 | 2.38E+01 | 0.199 | 0.8421 | |
Dist_Hospital | 8.27E+01 | 3.43E+01 | 2.408 | 0.0163 | * |
Carpet | -1.61E+03 | 3.92E+03 | -0.41 | 0.6818 | |
Builtup | 2.02E+03 | 3.27E+03 | 0.617 | 0.5376 | |
Rainfall | -2.01E+02 | 1.76E+02 | -1.146 | 0.2524 | |
(Parking) No Parking | -6.70E+05 | 1.59E+05 | -4.222 | 2.78E-05 | *** |
(Parking) NotProvided | -5.09E+05 | 1.43E+05 | -3.56 | 0.0004 | *** |
(Parking) Open | -2.83E+05 | 1.31E+05 | -2.156 | 0.0315 | * |
(City_Category) CAT B | -1.81E+06 | 1.11E+05 | -16.388 | < 2e-16 | *** |
(City_Category) CAT C | -2.87E+06 | 1.22E+05 | -23.404 | < 2e-16 | *** |
If you examine the results of the first regression model in the above table. The first thing to notice is that all the variables are part of this model including the categorical variables i.e. parking and city category. Moreover, categorical variables are converted to dummy variables where each category is represented as a separate variable.
The next thing to notice is the level of significance or importance of these variables in the model. This is presented in the last column of the table. In this model, carpet and built-up area of the house are not showing as important. This is a bit weird since we noticed while doing bivariate analysis that these variables had significant correlations with the house price. What is happening here? If you remember, these two variables have a high correlation with each other. This is where we are seeing demons of multicollinearity when significant variables are tagged as unimportant. We need to do something about multicollinearity. But before we make our next model to handle multicollinearity with principal component analysis let’s evaluate the performance of this ‘all variable model’ on the testing sample.
Step 4: evaluate performance of the 1st regression model
Estimate=predict(Org_Reg,type='response',newdata=Clean_Data[test,c(numeric,categoric,Target)]) Observed=subset(Clean_Data[test,c(numeric,categoric,Target)],select=Target) format(cor(Estimate,Observed$House_Price)^2,digits=4)
In the above code, ‘Estimate’ is the model estimated value of the house prices for the test sample and ‘Observed’ is the actual value of the house price. The correlation between observed and estimated value will tell us the level of accuracy of the model. The square of this correlation is referred to as R-square value or the predictive power of the model. The R-square value for this multicollinearity infected model is 0.4489. This means that around 44.89% of the variation in the house price can be explained by these predictor variables.
The next step for us is to remove multicollinearity from our model. A good way to achieve this is by building the model with the orthogonal principal components derived from the original variables. Remember, principal component analysis modifies a set of numeric variables into uncorrelated components.
Step 5: prepare data for 2nd regression model with principal components
require(FactoMineR) Data_for_PCA&amp;amp;amp;lt;-Clean_Data[,numeric] pca1 = PCA(Data_for_PCA) PCA_data=as.data.frame(cbind(Clean_Data[train,c(Target,categoric)],pca1$ind$coord[train,]))
In PCA_data we have replaced all the numeric variables with principal components. We will use this data to build our second regression model to counter multicollinearity.
Step 6: build 2nd regression model with principal components
Step_PCA_Reg =step(lm(House_Price~.,data = PCA_data)) summary(Step_PCA_Reg)
Coefficients: | |||||
Estimate | Std.Error | t value | Pr(>|t|) | ||
(Intercept) | 7684893 | 120912 | 63.558 | < 2e-16 | *** |
Comp 1 | 181462 | 32083 | 5.656 | 2.37e-08 | *** |
Comp 2 | 149740 | 34506 | 4.340 | 1.67e-05 | *** |
(Parking) No Parking | -643139 | 157929 | -4.072 | 5.26e-05 | *** |
(Parking) NotProvided | -503083 | 142925 | -3.520 | 0.0004 | *** |
(Parking) Open | -280855 | 130877 | -2.146 | 0.0322 | * |
(City_Category) CAT B | -1802882 | 110352 | -16.338 | < 2e-16 | *** |
(City_Category) CAT C | 110352 | -2860830 | -23.418 | < 2e-16 | *** |
As you must have noticed we don’t have any of the original numeric variables in this model but for the uncorrelated principal components i.e. Comp 1 and Comp 2. Moreover, we have run the stepwise regression to remove insignificant variables and components. In this case, only component 1 & 2 turned out to be significant and other components 3-6 were dropped because they were not important to estimate house prices. Let’s see how this new model will perform in terms of accuracy on the test dataset.
Step 7: performance evaluation of the 2nd regression model
PCA_Estimate=predict(Step_PCA_Reg,type='response',newdata=cbind(Clean_Data[test,c(Target,categoric)],pca1$ind$coord[test,])) format(cor(PCA_Estimate, Observed$House_Price)^2, digits=4)
The accuracy or R-square value for this model is 0.4559. This is a slight improvement in the accuracy from the original model. However, we know that the numeric variables in this model are not correlated hence we have tackled the demons of multicollinearity.
It is always a little problematic for an analyst to explain their analysis with principal components to their clients. Moreover, during operationalization of models, principal components add another level of complexity. Hence, it is a good idea if possible, to build the model with the original raw variables. You may remember this table from the previous part of this article on principal component analysis.
comp 1 | comp 2 | comp 3 | comp 4 | comp 5 | comp 6 | |
Dist_Hospital | 88% | 0% | 0% | 2% | 10% | 0% |
Dist_Taxi | 76% | 0% | 1% | 17% | 6% | 0% |
Dist_Market | 61% | 0% | 0% | 38% | 1% | 0% |
Rainfall | 1% | 1% | 98% | 0% | 0% | 0% |
Carpet | 0% | 100% | 0% | 0% | 0% | 0% |
Builtup | 0% | 100% | 0% | 0% | 0% | 0% |
As you can see, the dominant variables in comp 1 & 2 are distance to hospital and carpet area of the house. Hence, we will build our 3rd and final model with these variables.
Step 8: build 3rd regression model with dominant variables in significant pricipal components
numeric_new = c('Dist_Hospital','Carpet') New_Reg=lm(House_Price~.,data=Clean_Data[train,c(Target,numeric_new,categoric)]) summary(New_Reg)
Coefficients: | |||||
Estimate | Std.Error | t value | Pr(>|t|) | ||
(Intercept) | 5050297 | 406156 | 12.434 | < 2e-16 | *** |
Dist_Hospital | 109 | 1.8.7 | 5.824 | 9.22e-09 | *** |
Carpet | 811 | 195 | 4.161 | 3.61e-05 | *** |
(Parking) No Parking | -646164 | 157896 | -4.092 | 5.26e-05 | *** |
(Parking) NotProvided | -497397 | 142745 | -3.485 | 0.0005 | *** |
(Parking) Open | -274208 | 130744 | -2.097 | 0.0363 | * |
(City_Category) CAT B | -1811069 | 110093 | -16450 | < 2e-16 | *** |
(City_Category) CAT C | -2854096 | 122091 | -23.377 | < 2e-16 |
*** |
In this model we have much more friendly numeric variables. In this model, carpet area turned out to be significant since we have removed builtup area – remeber it was not significant in the 1st model. Now the only question is how accurate this model is in comparison to the model we built with pricipal components. Let’s evaluate the performance of this model.
# Step 9: performance evaluation of the 3rd regression model
New_Estimate=predict(New_Reg,type='response',newdata=Clean_Data[test,c(numeric,categoric,Target)]) Observed=subset(Clean_Data[test,c(numeric,categoric,Target)],select=Target) format(cor(New_Estimate,Observed$House_Price)^2,digits=4)
The R-square value for this model is 0.4517. This is not too bad. You can live with a slight reduction of accuracy since it will make your job of the operationalization of this model on your client’s system much less complicated. This is the final model that you will share with your client.
Sign-off Note
Your model despite your best effort is only good enough to predict 45% variation in the house price. But this is still better for estimating house prices than having no model at all. You are also slightly better equipped to tackle pseudo-experts like some real estate agents. However, there is still 55% variation in this data that can’t be explained by these predictor variables. You will have to bring in new and innovative variables in this model to completely throw pseudo-experts out of business.
YASSS! I always wonder how we can create ‘innovative variables’ that boost the predictive power of regression model. CAN’T WAIT FOR PART 6! 🙂
Roopam, you rock! No one could be any more lucid than you in explaining PCA! Thumbs up!
Just have a quick question. Why did we not find PCs for categorical variables? Do the categorical variables not have multicollinearity? If yes, how to deal with them using PCA?
Thanks, Chandramouli.
That’s a good question. Actually, PCA only works for numeric variables as you must have noticed based on the approach for PCA in these articles. For most models, numeric variables tend to show a higher degree of correlation. However, there is no reason to believe that categorical variables will never create problems of collinearity. In the case of categorical variables, a simple cross table, and chi-square test can reveal a lot about a significant large collinearity.
Hi,
When using pca$ind$coord for other datasets i see that it provides components only upto 75% of variation.Is there any way to see all the components?
Thanks
For this, increasing the dimensions in the PCA command by changing ncp to higher values. If you make the ncp equal to the number of input variables (i.e. 6 in this case) then you will capture the 100% variance.
PCA(X, scale.unit = TRUE, ncp = 5, ind.sup = NULL,
quanti.sup = NULL, quali.sup = NULL, row.w = NULL,
col.w = NULL, graph = TRUE, axes = c(1,2))
ncp number of dimensions kept in the results (by default 5)
Roopam
Great to see you sharing your knowledge with the world.
New to DS and loving learning what you are sharing.
A blessing in disguise.
Would you also try to put the code in python if time permits please.
Thanks
Sai
You could find the Python notebook for this case study at this link http://ucanalytics.com/blogs/python-code-time-series-forecasting-arima-models-manufacturing-case-study-example/
Hello Roopam, thanks for article. when you ran PCA 1 and PCA 2 in your regression, how did you do it? do you just use the loadings as weights? or do eigen values/eigen vectors come into play? or something else. I cannot wrap my head around this. thank you
Eigen vectors are the basis vectors so after the PCA the original data matrix is loaded on to the Eigen vectors with high Eigen value. Hence, variable loading, Eigen values, and vectors are all interrelated.
Hi, I wasn’t able to proceed with this line of code in R from step five. I attmpted to removed the “amp” from the code, but then R could read “lt”. Could you explain what this code is trying to do and if there’s modifications needed to be made ?
Data_for_PCA&amp;amp;lt;-Clean_Data[,numeric]