How to figure out if you are paying the right price for the property you are about to purchase? Welcome to a new data science case study example on YOU CANalytics to identify the right housing price. Pricing is a highly important and specialized function for any business. A right price can make the difference between profit or loss. In this case study we will use the example of property pricing to gain a deeper understanding of regression analysis.
Regression analysis is the mother of all machine learning and analysis techniques. Hence it is essential for every data scientist to have an intuitive understanding of regression. This understanding helps them appreciate other advanced data science and analytics techniques. In this case study we will explore nuances of regression analysis including data preparation, correlation analysis, principal component analysis (PCA), traditional regression with variables selection, and regression with regularization (Ridge & Lasso – used in machine learning).
Based on the suggestions from several readers, I will share the data for this case study example right at the beginning for you to play around and learn. Essentially, we will work on this case study together. Download the data file : Regression analysis data. But before we analyse this data, let’s create some connections between regression analysis and zodiac signs that govern your daily horoscope.
Zodiac Signs & Regression Analysis – Connect the Dots
The night sky has always fascinated humans. For centuries, human imagination has looked at the night sky as a vast canvas with stars as dots waiting to be connected. Constellations are the results of this imaginative thinking. Constellations are a group of stars connected together to form mythical characters like Orion, Great Bear etc. There are 88 officially recognized constellations out of which the twelve most popular constellations are the zodiac signs i.e. Aquarius, Pisces, Aries etc. Constellations had a practical use in the ancient times when they were used to identify seasons. Each of these 12 Zodiac constellations are clearly visible in the night sky during a particular period on the calendar year. For instance, Aries is visible in October and Taurus in November and so on. In absence of the modern calendars, farmers used the position of the constellations to plan their crops. This representation displays the relative position of zodiac constellations to the Earth and the Sun.
So how do constellations became part of horoscopes? In several cultures, change in seasons are also associated with change in fate. This makes sense since agriculture productivity is directly linked to seasons. Hence, the zodiac constellations which change with seasons became the indicators for horoscope. In October, we see Aries in the clear night sky. During the same period, the Sun is blocking Libra on the other side of the space. This means Libra is in the house of Sun. If you are born between 23rd September – 23rd October your Zodiac Sign is Libra. Other aspects of horoscopes is as much a part of human imagination as the shapes of constellations.
Again, what do constellations and zodiac signs have to do with regression analysis? Regression analysis is also an effort to connect the dots similar to formation of constellations with stars. The major difference is that regression analysis doesn’t rely on human imagination but mathematics to find the most optimal connection. Keeping this in mind let’s move to our case study example.
Housing Price – Regression Analysis Case Study Example
Buy cheap and sell dear is the fundamental goal for a market economy. If you purchase something at a lower market price, you have a higher leverage to make profit. ByeBuyHome is a property listing site that aggregates readytobuy properties and quoted prices across the country. This is a good opportunity for property investors to identify properties that are selling at a lower premium. The question is how to identify if the property is up for grabs at a lower price than market?
You are a data analytics consultant to one such investing firm. You have accumulated data for the properties sold this month along with the features of these properties: Regression analysis data. This data contains these parameters

You have calculated the distance for the first 3 variables based on your proprietary algorithm and data from Google Maps. Now that you have some data with you, here are the two immediate goals for you:
1) Do an exploratory data analysis to identify the initial patterns in this data and report your findings in the ‘Leave a Comment’ section at the bottom of this page. Please don’t do any regression analysis at this point, but just data exploration i.e. identification of outliers, missing values, univariate, and bivariate patterns.
2) Think of yourself like a god who has access to all the information in the Universe. What all information (variables) will you use to estimate the house price? Please report the variables of your choice in the comments section.
You will find these articles on regression analysis useful: article 1 & article 2.
Signoff Note
I look forward to read your comments. Your answers will lead us to the next part in this case study example. We don’t need horoscopes anymore in this case study to estimate the right price and fate of a house, it is all mathematics and logic from here on.
As a result of my exploratory data analysis I came to the following conclusions:
1. The house prices are approximately normally distributed. All values except the three outliers lie between 1492000 and 10515000. To obtain this result I first created a boxplot of the housing prices to identify the outliers. After excluding them from my data set I created a histogram of the housing prices to obtain their distribution.
2. Among all numeric variables, house prices are most highly correlated with Carpet (0.9) and Builtup(0.75).
3. After creating parallel boxplots to compare house prices between categories of Parking and City Category, I came to the conclusion that Parking is only slightly (and probably not significantly) related to house prices while City Category is strongly related to house prices.
Other variables that might have predictive power for house prices are: Number of bedrooms, Number of bathrooms, Year Built, Time since last repair, Criminal Rate in the near distance.
I am looking forward to the next part of the case study.
Thanks Katya, these are some good observations. I will use some of them in my next post.
You may want to think of some more variables that influence the price of a house. It’s a completely creative exercise – don’t worry about the availability of data for those factors – treat it like a creative and lateral thinking exercise. I find this kind of lateral thinking effort extremely useful while doing data science modeling. Remember, there are no wrong answers here.
Hi, I am a novice in Analytics and was trying to get the relations using SPSS. However, I am not sure how I could use the data for Parking and City_Category to obtain their relations with House_Price. Could you please enlighten me with your advice?
I am also curious to learn how parallel boxplots helps to understand the relation, i.e. what is analytical point of view.
City category and Parking are categorical variables. These types of variables are converted into dummy variables (0/1) before making the model. For advanced modeling tools like SPSS, R, and SAS this activity of conversion into dummy variables happens automatically in the background. Hence, you could model these variables the same way as numeric variable with these tools. For, box plot keep an eye on the average values (central line) and compare if they are significantly different for different groups.
Hi,
Below are some of my observations:
a)All the numeric variables seem to be normally distributed by and large
b)Correlation between Housing price and other variables:
Initially, it appears as if housing price has good correlation with built up and carpet. But, once we remove all observations having missing values (which is just ~4% of total obs), I find that the correlation drops down very low (~0.09 range)
c)A couple of variables have high correlation among themselves: (again with the same ~96% data)
(I) Builtup and carpet have high correlation
(ii)Distance to taxi/Market have high correlation with Distance to Hospital
This means that we will have to remove the proxy variables before using for prediction.
Of course some outliers were found, like Rainfall having negative value and housing price 150 Million.
Thinking about the possible predictor variables, SQ.Ft Area, #Rooms, Amenities, Locality Real estate value, Locality Pop. density, Proximity to main centres (given), recent price appreciation of the locality, Type of housing, Median Income of the locality etc., could be useful.
Thanks Mani, all good observations. I suggest you further explore the missing data. Also, your observation (b) is interesting. We will use this to figure out some peculiar properties about regression analysis.
Hello Roopam,
Following are some of my observations excluding the ones mentioned above:
1. What data can be considered in place of Dist_Taxi as now a days there are online taxi players(like Uber,Ola etc)?
2. Proximity to Travel Terminals (such as Railway stations, Airports, Bus stops, Highways etc)
3. Proximity to Malls?
4. Availability of continuous utility supply (such Power cut & Water cut frequency etc)