{"id":8488,"date":"2016-08-08T20:31:24","date_gmt":"2016-08-08T15:01:24","guid":{"rendered":"http:\/\/ucanalytics.com\/blogs\/?p=8488"},"modified":"2016-10-22T11:45:05","modified_gmt":"2016-10-22T06:15:05","slug":"data-preparation-regression-pricing-case-study-example-part-2","status":"publish","type":"post","link":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/","title":{"rendered":"Data Preparation for Regression &#8211; Pricing Case Study Example (Part 2)"},"content":{"rendered":"<hr \/>\n<p>In the last post we had started a case study example for regression analysis to help an investment firm make money through property price arbitrage\u00a0(read part 1 :\u00a0<strong><a href=\"http:\/\/ucanalytics.com\/blogs\/regression-analysis-pricing-case-study-example-part-1\/\" target=\"_blank\">regression case study example<\/a><\/strong>).\u00a0This is an interactive case study example and required your help to move forward. These are some of your observations from\u00a0exploratory analysis that you shared in the comments of the last part (<strong><a href=\"http:\/\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/07\/Regression-Analysis-Data.csv\">download the data here<\/a><\/strong>)<\/p>\n<blockquote><p><span style=\"color: #0000ff;\"><strong>Katya Chomakova<\/strong><\/span> : The house prices are approximately normally distributed. All values except the three outliers lie between 1492000 and 10515000. Among all numeric variables, house prices are most highly correlated with Carpet (0.9) and Builtup(0.75).<br \/>\n<span style=\"color: #0000ff;\"><strong>Mani<\/strong><\/span> : Initially, it appears as if housing price has good correlation with built up and carpet. But, once we remove all observations having missing values (which is just ~4% of total obs), I find that the correlation drops down very low (~0.09 range)<\/p><\/blockquote>\n<div id=\"attachment_8489\" style=\"width: 327px\" class=\"wp-caption alignright\"><a href=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg\"><img aria-describedby=\"caption-attachment-8489\" data-attachment-id=\"8489\" data-permalink=\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/regression-analysis\/\" data-orig-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&amp;ssl=1\" data-orig-size=\"448,528\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Regression analysis\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=255%2C300&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-8489\" src=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?resize=317%2C374\" alt=\"Punch and Regression Analysis - by Roopam\" width=\"317\" height=\"374\" srcset=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?w=448&amp;ssl=1 448w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?resize=212%2C250&amp;ssl=1 212w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?resize=255%2C300&amp;ssl=1 255w\" sizes=\"(max-width: 317px) 100vw, 317px\" data-recalc-dims=\"1\" \/><\/a><p id=\"caption-attachment-8489\" class=\"wp-caption-text\">Data Preparation for Regression Analysis &#8211; by Roopam<\/p><\/div>\n<p>Katya and Mani noticed something unusual about missing observations and outliers in the data, and how their presence and absence were\u00a0changing the results dramatically. This is the reason data preparation is an important exercise for any machine learning or statistical analysis to get consistent results. We will learn about data preparation for regression analysis in this part of the case study. Before we explore this in detail, let&#8217;s take a slight detour to understand the crux of stability and talk about fall of heroes.<\/p>\n<p>Every kid needs a hero. I had many when I was growing up. This is a story of how I used a concept in physics caller &#8216;<em>center of gravity<\/em>&#8216; to chose one of my heroes by having an imaginary competition between:<\/p>\n<h2><span style=\"color: #3366ff;\">Mike Tyson Vs. Bop Bag<\/span><\/h2>\n<p><strong>The Champion<\/strong> : Mike Tyson was the\u00a0undisputed heavyweight boxing champion in the late 1980s. He was no Mohammad Ali but was on his path to come closest to\u00a0<em>The Greatest<\/em>. This is where things went wrong for Tyson; he was convicted of rape and was in prison for 3 years. Out of jail and desperate to regain his glory days, Tyson challenged\u00a0Evander Holyfield,the then undisputed champion. What followed was a disgrace for any sport where during the challenge match Tyson bit a part of Holyfield&#8217;s ear off and got disqualified.<\/p>\n<p><strong>The Challenger<\/strong> : Most of us have played with a bop bag or the punching toy as kids. It is designed in such a way that when\u00a0punched, it topples for a while but eventually stands back up on its own. Bop bag is a perfect example where the center of gravity of the object is highly grounded and stays within its\u00a0body. You could punch it, kick it, or perturb it in any possible way but the bop bag will stand back up after a fall &#8211; yes, it has that cute, funny smile too. On the other hand, like Mike Tyson, most of us struggle big time after a fall. Possibly because our center of gravity is outside our body in other people&#8217;s opinion about us. Tyson was mostly driven by\u00a0the praises\u00a0from others after a win rather than his love for the game.<\/p>\n<p><strong>The Winner<\/strong> : Center of gravity helped me choose my hero : bop bag. This cute toy reminds me every day to keep my center grounded and inside my body and not let others perturb my core &#8211; even when punched. I wish I could always wear a sincere\u00a0smile like my hero.<\/p>\n<p>Bop bag also has important lessons for data preparation for machine learning and data science models. The data for modeling needs to display stability similar to bop bag and must not give completely different results with different observations.\u00a0Katya and Mani have noticed a major\u00a0instability in our data in their exploratory analysis. They have highlighted the presence of missing data and outliers; we will explore these ideas further in this part when we will explore data preparation for regression analysis. Now, let&#8217;s go back to our case study example.<\/p>\n<h2><span style=\"color: #3366ff;\">Data Preparation for Regression &#8211; Case Study Example<\/span><\/h2>\n<p><a href=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg\"><img data-attachment-id=\"8527\" data-permalink=\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/regression-model\/\" data-orig-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg?fit=745%2C761&amp;ssl=1\" data-orig-size=\"745,761\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Regression model\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg?fit=294%2C300&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg?fit=640%2C654&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-8527\" src=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg?resize=307%2C313\" alt=\"Regression model\" width=\"307\" height=\"313\" srcset=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg?w=745&amp;ssl=1 745w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg?resize=245%2C250&amp;ssl=1 245w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-model.jpg?resize=294%2C300&amp;ssl=1 294w\" sizes=\"(max-width: 307px) 100vw, 307px\" data-recalc-dims=\"1\" \/><\/a>You are a data science consultant for\u00a0an investment firm that tries to make money through property price arbitrage. They get daily data for thousands of houses across the country available for sale. Their expectation from you is to suggest properties worth investing in. This requires you to identify properties selling at a lower price than the market price. You already have quoted prices for all the properties. Now,\u00a0you need to create a model to estimate market price for properties. Your client should invest in the properties with a higher estimated price than the quoted price.<\/p>\n<p>In your effort to create a price estimation model, you have gathered <a href=\"http:\/\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/07\/Regression-Analysis-Data.csv\">this data<\/a>. The next step is\u00a0data preparation for regression analysis before the development of a model. This will require us to prepare a robust and logically correct data for analysis.<\/p>\n<p>We will do our analysis for this case study example on R. For this\u00a0I recommend you install <strong><a href=\"https:\/\/cran.r-project.org\/bin\/windows\/base\/\" target=\"_blank\">R<\/a><\/strong> &amp; <a href=\"https:\/\/www.rstudio.com\/products\/rstudio\/download2\/\" target=\"_blank\"><strong>R Studio<\/strong> <\/a>on your system. However, you could also try these codes on this online R engine :\u00a0<strong><a href=\"http:\/\/www.r-fiddle.org\/#\/fiddle?id=G5ZUcE5U\" target=\"_blank\">R-Fiddle<\/a><\/strong>.<\/p>\n<p>We will first import the data in R and then prepare a summary report for all the variables using this command:<\/p>\n<pre><span style=\"font-size: 12pt;\"><span style=\"color: #ff6600;\"><strong>data<\/strong><\/span>&lt;-<span style=\"color: #0000ff;\"><strong>read.csv<\/strong><\/span>('<span style=\"color: #993366;\">http:\/\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/07\/Regression-Analysis-Data.csv<\/span>')<\/span>\r\n<span style=\"font-size: 12pt;\">\r\n<span style=\"color: #0000ff;\"><strong>summary<\/strong><\/span>(<strong><span style=\"color: #ff6600;\">data<\/span><\/strong>)<\/span><\/pre>\n<p>A version of the summary report is displayed here. Remember there are total 932 observations is this data set.<\/p>\n<table style=\"border-color: #000000;\" width=\"594\">\n<tbody>\n<tr style=\"height: 23px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\" width=\"112\"><\/td>\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\" width=\"112\"><span style=\"font-size: 10pt;\"><strong>Dist_Taxi<\/strong><\/span><\/td>\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\" width=\"95\"><span style=\"font-size: 10pt;\"><strong>Dist_Market<\/strong><\/span><\/td>\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\" width=\"95\"><span style=\"font-size: 10pt;\"><strong>Dist_Hospital<\/strong><\/span><\/td>\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\" width=\"90\"><span style=\"font-size: 10pt;\"><strong>Carpet<\/strong><\/span><\/td>\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\" width=\"90\"><span style=\"font-size: 10pt;\"><strong>Builtup<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Min.\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">146<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1666<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">3227<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">775<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">932<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">1st Qu.<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">6476<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">9354<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">11302<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1318<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1583<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Median\u00a0\u00a0<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">8230<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">11161<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">13163<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1480<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1774<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Mean\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">8230<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">11019<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">13072<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1512<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1795<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">3rd Qu.<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">9937<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">12670<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">14817<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1655<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1982<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23.25px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Max.\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"height: 23.25px;\"><span style=\"font-size: 10pt;\">20662<\/span><\/td>\n<td style=\"height: 23.25px;\"><span style=\"font-size: 10pt;\">20945<\/span><\/td>\n<td style=\"height: 23.25px;\"><span style=\"font-size: 10pt;\">23294<\/span><\/td>\n<td style=\"height: 23.25px;\"><span style=\"font-size: 10pt;\">24300<\/span><\/td>\n<td style=\"height: 23.25px;\"><span style=\"font-size: 10pt;\">12730<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"height: 23px; width: 112px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">NA&#8217;s\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">13<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">13<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">1<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">8<\/span><\/td>\n<td style=\"height: 23px;\"><span style=\"font-size: 10pt;\">15<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Look at the last row where all the above variables have some missing data. Parking and City_Category are categorical variables hence we have got levels for them. Notice there is missing data in Parking as well marked as &#8216;Not Provided&#8217;.<\/p>\n<table style=\"width: 584px; height: 160px;\">\n<tbody>\n<tr style=\"height: 23px;\">\n<td style=\"width: 215px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\"><strong>Parking<\/strong><\/span><\/td>\n<td style=\"width: 108.4px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\"><strong>City_Category<\/strong><\/span><\/td>\n<td style=\"width: 75.6px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\"><strong>\u00a0<\/strong><\/span><\/td>\n<td style=\"width: 67px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\"><strong>Rainfall<\/strong><\/span><\/td>\n<td style=\"width: 90px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\"><strong>House_Price<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"width: 215px; height: 147.05px;\" rowspan=\"7\"><span style=\"font-size: 10pt;\">Covered : 188<\/span><\/p>\n<p><span style=\"font-size: 10pt;\">No Parking: 145<\/span><\/p>\n<p><span style=\"font-size: 10pt;\">Not Provided : 227<\/span><\/p>\n<p><span style=\"font-size: 10pt;\">Open : 372<\/span><\/td>\n<td style=\"width: 108.4px; height: 147.05px;\" rowspan=\"7\"><span style=\"font-size: 10pt;\">CAT A: 329<\/span><\/p>\n<p><span style=\"font-size: 10pt;\">CAT B: 365<\/span><\/p>\n<p><span style=\"font-size: 10pt;\">CAT C: 238<\/span><\/p>\n<p><span style=\"font-size: 10pt;\">\u00a0\u00a0<\/span><\/td>\n<td style=\"width: 75.6px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Min.\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"width: 67px; height: 23px;\"><span style=\"font-size: 10pt;\">110<\/span><\/td>\n<td style=\"width: 90px; height: 23px;\"><span style=\"font-size: 10pt;\">30000<\/span><\/td>\n<\/tr>\n<tr style=\"height: 29.05px;\">\n<td style=\"width: 75.6px; height: 29.05px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">1st Qu.<\/span><\/td>\n<td style=\"width: 67px; height: 29.05px;\"><span style=\"font-size: 10pt;\">600<\/span><\/td>\n<td style=\"width: 90px; height: 29.05px;\"><span style=\"font-size: 10pt;\">4658000<\/span><\/td>\n<\/tr>\n<tr style=\"height: 2px;\">\n<td style=\"width: 75.6px; height: 2px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Median\u00a0<\/span><\/td>\n<td style=\"width: 67px; height: 2px;\"><span style=\"font-size: 10pt;\">780<\/span><\/td>\n<td style=\"width: 90px; height: 2px;\"><span style=\"font-size: 10pt;\">5866000<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"width: 75.6px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Mean\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"width: 67px; height: 23px;\"><span style=\"font-size: 10pt;\">785.6<\/span><\/td>\n<td style=\"width: 90px; height: 23px;\"><span style=\"font-size: 10pt;\">6084695<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 75.6px; height: 24px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">3 Qu.<\/span><\/td>\n<td style=\"width: 67px; height: 24px;\"><span style=\"font-size: 10pt;\">970<\/span><\/td>\n<td style=\"width: 90px; height: 24px;\"><span style=\"font-size: 10pt;\">7187250<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"width: 75.6px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">Max.\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"width: 67px; height: 23px;\"><span style=\"font-size: 10pt;\">1560<\/span><\/td>\n<td style=\"width: 90px; height: 23px;\"><span style=\"font-size: 10pt;\">150000000<\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"width: 75.6px; height: 23px; background-color: #badcf7;\"><span style=\"font-size: 10pt;\">NA&#8217;s\u00a0\u00a0\u00a0<\/span><\/td>\n<td style=\"width: 67px; height: 23px;\"><span style=\"font-size: 10pt;\">\u00a00<\/span><\/td>\n<td style=\"width: 90px; height: 23px;\"><span style=\"font-size: 10pt;\">\u00a00<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The first thing we will do is to remove missing variables from this dataset. We will explore later whether removal of missing variables is a good strategy or not. We will also calculate how many\u00a0observations we will lose by removing missing data.<\/p>\n<pre><span style=\"font-size: 12pt;\"><strong><span style=\"color: #ff6600;\">data_without_missing<\/span><\/strong>&lt;-<span style=\"color: #ff6600;\">data<\/span>[<strong><span style=\"color: #0000ff;\">complete.cases<\/span><\/strong>(<span style=\"color: #ff6600;\">data<\/span>),]\r\n\r\n<strong><span style=\"color: #0000ff;\">nrow<\/span><\/strong>(<strong><span style=\"color: #ff6600;\">data<\/span><\/strong>) - <strong><span style=\"color: #0000ff;\">nrow<\/span><\/strong>(<strong><span style=\"color: #ff6600;\">data_without_missing)<\/span><\/strong><\/span><\/pre>\n<p>We have lost 34 observations after removal of missing data. The data set is now down to 898 observations. This is ~4% observations as Mani pointed in his comment. Also, notice that missing variables for categorical variables (Parking) are not removed, could you reason why?<\/p>\n<p>In the next step, we will plot a box plot of housing price to identify outliers for the dependent variable.<\/p>\n<pre><span style=\"font-size: 12pt;\"><strong><span style=\"color: #0000ff;\">options(<\/span><\/strong>scipen<strong><span style=\"color: #0000ff;\"> = <\/span><\/strong>100<strong><span style=\"color: #0000ff;\">) <\/span><\/strong><\/span><span style=\"font-size: 10pt; color: #808080;\"># this will print the numbers without scientific notation\r\n<\/span><span style=\"font-size: 12pt;\"><strong><span style=\"color: #0000ff;\">\r\nboxplot<\/span><\/strong>(<strong><span style=\"color: #ff6600;\">data_without_missing<\/span><\/strong><span style=\"color: #800000;\">$House_Price<\/span>, <span style=\"color: #993366;\">col =<\/span> \"Orange\",<span style=\"color: #993366;\">main=<\/span>\"Box Plot of House Price\")\r\n<\/span><\/pre>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg\"><img data-attachment-id=\"8600\" data-permalink=\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/data-preparation-for-regression-aanalysis\/\" data-orig-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg?fit=619%2C416&amp;ssl=1\" data-orig-size=\"619,416\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Data Preparation for Regression Aanalysis\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg?fit=300%2C202&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg?fit=619%2C416&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-8600\" src=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg?resize=270%2C182\" alt=\"Data Preparation for Regression Aanalysis\" width=\"270\" height=\"182\" srcset=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg?w=619&amp;ssl=1 619w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg?resize=250%2C168&amp;ssl=1 250w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-Preparation-for-Regression-Aanalysis-e1470635024956.jpeg?resize=300%2C202&amp;ssl=1 300w\" sizes=\"(max-width: 270px) 100vw, 270px\" data-recalc-dims=\"1\" \/><\/a>Clearly, there is an extreme outlier in this dataset. The dot at the top represents that outlier. All the other data-points are packed in the almost flat box at the bottom. (Click on the image to enlarge it)<\/p>\n<p>Let&#8217;s try to look at this extreme\u00a0outlier by fetching this observation.<\/p>\n<pre><span style=\"font-size: 12pt;\"><span style=\"color: #ff6600;\">data_without_missing<\/span>[data_without_missing$House_Price&gt;10^8,]<\/span><\/pre>\n<p>This observation seems to be for a large mansion in some countryside. As can be seen in data when compared with the summary data for other observations.<\/p>\n<table style=\"border-color: #000000; height: 46px;\" width=\"565\">\n<tbody>\n<tr>\n<td style=\"width: 70px; background-color: #aacdf0;\">Dist_Taxi<\/td>\n<td style=\"width: 90.8px; background-color: #aacdf0;\">Dist_Market<\/td>\n<td style=\"width: 99.6px; background-color: #aacdf0;\">Dist_Hospital<\/td>\n<td style=\"width: 47.6px; background-color: #aacdf0;\">Carpet<\/td>\n<td style=\"width: 52.4px; background-color: #aacdf0;\">Builtup<\/td>\n<td style=\"width: 58px; background-color: #aacdf0;\">Parking<\/td>\n<td style=\"width: 101.2px; background-color: #aacdf0;\">City_Category<\/td>\n<td style=\"width: 55.6px; background-color: #aacdf0;\">Rainfall<\/td>\n<td style=\"width: 91.6px; background-color: #aacdf0;\">House_Price<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 70px;\">20662<\/td>\n<td style=\"width: 90.8px;\">20945<\/td>\n<td style=\"width: 99.6px;\">23294<\/td>\n<td style=\"width: 47.6px;\">24300<\/td>\n<td style=\"width: 52.4px;\">12730<\/td>\n<td style=\"width: 58px;\">Covered<\/td>\n<td style=\"width: 101.2px;\">CAT B<\/td>\n<td style=\"width: 55.6px;\">1130<\/td>\n<td style=\"width: 91.6px;\">150000000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There is no point in keeping this super-rich property in data\u00a0while preparing a model for middle-class housing. Hence we will remove this observation. The next step is to look at the box plot of all the numerical variables in the model to find unusual observations. We will normalize the data to bring it to the same scale.<\/p>\n<pre><span style=\"font-size: 12pt;\"><strong><span style=\"color: #0000ff;\">boxplot<\/span><\/strong>(<strong><span style=\"color: #0000ff;\">scale<\/span><\/strong>(<span style=\"color: #ff6600;\">data_without_missing<\/span>[data_without_missing$House_Price&lt;10^8,c(2:6,9:10)]),<span style=\"color: #993366;\">col=<\/span>\"Orange\")<\/span><\/pre>\n<p><a href=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg\"><img data-attachment-id=\"8611\" data-permalink=\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/data-prepration-for-regression-analysis-1\/\" data-orig-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?fit=1524%2C818&amp;ssl=1\" data-orig-size=\"1524,818\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Data prepration for regression analysis 1\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?fit=300%2C161&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?fit=640%2C344&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-8611 aligncenter\" src=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?resize=640%2C344\" alt=\"Data prepration for regression analysis 1\" width=\"640\" height=\"344\" srcset=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?w=1524&amp;ssl=1 1524w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?resize=250%2C134&amp;ssl=1 250w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?resize=300%2C161&amp;ssl=1 300w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?resize=768%2C412&amp;ssl=1 768w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?resize=1024%2C550&amp;ssl=1 1024w, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Data-prepration-for-regression-analysis-1.jpg?w=1280 1280w\" sizes=\"(max-width: 640px) 100vw, 640px\" data-recalc-dims=\"1\" \/><\/a>This data looks fairly centered as is required for most modeling techniques.<\/p>\n<h4><span style=\"color: #3366ff;\">Sign-off Note<\/span><\/h4>\n<p>In this part, we have primarily spent our time on univariate analysis for data preparation for regression. In the next part, we will explore patterns through bivariate analysis before the development of multivariate models. These are some of the questions you may want to ponder and share your view before the next part:<\/p>\n<p>1) We had removed 34 observations with missing data, what impact the removal of missing data can have on our analysis? Could we do something to minimize this impact?<\/p>\n<p>2) Why did we not remove missing values from the categorical variable i.e. Parking?<\/p>\n<p>3) What impact could the extreme outlier, a large mansion, have on the model we are developing for middle-class house prices? Was it a good idea to remove that outlier?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the last post we had started a case study example for regression analysis to help an investment firm make money through property price arbitrage\u00a0(read part 1 :\u00a0regression case study example).\u00a0This is an interactive case study example and required your help to move forward. These are some of your observations from\u00a0exploratory analysis that you shared<\/p>\n<p><a class=\"excerpt-more blog-excerpt\" href=\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/\">Read More&#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":8489,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_newsletter_tier_id":0,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[63,80],"tags":[],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v17.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data Preparation for Regression - Pricing Case Study Example (Part 2) &ndash; YOU CANalytics |<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Preparation for Regression - Pricing Case Study Example (Part 2) &ndash; YOU CANalytics |\" \/>\n<meta property=\"og:description\" content=\"In the last post we had started a case study example for regression analysis to help an investment firm make money through property price arbitrage\u00a0(read part 1 :\u00a0regression case study example).\u00a0This is an interactive case study example and required your help to move forward. These are some of your observations from\u00a0exploratory analysis that you sharedRead More...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/\" \/>\n<meta property=\"og:site_name\" content=\"YOU CANalytics |\" \/>\n<meta property=\"article:author\" content=\"roopam\" \/>\n<meta property=\"article:published_time\" content=\"2016-08-08T15:01:24+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2016-10-22T06:15:05+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&#038;ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"448\" \/>\n\t<meta property=\"og:image:height\" content=\"528\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Roopam Upadhyay\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Organization\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#organization\",\"name\":\"YOU CANalytics\",\"url\":\"https:\/\/ucanalytics.com\/blogs\/\",\"sameAs\":[],\"logo\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#logo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2015\/11\/YOU-CANalytics-Logo.jpg?fit=607%2C120\",\"contentUrl\":\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2015\/11\/YOU-CANalytics-Logo.jpg?fit=607%2C120\",\"width\":607,\"height\":120,\"caption\":\"YOU CANalytics\"},\"image\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#logo\"}},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#website\",\"url\":\"https:\/\/ucanalytics.com\/blogs\/\",\"name\":\"YOU CANalytics |\",\"description\":\"Explore the Power of Data Science\",\"publisher\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ucanalytics.com\/blogs\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1\",\"width\":448,\"height\":528},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#webpage\",\"url\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/\",\"name\":\"Data Preparation for Regression - Pricing Case Study Example (Part 2) &ndash; YOU CANalytics |\",\"isPartOf\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#primaryimage\"},\"datePublished\":\"2016-08-08T15:01:24+00:00\",\"dateModified\":\"2016-10-22T06:15:05+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ucanalytics.com\/blogs\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Data Preparation for Regression &#8211; Pricing Case Study Example (Part 2)\"}]},{\"@type\":\"Article\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#webpage\"},\"author\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#\/schema\/person\/55961a1cea272ecdf290cb387be069b6\"},\"headline\":\"Data Preparation for Regression &#8211; Pricing Case Study Example (Part 2)\",\"datePublished\":\"2016-08-08T15:01:24+00:00\",\"dateModified\":\"2016-10-22T06:15:05+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#webpage\"},\"wordCount\":1379,\"commentCount\":9,\"publisher\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#organization\"},\"image\":{\"@id\":\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1\",\"articleSection\":[\"Analytics Labs\",\"Pricing Case Study Example\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#respond\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#\/schema\/person\/55961a1cea272ecdf290cb387be069b6\",\"name\":\"Roopam Upadhyay\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/ucanalytics.com\/blogs\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/dd1aa0b0e813f7639800bcfad6a554f1?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/dd1aa0b0e813f7639800bcfad6a554f1?s=96&d=mm&r=g\",\"caption\":\"Roopam Upadhyay\"},\"description\":\"This blog contains my personal views and thoughts on predictive Analytics and big data. - Roopam Upadhyay\",\"sameAs\":[\"roopam\"],\"url\":\"https:\/\/ucanalytics.com\/blogs\/author\/roopam\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data Preparation for Regression - Pricing Case Study Example (Part 2) &ndash; YOU CANalytics |","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/","og_locale":"en_US","og_type":"article","og_title":"Data Preparation for Regression - Pricing Case Study Example (Part 2) &ndash; YOU CANalytics |","og_description":"In the last post we had started a case study example for regression analysis to help an investment firm make money through property price arbitrage\u00a0(read part 1 :\u00a0regression case study example).\u00a0This is an interactive case study example and required your help to move forward. These are some of your observations from\u00a0exploratory analysis that you sharedRead More...","og_url":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/","og_site_name":"YOU CANalytics |","article_author":"roopam","article_published_time":"2016-08-08T15:01:24+00:00","article_modified_time":"2016-10-22T06:15:05+00:00","og_image":[{"width":448,"height":528,"url":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1","type":"image\/jpeg"}],"twitter_misc":{"Written by":"Roopam Upadhyay","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Organization","@id":"https:\/\/ucanalytics.com\/blogs\/#organization","name":"YOU CANalytics","url":"https:\/\/ucanalytics.com\/blogs\/","sameAs":[],"logo":{"@type":"ImageObject","@id":"https:\/\/ucanalytics.com\/blogs\/#logo","inLanguage":"en-US","url":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2015\/11\/YOU-CANalytics-Logo.jpg?fit=607%2C120","contentUrl":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2015\/11\/YOU-CANalytics-Logo.jpg?fit=607%2C120","width":607,"height":120,"caption":"YOU CANalytics"},"image":{"@id":"https:\/\/ucanalytics.com\/blogs\/#logo"}},{"@type":"WebSite","@id":"https:\/\/ucanalytics.com\/blogs\/#website","url":"https:\/\/ucanalytics.com\/blogs\/","name":"YOU CANalytics |","description":"Explore the Power of Data Science","publisher":{"@id":"https:\/\/ucanalytics.com\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ucanalytics.com\/blogs\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"ImageObject","@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#primaryimage","inLanguage":"en-US","url":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1","contentUrl":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1","width":448,"height":528},{"@type":"WebPage","@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#webpage","url":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/","name":"Data Preparation for Regression - Pricing Case Study Example (Part 2) &ndash; YOU CANalytics |","isPartOf":{"@id":"https:\/\/ucanalytics.com\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#primaryimage"},"datePublished":"2016-08-08T15:01:24+00:00","dateModified":"2016-10-22T06:15:05+00:00","breadcrumb":{"@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ucanalytics.com\/blogs\/"},{"@type":"ListItem","position":2,"name":"Data Preparation for Regression &#8211; Pricing Case Study Example (Part 2)"}]},{"@type":"Article","@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#article","isPartOf":{"@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#webpage"},"author":{"@id":"https:\/\/ucanalytics.com\/blogs\/#\/schema\/person\/55961a1cea272ecdf290cb387be069b6"},"headline":"Data Preparation for Regression &#8211; Pricing Case Study Example (Part 2)","datePublished":"2016-08-08T15:01:24+00:00","dateModified":"2016-10-22T06:15:05+00:00","mainEntityOfPage":{"@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#webpage"},"wordCount":1379,"commentCount":9,"publisher":{"@id":"https:\/\/ucanalytics.com\/blogs\/#organization"},"image":{"@id":"https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1","articleSection":["Analytics Labs","Pricing Case Study Example"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ucanalytics.com\/blogs\/data-preparation-regression-pricing-case-study-example-part-2\/#respond"]}]},{"@type":"Person","@id":"https:\/\/ucanalytics.com\/blogs\/#\/schema\/person\/55961a1cea272ecdf290cb387be069b6","name":"Roopam Upadhyay","image":{"@type":"ImageObject","@id":"https:\/\/ucanalytics.com\/blogs\/#personlogo","inLanguage":"en-US","url":"https:\/\/secure.gravatar.com\/avatar\/dd1aa0b0e813f7639800bcfad6a554f1?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/dd1aa0b0e813f7639800bcfad6a554f1?s=96&d=mm&r=g","caption":"Roopam Upadhyay"},"description":"This blog contains my personal views and thoughts on predictive Analytics and big data. - Roopam Upadhyay","sameAs":["roopam"],"url":"https:\/\/ucanalytics.com\/blogs\/author\/roopam\/"}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-analysis.jpg?fit=448%2C528&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p3L0jT-2cU","jetpack-related-posts":[{"id":8388,"url":"https:\/\/ucanalytics.com\/blogs\/regression-analysis-pricing-case-study-example-part-1\/","url_meta":{"origin":8488,"position":0},"title":"Regression Analysis &#8211; Pricing Case Study Example (Part 1)","author":"Roopam Upadhyay","date":false,"format":false,"excerpt":"How to figure out if you are paying the right price for the property you are about to purchase? Welcome to a new data science case study example on YOU CANalytics to identify the right housing price. Pricing is a highly important and\u00a0specialized function for any business. A right price\u2026","rel":"","context":"In &quot;Pricing Case Study Example&quot;","block_context":{"text":"Pricing Case Study Example","link":"https:\/\/ucanalytics.com\/blogs\/category\/pricing-case-study-example\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/07\/Connect-the-Dots.jpg?fit=397%2C603&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":8649,"url":"https:\/\/ucanalytics.com\/blogs\/bivariate-analysis-leverage-regression-case-study-example-part-3\/","url_meta":{"origin":8488,"position":1},"title":"Bivariate Analysis &#038; Leverage &#8211; Regression Case Study Example (Part 3)","author":"Roopam Upadhyay","date":false,"format":false,"excerpt":"Welcome back to the\u00a0case study example for regression analysis where you are helping an investment firm make money through property price arbitrage. In the last two parts (Part 1 & Part 2) you started with the univariate analysis to identify patterns in the data including missing data and outliers. In\u2026","rel":"","context":"In &quot;Pricing Case Study Example&quot;","block_context":{"text":"Pricing Case Study Example","link":"https:\/\/ucanalytics.com\/blogs\/category\/pricing-case-study-example\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-Case-Study-Example.jpg?fit=1156%2C720&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-Case-Study-Example.jpg?fit=1156%2C720&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-Case-Study-Example.jpg?fit=1156%2C720&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-Case-Study-Example.jpg?fit=1156%2C720&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/Regression-Case-Study-Example.jpg?fit=1156%2C720&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":9018,"url":"https:\/\/ucanalytics.com\/blogs\/step-step-regression-models-pricing-case-study-example-part-5\/","url_meta":{"origin":8488,"position":2},"title":"Step by Step Regression Modeling Using Principal Component Analysis &#8211; Case Study Example (Part 5)","author":"Roopam Upadhyay","date":false,"format":false,"excerpt":"This is a continuation of our case study example to estimate property pricing. In this part, you will learn nuances of regression modeling by building three different regression models and compare their results.\u00a0We will also use results of the principal component analysis, discussed in the last part, to develop a\u2026","rel":"","context":"In &quot;Pricing Case Study Example&quot;","block_context":{"text":"Pricing Case Study Example","link":"https:\/\/ucanalytics.com\/blogs\/category\/pricing-case-study-example\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/09\/Sumo-and-Regression-Model.jpg?fit=918%2C384&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/09\/Sumo-and-Regression-Model.jpg?fit=918%2C384&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/09\/Sumo-and-Regression-Model.jpg?fit=918%2C384&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/09\/Sumo-and-Regression-Model.jpg?fit=918%2C384&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":9145,"url":"https:\/\/ucanalytics.com\/blogs\/data-simulation-regression-modeling-pricing-case-study-example-part-6\/","url_meta":{"origin":8488,"position":3},"title":"Data Simulation for Regression Modeling &#8211; Pricing Case Study Example (Part 6)","author":"Roopam Upadhyay","date":false,"format":false,"excerpt":"\"Data! Data! Data!\" he cried impatiently. \"I can't make bricks without clay.\" - Sherlock Holmes This is a continuation of our regression case study example. In the previous parts, we have learned, as Sherlock Holmes says, to make bricks i.e. develop regression models. In this part, we will learn how\u2026","rel":"","context":"In &quot;Pricing Case Study Example&quot;","block_context":{"text":"Pricing Case Study Example","link":"https:\/\/ucanalytics.com\/blogs\/category\/pricing-case-study-example\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/10\/Potter-1.jpg?fit=403%2C301&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":8700,"url":"https:\/\/ucanalytics.com\/blogs\/principal-component-analysis-step-step-guide-r-regression-case-study-example-part-4\/","url_meta":{"origin":8488,"position":4},"title":"Principal Component Analysis: Step-by-Step Guide using R- Regression Case Study Example (Part 4)","author":"Roopam Upadhyay","date":false,"format":false,"excerpt":"Principal component analysis is a wonderful technique for data reduction without losing critical information. Yes, you could reduce the size of 2GB data to a few MBs without losing a lot of information. This is like a mp3 version of music. Many, including some experienced data scientists, find principal component\u2026","rel":"","context":"In &quot;Pricing Case Study Example&quot;","block_context":{"text":"Pricing Case Study Example","link":"https:\/\/ucanalytics.com\/blogs\/category\/pricing-case-study-example\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2016\/08\/principal-component-analysis-Death-Profile.jpg?fit=495%2C329&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":5782,"url":"https:\/\/ucanalytics.com\/blogs\/how-effective-is-my-marketing-budget-regression-with-arima-errors-arimax-case-study-example-part-5\/","url_meta":{"origin":8488,"position":5},"title":"How Effective is My Marketing Budget? &#8211; Regression with ARIMA Errors, Case Study Example (Part 5)","author":"Roopam Upadhyay","date":false,"format":false,"excerpt":"So far we have covered the following topics in this case study example\u00a0on time series forecasting and ARIMA models: Part 1\u00a0: Introduction to time series modeling & forecasting Part 2: Time series decomposition to decipher patterns and trends before forecasting Part 3: Introduction to ARIMA models for forecasting Part 4:\u2026","rel":"","context":"In &quot;Manufacturing Case Study Example&quot;","block_context":{"text":"Manufacturing Case Study Example","link":"https:\/\/ucanalytics.com\/blogs\/category\/manufacturing-case-study-example\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/ucanalytics.com\/blogs\/wp-content\/uploads\/2015\/07\/rope-walk.jpg?fit=480%2C640&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/posts\/8488"}],"collection":[{"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/comments?post=8488"}],"version-history":[{"count":0,"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/posts\/8488\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/media\/8489"}],"wp:attachment":[{"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/media?parent=8488"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/categories?post=8488"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ucanalytics.com\/blogs\/wp-json\/wp\/v2\/tags?post=8488"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}