House Price Prediction using a Random Forest Classifier
In this blog post, I will use machine learning and Python for predicting house prices. I will use a Random Forest Classifier (in fact Random Forest regression). In the end, I will demonstrate my Random Forest Python algorithm!
Data Science is about discovering hidden patterns (laws) in your data. Observing your data is as important as discovering patterns in your data. Without examining the data, your pattern detection will be imperfect and without pattern detection, you cannot draw any conclusions about your data. Therefore, both ideas are needed for drawing conclusions.
The description of the competition can be found on Kaggle
The remainder of this notebook is divided into the following chapters:
- Study the variables. What is the problem about? What are the target variables and what do the variables represent?
- Variable analysis. We will focus on the target variable and predicting variables and try to clean up as many variables as possible.
- Machine Learning. Here we will build and test our pattern detection algorithm. Yay!
Let’s give it a try!
['Id' 'MSSubClass' 'MSZoning' 'LotFrontage' 'LotArea' 'Street' 'Alley'
'LotShape' 'LandContour' 'Utilities' 'LotConfig' 'LandSlope'
'Neighborhood' 'Condition1' 'Condition2' 'BldgType' 'HouseStyle'
'OverallQual' 'OverallCond' 'YearBuilt' 'YearRemodAdd' 'RoofStyle'
'RoofMatl' 'Exterior1st' 'Exterior2nd' 'MasVnrType' 'MasVnrArea'
'ExterQual' 'ExterCond' 'Foundation' 'BsmtQual' 'BsmtCond' 'BsmtExposure'
'BsmtFinType1' 'BsmtFinSF1' 'BsmtFinType2' 'BsmtFinSF2' 'BsmtUnfSF'
'TotalBsmtSF' 'Heating' 'HeatingQC' 'CentralAir' 'Electrical' '1stFlrSF'
'2ndFlrSF' 'LowQualFinSF' 'GrLivArea' 'BsmtFullBath' 'BsmtHalfBath'
'FullBath' 'HalfBath' 'BedroomAbvGr' 'KitchenAbvGr' 'KitchenQual'
'TotRmsAbvGrd' 'Functional' 'Fireplaces' 'FireplaceQu' 'GarageType'
'GarageYrBlt' 'GarageFinish' 'GarageCars' 'GarageArea' 'GarageQual'
'GarageCond' 'PavedDrive' 'WoodDeckSF' 'OpenPorchSF' 'EnclosedPorch'
'3SsnPorch' 'ScreenPorch' 'PoolArea' 'PoolQC' 'Fence' 'MiscFeature'
'MiscVal' 'MoSold' 'YrSold' 'SaleType' 'SaleCondition' 'SalePrice']
No. variables: 81
Study the variables
So there are roughly 80 variables. That is a lot and we probably don’t need most of them. The ‘SalePrice’ variable is our target variable. We would like to predict this variable given the other variables. What are the other variables and what are their types? At this point, I will not throw away any variable unless it does not give any information.
Clean missing data
Now let’s take a look at which variables contain lots of NaNs. We will dump these variables since they do not contribute a lot to the predictability of the target variable.
Variable Analysis
Here we will do a quick analysis of the variables and the underlying relations. Let’s build a correlation matrix.

Now we can zoom in on the SalePrice and determine which variables are strongly correlated to it.
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
Name: SalePrice, dtype: float64
Nice! So apparently, the overall quality is the most predicting variable so far. Which makes sense, but it is also quite vague. What is exactly meant by this score? Let’s zoom in on the most predicting variable.
Unique values of "OverallQual": [ 1 2 3 4 5 6 7 8 9 10]
So apparently, we have a semi-categorical variable “OverallQual” with a score from 1 to 10. According to the description of the variables, 1 means Very Poor, 5 means Average and 10 means Very Excellent. Let’s plot the relationship between “OverallQual” and “SalePrice”:

Okay, the trend is clearly visible. Now let’s analyse all of our variables-of-interest.

This plot reveals a lot. It gives clues about the types of the different variables. There are a few discrete variables (OverallQual, GarageCars) and some continuous variables (GrLivArea, GarageArea, TotalBsmtSF, 1stFlrSF).
We will now zoom in on the heatmap we produced earlier by only showing the variables of interest. This could potentially reveal some underlying relations!

I see definitely some clusters here! GarageCars and GarageArea are strongly correlated, which also makes a lot of sense. Furthermore, TotalBsmtSF and 1stFlrSF are also correlated which also makes sense. And we intended to only use variables which were correlated to SalePrice which is also visible in this plot. Great! Now we will start with some Machine Learning and try to predict the SalePrice!
Machine Learning (Random Forest regression)
In this chapter, I will use a Random Forest classifier. In fact, it is Random Forest regression since the target variable is a continuous real number. I will split the train set into a train and a test set since I am not interested in running the analysis on the test set. Let’s find out how well the models work!

That is great! The red line shows the perfect predictions. If the prediction would equal the real value, then all points would lie on the red line. Here you can see that there are some deviations and a few outliers, but that is mainly the case for prices which are extremely high. There are some outliers in the low range and it would be interesting to find out what is the cause of these outliers. To conclude, we can compute the RMS error (Root Mean Squared error):
MAE: $23552.62
MSLE: 0.03613
A deviation of $23K,- which is mainly due to the extreme outliers, not too bad for a quick try! This is definitely something to keep in mind when buying a house!
No comments:
Post a Comment