House Price Prediction using a Random Forest Classifier

In this blog post, I will use machine learning and Python for predicting house prices. I will use a Random Forest Classifier (in fact Random Forest regression). In the end, I will demonstrate my Random Forest Python algorithm!

Data Science is about discovering hidden patterns (laws) in your data. Observing your data is as important as discovering patterns in your data. Without examining the data, your pattern detection will be imperfect and without pattern detection, you cannot draw any conclusions about your data. Therefore, both ideas are needed for drawing conclusions.

The description of the competition can be found on Kaggle

The remainder of this notebook is divided into the following chapters:

Study the variables. What is the problem about? What are the target variables and what do the variables represent?
Variable analysis. We will focus on the target variable and predicting variables and try to clean up as many variables as possible.
Machine Learning. Here we will build and test our pattern detection algorithm. Yay!

Let’s give it a try!

# Loading stuff
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

sns.set()
pd.set_option('max_columns', 1000)
warnings.filterwarnings('ignore')
%matplotlib inline

# Load the data
df_train = pd.read_csv('../input/train.csv')

# Explore the columns
print(df_train.columns.values)
print('No. variables:', len(df_train.columns.values))

['Id' 'MSSubClass' 'MSZoning' 'LotFrontage' 'LotArea' 'Street' 'Alley'
 'LotShape' 'LandContour' 'Utilities' 'LotConfig' 'LandSlope'
 'Neighborhood' 'Condition1' 'Condition2' 'BldgType' 'HouseStyle'
 'OverallQual' 'OverallCond' 'YearBuilt' 'YearRemodAdd' 'RoofStyle'
 'RoofMatl' 'Exterior1st' 'Exterior2nd' 'MasVnrType' 'MasVnrArea'
 'ExterQual' 'ExterCond' 'Foundation' 'BsmtQual' 'BsmtCond' 'BsmtExposure'
 'BsmtFinType1' 'BsmtFinSF1' 'BsmtFinType2' 'BsmtFinSF2' 'BsmtUnfSF'
 'TotalBsmtSF' 'Heating' 'HeatingQC' 'CentralAir' 'Electrical' '1stFlrSF'
 '2ndFlrSF' 'LowQualFinSF' 'GrLivArea' 'BsmtFullBath' 'BsmtHalfBath'
 'FullBath' 'HalfBath' 'BedroomAbvGr' 'KitchenAbvGr' 'KitchenQual'
 'TotRmsAbvGrd' 'Functional' 'Fireplaces' 'FireplaceQu' 'GarageType'
 'GarageYrBlt' 'GarageFinish' 'GarageCars' 'GarageArea' 'GarageQual'
 'GarageCond' 'PavedDrive' 'WoodDeckSF' 'OpenPorchSF' 'EnclosedPorch'
 '3SsnPorch' 'ScreenPorch' 'PoolArea' 'PoolQC' 'Fence' 'MiscFeature'
 'MiscVal' 'MoSold' 'YrSold' 'SaleType' 'SaleCondition' 'SalePrice']
No. variables: 81

Study the variables

So there are roughly 80 variables. That is a lot and we probably don’t need most of them. The ‘SalePrice’ variable is our target variable. We would like to predict this variable given the other variables. What are the other variables and what are their types? At this point, I will not throw away any variable unless it does not give any information.

Clean missing data

Now let’s take a look at which variables contain lots of NaNs. We will dump these variables since they do not contribute a lot to the predictability of the target variable.

num_missing = df_train.isnull().sum()
percent = num_missing / df_train.isnull().count()

df_missing = pd.concat([num_missing, percent], axis=1, keys=['MissingValues', 'Fraction'])
df_missing = df_missing.sort_values('Fraction', ascending=False)
df_missing[df_missing['MissingValues'] > 0]

MissingValues	Fraction
PoolQC	1453	0.995205
MiscFeature	1406	0.963014
Alley	1369	0.937671
Fence	1179	0.807534
FireplaceQu	690	0.472603
LotFrontage	259	0.177397
GarageYrBlt	81	0.055479
GarageCond	81	0.055479
GarageType	81	0.055479
GarageFinish	81	0.055479
GarageQual	81	0.055479
BsmtFinType2	38	0.026027
BsmtExposure	38	0.026027
BsmtQual	37	0.025342
BsmtCond	37	0.025342
BsmtFinType1	37	0.025342
MasVnrArea	8	0.005479
MasVnrType	8	0.005479
Electrical	1	0.000685

To simplify the problem, we will throw away any variable with a missing column. This will make our prediction worse, but this also ensures we do not have to make any assumptions about these variables (which could also be dangerous).

variables_to_keep = df_missing[df_missing['MissingValues'] == 0].index
df_train = df_train[variables_to_keep]

Variable Analysis

Here we will do a quick analysis of the variables and the underlying relations. Let’s build a correlation matrix.

# Build the correlation matrix
matrix = df_train.corr()
f, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(matrix, vmax=0.7, square=True)

Now we can zoom in on the SalePrice and determine which variables are strongly correlated to it.

interesting_variables = matrix['SalePrice'].sort_values(ascending=False)
# Filter out the target variables (SalePrice) and variables with a low correlation score (v such that -0.6 <= v <= 0.6)
interesting_variables = interesting_variables[abs(interesting_variables) >= 0.6]
interesting_variables = interesting_variables[interesting_variables.index != 'SalePrice']
interesting_variables

OverallQual    0.790982
GrLivArea      0.708624
GarageCars     0.640409
GarageArea     0.623431
TotalBsmtSF    0.613581
1stFlrSF       0.605852
Name: SalePrice, dtype: float64

Nice! So apparently, the overall quality is the most predicting variable so far. Which makes sense, but it is also quite vague. What is exactly meant by this score? Let’s zoom in on the most predicting variable.

values = np.sort(df_train['OverallQual'].unique())
print('Unique values of "OverallQual":', values)

Unique values of "OverallQual": [ 1  2  3  4  5  6  7  8  9 10]

So apparently, we have a semi-categorical variable “OverallQual” with a score from 1 to 10. According to the description of the variables, 1 means Very Poor, 5 means Average and 10 means Very Excellent. Let’s plot the relationship between “OverallQual” and “SalePrice”:

data = pd.concat([df_train['SalePrice'], df_train['OverallQual']], axis=1)
data.plot.scatter(x='OverallQual', y='SalePrice')

Okay, the trend is clearly visible. Now let’s analyse all of our variables-of-interest.

cols = interesting_variables.index.values.tolist() + ['SalePrice']
sns.pairplot(df_train[cols], size=2.5)
plt.show()

This plot reveals a lot. It gives clues about the types of the different variables. There are a few discrete variables (OverallQual, GarageCars) and some continuous variables (GrLivArea, GarageArea, TotalBsmtSF, 1stFlrSF).

We will now zoom in on the heatmap we produced earlier by only showing the variables of interest. This could potentially reveal some underlying relations!

# Build the correlation matrix
matrix = df_train[cols].corr()
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(matrix, vmax=1.0, square=True)

I see definitely some clusters here! GarageCars and GarageArea are strongly correlated, which also makes a lot of sense. Furthermore, TotalBsmtSF and 1stFlrSF are also correlated which also makes sense. And we intended to only use variables which were correlated to SalePrice which is also visible in this plot. Great! Now we will start with some Machine Learning and try to predict the SalePrice!

Machine Learning (Random Forest regression)

In this chapter, I will use a Random Forest classifier. In fact, it is Random Forest regression since the target variable is a continuous real number. I will split the train set into a train and a test set since I am not interested in running the analysis on the test set. Let’s find out how well the models work!

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

pred_vars = [v for v in interesting_variables.index.values if v != 'SalePrice']
target_var = 'SalePrice'

X = df_train[pred_vars]
y = df_train[target_var]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Build a plot
plt.scatter(y_pred, y_test)
plt.xlabel('Prediction')
plt.ylabel('Real value')

# Now add the perfect prediction line
diagonal = np.linspace(0, np.max(y_test), 100)
plt.plot(diagonal, diagonal, '-r')
plt.show()

That is great! The red line shows the perfect predictions. If the prediction would equal the real value, then all points would lie on the red line. Here you can see that there are some deviations and a few outliers, but that is mainly the case for prices which are extremely high. There are some outliers in the low range and it would be interesting to find out what is the cause of these outliers. To conclude, we can compute the RMS error (Root Mean Squared error):

from sklearn.metrics import mean_squared_log_error, mean_absolute_error

print('MAE:\t$%.2f' % mean_absolute_error(y_test, y_pred))
print('MSLE:\t%.5f' % mean_squared_log_error(y_test, y_pred))

MAE:	$23552.62
MSLE:	0.03613

A deviation of $23K,- which is mainly due to the extreme outliers, not too bad for a quick try! This is definitely something to keep in mind when buying a house!

Main Menu

House Price Prediction using a Random Forest Classifier

Study the variables

Clean missing data

Variable Analysis

Machine Learning (Random Forest regression)

No comments:

Post a Comment

Facebook SDK

Search Here

Css Options

Follow Us

Mobile Logo Settings

Menu Footer Widget

Search Here

Main Menu

House Price Prediction using a Random Forest Classifier

Study the variables

Clean missing data

(adsbygoogle = window.adsbygoogle || []).push({});

Variable Analysis

Machine Learning (Random Forest regression)

No comments:

Post a Comment