House Price Prediction using a Random Forest Classifier

House Price Prediction using a Random Forest Classifier

In this blog post, I will use machine learning and Python for predicting house prices. I will use a Random Forest Classifier (in fact Random Forest regression). In the end, I will demonstrate my Random Forest Python algorithm!

Data Science is about discovering hidden patterns (laws) in your data. Observing your data is as important as discovering patterns in your data. Without examining the data, your pattern detection will be imperfect and without pattern detection, you cannot draw any conclusions about your data. Therefore, both ideas are needed for drawing conclusions.

The description of the competition can be found on Kaggle

The remainder of this notebook is divided into the following chapters:

  1. Study the variables. What is the problem about? What are the target variables and what do the variables represent?
  2. Variable analysis. We will focus on the target variable and predicting variables and try to clean up as many variables as possible.
  3. Machine Learning. Here we will build and test our pattern detection algorithm. Yay!

Let’s give it a try!

# Loading stuff
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

sns.set()
pd.set_option('max_columns', 1000)
warnings.filterwarnings('ignore')
%matplotlib inline
# Load the data
df_train = pd.read_csv('../input/train.csv')

# Explore the columns
print(df_train.columns.values)
print('No. variables:', len(df_train.columns.values))
['Id' 'MSSubClass' 'MSZoning' 'LotFrontage' 'LotArea' 'Street' 'Alley'
'LotShape' 'LandContour' 'Utilities' 'LotConfig' 'LandSlope'
'Neighborhood' 'Condition1' 'Condition2' 'BldgType' 'HouseStyle'
'OverallQual' 'OverallCond' 'YearBuilt' 'YearRemodAdd' 'RoofStyle'
'RoofMatl' 'Exterior1st' 'Exterior2nd' 'MasVnrType' 'MasVnrArea'
'ExterQual' 'ExterCond' 'Foundation' 'BsmtQual' 'BsmtCond' 'BsmtExposure'
'BsmtFinType1' 'BsmtFinSF1' 'BsmtFinType2' 'BsmtFinSF2' 'BsmtUnfSF'
'TotalBsmtSF' 'Heating' 'HeatingQC' 'CentralAir' 'Electrical' '1stFlrSF'
'2ndFlrSF' 'LowQualFinSF' 'GrLivArea' 'BsmtFullBath' 'BsmtHalfBath'
'FullBath' 'HalfBath' 'BedroomAbvGr' 'KitchenAbvGr' 'KitchenQual'
'TotRmsAbvGrd' 'Functional' 'Fireplaces' 'FireplaceQu' 'GarageType'
'GarageYrBlt' 'GarageFinish' 'GarageCars' 'GarageArea' 'GarageQual'
'GarageCond' 'PavedDrive' 'WoodDeckSF' 'OpenPorchSF' 'EnclosedPorch'
'3SsnPorch' 'ScreenPorch' 'PoolArea' 'PoolQC' 'Fence' 'MiscFeature'
'MiscVal' 'MoSold' 'YrSold' 'SaleType' 'SaleCondition' 'SalePrice']
No. variables: 81

Study the variables

So there are roughly 80 variables. That is a lot and we probably don’t need most of them. The ‘SalePrice’ variable is our target variable. We would like to predict this variable given the other variables. What are the other variables and what are their types? At this point, I will not throw away any variable unless it does not give any information.

Clean missing data

Now let’s take a look at which variables contain lots of NaNs. We will dump these variables since they do not contribute a lot to the predictability of the target variable.

num_missing = df_train.isnull().sum()
percent = num_missing / df_train.isnull().count()

df_missing = pd.concat([num_missing, percent], axis=1, keys=['MissingValues', 'Fraction'])
df_missing = df_missing.sort_values('Fraction', ascending=False)
df_missing[df_missing['MissingValues'] > 0]
MissingValuesFraction
PoolQC14530.995205
MiscFeature14060.963014
Alley13690.937671
Fence11790.807534
FireplaceQu6900.472603
LotFrontage2590.177397
GarageYrBlt810.055479
GarageCond810.055479
GarageType810.055479
GarageFinish810.055479
GarageQual810.055479
BsmtFinType2380.026027
BsmtExposure380.026027
BsmtQual370.025342
BsmtCond370.025342
BsmtFinType1370.025342
MasVnrArea80.005479
MasVnrType80.005479
Electrical10.000685
 To simplify the problem, we will throw away any variable with a missing column. This will make our prediction worse, but this also ensures we do not have to make any assumptions about these variables (which could also be dangerous).
variables_to_keep = df_missing[df_missing['MissingValues'] == 0].index
df_train = df_train[variables_to_keep]

Variable Analysis

Here we will do a quick analysis of the variables and the underlying relations. Let’s build a correlation matrix.

# Build the correlation matrix
matrix = df_train.corr()
f, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(matrix, vmax=0.7, square=True)
Correlations.

Now we can zoom in on the SalePrice and determine which variables are strongly correlated to it.

interesting_variables = matrix['SalePrice'].sort_values(ascending=False)
# Filter out the target variables (SalePrice) and variables with a low correlation score (v such that -0.6 <= v <= 0.6)
interesting_variables = interesting_variables[abs(interesting_variables) >= 0.6]
interesting_variables = interesting_variables[interesting_variables.index != 'SalePrice']
interesting_variables
OverallQual    0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
Name: SalePrice, dtype: float64

Nice! So apparently, the overall quality is the most predicting variable so far. Which makes sense, but it is also quite vague. What is exactly meant by this score? Let’s zoom in on the most predicting variable.

values = np.sort(df_train['OverallQual'].unique())
print('Unique values of "OverallQual":', values)
Unique values of "OverallQual": [ 1  2  3  4  5  6  7  8  9 10]

So apparently, we have a semi-categorical variable “OverallQual” with a score from 1 to 10. According to the description of the variables, 1 means Very Poor, 5 means Average and 10 means Very Excellent. Let’s plot the relationship between “OverallQual” and “SalePrice”:

data = pd.concat([df_train['SalePrice'], df_train['OverallQual']], axis=1)
data.plot.scatter(x='OverallQual', y='SalePrice')
Correlations.

Okay, the trend is clearly visible. Now let’s analyse all of our variables-of-interest.

cols = interesting_variables.index.values.tolist() + ['SalePrice']
sns.pairplot(df_train[cols], size=2.5)
plt.show()

Correlations.

This plot reveals a lot. It gives clues about the types of the different variables. There are a few discrete variables (OverallQual, GarageCars) and some continuous variables (GrLivArea, GarageArea, TotalBsmtSF, 1stFlrSF).

We will now zoom in on the heatmap we produced earlier by only showing the variables of interest. This could potentially reveal some underlying relations!

# Build the correlation matrix
matrix = df_train[cols].corr()
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(matrix, vmax=1.0, square=True)
Heatmap.

I see definitely some clusters here! GarageCars and GarageArea are strongly correlated, which also makes a lot of sense. Furthermore, TotalBsmtSF and 1stFlrSF are also correlated which also makes sense. And we intended to only use variables which were correlated to SalePrice which is also visible in this plot. Great! Now we will start with some Machine Learning and try to predict the SalePrice!

Machine Learning (Random Forest regression)

In this chapter, I will use a Random Forest classifier. In fact, it is Random Forest regression since the target variable is a continuous real number. I will split the train set into a train and a test set since I am not interested in running the analysis on the test set. Let’s find out how well the models work!

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

pred_vars = [v for v in interesting_variables.index.values if v != 'SalePrice']
target_var = 'SalePrice'

X = df_train[pred_vars]
y = df_train[target_var]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Build a plot
plt.scatter(y_pred, y_test)
plt.xlabel('Prediction')
plt.ylabel('Real value')

# Now add the perfect prediction line
diagonal = np.linspace(0, np.max(y_test), 100)
plt.plot(diagonal, diagonal, '-r')
plt.show()
Predictions.

That is great! The red line shows the perfect predictions. If the prediction would equal the real value, then all points would lie on the red line. Here you can see that there are some deviations and a few outliers, but that is mainly the case for prices which are extremely high. There are some outliers in the low range and it would be interesting to find out what is the cause of these outliers. To conclude, we can compute the RMS error (Root Mean Squared error):

from sklearn.metrics import mean_squared_log_error, mean_absolute_error

print('MAE:\t$%.2f' % mean_absolute_error(y_test, y_pred))
print('MSLE:\t%.5f' % mean_squared_log_error(y_test, y_pred))
MAE:	$23552.62
MSLE: 0.03613

A deviation of $23K,- which is mainly due to the extreme outliers, not too bad for a quick try! This is definitely something to keep in mind when buying a house!

No comments:

Post a Comment