California Median Housing Price Prediction via Linear Regression & More

Data collected from the US 1990 Census, and later organized into nine continuous variables offers information into (what is thought to be) highly correlative features, each of which might offer insights into useful properties of interest to people looking into housing markets. In this article, we want to investigate this dataset to then test and implement the predictive capabilities of linear regression models. We will begin by offering insights into the properties exhibited by this dataset, and then move to deriving one of the most popular low-complexity linear models, and finish with a comparison of the performance of this model with other popular machine learning algorithms.

The dataset is composed of the following nine variables. Note that a ‘block group’ on average includes 1425.5 individuals living in a geographically compact area.

MedInc :     Median income in block group                  -  0
HouseAge:    Median house age in block group               -  1
AveRooms:    Average number of rooms per household         -  2
AveBedrms:   Average number of bedrooms per household      -  3
Pop:         Block group population                        -  4
AveOccup:    Average number of household members           -  5
Latitude:    Block group latitude                          -  6
Longitude:   Block group longitude                         -  7
HousePrice:  Median House Price of Block Group (TARGET)    -  8

In this article, we will restrict ourselves to working with the given variables, however, it is important to note that often times, such datasets can be improved after some exploration and analysis, of which we will do in the following, where we will give some examples as to a few ways we can improve the dataset, and thereby our model. To explore potential hidden insights into this dataset, we begin by plotting the two geographical variables over the target variable, namely the median house price.

We can clearly see the outline of the state of California among with some other interesting properties not immediately offered by the given data, such as a seemingly strong correlation between housing prices and coastal and/or metropolitan proximity (especially in the regions around San Francisco and Los Angeles). As a result, it would be advantageous to our model to include additional variables which explicitly capture this information. One way to do this would be to generate a variable whose magnitude reflects such proximities, and concatenate this onto our given input matrix.

Just as adding correlative variables to our dataset can generally improve any model, removing potentially uncorrelated features can also help improve the quality of our data and model. In order to investigate whether the given data offers uninformative features, we will set up a nine by nine correlation matrix and look at the row or column corresponding to our target variable.

Looking at the last row or column, we see variable 4, corresponding to block population size, has the smallest correlation with our target variable, rendering it somewhat unnecessary to consider. On the contrary, variable 0 , corresponding to median income, exhibits the highest correlation by a factor of over 4 in comparison to the next most correlated variable. Not only is it intuitive for the median income of a group to be associated with higher house prices, but as a result of this demonstrated correlation, we can expect this variable to hold the largest weight when predicting future housing prices.

Generally, to robustify machine learning models, we would like all our features to be around the same scale. This way, any general machine learning algorithm can learn which variables ought to influence the model more than others without being influenced by any predefined biases induced by order of magnitudes present in the data (especially if less correlated variables already come in larger order of magnitudes than more correlated variables). A simple and common way to standardize the data, is to consider all samples of a certain feature, compute the mean value and standard deviation for said feature, and then take each datapoint and divide by the standard deviation after subtracting by the computed mean. This essentially standardizes each feature vector around a normal distribution, and can be done efficiently by importing and executing the following function.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)

def preprocessor(X):
    A = np.copy(X)
    A = scaler.transform(A)
    return A

X_train_standardized = preprocessor(X_train)

After considering such improvements to our given data, we begin by initializing our regression model. The goal behind linear regression is to find a function within a family of (possibly nonlinear) working functions which closely matches our data via a (to be derived) projection operator. When we linearly transform our family of functions with a set of weights, we will find a new target value which is guaranteed to be the closest possible value to the unknown target value given our constraints (which in our case, is the span of our working functions applied onto our input data). An important first step is to split our data into a training and testing set, where the training set will be used to find our linear set of weights, and where our testing set will evaluate the performance based on these weights. We will be using a relatively large test set, due to the large size of the dataset, as seen through a training-test split of 10:90, as well as the following random seed in order to reliably compare our model to other common methods later on.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9, random_state=1)

Although I mentioned that our family of functions could be potentially nonlinear, we will begin by considering a linear family of functions. This means that our set of input variables labeled by {0, 1, 2, 3, 4, 5, 6, 7} (not including target variable 8) will undergo an affine transformation by a set of eight weights plus one offset term. The output of this function should yield, in the ideal case, our target value after we have found our optimal set of weights. Important to note here, is that “optimal” means nothing without respect to some metric. The metric we will use, is the commonly utilized mean square error. In the following derivation, we begin by defining the mean square error and the optimal weights as the argument which minimizes said error, and then proceed to find the optimal set of weights by taking the derivative of this metric, and setting it equal to zero.

We find that our optimal weights take the form of the Moore-Penrose inverse, which was formalized in 1920 and is a widely known generalization of an “inverse” matrix, as this pseudoinverse is not restricted to only invertible transformations. This closed-form solution for our optimal weights is a major reason as to why linear regression is so popular, as no optimization or other higher-complexity algorithm is required. Even if we have chosen our family of functions to be nonlinear, our solution remains essentially the same. The only modification required is to preprocess the feature tensor X by our predefined working functions, and organize them into a matrix of the same dimensions, where we then replace X with this matrix in our closed-form solution.

Now we have all the ingredients required to begin programming our pipeline. To train our model, and then apply it onto our test set, we will define the following four functions.

def fit_least_squares(X_train, y_train)
def fit_ridge(X_train, y_train, regularization_strength)
def predict_linear_model(X_test, w_opt)
def mean_squared_error(y_predicted, y_true)

The first function applies the derived pseudoinverse onto our training data, and returns our optimal set of weights (this is the “training” step in our procedure). The second function does the same, however for a loss function with a squared L2-regularization term added with the goal of finding an optimal set of weights which does not grow too large, as this is commonly seen to occur with an overfitted model. The third function uses the returned optimal weights and uses them to apply an affine transformation onto our test set to find predicted target values. The last function then uses the predicted median house prices and the true median house prices of our test set to then measure the mean squared error of our model.

Before executing our chain of functions, we would like to compare both the linear regression and ridge regression model with other commonly used machine learning techniques. To do so, we will simultaneously execute these two regressors along with an ensemble method and a k-nearest neighbors regressor. The ensemble method will use a random forest, which is a collection of trained decision trees to decide what the most likely median house prices is given the features, and the k-NN regressor simply looks at the ‘k’ feature vectors which are closest to the features in our test vector of consideration, and computes the predicted median house prices as the average of the house prices assigned to these ‘k’ vectors. The following functions were imported from scikit-learn.

knn = KNeighborsRegressor(n_neighbors=4).fit(X_train, y_train)
MSE_knn = mean_squared_error(y_test, knn.predict(X_test))

rfr = RandomForestRegressor(max_depth=4).fit(X_train, y_train)
MSE_rfr = mean_squared_error(y_test, rfr.predict(X_test))

After applying our four different models onto our data, we find the following results

>> MSE for k-NN = 1.43901490776407
>> MSE for RF = 0.5078535241487971
>> MSE for Least squares = 4.638432867430077
>> MSE for Ridge regression = 4.695883266990483

We can conclude from the given results that the linear subspace spanned by our feature vectors is not expressive enough to capture enough of the effects which the inputs have on the median housing prices. It seems as though the nonlinearities offered by the other two machine learning algorithms offer the advantage in our case, and if we would like our regression model to perform better, we will need to consider further modifications. Such modifications could include expanding our family of working functions to include different nonlinearities, such that we then linearly regress onto the space spanned by these working functions, we could capture more of the underlying effects created by our inputs. A different approach could be to consider a nonlinear regression approach, where our weights are nonlinearly regressed onto our data, or some preprocessed functions of our data.

Previous
Previous

EEG/MEG Signal Processing for Successive Neural Network Control

Next
Next

Breast Cancer Diagnosis Predictions via Logistic Regression