Predicting Car Prices using KNN

11 minute read

Predicting Car Prices

The goal of this project is to practice machine learning workflow by predicting car prices using dataset containing car characteristics. The dataset comes from the UCI’s machine learning repository.

Disclaimer: This project is part of the guided project for a Data Scientist Path on dataquest.io.

Introduction

Let’s first read the dataset and explore it to get a better understanding of it. Since the original dataset doesn’t have column headers in it, we need to provide the list manually.

The numeric feature columns in the dataset are:

  • normalized-losses
  • wheel-base
  • length
  • width
  • height
  • curb-weight
  • engine-type
  • num-of-cylinders
  • engine-size
  • bore
  • stroke
  • compression-rate
  • horsepower
  • peak-rpm
  • city-mpg
  • highway-mpg

The target column is price.

# import necessary libraries
import pandas as pd
import numpy as np
import collections
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# enable pandas to show all columns
pd.options.display.max_columns = 99


# list of columns names
columns = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
           'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base',
           'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
           'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower',
           'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

# read the dataset
cars = pd.read_csv("imports-85.data", names=columns)

# check first few values
cars.head()
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
# check columns and observations
cars.shape
(205, 26)

Data Cleaning

Machine Learning algorithms can’t treat missing values on their own. For that reason we need to treat missing values. From the output above we immediately can see that normalized-losses column contains missing values in the form of “?”. We need to first replace “?” values with numpy.nan across the dataset. This will help us to easier convert other columns to required datatypes.

# replace ? values with np.nan
cars = cars.replace("?", np.nan)
# check datatypes
cars.dtypes
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-rate     float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

Now that we have replaced the “?” values, let’s convert columns containing int or float values to their correct data type

# list of columns to convert
cols = ["normalized-losses","bore","stroke","horsepower","peak-rpm","price"]

# convert the values to their appropriate datatype
for col in cols:
    cars[col] = pd.to_numeric(cars[col], errors='coerce')

Rerun the datatypes cell to check if the conversion was successful.

Now let’s look at the columns which have these missing values and decide on how to handle them.

# Check for missing values
cars.isnull().sum()
symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

We can see that there are 41 null values out of 205 total records in normalized-losses. The missing values are 25% of all records. To handle this column entirely we can fill the missing values with the average value or drop the column entirely. We won’t remove the missing rows since this is a big part of the dataset. To decide on this, let’s check the distribution of values in the normalized-losses.

cars['normalized-losses'].value_counts().plot(kind='bar', figsize=(16,8))
<matplotlib.axes._subplots.AxesSubplot at 0x1c3597dd898>

png

We can see that the distribution for this column is right skewed. Thus we won’t be imputing using average since this might skew our results and introduce bias into this column. One potential solution would be to use mean imputation after normalizing the column, which we will do.

# normalize the column
cars['normalized-losses'] = (cars['normalized-losses'] - cars['normalized-losses'].mean()) / cars['normalized-losses'].std()

# impute with average value
cars['normalized-losses'].fillna(cars['normalized-losses'].mean(), inplace=True)

Rerun, the cell showing missing values to ensure that the imputation was successful. We can also observe that there are few missing values for columns such as price, horsepower, peak-rpm, bore and stroke. We will also normalize them and impute with average value.

# columns to be treated as per above
cols = ['bore', 'stroke', 'peak-rpm', 'horsepower', 'price']

# normalize and fill missing values
for col in cols:
    cars[col] = (cars[col] - cars[col].mean()) / cars[col].std()
    cars[col].fillna(cars[col].mean(), inplace=True)

Rerun the cell generating null value count and make sure they are treated.

Univariate Model

Now let’s start with a univariate k-nearest neighbors model as part of our project for predicting car prices. Starting with simple models before moving to more complex models helps us structure your code workflow and understand the features better.

To enhance and simplify our model testing workflow, we will create a function that encapsulates the training and simple validation process.

# function to knn_model training and validation
def knn_train_test(train_name, target_name, data):
    # make a copy of the dataset
    df = data.copy()
    
    # number of observations for split
    obs = int(df.shape[0] / 2)
    
    # split the data into train and test
    train_df = df.iloc[:obs]
    test_df = df.iloc[obs:]
    
    # reshape the values since we have a single feature
    train_data = train_df[train_name].values.reshape(-1, 1)
    test_data = test_df[train_name].values.reshape(-1, 1)
    
    # instantiate  and fit the model
    knn = KNeighborsRegressor()
    knn.fit(train_data, train_df[target_name])
    
    # make predictions and evaluate
    preds = knn.predict(test_data)
    rmse = mean_squared_error(test_df[target_name], preds) ** (1/2)
    
    return rmse
# use the function to train a model on normalized-losses
knn_train_test('normalized-losses', 'price', cars)
1.0768788918340664

Resulting rmse is based on the model that uses only one feature normalized-losses. We can test additional features to see how our model performs.

# use numeric features
numeric_features = ['wheel-base', 'length', 'width', 'height',
                    'curb-weight', 'engine-size', 'compression-rate',
                    'city-mpg', 'highway-mpg', 'horsepower', 'peak-rpm']

# loop through models with individual feature
for col in numeric_features:
    result = knn_train_test(col, 'price', cars)
    print("RMSE for model with {}: {}".format(col, result))
RMSE for model with wheel-base: 1.214252890227465
RMSE for model with length: 1.1107969135114395
RMSE for model with width: 0.8581184488064216
RMSE for model with height: 1.3712276576274274
RMSE for model with curb-weight: 0.7526797743700095
RMSE for model with engine-size: 0.6066522674323407
RMSE for model with compression-rate: 0.9980929742948769
RMSE for model with city-mpg: 0.6071818370617673
RMSE for model with highway-mpg: 0.5904065996733864
RMSE for model with horsepower: 0.6212450466337898
RMSE for model with peak-rpm: 1.1696138374863996

While the models above are not indicative of performance, but it seems that model with city-mpg feature has the lowest error on the model.

Next, let’s modify our function to accept a k parameter and rerun our univariate model with the above features to see the performance. We will use k values of 1, 3, 5, 7 and 9 to see which model performs well.

# function to knn_model training and validation - modified 
def knn_train_test(train_name, target_name, data, k=5):
    # make a copy of the dataset
    df = data.copy()
    
    # number of observations for split
    obs = int(df.shape[0] / 2)
    
    # split the data into train and test
    train_df = df.iloc[:obs]
    test_df = df.iloc[obs:]
    
    # reshape the values since we have a single feature
    train_data = train_df[train_name].values.reshape(-1, 1)
    test_data = test_df[train_name].values.reshape(-1, 1)
    
    # instantiate  and fit the model
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(train_data, train_df[target_name])
    
    # make predictions and evaluate
    preds = knn.predict(test_data)
    rmse = mean_squared_error(test_df[target_name], preds) ** (1/2)
    
    return rmse
# use numeric features
numeric_features = ['wheel-base', 'length', 'width', 'height',
                    'curb-weight', 'engine-size', 'compression-rate',
                    'city-mpg', 'highway-mpg', 'horsepower', 'peak-rpm']

# dictionary to store model results
lowest_rmse = {}

# loop through models with individual feature
for col in numeric_features:
    for k in [1,3,5,7,9]:
        result = knn_train_test(col, 'price', cars, k=k)
        lowest_rmse[(col, k)] = result

lowest_rmse
{('wheel-base', 1): 1.0200915246848115,
 ('wheel-base', 3): 1.1330559244770682,
 ('wheel-base', 5): 1.214252890227465,
 ('wheel-base', 7): 1.1050644625745172,
 ('wheel-base', 9): 1.0706582345927034,
 ('length', 1): 1.2582151367719674,
 ('length', 3): 1.164040684899018,
 ('length', 5): 1.1107969135114395,
 ('length', 7): 1.0788091491246363,
 ('length', 9): 0.9649699176225501,
 ('width', 1): 1.173235330629889,
 ('width', 3): 0.776724402613776,
 ('width', 5): 0.8581184488064216,
 ('width', 7): 0.7920962270887307,
 ('width', 9): 0.7818084447558675,
 ('height', 1): 1.835270961051065,
 ('height', 3): 1.4522475457196613,
 ('height', 5): 1.3712276576274274,
 ('height', 7): 1.2536105424726978,
 ('height', 9): 1.1799262400433599,
 ('curb-weight', 1): 0.8977925627200231,
 ('curb-weight', 3): 0.7810203611189004,
 ('curb-weight', 5): 0.7526797743700095,
 ('curb-weight', 7): 0.7273112573379618,
 ('curb-weight', 9): 0.6962953604831444,
 ('engine-size', 1): 0.711130777393697,
 ('engine-size', 3): 0.6670315366668025,
 ('engine-size', 5): 0.6066522674323407,
 ('engine-size', 7): 0.5846523661450151,
 ('engine-size', 9): 0.5470224273794155,
 ('compression-rate', 1): 0.9995557239010212,
 ('compression-rate', 3): 1.0014666665882759,
 ('compression-rate', 5): 0.9980929742948769,
 ('compression-rate', 7): 0.994312348888388,
 ('compression-rate', 9): 0.9657803897563202,
 ('city-mpg', 1): 0.672247401307514,
 ('city-mpg', 3): 0.6692599843503406,
 ('city-mpg', 5): 0.6071818370617673,
 ('city-mpg', 7): 0.5947058619062482,
 ('city-mpg', 9): 0.6184289881119851,
 ('highway-mpg', 1): 0.65226957297654,
 ('highway-mpg', 3): 0.5514192730892143,
 ('highway-mpg', 5): 0.5904065996733864,
 ('highway-mpg', 7): 0.5316220529521046,
 ('highway-mpg', 9): 0.5045969744641035,
 ('horsepower', 1): 0.7424499390187735,
 ('horsepower', 3): 0.751441782151203,
 ('horsepower', 5): 0.6212450466337898,
 ('horsepower', 7): 0.5678990997984015,
 ('horsepower', 9): 0.51029924508396,
 ('peak-rpm', 1): 1.2547163045227205,
 ('peak-rpm', 3): 1.1322701001836366,
 ('peak-rpm', 5): 1.1696138374863996,
 ('peak-rpm', 7): 1.1987345300306547,
 ('peak-rpm', 9): 1.1998145119829764}

We can observe that there are multiple univariate models with low rmse. Nevertheles, we can find out the one with smallest rmse.

# find the feature and k with minimum rmse
minval = min(lowest_rmse.values())
res = [k for k, v in lowest_rmse.items() if v == minval]

print("Model with lowest rmse is based on feature:{} with k={}".format(res[0][0], res[0][1]))
Model with lowest rmse is based on feature:highway-mpg with k=9

Multivariate Model

Let’s modify the function we wrote to work with multiple columns.

# function to knn_model training and validation
# modified for multivariate modeling
def knn_train_test(train_name, target_name, data, k=5):
    # make a copy of the dataset
    df = data.copy()
    
    # number of observations for split
    obs = int(df.shape[0] / 2)
    
    if len(train_name) <= 1:
        return "Please provide a list of features"
    
    # split the data into train and test
    train_df = df.iloc[:obs]
    test_df = df.iloc[obs:]
    
    # instantiate  and fit the model
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(train_df[train_name], train_df[target_name])
    
    # make predictions and evaluate
    preds = knn.predict(test_df[train_name])
    rmse = mean_squared_error(test_df[target_name], preds) ** (1/2)
    
    return rmse

To test our new function, we will choose the best 2, 3, 4, 5 and 6 features sequentially from the univariate model exploration and train using the default value of k=5.

# features ordered by performance from univariate modeling
features = ['highway-mpg', 'horsepower', 'engine-size',
            'city-mpg', 'curb-weight']

# starting feature set
cols = features[:1]

# iterate and add each feature to the next model
for col in features:
    # add the next feature
    cols.append(col)
    
    # train the model
    res = knn_train_test(cols, 'price', cars)
    
    # print the rmse
    print("The model with {} features has rmse:{}".format(cols,res))
The model with ['highway-mpg', 'highway-mpg'] features has rmse:0.5904065996733864
The model with ['highway-mpg', 'highway-mpg', 'horsepower'] features has rmse:0.6622571328368921
The model with ['highway-mpg', 'highway-mpg', 'horsepower', 'engine-size'] features has rmse:0.5902520717106036
The model with ['highway-mpg', 'highway-mpg', 'horsepower', 'engine-size', 'city-mpg'] features has rmse:0.5614448491361091
The model with ['highway-mpg', 'highway-mpg', 'horsepower', 'engine-size', 'city-mpg', 'curb-weight'] features has rmse:0.7406898108805515

We can observe from the results above that the model with 5 features has the best rmse.

Hyperparameter Tuning

Last, but not least, we will perform hyperparameter tuning in order to find the best k parameter for our top 3 models from the last section. We will visualize the hyperparameter tuning for each model in order to find out the optimal set.

# list of features of top models
model1 = ['highway-mpg', 'highway-mpg', 'horsepower', 'engine-size', 'city-mpg']
model2 = ['highway-mpg', 'highway-mpg']
model3 = ['highway-mpg', 'highway-mpg', 'horsepower', 'engine-size']

# dataframe to hold the results
results = []

# iterate through various k values
for k in range(1,25):

    # calculate the rmse for models
    res1 = knn_train_test(model1, 'price', cars, k=k)
    res2 = knn_train_test(model2, 'price', cars, k=k)
    res3 = knn_train_test(model3, 'price', cars, k=k)
    
    # save the results to the df
    results.append(['model1', k, res1])
    results.append(['model2', k, res2])
    results.append(['model3', k, res3])
    
# convert list of list to df
df = pd.DataFrame(results, columns=['model', 'k', 'score'])
df.dtypes
model     object
k          int64
score    float64
dtype: object
# visualize the results
import seaborn as sns
fig, ax = plt.subplots(figsize=(16,8))
sns.barplot(x='k', y='score', hue='model', data=df)
plt.show()

png

The bar chart displays the rmse for each model across k values. We can observe that model 2 has lower rmse consistently but all three models have lower rmse around k=12-14. We can also see that past k values of 15 the rmse started going up again and this indicates that our models started to overfit in for those k values. Nevertheless, k values around 12-14 seem like a good choice for the k parameter.

Summary & Ideas

In this project we build a knn classifier and trained both univariate and multivariate models with varying k values. Some ideas on improving our function are implementing k-fold cross validations since currently we are using half of the data to train and the other half for testing, we could improve the performance and accuracy of the model by implementing k-fold cross validation and choose the k value just as we did with k value for knn.

That’s it for this project. Feel free to leave your feedback and comments here below.

For a complete jupyter notebook file please refer to the github page of the project.

Leave a comment