Classification ML model for Car MSRP and Market Label

28 Nov 2019 Car Classification and MSRP Prediction Model

knitr::opts_chunk$set(error = TRUE)
library(plyr)
library(pls)
library(tree)
library(tidyverse)

Introduction

The focus of this report is to shed light on the ways and models to predict manufacturer’s suggested retail price of cars by cars’ important features. Car trading has always been a huge market in the country and our team feels strongly about figuring out a way to accurately predict the price as well as market category of used cars in the market. The inferences and models we derive from this analysis will be helpful in the decision-making process for people who are trying to buy used cars, people who are trying to sell used cars, and car dealers.

Dataset

The data set we used for this project is sourced from Kaggle. The primary goal for this dataset is to use the important features of cars to predict their price. This data set lists 10,879 cars, with detailed information about each. The details include information such as make, model, year, engine, and transmission type of the cars. It also includes information regarding to wheels, number of doors, vehicle size and style, MPG in city and highway. The dataset also enlists the market category, popularity and price of the cars. All the recorded car information in the set is based in the US between 1990 and March 2017 inclusive.

Methodology

Before starting the analysis process, we cleaned the data set for data points that were not relevant for the goals of this project. We started with removing information of cars which has manufacturer suggested retail price less than and equal to 2,000 USD as well as the cars with MSRP higher than or equal to 100,000. We applied Hierarchical Clustering and used PCR to predict MSRP. We fit a linear model to the dataset as a benchmark to see how does the PCR model perform. We also fitted a regression tree to MSRP and used cross validation to find the best size of the tree. We then applied classification tree as well as logistic regression to predict the market category of cars.

Building Models



dim(data)

## [1] 10879    19

data <- na.omit(data)
dim(data)

## [1] 10788    19

data$Age <- 2017 - data$Year

We added another variables, which is Age, calculated by taking 2017 minus the year the car is introduced. 2017 is the year that the dataset is combined.
Since our analysis would focus on predicting the price of the car, our response variable would be MSRP.

Data First Look and Visualization

hist(data$MSRP, main="MSRP Distribution")

We see that the data is heavily skewed to the right. We have some outliers in the $200,000 range in MSRP. We decided that our focus should be around the cars that are more common, which is around the range of $2,000 and $100,000.

data <- data[data$MSRP>2000,]
data<- data[data$MSRP<100000,]
dim(data)

## [1] 10155    20

hist(data$MSRP, main="MSRP Distribution")

After removed the outliers, we have a more focused distribution.
We decided to create a scatterplot matrix for some numerical variables that we initially thought might be correlated to MSRP

pairs(~MSRP+Year+Popularity+Engine.Cylinders+Engine.HP+highway.MPG+city.mpg, data=data)

We found that horsepower and MSRP are correlated.
City and highway MPG are correlated.
Cylinders and MSRP are somewhat correlated. Year and horsepower are correlated.

Categorical Variables Vizualization

We would first look at the categorical variables in this data sets. 9 out of 20 variables are categorical.

head(plyr::count(data, 'ï..Make'))

##        ï..Make freq
## 1        Acura  242
## 2   Alfa Romeo    5
## 3 Aston Martin    1
## 4         Audi  256
## 5          BMW  304
## 6        Buick  170

We can see that there is only one Aston Martin, three Genesis. When splitting data set into train and test, this would cause a problem.

par(mar = c(5.1,15,4.1,2.1))
barplot(plyr::count(data,'Engine.Fuel.Type')[,2], names.arg=plyr::count(data,'Engine.Fuel.Type')[,1], horiz=TRUE, col='navy blue',las=2, main = "Engine fuel type")

The most common recommended fuel type is regular unleaded.

par(mar = c(5.1,4.1,4.1,2.1))
barplot(plyr::count(data,'Engine.Cylinders')[,2], names.arg=plyr::count(data,'Engine.Cylinders')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Number of Cylinders')

Most cars use either four, six or eight cylinders. We also see earlier that number of cylinders is correlated to horsepower.

par(mfrow = c(1,2))
par(mar = c(5.1,10,4.1,2.1))
barplot(plyr::count(data,'Transmission.Type')[,2], names.arg=plyr::count(data,'Transmission.Type')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Transmission Type')
barplot(plyr::count(data,'Number.of.Doors')[,2], names.arg=plyr::count(data,'Number.of.Doors')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Number of Doors')

Automatic transmission and 4 doors are the default of average cars nowadays.

par(mfrow = c(1,2))
par(mar = c(5.1,7,4.1,2.1))
barplot(plyr::count(data,'Country')[,2], names.arg=plyr::count(data,'Country')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Country')
barplot(plyr::count(data,'Vehicle.Size')[,2], names.arg=plyr::count(data,'Vehicle.Size')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Vehicle Size')

Cars are mostly manufactured from US, Japan and Germany. Vehicle size varies.

a <- plyr::count(data,'Vehicle.Style')
a[order(a[,2]),]

##          Vehicle.Style freq
## 8      Convertible SUV   18
## 2              2dr SUV   61
## 5        Cargo Minivan   63
## 6            Cargo Van   66
## 13       Passenger Van  110
## 14  Regular Cab Pickup  316
## 12   Passenger Minivan  365
## 1        2dr Hatchback  378
## 16               Wagon  495
## 7          Convertible  577
## 11 Extended Cab Pickup  587
## 3        4dr Hatchback  641
## 10     Crew Cab Pickup  681
## 9                Coupe  797
## 4              4dr SUV 2404
## 15               Sedan 2596

The most frequent car style is SUV and Sedan.

head(plyr::count(data,'Market.Category'))

##                                   Market.Category freq
## 1                                       Crossover 1104
## 2                                Crossover,Diesel    7
## 3        Crossover,Exotic,Luxury,High-Performance    1
## 4             Crossover,Exotic,Luxury,Performance    1
## 5 Crossover,Factory Tuner,Luxury,High-Performance   14
## 6      Crossover,Factory Tuner,Luxury,Performance    5

There are lots of market categories for these cars, and lots of them are not in categorical fashion. We decided to insert two dummy variables, Performance, and Luxury. A car will have a value of 1 for Performance if they have Performance in their market category. The same goes for Luxury. If the category contains no Performance nor Luxury, it is a 0.

Visualization of Numerical Variables

We created another category that divides the MSRP by 2000. The reason for this is that it would help our classification question later on.

data2 <- data[,-c(1,2)]
data2$PriceCat <- data$MSRP/2000

We mapped out the histogram of numerical variables

par(mfrow = c(1, 2))
options(scipen=10)
hist(data$MSRP, main = "MSRP", col='dark green', xlab='Price')
hist(data2$PriceCat, main = "Price Category", col =' dark red', xlab = 'Price')

par(mfrow = c(1, 2))
hist(data$highway.MPG, main = "Highway MPG", col = " pink", xlab='mpg')
hist(data$city.mpg, main = "City MPG", col = "navy",xlab='mpg')

There is some outlier for MPG for our data set, therefore the graphs look skewed.

par(mfrow = c(1, 1))
hist(data$Engine.HP, main = "Horsepower", col = "orange", xlab='HP')

The Horsepower distribution is bimodal, and there is some outlier in the 500 and 600 units.

Hierrachical Clustering

We first try to cluster the data.

#perform clustering
disthc.average <- hclust(dist(data), method = 'average')

## Warning in dist(data): NAs introduced by coercion

#plotting dendogram would be too thick
par(mfrow = c(1, 2))
ggplot2::ggplot(data, aes_string(data$Age, data$MSRP)) + geom_point(aes(color = cutree(disthc.average,k=4)),size = 3, na.rm = T) + ggtitle("Age vs MSRP Clusters") + xlab('Age') + ylab('MSRP')

ggplot2::ggplot(data, aes_string(data$Engine.HP, data$MSRP)) + geom_point(aes(color = cutree(disthc.average,k=4)),size = 3, na.rm = T)+ ggtitle("Horsepower vs MSRP Clusters") + xlab('Horsepower') + ylab('MSRP')

The clusters are seporated by price range. The correlation is already discovered in the correlation matrix from earlier.

PCR

We already had an idea about what variables are correlated with MSRP.
We now will use PCR, to predict MSRP. We split the data into training set and test set. Training set contains 3/4 of the data, and the remaining observations belong to test set.

Since PCR works best with numerical variables, our team will use Horsepower, Number of Doors, Highway MPG, City MPG and Popularity as predictors.

# creating train set
set.seed(1)
train <- sample(1:nrow(data), 3*as.integer(nrow(data)/4))
# fit PCR
pcr.fit <- pcr(MSRP ~ Engine.HP + Number.of.Doors + highway.MPG +city.mpg + Popularity, data = data, subset=train, scale = TRUE, validation = "CV", segments = 10)
summary(pcr.fit)

## Data:    X dimension: 7614 5 
##  Y dimension: 7614 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
## CV           17066    15975    15981    15527    10691    10974
## adjCV        17066    15973    15979    15525    10689    10952
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       43.16    64.24    83.18    96.46   100.00
## MSRP    12.63    12.67    17.59    60.81    60.82

Plot RMSEP vs number of components of PCR model

par(mfrow = c(1, 1))
validationplot(pcr.fit, val.type = 'RMSEP', estimate = "CV")

We will also fit a linear model with the same set of data and predictors.

# Fitting linear regression with train set
lm.fit <- lm(MSRP ~ Engine.HP + Number.of.Doors + highway.MPG + city.mpg + Popularity, data = data, subset=train)
summary(lm.fit)$r.squared

## [1] 0.6081907

After that, we are comparing the test rooted mean squared error of the two models

pcr.pred <- predict(pcr.fit, data[-train,], ncomp = 4)
sqrt(mean((pcr.pred - data[-train, "MSRP"])^2))

## [1] 10777.91

lm.pred <- predict(lm.fit, data[-train,])
sqrt(mean((lm.pred - data[-train, "MSRP"])^2))

## [1] 10785.96

The test rooted mean squared errors of the two models differ by not much. We will try other model and compare on their prediction power.

Regression Tree

Our next step is to fit a regression tree on MSRP.

tree.fit <- tree(MSRP~.-ï..Make - Model - Luxury - Performance - Market.Category, data=data, subset=train)
summary(tree.fit)

## 
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance - 
##     Market.Category, data = data, subset = train)
## Variables actually used in tree construction:
## [1] "Engine.HP"     "Year"          "Country"       "Vehicle.Style"
## Number of terminal nodes:  10 
## Residual mean deviance:  65090000 = 494900000000 / 7604 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -69000   -4615    -888       0    3408   53060

There are total 10 terminal nodes in this tree.

# plot
plot(tree.fit)
text(tree.fit, pretty=0)

We will use cross validation to find a lower terminal nodes without compromising the residual mean deviance.

set.seed(567)
(cv.car <- cv.tree(tree.fit, FUN = prune.tree, K = 10))

## $size
##  [1] 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1]  501444636813  535568446773  591972373779  642792968430  642792968430
##  [6]  844444116541  844444116541  943588221853 1408713504575 2217401064435
## 
## $k
##  [1]         -Inf  25959535352  44327778921  50753457966  52592362938
##  [6] 108189909015 113138731338 163482024579 327075653963 836470389886
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

The best tree size would be the one where the tree has the lowest deviance.

(bestsize <- cv.car$size[which.min(cv.car$dev)])

## [1] 10

In this case, it is still 10. Therefore, we will use the same tree to predict with the test set.

prune.car <- prune.tree(tree.fit, best = bestsize)
yhat <- predict(prune.car, newdata = data[-train,])
mean((yhat - data[-train, 'MSRP'])^2)

## [1] 63065307

sqrt(mean((yhat - data[-train, 'MSRP'])^2))

## [1] 7941.367

The rooted mean squared error returns $8000. We would say that a prediction that can be $8000 lower or higher is not that good. Imagine buying a car and the seller says you have to pay between $10000 and $26000. That is a lot of deviation.

We decided to use the earlier self-created variable, PriceCat. This variable is MSRP divided by 2000.We did the same regression tree all over again.

tree.fit2 <- tree(PriceCat~.- Luxury - Performance - Market.Category - MSRP, data=data2, subset=train)
summary(tree.fit2)

## 
## Regression tree:
## tree(formula = PriceCat ~ . - Luxury - Performance - Market.Category - 
##     MSRP, data = data2, subset = train)
## Variables actually used in tree construction:
## [1] "Engine.HP"     "Year"          "Country"       "Vehicle.Style"
## Number of terminal nodes:  10 
## Residual mean deviance:  16.27 = 123700 / 7604 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -34.500  -2.307  -0.444   0.000   1.704  26.530

plot(tree.fit2)
text(tree.fit2, pretty=0)

set.seed(567)
(cv.car2 <- cv.tree(tree.fit2, FUN = prune.tree, K = 10))

## $size
##  [1] 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 125361.2 133892.1 147993.1 160698.2 160698.2 211111.0 211111.0 235897.1
##  [9] 352178.4 554350.3
## 
## $k
##  [1]       -Inf   6489.884  11081.945  12688.364  13148.091  27047.477
##  [7]  28284.683  40870.506  81768.913 209117.597
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(bestsize2 <- cv.car2$size[which.min(cv.car2$dev)])

## [1] 10

prune.car2 <- prune.tree(tree.fit2, best = bestsize2)

yhat <- predict(prune.car2, newdata = data2[-train,])
mean((yhat - data2[-train, 'PriceCat'])^2)

## [1] 15.76633

sqrt(mean((yhat - data2[-train, 'PriceCat'])^2))

## [1] 3.970683

This time the RMSE returns exactly our earlier RMSE divided by 2000. We realized that this step is redundant due to the fact that unit changes do not affect the model. However, we still want to put it here.

Split the data

We learned that divide MSRP by 2000 would not affect the model.
Our team came up with the idea of splitting the data into 3 parts, by the MSRP of the car. We will split each of the three parts into training and test sets, and do a seperate regression tree. We believe that by splitting the data into smaller parts, the model would predict better.

We divided the data into 3 parts, where MSRP is lower than 20000, between 20000 and 35000, and greater than 35000.

lowdata <- data[data$MSRP<=20000,]
middata <- data[data$MSRP>20000,]
middata <- middata[middata$MSRP<35000,]
highdata <- data[data$MSRP>=35000,]

trainlow <- sample(1:nrow(lowdata), 3*as.integer(nrow(lowdata)/4))
trainmid <- sample(1:nrow(middata), 3*as.integer(nrow(middata)/4))
trainhigh <- sample(1:nrow(highdata), 3*as.integer(nrow(highdata)/4))

Regression on seperate data

We then perform regression tree on each of the three data sets.

Low Price (<$20000)

tree.fitlow <- tree(MSRP~.-ï..Make - Model- Luxury - Performance - Market.Category, data=lowdata, subset=trainlow)
summary(tree.fitlow)

## 
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance - 
##     Market.Category, data = lowdata, subset = trainlow)
## Variables actually used in tree construction:
## [1] "Age"       "Engine.HP"
## Number of terminal nodes:  3 
## Residual mean deviance:  3063000 = 3823000000 / 1248 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5791.0  -886.2  -116.5     0.0   963.6 15210.0

plot(tree.fitlow)
text(tree.fitlow, pretty=0)

set.seed(567)
(cv.carlow <- cv.tree(tree.fitlow, FUN = prune.tree, K = 10))

## $size
## [1] 3 2 1
## 
## $dev
## [1]  3846960705  4744853113 60972032782
## 
## $k
## [1]        -Inf   899885254 56133954775
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(bestsizelow <- cv.carlow$size[which.min(cv.carlow$dev)])

## [1] 3

prune.carlow <- prune.tree(tree.fitlow, best = bestsizelow)

yhat <- predict(prune.carlow, newdata = lowdata[-trainlow,])
mean((yhat - lowdata[-trainlow, 'MSRP'])^2)

## [1] 3015514

sqrt(mean((yhat - lowdata[-trainlow, 'MSRP'])^2))

## [1] 1736.524

The regression tree has only 3 nodes, and its rooted mean squared error is around 1700.

Mid Price ($20000<=X<=$35000)

tree.fitmid <- tree(MSRP~.-ï..Make - Model- Luxury - Performance - Market.Category, data=middata, subset=trainmid)
summary(tree.fitmid)

## 
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance - 
##     Market.Category, data = middata, subset = trainmid)
## Variables actually used in tree construction:
## [1] "Engine.HP"        "Engine.Fuel.Type" "Vehicle.Style"    "city.mpg"        
## Number of terminal nodes:  9 
## Residual mean deviance:  9173000 = 32090000000 / 3498 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9649.0 -2106.0  -175.5     0.0  2118.0 10850.0

plot(tree.fitmid)
text(tree.fitmid, pretty=0)

set.seed(567)
(cv.carmid <- cv.tree(tree.fitmid, FUN = prune.tree, K = 10))

## $size
## [1] 9 8 7 6 5 4 3 2 1
## 
## $dev
## [1] 33440622828 33600033810 36904259469 37116214348 37487310509 40122137084
## [7] 41447259810 42697728232 61101239850
## 
## $k
## [1]        -Inf   656734842  1037784639  1058207475  1132491659  1883884122
## [7]  2251939450  2527533376 18426522290
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(bestsizemid <- cv.carmid$size[which.min(cv.carmid$dev)])

## [1] 9

prune.carmid <- prune.tree(tree.fitmid, best = bestsizemid)

yhat <- predict(prune.carmid, newdata = middata[-trainmid,])
mean((yhat - middata[-trainmid, 'MSRP'])^2)

## [1] 8756495

sqrt(mean((yhat - middata[-trainmid, 'MSRP'])^2))

## [1] 2959.138

The tree size is 9 and the rooted mean squared error is 3016.79.

High Price (> $35000)

tree.fithigh <- tree(MSRP~.-ï..Make - Model- Luxury - Performance - Market.Category, data=highdata, subset=trainhigh)
summary(tree.fithigh)

## 
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance - 
##     Market.Category, data = highdata, subset = trainhigh)
## Variables actually used in tree construction:
## [1] "Engine.HP"        "Engine.Fuel.Type" "Country"          "Vehicle.Style"   
## [5] "Vehicle.Size"     "highway.MPG"     
## Number of terminal nodes:  10 
## Residual mean deviance:  78190000 = 222500000000 / 2846 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -37460   -5427   -1212       0    4776   46480

plot(tree.fithigh)
text(tree.fithigh, pretty=0)

set.seed(567)
(cv.carhigh <- cv.tree(tree.fithigh, FUN = prune.tree, K = 10))

## $size
##  [1] 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 227698362301 248459509824 251094509016 257635733239 290470355586
##  [6] 294879970084 304094878759 352092725478 413142010333 615636954090
## 
## $k
##  [1]         -Inf   6972283683   7780255262   9217718625  16579391383
##  [6]  17277757300  19879609496  46749283255  61203580094 207265224514
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(bestsizehigh <- cv.carhigh$size[which.min(cv.carhigh$dev)])

## [1] 10

prune.carhigh <- prune.tree(tree.fithigh, best = bestsizehigh)

yhat <- predict(prune.carhigh, newdata = highdata[-trainhigh,])
mean((yhat - highdata[-trainhigh, 'MSRP'])^2)

## [1] 84429178

sqrt(mean((yhat - highdata[-trainhigh, 'MSRP'])^2))

## [1] 9188.535

The tree has 11 terminal nodes and the rooted mean squared error is 8949.326.

We have a regression tree, PCR to predict MSRP. Our current model for two out of three budget levels is performing well. RMSE is within normal price fluctuation based on individual listings. Therefore, these models have reached our desired goal.
Our team decided to move on to build another model to predict the category a new car would fall in.

Clasification Tree

We would want to predict whether a car will be categorized as a Performance or Luxury. We would use a classification tree for this goal.
Since we are predicting market category Performance and Luxury, we will remove the market category variable from the data.

data2 <- data2[,-8]

Performance Model

We removed Make, Model and Category because it has too many levels and don’t contribute to our prediction model.

par(mfrow = c(1, 1))

pr.tree <- tree(as.factor(Performance)~. , split = 'deviance', data = data2, subset=train)
plot(pr.tree)
text(pr.tree, pretty = 0)

We use the test set to predict and build a confusion matrix.

pr.pred <- predict(pr.tree, newdata = data2[-train,], type='class')
table(pr.pred, data2[-train,]$Performance)

##        
## pr.pred    0    1
##       0 1797  152
##       1   75  517

mean(pr.pred == data2[-train,]$Performance)

## [1] 0.9106651

The test classification accuracy is 90%, which is really high.

Luxury Model

We do the same thing to predict Luxury label.

lu.tree <- tree(as.factor(Luxury)~. , split = 'deviance', data = data2, subset=train)
plot(lu.tree)
text(lu.tree, pretty = 0)

lu.pred <- predict(lu.tree, newdata = data2[-train,], type='class')
table(lu.pred, data2[-train,]$Luxury)

##        
## lu.pred    0    1
##       0 1857   27
##       1   25  632

mean(lu.pred == data2[-train,]$Luxury)

## [1] 0.9795356

The classification accuracy is 98%, meaning we have a 98% to actually has the right market category for a new car.

Logistic Regression

With good results from the classification tree, our team decides to fit a logistic regression model to compare.

Performance Model

pr.logfit <- glm(as.factor(Performance)~., data=data2, subset=train, family = 'binomial')

We removed some of the variables because they are not significant.

pr.logfit <- glm(as.factor(Performance)~. - Engine.Fuel.Type - Number.of.Doors - highway.MPG - city.mpg - Age - PriceCat, data=data2, subset=train, family = 'binomial')

pr.prob <- predict(pr.logfit, newdata = data2[-train,], type="response")
pr.pred <- rep(0,nrow(data2[-train,]))
pr.pred[pr.prob>0.5] <- 1
table(pr.pred, data2[-train,]$Performance)

##        
## pr.pred    0    1
##       0 1783  134
##       1   89  535

mean(pr.pred == data2[-train,]$Performance)

## [1] 0.9122393

The classification accuracy is 91%.

Luxury Model

lu.logfit <- glm(as.factor(Luxury)~., data=data2, subset=train, family = 'binomial')

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

We removed some of the variables because they are not significant

lu.logfit <- glm(as.factor(Luxury)~. - Engine.Fuel.Type - Number.of.Doors -Age- PriceCat -Year , data=data2, subset=train, family = 'binomial')

lu.prob <- predict(lu.logfit, newdata = data2[-train,], type="response")
lu.pred <- rep(0,nrow(data2[-train,]))
lu.pred[lu.prob>0.5] <- 1
table(lu.pred, data2[-train,]$Luxury)

##        
## lu.pred    0    1
##       0 1793  169
##       1   89  490

mean(lu.pred == data2[-train,]$Luxury)

## [1] 0.8984652

The classification accuracy is 89%. This model performs worst than classification tree.

Implementation and Conclusion

MSRP Model

Scenario: When people or dealership are trying to purchase used car from an individual
Gives a very good estimation on how much money one is expected to spend on a certain used car under the certain budget
Also we could potentially predict used car prices in future
Not applicable for dealership sales as dealership following a commission- markup price model.
Strength: The model have a RMSE that is deemed usable in real world application. The RMSE is in range of price fluctuation base on individual car’s condition.
Limitation and further research directions :
the model is not robust enough to handle all car under $100,000
Model have very limited ability to predict MSRP on higher end cars
Lack important predicting factors like Mileage

Performance/Luxury Model

Scenario: When people or dealership are trying to sell a used car
The Performance/ luxury category will dictate what kind of marketing content should the selling emphasis on the car
- Performance- Horsepower, 0-60 time ,handling
- Luxury- Luxury features, ride quality, rear seat room
- Regular- reliability, maintenance, condition of wearable parts
Limitation and further research directions
- Some missing information in the dataset’s Market Category variable
- Some of car might be subjected to brand perception bias:
  - Exp: BMW M2 and VW Phaeton

Phan Nguyen

Classification ML model for Car MSRP and Market Label

Mason Huang, Croft Li, Phan Nguyen

11/28/2019

Introduction

Dataset

Methodology

Building Models

Data First Look and Visualization

Categorical Variables Vizualization

Visualization of Numerical Variables

Hierrachical Clustering

PCR

Regression Tree

Split the data

Regression on seperate data

Low Price (<$20000)

Mid Price ($20000<=X<=$35000)

High Price (> $35000)

Clasification Tree

Performance Model

Luxury Model

Logistic Regression

Performance Model

Luxury Model

Implementation and Conclusion

MSRP Model

Performance/Luxury Model

Phan Nguyen

Classification ML model for Car MSRP and Market Label

Mason Huang, Croft Li, Phan Nguyen

11/28/2019

Introduction

Dataset

Methodology

Building Models

Data First Look and Visualization

Categorical Variables Vizualization

Visualization of Numerical Variables

Hierrachical Clustering

PCR

Regression Tree

Split the data

Regression on seperate data

Low Price (<$20000)

Mid Price ($20000<=X<=$35000)

High Price (> $35000)

Clasification Tree

Performance Model

Luxury Model

Logistic Regression

Performance Model

Luxury Model

Implementation and Conclusion

MSRP Model

Performance/Luxury Model

Related Projects

Pizza Heatmap 28 Apr 2023

Vietnam College Exam 2021 24 Dec 2020

Vietnam College Entrance Exam 2020 24 Dec 2020