Classification ML model for Car MSRP and Market Label
28 Nov 2019Mason Huang, Croft Li, Phan Nguyen
11/28/2019
knitr::opts_chunk$set(error = TRUE)
library(plyr)
library(pls)
library(tree)
library(tidyverse)
Introduction
The focus of this report is to shed light on the ways and models to predict manufacturer’s suggested retail price of cars by cars’ important features. Car trading has always been a huge market in the country and our team feels strongly about figuring out a way to accurately predict the price as well as market category of used cars in the market. The inferences and models we derive from this analysis will be helpful in the decision-making process for people who are trying to buy used cars, people who are trying to sell used cars, and car dealers.
Dataset
The data set we used for this project is sourced from Kaggle. The primary goal for this dataset is to use the important features of cars to predict their price. This data set lists 10,879 cars, with detailed information about each. The details include information such as make, model, year, engine, and transmission type of the cars. It also includes information regarding to wheels, number of doors, vehicle size and style, MPG in city and highway. The dataset also enlists the market category, popularity and price of the cars. All the recorded car information in the set is based in the US between 1990 and March 2017 inclusive.
Methodology
Before starting the analysis process, we cleaned the data set for data points that were not relevant for the goals of this project. We started with removing information of cars which has manufacturer suggested retail price less than and equal to 2,000 USD as well as the cars with MSRP higher than or equal to 100,000. We applied Hierarchical Clustering and used PCR to predict MSRP. We fit a linear model to the dataset as a benchmark to see how does the PCR model perform. We also fitted a regression tree to MSRP and used cross validation to find the best size of the tree. We then applied classification tree as well as logistic regression to predict the market category of cars.
Building Models
dim(data)
## [1] 10879 19
data <- na.omit(data)
dim(data)
## [1] 10788 19
data$Age <- 2017 - data$Year
We added another variables, which is Age, calculated by taking 2017 minus the year the car is introduced. 2017 is the year that the dataset is combined.
Since our analysis would focus on predicting the price of the car, our response variable would be MSRP.
Data First Look and Visualization
hist(data$MSRP, main="MSRP Distribution")
We see that the data is heavily skewed to the right. We have some outliers in the $200,000 range in MSRP. We decided that our focus should be around the cars that are more common, which is around the range of $2,000 and $100,000.
data <- data[data$MSRP>2000,]
data<- data[data$MSRP<100000,]
dim(data)
## [1] 10155 20
hist(data$MSRP, main="MSRP Distribution")
After removed the outliers, we have a more focused distribution.
We decided to create a scatterplot matrix for some numerical variables that we initially thought might be correlated to MSRP
pairs(~MSRP+Year+Popularity+Engine.Cylinders+Engine.HP+highway.MPG+city.mpg, data=data)
We found that horsepower and MSRP are correlated.
City and highway MPG are correlated.
Cylinders and MSRP are somewhat correlated. Year and horsepower are correlated.
Categorical Variables Vizualization
We would first look at the categorical variables in this data sets. 9 out of 20 variables are categorical.
head(plyr::count(data, 'ï..Make'))
## ï..Make freq
## 1 Acura 242
## 2 Alfa Romeo 5
## 3 Aston Martin 1
## 4 Audi 256
## 5 BMW 304
## 6 Buick 170
We can see that there is only one Aston Martin, three Genesis. When splitting data set into train and test, this would cause a problem.
par(mar = c(5.1,15,4.1,2.1))
barplot(plyr::count(data,'Engine.Fuel.Type')[,2], names.arg=plyr::count(data,'Engine.Fuel.Type')[,1], horiz=TRUE, col='navy blue',las=2, main = "Engine fuel type")
The most common recommended fuel type is regular unleaded.
par(mar = c(5.1,4.1,4.1,2.1))
barplot(plyr::count(data,'Engine.Cylinders')[,2], names.arg=plyr::count(data,'Engine.Cylinders')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Number of Cylinders')
Most cars use either four, six or eight cylinders. We also see earlier that number of cylinders is correlated to horsepower.
par(mfrow = c(1,2))
par(mar = c(5.1,10,4.1,2.1))
barplot(plyr::count(data,'Transmission.Type')[,2], names.arg=plyr::count(data,'Transmission.Type')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Transmission Type')
barplot(plyr::count(data,'Number.of.Doors')[,2], names.arg=plyr::count(data,'Number.of.Doors')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Number of Doors')
Automatic transmission and 4 doors are the default of average cars nowadays.
par(mfrow = c(1,2))
par(mar = c(5.1,7,4.1,2.1))
barplot(plyr::count(data,'Country')[,2], names.arg=plyr::count(data,'Country')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Country')
barplot(plyr::count(data,'Vehicle.Size')[,2], names.arg=plyr::count(data,'Vehicle.Size')[,1], horiz=TRUE, col='navy blue',las=2, main = 'Vehicle Size')
Cars are mostly manufactured from US, Japan and Germany. Vehicle size varies.
a <- plyr::count(data,'Vehicle.Style')
a[order(a[,2]),]
## Vehicle.Style freq
## 8 Convertible SUV 18
## 2 2dr SUV 61
## 5 Cargo Minivan 63
## 6 Cargo Van 66
## 13 Passenger Van 110
## 14 Regular Cab Pickup 316
## 12 Passenger Minivan 365
## 1 2dr Hatchback 378
## 16 Wagon 495
## 7 Convertible 577
## 11 Extended Cab Pickup 587
## 3 4dr Hatchback 641
## 10 Crew Cab Pickup 681
## 9 Coupe 797
## 4 4dr SUV 2404
## 15 Sedan 2596
The most frequent car style is SUV and Sedan.
head(plyr::count(data,'Market.Category'))
## Market.Category freq
## 1 Crossover 1104
## 2 Crossover,Diesel 7
## 3 Crossover,Exotic,Luxury,High-Performance 1
## 4 Crossover,Exotic,Luxury,Performance 1
## 5 Crossover,Factory Tuner,Luxury,High-Performance 14
## 6 Crossover,Factory Tuner,Luxury,Performance 5
There are lots of market categories for these cars, and lots of them are not in categorical fashion. We decided to insert two dummy variables, Performance, and Luxury. A car will have a value of 1 for Performance if they have Performance in their market category. The same goes for Luxury. If the category contains no Performance nor Luxury, it is a 0.
Visualization of Numerical Variables
We created another category that divides the MSRP by 2000. The reason for this is that it would help our classification question later on.
data2 <- data[,-c(1,2)]
data2$PriceCat <- data$MSRP/2000
We mapped out the histogram of numerical variables
par(mfrow = c(1, 2))
options(scipen=10)
hist(data$MSRP, main = "MSRP", col='dark green', xlab='Price')
hist(data2$PriceCat, main = "Price Category", col =' dark red', xlab = 'Price')
par(mfrow = c(1, 2))
hist(data$highway.MPG, main = "Highway MPG", col = " pink", xlab='mpg')
hist(data$city.mpg, main = "City MPG", col = "navy",xlab='mpg')
There is some outlier for MPG for our data set, therefore the graphs look skewed.
par(mfrow = c(1, 1))
hist(data$Engine.HP, main = "Horsepower", col = "orange", xlab='HP')
The Horsepower distribution is bimodal, and there is some outlier in the 500 and 600 units.
Hierrachical Clustering
We first try to cluster the data.
#perform clustering
disthc.average <- hclust(dist(data), method = 'average')
## Warning in dist(data): NAs introduced by coercion
#plotting dendogram would be too thick
par(mfrow = c(1, 2))
ggplot2::ggplot(data, aes_string(data$Age, data$MSRP)) + geom_point(aes(color = cutree(disthc.average,k=4)),size = 3, na.rm = T) + ggtitle("Age vs MSRP Clusters") + xlab('Age') + ylab('MSRP')
ggplot2::ggplot(data, aes_string(data$Engine.HP, data$MSRP)) + geom_point(aes(color = cutree(disthc.average,k=4)),size = 3, na.rm = T)+ ggtitle("Horsepower vs MSRP Clusters") + xlab('Horsepower') + ylab('MSRP')
The clusters are seporated by price range. The correlation is already discovered in the correlation matrix from earlier.
PCR
We already had an idea about what variables are correlated with MSRP.
We now will use PCR, to predict MSRP. We split the data into training set and test set. Training set contains 3/4 of the data, and the remaining observations belong to test set.
Since PCR works best with numerical variables, our team will use Horsepower, Number of Doors, Highway MPG, City MPG and Popularity as predictors.
# creating train set
set.seed(1)
train <- sample(1:nrow(data), 3*as.integer(nrow(data)/4))
# fit PCR
pcr.fit <- pcr(MSRP ~ Engine.HP + Number.of.Doors + highway.MPG +city.mpg + Popularity, data = data, subset=train, scale = TRUE, validation = "CV", segments = 10)
summary(pcr.fit)
## Data: X dimension: 7614 5
## Y dimension: 7614 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 17066 15975 15981 15527 10691 10974
## adjCV 17066 15973 15979 15525 10689 10952
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 43.16 64.24 83.18 96.46 100.00
## MSRP 12.63 12.67 17.59 60.81 60.82
Plot RMSEP vs number of components of PCR model
par(mfrow = c(1, 1))
validationplot(pcr.fit, val.type = 'RMSEP', estimate = "CV")
We will also fit a linear model with the same set of data and predictors.
# Fitting linear regression with train set
lm.fit <- lm(MSRP ~ Engine.HP + Number.of.Doors + highway.MPG + city.mpg + Popularity, data = data, subset=train)
summary(lm.fit)$r.squared
## [1] 0.6081907
After that, we are comparing the test rooted mean squared error of the two models
pcr.pred <- predict(pcr.fit, data[-train,], ncomp = 4)
sqrt(mean((pcr.pred - data[-train, "MSRP"])^2))
## [1] 10777.91
lm.pred <- predict(lm.fit, data[-train,])
sqrt(mean((lm.pred - data[-train, "MSRP"])^2))
## [1] 10785.96
The test rooted mean squared errors of the two models differ by not much. We will try other model and compare on their prediction power.
Regression Tree
Our next step is to fit a regression tree on MSRP.
tree.fit <- tree(MSRP~.-ï..Make - Model - Luxury - Performance - Market.Category, data=data, subset=train)
summary(tree.fit)
##
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance -
## Market.Category, data = data, subset = train)
## Variables actually used in tree construction:
## [1] "Engine.HP" "Year" "Country" "Vehicle.Style"
## Number of terminal nodes: 10
## Residual mean deviance: 65090000 = 494900000000 / 7604
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -69000 -4615 -888 0 3408 53060
There are total 10 terminal nodes in this tree.
# plot
plot(tree.fit)
text(tree.fit, pretty=0)
We will use cross validation to find a lower terminal nodes without compromising the residual mean deviance.
set.seed(567)
(cv.car <- cv.tree(tree.fit, FUN = prune.tree, K = 10))
## $size
## [1] 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 501444636813 535568446773 591972373779 642792968430 642792968430
## [6] 844444116541 844444116541 943588221853 1408713504575 2217401064435
##
## $k
## [1] -Inf 25959535352 44327778921 50753457966 52592362938
## [6] 108189909015 113138731338 163482024579 327075653963 836470389886
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
The best tree size would be the one where the tree has the lowest deviance.
(bestsize <- cv.car$size[which.min(cv.car$dev)])
## [1] 10
In this case, it is still 10. Therefore, we will use the same tree to predict with the test set.
prune.car <- prune.tree(tree.fit, best = bestsize)
yhat <- predict(prune.car, newdata = data[-train,])
mean((yhat - data[-train, 'MSRP'])^2)
## [1] 63065307
sqrt(mean((yhat - data[-train, 'MSRP'])^2))
## [1] 7941.367
The rooted mean squared error returns $8000. We would say that a prediction that can be $8000 lower or higher is not that good. Imagine buying a car and the seller says you have to pay between $10000 and $26000. That is a lot of deviation.
We decided to use the earlier self-created variable, PriceCat. This variable is MSRP divided by 2000.We did the same regression tree all over again.
tree.fit2 <- tree(PriceCat~.- Luxury - Performance - Market.Category - MSRP, data=data2, subset=train)
summary(tree.fit2)
##
## Regression tree:
## tree(formula = PriceCat ~ . - Luxury - Performance - Market.Category -
## MSRP, data = data2, subset = train)
## Variables actually used in tree construction:
## [1] "Engine.HP" "Year" "Country" "Vehicle.Style"
## Number of terminal nodes: 10
## Residual mean deviance: 16.27 = 123700 / 7604
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -34.500 -2.307 -0.444 0.000 1.704 26.530
plot(tree.fit2)
text(tree.fit2, pretty=0)
set.seed(567)
(cv.car2 <- cv.tree(tree.fit2, FUN = prune.tree, K = 10))
## $size
## [1] 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 125361.2 133892.1 147993.1 160698.2 160698.2 211111.0 211111.0 235897.1
## [9] 352178.4 554350.3
##
## $k
## [1] -Inf 6489.884 11081.945 12688.364 13148.091 27047.477
## [7] 28284.683 40870.506 81768.913 209117.597
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
(bestsize2 <- cv.car2$size[which.min(cv.car2$dev)])
## [1] 10
prune.car2 <- prune.tree(tree.fit2, best = bestsize2)
yhat <- predict(prune.car2, newdata = data2[-train,])
mean((yhat - data2[-train, 'PriceCat'])^2)
## [1] 15.76633
sqrt(mean((yhat - data2[-train, 'PriceCat'])^2))
## [1] 3.970683
This time the RMSE returns exactly our earlier RMSE divided by 2000. We realized that this step is redundant due to the fact that unit changes do not affect the model. However, we still want to put it here.
Split the data
We learned that divide MSRP by 2000 would not affect the model.
Our team came up with the idea of splitting the data into 3 parts, by the MSRP of the car. We will split each of the three parts into training and test sets, and do a seperate regression tree. We believe that by splitting the data into smaller parts, the model would predict better.
We divided the data into 3 parts, where MSRP is lower than 20000, between 20000 and 35000, and greater than 35000.
lowdata <- data[data$MSRP<=20000,]
middata <- data[data$MSRP>20000,]
middata <- middata[middata$MSRP<35000,]
highdata <- data[data$MSRP>=35000,]
trainlow <- sample(1:nrow(lowdata), 3*as.integer(nrow(lowdata)/4))
trainmid <- sample(1:nrow(middata), 3*as.integer(nrow(middata)/4))
trainhigh <- sample(1:nrow(highdata), 3*as.integer(nrow(highdata)/4))
Regression on seperate data
We then perform regression tree on each of the three data sets.
Low Price (<$20000)
tree.fitlow <- tree(MSRP~.-ï..Make - Model- Luxury - Performance - Market.Category, data=lowdata, subset=trainlow)
summary(tree.fitlow)
##
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance -
## Market.Category, data = lowdata, subset = trainlow)
## Variables actually used in tree construction:
## [1] "Age" "Engine.HP"
## Number of terminal nodes: 3
## Residual mean deviance: 3063000 = 3823000000 / 1248
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5791.0 -886.2 -116.5 0.0 963.6 15210.0
plot(tree.fitlow)
text(tree.fitlow, pretty=0)
set.seed(567)
(cv.carlow <- cv.tree(tree.fitlow, FUN = prune.tree, K = 10))
## $size
## [1] 3 2 1
##
## $dev
## [1] 3846960705 4744853113 60972032782
##
## $k
## [1] -Inf 899885254 56133954775
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
(bestsizelow <- cv.carlow$size[which.min(cv.carlow$dev)])
## [1] 3
prune.carlow <- prune.tree(tree.fitlow, best = bestsizelow)
yhat <- predict(prune.carlow, newdata = lowdata[-trainlow,])
mean((yhat - lowdata[-trainlow, 'MSRP'])^2)
## [1] 3015514
sqrt(mean((yhat - lowdata[-trainlow, 'MSRP'])^2))
## [1] 1736.524
The regression tree has only 3 nodes, and its rooted mean squared error is around 1700.
Mid Price ($20000<=X<=$35000)
tree.fitmid <- tree(MSRP~.-ï..Make - Model- Luxury - Performance - Market.Category, data=middata, subset=trainmid)
summary(tree.fitmid)
##
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance -
## Market.Category, data = middata, subset = trainmid)
## Variables actually used in tree construction:
## [1] "Engine.HP" "Engine.Fuel.Type" "Vehicle.Style" "city.mpg"
## Number of terminal nodes: 9
## Residual mean deviance: 9173000 = 32090000000 / 3498
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9649.0 -2106.0 -175.5 0.0 2118.0 10850.0
plot(tree.fitmid)
text(tree.fitmid, pretty=0)
set.seed(567)
(cv.carmid <- cv.tree(tree.fitmid, FUN = prune.tree, K = 10))
## $size
## [1] 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 33440622828 33600033810 36904259469 37116214348 37487310509 40122137084
## [7] 41447259810 42697728232 61101239850
##
## $k
## [1] -Inf 656734842 1037784639 1058207475 1132491659 1883884122
## [7] 2251939450 2527533376 18426522290
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
(bestsizemid <- cv.carmid$size[which.min(cv.carmid$dev)])
## [1] 9
prune.carmid <- prune.tree(tree.fitmid, best = bestsizemid)
yhat <- predict(prune.carmid, newdata = middata[-trainmid,])
mean((yhat - middata[-trainmid, 'MSRP'])^2)
## [1] 8756495
sqrt(mean((yhat - middata[-trainmid, 'MSRP'])^2))
## [1] 2959.138
The tree size is 9 and the rooted mean squared error is 3016.79.
High Price (> $35000)
tree.fithigh <- tree(MSRP~.-ï..Make - Model- Luxury - Performance - Market.Category, data=highdata, subset=trainhigh)
summary(tree.fithigh)
##
## Regression tree:
## tree(formula = MSRP ~ . - ï..Make - Model - Luxury - Performance -
## Market.Category, data = highdata, subset = trainhigh)
## Variables actually used in tree construction:
## [1] "Engine.HP" "Engine.Fuel.Type" "Country" "Vehicle.Style"
## [5] "Vehicle.Size" "highway.MPG"
## Number of terminal nodes: 10
## Residual mean deviance: 78190000 = 222500000000 / 2846
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -37460 -5427 -1212 0 4776 46480
plot(tree.fithigh)
text(tree.fithigh, pretty=0)
set.seed(567)
(cv.carhigh <- cv.tree(tree.fithigh, FUN = prune.tree, K = 10))
## $size
## [1] 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 227698362301 248459509824 251094509016 257635733239 290470355586
## [6] 294879970084 304094878759 352092725478 413142010333 615636954090
##
## $k
## [1] -Inf 6972283683 7780255262 9217718625 16579391383
## [6] 17277757300 19879609496 46749283255 61203580094 207265224514
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
(bestsizehigh <- cv.carhigh$size[which.min(cv.carhigh$dev)])
## [1] 10
prune.carhigh <- prune.tree(tree.fithigh, best = bestsizehigh)
yhat <- predict(prune.carhigh, newdata = highdata[-trainhigh,])
mean((yhat - highdata[-trainhigh, 'MSRP'])^2)
## [1] 84429178
sqrt(mean((yhat - highdata[-trainhigh, 'MSRP'])^2))
## [1] 9188.535
The tree has 11 terminal nodes and the rooted mean squared error is 8949.326.
We have a regression tree, PCR to predict MSRP. Our current model for two out of three budget levels is performing well. RMSE is within normal price fluctuation based on individual listings. Therefore, these models have reached our desired goal.
Our team decided to move on to build another model to predict the category a new car would fall in.
Clasification Tree
We would want to predict whether a car will be categorized as a Performance or Luxury. We would use a classification tree for this goal.
Since we are predicting market category Performance and Luxury, we will remove the market category variable from the data.
data2 <- data2[,-8]
Performance Model
We removed Make, Model and Category because it has too many levels and don’t contribute to our prediction model.
par(mfrow = c(1, 1))
pr.tree <- tree(as.factor(Performance)~. , split = 'deviance', data = data2, subset=train)
plot(pr.tree)
text(pr.tree, pretty = 0)
We use the test set to predict and build a confusion matrix.
pr.pred <- predict(pr.tree, newdata = data2[-train,], type='class')
table(pr.pred, data2[-train,]$Performance)
##
## pr.pred 0 1
## 0 1797 152
## 1 75 517
mean(pr.pred == data2[-train,]$Performance)
## [1] 0.9106651
The test classification accuracy is 90%, which is really high.
Luxury Model
We do the same thing to predict Luxury label.
lu.tree <- tree(as.factor(Luxury)~. , split = 'deviance', data = data2, subset=train)
plot(lu.tree)
text(lu.tree, pretty = 0)
lu.pred <- predict(lu.tree, newdata = data2[-train,], type='class')
table(lu.pred, data2[-train,]$Luxury)
##
## lu.pred 0 1
## 0 1857 27
## 1 25 632
mean(lu.pred == data2[-train,]$Luxury)
## [1] 0.9795356
The classification accuracy is 98%, meaning we have a 98% to actually has the right market category for a new car.
Logistic Regression
With good results from the classification tree, our team decides to fit a logistic regression model to compare.
Performance Model
pr.logfit <- glm(as.factor(Performance)~., data=data2, subset=train, family = 'binomial')
We removed some of the variables because they are not significant.
pr.logfit <- glm(as.factor(Performance)~. - Engine.Fuel.Type - Number.of.Doors - highway.MPG - city.mpg - Age - PriceCat, data=data2, subset=train, family = 'binomial')
pr.prob <- predict(pr.logfit, newdata = data2[-train,], type="response")
pr.pred <- rep(0,nrow(data2[-train,]))
pr.pred[pr.prob>0.5] <- 1
table(pr.pred, data2[-train,]$Performance)
##
## pr.pred 0 1
## 0 1783 134
## 1 89 535
mean(pr.pred == data2[-train,]$Performance)
## [1] 0.9122393
The classification accuracy is 91%.
Luxury Model
lu.logfit <- glm(as.factor(Luxury)~., data=data2, subset=train, family = 'binomial')
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
We removed some of the variables because they are not significant
lu.logfit <- glm(as.factor(Luxury)~. - Engine.Fuel.Type - Number.of.Doors -Age- PriceCat -Year , data=data2, subset=train, family = 'binomial')
lu.prob <- predict(lu.logfit, newdata = data2[-train,], type="response")
lu.pred <- rep(0,nrow(data2[-train,]))
lu.pred[lu.prob>0.5] <- 1
table(lu.pred, data2[-train,]$Luxury)
##
## lu.pred 0 1
## 0 1793 169
## 1 89 490
mean(lu.pred == data2[-train,]$Luxury)
## [1] 0.8984652
The classification accuracy is 89%. This model performs worst than classification tree.
Implementation and Conclusion
MSRP Model
- Scenario: When people or dealership are trying to purchase used car from an individual
- Gives a very good estimation on how much money one is expected to spend on a certain used car under the certain budget
- Also we could potentially predict used car prices in future
- Not applicable for dealership sales as dealership following a commission- markup price model.
- Strength: The model have a RMSE that is deemed usable in real world application. The RMSE is in range of price fluctuation base on individual car’s condition.
- Limitation and further research directions :
the model is not robust enough to handle all car under $100,000
- Model have very limited ability to predict MSRP on higher end cars
- Lack important predicting factors like Mileage
Performance/Luxury Model
- Scenario: When people or dealership are trying to sell a used car
- The Performance/ luxury category will dictate what kind of marketing content should the selling emphasis on the car
- Performance- Horsepower, 0-60 time ,handling
- Luxury- Luxury features, ride quality, rear seat room
- Regular- reliability, maintenance, condition of wearable parts
- Performance- Horsepower, 0-60 time ,handling
- Limitation and further research directions
- Some missing information in the dataset’s Market Category variable
- Some of car might be subjected to brand perception bias:
- Exp: BMW M2 and VW Phaeton
- Some missing information in the dataset’s Market Category variable