Simple Logistic Regression on Titanic
15 Dec 2020You can find the full report and dataset here on Kaggle.
Predicting Survivor Rate of Titanic
Hello, welcome to my first attempt at cracking this competition. I am new at data science, therefore any comments would be highly appreciated.
1. Importing Libraries
These are the libraries I found neccessary to perform this model building exercise.
library(tidyverse)# metapackage of all tidyverse packages
library('plyr') # data manipulation
library('dplyr')
2. First look at the data
First, I’m going to pull in the data from Kaggle. There are two data sets, train and test.
train <- read.csv("../input/titanic/train.csv")
test <- read.csv ("../input/titanic/test.csv")
dim(train)
## [1] 891 12
dim(test)
## [1] 418 11
The train data set has 891 rows, each row represents a passenger and the test set has 418 rows. Let’s look at the first few rows of the data.
head(train,n=5)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
These are the 12 variables in the train data set. The test data set doesn’t contain Survived.
Variable Name | Description |
---|---|
Survived | Survived (1) or died (0) |
Pclass | Passenger’s class |
Name | Passenger’s name |
Sex | Passenger’s sex |
Age | Passenger’s age |
SibSp | Number of siblings/spouses aboard |
Parch | Number of parents/children aboard |
Ticket | Ticket number |
Fare | Fare |
Cabin | Cabin |
Embarked | Port of embarkation |
Right off the bat, I am speculating that sex and age will be significant in detecting the Survivor Rate. Passenger ID, Name and Ticket are not relevant in my opinion.
For this exercise, I am going to fit a logistic model.
However, first, we are going to remove missing data from the train data set.
a <- nrow(train)
train <- na.omit(train)
a - nrow(train)
## [1] 177
plyr::count(train$Survived)
## x freq
## 1 0 424
## 2 1 290
We removed 177 observations due to missing data. Among the 714 passengers, 424 died and 290 survived.
3. Fitting the model
Let’s fit the a logistic model with all the variables, except for PassengerId, Name and Ticket
titanic.lr <- glm(formula=as.factor(Survived)~.-PassengerId-Name-Ticket,data=train,family='binomial')
summary(titanic.lr)
##
## Call:
## glm(formula = as.factor(Survived) ~ . - PassengerId - Name -
## Ticket, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9325 -0.5407 -0.2779 0.3380 2.4949
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.829e+00 8.132e-01 5.938 2.88e-09 ***
## Pclass -1.027e+00 2.352e-01 -4.367 1.26e-05 ***
## Sexmale -2.664e+00 2.510e-01 -10.612 < 2e-16 ***
## Age -4.926e-02 1.040e-02 -4.737 2.17e-06 ***
## SibSp -5.055e-01 1.505e-01 -3.358 0.000784 ***
## Parch -1.999e-02 1.356e-01 -0.147 0.882753
## Fare 8.274e-03 6.561e-03 1.261 0.207334
## CabinA10 -1.863e+01 6.523e+03 -0.003 0.997721
## CabinA16 1.694e+01 6.523e+03 0.003 0.997928
## CabinA20 1.951e+01 6.523e+03 0.003 0.997614
## CabinA23 2.112e+01 6.523e+03 0.003 0.997416
## CabinA24 -1.859e+01 6.523e+03 -0.003 0.997725
## CabinA26 1.952e+01 6.523e+03 0.003 0.997612
## CabinA31 1.877e+01 6.523e+03 0.003 0.997704
## CabinA34 1.699e+01 6.523e+03 0.003 0.997922
## CabinA36 -1.778e+01 6.523e+03 -0.003 0.997825
## CabinA5 -1.686e+01 6.523e+03 -0.003 0.997937
## CabinA6 1.851e+01 6.523e+03 0.003 0.997735
## CabinA7 -1.757e+01 6.523e+03 -0.003 0.997851
## CabinB101 1.454e+01 6.523e+03 0.002 0.998221
## CabinB18 1.555e+01 4.422e+03 0.004 0.997194
## CabinB19 -1.698e+01 6.523e+03 -0.003 0.997923
## CabinB20 1.788e+01 3.680e+03 0.005 0.996123
## CabinB22 -1.636e-01 2.609e+00 -0.063 0.949990
## CabinB28 1.667e+01 4.469e+03 0.004 0.997024
## CabinB3 1.515e+01 6.523e+03 0.002 0.998146
## CabinB30 -1.737e+01 6.523e+03 -0.003 0.997876
## CabinB35 1.500e+01 4.612e+03 0.003 0.997405
## CabinB37 -1.746e+01 6.523e+03 -0.003 0.997864
## CabinB38 -1.771e+01 6.523e+03 -0.003 0.997834
## CabinB39 1.511e+01 6.523e+03 0.002 0.998152
## CabinB4 1.633e+01 6.523e+03 0.003 0.998002
## CabinB41 1.988e+01 6.523e+03 0.003 0.997568
## CabinB42 1.545e+01 6.523e+03 0.002 0.998110
## CabinB49 1.701e+01 3.794e+03 0.004 0.996423
## CabinB5 1.414e+01 4.563e+03 0.003 0.997527
## CabinB50 1.838e+01 6.523e+03 0.003 0.997752
## CabinB51 B53 B55 -1.754e+00 2.781e+00 -0.631 0.528242
## CabinB57 B59 B63 B66 1.424e+01 4.610e+03 0.003 0.997536
## CabinB58 B60 -3.046e+00 2.047e+00 -1.488 0.136785
## CabinB69 1.708e+01 6.523e+03 0.003 0.997911
## CabinB71 -1.810e+01 6.523e+03 -0.003 0.997786
## CabinB73 1.547e+01 6.523e+03 0.002 0.998108
## CabinB77 1.560e+01 4.610e+03 0.003 0.997299
## CabinB79 1.484e+01 6.523e+03 0.002 0.998185
## CabinB80 1.604e+01 6.523e+03 0.002 0.998038
## CabinB82 B84 -1.846e+01 6.523e+03 -0.003 0.997741
## CabinB86 -1.955e+01 6.523e+03 -0.003 0.997609
## CabinB94 -1.773e+01 6.523e+03 -0.003 0.997831
## CabinB96 B98 1.732e+01 2.708e+03 0.006 0.994897
## CabinC101 1.796e+01 6.523e+03 0.003 0.997803
## CabinC103 1.740e+01 6.523e+03 0.003 0.997871
## CabinC104 1.974e+01 6.523e+03 0.003 0.997586
## CabinC110 -1.782e+01 6.523e+03 -0.003 0.997820
## CabinC111 -1.883e+01 6.523e+03 -0.003 0.997697
## CabinC118 -1.848e+01 6.523e+03 -0.003 0.997740
## CabinC123 -6.302e-01 1.802e+00 -0.350 0.726500
## CabinC124 -1.770e+01 6.523e+03 -0.003 0.997835
## CabinC125 1.598e+01 4.526e+03 0.004 0.997183
## CabinC126 1.987e+01 6.523e+03 0.003 0.997570
## CabinC148 1.809e+01 6.523e+03 0.003 0.997787
## CabinC2 -1.259e+00 1.878e+00 -0.670 0.502655
## CabinC22 C26 -4.052e+00 1.598e+00 -2.535 0.011234 *
## CabinC23 C25 C27 -1.791e+00 1.926e+00 -0.930 0.352431
## CabinC30 -1.725e+01 6.523e+03 -0.003 0.997890
## CabinC32 1.504e+01 6.523e+03 0.002 0.998160
## CabinC45 1.438e+01 6.523e+03 0.002 0.998241
## CabinC46 -1.808e+01 6.523e+03 -0.003 0.997789
## CabinC49 -2.051e+01 6.523e+03 -0.003 0.997491
## CabinC50 1.648e+01 6.523e+03 0.003 0.997984
## CabinC52 1.859e+01 6.523e+03 0.003 0.997726
## CabinC54 1.489e+01 6.523e+03 0.002 0.998179
## CabinC62 C64 1.390e+01 6.523e+03 0.002 0.998299
## CabinC65 -2.374e+00 1.863e+00 -1.275 0.202381
## CabinC68 -1.065e+00 1.973e+00 -0.540 0.589165
## CabinC7 1.497e+01 6.523e+03 0.002 0.998169
## CabinC70 1.702e+01 6.523e+03 0.003 0.997918
## CabinC78 -1.468e-01 2.210e+00 -0.066 0.947047
## CabinC82 -2.046e+01 6.523e+03 -0.003 0.997498
## CabinC83 -6.845e-01 1.932e+00 -0.354 0.723085
## CabinC85 1.618e+01 6.523e+03 0.002 0.998021
## CabinC86 -1.799e+01 6.523e+03 -0.003 0.997800
## CabinC87 -1.687e+01 6.523e+03 -0.003 0.997936
## CabinC90 1.517e+01 6.523e+03 0.002 0.998145
## CabinC91 -1.908e+01 6.523e+03 -0.003 0.997666
## CabinC92 1.924e+01 6.523e+03 0.003 0.997647
## CabinC93 1.806e+01 3.838e+03 0.005 0.996246
## CabinC99 1.537e+01 6.523e+03 0.002 0.998120
## CabinD 2.671e-01 1.528e+00 0.175 0.861213
## CabinD10 D12 1.769e+01 6.523e+03 0.003 0.997837
## CabinD11 1.714e+01 6.523e+03 0.003 0.997904
## CabinD15 1.534e+01 6.523e+03 0.002 0.998124
## CabinD17 1.694e+01 4.612e+03 0.004 0.997069
## CabinD19 1.957e+01 6.523e+03 0.003 0.997606
## CabinD20 1.686e+01 4.611e+03 0.004 0.997082
## CabinD26 -1.866e+01 4.357e+03 -0.004 0.996583
## CabinD28 1.525e+01 6.523e+03 0.002 0.998135
## CabinD30 -1.870e+01 6.523e+03 -0.003 0.997712
## CabinD33 1.833e+01 3.899e+03 0.005 0.996248
## CabinD35 1.849e+01 4.037e+03 0.005 0.996346
## CabinD36 1.530e+01 4.595e+03 0.003 0.997343
## CabinD37 1.723e+01 6.523e+03 0.003 0.997892
## CabinD46 -1.767e+01 6.523e+03 -0.003 0.997838
## CabinD47 1.552e+01 6.523e+03 0.002 0.998101
## CabinD48 -1.812e+01 6.523e+03 -0.003 0.997784
## CabinD49 1.775e+01 6.523e+03 0.003 0.997828
## CabinD50 -1.697e+01 6.523e+03 -0.003 0.997925
## CabinD56 2.002e+01 6.523e+03 0.003 0.997551
## CabinD6 -1.852e+01 6.523e+03 -0.003 0.997734
## CabinD7 1.773e+01 6.523e+03 0.003 0.997831
## CabinD9 1.515e+01 6.523e+03 0.002 0.998146
## CabinE10 2.099e+01 6.523e+03 0.003 0.997432
## CabinE101 1.716e+01 4.605e+03 0.004 0.997026
## CabinE12 1.957e+01 6.523e+03 0.003 0.997606
## CabinE121 1.900e+01 4.203e+03 0.005 0.996394
## CabinE17 1.972e+01 6.523e+03 0.003 0.997588
## CabinE24 1.912e+01 4.599e+03 0.004 0.996684
## CabinE25 1.898e+01 4.612e+03 0.004 0.996716
## CabinE31 -1.744e+01 6.523e+03 -0.003 0.997867
## CabinE33 1.541e+01 6.523e+03 0.002 0.998115
## CabinE34 1.578e+01 6.523e+03 0.002 0.998070
## CabinE36 1.540e+01 6.523e+03 0.002 0.998116
## CabinE38 -1.672e+01 6.523e+03 -0.003 0.997955
## CabinE40 1.530e+01 6.523e+03 0.002 0.998128
## CabinE44 -2.347e-01 1.930e+00 -0.122 0.903231
## CabinE46 -1.747e+01 6.523e+03 -0.003 0.997863
## CabinE49 1.615e+01 6.523e+03 0.002 0.998024
## CabinE50 1.834e+01 6.523e+03 0.003 0.997757
## CabinE58 -1.760e+01 6.523e+03 -0.003 0.997847
## CabinE63 -1.771e+01 6.523e+03 -0.003 0.997834
## CabinE67 -3.619e-01 1.972e+00 -0.184 0.854360
## CabinE68 1.503e+01 6.523e+03 0.002 0.998161
## CabinE77 -1.862e+01 6.523e+03 -0.003 0.997722
## CabinE8 1.792e+01 3.975e+03 0.005 0.996403
## CabinF G63 -1.564e+01 6.523e+03 -0.002 0.998086
## CabinF G73 -1.664e+01 4.603e+03 -0.004 0.997116
## CabinF2 1.458e+00 1.304e+00 1.118 0.263681
## CabinF33 1.714e+01 3.750e+03 0.005 0.996353
## CabinF4 1.828e+01 3.929e+03 0.005 0.996289
## CabinG6 -8.603e-01 1.083e+00 -0.795 0.426753
## CabinT -1.778e+01 6.523e+03 -0.003 0.997825
## EmbarkedC 3.712e-01 3.560e-01 1.043 0.297138
## EmbarkedQ -4.129e-01 5.877e-01 -0.702 0.482378
## EmbarkedS NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 490.36 on 571 degrees of freedom
## AIC: 776.36
##
## Number of Fisher Scoring iterations: 17
Among all these variables, only Ticket Class, Sex, Age, SibSp are significant.
These variables, in my opnion, make logical sense for survival rate. Sex would be a great factor, since men back then is more likely to know how to swim. Age plays an important part too, since younger people might have higher chance to survive, due to increased physical strength and agility. Having traveled with siblings and spouses traveling with might improve survival rate, since passenger has someone to help them.
The only problem I have is with ticket class. My guess is people with higher ticket class might have a higher social status, thus being prioritized. Or it might be the ticket class reflects access to lifeboats.
Let’s fit the model one more time with these variables
titanic.lr <- glm(formula=as.factor(Survived)~as.factor(Pclass)+as.factor(Sex)+Age+SibSp,data=train,family='binomial')
summary(titanic.lr)
##
## Call:
## glm(formula = as.factor(Survived) ~ as.factor(Pclass) + as.factor(Sex) +
## Age + SibSp, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7876 -0.6417 -0.3864 0.6261 2.4539
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.334201 0.450700 9.617 < 2e-16 ***
## as.factor(Pclass)2 -1.414360 0.284727 -4.967 6.78e-07 ***
## as.factor(Pclass)3 -2.652618 0.285832 -9.280 < 2e-16 ***
## as.factor(Sex)male -2.627679 0.214771 -12.235 < 2e-16 ***
## Age -0.044760 0.008225 -5.442 5.27e-08 ***
## SibSp -0.380190 0.121516 -3.129 0.00176 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 636.56 on 708 degrees of freedom
## AIC: 648.56
##
## Number of Fisher Scoring iterations: 5
All the variables have low p-value, with high significance values. I believe this is the best model for logistic regression.
4 Prediction
Let’s push the test data through this model
prob <- predict(titanic.lr,newdata=test,type="response")
prediction <- rep(0,nrow(test))
prediction[prob>0.5] <- 1
solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
And write it into a submission file
write.csv(solution, file = 'solution.csv', row.names = F)
head(solution)
## PassengerID Survived
## 1 892 0
## 2 893 0
## 3 894 0
## 4 895 0
## 5 896 1
## 6 897 0
Thank you for reading my report on Logistic Regression to predict Titanic Survivals !!!