Simple Logistic Regression on Titanic

You can find the full report and dataset here on Kaggle.

Predicting Survivor Rate of Titanic

Hello, welcome to my first attempt at cracking this competition. I am new at data science, therefore any comments would be highly appreciated.

1. Importing Libraries

These are the libraries I found neccessary to perform this model building exercise.

library(tidyverse)# metapackage of all tidyverse packages
library('plyr') # data manipulation
library('dplyr')

2. First look at the data

First, I’m going to pull in the data from Kaggle. There are two data sets, train and test.

train <- read.csv("../input/titanic/train.csv")
test <- read.csv ("../input/titanic/test.csv")
dim(train)
## [1] 891  12
dim(test)
## [1] 418  11

The train data set has 891 rows, each row represents a passenger and the test set has 418 rows. Let’s look at the first few rows of the data.

head(train,n=5)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S

These are the 12 variables in the train data set. The test data set doesn’t contain Survived.

Variable Name Description
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

Right off the bat, I am speculating that sex and age will be significant in detecting the Survivor Rate. Passenger ID, Name and Ticket are not relevant in my opinion.

For this exercise, I am going to fit a logistic model.

However, first, we are going to remove missing data from the train data set.

a <- nrow(train)
train <- na.omit(train)
a - nrow(train)
## [1] 177
plyr::count(train$Survived)
##   x freq
## 1 0  424
## 2 1  290

We removed 177 observations due to missing data. Among the 714 passengers, 424 died and 290 survived.

3. Fitting the model

Let’s fit the a logistic model with all the variables, except for PassengerId, Name and Ticket

titanic.lr <- glm(formula=as.factor(Survived)~.-PassengerId-Name-Ticket,data=train,family='binomial')
summary(titanic.lr)
## 
## Call:
## glm(formula = as.factor(Survived) ~ . - PassengerId - Name - 
##     Ticket, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9325  -0.5407  -0.2779   0.3380   2.4949  
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           4.829e+00  8.132e-01   5.938 2.88e-09 ***
## Pclass               -1.027e+00  2.352e-01  -4.367 1.26e-05 ***
## Sexmale              -2.664e+00  2.510e-01 -10.612  < 2e-16 ***
## Age                  -4.926e-02  1.040e-02  -4.737 2.17e-06 ***
## SibSp                -5.055e-01  1.505e-01  -3.358 0.000784 ***
## Parch                -1.999e-02  1.356e-01  -0.147 0.882753    
## Fare                  8.274e-03  6.561e-03   1.261 0.207334    
## CabinA10             -1.863e+01  6.523e+03  -0.003 0.997721    
## CabinA16              1.694e+01  6.523e+03   0.003 0.997928    
## CabinA20              1.951e+01  6.523e+03   0.003 0.997614    
## CabinA23              2.112e+01  6.523e+03   0.003 0.997416    
## CabinA24             -1.859e+01  6.523e+03  -0.003 0.997725    
## CabinA26              1.952e+01  6.523e+03   0.003 0.997612    
## CabinA31              1.877e+01  6.523e+03   0.003 0.997704    
## CabinA34              1.699e+01  6.523e+03   0.003 0.997922    
## CabinA36             -1.778e+01  6.523e+03  -0.003 0.997825    
## CabinA5              -1.686e+01  6.523e+03  -0.003 0.997937    
## CabinA6               1.851e+01  6.523e+03   0.003 0.997735    
## CabinA7              -1.757e+01  6.523e+03  -0.003 0.997851    
## CabinB101             1.454e+01  6.523e+03   0.002 0.998221    
## CabinB18              1.555e+01  4.422e+03   0.004 0.997194    
## CabinB19             -1.698e+01  6.523e+03  -0.003 0.997923    
## CabinB20              1.788e+01  3.680e+03   0.005 0.996123    
## CabinB22             -1.636e-01  2.609e+00  -0.063 0.949990    
## CabinB28              1.667e+01  4.469e+03   0.004 0.997024    
## CabinB3               1.515e+01  6.523e+03   0.002 0.998146    
## CabinB30             -1.737e+01  6.523e+03  -0.003 0.997876    
## CabinB35              1.500e+01  4.612e+03   0.003 0.997405    
## CabinB37             -1.746e+01  6.523e+03  -0.003 0.997864    
## CabinB38             -1.771e+01  6.523e+03  -0.003 0.997834    
## CabinB39              1.511e+01  6.523e+03   0.002 0.998152    
## CabinB4               1.633e+01  6.523e+03   0.003 0.998002    
## CabinB41              1.988e+01  6.523e+03   0.003 0.997568    
## CabinB42              1.545e+01  6.523e+03   0.002 0.998110    
## CabinB49              1.701e+01  3.794e+03   0.004 0.996423    
## CabinB5               1.414e+01  4.563e+03   0.003 0.997527    
## CabinB50              1.838e+01  6.523e+03   0.003 0.997752    
## CabinB51 B53 B55     -1.754e+00  2.781e+00  -0.631 0.528242    
## CabinB57 B59 B63 B66  1.424e+01  4.610e+03   0.003 0.997536    
## CabinB58 B60         -3.046e+00  2.047e+00  -1.488 0.136785    
## CabinB69              1.708e+01  6.523e+03   0.003 0.997911    
## CabinB71             -1.810e+01  6.523e+03  -0.003 0.997786    
## CabinB73              1.547e+01  6.523e+03   0.002 0.998108    
## CabinB77              1.560e+01  4.610e+03   0.003 0.997299    
## CabinB79              1.484e+01  6.523e+03   0.002 0.998185    
## CabinB80              1.604e+01  6.523e+03   0.002 0.998038    
## CabinB82 B84         -1.846e+01  6.523e+03  -0.003 0.997741    
## CabinB86             -1.955e+01  6.523e+03  -0.003 0.997609    
## CabinB94             -1.773e+01  6.523e+03  -0.003 0.997831    
## CabinB96 B98          1.732e+01  2.708e+03   0.006 0.994897    
## CabinC101             1.796e+01  6.523e+03   0.003 0.997803    
## CabinC103             1.740e+01  6.523e+03   0.003 0.997871    
## CabinC104             1.974e+01  6.523e+03   0.003 0.997586    
## CabinC110            -1.782e+01  6.523e+03  -0.003 0.997820    
## CabinC111            -1.883e+01  6.523e+03  -0.003 0.997697    
## CabinC118            -1.848e+01  6.523e+03  -0.003 0.997740    
## CabinC123            -6.302e-01  1.802e+00  -0.350 0.726500    
## CabinC124            -1.770e+01  6.523e+03  -0.003 0.997835    
## CabinC125             1.598e+01  4.526e+03   0.004 0.997183    
## CabinC126             1.987e+01  6.523e+03   0.003 0.997570    
## CabinC148             1.809e+01  6.523e+03   0.003 0.997787    
## CabinC2              -1.259e+00  1.878e+00  -0.670 0.502655    
## CabinC22 C26         -4.052e+00  1.598e+00  -2.535 0.011234 *  
## CabinC23 C25 C27     -1.791e+00  1.926e+00  -0.930 0.352431    
## CabinC30             -1.725e+01  6.523e+03  -0.003 0.997890    
## CabinC32              1.504e+01  6.523e+03   0.002 0.998160    
## CabinC45              1.438e+01  6.523e+03   0.002 0.998241    
## CabinC46             -1.808e+01  6.523e+03  -0.003 0.997789    
## CabinC49             -2.051e+01  6.523e+03  -0.003 0.997491    
## CabinC50              1.648e+01  6.523e+03   0.003 0.997984    
## CabinC52              1.859e+01  6.523e+03   0.003 0.997726    
## CabinC54              1.489e+01  6.523e+03   0.002 0.998179    
## CabinC62 C64          1.390e+01  6.523e+03   0.002 0.998299    
## CabinC65             -2.374e+00  1.863e+00  -1.275 0.202381    
## CabinC68             -1.065e+00  1.973e+00  -0.540 0.589165    
## CabinC7               1.497e+01  6.523e+03   0.002 0.998169    
## CabinC70              1.702e+01  6.523e+03   0.003 0.997918    
## CabinC78             -1.468e-01  2.210e+00  -0.066 0.947047    
## CabinC82             -2.046e+01  6.523e+03  -0.003 0.997498    
## CabinC83             -6.845e-01  1.932e+00  -0.354 0.723085    
## CabinC85              1.618e+01  6.523e+03   0.002 0.998021    
## CabinC86             -1.799e+01  6.523e+03  -0.003 0.997800    
## CabinC87             -1.687e+01  6.523e+03  -0.003 0.997936    
## CabinC90              1.517e+01  6.523e+03   0.002 0.998145    
## CabinC91             -1.908e+01  6.523e+03  -0.003 0.997666    
## CabinC92              1.924e+01  6.523e+03   0.003 0.997647    
## CabinC93              1.806e+01  3.838e+03   0.005 0.996246    
## CabinC99              1.537e+01  6.523e+03   0.002 0.998120    
## CabinD                2.671e-01  1.528e+00   0.175 0.861213    
## CabinD10 D12          1.769e+01  6.523e+03   0.003 0.997837    
## CabinD11              1.714e+01  6.523e+03   0.003 0.997904    
## CabinD15              1.534e+01  6.523e+03   0.002 0.998124    
## CabinD17              1.694e+01  4.612e+03   0.004 0.997069    
## CabinD19              1.957e+01  6.523e+03   0.003 0.997606    
## CabinD20              1.686e+01  4.611e+03   0.004 0.997082    
## CabinD26             -1.866e+01  4.357e+03  -0.004 0.996583    
## CabinD28              1.525e+01  6.523e+03   0.002 0.998135    
## CabinD30             -1.870e+01  6.523e+03  -0.003 0.997712    
## CabinD33              1.833e+01  3.899e+03   0.005 0.996248    
## CabinD35              1.849e+01  4.037e+03   0.005 0.996346    
## CabinD36              1.530e+01  4.595e+03   0.003 0.997343    
## CabinD37              1.723e+01  6.523e+03   0.003 0.997892    
## CabinD46             -1.767e+01  6.523e+03  -0.003 0.997838    
## CabinD47              1.552e+01  6.523e+03   0.002 0.998101    
## CabinD48             -1.812e+01  6.523e+03  -0.003 0.997784    
## CabinD49              1.775e+01  6.523e+03   0.003 0.997828    
## CabinD50             -1.697e+01  6.523e+03  -0.003 0.997925    
## CabinD56              2.002e+01  6.523e+03   0.003 0.997551    
## CabinD6              -1.852e+01  6.523e+03  -0.003 0.997734    
## CabinD7               1.773e+01  6.523e+03   0.003 0.997831    
## CabinD9               1.515e+01  6.523e+03   0.002 0.998146    
## CabinE10              2.099e+01  6.523e+03   0.003 0.997432    
## CabinE101             1.716e+01  4.605e+03   0.004 0.997026    
## CabinE12              1.957e+01  6.523e+03   0.003 0.997606    
## CabinE121             1.900e+01  4.203e+03   0.005 0.996394    
## CabinE17              1.972e+01  6.523e+03   0.003 0.997588    
## CabinE24              1.912e+01  4.599e+03   0.004 0.996684    
## CabinE25              1.898e+01  4.612e+03   0.004 0.996716    
## CabinE31             -1.744e+01  6.523e+03  -0.003 0.997867    
## CabinE33              1.541e+01  6.523e+03   0.002 0.998115    
## CabinE34              1.578e+01  6.523e+03   0.002 0.998070    
## CabinE36              1.540e+01  6.523e+03   0.002 0.998116    
## CabinE38             -1.672e+01  6.523e+03  -0.003 0.997955    
## CabinE40              1.530e+01  6.523e+03   0.002 0.998128    
## CabinE44             -2.347e-01  1.930e+00  -0.122 0.903231    
## CabinE46             -1.747e+01  6.523e+03  -0.003 0.997863    
## CabinE49              1.615e+01  6.523e+03   0.002 0.998024    
## CabinE50              1.834e+01  6.523e+03   0.003 0.997757    
## CabinE58             -1.760e+01  6.523e+03  -0.003 0.997847    
## CabinE63             -1.771e+01  6.523e+03  -0.003 0.997834    
## CabinE67             -3.619e-01  1.972e+00  -0.184 0.854360    
## CabinE68              1.503e+01  6.523e+03   0.002 0.998161    
## CabinE77             -1.862e+01  6.523e+03  -0.003 0.997722    
## CabinE8               1.792e+01  3.975e+03   0.005 0.996403    
## CabinF G63           -1.564e+01  6.523e+03  -0.002 0.998086    
## CabinF G73           -1.664e+01  4.603e+03  -0.004 0.997116    
## CabinF2               1.458e+00  1.304e+00   1.118 0.263681    
## CabinF33              1.714e+01  3.750e+03   0.005 0.996353    
## CabinF4               1.828e+01  3.929e+03   0.005 0.996289    
## CabinG6              -8.603e-01  1.083e+00  -0.795 0.426753    
## CabinT               -1.778e+01  6.523e+03  -0.003 0.997825    
## EmbarkedC             3.712e-01  3.560e-01   1.043 0.297138    
## EmbarkedQ            -4.129e-01  5.877e-01  -0.702 0.482378    
## EmbarkedS                    NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 490.36  on 571  degrees of freedom
## AIC: 776.36
## 
## Number of Fisher Scoring iterations: 17

Among all these variables, only Ticket Class, Sex, Age, SibSp are significant.

These variables, in my opnion, make logical sense for survival rate. Sex would be a great factor, since men back then is more likely to know how to swim. Age plays an important part too, since younger people might have higher chance to survive, due to increased physical strength and agility. Having traveled with siblings and spouses traveling with might improve survival rate, since passenger has someone to help them.

The only problem I have is with ticket class. My guess is people with higher ticket class might have a higher social status, thus being prioritized. Or it might be the ticket class reflects access to lifeboats.

Let’s fit the model one more time with these variables

titanic.lr <- glm(formula=as.factor(Survived)~as.factor(Pclass)+as.factor(Sex)+Age+SibSp,data=train,family='binomial')
summary(titanic.lr)
## 
## Call:
## glm(formula = as.factor(Survived) ~ as.factor(Pclass) + as.factor(Sex) + 
##     Age + SibSp, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7876  -0.6417  -0.3864   0.6261   2.4539  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         4.334201   0.450700   9.617  < 2e-16 ***
## as.factor(Pclass)2 -1.414360   0.284727  -4.967 6.78e-07 ***
## as.factor(Pclass)3 -2.652618   0.285832  -9.280  < 2e-16 ***
## as.factor(Sex)male -2.627679   0.214771 -12.235  < 2e-16 ***
## Age                -0.044760   0.008225  -5.442 5.27e-08 ***
## SibSp              -0.380190   0.121516  -3.129  0.00176 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 636.56  on 708  degrees of freedom
## AIC: 648.56
## 
## Number of Fisher Scoring iterations: 5

All the variables have low p-value, with high significance values. I believe this is the best model for logistic regression.

4 Prediction

Let’s push the test data through this model

prob <- predict(titanic.lr,newdata=test,type="response")
prediction <- rep(0,nrow(test))
prediction[prob>0.5] <- 1
solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)

And write it into a submission file

write.csv(solution, file = 'solution.csv', row.names = F)
head(solution)
##   PassengerID Survived
## 1         892        0
## 2         893        0
## 3         894        0
## 4         895        0
## 5         896        1
## 6         897        0

Thank you for reading my report on Logistic Regression to predict Titanic Survivals !!!