I’ll let you search for info on the life-cycle of data science projects.

But, the first thing is pretty obviously defining the business problem. We don’t really have one. We are just trying to develop the best model we can to most accurately classify the Kaggle Titanic test dataset. This will likely take a number of iterations.

Generally, the second step is get the data. We’ve pretty much done that in the previous two posts in this series. Following that is usually the step of cleaning up/wrangling our data. Once again, what little we may have had to do in this case was done in the previous post.

Now comes the critical step of Exploratory Data Analysis (EDA). Which may include some data visualization (something we’ve had a cursory look at in prior posts). And, perhaps, some initial feature engineering.

In this post we will begin with the EDA process. Perhaps some feature engineering if the EDA indicates we there might be something of value.

Exploratory Data Analysis

Let’s start with a look at the features in the data set.

Load Datasets

In [3]:
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
rek_k_tst2 = "./data/titanic/rek_test_2.csv"
In [4]:
# load the datasets currently of interest
k_trn = pd.read_csv(kaggle_trn)
k_tst = pd.read_csv(rek_k_tst2)

Examine Feature Set

General Info

In [5]:
print(k_trn.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

From various sources, this is what the features represent.

  • PassengerId: index number to dataset, likely automatically generated by DataFrame
  • Survived: survival status (0 = No; 1 = Yes)
  • Pclass: passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • Name: well, the passenger’s name (duh)
  • Sex: passenger’s gender (female, male)
  • Age: passenger’s age in years, unless age < 1 then decimal fraction
  • SibSp: number of siblings/spouses aboard
  • Parch: number of parents/children Aboard
  • Ticket: ticket number
  • Fare: passenger fare (British pound)
  • Cabin: allocated cabin
  • Embarked: port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

With respect to the family relation variables (i.e. SibSp and Parch) some relations were ignored. I am not going to worry about the exact details as I really don’t plan to account for that in any way.

Now, thinking about which features might be of value in our model:

  • PassengerId: useless m’thinks
  • Survived: knowing/predicting that is the whole purpose of this exercise
  • Pclass: may have some value, as I expect the higher the class the closer they would be to lifeboats and more likely to be treated differently
  • Name: may or may not have any value, but something to have a look at and a think about
  • Sex: I expect this likely had a good deal to do with survival
  • Age: possibly affected survival, especially for the youngest and oldest
  • SibSp and Parch: hard to say for sure, but maybe we should investigate further
  • Ticket: don’t see that the ticket number is going to tell us much
  • Fare: perhaps, as again the higher the fare, the higher the deck you would likely be on, but may be correlated with Pclass
  • Cabin: in and of itself, probably not of much use. But, the cabin information also includes the deck which might be of value
  • Embarked: off the top of my head don’t see that it would be of consequence, but should likely investigate at least a little

I think I will start looking at how each feature relates to survival (in the traning dataset). If there seems to be some sort of correlation I will include that feature in the modelling. Though, we should also look at the correlation between features and perhaps not use all the features that are highly correlated. This is going to take some work and a bit of time.

And, for any of those features I choose to use, I will have to decide how to deal with missing data. And, how to present them to the model. Use “female” and “male” or a integer binary encoding, or…

Missing Values

Let’s look a bit more at the data. Things like: is there missing data? The DataFrame info() method should get us started.

In [6]:
k_trn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Now, using a different approach, let’s look for missing data in the test dataset.

In [7]:
# looks to be missing data in the training dataset, what about the test dataset
#sns.color_palette("Set2")
fig, ax = plt.subplots(figsize=(5,5))
o_ttl = plt.title("Training Dataset: Missing Values")
o_map = sns.heatmap(k_tst.isnull(), cbar=False, cmap="rocket_r")

Oops! The chart is for the test dataset. Too lazy to rework everything and fix the title. Just be aware.

heatmap showing missing data in test dataset

As the above shows, the training and test datasets contain features with missing values. The worst cases are “Age” and “Cabin”. But the training set is also missing some values in the “Embarked” column. And the test dataset in the “Fare” column. We could just say, we’ll leave these features out, but we have no idea how valuable they may be to our model. So, that decision will wait. We may in fact have to in some fashion impute the missing values so our model gets as much and the best information as possible.

Now, let’s take a closer look at individual features. Not in any particular order.

Feature: Sex

Did gender affect the likelihood of survival?

In [8]:
# gender vs survival
k_trn.groupby('Sex').Survived.mean()
Out[8]:
Sex
female   0.74
male     0.19
Name: Survived, dtype: float64
In [9]:
# or visually
srvvd = k_trn[k_trn["Survived"] == 1]["Sex"].value_counts()
not_s = k_trn[k_trn["Survived"] == 0]["Sex"].value_counts()
df_s_no = pd.DataFrame([srvvd, not_s])
df_s_no.index = ['Survived', 'Did Not']
ax = df_s_no.plot(kind='bar', stacked=True, figsize=(5,5))
stacked barchart showing survival or not by gender

Would definitely appear gender affected survival. Even though male passengers outnumbered female passengers, female survivors outnumbered male survivors. By something like 4 to 1. Would seem the “women” part of “women and children first” was true.

Feature: Age

What affect did age have on survival?

In [10]:
# age vs survival
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Age Distribution")
p_plt = sns.histplot(k_trn["Age"], ax=ax, bins=30, kde=True, color='b')
kde plots of survival by age

Fairly “normal” looking distribution.

In [11]:
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Age Distribution by Survival Status")
p_x = ax.set_xlabel("Age")
p_p1 = sns.kdeplot(k_trn["Age"].loc[k_trn["Survived"] == 1], ax=ax, label='Survived', shade=True)
p_p2 = sns.kdeplot(k_trn["Age"].loc[k_trn["Survived"] == 0], ax=ax, label='Did Not', shade=True)
kde plots of survival by age

The kernel density estimate for survival by age doesn’t seem to add any additional information. Except perhaps for the little bump in the younger ages. But, it seems unlikely age was not relevant with respect to survival. Recall that “women and children first” concept prevalent at the time.

So, let’s try another chart — swarmplot.

In [12]:
# not much info, try something else
fig, ax = plt.subplots(figsize=(15,5))
ax.grid(True)
p_ttl = ax.set_title("Survival by Age and Gender")
p_xtk = plt.xticks(list(range(0,100,2)))
p_p2 = sns.swarmplot(y="Sex", x="Age", data=k_trn, hue="Survived")
swarmplot showing survival by gender and age

So would seem that amongst the males, being between 0 and 12 appeared to improve the chance of survival. So, we likely need to consider age in our modelling.

Feature: Pclass

Okay, let’s look at the probable affect of social status on survival via the Pclass feature.

In [13]:
# let's checkout Pclass
k_trn.groupby(['Pclass']).Survived.mean().to_frame()
Out[13]:
Survived
Pclass
10.63
20.47
30.24
In [14]:
# how about a chart
# yes, I know, should have created a function
srvvd = k_trn[k_trn["Survived"] == 1]["Pclass"].value_counts()
not_s = k_trn[k_trn["Survived"] == 0]["Pclass"].value_counts()
df_s_no = pd.DataFrame([srvvd, not_s])
df_s_no.index = ['Survived', 'Did Not']
ax = df_s_no.plot(kind='bar', stacked=True, figsize=(5,5))
stacked barchart showing survival by Pclass feature

Let’s look at the numbers for some perspective.

In [15]:
# to help understand the above, let's look at the numbers
pd.pivot_table(k_trn, index='Survived', columns='Pclass', values='Ticket', aggfunc='count')
Out[15]:
Pclass123
Survived
08097372
113687119

Seems fairly clear that Pclass was a significant factor with respect to passenger survival. About 500 passengers (~55%) had third class tickets. But only 24% of them survived. Whereas ~63% of first class and ~47% of second class passengers survived.

But we should check whether class or gender is the bigger factor.

In [16]:
k_trn.groupby(['Pclass', 'Sex']).Survived.mean().to_frame()
Out[16]:
Survived
PclassSex
1female0.97
male0.37
2female0.92
male0.16
3female0.50
male0.14

Regardless of class, gender is the most important factor when it comes to survival. Though, men in first class were twice as likely to survive as those in the other two classes. Which probably indicates we should keep both features in our model.

Before moving on, let’s see how Pclass compares with Age as a metric for survival.

In [17]:
# let's also look at Pclass vs Age
# yes another function needed
fig, ax = plt.subplots(figsize=(15,5))
ax.grid(True)
p_ttl = ax.set_title("Survival by Age and Pclass")
p_xtk = plt.xticks(list(range(0,100,2)))
p_p2 = sns.swarmplot(y="Age", x="Pclass", data=k_trn, hue="Survived")
swarmplot showing survival by age and pclass

That seems to confirm the importance of Pclass on the likelihood of survival.

You may also have noticed (took me a while) that first class appears to have fewer children amongst its passengers than the two other classes. Are children left home with the nanny when parents are off having fun?

Feature: Embarked

Intuitively, this one should, to me, have little or no significance in helping determine the odds of survival. But, more than one of the articles I have read suggests one should not jump to any conclusions. At least take a brief “unbiased” look at the feature values may be saying.

In [18]:
# how about point of embarkation
k_trn['Embarked'].value_counts().to_frame()
pd.pivot_table(k_trn, index="Survived", columns="Embarked", values="Ticket", aggfunc="count")
Out[18]:
Embarked
S644
C168
Q77
Out[18]:
EmbarkedCQS
Survived
07547427
19330217

Looking at the above it seems there might be some differences in survival rate based on point of embarcation. But, maybe an underlying element might better explain those differences.

Let’s start by looking at the higher survival rate for those boarding at Cherbourg versus the much lower rate for those boarding at Southampton.

First we will have a look at the travelling class of those boarding at those locations.

In [19]:
# why high survival rate at Cherbourg and lowest at Southampton
# due to numbers of first class and/or third class passengers embarking at those points?
#k_trn.groupby(['Embarked', 'Pclass']).Survived.sum().to_frame()

def count_zeros(s): return s.size - s.sum()

k_trn.groupby(['Embarked', 'Pclass']).agg( Survived=pd.NamedAgg(column='Survived', aggfunc='sum'), DidNot=pd.NamedAgg(column='Survived', aggfunc=count_zeros) )

Out[19]:
SurvivedDidNot
EmbarkedPclass
C15926
298
32541
Q111
221
32745
S17453
27688
367286

And, now let’s consider the gender of those boarding at each harbour.

In [20]:
# did gender balance make difference in survival rates Queenstown vs Southampton
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Count by Embarcation Point and Class")
p_p = sns.countplot(x="Embarked", data=k_trn, ax=ax, hue="Sex")
plot of counts by embarcation point and gender

The bulk of the passengers boarded at Southampton. And the majority of those were in third class. Whereas at Cherbourg, those in third class made up the minority of boarders. So, the discrepancy in survival is more likely based on the class of the passengers boarding in each location rather than the location.

Whereas for Queenstown, it looks like the gender of the boarding passengers more accurately explains the overall survival rate for that group.

Done m’thinks

Okay, this post is getting rather lengthy. So, I am going to take a break and continue this exploratory data analysis in the next one.

Feel free to download and play with my version of this post’s related notebook.

Until next time, be happy and safe.

Resources