Let’s continue with that exploratory data analysis we started last post. We still have the following features to have a look at: SibSp, Parch, Fare, Cabin and perhaps Name.

Exploratory Data Analysis

Dataset

After loading the datasets, we’ll give ourselves a bit of a reminder about what’s in the training dataset.

In [4]:

# load the datasets currently of interest
k_trn = pd.read_csv(kaggle_trn)
k_tst = pd.read_csv(rek_k_tst2)

In [5]:

# a wee reminder
k_trn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

And let’s continue investigating our features.

Feature: Fare

Fare may tell us something about the odds of survival, but I expect it is closely correlated with Pclass and may not be of any independent value. But, let’s have a look just in case.

In [6]:

# did the fare paid influence the survival rate
# let's look at the fare feature info
k_trn.Fare.describe()

Out[6]:

count   891.00
mean     32.20
std      49.69
min       0.00
25%       7.91
50%      14.45
75%      31.00
max     512.33
Name: Fare, dtype: float64

In [7]:

fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Fare Distribution")
p_plt = sns.histplot(k_trn["Fare"], ax=ax, bins=50, kde=True, color='b')

Hardly a normal distribution. 75% (3rd quartile) of paid fares were £31 or less. Wonder what a £512 ticket got for Miss. Anna Ward.

The prior approach of plotting Fare vs Survived won’t really work here. Just too many different values for Fare. So, let’s try putting the fares into a small number of categories.

In [8]:

# categorize fares and plot against survival - let's just use the quartiles for now
f_cats = ['Lowest', "Medium", "High", "Highest"]
f_rngs = pd.qcut(k_trn["Fare"], len(f_cats), labels=f_cats)
fig, ax = plt.subplots(figsize=(5, 5))
p_ttl = ax.set_title("Survival vs Fare")
p_y = ax.set_ylabel("Survival Rate")
p_plt = sns.barplot(x=f_rngs, y=k_trn.Survived, ax=ax, ci=None)

bar plot of survival against 4 categories of fare

In [9]:

# let's add gender to the equations
fig, ax = plt.subplots(figsize=(20,8))
ax.grid(True)
p_ttl = ax.set_title("Survival by Fare and Gender")
p_xtk = plt.xticks(list(range(0,100,2)))
p_p2 = sns.swarmplot(y="Fare", x="Sex", data=k_trn, hue="Survived", size=5)

swarmplot showing survival by gender and fare

Sorry a little small that image. And, takes a bit of time to run (~11 sec). But, anyone paying over £500, survived. Perhaps strangely males paying between £200 - £300 all perished, while females paying fares in the range all survived. This might be something a model might be able to use to improve its accuracy.

Also, the minimum fare was listed as £0. Unlikely a trip on the maiden voyage of the Titanic would be given to anyone for free. This is probably something we should deal with before training the model.

Feature: SibSp and Parch

Now, how likely is it that the number of relatives one is travelling with would influence one’s chances of survival?

Let’s start by visualizing these features.

In [10]:

# let's have a look at SibSp and Parch features
fig, ax = plt.subplots(figsize=(5, 5))
p_ttl = ax.set_title("Survival vs Siblings/Spouses Aboard")
p_y = ax.set_ylabel("Survival Rate")
p_plt = sns.barplot(x="SibSp", y="Survived", data=k_trn, ax=ax, ci=None)

In [11]:

fig, ax = plt.subplots(figsize=(5, 5))
p_ttl = ax.set_title("Survival vs Parents/Children Aboard")
p_y = ax.set_ylabel("Survival Rate")
p_plt = sns.barplot(x="Parch", y="Survived", data=k_trn, ax=ax, ci=None)

In [12]:

fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Survival Count by Siblings/Spouses Onboard")
p_p = sns.countplot(x="SibSp", data=k_trn, ax=ax, hue="Survived")

count plot of survival status by SibSp feature

In [13]:

fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Survival Count by Parents/Children Onboard")
p_p = sns.countplot(x="Parch", data=k_trn, ax=ax, hue="Survived")

count plot of survival status by Parch feature

Looks like those with SibSp count of 1, perhaps 2 as well, and those with a Parch count of 1-3 inclusive have a slightly higher likelihood of survival.

Seems to me that it would be better to combine the Parch and SibSp values into a new feature indicating family size, FamilySize. I know we are getting into feature engineering, but let’s just have a look. I am including the passenger in the family size value.

In [14]:

# what if we combine SibSp and Parch to get "family size", adding one for the passenger themselves
k_trn['FamilySize'] = k_trn['Parch'] + k_trn['SibSp'] + 1

In [15]:

fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Survival Count by Family Size Onboard")
p_p = sns.countplot(x="FamilySize", data=k_trn, ax=ax, hue="Survived")

count plot of survival status by family size engineered feature

Looks like there might be a slightly improved probability of survival for family sizes of 2-4. Travelling alone or family sizes greater than 4 do not fare well at all.

Something else to look at is perhaps how many people are travelling on one ticket. This might be families, as above, but it could also be other groups, relatives or not. This might be something to consider when we get into feature engineering.

Feature: Cabin, Name, Ticket

I am not going to do any tabular or visual analysis on these three features.

I do think that using the Cabin feature to get the passengers assigned deck would very likely be useful information for the model. Unfortunately, there are way too many missing Cabin values. I did think about trying to use Fare to predict the deck assigned to passengers with missing Cabin information. But, really not sure there are enough cabin values to make that possible. And, Fare is not necessarily an individual value for each passenger. There are numerous cases where multiple passengers are travelling on one ticket. Which could inflate the upper ticket values for a given deck. That said, I may down the line, give it a shot. For the next while I am going to drop this feature from the dataset used to train the model algorithms I will be testing.

Similarly, Name would appear to be pretty unhelpful as a model feature. But, the title in each name may provide some clues for the model. I will most likely try to incorporate that information in the model somewhere down the line. As with the cabin info, not sure when.

As for Ticket, it seems of little or no value to me. But, one article I looked at seemed to indicate that there was information regarding passenger class in the ticket values. So far I have been too lazy to dig in the suggestion.

In [16]:

k_trn["Cabin"].describe()

Out[16]:

count         204
unique        147
top       B96 B98
freq            4
Name: Cabin, dtype: object

In [17]:

k_trn["Ticket"].describe()

Out[17]:

count          891
unique         681
top       CA. 2343
freq             7
Name: Ticket, dtype: object

In [18]:

k_trn["Name"].describe()

Out[18]:

count                    891
unique                   891
top       Zimmerman, Mr. Leo
freq                       1
Name: Name, dtype: object

Done

I think I will end our EDA exercise at this point. In the next post, I am going to look at dealing with the missing data in the embarcation and age features. You may recall, a previous post covering one or both of these (see the Dealing with Missing Data section). As well as what to do with the non-numeric features in the dataset (many models don’t handle features with string values very well).

Feel free to download and play with my version of this post’s related notebook.

Resources

Too make life a touch easier, I have copied over the resources section from the previous post. Wasn’t going to, but…

pandas.DataFrame.groupby
pandas.DataFrame.plot
Group by: split-apply-combine
seaborn.countplot
seaborn.heatmap
seaborn.histplot
seaborn.swarmplot
Description of Titanic dataset
Comprehensive Guide to Grouping and Aggregating with Pandas
Multiple aggregations of the same column using pandas GroupBy.agg()

Too Old To Code

Titanic Dataset: Exploratory Data Analysis, Part II