Let’s continue with that exploratory data analysis we started last post. We still have the following features to have a look at: SibSp, Parch, Fare, Cabin and perhaps Name.
Exploratory Data Analysis
Dataset
After loading the datasets, we’ll give ourselves a bit of a reminder about what’s in the training dataset.
# load the datasets currently of interest
k_trn = pd.read_csv(kaggle_trn)
k_tst = pd.read_csv(rek_k_tst2)
# a wee reminder
k_trn.info()
And let’s continue investigating our features.
Feature: Fare
Fare may tell us something about the odds of survival, but I expect it is closely correlated with Pclass and may not be of any independent value. But, let’s have a look just in case.
# did the fare paid influence the survival rate
# let's look at the fare feature info
k_trn.Fare.describe()
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Fare Distribution")
p_plt = sns.histplot(k_trn["Fare"], ax=ax, bins=50, kde=True, color='b')
Hardly a normal distribution. 75% (3rd quartile) of paid fares were £31 or less. Wonder what a £512 ticket got for Miss. Anna Ward.
The prior approach of plotting Fare vs Survived won’t really work here. Just too many different values for Fare. So, let’s try putting the fares into a small number of categories.
# categorize fares and plot against survival - let's just use the quartiles for now
f_cats = ['Lowest', "Medium", "High", "Highest"]
f_rngs = pd.qcut(k_trn["Fare"], len(f_cats), labels=f_cats)
fig, ax = plt.subplots(figsize=(5, 5))
p_ttl = ax.set_title("Survival vs Fare")
p_y = ax.set_ylabel("Survival Rate")
p_plt = sns.barplot(x=f_rngs, y=k_trn.Survived, ax=ax, ci=None)
# let's add gender to the equations
fig, ax = plt.subplots(figsize=(20,8))
ax.grid(True)
p_ttl = ax.set_title("Survival by Fare and Gender")
p_xtk = plt.xticks(list(range(0,100,2)))
p_p2 = sns.swarmplot(y="Fare", x="Sex", data=k_trn, hue="Survived", size=5)
Sorry a little small that image. And, takes a bit of time to run (~11 sec). But, anyone paying over £500, survived. Perhaps strangely males paying between £200 - £300 all perished, while females paying fares in the range all survived. This might be something a model might be able to use to improve its accuracy.
Also, the minimum fare was listed as £0. Unlikely a trip on the maiden voyage of the Titanic would be given to anyone for free. This is probably something we should deal with before training the model.
Feature: SibSp and Parch
Now, how likely is it that the number of relatives one is travelling with would influence one’s chances of survival?
Let’s start by visualizing these features.
# let's have a look at SibSp and Parch features
fig, ax = plt.subplots(figsize=(5, 5))
p_ttl = ax.set_title("Survival vs Siblings/Spouses Aboard")
p_y = ax.set_ylabel("Survival Rate")
p_plt = sns.barplot(x="SibSp", y="Survived", data=k_trn, ax=ax, ci=None)
fig, ax = plt.subplots(figsize=(5, 5))
p_ttl = ax.set_title("Survival vs Parents/Children Aboard")
p_y = ax.set_ylabel("Survival Rate")
p_plt = sns.barplot(x="Parch", y="Survived", data=k_trn, ax=ax, ci=None)
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Survival Count by Siblings/Spouses Onboard")
p_p = sns.countplot(x="SibSp", data=k_trn, ax=ax, hue="Survived")
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Survival Count by Parents/Children Onboard")
p_p = sns.countplot(x="Parch", data=k_trn, ax=ax, hue="Survived")
Looks like those with SibSp count of 1, perhaps 2 as well, and those with a Parch count of 1-3 inclusive have a slightly higher likelihood of survival.
Seems to me that it would be better to combine the Parch and SibSp values into a new feature indicating family size, FamilySize. I know we are getting into feature engineering, but let’s just have a look. I am including the passenger in the family size value.
# what if we combine SibSp and Parch to get "family size", adding one for the passenger themselves
k_trn['FamilySize'] = k_trn['Parch'] + k_trn['SibSp'] + 1
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Survival Count by Family Size Onboard")
p_p = sns.countplot(x="FamilySize", data=k_trn, ax=ax, hue="Survived")
Looks like there might be a slightly improved probability of survival for family sizes of 2-4. Travelling alone or family sizes greater than 4 do not fare well at all.
Something else to look at is perhaps how many people are travelling on one ticket. This might be families, as above, but it could also be other groups, relatives or not. This might be something to consider when we get into feature engineering.
Feature: Cabin, Name, Ticket
I am not going to do any tabular or visual analysis on these three features.
I do think that using the Cabin feature to get the passengers assigned deck would very likely be useful information for the model. Unfortunately, there are way too many missing Cabin values. I did think about trying to use Fare to predict the deck assigned to passengers with missing Cabin information. But, really not sure there are enough cabin values to make that possible. And, Fare is not necessarily an individual value for each passenger. There are numerous cases where multiple passengers are travelling on one ticket. Which could inflate the upper ticket values for a given deck. That said, I may down the line, give it a shot. For the next while I am going to drop this feature from the dataset used to train the model algorithms I will be testing.
Similarly, Name would appear to be pretty unhelpful as a model feature. But, the title in each name may provide some clues for the model. I will most likely try to incorporate that information in the model somewhere down the line. As with the cabin info, not sure when.
As for Ticket, it seems of little or no value to me. But, one article I looked at seemed to indicate that there was information regarding passenger class in the ticket values. So far I have been too lazy to dig in the suggestion.
k_trn["Cabin"].describe()
k_trn["Ticket"].describe()
k_trn["Name"].describe()
Done
I think I will end our EDA exercise at this point. In the next post, I am going to look at dealing with the missing data in the embarcation and age features. You may recall, a previous post covering one or both of these (see the Dealing with Missing Data section). As well as what to do with the non-numeric features in the dataset (many models don’t handle features with string values very well).
Feel free to download and play with my version of this post’s related notebook.
Resources
Too make life a touch easier, I have copied over the resources section from the previous post. Wasn’t going to, but…
- pandas.DataFrame.groupby
- pandas.DataFrame.plot
- Group by: split-apply-combine
- seaborn.countplot
- seaborn.heatmap
- seaborn.histplot
- seaborn.swarmplot
- Description of Titanic dataset
- Comprehensive Guide to Grouping and Aggregating with Pandas
- Multiple aggregations of the same column using pandas GroupBy.agg()