I thought I’d take a step back and do a little more exploratory data analysis on our dataset. Another look to see if we can spot any good candidates for feature selection and testing.
As is typical with EDA, we will likely use mostly charts in our analysis.
Big Picture
Load our data.
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
Likelihood of Survival by Feature
Now let’s plot some of our features against survival. I’ll start by defining myself a utility function for plotting the barcharts.
def plot_bar(x_nm, y_nm, d_src, fg, **kwargs):
xlbl, hue, hue_order = (None, None, None)
if 'xlbl' in kwargs:
xlbl = kwargs['xlbl']
use_x = xlbl or x_nm
if 'hue' in kwargs:
hue = kwargs['hue']
if 'hue_order' in kwargs:
hue_order = kwargs['hue_order']
_ = sns.barplot(x=x_nm, y=y_nm, data=d_src, hue=hue, hue_order=hue_order)
_ = plt.xticks(rotation=0)
_ = plt.xlabel(use_x, weight="bold")
sns.set(font_scale=1.5)
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(40, 24))
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=0.35)
ax1 = fig.add_subplot(3, 3, 1)
plot_bar("Sex", "Survived", k_trn, fig)
ax2 = fig.add_subplot(3, 3, 2, sharey=ax1)
plot_bar("Pclass", "Survived", k_trn, fig)
ax3 = fig.add_subplot(3, 3, 3, sharey=ax1)
plot_bar("Embarked", "Survived", k_trn, fig)
ax4 = fig.add_subplot(3, 3, 4)
plot_bar("SibSp", "Survived", k_trn, fig)
ax5 = fig.add_subplot(3, 3, 7, sharey=ax4)
plot_bar("Parch", "Survived", k_trn, fig)
ax6 = fig.add_subplot(3, 3, 6, sharey=ax4)
plot_bar("FamilySize", "Survived", k_trn, fig)
ax7 = fig.add_subplot(3, 3, 9, sharey=ax4)
plot_bar("Group", "Survived", k_trn, fig, xlbl="Group (Size)")
- Sex and Pclass both show a strong influence on survival.
- FamilySize of 2-4 seems to improve the likelihood of survival. That likelihood drops for FamilySize >= 5.
- Embarked seems to show that those who embarked in Cherbourg had a higher chance of survival than those who embarked elsewhere. As unlikely as that seems, we may need to investigate this feature in more detail.
- Group (size) looks a lot like what is seen with FamilySize. Survival increases up to a group size of 4, then drops off sharply for larger group sizes. But the confidence bounds for this variable look much tighter than those for FamilySize. Perhaps Group may be a better feature for model training than FamilySize.
Compare Survival for Training Set and Full Dataset
I thought I’d have a look to see if there was a significant difference in survival rate for a few of features, between the full and the training datasets.
# let's compare survival rate of training set against complete dataset for sex, class and embarkation site
fig = plt.figure(figsize=(40, 16))
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=0.35)
ax1 = fig.add_subplot(2, 3, 1)
plot_bar("Sex", "Survived", k_all, fig, xlbl="Sex (all)")
ax2 = fig.add_subplot(2, 3, 2, sharey=ax1)
plot_bar("Pclass", "Survived", k_all, fig, xlbl="Pclass (all)")
ax3 = fig.add_subplot(2, 3, 3, sharey=ax1)
plot_bar("Embarked", "Survived", k_all, fig, xlbl="Embarked (all)")
ax4 = fig.add_subplot(2, 3, 4, sharey=ax1)
plot_bar("Sex", "Survived", k_trn, fig, xlbl="Sex (train)")
ax5 = fig.add_subplot(2, 3, 5, sharey=ax1)
plot_bar("Pclass", "Survived", k_trn, fig, xlbl="Pclass (train)")
ax6 = fig.add_subplot(2, 3, 6, sharey=ax1)
plot_bar("Embarked", "Survived", k_trn, fig, xlbl="Embarked (train))")
In a few cases, training set differs measureably from test set. Wonder how that might affect the predictions?
Add Gender to the Mixture
Let’s look at survival by class and embarkation point split by passenger gender.
# quick look at class and embarkation by sex
fig = plt.figure(figsize=(12, 5))
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=0.35)
ax1 = fig.add_subplot(1, 2, 1)
plot_bar("Pclass", "Survived", k_trn, fig, hue='Sex', hue_order=["female", "male"])
ax2 = fig.add_subplot(1, 2, 2, sharey=ax1)
plot_bar("Embarked", "Survived", k_trn, fig, hue='Sex', hue_order=["female", "male"])
Clearly better to have been in 1st class for both sexes. Seems the same for embarking in Cherbourg. Though I can not fathom how embarkation site could make a difference in survival rate. Must be some other reason.
Correlation Heatmap
I’ll start by numerically encoding a couple of the categorical features myself. I know I could have used Pandas to do this for me, but… And, I’ll add a LargeGroup feature as well.
# let's see what correlation between features looks like
feature_list = ['Survived', 'Pclass', 'Sex', 'Age', 'AgeBin', 'Fare', 'iFare', 'SibSp', 'Parch', 'FamilySize', 'Group']
k_t1 = k_trn[feature_list].copy()
k_t1['Sex'] = k_t1['Sex'].map({'male': 0, 'female': 1})
ageBin_to_integer = {'0-15': 1,
'16-29': 2,
'30-40': 3,
'41-59': 4,
'60+': 5}
k_t1['AgeBin'] = k_t1['AgeBin'].map(ageBin_to_integer)
k_t1['LargeGroup'] = np.where(k_t1['Group'] > 4, 1, 0)
display(k_t1.iloc[25:30, :])
And, now the correlation values for the features selected/created above displayed in a heatmap.
corr_list = k_t1.columns
_ = plt.figure(figsize=(18,14))
sns.set(font_scale=1.5)
g = sns.heatmap(k_t1[np.isfinite(k_t1['Survived'])][corr_list].corr(),
square=True, annot=True, fmt='.2f')
- Pclass, Sex and iFare all show a reasonable correlation with survival. As we’ve clearly seen previously for Pclass and Sex.
- There is some negative correlation between Age and FamilySize. Which makes sense, the larger the family the more youngsters, lowering the average age.
- Also some negative correlation between Age and Pclass. Basically suggesting 1st class passengers are older than 3rd class passengers. Something likely easy enough to check.
- And, as one would expect, SibSp, Parch and FamilySize are all correlated with one and other. You can also add Group (size) to those related features. Those groups contain many families.
- Fare is fairly correlated with Group (size). Again, logical as Fare is the total paid by all passengers on the same ticket.
In the plots above we saw that FamilySize had a noticeable impact on survival. Yet, here we see the correlation between the two is virtually zero. Ditto for Group (size). The LargeGroup feature also dosen’t seem to show any correlation with survival. And for Age even though we’ve seen that being under 15 years of age represents an advantage. With respect to the family/group sizes this could be because of the change in trend (slope) for a family size greater than 4. That may also explain the lack of correlation for the age related categories.
We may have to try a different approach to determine the best features on which to train/test our model.
Done
These posts and related notebooks always seem to grow in size rather quickly. So, I going to call it quits for this post and notebook.
Next post, I will perhaps take a closer look at some of the features.
Feel free to download and play with my version of this post’s related notebook.
Until next time, be happy and safe.
(Note: I think I may have published the last two posts in the wrong order. Not that it likely mattered much.)
Resources
- matplotlib.figure
- matplotlib.figure.add_subplot
- matplotlib.figure.subplots_adjust
- pandas.get_dummies
- seaborn.barplot
- In supervised learning, why is it bad to have correlated features? Perhaps not the most definitive source.