I’ll let you search for info on the life-cycle of data science projects.
But, the first thing is pretty obviously defining the business problem. We don’t really have one. We are just trying to develop the best model we can to most accurately classify the Kaggle Titanic test dataset. This will likely take a number of iterations.
Generally, the second step is get the data. We’ve pretty much done that in the previous two posts in this series. Following that is usually the step of cleaning up/wrangling our data. Once again, what little we may have had to do in this case was done in the previous post.
Now comes the critical step of Exploratory Data Analysis (EDA). Which may include some data visualization (something we’ve had a cursory look at in prior posts). And, perhaps, some initial feature engineering.
In this post we will begin with the EDA process. Perhaps some feature engineering if the EDA indicates we there might be something of value.
Exploratory Data Analysis
Let’s start with a look at the features in the data set.
Load Datasets
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
rek_k_tst2 = "./data/titanic/rek_test_2.csv"
# load the datasets currently of interest
k_trn = pd.read_csv(kaggle_trn)
k_tst = pd.read_csv(rek_k_tst2)
Examine Feature Set
General Info
print(k_trn.columns)
From various sources, this is what the features represent.
- PassengerId: index number to dataset, likely automatically generated by DataFrame
- Survived: survival status (0 = No; 1 = Yes)
- Pclass: passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name: well, the passenger’s name (duh)
- Sex: passenger’s gender (female, male)
- Age: passenger’s age in years, unless age < 1 then decimal fraction
- SibSp: number of siblings/spouses aboard
- Parch: number of parents/children Aboard
- Ticket: ticket number
- Fare: passenger fare (British pound)
- Cabin: allocated cabin
- Embarked: port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
With respect to the family relation variables (i.e. SibSp and Parch) some relations were ignored. I am not going to worry about the exact details as I really don’t plan to account for that in any way.
Now, thinking about which features might be of value in our model:
- PassengerId: useless m’thinks
- Survived: knowing/predicting that is the whole purpose of this exercise
- Pclass: may have some value, as I expect the higher the class the closer they would be to lifeboats and more likely to be treated differently
- Name: may or may not have any value, but something to have a look at and a think about
- Sex: I expect this likely had a good deal to do with survival
- Age: possibly affected survival, especially for the youngest and oldest
- SibSp and Parch: hard to say for sure, but maybe we should investigate further
- Ticket: don’t see that the ticket number is going to tell us much
- Fare: perhaps, as again the higher the fare, the higher the deck you would likely be on, but may be correlated with Pclass
- Cabin: in and of itself, probably not of much use. But, the cabin information also includes the deck which might be of value
- Embarked: off the top of my head don’t see that it would be of consequence, but should likely investigate at least a little
I think I will start looking at how each feature relates to survival (in the traning dataset). If there seems to be some sort of correlation I will include that feature in the modelling. Though, we should also look at the correlation between features and perhaps not use all the features that are highly correlated. This is going to take some work and a bit of time.
And, for any of those features I choose to use, I will have to decide how to deal with missing data. And, how to present them to the model. Use “female” and “male” or a integer binary encoding, or…
Missing Values
Let’s look a bit more at the data. Things like: is there missing data? The DataFrame info()
method should get us started.
k_trn.info()
Now, using a different approach, let’s look for missing data in the test dataset.
# looks to be missing data in the training dataset, what about the test dataset
#sns.color_palette("Set2")
fig, ax = plt.subplots(figsize=(5,5))
o_ttl = plt.title("Training Dataset: Missing Values")
o_map = sns.heatmap(k_tst.isnull(), cbar=False, cmap="rocket_r")
Oops! The chart is for the test dataset. Too lazy to rework everything and fix the title. Just be aware.
As the above shows, the training and test datasets contain features with missing values. The worst cases are “Age” and “Cabin”. But the training set is also missing some values in the “Embarked” column. And the test dataset in the “Fare” column. We could just say, we’ll leave these features out, but we have no idea how valuable they may be to our model. So, that decision will wait. We may in fact have to in some fashion impute the missing values so our model gets as much and the best information as possible.
Now, let’s take a closer look at individual features. Not in any particular order.
Feature: Sex
Did gender affect the likelihood of survival?
# gender vs survival
k_trn.groupby('Sex').Survived.mean()
# or visually
srvvd = k_trn[k_trn["Survived"] == 1]["Sex"].value_counts()
not_s = k_trn[k_trn["Survived"] == 0]["Sex"].value_counts()
df_s_no = pd.DataFrame([srvvd, not_s])
df_s_no.index = ['Survived', 'Did Not']
ax = df_s_no.plot(kind='bar', stacked=True, figsize=(5,5))
Would definitely appear gender affected survival. Even though male passengers outnumbered female passengers, female survivors outnumbered male survivors. By something like 4 to 1. Would seem the “women” part of “women and children first” was true.
Feature: Age
What affect did age have on survival?
# age vs survival
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Age Distribution")
p_plt = sns.histplot(k_trn["Age"], ax=ax, bins=30, kde=True, color='b')
Fairly “normal” looking distribution.
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Age Distribution by Survival Status")
p_x = ax.set_xlabel("Age")
p_p1 = sns.kdeplot(k_trn["Age"].loc[k_trn["Survived"] == 1], ax=ax, label='Survived', shade=True)
p_p2 = sns.kdeplot(k_trn["Age"].loc[k_trn["Survived"] == 0], ax=ax, label='Did Not', shade=True)
The kernel density estimate for survival by age doesn’t seem to add any additional information. Except perhaps for the little bump in the younger ages. But, it seems unlikely age was not relevant with respect to survival. Recall that “women and children first” concept prevalent at the time.
So, let’s try another chart — swarmplot.
# not much info, try something else
fig, ax = plt.subplots(figsize=(15,5))
ax.grid(True)
p_ttl = ax.set_title("Survival by Age and Gender")
p_xtk = plt.xticks(list(range(0,100,2)))
p_p2 = sns.swarmplot(y="Sex", x="Age", data=k_trn, hue="Survived")
So would seem that amongst the males, being between 0 and 12 appeared to improve the chance of survival. So, we likely need to consider age in our modelling.
Feature: Pclass
Okay, let’s look at the probable affect of social status on survival via the Pclass
feature.
# let's checkout Pclass
k_trn.groupby(['Pclass']).Survived.mean().to_frame()
# how about a chart
# yes, I know, should have created a function
srvvd = k_trn[k_trn["Survived"] == 1]["Pclass"].value_counts()
not_s = k_trn[k_trn["Survived"] == 0]["Pclass"].value_counts()
df_s_no = pd.DataFrame([srvvd, not_s])
df_s_no.index = ['Survived', 'Did Not']
ax = df_s_no.plot(kind='bar', stacked=True, figsize=(5,5))
Let’s look at the numbers for some perspective.
# to help understand the above, let's look at the numbers
pd.pivot_table(k_trn, index='Survived', columns='Pclass', values='Ticket', aggfunc='count')
Seems fairly clear that Pclass was a significant factor with respect to passenger survival. About 500 passengers (~55%) had third class tickets. But only 24% of them survived. Whereas ~63% of first class and ~47% of second class passengers survived.
But we should check whether class or gender is the bigger factor.
k_trn.groupby(['Pclass', 'Sex']).Survived.mean().to_frame()
Regardless of class, gender is the most important factor when it comes to survival. Though, men in first class were twice as likely to survive as those in the other two classes. Which probably indicates we should keep both features in our model.
Before moving on, let’s see how Pclass compares with Age as a metric for survival.
# let's also look at Pclass vs Age
# yes another function needed
fig, ax = plt.subplots(figsize=(15,5))
ax.grid(True)
p_ttl = ax.set_title("Survival by Age and Pclass")
p_xtk = plt.xticks(list(range(0,100,2)))
p_p2 = sns.swarmplot(y="Age", x="Pclass", data=k_trn, hue="Survived")
That seems to confirm the importance of Pclass on the likelihood of survival.
You may also have noticed (took me a while) that first class appears to have fewer children amongst its passengers than the two other classes. Are children left home with the nanny when parents are off having fun?
Feature: Embarked
Intuitively, this one should, to me, have little or no significance in helping determine the odds of survival. But, more than one of the articles I have read suggests one should not jump to any conclusions. At least take a brief “unbiased” look at the feature values may be saying.
# how about point of embarkation
k_trn['Embarked'].value_counts().to_frame()
pd.pivot_table(k_trn, index="Survived", columns="Embarked", values="Ticket", aggfunc="count")
Looking at the above it seems there might be some differences in survival rate based on point of embarcation. But, maybe an underlying element might better explain those differences.
Let’s start by looking at the higher survival rate for those boarding at Cherbourg versus the much lower rate for those boarding at Southampton.
First we will have a look at the travelling class of those boarding at those locations.
# why high survival rate at Cherbourg and lowest at Southampton
# due to numbers of first class and/or third class passengers embarking at those points?
#k_trn.groupby(['Embarked', 'Pclass']).Survived.sum().to_frame()
def count_zeros(s):
return s.size - s.sum()
k_trn.groupby(['Embarked', 'Pclass']).agg(
Survived=pd.NamedAgg(column='Survived', aggfunc='sum'),
DidNot=pd.NamedAgg(column='Survived', aggfunc=count_zeros)
)
And, now let’s consider the gender of those boarding at each harbour.
# did gender balance make difference in survival rates Queenstown vs Southampton
fig, ax = plt.subplots(figsize=(5,5))
p_ttl = ax.set_title("Count by Embarcation Point and Class")
p_p = sns.countplot(x="Embarked", data=k_trn, ax=ax, hue="Sex")
The bulk of the passengers boarded at Southampton. And the majority of those were in third class. Whereas at Cherbourg, those in third class made up the minority of boarders. So, the discrepancy in survival is more likely based on the class of the passengers boarding in each location rather than the location.
Whereas for Queenstown, it looks like the gender of the boarding passengers more accurately explains the overall survival rate for that group.
Done m’thinks
Okay, this post is getting rather lengthy. So, I am going to take a break and continue this exploratory data analysis in the next one.
Feel free to download and play with my version of this post’s related notebook.
Until next time, be happy and safe.
Resources
- pandas.DataFrame.groupby
- pandas.DataFrame.plot
- Group by: split-apply-combine
- seaborn.countplot
- seaborn.heatmap
- seaborn.histplot
- seaborn.swarmplot
- Description of Titanic dataset
- Comprehensive Guide to Grouping and Aggregating with Pandas
- Multiple aggregations of the same column using pandas GroupBy.agg()