I should likely have dealt with the missing values before the Exploratory Data Analysis. But, then again, maybe a better idea of the dataset will aid in this process.
As I understand things, most machine learning algorithms/models do not respond well to missing data. Nor do many column transformers (I’ll be getting there in the future.) There are really two choices: drop the rows or columns with the missing data from the dataset(s) or some how replace (impute) the missing values. With a large dataset and a few missing values, dropping the rows with missing data may be a workable solution. Or perhaps dropping the whole column if that feature does not look to be important to getting a good prediction from the generated model.
We don’t have a large dataset, so dropping rows with missing data would likely be disastrous. I don’t believe the Cabin will prove of any real value, so dropping that whole feature/column will likely not hurt our model development. But, I think we should keep the Embark and Age features.
Let’s get the easy one out of the way first. The two missing values in the Embark feature. Given we’ve solved that one before.
I will skip most of the preliminary code cells (defaults, load datasets, etc.). I will be saving the datasets, with at least some of the fixes, to new CSV files.
Imports
Decided to include the imports in the post, as there is at least one idiosyncrasy with respect to the IterativeImputer
(experimental).
from IPython.core.interactiveshell import InteractiveShell
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
Missing Values for Both Datasets
We will not only need to fix the missing values in the training dataset, but also in the test dataset. So, let’s have a quick look at both of those.
# a wee reminder
k_trn.info()
# let's check test dataset for missing values
k_tst.info()
Okay, 1 missing value of interest in the test dataset (currently ignoring Cabin). Many more in the training dataset.
Feature: Embark
I again refer you to the Dealing with Missing Data sectoin in a previous post, pandas — Reshaping/Filtering Data, Group By. Note the same ticket and cabin values.
# Ok let's first sort out missing embarked values
display(k_trn.loc[k_trn['Embarked'].isnull()])
# searching suitable sources, turns out both embarked at Southhampton
# https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html
# https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html
k_trn.loc[k_trn['Embarked'].isnull(), 'Embarked'] = 'S'
Feature: Fare (test dataset)
# let's deal with that
display(k_tst.loc[k_tst['Fare'].isnull()])
# will use avg 3 class fare for passenger without any SibSp or Parch,
# not quite correct as we are not ignoring fares for multipassenger tickets
avg_fare_3 = k_trn[((k_trn['Pclass'] == 3) &
(k_trn['SibSp'] == 0) & (k_trn['Parch'] == 0))][['Fare']].mean().values[0]
print(avg_fare_3)
mask = k_tst['PassengerId'] == 1044
k_tst.loc[mask, 'Fare'] = avg_fare_3
Save Modified Datasets
Going to save the above, modified datasets to CSV files, oma_trn.csv and oma_tst.csv. oma -> only missing age, which is not exactly true, but at this point I only plan on imputing the missing age values. For now I am going to totally ignore the Cabin feature. And, oh yes, I am going to drop the TName feature from the test dataset, it was only there as a check for my earlier work.
# I want to save these two dataset to CSVs so don't have to redo these few fixes
# Not going to currently save imputed ages, as may use different methods to impute missing ages
k_trn.to_csv(oma_trn, index=False)
dTN_tst = k_tst.drop(columns="TName")
dTN_tst.to_csv(oma_tst, index=False)
Feature: Age
Now onto something potentially more complicated.
There are a few options. So, I am currently thinking of trying a couple of them and seeing how the model used earlier performs with the differing value estimates.
But first, let’s write a function to do the testing on our various approaches to modifying the datasets for the missing age values.
Function for Training/Testing
def trn_tst(X_trn, y_trn, X_tst):
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_trn, y_trn)
predictions = model.predict(X_tst)
# print(predictions[:5])
# display(ds_tst["Survived"].head())
# return accuracy_score(ds_tst["Survived"], predictions)
return predictions
Scikit-Learn Imputers
Modified Train/Test Datasets
Let’s set up some variables, and our reduced datasets.
r_feats = ['Pclass','Sex','SibSp','Parch','Age','Embarked','Fare']
d_feats = ['Pclass','SibSp','Parch','Age','Fare','Sex_female','Sex_male','Embarked_C','Embarked_Q','Embarked_S']
y_trn = k_trn['Survived']
Xd_age = pd.get_dummies(k_trn[r_feats])
Xd_t_age = pd.get_dummies(k_tst[r_feats])
#print(Xd_age.iloc[:2, :])
SimpleImputer
imputer = SimpleImputer(missing_values=np.nan)
X_age = imputer.fit_transform(Xd_age)
X_age = pd.DataFrame(X_age, columns=d_feats)
# X_age.info()
X_t_age = imputer.transform(Xd_t_age)
X_t_age = pd.DataFrame(X_t_age, columns=d_feats)
# X_t_age.info()
#print(X_age.columns, "\n", X_t_age.columns)
t_age_pred = trn_tst(X_age, y_trn, X_t_age)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred)}')
IterativeImputer
imputer_2 = IterativeImputer(missing_values=np.nan, max_iter=25)
X_age_2 = imputer_2.fit_transform(Xd_age)
X_age_2 = pd.DataFrame(X_age, columns=d_feats)
X_t_age_2 = imputer_2.transform(Xd_t_age)
X_t_age_2 = pd.DataFrame(X_t_age, columns=d_feats)
#print(X_age.columns, X_t_age.columns)
t_age_pred_2 = trn_tst(X_age_2, y_trn, X_t_age_2)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred_2)}')
Surprised Simple and Iterative both produced the same result. May have to dig deeper.
#X_age_2['Age'].head()
imps = pd.DataFrame(X_age_2['Age'].to_list(), columns=['Iterative'])
#imps.head()
imps2 = pd.concat([imps, X_age['Age']], axis=1)
# imps2.groupby(['Iterative', 'Age']).ngroups
imps2[(imps2['Iterative'] == imps2['Age'])].count()
imps2[(imps2['Iterative'] != imps2['Age'])]
Both imputers generated exactly the same imputed ages?!
Imputing My Way
Now, let’s look a imputing with our own code and approach.
EDA
Let’s see if there is any correlation between Age and some of the other features. Specifically, Pclass, Sex_female, Sex_male, SibSp and Parch.
If you are wondering about Sex_female and Sex_male, they were created when we used Xd_age = pd.get_dummies(k_trn[r_feats])
on our datasets. get_dummies()
converts categorical variables into dummy/indicator variables. So, Sex_female will be 1 when that passenger had a value of female in the Sex feature, zero otherwise. Similarly for the male feature value. Something similar will have happened with the Embarked feature — as you will see below. This is/was done because a good many models do not like categorical features.
sns.set(font_scale=1.5)
feature_list = ['Pclass', 'Age','Sex_female','Sex_male', 'SibSp', 'Parch']
hm = plt.figure(figsize=(10,6))
g = sns.heatmap(Xd_age[feature_list].corr(), square=True, annot=True, cmap='coolwarm', fmt='.2f')
Looks like there is some correlation between Age and these three features: Pclass, Parch and SibSp. Though somewhat less for Parch than the other two. Let’s have a look at those three.
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(14, 5))
top = fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=0.5)
ax = fig.add_subplot(1, 3, 1)
ax = sns.boxplot(x='Pclass', y='Age', data=k_trn)
ax = fig.add_subplot(1, 3, 2)
ax = sns.boxplot(x='Parch', y='Age', data=k_trn)
ax = fig.add_subplot(1, 3, 3)
ax = sns.boxplot(x='SibSp', y='Age', data=k_trn)
Median Age by Group
Let’s get a median age by grouping on those three features.
median_ages = k_trn.groupby(['Pclass', 'Parch', 'SibSp'], as_index=False).Age.median()
# median_ages.head()
pc = 3
pa = 2
ss = 3
median_ages[(median_ages['Pclass']==pc) & (median_ages['Parch']==pa) & (median_ages['SibSp']==ss)].Age.item()
print(f"Number of missing Age values: {median_ages['Age'].isnull().sum()}")
median_ages[(median_ages['Pclass']==3) & (median_ages['Parch']==2) & (median_ages['SibSp']==8)].Age.item()
Well! Seems our grouped ages are missing at least one value?
k_trn[(k_trn['Pclass']==3) & (k_trn['Parch']==2) & (k_trn['SibSp']==8)]
My Imputing Attempt 1
I will get the imputed values from the training dataset. And, use that information to replace the missing values in both the training and test datasets.
Along with that missing imputed value for the training dataset. There are values for the three features in the test data set that didn’t have any matching values in the training data set. I am not showing the trial and error. Just the final code covering the problem feature values. Since most of the problem situations involved a higher number of children than adults I used 15 as the replacement age. Possibly should have gone younger, but…
X_age_3 = Xd_age
X_t_age_3 = Xd_t_age
miss_trn = list(X_age_3['Age'][X_age_3['Age'].isnull()].index)
miss_tst = list(X_t_age_3['Age'][X_t_age_3['Age'].isnull()].index)
# print(miss_age[:5])
for i in miss_trn:
age_est = median_ages[(median_ages['Pclass']==X_age_3.iloc[i]['Pclass'])
& (median_ages['Parch']==X_age_3.iloc[i]['Parch'])
& (median_ages['SibSp']==X_age_3.iloc[i]['SibSp'])].Age.item()
if pd.isnull(age_est):
# print(f"! {k_trn.iloc[i]['Pclass']}, {k_trn.iloc[i]['Parch']}, {k_trn.iloc[i]['SibSp']} -> {age_est}")
X_age_3.loc[i, 'Age'] = 25
else:
X_age_3.loc[i, 'Age'] = math.ceil(age_est)
for i in miss_tst:
# print(f"! {X_t_age_3.iloc[i]['Pclass']}, {X_t_age_3.iloc[i]['Parch']}, {X_t_age_3.iloc[i]['SibSp']}")
if not median_ages[(median_ages['Pclass']==X_t_age_3.iloc[i]['Pclass'])
& (median_ages['Parch']==X_t_age_3.iloc[i]['Parch'])
& (median_ages['SibSp']==X_t_age_3.iloc[i]['SibSp'])].empty:
<span class="n">age_est</span> <span class="o">=</span> <span class="n">median_ages</span><span class="p">[(</span><span class="n">median_ages</span><span class="p">[</span><span class="s1">'Pclass'</span><span class="p">]</span><span class="o">==</span><span class="n">X_t_age_3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">'Pclass'</span><span class="p">])</span>
<span class="o">&</span> <span class="p">(</span><span class="n">median_ages</span><span class="p">[</span><span class="s1">'Parch'</span><span class="p">]</span><span class="o">==</span><span class="n">X_t_age_3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">'Parch'</span><span class="p">])</span>
<span class="o">&</span> <span class="p">(</span><span class="n">median_ages</span><span class="p">[</span><span class="s1">'SibSp'</span><span class="p">]</span><span class="o">==</span><span class="n">X_t_age_3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">'SibSp'</span><span class="p">])]</span><span class="o">.</span><span class="n">Age</span><span class="o">.</span><span class="n">item</span><span class="p">()</span>
<span class="k">if</span> <span class="n">pd</span><span class="o">.</span><span class="n">isnull</span><span class="p">(</span><span class="n">age_est</span><span class="p">):</span>
<span class="c1"># print(f"! {k_trn.iloc[i]['Pclass']}, {k_trn.iloc[i]['Parch']}, {k_trn.iloc[i]['SibSp']} -> {age_est}")</span>
<span class="n">X_t_age_3</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">15</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">X_t_age_3</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">]</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">ceil</span><span class="p">(</span><span class="n">age_est</span><span class="p">)</span>
else:
X_t_age_3.loc[i, 'Age'] = 15
print(f"Number of missing Age training values: {X_age_3['Age'].isnull().sum()}")
print(f"Number of missing Age test values: {X_t_age_3['Age'].isnull().sum()}")
And, the model score using these imputed values?
t_age_pred_3 = trn_tst(X_age_3, y_trn, X_t_age_3)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred_3)}')
My Imputing Attempt 2
For comparison, ket’s drop Parch from the imputation computation and see what happens.
median_ages_2 = k_trn.groupby(['Pclass', 'SibSp'], as_index=False).Age.median()
# median_ages.head()
pc = 3
ss = 3
median_ages_2[(median_ages_2['Pclass']==pc) & (median_ages_2['SibSp']==ss)].Age.item()
print(f"Number of missing Age values: {median_ages_2['Age'].isnull().sum()}")
median_ages_2[(median_ages_2['Pclass']==3) & (median_ages_2['SibSp']==8)].Age.item()
X_age_4 = Xd_age
X_t_age_4 = Xd_t_age
miss_trn = list(X_age_4['Age'][X_age_4['Age'].isnull()].index)
miss_tst = list(X_t_age_4['Age'][X_t_age_4['Age'].isnull()].index)
# print(miss_age[:5])
for i in miss_trn:
age_est = median_ages_2[(median_ages_2['Pclass']==X_age_4.iloc[i]['Pclass'])
& (median_ages_2['Parch']==X_age_4.iloc[i]['Parch'])
& (median_ages_2['SibSp']==X_age_4.iloc[i]['SibSp'])].Age.item()
if pd.isnull(age_est):
# print(f"! {k_trn.iloc[i]['Pclass']}, {k_trn.iloc[i]['Parch']}, {k_trn.iloc[i]['SibSp']} -> {age_est}")
X_age_4.loc[i, 'Age'] = 25
else:
X_age_4.loc[i, 'Age'] = math.ceil(age_est)
for i in miss_tst:
# print(f"! {X_t_age_3.iloc[i]['Pclass']}, {X_t_age_3.iloc[i]['Parch']}, {X_t_age_3.iloc[i]['SibSp']}")
if not median_ages_2[(median_ages_2['Pclass']==X_t_age_4.iloc[i]['Pclass'])
& (median_ages_2['Parch']==X_t_age_4.iloc[i]['Parch'])
& (median_ages_2['SibSp']==X_t_age_4.iloc[i]['SibSp'])].empty:
age_est = median_ages_2[(median_ages['Pclass']==X_t_age_4.iloc[i]['Pclass'])
& (median_ages_2['Parch']==X_t_age_4.iloc[i]['Parch'])
& (median_ages_2['SibSp']==X_t_age_4.iloc[i]['SibSp'])].Age.item()
if pd.isnull(age_est):
# print(f"! {k_trn.iloc[i]['Pclass']}, {k_trn.iloc[i]['Parch']}, {k_trn.iloc[i]['SibSp']} -> {age_est}")
X_t_age_4.loc[i, 'Age'] = 15
else:
X_t_age_4.loc[i, 'Age'] = math.ceil(age_est)
else:
X_t_age_4.loc[i, 'Age'] = 15
print(f"Number of missing Age training values: {X_age_4['Age'].isnull().sum()}")
print(f"Number of missing Age test values: {X_t_age_4['Age'].isnull().sum()}")
t_age_pred_4 = trn_tst(X_age_4, y_trn, X_t_age_4)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred_4)}')
Both of my imputing attempts netted the same score. Which was less than that for the Scikit-learn imputers. But, I expect in many cases the IterativeImputer will fair better. And since it didn’t take all that long for these datasets, I will use it to impute the missing passenger ages.
Done
I think that’s it for another post.
Feel free to download and play with my version of this post’s related notebook.
Perhaps a look at some feature engineering in the next post.
Resources
- pandas.DataFrame
- pandas.get_dummies
- pandas.DataFrame.groupby: do check out the
as_index=
parameter - Indexing and selecting data
- How do I select a subset of a DataFrame
- Pandas dataframe filter with Multiple conditions
- Filter Pandas Dataframe with multiple conditions
- seaborn.boxplot
- seaborn.heatmap
- sklearn.impute.IterativeImputer
- sklearn.impute.SimpleImputer
- Imputation of missing values
- Metrics and scoring: quantifying the quality of predictions
- Using scikit-learn’s Iterative Imputer