I should likely have dealt with the missing values before the Exploratory Data Analysis. But, then again, maybe a better idea of the dataset will aid in this process.

As I understand things, most machine learning algorithms/models do not respond well to missing data. Nor do many column transformers (I’ll be getting there in the future.) There are really two choices: drop the rows or columns with the missing data from the dataset(s) or some how replace (impute) the missing values. With a large dataset and a few missing values, dropping the rows with missing data may be a workable solution. Or perhaps dropping the whole column if that feature does not look to be important to getting a good prediction from the generated model.

We don’t have a large dataset, so dropping rows with missing data would likely be disastrous. I don’t believe the Cabin will prove of any real value, so dropping that whole feature/column will likely not hurt our model development. But, I think we should keep the Embark and Age features.

Let’s get the easy one out of the way first. The two missing values in the Embark feature. Given we’ve solved that one before.

I will skip most of the preliminary code cells (defaults, load datasets, etc.). I will be saving the datasets, with at least some of the fixes, to new CSV files.

Imports

Decided to include the imports in the post, as there is at least one idiosyncrasy with respect to the IterativeImputer (experimental).

In [1]:
from IPython.core.interactiveshell import InteractiveShell
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting 
import seaborn as sns # for plotting

from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from sklearn.impute import SimpleImputer from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer

Missing Values for Both Datasets

We will not only need to fix the missing values in the training dataset, but also in the test dataset. So, let’s have a quick look at both of those.

In [5]:
# a wee reminder
k_trn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [6]:
# let's check test dataset for missing values
k_tst.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
 11  Survived     418 non-null    int64  
 12  TName        418 non-null    object 
dtypes: float64(2), int64(5), object(6)
memory usage: 42.6+ KB

Okay, 1 missing value of interest in the test dataset (currently ignoring Cabin). Many more in the training dataset.

Feature: Embark

I again refer you to the Dealing with Missing Data sectoin in a previous post, pandas — Reshaping/Filtering Data, Group By. Note the same ticket and cabin values.

In [6]:
# Ok let's first sort out missing embarked values
display(k_trn.loc[k_trn['Embarked'].isnull()])
# searching suitable sources, turns out both embarked at Southhampton
# https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html
# https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html
k_trn.loc[k_trn['Embarked'].isnull(), 'Embarked'] = 'S'
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
616211Icard, Miss. Ameliefemale38.000011357280.00B28NaN
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.000011357280.00B28NaN

Feature: Fare (test dataset)

In [8]:
# let's deal with that
display(k_tst.loc[k_tst['Fare'].isnull()])
# will use avg 3 class fare for passenger without any SibSp or Parch,
# not quite correct as we are not ignoring fares for multipassenger tickets
avg_fare_3 = k_trn[((k_trn['Pclass'] == 3) & 
                    (k_trn['SibSp'] == 0) & (k_trn['Parch'] == 0))][['Fare']].mean().values[0]
print(avg_fare_3)
mask = k_tst['PassengerId'] == 1044 
k_tst.loc[mask, 'Fare'] = avg_fare_3
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvivedTName
15210443Storey, Mr. Thomasmale60.50003701NaNNaNS0Storey, Mr. Thomas
9.272051851851854

Save Modified Datasets

Going to save the above, modified datasets to CSV files, oma_trn.csv and oma_tst.csv. oma -> only missing age, which is not exactly true, but at this point I only plan on imputing the missing age values. For now I am going to totally ignore the Cabin feature. And, oh yes, I am going to drop the TName feature from the test dataset, it was only there as a check for my earlier work.

In [9]:
# I want to save these two dataset to CSVs so don't have to redo these few fixes
# Not going to currently save imputed ages, as may use different methods to impute missing ages
k_trn.to_csv(oma_trn, index=False)
dTN_tst = k_tst.drop(columns="TName")
dTN_tst.to_csv(oma_tst, index=False)

Feature: Age

Now onto something potentially more complicated.

There are a few options. So, I am currently thinking of trying a couple of them and seeing how the model used earlier performs with the differing value estimates.

But first, let’s write a function to do the testing on our various approaches to modifying the datasets for the missing age values.

Function for Training/Testing

In [10]:
def trn_tst(X_trn, y_trn, X_tst):
  model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
  model.fit(X_trn, y_trn)
  predictions = model.predict(X_tst)
  # print(predictions[:5])
  # display(ds_tst["Survived"].head())
  # return accuracy_score(ds_tst["Survived"], predictions)
  return predictions

Scikit-Learn Imputers

Modified Train/Test Datasets

Let’s set up some variables, and our reduced datasets.

In [11]:
r_feats = ['Pclass','Sex','SibSp','Parch','Age','Embarked','Fare']
d_feats = ['Pclass','SibSp','Parch','Age','Fare','Sex_female','Sex_male','Embarked_C','Embarked_Q','Embarked_S']
y_trn = k_trn['Survived']
Xd_age = pd.get_dummies(k_trn[r_feats])
Xd_t_age = pd.get_dummies(k_tst[r_feats])
#print(Xd_age.iloc[:2, :])

SimpleImputer

In [12]:
imputer = SimpleImputer(missing_values=np.nan)
X_age = imputer.fit_transform(Xd_age)
X_age = pd.DataFrame(X_age, columns=d_feats)
# X_age.info()
X_t_age = imputer.transform(Xd_t_age)
X_t_age = pd.DataFrame(X_t_age, columns=d_feats)
# X_t_age.info()
#print(X_age.columns, "\n", X_t_age.columns)
t_age_pred = trn_tst(X_age, y_trn, X_t_age)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred)}')

IterativeImputer

model accuracy: 0.7870813397129187
In [13]:
imputer_2 = IterativeImputer(missing_values=np.nan, max_iter=25)
X_age_2 = imputer_2.fit_transform(Xd_age)
X_age_2 = pd.DataFrame(X_age, columns=d_feats)
X_t_age_2 = imputer_2.transform(Xd_t_age)
X_t_age_2 = pd.DataFrame(X_t_age, columns=d_feats)
#print(X_age.columns, X_t_age.columns)
t_age_pred_2 = trn_tst(X_age_2, y_trn, X_t_age_2)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred_2)}')
model accuracy: 0.7870813397129187

Surprised Simple and Iterative both produced the same result. May have to dig deeper.

In [14]:
#X_age_2['Age'].head()
imps = pd.DataFrame(X_age_2['Age'].to_list(), columns=['Iterative'])
#imps.head()
imps2 = pd.concat([imps, X_age['Age']], axis=1)
# imps2.groupby(['Iterative', 'Age']).ngroups
imps2[(imps2['Iterative'] == imps2['Age'])].count()
imps2[(imps2['Iterative'] != imps2['Age'])]

Both imputers generated exactly the same imputed ages?!

Out[14]:
Iterative    891
Age          891
dtype: int64
Out[14]:
IterativeAge

Imputing My Way

Now, let’s look a imputing with our own code and approach.

EDA

Let’s see if there is any correlation between Age and some of the other features. Specifically, Pclass, Sex_female, Sex_male, SibSp and Parch.

If you are wondering about Sex_female and Sex_male, they were created when we used Xd_age = pd.get_dummies(k_trn[r_feats]) on our datasets. get_dummies() converts categorical variables into dummy/indicator variables. So, Sex_female will be 1 when that passenger had a value of female in the Sex feature, zero otherwise. Similarly for the male feature value. Something similar will have happened with the Embarked feature — as you will see below. This is/was done because a good many models do not like categorical features.

In [15]:
sns.set(font_scale=1.5)
feature_list = ['Pclass', 'Age','Sex_female','Sex_male', 'SibSp', 'Parch']
hm = plt.figure(figsize=(10,6))
g = sns.heatmap(Xd_age[feature_list].corr(), square=True, annot=True, cmap='coolwarm', fmt='.2f')
heatmap showing correlation between age and a few other features

Looks like there is some correlation between Age and these three features: Pclass, Parch and SibSp. Though somewhat less for Parch than the other two. Let’s have a look at those three.

In [16]:
plt.style.use('seaborn-whitegrid')

fig = plt.figure(figsize=(14, 5)) top = fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=0.5) ax = fig.add_subplot(1, 3, 1) ax = sns.boxplot(x='Pclass', y='Age', data=k_trn) ax = fig.add_subplot(1, 3, 2) ax = sns.boxplot(x='Parch', y='Age', data=k_trn) ax = fig.add_subplot(1, 3, 3) ax = sns.boxplot(x='SibSp', y='Age', data=k_trn)

boxplots of Pclass, Parch and SibSp vs Age

Median Age by Group

Let’s get a median age by grouping on those three features.

In [17]:
median_ages = k_trn.groupby(['Pclass', 'Parch', 'SibSp'], as_index=False).Age.median()
# median_ages.head()
pc = 3
pa = 2
ss = 3
median_ages[(median_ages['Pclass']==pc) & (median_ages['Parch']==pa) & (median_ages['SibSp']==ss)].Age.item()
print(f"Number of missing Age values: {median_ages['Age'].isnull().sum()}")
median_ages[(median_ages['Pclass']==3) & (median_ages['Parch']==2) & (median_ages['SibSp']==8)].Age.item()
Out[17]:
6.5
Number of missing Age values: 1
Out[17]:
nan

Well! Seems our grouped ages are missing at least one value?

In [18]:
k_trn[(k_trn['Pclass']==3) & (k_trn['Parch']==2) & (k_trn['SibSp']==8)]
Out[18]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
15916003Sage, Master. Thomas HenrymaleNaN82CA. 234369.55NaNS
18018103Sage, Miss. Constance GladysfemaleNaN82CA. 234369.55NaNS
20120203Sage, Mr. FrederickmaleNaN82CA. 234369.55NaNS
32432503Sage, Mr. George John JrmaleNaN82CA. 234369.55NaNS
79279303Sage, Miss. Stella AnnafemaleNaN82CA. 234369.55NaNS
84684703Sage, Mr. Douglas BullenmaleNaN82CA. 234369.55NaNS
86386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.55NaNS

My Imputing Attempt 1

I will get the imputed values from the training dataset. And, use that information to replace the missing values in both the training and test datasets.

Along with that missing imputed value for the training dataset. There are values for the three features in the test data set that didn’t have any matching values in the training data set. I am not showing the trial and error. Just the final code covering the problem feature values. Since most of the problem situations involved a higher number of children than adults I used 15 as the replacement age. Possibly should have gone younger, but…

In [19]:
X_age_3 = Xd_age
X_t_age_3 = Xd_t_age
miss_trn = list(X_age_3['Age'][X_age_3['Age'].isnull()].index)
miss_tst = list(X_t_age_3['Age'][X_t_age_3['Age'].isnull()].index)

# print(miss_age[:5]) for i in miss_trn: age_est = median_ages[(median_ages['Pclass']==X_age_3.iloc[i]['Pclass']) & (median_ages['Parch']==X_age_3.iloc[i]['Parch']) & (median_ages['SibSp']==X_age_3.iloc[i]['SibSp'])].Age.item() if pd.isnull(age_est): # print(f"! {k_trn.iloc[i]['Pclass']}, {k_trn.iloc[i]['Parch']}, {k_trn.iloc[i]['SibSp']} -> {age_est}") X_age_3.loc[i, 'Age'] = 25 else: X_age_3.loc[i, 'Age'] = math.ceil(age_est)

for i in miss_tst: # print(f"! {X_t_age_3.iloc[i]['Pclass']}, {X_t_age_3.iloc[i]['Parch']}, {X_t_age_3.iloc[i]['SibSp']}") if not median_ages[(median_ages['Pclass']==X_t_age_3.iloc[i]['Pclass']) & (median_ages['Parch']==X_t_age_3.iloc[i]['Parch']) & (median_ages['SibSp']==X_t_age_3.iloc[i]['SibSp'])].empty:

<span class="n">age_est</span> <span class="o">=</span> <span class="n">median_ages</span><span class="p">[(</span><span class="n">median_ages</span><span class="p">[</span><span class="s1">&#39;Pclass&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">X_t_age_3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">&#39;Pclass&#39;</span><span class="p">])</span>
  <span class="o">&amp;</span> <span class="p">(</span><span class="n">median_ages</span><span class="p">[</span><span class="s1">&#39;Parch&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">X_t_age_3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">&#39;Parch&#39;</span><span class="p">])</span>
  <span class="o">&amp;</span> <span class="p">(</span><span class="n">median_ages</span><span class="p">[</span><span class="s1">&#39;SibSp&#39;</span><span class="p">]</span><span class="o">==</span><span class="n">X_t_age_3</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">&#39;SibSp&#39;</span><span class="p">])]</span><span class="o">.</span><span class="n">Age</span><span class="o">.</span><span class="n">item</span><span class="p">()</span>
<span class="k">if</span> <span class="n">pd</span><span class="o">.</span><span class="n">isnull</span><span class="p">(</span><span class="n">age_est</span><span class="p">):</span>
  <span class="c1"># print(f&quot;! {k_trn.iloc[i][&#39;Pclass&#39;]}, {k_trn.iloc[i][&#39;Parch&#39;]}, {k_trn.iloc[i][&#39;SibSp&#39;]} -&gt; {age_est}&quot;)</span>
  <span class="n">X_t_age_3</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="s1">&#39;Age&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">15</span>
<span class="k">else</span><span class="p">:</span>
  <span class="n">X_t_age_3</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="s1">&#39;Age&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">ceil</span><span class="p">(</span><span class="n">age_est</span><span class="p">)</span>

else: X_t_age_3.loc[i, 'Age'] = 15

print(f"Number of missing Age training values: {X_age_3['Age'].isnull().sum()}") print(f"Number of missing Age test values: {X_t_age_3['Age'].isnull().sum()}")

Number of missing Age training values: 0
Number of missing Age test values: 0

And, the model score using these imputed values?

In [20]:
t_age_pred_3 = trn_tst(X_age_3, y_trn, X_t_age_3)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred_3)}')
model accuracy: 0.7799043062200957

My Imputing Attempt 2

For comparison, ket’s drop Parch from the imputation computation and see what happens.

In [21]:
median_ages_2 = k_trn.groupby(['Pclass', 'SibSp'], as_index=False).Age.median()
# median_ages.head()
pc = 3
ss = 3
median_ages_2[(median_ages_2['Pclass']==pc) & (median_ages_2['SibSp']==ss)].Age.item()
print(f"Number of missing Age values: {median_ages_2['Age'].isnull().sum()}")
median_ages_2[(median_ages_2['Pclass']==3) & (median_ages_2['SibSp']==8)].Age.item()
Out[21]:
6.0
Number of missing Age values: 1
Out[21]:
nan
In [22]:
X_age_4 = Xd_age
X_t_age_4 = Xd_t_age
miss_trn = list(X_age_4['Age'][X_age_4['Age'].isnull()].index)
miss_tst = list(X_t_age_4['Age'][X_t_age_4['Age'].isnull()].index)

# print(miss_age[:5]) for i in miss_trn: age_est = median_ages_2[(median_ages_2['Pclass']==X_age_4.iloc[i]['Pclass']) & (median_ages_2['Parch']==X_age_4.iloc[i]['Parch']) & (median_ages_2['SibSp']==X_age_4.iloc[i]['SibSp'])].Age.item() if pd.isnull(age_est): # print(f"! {k_trn.iloc[i]['Pclass']}, {k_trn.iloc[i]['Parch']}, {k_trn.iloc[i]['SibSp']} -> {age_est}") X_age_4.loc[i, 'Age'] = 25 else: X_age_4.loc[i, 'Age'] = math.ceil(age_est)

for i in miss_tst: # print(f"! {X_t_age_3.iloc[i]['Pclass']}, {X_t_age_3.iloc[i]['Parch']}, {X_t_age_3.iloc[i]['SibSp']}") if not median_ages_2[(median_ages_2['Pclass']==X_t_age_4.iloc[i]['Pclass']) & (median_ages_2['Parch']==X_t_age_4.iloc[i]['Parch']) & (median_ages_2['SibSp']==X_t_age_4.iloc[i]['SibSp'])].empty:

age_est = median_ages_2[(median_ages['Pclass']==X_t_age_4.iloc[i]['Pclass']) & (median_ages_2['Parch']==X_t_age_4.iloc[i]['Parch']) & (median_ages_2['SibSp']==X_t_age_4.iloc[i]['SibSp'])].Age.item() if pd.isnull(age_est): # print(f"! {k_trn.iloc[i]['Pclass']}, {k_trn.iloc[i]['Parch']}, {k_trn.iloc[i]['SibSp']} -> {age_est}") X_t_age_4.loc[i, 'Age'] = 15 else: X_t_age_4.loc[i, 'Age'] = math.ceil(age_est) else: X_t_age_4.loc[i, 'Age'] = 15

print(f"Number of missing Age training values: {X_age_4['Age'].isnull().sum()}") print(f"Number of missing Age test values: {X_t_age_4['Age'].isnull().sum()}")

Number of missing Age training values: 0
Number of missing Age test values: 0
In [23]:
t_age_pred_4 = trn_tst(X_age_4, y_trn, X_t_age_4)
print(f'model accuracy: {accuracy_score(k_tst["Survived"], t_age_pred_4)}')
model accuracy: 0.7799043062200957

Both of my imputing attempts netted the same score. Which was less than that for the Scikit-learn imputers. But, I expect in many cases the IterativeImputer will fair better. And since it didn’t take all that long for these datasets, I will use it to impute the missing passenger ages.

Done

I think that’s it for another post.

Feel free to download and play with my version of this post’s related notebook.

Perhaps a look at some feature engineering in the next post.

Resources