Well here we are again. Looks like I will be making a third attempt at getting this somewhat right.
Missing Age Related Data
While working on setting up categorical feature encoding for a future post, missing data caused the column transformer to choke.
The first time it choked I figured it was the fact that the Cabin feature, which has missing data, was in the dataset. So, I dropped if from the dataset. But no relief. Choked again with the same problem. So, a little investigating eventually led me the following. (You will need to scroll to the right side.)
k_trn[pd.isnull(k_trn).any(axis=1)]
Well, I am guessing that most people realized in that last post that I had failed to update the BinAge feature. So, the passengers with an imputed negative age had no entry in that column/feature. Expect the same holds true for the testing dataset.
But, looking at the above, I didn’t like what I saw. Everyone in that family was assigned an age of 0.17 years? So, I decided I would try imputing the missing ages once again.
I had read in one of the many posts/articles on working with the Titanic dataset that the title “Master” was given to male children aged 15 and under. So, to any “Master” missing an age, I am going to assign the average age of those with an age. Then I will fit the imputer and go from there. Don’t how much that will help, but it is likely better than not doing so.
A lot of this will be plain and simple repetition from the previous post. But, you know, “practice, practice, practice”.
Load Datasets and Re-initialize Age Feature
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
y_trn = k_trn['Survived']
# start fresh
features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age', 'Title']
X_trn = k_trn[features].copy()
X_tst = k_tst[features].copy()
# let's load the Kaggle datasets so we can get the original Age data
kg_trn = pd.read_csv(kaggle_trn)
kg_tst = pd.read_csv(kaggle_tst)
# now let's replace the Age data in my versions of the datasets with that from the Kaggle datasets
# for test will add kaggle column as well
X_trn.rename(columns={"Age": "iiAge"}, inplace=True)
X_trn["Age"] = kg_trn["Age"]
X_tst.rename(columns={"Age": "iiAge"}, inplace=True)
X_tst["Age"] = kg_tst["Age"]
X_trn.head()
X_trn.tail()
X_trn["Age"].describe()
Missing Age for Passengers with Title Master
Work out the mean age for both datasets. Find those with missing age (so we can check afterwards).
mean_trn = k_trn[k_trn["Title"] == "Master"]["Age"].mean()
mean_tst = k_tst[k_tst["Title"] == "Master"]["Age"].mean()
cnt_trn = k_trn[k_trn["Title"] == "Master"]["Title"].count()
cnt_tst = k_tst[k_tst["Title"] == "Master"]["Title"].count()
mean_all = ((mean_trn * cnt_trn) + (mean_tst * cnt_tst)) / (cnt_trn + cnt_tst)
print(f"Mean Age for 'Master': training set {mean_trn:.3f} ({cnt_trn}), test set {mean_tst:.3f} ({cnt_tst}) -> {mean_all:.3f}")
X_trn[(X_trn["Title"] == "Master") & (X_trn["Age"].isnull())].head()
X_tst[(X_tst["Title"] == "Master") & (X_tst["Age"].isnull())].head()
Let’s replace the missing ages with the mean we calculated and look at the results.
X_trn.loc[(X_trn["Title"] == "Master") & (X_trn["Age"].isnull()), "Age"] = mean_all
X_tst.loc[(X_tst["Title"] == "Master") & (X_tst["Age"].isnull()), "Age"] = mean_all
# let's check
pid_trn = [66, 160, 177, 710]
pid_tst = [1136, 1231, 1236, 1309]
X_trn[X_trn["PassengerId"].isin(pid_trn)]
X_tst[X_tst["PassengerId"].isin(pid_tst)]
Looks reasonable.
Impute Remaining Missing Ages
Once again going to use IterativeImputer
. For better or worse.
# okay, not on to the IterativeImputer
min_age = min(X_trn["Age"].min(), X_tst["Age"].min())
max_age = max(X_trn["Age"].max(), X_tst["Age"].max())
print(min_age, max_age)
transformer = FeatureUnion(
transformer_list=[
('features', IterativeImputer(max_iter=10, min_value=min_age, max_value=max_age, random_state=0)),
('indicators', MissingIndicator())])
features = ["PassengerId", "Pclass", "Sex", "SibSp", "Parch", "Title", "Age"]
X_trn = pd.get_dummies(X_trn[features])
X_tst = pd.get_dummies(X_tst[features])
X_trn.head()
# let's train, and transform, our imputer on X_trn, and have look
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
disp_cols = ["PassengerId", "Pclass", "Sex_female", "Sex_male", "SibSp", "Parch", "Age"]
# X_trn_trans[disp_cols].tail()
X_trn_trans[disp_cols].describe()
# looks better, do the same for X_tst
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
disp_cols = ["PassengerId", "Pclass", "Sex_female", "Sex_male", "SibSp", "Parch", "Age"]
# X_trn_trans[disp_cols].tail()
X_tst_trans[disp_cols].describe()
New Updated Datasets
Let’s create updated datasets (training and testing).
# new updated training dataset dataframe
k_trn_2 = k_trn.copy()
k_trn_2 = k_trn_2.drop("AgeMissing", axis=1)
k_trn_2[:].Age = X_trn_trans[:].Age
# k_trn_2[:].AgeMissing = X_trn_trans[:].AgeMissing
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
# new updated testing dataset dataframe
k_tst_2 = k_tst.copy()
k_tst_2 = k_tst_2.drop("AgeMissing", axis=1)
k_tst_2[:].Age = X_tst_trans[:].Age
# k_tst_2[:].AgeMissing = X_tst_trans[:].AgeMissing
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)
k_trn_2.describe()
k_tst_2.describe()
k_trn_2.info()
Let’s Not Forget Why We’re Here
Best not forget to update the AgeBin feature this time.
# glad I did that, almost forgot to update AgeBin again
bin_thresholds = [0, 15, 30, 40, 59, 90]
bin_labels = ['0-15', '16-29', '30-40', '41-59', '60+']
k_trn_2['AgeBin'] = pd.cut(k_trn['Age'], bins=bin_thresholds, labels=bin_labels)
k_tst_2['AgeBin'] = pd.cut(k_tst['Age'], bins=bin_thresholds, labels=bin_labels)
k_trn_2.info()
k_tst_2.info()
Looks like we are where we want to be.
Last But Not Least
So, let’s save our work (once again and maybe not for the last time).
# save updated datasets to our CSV files
k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)
Done???
Who knows for sure, but hopefully this repetitive effort can finally be laid to rest.
Feel free to download and play with my version of this post’s related notebook.