Might be only partially correct with the title for this post.
As mentioned in the last post, I am going to be adding all the missing age data to my modified datasets, training and testing. But I am also going to add a new feature assigning passengers to age ranges. As things stand in my head, this feature will be categorical. I will look at adding a suitable conversion in any pipeline using the feature.
Impute Missing Ages & Update Datasets
This will be a bit of a repeat from last post. But…
# will use iterativeimputer in pipeline to fill in missing ages
transformer = FeatureUnion(
transformer_list=[
('features', IterativeImputer(max_iter=10, random_state=0)),
('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier())
I initially attempted to run the imputer on the whole training dataset. But, it choked on the Name feature. The imputer assumes all features in the dataset are numeric when calculating its fit statistics/parameters.
So, going to impute on the same features used in the previous post to see if adding Age to the model improved our prediction accuracy. I will fit the imputer on the training set. Transforming both datasets using the resulting imputer. Then add the ages to the appropriate datasets.
All in all a touch more work than I anticipated.
y_trn = k_trn['Survived']
features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age']
X_trn = pd.get_dummies(k_trn[features])
X_tst = pd.get_dummies(k_tst[features])
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
X_trn_trans.tail()
X_tst_trans.tail()
k_trn.head(2)
print(X_trn_trans.iloc[0].PassengerId == k_trn.iloc[0].PassengerId)
k_trn_2 = k_trn.copy()
k_trn_2[:].Age = X_trn_trans[:].Age
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
k_tst_2 = k_tst.copy()
k_tst_2[:].Age = X_tst_trans[:].Age
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)
k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)
# reload the updated datasets and see if any missing Age data
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_trn.info()
k_tst.info()
I should probably have been much more thorough in confirming that my concatentations actually assigned the ages to the correct people. But I am trusting that Pandas is good at looking after that kind of thing when preserving indices.
How to Define the Age Groups
But, what ages should be in each group? I don’t think just splitting the ages into equal length ranges will necessarily be the most effective approach. Though that likely would be the easiest to implement.
Because of issues if I re-run the code above, I started a new notebook for the following.
Let’s start by looking at the age data split into even length ranges, say 5 years. You may recall the max age was 80.
# let's bin the age data and have a look
k_all['AgeRng'] = pd.cut(k_all['Age'], bins=range(0, 90, 5))
sns.set(rc={'figure.figsize':(12,8)})
sns.set(font_scale=1.0)
# plt.style.use('seaborn-whitegrid')
g = sns.barplot(x='AgeRng', y='Survived', data=k_all)
table = pd.crosstab(k_all['AgeRng'], k_all['Survived'])
print('\n', table)
Looks like:
- 0-15: survival pretty good
- 30-40: a slight increase in survival rate compare to adjacent bins
- 60+: survival rate seems to decline
So, I am going use the following ranges: [‘0-15’, ‘16-29’, ‘30-40’, ‘41-59’, ‘60+’]
bin_thresholds = [0, 15, 30, 40, 59, 90]
bin_labels = ['0-15', '16-29', '30-40', '41-59', '60+']
k_trn['AgeBin'] = pd.cut(k_trn['Age'], bins=bin_thresholds, labels=bin_labels)
k_tst['AgeBin'] = pd.cut(k_tst['Age'], bins=bin_thresholds, labels=bin_labels)
k_trn.tail()
k_tst.tail()
That appears to have worked. So, let’s save our changes to the CSV files.
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)
Done M’thinks
I really think this post has done what it intended to do. So, another fairly short and sweet post.
Feel free to download and play with this post’s two related notebooks: adding missing ages or creating new feature, AgeBin.
Apparently Not Done!
Well, turns out I’m not done.
While working on a future post when generating a histplot for age, hue="Survived"
, I saw negative values for Age. This is apparently a possibility when using the IterativeImputer.
Let’s have a quick look, describe()
is your friend.
k_trn['Age'].describe()
Sure enough. Should have done that when I was originally working on this post/notebook. Would have saved myself some grief.
A look at the test dataset also shows negative ages.
Was just going to use a backup of the CSV files and redo the post/notebook with hopefully the correct result. But, decided to fix without using the back up. Just the current CSV and the original Kaggle CSVs. And add the fix to both the post and related notebook. Expect it might get messy.
I have added a new variable near the top of the notebook, do_cell
, defaulting to False
. For all the cells I do not want to be run again when I execute the full notebook, I put the cell in an if do_cell:
conditional block. Would likely mess up my CSVs if they ran repeatedly, m’thinks. And, if not, no sense wasting time having them run.
Reload Datasets
# start fresh
features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age']
X_trn = pd.get_dummies(k_trn[features])
X_tst = pd.get_dummies(k_tst[features])
# let's load the Kaggle datasets so we can get the original Age data
kg_trn = pd.read_csv(kaggle_trn)
kg_tst = pd.read_csv(kaggle_tst)
Re-initialize Age Column
And, replace the Age column in the dataframes to be used for imputing missing Age values with the Age column from the original Kaggle datasets.
Sort How to Go About It
Let’s sort how to do this and test on the training dataset.
We will start by getting back the original Age data. I will temporarily rename the imputed data column to iiAge
.
# now let's replace the Age data in my versions of the datasets with that from the Kaggle datasets
# for test will add kaggle column as well
X_trn.rename(columns={"Age": "iiAge"}, inplace=True)
X_trn["Age"] = kg_trn["Age"]
X_trn.head()
X_trn.tail()
X_trn["Age"].describe()
# let's check things out before deleting the old column
X_trn.loc[(X_trn['iiAge'].ne(X_trn['Age'])) & (X_trn['Age'].notna())]
# get rid of iiAge column in X_trn
X_trn.drop('iiAge', axis=1)
Process Test Dataset
# let's do same for test data set
X_tst.rename(columns={"Age": "iiAge"}, inplace=True)
X_tst["Age"] = kg_tst["Age"]
X_tst.loc[(X_tst['iiAge'].ne(X_tst['Age'])) & (X_tst['Age'].notna())]
X_tst.drop('iiAge', axis=1)
Impute Missing Values (Properly?)
# now see if we can fix that imputer
min_age = min(X_trn["Age"].min(), X_tst["Age"].min())
max_age = max(X_trn["Age"].max(), X_tst["Age"].max())
print(min_age, max_age)
transformer = FeatureUnion(
transformer_list=[
('features', IterativeImputer(max_iter=10, min_value=min_age, max_value=max_age, random_state=0)),
('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier())
Training Dataset
# let's train, and transform, our imputer on X_trn, and have look
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
X_trn_trans.describe()
Test Dataset
# looks better, do the same for X_tst
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
X_tst_trans.describe()
New Dataframes with Updated Values
Now, re-build the datasets we will be saving to our CSVs.
# new updated training dataset dataframe
k_trn_2 = k_trn.copy()
k_trn_2 = k_trn_2.drop("AgeMissing", axis=1)
k_trn_2[:].Age = X_trn_trans[:].Age
# k_trn_2[:].AgeMissing = X_trn_trans[:].AgeMissing
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
# new updated testing dataset dataframe
k_tst_2 = k_tst.copy()
k_tst_2 = k_tst_2.drop("AgeMissing", axis=1)
k_tst_2[:].Age = X_tst_trans[:].Age
# k_tst_2[:].AgeMissing = X_tst_trans[:].AgeMissing
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)
k_trn_2.describe()
k_tst_2.describe()
Save to CSV
k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)
Done for the 2nd Time
Okay I think that’s it. And, now this post really is lengthy.
Feel free to download and play with the updated version of this post’s related notebook.
Resources
- pandas.DataFrame.filter
- pandas.DataFrame.max
- pandas.DataFrame.rename
- pandas.cut
- pandas.qcut — in case you want to look at another option
- python pandas select rows where two columns are (not) equal
- Create new column from multiple columns where value that is not NaN
- sklearn.impute.IterativeImputer
- I’m getting negative values as output of IterativeImputer from sklearn