Let’s have a quick look at what adding Age as a feature might do for us.
Age Model #1
Note the title, could be a variation on how I handle imputing Age or perhaps engineering a new feature based on age (e.g. age groupings).
I’ll skip the setup cells contained in the notebook (you can always download the related notebook for a look). But, I have finally created a base notebook that I copy to start each new attempt/post. Took me awhile to see the light. The base will likely change over time.
I am also going to try using a pipeline in my modelling process.
# will use iterativeimputer in pipeline to fill in missing ages
# may also try KNNImputer in future
transformer = FeatureUnion(
transformer_list=[
('features', IterativeImputer(max_iter=10, random_state=0)),
('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))
y_trn = k_trn['Survived']
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Age']
X_trn = pd.get_dummies(k_trn[features])
X_test = pd.get_dummies(k_tst[features])
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
tst_cols = X_test.columns.tolist()
tst_cols.append("AgeMissing")
clf = clf.fit(X_trn, y_trn)
preds = clf.predict(X_test)
accuracy_score(k_tst["Survived"], preds)
# want to have a look at the intermediate/transformed data for training and prediction
X_trn.tail()
x_intermediate = X_trn
for step in clf.steps[:-1]:
x_intermediate = step[1].transform(x_intermediate)
x_int_trn_trans = pd.DataFrame(x_intermediate, columns=trn_cols)
x_int_trn_trans.tail()
x_tst_int = X_test
for step in clf.steps[:-1]:
x_int = step[1].transform(x_tst_int)
x_int_tst_trans = pd.DataFrame(x_int, columns=tst_cols)
x_int_tst_trans.tail()
# let's check the pipeline result
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
tst_cols = X_test.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_test)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_trn_trans, y_trn)
predictions = model.predict(X_tst_trans)
accuracy_score(k_tst["Survived"], predictions)
I could likely play with the random seed or the maximum number of iterations used by the IterativeImputer
to see if the model could do better. But have decided, at this time, not to bother.
Age Model #2
Okay, let’s see if the KNNImputer
does any better. I am going to go with the default of k_neighbors=5
. And, I did have to add an import for this imputer.
# let's try the KNNimputer
transformer_2 = FeatureUnion(
transformer_list=[
('features', KNNImputer(n_neighbors=5)),
('indicators', MissingIndicator())])
clf_2 = make_pipeline(transformer_2, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))
clf_2 = clf_2.fit(X_trn, y_trn)
preds_2 = clf_2.predict(X_test)
accuracy_score(k_tst["Survived"], preds_2)
Well that attempt did not do better than the IterativeImputer
. Close but no cigar. Let’s see if differing values for k_neigbors
makes any meaningful difference.
# let's try a few values for n_neighbors0
for kn in [2, 4, 6, 8, 10, 12]:
transformer_3 = FeatureUnion(
transformer_list=[
('features', KNNImputer(n_neighbors=kn)),
('indicators', MissingIndicator())])
clf_3 = make_pipeline(transformer_3, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))
clf_3 = clf_3.fit(X_trn, y_trn)
preds_3 = clf_3.predict(X_test)
score = accuracy_score(k_tst["Survived"], preds_3)
print(f"k_neighbors={kn} -> {score}")
Nothing better than the default.
Next Step
I am going to use the IterativeImputer
to fill in the missing age values, and update my modified dataset so that I do not need to continually impute the missing Age values.
I am also going to look at adding a new categorical feature assigning passengers to an age range. E.g. ‘0-15’, ‘16-29’,…
I may add that to this post, but will likely put it into its own post — short and sweet as it may be.
Done
That’s it for another week.
Feel free to download and play with my version of this post’s related notebook.
Resources
- sklearn.impute.IterativeImputer
- sklearn.impute.KNNImputer
- sklearn.impute.MissingIndicator
- sklearn.pipeline.FeatureUnion
- sklearn.pipeline.make_pipeline sklearn.pipeline.Pipeline
- Imputation of missing values
- Pipelines and composite estimators
- Get intermediate data state in scikit-learn Pipeline