Let’s have a quick look at what adding Age as a feature might do for us.

Age Model #1

Note the title, could be a variation on how I handle imputing Age or perhaps engineering a new feature based on age (e.g. age groupings).

I’ll skip the setup cells contained in the notebook (you can always download the related notebook for a look). But, I have finally created a base notebook that I copy to start each new attempt/post. Took me awhile to see the light. The base will likely change over time.

I am also going to try using a pipeline in my modelling process.

In [5]:
# will use iterativeimputer in pipeline to fill in missing ages
# may also try KNNImputer in future
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, random_state=0)),
    ('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))
In [6]:
y_trn = k_trn['Survived']

features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Age'] X_trn = pd.get_dummies(k_trn[features]) X_test = pd.get_dummies(k_tst[features])

In [7]:
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
tst_cols = X_test.columns.tolist()
tst_cols.append("AgeMissing")
In [8]:
clf = clf.fit(X_trn, y_trn)
preds = clf.predict(X_test)
accuracy_score(k_tst["Survived"], preds)
Out[8]:
0.7870813397129187
In [9]:
# want to have a look at the intermediate/transformed data for training and prediction
X_trn.tail()
x_intermediate = X_trn
for step in clf.steps[:-1]:
    x_intermediate = step[1].transform(x_intermediate)
    x_int_trn_trans = pd.DataFrame(x_intermediate, columns=trn_cols)
    x_int_trn_trans.tail()
x_tst_int = X_test
for step in clf.steps[:-1]:
    x_int = step[1].transform(x_tst_int)
    x_int_tst_trans = pd.DataFrame(x_int, columns=tst_cols)
    x_int_tst_trans.tail()
Out[9]:
PclassSibSpParchAgeSex_femaleSex_male
88620027.0001
88710019.0010
888312NaN10
88910026.0001
89030032.0001
Out[9]:
PclassSibSpParchAgeSex_femaleSex_maleAgeMissing
8862.000.000.0027.000.001.000.00
8871.000.000.0019.001.000.000.00
8883.001.002.0019.491.000.001.00
8891.000.000.0026.000.001.000.00
8903.000.000.0032.000.001.000.00
Out[9]:
PclassSibSpParchAgeSex_femaleSex_maleAgeMissing
4133.000.000.0028.500.001.001.00
4141.000.000.0039.001.000.000.00
4153.000.000.0038.500.001.000.00
4163.000.000.0028.500.001.001.00
4173.001.001.0023.670.001.001.00
In [10]:
# let's check the pipeline result
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
tst_cols = X_test.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_test)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
In [11]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_trn_trans, y_trn)
predictions = model.predict(X_tst_trans)
accuracy_score(k_tst["Survived"], predictions)
Out[11]:
RandomForestClassifier(max_depth=5, random_state=1)
Out[11]:
0.7870813397129187

I could likely play with the random seed or the maximum number of iterations used by the IterativeImputer to see if the model could do better. But have decided, at this time, not to bother.

Age Model #2

Okay, let’s see if the KNNImputer does any better. I am going to go with the default of k_neighbors=5. And, I did have to add an import for this imputer.

In [12]:
# let's try the KNNimputer
transformer_2 = FeatureUnion(
  transformer_list=[
    ('features', KNNImputer(n_neighbors=5)),
    ('indicators', MissingIndicator())])
clf_2 = make_pipeline(transformer_2, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))
In [13]:
clf_2 = clf_2.fit(X_trn, y_trn)
preds_2 = clf_2.predict(X_test)
accuracy_score(k_tst["Survived"], preds_2)
Out[13]:
0.7751196172248804

Well that attempt did not do better than the IterativeImputer. Close but no cigar. Let’s see if differing values for k_neigbors makes any meaningful difference.

In [14]:
# let's try a few values for n_neighbors0
for kn in [2, 4, 6, 8, 10, 12]:
  transformer_3 = FeatureUnion(
    transformer_list=[
      ('features', KNNImputer(n_neighbors=kn)),
      ('indicators', MissingIndicator())])
  clf_3 = make_pipeline(transformer_3, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))
  clf_3 = clf_3.fit(X_trn, y_trn)
  preds_3 = clf_3.predict(X_test)
  score = accuracy_score(k_tst["Survived"], preds_3)
  print(f"k_neighbors={kn} -> {score}")
k_neighbors=2 -> 0.7751196172248804
k_neighbors=4 -> 0.7751196172248804
k_neighbors=6 -> 0.777511961722488
k_neighbors=8 -> 0.7727272727272727
k_neighbors=10 -> 0.777511961722488
k_neighbors=12 -> 0.777511961722488

Nothing better than the default.

Next Step

I am going to use the IterativeImputer to fill in the missing age values, and update my modified dataset so that I do not need to continually impute the missing Age values.

I am also going to look at adding a new categorical feature assigning passengers to an age range. E.g. ‘0-15’, ‘16-29’,…

I may add that to this post, but will likely put it into its own post — short and sweet as it may be.

Done

That’s it for another week.

Feel free to download and play with my version of this post’s related notebook.

Resources