Let’s have a quick look at what adding Age as a feature might do for us.

Age Model #1

Note the title, could be a variation on how I handle imputing Age or perhaps engineering a new feature based on age (e.g. age groupings).

I’ll skip the setup cells contained in the notebook (you can always download the related notebook for a look). But, I have finally created a base notebook that I copy to start each new attempt/post. Took me awhile to see the light. The base will likely change over time.

I am also going to try using a pipeline in my modelling process.

In [5]:

# will use iterativeimputer in pipeline to fill in missing ages
# may also try KNNImputer in future
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, random_state=0)),
    ('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))

In [6]:

y_trn = k_trn['Survived']
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Age']
X_trn = pd.get_dummies(k_trn[features])
X_test = pd.get_dummies(k_tst[features])

In [7]:

trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
tst_cols = X_test.columns.tolist()
tst_cols.append("AgeMissing")

In [8]:

clf = clf.fit(X_trn, y_trn)
preds = clf.predict(X_test)
accuracy_score(k_tst["Survived"], preds)

Out[8]:

0.7870813397129187

In [9]:

# want to have a look at the intermediate/transformed data for training and prediction
X_trn.tail()
x_intermediate = X_trn
for step in clf.steps[:-1]:
    x_intermediate = step[1].transform(x_intermediate)
    x_int_trn_trans = pd.DataFrame(x_intermediate, columns=trn_cols)
    x_int_trn_trans.tail()
x_tst_int = X_test
for step in clf.steps[:-1]:
    x_int = step[1].transform(x_tst_int)
    x_int_tst_trans = pd.DataFrame(x_int, columns=tst_cols)
    x_int_tst_trans.tail()

Out[9]:

	Pclass	SibSp	Parch	Age	Sex_female	Sex_male
886	2	0	0	27.00	0	1
887	1	0	0	19.00	1	0
888	3	1	2	NaN	1	0
889	1	0	0	26.00	0	1
890	3	0	0	32.00	0	1

Out[9]:

	Pclass	SibSp	Parch	Age	Sex_female	Sex_male	AgeMissing
886	2.00	0.00	0.00	27.00	0.00	1.00	0.00
887	1.00	0.00	0.00	19.00	1.00	0.00	0.00
888	3.00	1.00	2.00	19.49	1.00	0.00	1.00
889	1.00	0.00	0.00	26.00	0.00	1.00	0.00
890	3.00	0.00	0.00	32.00	0.00	1.00	0.00

Out[9]:

	Pclass	SibSp	Parch	Age	Sex_female	Sex_male	AgeMissing
413	3.00	0.00	0.00	28.50	0.00	1.00	1.00
414	1.00	0.00	0.00	39.00	1.00	0.00	0.00
415	3.00	0.00	0.00	38.50	0.00	1.00	0.00
416	3.00	0.00	0.00	28.50	0.00	1.00	1.00
417	3.00	1.00	1.00	23.67	0.00	1.00	1.00

In [10]:

# let's check the pipeline result
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
tst_cols = X_test.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_test)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)

In [11]:

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_trn_trans, y_trn)
predictions = model.predict(X_tst_trans)
accuracy_score(k_tst["Survived"], predictions)

Out[11]:

RandomForestClassifier(max_depth=5, random_state=1)

Out[11]:

0.7870813397129187

I could likely play with the random seed or the maximum number of iterations used by the IterativeImputer to see if the model could do better. But have decided, at this time, not to bother.

Age Model #2

Okay, let’s see if the KNNImputer does any better. I am going to go with the default of k_neighbors=5. And, I did have to add an import for this imputer.

In [12]:

# let's try the KNNimputer
transformer_2 = FeatureUnion(
  transformer_list=[
    ('features', KNNImputer(n_neighbors=5)),
    ('indicators', MissingIndicator())])
clf_2 = make_pipeline(transformer_2, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))

In [13]:

clf_2 = clf_2.fit(X_trn, y_trn)
preds_2 = clf_2.predict(X_test)
accuracy_score(k_tst["Survived"], preds_2)

Out[13]:

0.7751196172248804

Well that attempt did not do better than the IterativeImputer. Close but no cigar. Let’s see if differing values for k_neigbors makes any meaningful difference.

In [14]:

# let's try a few values for n_neighbors0
for kn in [2, 4, 6, 8, 10, 12]:
  transformer_3 = FeatureUnion(
    transformer_list=[
      ('features', KNNImputer(n_neighbors=kn)),
      ('indicators', MissingIndicator())])
  clf_3 = make_pipeline(transformer_3, RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1))
  clf_3 = clf_3.fit(X_trn, y_trn)
  preds_3 = clf_3.predict(X_test)
  score = accuracy_score(k_tst["Survived"], preds_3)
  print(f"k_neighbors={kn} -> {score}")

k_neighbors=2 -> 0.7751196172248804
k_neighbors=4 -> 0.7751196172248804
k_neighbors=6 -> 0.777511961722488
k_neighbors=8 -> 0.7727272727272727
k_neighbors=10 -> 0.777511961722488
k_neighbors=12 -> 0.777511961722488

Nothing better than the default.

Next Step

I am going to use the IterativeImputer to fill in the missing age values, and update my modified dataset so that I do not need to continually impute the missing Age values.

I am also going to look at adding a new categorical feature assigning passengers to an age range. E.g. ‘0-15’, ‘16-29’,…

I may add that to this post, but will likely put it into its own post — short and sweet as it may be.

Done

That’s it for another week.

Feel free to download and play with my version of this post’s related notebook.

Resources

sklearn.impute.IterativeImputer
sklearn.impute.KNNImputer
sklearn.impute.MissingIndicator
sklearn.pipeline.FeatureUnion
sklearn.pipeline.make_pipeline sklearn.pipeline.Pipeline
Imputation of missing values
Pipelines and composite estimators
Get intermediate data state in scikit-learn Pipeline

Too Old To Code

Titanic Dataset: Improve Prediction Success II

Age Model #1

Age Model #2

Next Step

Done

Resources