Might be only partially correct with the title for this post.

As mentioned in the last post, I am going to be adding all the missing age data to my modified datasets, training and testing. But I am also going to add a new feature assigning passengers to age ranges. As things stand in my head, this feature will be categorical. I will look at adding a suitable conversion in any pipeline using the feature.

Impute Missing Ages & Update Datasets

This will be a bit of a repeat from last post. But…

In [5]:

# will use iterativeimputer in pipeline to fill in missing ages
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, random_state=0)),
    ('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier())

I initially attempted to run the imputer on the whole training dataset. But, it choked on the Name feature. The imputer assumes all features in the dataset are numeric when calculating its fit statistics/parameters.

So, going to impute on the same features used in the previous post to see if adding Age to the model improved our prediction accuracy. I will fit the imputer on the training set. Transforming both datasets using the resulting imputer. Then add the ages to the appropriate datasets.

All in all a touch more work than I anticipated.

In [6]:

y_trn = k_trn['Survived']
features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age']
X_trn = pd.get_dummies(k_trn[features])
X_tst = pd.get_dummies(k_tst[features])

In [7]:

trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)

In [8]:

X_trn_trans.tail()
X_tst_trans.tail()

Out[8]:

	PassengerId	Pclass	SibSp	Parch	Age	Sex_female	Sex_male	AgeMissing
886	887.00	2.00	0.00	0.00	27.00	0.00	1.00	0.00
887	888.00	1.00	0.00	0.00	19.00	1.00	0.00	0.00
888	889.00	3.00	1.00	2.00	19.55	1.00	0.00	1.00
889	890.00	1.00	0.00	0.00	26.00	0.00	1.00	0.00
890	891.00	3.00	0.00	0.00	32.00	0.00	1.00	0.00

Out[8]:

	PassengerId	Pclass	SibSp	Parch	Age	Sex_female	Sex_male	AgeMissing
413	1,305.00	3.00	0.00	0.00	28.55	0.00	1.00	1.00
414	1,306.00	1.00	0.00	0.00	39.00	1.00	0.00	0.00
415	1,307.00	3.00	0.00	0.00	38.50	0.00	1.00	0.00
416	1,308.00	3.00	0.00	0.00	28.55	0.00	1.00	1.00
417	1,309.00	3.00	1.00	1.00	23.72	0.00	1.00	1.00

In [9]:

k_trn.head(2)

Out[9]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	FamilySize	Group	Title	iFare
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	0	A/5 21171	7.25	NaN	S	2	1	Mr	7.25
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	0	PC 17599	71.28	C85	C	2	2	Mrs	35.64

In [10]:

print(X_trn_trans.iloc[0].PassengerId == k_trn.iloc[0].PassengerId)

True

In [11]:

k_trn_2 = k_trn.copy()
k_trn_2[:].Age = X_trn_trans[:].Age
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)

In [12]:

k_tst_2 = k_tst.copy()
k_tst_2[:].Age = X_tst_trans[:].Age
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)

In [13]:

k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)

In [14]:

# reload the updated datasets and see if any missing Age data
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_trn.info()
k_tst.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
 12  FamilySize   891 non-null    int64  
 13  Group        891 non-null    int64  
 14  Title        891 non-null    object 
 15  iFare        891 non-null    float64
 16  AgeMissing   891 non-null    float64
dtypes: float64(4), int64(7), object(6)
memory usage: 118.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 17 columns):
Column       Non-Null Count  Dtype

0   PassengerId  418 non-null    int64

1   Pclass       418 non-null    int64

2   Name         418 non-null    object
3   Sex          418 non-null    object
4   Age          418 non-null    float64
5   SibSp        418 non-null    int64

6   Parch        418 non-null    int64

7   Ticket       418 non-null    object
8   Fare         418 non-null    float64
9   Cabin        91 non-null     object
10  Embarked     418 non-null    object
11  Survived     418 non-null    int64

12  FamilySize   418 non-null    int64

13  Group        418 non-null    int64

14  Title        418 non-null    object
15  iFare        418 non-null    float64
16  AgeMissing   418 non-null    float64
dtypes: float64(4), int64(7), object(6)
memory usage: 55.6+ KB

I should probably have been much more thorough in confirming that my concatentations actually assigned the ages to the correct people. But I am trusting that Pandas is good at looking after that kind of thing when preserving indices.

How to Define the Age Groups

But, what ages should be in each group? I don’t think just splitting the ages into equal length ranges will necessarily be the most effective approach. Though that likely would be the easiest to implement.

Because of issues if I re-run the code above, I started a new notebook for the following.

Let’s start by looking at the age data split into even length ranges, say 5 years. You may recall the max age was 80.

In [7]:

# let's bin the age data and have a look
k_all['AgeRng'] = pd.cut(k_all['Age'], bins=range(0, 90, 5))
sns.set(rc={'figure.figsize':(12,8)})
sns.set(font_scale=1.0)
# plt.style.use('seaborn-whitegrid')
g = sns.barplot(x='AgeRng', y='Survived', data=k_all)
table = pd.crosstab(k_all['AgeRng'], k_all['Survived'])
print('\n', table)

 Survived    0   1
AgeRng           
(0, 5]     19  37
(5, 10]    17  13
(10, 15]   17  16
(15, 20]   89  50
(20, 25]  146  87
(25, 30]  223  95
(30, 35]   75  62
(35, 40]   58  42
(40, 45]   64  36
(45, 50]   36  24
(50, 55]   17  19
(55, 60]   15  11
(60, 65]   17   6
(65, 70]    4   0
(70, 75]    4   0
(75, 80]    0   2

barplot showing survival rate by binned 5-yr age ranges

Looks like:

0-15: survival pretty good
30-40: a slight increase in survival rate compare to adjacent bins
60+: survival rate seems to decline

So, I am going use the following ranges: [‘0-15’, ‘16-29’, ‘30-40’, ‘41-59’, ‘60+’]

In [9]:

bin_thresholds = [0, 15, 30, 40, 59, 90]
bin_labels = ['0-15', '16-29', '30-40', '41-59', '60+']
k_trn['AgeBin'] = pd.cut(k_trn['Age'], bins=bin_thresholds, labels=bin_labels)
k_tst['AgeBin'] = pd.cut(k_tst['Age'], bins=bin_thresholds, labels=bin_labels)

In [10]:

k_trn.tail()

Out[10]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	FamilySize	Group	Title	iFare	AgeMissing	AgeBin
886	887	0	2	Montvila, Rev. Juozas	male	27.00	0	0	211536	13.00	NaN	S	1	1	Official	13.00	0.00	16-29
887	888	1	1	Graham, Miss. Margaret Edith	female	19.00	0	0	112053	30.00	B42	S	1	1	Miss	30.00	0.00	16-29
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	19.55	1	2	W./C. 6607	23.45	NaN	S	4	4	Miss	5.86	1.00	16-29
889	890	1	1	Behr, Mr. Karl Howell	male	26.00	0	0	111369	30.00	C148	C	1	1	Mr	30.00	0.00	16-29
890	891	0	3	Dooley, Mr. Patrick	male	32.00	0	0	370376	7.75	NaN	Q	1	1	Mr	7.75	0.00	30-40

In [11]:

k_tst.tail()

Out[11]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived	FamilySize	Group	Title	iFare	AgeMissing	AgeBin
413	1305	3	Spector, Mr. Woolf	male	28.55	0	0	A.5. 3236	8.05	NaN	S	0	1	1	Mr	8.05	1.00	16-29
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.00	0	0	PC 17758	108.90	C105	C	1	1	3	Noble	36.30	0.00	30-40
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.50	0	0	SOTON/O.Q. 3101262	7.25	NaN	S	0	1	1	Mr	7.25	0.00	30-40
416	1308	3	Ware, Mr. Frederick	male	28.55	0	0	359309	8.05	NaN	S	0	1	1	Mr	8.05	1.00	16-29
417	1309	3	Peter, Master. Michael J	male	23.72	1	1	2668	22.36	NaN	C	1	3	3	Master	7.45	1.00	16-29

That appears to have worked. So, let’s save our changes to the CSV files.

In [12]:

k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Done M’thinks

I really think this post has done what it intended to do. So, another fairly short and sweet post.

Feel free to download and play with this post’s two related notebooks: adding missing ages or creating new feature, AgeBin.

Apparently Not Done!

Well, turns out I’m not done.

While working on a future post when generating a histplot for age, hue="Survived", I saw negative values for Age. This is apparently a possibility when using the IterativeImputer.

Let’s have a quick look, describe() is your friend.

In [15]:

k_trn['Age'].describe()

Out[15]:

count   891.00
mean     29.30
std      13.62
min      -7.33
25%      21.51
50%      28.51
75%      36.00
max      80.00
Name: Age, dtype: float64

Sure enough. Should have done that when I was originally working on this post/notebook. Would have saved myself some grief.

A look at the test dataset also shows negative ages.

Was just going to use a backup of the CSV files and redo the post/notebook with hopefully the correct result. But, decided to fix without using the back up. Just the current CSV and the original Kaggle CSVs. And add the fix to both the post and related notebook. Expect it might get messy.

I have added a new variable near the top of the notebook, do_cell, defaulting to False. For all the cells I do not want to be run again when I execute the full notebook, I put the cell in an if do_cell: conditional block. Would likely mess up my CSVs if they ran repeatedly, m’thinks. And, if not, no sense wasting time having them run.

Reload Datasets

In [16]:

# start fresh
features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age']
X_trn = pd.get_dummies(k_trn[features])
X_tst = pd.get_dummies(k_tst[features])
# let's load the Kaggle datasets so we can get the original Age data
kg_trn = pd.read_csv(kaggle_trn)
kg_tst = pd.read_csv(kaggle_tst)

Re-initialize Age Column

And, replace the Age column in the dataframes to be used for imputing missing Age values with the Age column from the original Kaggle datasets.

Sort How to Go About It

Let’s sort how to do this and test on the training dataset.

We will start by getting back the original Age data. I will temporarily rename the imputed data column to iiAge.

In [17]:

# now let's replace the Age data in my versions of the datasets with that from the Kaggle datasets
# for test will add kaggle column as well
X_trn.rename(columns={"Age": "iiAge"}, inplace=True)
X_trn["Age"] = kg_trn["Age"]

In [18]:

X_trn.head()
X_trn.tail()
X_trn["Age"].describe()

Out[18]:

	PassengerId	Pclass	SibSp	iiAge	Sex_female	Sex_male	Age
0	1	3	1	22.00	0	1	22.00
1	2	1	1	38.00	1	0	38.00
2	3	3	0	26.00	1	0	26.00
3	4	1	1	35.00	1	0	35.00
4	5	3	0	35.00	0	1	35.00

Out[18]:

	PassengerId	Pclass	SibSp	Parch	iiAge	Sex_female	Sex_male	Age
886	887	2	0	0	27.00	0	1	27.00
887	888	1	0	0	19.00	1	0	19.00
888	889	3	1	2	19.55	1	0	NaN
889	890	1	0	0	26.00	0	1	26.00
890	891	3	0	0	32.00	0	1	32.00

Out[18]:

count   714.00
mean     29.70
std      14.53
min       0.42
25%      20.12
50%      28.00
75%      38.00
max      80.00
Name: Age, dtype: float64

In [19]:

# let's check things out before deleting the old column
X_trn.loc[(X_trn['iiAge'].ne(X_trn['Age'])) & (X_trn['Age'].notna())]

Out[19]:

	PassengerId	Pclass	SibSp	Parch	iiAge	Sex_female	Sex_male	Age

In [20]:

# get rid of iiAge column in X_trn
X_trn.drop('iiAge', axis=1)

Out[20]:

	PassengerId	Pclass	SibSp	Parch	Sex_female	Sex_male	Age
0	1	3	1	0	0	1	22.00
1	2	1	1	0	1	0	38.00
2	3	3	0	0	1	0	26.00
3	4	1	1	0	1	0	35.00
4	5	3	0	0	0	1	35.00
...	...	...	...	...	...	...	...
886	887	2	0	0	0	1	27.00
887	888	1	0	0	1	0	19.00
888	889	3	1	2	1	0	NaN
889	890	1	0	0	0	1	26.00
890	891	3	0	0	0	1	32.00

891 rows × 7 columns

Process Test Dataset

In [21]:

# let's do same for test data set
X_tst.rename(columns={"Age": "iiAge"}, inplace=True)
X_tst["Age"] = kg_tst["Age"]
X_tst.loc[(X_tst['iiAge'].ne(X_tst['Age'])) & (X_tst['Age'].notna())]

Out[21]:

	PassengerId	Pclass	SibSp	Parch	iiAge	Sex_female	Sex_male	Age

In [22]:

X_tst.drop('iiAge', axis=1)

Out[22]:

	PassengerId	Pclass	SibSp	Parch	Sex_female	Sex_male	Age
0	892	3	0	0	0	1	34.50
1	893	3	1	0	1	0	47.00
2	894	2	0	0	0	1	62.00
3	895	3	0	0	0	1	27.00
4	896	3	1	1	1	0	22.00
...	...	...	...	...	...	...	...
413	1305	3	0	0	0	1	NaN
414	1306	1	0	0	1	0	39.00
415	1307	3	0	0	0	1	38.50
416	1308	3	0	0	0	1	NaN
417	1309	3	1	1	0	1	NaN

418 rows × 7 columns

Impute Missing Values (Properly?)

In [25]:

# now see if we can fix that imputer
min_age = min(X_trn["Age"].min(), X_tst["Age"].min())
max_age = max(X_trn["Age"].max(), X_tst["Age"].max())
print(min_age, max_age)
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, min_value=min_age, max_value=max_age, random_state=0)),
    ('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier())

0.17 80.0

Training Dataset

In [26]:

# let's train, and transform, our imputer on X_trn, and have look
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)

In [34]:

X_trn_trans.describe()

Out[34]:

	PassengerId	Pclass	SibSp	Parch	iiAge	Sex_female	Sex_male	Age	AgeMissing
count	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00
mean	446.00	2.31	0.52	0.38	29.30	0.35	0.65	29.35	0.20
std	257.35	0.84	1.10	0.81	13.62	0.48	0.48	13.52	0.40
min	1.00	1.00	0.00	0.00	-7.33	0.00	0.00	0.17	0.00
25%	223.50	2.00	0.00	0.00	21.51	0.00	0.00	21.51	0.00
50%	446.00	3.00	0.00	0.00	28.51	0.00	1.00	28.51	0.00
75%	668.50	3.00	1.00	0.00	36.00	1.00	1.00	36.00	0.00
max	891.00	3.00	8.00	6.00	80.00	1.00	1.00	80.00	1.00

Test Dataset

In [28]:

# looks better, do the same for X_tst
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)

In [33]:

X_tst_trans.describe()

Out[33]:

	PassengerId	Pclass	SibSp	Parch	iiAge	Sex_female	Sex_male	Age	AgeMissing
count	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00
mean	1,100.50	2.27	0.45	0.39	29.74	0.36	0.64	29.76	0.21
std	120.81	0.84	0.90	0.98	13.03	0.48	0.48	12.99	0.40
min	892.00	1.00	0.00	0.00	-7.29	0.00	0.00	0.17	0.00
25%	996.25	1.00	0.00	0.00	22.00	0.00	0.00	22.00	0.00
50%	1,100.50	3.00	0.00	0.00	28.53	0.00	1.00	28.53	0.00
75%	1,204.75	3.00	1.00	0.00	36.38	1.00	1.00	36.38	0.00
max	1,309.00	3.00	8.00	9.00	76.00	1.00	1.00	76.00	1.00

New Dataframes with Updated Values

Now, re-build the datasets we will be saving to our CSVs.

In [40]:

# new updated training dataset dataframe
k_trn_2 = k_trn.copy()
k_trn_2 = k_trn_2.drop("AgeMissing", axis=1)
k_trn_2[:].Age = X_trn_trans[:].Age
# k_trn_2[:].AgeMissing = X_trn_trans[:].AgeMissing
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
# new updated testing dataset dataframe
k_tst_2 = k_tst.copy()
k_tst_2 = k_tst_2.drop("AgeMissing", axis=1)
k_tst_2[:].Age = X_tst_trans[:].Age
# k_tst_2[:].AgeMissing = X_tst_trans[:].AgeMissing
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)

In [41]:

k_trn_2.describe()

Out[41]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare	FamilySize	Group	iFare	AgeMissing
count	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00
mean	446.00	0.38	2.31	29.35	0.52	0.38	32.74	1.90	2.12	14.87	0.20
std	257.35	0.49	0.84	13.52	1.10	0.81	49.54	1.61	1.80	13.57	0.40
min	1.00	0.00	1.00	0.17	0.00	0.00	4.01	1.00	1.00	3.71	0.00
25%	223.50	0.00	2.00	21.51	0.00	0.00	7.92	1.00	1.00	7.65	0.00
50%	446.00	0.00	3.00	28.51	0.00	0.00	15.00	1.00	1.00	8.05	0.00
75%	668.50	1.00	3.00	36.00	1.00	0.00	31.33	2.00	3.00	15.00	0.00
max	891.00	1.00	3.00	80.00	8.00	6.00	512.33	11.00	11.00	128.08	1.00

In [42]:

k_tst_2.describe()

Out[42]:

	PassengerId	Pclass	Age	SibSp	Parch	Fare	Survived	FamilySize	Group	iFare	AgeMissing
count	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00
mean	1,100.50	2.27	29.76	0.45	0.39	36.93	0.38	1.84	2.07	16.48	0.21
std	120.81	0.84	12.99	0.90	0.98	60.46	0.49	1.52	1.75	27.80	0.40
min	892.00	1.00	0.17	0.00	0.00	3.17	0.00	1.00	1.00	3.17	0.00
25%	996.25	1.00	22.00	0.00	0.00	7.90	0.00	1.00	1.00	7.73	0.00
50%	1,100.50	3.00	28.53	0.00	0.00	14.48	0.00	1.00	1.00	8.66	0.00
75%	1,204.75	3.00	36.38	1.00	0.00	31.50	1.00	2.00	2.00	22.83	0.00
max	1,309.00	3.00	76.00	8.00	9.00	512.33	1.00	11.00	11.00	512.33	1.00

Save to CSV

In [44]:

k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)

Done for the 2nd Time

Okay I think that’s it. And, now this post really is lengthy.

Feel free to download and play with the updated version of this post’s related notebook.

Resources

pandas.DataFrame.filter
pandas.DataFrame.max
pandas.DataFrame.rename
pandas.cut
pandas.qcut — in case you want to look at another option
python pandas select rows where two columns are (not) equal
Create new column from multiple columns where value that is not NaN
sklearn.impute.IterativeImputer
I’m getting negative values as output of IterativeImputer from sklearn

Too Old To Code

Titanic Dataset: Missing Data, Part 3

Impute Missing Ages & Update Datasets

Column Non-Null Count Dtype

How to Define the Age Groups

Done M’thinks

Apparently Not Done!

Reload Datasets

Re-initialize Age Column

Sort How to Go About It

Process Test Dataset

Impute Missing Values (Properly?)

Training Dataset

Test Dataset

New Dataframes with Updated Values

Save to CSV

Done for the 2nd Time

Resources