Might be only partially correct with the title for this post.

As mentioned in the last post, I am going to be adding all the missing age data to my modified datasets, training and testing. But I am also going to add a new feature assigning passengers to age ranges. As things stand in my head, this feature will be categorical. I will look at adding a suitable conversion in any pipeline using the feature.

Impute Missing Ages & Update Datasets

This will be a bit of a repeat from last post. But…

In [5]:
# will use iterativeimputer in pipeline to fill in missing ages
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, random_state=0)),
    ('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier())

I initially attempted to run the imputer on the whole training dataset. But, it choked on the Name feature. The imputer assumes all features in the dataset are numeric when calculating its fit statistics/parameters.

So, going to impute on the same features used in the previous post to see if adding Age to the model improved our prediction accuracy. I will fit the imputer on the training set. Transforming both datasets using the resulting imputer. Then add the ages to the appropriate datasets.

All in all a touch more work than I anticipated.

In [6]:
y_trn = k_trn['Survived']

features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age'] X_trn = pd.get_dummies(k_trn[features]) X_tst = pd.get_dummies(k_tst[features])

In [7]:
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
In [8]:
X_trn_trans.tail()
X_tst_trans.tail()
Out[8]:
PassengerIdPclassSibSpParchAgeSex_femaleSex_maleAgeMissing
886887.002.000.000.0027.000.001.000.00
887888.001.000.000.0019.001.000.000.00
888889.003.001.002.0019.551.000.001.00
889890.001.000.000.0026.000.001.000.00
890891.003.000.000.0032.000.001.000.00
Out[8]:
PassengerIdPclassSibSpParchAgeSex_femaleSex_maleAgeMissing
4131,305.003.000.000.0028.550.001.001.00
4141,306.001.000.000.0039.001.000.000.00
4151,307.003.000.000.0038.500.001.000.00
4161,308.003.000.000.0028.550.001.001.00
4171,309.003.001.001.0023.720.001.001.00
In [9]:
k_trn.head(2)
Out[9]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySizeGroupTitleiFare
0103Braund, Mr. Owen Harrismale22.0010A/5 211717.25NaNS21Mr7.25
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.0010PC 1759971.28C85C22Mrs35.64
In [10]:
print(X_trn_trans.iloc[0].PassengerId == k_trn.iloc[0].PassengerId)
True
In [11]:
k_trn_2 = k_trn.copy()
k_trn_2[:].Age = X_trn_trans[:].Age
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
In [12]:
k_tst_2 = k_tst.copy()
k_tst_2[:].Age = X_tst_trans[:].Age
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)
In [13]:
k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)
In [14]:
# reload the updated datasets and see if any missing Age data
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_trn.info()
k_tst.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
 12  FamilySize   891 non-null    int64  
 13  Group        891 non-null    int64  
 14  Title        891 non-null    object 
 15  iFare        891 non-null    float64
 16  AgeMissing   891 non-null    float64
dtypes: float64(4), int64(7), object(6)
memory usage: 118.5+ KB

<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 17 columns):

Column Non-Null Count Dtype


0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 418 non-null float64 5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object 8 Fare 418 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object 11 Survived 418 non-null int64
12 FamilySize 418 non-null int64
13 Group 418 non-null int64
14 Title 418 non-null object 15 iFare 418 non-null float64 16 AgeMissing 418 non-null float64 dtypes: float64(4), int64(7), object(6) memory usage: 55.6+ KB

I should probably have been much more thorough in confirming that my concatentations actually assigned the ages to the correct people. But I am trusting that Pandas is good at looking after that kind of thing when preserving indices.

How to Define the Age Groups

But, what ages should be in each group? I don’t think just splitting the ages into equal length ranges will necessarily be the most effective approach. Though that likely would be the easiest to implement.

Because of issues if I re-run the code above, I started a new notebook for the following.

Let’s start by looking at the age data split into even length ranges, say 5 years. You may recall the max age was 80.

In [7]:
# let's bin the age data and have a look
k_all['AgeRng'] = pd.cut(k_all['Age'], bins=range(0, 90, 5))
sns.set(rc={'figure.figsize':(12,8)})
sns.set(font_scale=1.0)
# plt.style.use('seaborn-whitegrid')
g = sns.barplot(x='AgeRng', y='Survived', data=k_all)

table = pd.crosstab(k_all['AgeRng'], k_all['Survived']) print('\n', table)

 Survived    0   1
AgeRng           
(0, 5]     19  37
(5, 10]    17  13
(10, 15]   17  16
(15, 20]   89  50
(20, 25]  146  87
(25, 30]  223  95
(30, 35]   75  62
(35, 40]   58  42
(40, 45]   64  36
(45, 50]   36  24
(50, 55]   17  19
(55, 60]   15  11
(60, 65]   17   6
(65, 70]    4   0
(70, 75]    4   0
(75, 80]    0   2
barplot showing survival rate by binned 5-yr age ranges

Looks like:

  • 0-15: survival pretty good
  • 30-40: a slight increase in survival rate compare to adjacent bins
  • 60+: survival rate seems to decline

So, I am going use the following ranges: [‘0-15’, ‘16-29’, ‘30-40’, ‘41-59’, ‘60+’]

In [9]:
bin_thresholds = [0, 15, 30, 40, 59, 90]
bin_labels = ['0-15', '16-29', '30-40', '41-59', '60+']
k_trn['AgeBin'] = pd.cut(k_trn['Age'], bins=bin_thresholds, labels=bin_labels)
k_tst['AgeBin'] = pd.cut(k_tst['Age'], bins=bin_thresholds, labels=bin_labels)
In [10]:
k_trn.tail()
Out[10]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySizeGroupTitleiFareAgeMissingAgeBin
88688702Montvila, Rev. Juozasmale27.000021153613.00NaNS11Official13.000.0016-29
88788811Graham, Miss. Margaret Edithfemale19.000011205330.00B42S11Miss30.000.0016-29
88888903Johnston, Miss. Catherine Helen "Carrie"female19.5512W./C. 660723.45NaNS44Miss5.861.0016-29
88989011Behr, Mr. Karl Howellmale26.000011136930.00C148C11Mr30.000.0016-29
89089103Dooley, Mr. Patrickmale32.00003703767.75NaNQ11Mr7.750.0030-40
In [11]:
k_tst.tail()
Out[11]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvivedFamilySizeGroupTitleiFareAgeMissingAgeBin
41313053Spector, Mr. Woolfmale28.5500A.5. 32368.05NaNS011Mr8.051.0016-29
41413061Oliva y Ocana, Dona. Ferminafemale39.0000PC 17758108.90C105C113Noble36.300.0030-40
41513073Saether, Mr. Simon Sivertsenmale38.5000SOTON/O.Q. 31012627.25NaNS011Mr7.250.0030-40
41613083Ware, Mr. Frederickmale28.55003593098.05NaNS011Mr8.051.0016-29
41713093Peter, Master. Michael Jmale23.7211266822.36NaNC133Master7.451.0016-29

That appears to have worked. So, let’s save our changes to the CSV files.

In [12]:
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Done M’thinks

I really think this post has done what it intended to do. So, another fairly short and sweet post.

Feel free to download and play with this post’s two related notebooks: adding missing ages or creating new feature, AgeBin.

Apparently Not Done!

Well, turns out I’m not done.

While working on a future post when generating a histplot for age, hue="Survived", I saw negative values for Age. This is apparently a possibility when using the IterativeImputer.

Let’s have a quick look, describe() is your friend.

In [15]:
k_trn['Age'].describe()
Out[15]:
count   891.00
mean     29.30
std      13.62
min      -7.33
25%      21.51
50%      28.51
75%      36.00
max      80.00
Name: Age, dtype: float64

Sure enough. Should have done that when I was originally working on this post/notebook. Would have saved myself some grief.

A look at the test dataset also shows negative ages.

Was just going to use a backup of the CSV files and redo the post/notebook with hopefully the correct result. But, decided to fix without using the back up. Just the current CSV and the original Kaggle CSVs. And add the fix to both the post and related notebook. Expect it might get messy.

I have added a new variable near the top of the notebook, do_cell, defaulting to False. For all the cells I do not want to be run again when I execute the full notebook, I put the cell in an if do_cell: conditional block. Would likely mess up my CSVs if they ran repeatedly, m’thinks. And, if not, no sense wasting time having them run.

Reload Datasets

In [16]:
# start fresh
features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age']
X_trn = pd.get_dummies(k_trn[features])
X_tst = pd.get_dummies(k_tst[features])
# let's load the Kaggle datasets so we can get the original Age data
kg_trn = pd.read_csv(kaggle_trn)
kg_tst = pd.read_csv(kaggle_tst)

Re-initialize Age Column

And, replace the Age column in the dataframes to be used for imputing missing Age values with the Age column from the original Kaggle datasets.

Sort How to Go About It

Let’s sort how to do this and test on the training dataset.

We will start by getting back the original Age data. I will temporarily rename the imputed data column to iiAge.

In [17]:
# now let's replace the Age data in my versions of the datasets with that from the Kaggle datasets
# for test will add kaggle column as well
X_trn.rename(columns={"Age": "iiAge"}, inplace=True)
X_trn["Age"] = kg_trn["Age"]
In [18]:
X_trn.head()
X_trn.tail()
X_trn["Age"].describe()
Out[18]:
PassengerIdPclassSibSpParchiiAgeSex_femaleSex_maleAge
0131022.000122.00
1211038.001038.00
2330026.001026.00
3411035.001035.00
4530035.000135.00
Out[18]:
PassengerIdPclassSibSpParchiiAgeSex_femaleSex_maleAge
88688720027.000127.00
88788810019.001019.00
88888931219.5510NaN
88989010026.000126.00
89089130032.000132.00
Out[18]:
count   714.00
mean     29.70
std      14.53
min       0.42
25%      20.12
50%      28.00
75%      38.00
max      80.00
Name: Age, dtype: float64
In [19]:
# let's check things out before deleting the old column
X_trn.loc[(X_trn['iiAge'].ne(X_trn['Age'])) & (X_trn['Age'].notna())]
Out[19]:
PassengerIdPclassSibSpParchiiAgeSex_femaleSex_maleAge
In [20]:
# get rid of iiAge column in X_trn
X_trn.drop('iiAge', axis=1)
Out[20]:
PassengerIdPclassSibSpParchSex_femaleSex_maleAge
013100122.00
121101038.00
233001026.00
341101035.00
453000135.00
........................
8868872000127.00
8878881001019.00
88888931210NaN
8898901000126.00
8908913000132.00

891 rows × 7 columns

Process Test Dataset

In [21]:
# let's do same for test data set
X_tst.rename(columns={"Age": "iiAge"}, inplace=True)
X_tst["Age"] = kg_tst["Age"]
X_tst.loc[(X_tst['iiAge'].ne(X_tst['Age'])) & (X_tst['Age'].notna())]
Out[21]:
PassengerIdPclassSibSpParchiiAgeSex_femaleSex_maleAge
In [22]:
X_tst.drop('iiAge', axis=1)
Out[22]:
PassengerIdPclassSibSpParchSex_femaleSex_maleAge
08923000134.50
18933101047.00
28942000162.00
38953000127.00
48963111022.00
........................
413130530001NaN
41413061001039.00
41513073000138.50
416130830001NaN
417130931101NaN

418 rows × 7 columns

Impute Missing Values (Properly?)

In [25]:
# now see if we can fix that imputer
min_age = min(X_trn["Age"].min(), X_tst["Age"].min())
max_age = max(X_trn["Age"].max(), X_tst["Age"].max())
print(min_age, max_age)
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, min_value=min_age, max_value=max_age, random_state=0)),
    ('indicators', MissingIndicator())])
clf = make_pipeline(transformer, RandomForestClassifier())
0.17 80.0

Training Dataset

In [26]:
# let's train, and transform, our imputer on X_trn, and have look
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
In [34]:
X_trn_trans.describe()
Out[34]:
PassengerIdPclassSibSpParchiiAgeSex_femaleSex_maleAgeAgeMissing
count891.00891.00891.00891.00891.00891.00891.00891.00891.00
mean446.002.310.520.3829.300.350.6529.350.20
std257.350.841.100.8113.620.480.4813.520.40
min1.001.000.000.00-7.330.000.000.170.00
25%223.502.000.000.0021.510.000.0021.510.00
50%446.003.000.000.0028.510.001.0028.510.00
75%668.503.001.000.0036.001.001.0036.000.00
max891.003.008.006.0080.001.001.0080.001.00

Test Dataset

In [28]:
# looks better, do the same for X_tst
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
In [33]:
X_tst_trans.describe()
Out[33]:
PassengerIdPclassSibSpParchiiAgeSex_femaleSex_maleAgeAgeMissing
count418.00418.00418.00418.00418.00418.00418.00418.00418.00
mean1,100.502.270.450.3929.740.360.6429.760.21
std120.810.840.900.9813.030.480.4812.990.40
min892.001.000.000.00-7.290.000.000.170.00
25%996.251.000.000.0022.000.000.0022.000.00
50%1,100.503.000.000.0028.530.001.0028.530.00
75%1,204.753.001.000.0036.381.001.0036.380.00
max1,309.003.008.009.0076.001.001.0076.001.00

New Dataframes with Updated Values

Now, re-build the datasets we will be saving to our CSVs.

In [40]:
# new updated training dataset dataframe
k_trn_2 = k_trn.copy()
k_trn_2 = k_trn_2.drop("AgeMissing", axis=1)
k_trn_2[:].Age = X_trn_trans[:].Age
# k_trn_2[:].AgeMissing = X_trn_trans[:].AgeMissing
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
# new updated testing dataset dataframe
k_tst_2 = k_tst.copy()
k_tst_2 = k_tst_2.drop("AgeMissing", axis=1)
k_tst_2[:].Age = X_tst_trans[:].Age
# k_tst_2[:].AgeMissing = X_tst_trans[:].AgeMissing
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)
In [41]:
k_trn_2.describe()
Out[41]:
PassengerIdSurvivedPclassAgeSibSpParchFareFamilySizeGroupiFareAgeMissing
count891.00891.00891.00891.00891.00891.00891.00891.00891.00891.00891.00
mean446.000.382.3129.350.520.3832.741.902.1214.870.20
std257.350.490.8413.521.100.8149.541.611.8013.570.40
min1.000.001.000.170.000.004.011.001.003.710.00
25%223.500.002.0021.510.000.007.921.001.007.650.00
50%446.000.003.0028.510.000.0015.001.001.008.050.00
75%668.501.003.0036.001.000.0031.332.003.0015.000.00
max891.001.003.0080.008.006.00512.3311.0011.00128.081.00
In [42]:
k_tst_2.describe()
Out[42]:
PassengerIdPclassAgeSibSpParchFareSurvivedFamilySizeGroupiFareAgeMissing
count418.00418.00418.00418.00418.00418.00418.00418.00418.00418.00418.00
mean1,100.502.2729.760.450.3936.930.381.842.0716.480.21
std120.810.8412.990.900.9860.460.491.521.7527.800.40
min892.001.000.170.000.003.170.001.001.003.170.00
25%996.251.0022.000.000.007.900.001.001.007.730.00
50%1,100.503.0028.530.000.0014.480.001.001.008.660.00
75%1,204.753.0036.381.000.0031.501.002.002.0022.830.00
max1,309.003.0076.008.009.00512.331.0011.0011.00512.331.00

Save to CSV

In [44]:
k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)

Done for the 2nd Time

Okay I think that’s it. And, now this post really is lengthy.

Feel free to download and play with the updated version of this post’s related notebook.

Resources