Well here we are again. Looks like I will be making a third attempt at getting this somewhat right.

While working on setting up categorical feature encoding for a future post, missing data caused the column transformer to choke.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-2b9af7ef81ea> in <module>
     16 preprocessor = FeatureUnion(transformer_list=[('ord', ord_pipe),
     17                                               ('nom', nom_pipe)])
---> 18 preprocessor.fit(k_trn)
     19 
     20 #Ready to list
• • •
E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    109     elif X.dtype == np.dtype('object') and not allow_nan:
    110         if _object_dtype_isnan(X).any():
–> 111             raise ValueError("Input contains NaN")
    112
    113
ValueError: Input contains NaN

The first time it choked I figured it was the fact that the Cabin feature, which has missing data, was in the dataset. So, I dropped if from the dataset. But no relief. Choked again with the same problem. So, a little investigating eventually led me the following. (You will need to scroll to the right side.)

In [31]:

k_trn[pd.isnull(k_trn).any(axis=1)]

Out[31]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	...	FamilySize	Group	Title	iFare	AgeBin	AgeMissing	logFare	logiFare	Sex_enc	AgeBin_enc
159	160	3	Sage, Master. Thomas Henry	male	0.17	8	2	CA. 2343	69.55	...	11	11	Master	6.32	NaN	1.00	1.85	0.86	male	NaN
180	181	3	Sage, Miss. Constance Gladys	female	0.17	8	2	CA. 2343	69.55	...	11	11	Miss	6.32	NaN	1.00	1.85	0.86	female	NaN
201	202	3	Sage, Mr. Frederick	male	0.17	8	2	CA. 2343	69.55	...	11	11	Mr	6.32	NaN	1.00	1.85	0.86	male	NaN
324	325	3	Sage, Mr. George John Jr	male	0.17	8	2	CA. 2343	69.55	...	11	11	Mr	6.32	NaN	1.00	1.85	0.86	male	NaN
792	793	3	Sage, Miss. Stella Anna	female	0.17	8	2	CA. 2343	69.55	...	11	11	Miss	6.32	NaN	1.00	1.85	0.86	female	NaN
846	847	3	Sage, Mr. Douglas Bullen	male	0.17	8	2	CA. 2343	69.55	...	11	11	Mr	6.32	NaN	1.00	1.85	0.86	male	NaN
863	864	3	Sage, Miss. Dorothy Edith "Dolly"	female	0.17	8	2	CA. 2343	69.55	...	11	11	Miss	6.32	NaN	1.00	1.85	0.86	female	NaN

7 rows × 21 columns

Well, I am guessing that most people realized in that last post that I had failed to update the BinAge feature. So, the passengers with an imputed negative age had no entry in that column/feature. Expect the same holds true for the testing dataset.

But, looking at the above, I didn’t like what I saw. Everyone in that family was assigned an age of 0.17 years? So, I decided I would try imputing the missing ages once again.

I had read in one of the many posts/articles on working with the Titanic dataset that the title “Master” was given to male children aged 15 and under. So, to any “Master” missing an age, I am going to assign the average age of those with an age. Then I will fit the imputer and go from there. Don’t how much that will help, but it is likely better than not doing so.

A lot of this will be plain and simple repetition from the previous post. But, you know, “practice, practice, practice”.

Load Datasets and Re-initialize Age Feature

In [4]:

# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)

In [5]:

y_trn = k_trn['Survived']
# start fresh
features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age', 'Title']
X_trn = k_trn[features].copy()
X_tst = k_tst[features].copy()
# let's load the Kaggle datasets so we can get the original Age data
kg_trn = pd.read_csv(kaggle_trn)
kg_tst = pd.read_csv(kaggle_tst)

In [6]:

# now let's replace the Age data in my versions of the datasets with that from the Kaggle datasets
# for test will add kaggle column as well
X_trn.rename(columns={"Age": "iiAge"}, inplace=True)
X_trn["Age"] = kg_trn["Age"]
X_tst.rename(columns={"Age": "iiAge"}, inplace=True)
X_tst["Age"] = kg_tst["Age"]

In [7]:

X_trn.head()
X_trn.tail()
X_trn["Age"].describe()

Out[7]:

	PassengerId	Pclass	Sex	SibSp	iiAge	Title	Age
0	1	3	male	1	22.00	Mr	22.00
1	2	1	female	1	38.00	Mrs	38.00
2	3	3	female	0	26.00	Miss	26.00
3	4	1	female	1	35.00	Mrs	35.00
4	5	3	male	0	35.00	Mr	35.00

Out[7]:

	PassengerId	Pclass	Sex	SibSp	Parch	iiAge	Title	Age
886	887	2	male	0	0	27.00	Official	27.00
887	888	1	female	0	0	19.00	Miss	19.00
888	889	3	female	1	2	15.11	Miss	NaN
889	890	1	male	0	0	26.00	Mr	26.00
890	891	3	male	0	0	32.00	Mr	32.00

Out[7]:

count   714.00
mean     29.70
std      14.53
min       0.42
25%      20.12
50%      28.00
75%      38.00
max      80.00
Name: Age, dtype: float64

Missing Age for Passengers with Title Master

Work out the mean age for both datasets. Find those with missing age (so we can check afterwards).

In [8]:

mean_trn = k_trn[k_trn["Title"] == "Master"]["Age"].mean()
mean_tst = k_tst[k_tst["Title"] == "Master"]["Age"].mean()
cnt_trn = k_trn[k_trn["Title"] == "Master"]["Title"].count()
cnt_tst = k_tst[k_tst["Title"] == "Master"]["Title"].count()
mean_all = ((mean_trn * cnt_trn) + (mean_tst * cnt_tst)) / (cnt_trn + cnt_tst)
print(f"Mean Age for 'Master': training set {mean_trn:.3f} ({cnt_trn}), test set {mean_tst:.3f} ({cnt_tst}) -> {mean_all:.3f}")
X_trn[(X_trn["Title"] == "Master") & (X_trn["Age"].isnull())].head()
X_tst[(X_tst["Title"] == "Master") & (X_tst["Age"].isnull())].head()

Mean Age for 'Master': training set 4.859 (40), test set 7.410 (21) -> 5.737

Out[8]:

	PassengerId	Pclass	Sex	SibSp	Parch	iiAge	Title	Age
65	66	3	male	1	1	7.42	Master	NaN
159	160	3	male	8	2	7.42	Master	NaN
176	177	3	male	3	1	7.42	Master	NaN
709	710	3	male	1	1	7.42	Master	NaN

Out[8]:

	PassengerId	Pclass	Sex	SibSp	Parch	iiAge	Title	Age
244	1136	3	male	1	2	7.42	Master	NaN
339	1231	3	male	0	0	7.42	Master	NaN
344	1236	3	male	1	1	7.42	Master	NaN
417	1309	3	male	1	1	7.42	Master	NaN

Let’s replace the missing ages with the mean we calculated and look at the results.

In [9]:

X_trn.loc[(X_trn["Title"] == "Master") & (X_trn["Age"].isnull()), "Age"] = mean_all
X_tst.loc[(X_tst["Title"] == "Master") & (X_tst["Age"].isnull()), "Age"] = mean_all

In [10]:

# let's check
pid_trn = [66, 160, 177, 710]
pid_tst = [1136, 1231, 1236, 1309]
X_trn[X_trn["PassengerId"].isin(pid_trn)]
X_tst[X_tst["PassengerId"].isin(pid_tst)]

Out[10]:

	PassengerId	Pclass	Sex	SibSp	Parch	iiAge	Title	Age
65	66	3	male	1	1	7.42	Master	5.74
159	160	3	male	8	2	7.42	Master	5.74
176	177	3	male	3	1	7.42	Master	5.74
709	710	3	male	1	1	7.42	Master	5.74

Out[10]:

	PassengerId	Pclass	Sex	SibSp	Parch	iiAge	Title	Age
244	1136	3	male	1	2	7.42	Master	5.74
339	1231	3	male	0	0	7.42	Master	5.74
344	1236	3	male	1	1	7.42	Master	5.74
417	1309	3	male	1	1	7.42	Master	5.74

Looks reasonable.

Impute Remaining Missing Ages

Once again going to use IterativeImputer. For better or worse.

In [11]:

# okay, not on to the IterativeImputer
min_age = min(X_trn["Age"].min(), X_tst["Age"].min())
max_age = max(X_trn["Age"].max(), X_tst["Age"].max())
print(min_age, max_age)
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, min_value=min_age, max_value=max_age, random_state=0)),
    ('indicators', MissingIndicator())])

0.17 80.0

In [12]:

features = ["PassengerId", "Pclass", "Sex", "SibSp", "Parch",	"Title", "Age"]
X_trn = pd.get_dummies(X_trn[features])
X_tst = pd.get_dummies(X_tst[features])

In [13]:

X_trn.head()

Out[13]:

	PassengerId	Pclass	SibSp	Age	Sex_female	Sex_male	Title_Miss	Title_Mr	Title_Mrs
0	1	3	1	22.00	0	1	0	1	0
1	2	1	1	38.00	1	0	0	0	1
2	3	3	0	26.00	1	0	1	0	0
3	4	1	1	35.00	1	0	0	0	1
4	5	3	0	35.00	0	1	0	1	0

In [14]:

# let's train, and transform, our imputer on X_trn, and have look
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)

In [15]:

disp_cols = ["PassengerId", "Pclass", "Sex_female", "Sex_male", "SibSp", "Parch", "Age"]
# X_trn_trans[disp_cols].tail()
X_trn_trans[disp_cols].describe()

Out[15]:

	PassengerId	Pclass	Sex_female	Sex_male	SibSp	Parch	Age
count	891.00	891.00	891.00	891.00	891.00	891.00	891.00
mean	446.00	2.31	0.35	0.65	0.52	0.38	29.39
std	257.35	0.84	0.48	0.48	1.10	0.81	13.60
min	1.00	1.00	0.00	0.00	0.00	0.00	0.42
25%	223.50	2.00	0.00	0.00	0.00	0.00	21.00
50%	446.00	3.00	0.00	1.00	0.00	0.00	29.05
75%	668.50	3.00	1.00	1.00	1.00	0.00	36.75
max	891.00	3.00	1.00	1.00	8.00	6.00	80.00

In [16]:

# looks better, do the same for X_tst
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)

In [17]:

disp_cols = ["PassengerId", "Pclass", "Sex_female", "Sex_male", "SibSp", "Parch", "Age"]
# X_trn_trans[disp_cols].tail()
X_tst_trans[disp_cols].describe()

Out[17]:

	PassengerId	Pclass	Sex_female	Sex_male	SibSp	Parch	Age
count	418.00	418.00	418.00	418.00	418.00	418.00	418.00
mean	1,100.50	2.27	0.36	0.64	0.45	0.39	29.60
std	120.81	0.84	0.48	0.48	0.90	0.98	13.24
min	892.00	1.00	0.00	0.00	0.00	0.00	0.17
25%	996.25	1.00	0.00	0.00	0.00	0.00	21.86
50%	1,100.50	3.00	0.00	1.00	0.00	0.00	28.62
75%	1,204.75	3.00	1.00	1.00	1.00	0.00	36.75
max	1,309.00	3.00	1.00	1.00	8.00	9.00	76.00

New Updated Datasets

Let’s create updated datasets (training and testing).

In [18]:

# new updated training dataset dataframe
k_trn_2 = k_trn.copy()
k_trn_2 = k_trn_2.drop("AgeMissing", axis=1)
k_trn_2[:].Age = X_trn_trans[:].Age
# k_trn_2[:].AgeMissing = X_trn_trans[:].AgeMissing
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
# new updated testing dataset dataframe
k_tst_2 = k_tst.copy()
k_tst_2 = k_tst_2.drop("AgeMissing", axis=1)
k_tst_2[:].Age = X_tst_trans[:].Age
# k_tst_2[:].AgeMissing = X_tst_trans[:].AgeMissing
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)

In [19]:

k_trn_2.describe()

Out[19]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare	FamilySize	Group	iFare	AgeMissing
count	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00	891.00
mean	446.00	0.38	2.31	29.39	0.52	0.38	32.74	1.90	2.12	14.87	0.19
std	257.35	0.49	0.84	13.60	1.10	0.81	49.54	1.61	1.80	13.57	0.40
min	1.00	0.00	1.00	0.42	0.00	0.00	4.01	1.00	1.00	3.71	0.00
25%	223.50	0.00	2.00	21.00	0.00	0.00	7.92	1.00	1.00	7.65	0.00
50%	446.00	0.00	3.00	29.05	0.00	0.00	15.00	1.00	1.00	8.05	0.00
75%	668.50	1.00	3.00	36.75	1.00	0.00	31.33	2.00	3.00	15.00	0.00
max	891.00	1.00	3.00	80.00	8.00	6.00	512.33	11.00	11.00	128.08	1.00

In [20]:

k_tst_2.describe()

Out[20]:

	PassengerId	Pclass	Age	SibSp	Parch	Fare	Survived	FamilySize	Group	iFare	AgeMissing
count	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00	418.00
mean	1,100.50	2.27	29.60	0.45	0.39	36.93	0.38	1.84	2.07	16.48	0.20
std	120.81	0.84	13.24	0.90	0.98	60.46	0.49	1.52	1.75	27.80	0.40
min	892.00	1.00	0.17	0.00	0.00	3.17	0.00	1.00	1.00	3.17	0.00
25%	996.25	1.00	21.86	0.00	0.00	7.90	0.00	1.00	1.00	7.73	0.00
50%	1,100.50	3.00	28.62	0.00	0.00	14.48	0.00	1.00	1.00	8.66	0.00
75%	1,204.75	3.00	36.75	1.00	0.00	31.50	1.00	2.00	2.00	22.83	0.00
max	1,309.00	3.00	76.00	8.00	9.00	512.33	1.00	11.00	11.00	512.33	1.00

In [21]:

k_trn_2.info()

Let’s Not Forget Why We’re Here

Best not forget to update the AgeBin feature this time.

In [22]:

# glad I did that, almost forgot to update AgeBin again
bin_thresholds = [0, 15, 30, 40, 59, 90]
bin_labels = ['0-15', '16-29', '30-40', '41-59', '60+']
k_trn_2['AgeBin'] = pd.cut(k_trn['Age'], bins=bin_thresholds, labels=bin_labels)
k_tst_2['AgeBin'] = pd.cut(k_tst['Age'], bins=bin_thresholds, labels=bin_labels)

In [23]:

k_trn_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    int64   
 2   Pclass       891 non-null    int64   
 3   Name         891 non-null    object  
 4   Sex          891 non-null    object  
 5   Age          891 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    object  
 11  Embarked     891 non-null    object  
 12  FamilySize   891 non-null    int64   
 13  Group        891 non-null    int64   
 14  Title        891 non-null    object  
 15  iFare        891 non-null    float64 
 16  AgeBin       891 non-null    category
 17  AgeMissing   891 non-null    float64 
dtypes: category(1), float64(4), int64(7), object(6)
memory usage: 119.5+ KB

In [24]:

k_tst_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  418 non-null    int64   
 1   Pclass       418 non-null    int64   
 2   Name         418 non-null    object  
 3   Sex          418 non-null    object  
 4   Age          418 non-null    float64 
 5   SibSp        418 non-null    int64   
 6   Parch        418 non-null    int64   
 7   Ticket       418 non-null    object  
 8   Fare         418 non-null    float64 
 9   Cabin        91 non-null     object  
 10  Embarked     418 non-null    object  
 11  Survived     418 non-null    int64   
 12  FamilySize   418 non-null    int64   
 13  Group        418 non-null    int64   
 14  Title        418 non-null    object  
 15  iFare        418 non-null    float64 
 16  AgeBin       418 non-null    category
 17  AgeMissing   418 non-null    float64 
dtypes: category(1), float64(4), int64(7), object(6)
memory usage: 56.3+ KB

Looks like we are where we want to be.

Last But Not Least

So, let’s save our work (once again and maybe not for the last time).

In [25]:

# save updated datasets to our CSV files
k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)

Done???

Who knows for sure, but hopefully this repetitive effort can finally be laid to rest.

Feel free to download and play with my version of this post’s related notebook.

Resources

pandas.cut
pandas.DataFrame.any

Too Old To Code

Titanic Dataset: Missing Data, Part 4

Load Datasets and Re-initialize Age Feature

Missing Age for Passengers with Title Master

Impute Remaining Missing Ages

New Updated Datasets

Let’s Not Forget Why We’re Here

Last But Not Least

Done???

Resources

Missing Age Related Data

Load Datasets and Re-initialize Age Feature

Missing Age for Passengers with Title Master

Impute Remaining Missing Ages

New Updated Datasets

Let’s Not Forget Why We’re Here

Last But Not Least

Done???

Resources