Well here we are again. Looks like I will be making a third attempt at getting this somewhat right.

While working on setting up categorical feature encoding for a future post, missing data caused the column transformer to choke.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-2b9af7ef81ea> in <module>
     16 preprocessor = FeatureUnion(transformer_list=[('ord', ord_pipe),
     17                                               ('nom', nom_pipe)])
---> 18 preprocessor.fit(k_trn)
     19 
     20 #Ready to list

• • •

E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype) 109 elif X.dtype == np.dtype('object') and not allow_nan: 110 if _object_dtype_isnan(X).any(): –> 111 raise ValueError("Input contains NaN") 112 113

ValueError: Input contains NaN

The first time it choked I figured it was the fact that the Cabin feature, which has missing data, was in the dataset. So, I dropped if from the dataset. But no relief. Choked again with the same problem. So, a little investigating eventually led me the following. (You will need to scroll to the right side.)

In [31]:
k_trn[pd.isnull(k_trn).any(axis=1)]
Out[31]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...FamilySizeGroupTitleiFareAgeBinAgeMissinglogFarelogiFareSex_encAgeBin_enc
15916003Sage, Master. Thomas Henrymale0.1782CA. 234369.55...1111Master6.32NaN1.001.850.86maleNaN
18018103Sage, Miss. Constance Gladysfemale0.1782CA. 234369.55...1111Miss6.32NaN1.001.850.86femaleNaN
20120203Sage, Mr. Frederickmale0.1782CA. 234369.55...1111Mr6.32NaN1.001.850.86maleNaN
32432503Sage, Mr. George John Jrmale0.1782CA. 234369.55...1111Mr6.32NaN1.001.850.86maleNaN
79279303Sage, Miss. Stella Annafemale0.1782CA. 234369.55...1111Miss6.32NaN1.001.850.86femaleNaN
84684703Sage, Mr. Douglas Bullenmale0.1782CA. 234369.55...1111Mr6.32NaN1.001.850.86maleNaN
86386403Sage, Miss. Dorothy Edith "Dolly"female0.1782CA. 234369.55...1111Miss6.32NaN1.001.850.86femaleNaN

7 rows × 21 columns

Well, I am guessing that most people realized in that last post that I had failed to update the BinAge feature. So, the passengers with an imputed negative age had no entry in that column/feature. Expect the same holds true for the testing dataset.

But, looking at the above, I didn’t like what I saw. Everyone in that family was assigned an age of 0.17 years? So, I decided I would try imputing the missing ages once again.

I had read in one of the many posts/articles on working with the Titanic dataset that the title “Master” was given to male children aged 15 and under. So, to any “Master” missing an age, I am going to assign the average age of those with an age. Then I will fit the imputer and go from there. Don’t how much that will help, but it is likely better than not doing so.

A lot of this will be plain and simple repetition from the previous post. But, you know, “practice, practice, practice”.

Load Datasets and Re-initialize Age Feature

In [4]:
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
In [5]:
y_trn = k_trn['Survived']

# start fresh features = ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Age', 'Title'] X_trn = k_trn[features].copy() X_tst = k_tst[features].copy() # let's load the Kaggle datasets so we can get the original Age data kg_trn = pd.read_csv(kaggle_trn) kg_tst = pd.read_csv(kaggle_tst)

In [6]:
# now let's replace the Age data in my versions of the datasets with that from the Kaggle datasets
# for test will add kaggle column as well
X_trn.rename(columns={"Age": "iiAge"}, inplace=True)
X_trn["Age"] = kg_trn["Age"]
X_tst.rename(columns={"Age": "iiAge"}, inplace=True)
X_tst["Age"] = kg_tst["Age"]
In [7]:
X_trn.head()
X_trn.tail()
X_trn["Age"].describe()
Out[7]:
PassengerIdPclassSexSibSpParchiiAgeTitleAge
013male1022.00Mr22.00
121female1038.00Mrs38.00
233female0026.00Miss26.00
341female1035.00Mrs35.00
453male0035.00Mr35.00
Out[7]:
PassengerIdPclassSexSibSpParchiiAgeTitleAge
8868872male0027.00Official27.00
8878881female0019.00Miss19.00
8888893female1215.11MissNaN
8898901male0026.00Mr26.00
8908913male0032.00Mr32.00
Out[7]:
count   714.00
mean     29.70
std      14.53
min       0.42
25%      20.12
50%      28.00
75%      38.00
max      80.00
Name: Age, dtype: float64

Missing Age for Passengers with Title Master

Work out the mean age for both datasets. Find those with missing age (so we can check afterwards).

In [8]:
mean_trn = k_trn[k_trn["Title"] == "Master"]["Age"].mean()
mean_tst = k_tst[k_tst["Title"] == "Master"]["Age"].mean()
cnt_trn = k_trn[k_trn["Title"] == "Master"]["Title"].count()
cnt_tst = k_tst[k_tst["Title"] == "Master"]["Title"].count()
mean_all = ((mean_trn * cnt_trn) + (mean_tst * cnt_tst)) / (cnt_trn + cnt_tst)
print(f"Mean Age for 'Master': training set {mean_trn:.3f} ({cnt_trn}), test set {mean_tst:.3f} ({cnt_tst}) -> {mean_all:.3f}")

X_trn[(X_trn["Title"] == "Master") & (X_trn["Age"].isnull())].head() X_tst[(X_tst["Title"] == "Master") & (X_tst["Age"].isnull())].head()

Mean Age for 'Master': training set 4.859 (40), test set 7.410 (21) -> 5.737
Out[8]:
PassengerIdPclassSexSibSpParchiiAgeTitleAge
65663male117.42MasterNaN
1591603male827.42MasterNaN
1761773male317.42MasterNaN
7097103male117.42MasterNaN
Out[8]:
PassengerIdPclassSexSibSpParchiiAgeTitleAge
24411363male127.42MasterNaN
33912313male007.42MasterNaN
34412363male117.42MasterNaN
41713093male117.42MasterNaN

Let’s replace the missing ages with the mean we calculated and look at the results.

In [9]:
X_trn.loc[(X_trn["Title"] == "Master") & (X_trn["Age"].isnull()), "Age"] = mean_all
X_tst.loc[(X_tst["Title"] == "Master") & (X_tst["Age"].isnull()), "Age"] = mean_all
In [10]:
# let's check
pid_trn = [66, 160, 177, 710]
pid_tst = [1136, 1231, 1236, 1309]
X_trn[X_trn["PassengerId"].isin(pid_trn)]
X_tst[X_tst["PassengerId"].isin(pid_tst)]
Out[10]:
PassengerIdPclassSexSibSpParchiiAgeTitleAge
65663male117.42Master5.74
1591603male827.42Master5.74
1761773male317.42Master5.74
7097103male117.42Master5.74
Out[10]:
PassengerIdPclassSexSibSpParchiiAgeTitleAge
24411363male127.42Master5.74
33912313male007.42Master5.74
34412363male117.42Master5.74
41713093male117.42Master5.74

Looks reasonable.

Impute Remaining Missing Ages

Once again going to use IterativeImputer. For better or worse.

In [11]:
# okay, not on to the IterativeImputer
min_age = min(X_trn["Age"].min(), X_tst["Age"].min())
max_age = max(X_trn["Age"].max(), X_tst["Age"].max())
print(min_age, max_age)
transformer = FeatureUnion(
  transformer_list=[
    ('features', IterativeImputer(max_iter=10, min_value=min_age, max_value=max_age, random_state=0)),
    ('indicators', MissingIndicator())])
0.17 80.0
In [12]:
features = ["PassengerId", "Pclass", "Sex", "SibSp", "Parch",	"Title", "Age"]
X_trn = pd.get_dummies(X_trn[features])
X_tst = pd.get_dummies(X_tst[features])
In [13]:
X_trn.head()
Out[13]:
PassengerIdPclassSibSpParchAgeSex_femaleSex_maleTitle_MasterTitle_MissTitle_MrTitle_MrsTitle_NobleTitle_Official
0131022.0001001000
1211038.0010000100
2330026.0010010000
3411035.0010000100
4530035.0001001000
In [14]:
# let's train, and transform, our imputer on X_trn, and have look
trn_cols = X_trn.columns.tolist()
trn_cols.append("AgeMissing")
X_trn_trans = transformer.fit_transform(X_trn, y_trn)
X_trn_trans = pd.DataFrame(X_trn_trans, columns=trn_cols)
In [15]:
disp_cols = ["PassengerId", "Pclass", "Sex_female", "Sex_male", "SibSp", "Parch", "Age"]
# X_trn_trans[disp_cols].tail()
X_trn_trans[disp_cols].describe()
Out[15]:
PassengerIdPclassSex_femaleSex_maleSibSpParchAge
count891.00891.00891.00891.00891.00891.00891.00
mean446.002.310.350.650.520.3829.39
std257.350.840.480.481.100.8113.60
min1.001.000.000.000.000.000.42
25%223.502.000.000.000.000.0021.00
50%446.003.000.001.000.000.0029.05
75%668.503.001.001.001.000.0036.75
max891.003.001.001.008.006.0080.00
In [16]:
# looks better, do the same for X_tst
tst_cols = X_tst.columns.tolist()
tst_cols.append("AgeMissing")
X_tst_trans = transformer.transform(X_tst)
X_tst_trans = pd.DataFrame(X_tst_trans, columns=tst_cols)
In [17]:
disp_cols = ["PassengerId", "Pclass", "Sex_female", "Sex_male", "SibSp", "Parch", "Age"]
# X_trn_trans[disp_cols].tail()
X_tst_trans[disp_cols].describe()
Out[17]:
PassengerIdPclassSex_femaleSex_maleSibSpParchAge
count418.00418.00418.00418.00418.00418.00418.00
mean1,100.502.270.360.640.450.3929.60
std120.810.840.480.480.900.9813.24
min892.001.000.000.000.000.000.17
25%996.251.000.000.000.000.0021.86
50%1,100.503.000.001.000.000.0028.62
75%1,204.753.001.001.001.000.0036.75
max1,309.003.001.001.008.009.0076.00

New Updated Datasets

Let’s create updated datasets (training and testing).

In [18]:
# new updated training dataset dataframe
k_trn_2 = k_trn.copy()
k_trn_2 = k_trn_2.drop("AgeMissing", axis=1)
k_trn_2[:].Age = X_trn_trans[:].Age
# k_trn_2[:].AgeMissing = X_trn_trans[:].AgeMissing
k_trn_2 = pd.concat([k_trn_2, X_trn_trans[:].AgeMissing], axis=1)
# new updated testing dataset dataframe
k_tst_2 = k_tst.copy()
k_tst_2 = k_tst_2.drop("AgeMissing", axis=1)
k_tst_2[:].Age = X_tst_trans[:].Age
# k_tst_2[:].AgeMissing = X_tst_trans[:].AgeMissing
k_tst_2 = pd.concat([k_tst_2, X_tst_trans[:].AgeMissing], axis=1)
In [19]:
k_trn_2.describe()
Out[19]:
PassengerIdSurvivedPclassAgeSibSpParchFareFamilySizeGroupiFareAgeMissing
count891.00891.00891.00891.00891.00891.00891.00891.00891.00891.00891.00
mean446.000.382.3129.390.520.3832.741.902.1214.870.19
std257.350.490.8413.601.100.8149.541.611.8013.570.40
min1.000.001.000.420.000.004.011.001.003.710.00
25%223.500.002.0021.000.000.007.921.001.007.650.00
50%446.000.003.0029.050.000.0015.001.001.008.050.00
75%668.501.003.0036.751.000.0031.332.003.0015.000.00
max891.001.003.0080.008.006.00512.3311.0011.00128.081.00
In [20]:
k_tst_2.describe()
Out[20]:
PassengerIdPclassAgeSibSpParchFareSurvivedFamilySizeGroupiFareAgeMissing
count418.00418.00418.00418.00418.00418.00418.00418.00418.00418.00418.00
mean1,100.502.2729.600.450.3936.930.381.842.0716.480.20
std120.810.8413.240.900.9860.460.491.521.7527.800.40
min892.001.000.170.000.003.170.001.001.003.170.00
25%996.251.0021.860.000.007.900.001.001.007.730.00
50%1,100.503.0028.620.000.0014.480.001.001.008.660.00
75%1,204.753.0036.751.000.0031.501.002.002.0022.830.00
max1,309.003.0076.008.009.00512.331.0011.0011.00512.331.00
In [21]:
k_trn_2.info()

Let’s Not Forget Why We’re Here

Best not forget to update the AgeBin feature this time.

In [22]:
# glad I did that, almost forgot to update AgeBin again
bin_thresholds = [0, 15, 30, 40, 59, 90]
bin_labels = ['0-15', '16-29', '30-40', '41-59', '60+']
k_trn_2['AgeBin'] = pd.cut(k_trn['Age'], bins=bin_thresholds, labels=bin_labels)
k_tst_2['AgeBin'] = pd.cut(k_tst['Age'], bins=bin_thresholds, labels=bin_labels)
In [23]:
k_trn_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    int64   
 2   Pclass       891 non-null    int64   
 3   Name         891 non-null    object  
 4   Sex          891 non-null    object  
 5   Age          891 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    object  
 11  Embarked     891 non-null    object  
 12  FamilySize   891 non-null    int64   
 13  Group        891 non-null    int64   
 14  Title        891 non-null    object  
 15  iFare        891 non-null    float64 
 16  AgeBin       891 non-null    category
 17  AgeMissing   891 non-null    float64 
dtypes: category(1), float64(4), int64(7), object(6)
memory usage: 119.5+ KB
In [24]:
k_tst_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  418 non-null    int64   
 1   Pclass       418 non-null    int64   
 2   Name         418 non-null    object  
 3   Sex          418 non-null    object  
 4   Age          418 non-null    float64 
 5   SibSp        418 non-null    int64   
 6   Parch        418 non-null    int64   
 7   Ticket       418 non-null    object  
 8   Fare         418 non-null    float64 
 9   Cabin        91 non-null     object  
 10  Embarked     418 non-null    object  
 11  Survived     418 non-null    int64   
 12  FamilySize   418 non-null    int64   
 13  Group        418 non-null    int64   
 14  Title        418 non-null    object  
 15  iFare        418 non-null    float64 
 16  AgeBin       418 non-null    category
 17  AgeMissing   418 non-null    float64 
dtypes: category(1), float64(4), int64(7), object(6)
memory usage: 56.3+ KB

Looks like we are where we want to be.

Last But Not Least

So, let’s save our work (once again and maybe not for the last time).

In [25]:
# save updated datasets to our CSV files
k_trn_2.to_csv(oma_trn_3, index=False)
k_tst_2.to_csv(oma_tst_3, index=False)

Done???

Who knows for sure, but hopefully this repetitive effort can finally be laid to rest.

Feel free to download and play with my version of this post’s related notebook.

Resources

  • pandas.cut
  • pandas.DataFrame.any