Okay, let’s carry on from last time. At least a couple more possible features to consider/attempt.

It is likely that the modelling and feature engineering would be done in some iterative fashion. But, I am just going to create them all, then start testing various combinations. That model testing will come in future posts. And, maybe new features will come out of the woodwork as well.

Load Datasets

In [3]:
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"

Need to load oma_*_3.csv in order to get the engineered features we added earlier—don’t want to lose those.

In [4]:
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)

Passenger’s Title

We previously mentioned that a person’s status might affect the likelihood of survival. And, that the title in their name might be an estimator of their status. So, let’s give it a shot.

Let’s start with a look at the passenger titles we find in the passenger name feature. You may recall, the name entries in our datasets look like: “<surname>, <title>. …”. Note the period after the title; let’s drop that from our list of titles. And, let’s use a set to get a unique list.

In [6]:
# let's look at passenger title
titles = set()
for name in k_all['Name']:
  titles.add(name.split(',')[1].split('.')[0].strip())
print(titles)
{'Rev', 'Capt', 'Dona', 'Don', 'Mme', 'Ms', 'Major', 'Mr', 'Lady', 'Sir', 'Mlle', 'Col', 'the Countess', 'Jonkheer', 'Master', 'Miss', 'Dr', 'Mrs'}

Let’s combine the ones that are fundamentally the same (e.g. Mme, Mrs, Ms) and reduce the number further by combining others into “categories” (e.g. Capt, Col, Major).

In [7]:
# we can combine at least a few of these in a single title category
d_titles = {
  "Master": "Master",
  "Capt": "Official",
  "Sir": "Noble",
  "Don": "Noble",
  "Miss": "Miss",
  "Dr": "Official",
  "Dona": "Noble",
  "Mme": "Mrs",
  "Major": "Official",
  "Mrs": "Mrs",
  "Mlle": "Miss",
  "the Countess": "Noble",
  "Ms": "Mrs",
  "Rev": "Official",
  "Jonkheer": "Noble",
  "Col": "Official",
  "Lady": "Noble",
  "Mr": "Mr"
}

Let’s update the training dataset and have a look. Using map to apply a lambda function to get the title from the name for each row in the dataset.

In [8]:
k_trn["Title"] = k_trn["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())
k_trn["Title"] = k_trn.Title.map(d_titles)
In [9]:
k_trn.groupby("Title", as_index=False).PassengerId.count()
Out[9]:
TitlePassengerId
0Master40
1Miss184
2Mr517
3Mrs127
4Noble5
5Official18

Now, let’s add that feature to the test dataset and save our work.

In [10]:
k_tst["Title"] = k_tst["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())
k_tst["Title"] = k_tst.Title.map(d_titles)
In [11]:
k_tst.groupby("Title", as_index=False).PassengerId.count()
Out[11]:
TitlePassengerId
0Master21
1Miss78
2Mr240
3Mrs73
4Noble1
5Official5

Quick look to make sure all three added features (that we’ve created so far) are present in the dataset/DataFrame.

In [12]:
k_trn.head()
Out[12]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySizeGroupTitle
0103Braund, Mr. Owen Harrismale22.0010A/5 211717.25NaNS21Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.0010PC 1759971.28C85C22Mrs
2313Heikkinen, Miss. Lainafemale26.0000STON/O2. 31012827.92NaNS11Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.001011380353.10C123S22Mrs
4503Allen, Mr. William Henrymale35.00003734508.05NaNS11Mr
In [13]:
# let's save those additions
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Individual Fares

Not sure how much value this will have, but I am not happy with the Fare feature containing a fare for a ticket with multiple passengers. If Fare has any value to the model, I expect that individual fares would be of better value than joint fares. Since the Group feature essentially counts the number of people on a single ticket, I will use that to estimate an individual fare for each passenger, iFare (sorry couldn’t resist).

And, I did say that Group feature might come in handy.

In [14]:
# now how about an individual fare feature
# expect it to be a simple chore
k_trn["iFare"] = round(k_trn["Fare"] / k_trn["Group"], 4)
In [15]:
k_trn.head()
Out[15]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySizeGroupTitleiFare
0103Braund, Mr. Owen Harrismale22.0010A/5 211717.25NaNS21Mr7.25
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.0010PC 1759971.28C85C22Mrs35.64
2313Heikkinen, Miss. Lainafemale26.0000STON/O2. 31012827.92NaNS11Miss7.92
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.001011380353.10C123S22Mrs26.55
4503Allen, Mr. William Henrymale35.00003734508.05NaNS11Mr8.05
In [16]:
k_tst["iFare"] = round(k_tst["Fare"] / k_tst["Group"], 4)
In [17]:
# let's save those last additions
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Done

For now I think that is it. At the moment I don’t have any thoughts about other possible manufactured features. But, you never know if something else might come up as I go along.

Feel free to download and play with my version of this post’s related notebook.