Okay, let’s carry on from last time. At least a couple more possible features to consider/attempt.
It is likely that the modelling and feature engineering would be done in some iterative fashion. But, I am just going to create them all, then start testing various combinations. That model testing will come in future posts. And, maybe new features will come out of the woodwork as well.
Load Datasets
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"
Need to load oma_*_3.csv
in order to get the engineered features we added earlier—don’t want to lose those.
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
Passenger’s Title
We previously mentioned that a person’s status might affect the likelihood of survival. And, that the title in their name might be an estimator of their status. So, let’s give it a shot.
Let’s start with a look at the passenger titles we find in the passenger name feature. You may recall, the name entries in our datasets look like: “<surname>, <title>. …”. Note the period after the title; let’s drop that from our list of titles. And, let’s use a set
to get a unique list.
# let's look at passenger title
titles = set()
for name in k_all['Name']:
titles.add(name.split(',')[1].split('.')[0].strip())
print(titles)
Let’s combine the ones that are fundamentally the same (e.g. Mme, Mrs, Ms) and reduce the number further by combining others into “categories” (e.g. Capt, Col, Major).
# we can combine at least a few of these in a single title category
d_titles = {
"Master": "Master",
"Capt": "Official",
"Sir": "Noble",
"Don": "Noble",
"Miss": "Miss",
"Dr": "Official",
"Dona": "Noble",
"Mme": "Mrs",
"Major": "Official",
"Mrs": "Mrs",
"Mlle": "Miss",
"the Countess": "Noble",
"Ms": "Mrs",
"Rev": "Official",
"Jonkheer": "Noble",
"Col": "Official",
"Lady": "Noble",
"Mr": "Mr"
}
Let’s update the training dataset and have a look. Using map
to apply a lambda function to get the title from the name for each row in the dataset.
k_trn["Title"] = k_trn["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())
k_trn["Title"] = k_trn.Title.map(d_titles)
k_trn.groupby("Title", as_index=False).PassengerId.count()
Now, let’s add that feature to the test dataset and save our work.
k_tst["Title"] = k_tst["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())
k_tst["Title"] = k_tst.Title.map(d_titles)
k_tst.groupby("Title", as_index=False).PassengerId.count()
Quick look to make sure all three added features (that we’ve created so far) are present in the dataset/DataFrame.
k_trn.head()
# let's save those additions
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)
Individual Fares
Not sure how much value this will have, but I am not happy with the Fare feature containing a fare for a ticket with multiple passengers. If Fare has any value to the model, I expect that individual fares would be of better value than joint fares. Since the Group feature essentially counts the number of people on a single ticket, I will use that to estimate an individual fare for each passenger, iFare (sorry couldn’t resist).
And, I did say that Group feature might come in handy.
# now how about an individual fare feature
# expect it to be a simple chore
k_trn["iFare"] = round(k_trn["Fare"] / k_trn["Group"], 4)
k_trn.head()
k_tst["iFare"] = round(k_tst["Fare"] / k_tst["Group"], 4)
# let's save those last additions
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)
Done
For now I think that is it. At the moment I don’t have any thoughts about other possible manufactured features. But, you never know if something else might come up as I go along.
Feel free to download and play with my version of this post’s related notebook.