Okay, let’s carry on from last time. At least a couple more possible features to consider/attempt.

It is likely that the modelling and feature engineering would be done in some iterative fashion. But, I am just going to create them all, then start testing various combinations. That model testing will come in future posts. And, maybe new features will come out of the woodwork as well.

Load Datasets

In [3]:

# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"

Need to load oma_*_3.csv in order to get the engineered features we added earlier—don’t want to lose those.

In [4]:

# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)

Passenger’s Title

We previously mentioned that a person’s status might affect the likelihood of survival. And, that the title in their name might be an estimator of their status. So, let’s give it a shot.

Let’s start with a look at the passenger titles we find in the passenger name feature. You may recall, the name entries in our datasets look like: “<surname>, <title>. …”. Note the period after the title; let’s drop that from our list of titles. And, let’s use a set to get a unique list.

In [6]:

# let's look at passenger title
titles = set()
for name in k_all['Name']:
  titles.add(name.split(',')[1].split('.')[0].strip())
print(titles)

{'Rev', 'Capt', 'Dona', 'Don', 'Mme', 'Ms', 'Major', 'Mr', 'Lady', 'Sir', 'Mlle', 'Col', 'the Countess', 'Jonkheer', 'Master', 'Miss', 'Dr', 'Mrs'}

Let’s combine the ones that are fundamentally the same (e.g. Mme, Mrs, Ms) and reduce the number further by combining others into “categories” (e.g. Capt, Col, Major).

In [7]:

# we can combine at least a few of these in a single title category
d_titles = {
  "Master": "Master",
  "Capt": "Official",
  "Sir": "Noble",
  "Don": "Noble",
  "Miss": "Miss",
  "Dr": "Official",
  "Dona": "Noble",
  "Mme": "Mrs",
  "Major": "Official",
  "Mrs": "Mrs",
  "Mlle": "Miss",
  "the Countess": "Noble",
  "Ms": "Mrs",
  "Rev": "Official",
  "Jonkheer": "Noble",
  "Col": "Official",
  "Lady": "Noble",
  "Mr": "Mr"
}

Let’s update the training dataset and have a look. Using map to apply a lambda function to get the title from the name for each row in the dataset.

In [8]:

k_trn["Title"] = k_trn["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())
k_trn["Title"] = k_trn.Title.map(d_titles)

In [9]:

k_trn.groupby("Title", as_index=False).PassengerId.count()

Out[9]:

	Title	PassengerId
0	Master	40
1	Miss	184
2	Mr	517
3	Mrs	127
4	Noble	5
5	Official	18

Now, let’s add that feature to the test dataset and save our work.

In [10]:

k_tst["Title"] = k_tst["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())
k_tst["Title"] = k_tst.Title.map(d_titles)

In [11]:

k_tst.groupby("Title", as_index=False).PassengerId.count()

Out[11]:

	Title	PassengerId
0	Master	21
1	Miss	78
2	Mr	240
3	Mrs	73
4	Noble	1
5	Official	5

Quick look to make sure all three added features (that we’ve created so far) are present in the dataset/DataFrame.

In [12]:

k_trn.head()

Out[12]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	FamilySize	Group	Title
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	A/5 21171	7.25	NaN	S	2	1	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	PC 17599	71.28	C85	C	2	2	Mrs
2	3	1	3	Heikkinen, Miss. Laina	female	26.00	0	STON/O2. 3101282	7.92	NaN	S	1	1	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	113803	53.10	C123	S	2	2	Mrs
4	5	0	3	Allen, Mr. William Henry	male	35.00	0	373450	8.05	NaN	S	1	1	Mr

In [13]:

# let's save those additions
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Individual Fares

Not sure how much value this will have, but I am not happy with the Fare feature containing a fare for a ticket with multiple passengers. If Fare has any value to the model, I expect that individual fares would be of better value than joint fares. Since the Group feature essentially counts the number of people on a single ticket, I will use that to estimate an individual fare for each passenger, iFare (sorry couldn’t resist).

And, I did say that Group feature might come in handy.

In [14]:

# now how about an individual fare feature
# expect it to be a simple chore
k_trn["iFare"] = round(k_trn["Fare"] / k_trn["Group"], 4)

In [15]:

k_trn.head()

Out[15]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	FamilySize	Group	Title	iFare
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	A/5 21171	7.25	NaN	S	2	1	Mr	7.25
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	PC 17599	71.28	C85	C	2	2	Mrs	35.64
2	3	1	3	Heikkinen, Miss. Laina	female	26.00	0	STON/O2. 3101282	7.92	NaN	S	1	1	Miss	7.92
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	113803	53.10	C123	S	2	2	Mrs	26.55
4	5	0	3	Allen, Mr. William Henry	male	35.00	0	373450	8.05	NaN	S	1	1	Mr	8.05

In [16]:

k_tst["iFare"] = round(k_tst["Fare"] / k_tst["Group"], 4)

In [17]:

# let's save those last additions
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Done

For now I think that is it. At the moment I don’t have any thoughts about other possible manufactured features. But, you never know if something else might come up as I go along.

Feel free to download and play with my version of this post’s related notebook.

Too Old To Code

Titanic Dataset: Feature Engineering, Part 2

Load Datasets

Passenger’s Title

Individual Fares

Done