What I proposing to do next is some feature engineering. At least I hope it can be considered feature engineering.
I am going to add additional columns to the two modified datasets (oma*.csv) I created in the last post or two.
New Features
I will add that family size feature we covered in Exploratory Data Analysis, Part II. But there were also non-family groups travelling together on a single ticket.
For example, Mr. Thomas Storey (for whom we estimated a fare in the test dataset in a previous post, and then modified in the last post).
with Storey and several other shipmates; Andrew Shannon [Lionel Leonard], August Johnson, William Henry Törnquist, Alfred Carver and William Cahoone Johnson) forced to travel aboard Titanic as passengers. Storey and his shipmates boarded the Titanic at Southampton, all travelling third class (ticket number 370160).
Encyclopedia Titanica
If so I want to get that group size information into the possible feature set as well.
Given the preceding, it is likely that a number of the ticket prices shown for individuals is actually a group fare. I think the model would really work better if we had individual fares. So, I will attempt to sort something with respect to fares.
Finally, the name feature is in and of itself, not horribly helpful. But, it does in many cases contain the passengers title. It is possible that title might be a feature worth looking at. It may provide an additional prespective on a passenger’s status in the eyes of the crew and other passengers. Something which would likely affect the likelihood of survival.
Some people talked about using the cabin information to determine the deck to which a person was assigned. The thinking being that the lower the deck the worse the likelihood of survival. But, there are a great many passengers without any cabin information. And, there really is no way to impute the cabin value in any meaningful way. So, I am going to ignore the Cabin feature altogether. Though I doubt I will remove it from the CSVs.
Once I get these features sorted, I will save the modified training/test datasets to CSV files for use down the road.
Also, to be safe, I will likely save the datasets after each new feature is added. Don’t know if I will use differing file names for each iteration.
So, let’s get going.
Quick Check
Decided to take quick look at both datasets to make sure that I wasn’t missing any data values (except for Age and Cabin).
# paths to datasets
# current datasets of choice
oma_trn_2 = "./data/titanic/oma_trn_2.csv"
oma_tst_2 = "./data/titanic/oma_tst_2.csv"
# dataset to be generate in this notebook
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_2)
k_tst = pd.read_csv(oma_tst_2)
# curiousity killed the cat
k_trn_nac = k_trn.drop(["Age", "Cabin"], axis=1)
k_trn_nac[k_trn_nac.isnull().any(axis=1)]
k_trn_nac[k_trn_nac["Fare"] == 0.0]
k_tst_nac = k_tst.drop(["Age", "Cabin"], axis=1)
k_tst_nac[k_tst_nac.isnull().any(axis=1)]
k_tst_nac[k_tst_nac["Fare"] == 0.0]
Family Size
This should be relatively easy, I’ve already done it once. So add a column/feature FamilySize which is the sum: SibSp + Parch + 1
for each passenger in both datasets.
# let's add the family size feature
k_trn['FamilySize'] = k_trn['Parch'] + k_trn['SibSp'] + 1
k_tst['FamilySize'] = k_tst['Parch'] + k_tst['SibSp'] + 1
Quick look.
k_trn.head()
k_tst.head()
And, let’s save that enhancement.
# let's save our work so far
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)
Groups
As mentioned, there were also possibly non-family groups travelling on a single ticket. Let’s have a look.
# let's check for non-family groups travelling on a single ticket
p_grps = k_trn.groupby(["Ticket", "SibSp", "Parch"], as_index=False).PassengerId.count()
p_nofam = p_grps[(p_grps['SibSp'] == 0) & (p_grps['Parch'] == 0) & (p_grps['PassengerId'] > 1)]
p_nofam.head()
k_trn[k_trn['Ticket'] == '110152']
There may possibly be some value in identifying such groups. And, I may need that info elsewhere in my feature engineering. So, let’s add a new column “Group”. Probably should have called the “FamilySize” column “Family” to save some typing.
# add new column "Group" to training dataset, then to test dataset
# add new column with default value of 1
k_trn.loc[:,"Group"] = 1
# now change those rows that need changing
i = 0
for _, rw in k_trn.iterrows():
p_tkt = rw.Ticket
rw_tkt = p_grps[p_grps['Ticket'] == p_tkt]
print(rw_tkt)
cnt_tkt = rw_tkt.PassengerId.item()
print(cnt_tkt)
if cnt_tkt > 1:
rw.Group = cnt_tkt
# i += 1
# if i > 4:
# break
Oops! Need to rework that p_grp
dataframe. Also, do note, that I couldn’t use the rw
variable returned by iterrows()
to modify the dataframe. That’s what I attempted above. Took me a bit of fooling around to sort that out.
# Okay let's fix that mistake
p_grp2 = k_trn.groupby("Ticket", as_index=False).PassengerId.count()
k_trn.loc[:,"Group"] = 1
for ndx, rw in k_trn.iterrows():
p_tkt = rw.Ticket
rw_tkt = p_grp2[p_grp2['Ticket'] == p_tkt]
cnt_tkt = rw_tkt.PassengerId.item()
if cnt_tkt > 1:
k_trn.loc[ndx, "Group"] = cnt_tkt
k_trn[k_trn["Group"] > 1].head()
You may have noticed in the above that “Palsson, Master. Gosta Leonard” ended up with a FamilySize of 5 and a Group of 4. That’s likely due to the other members being in the test dataset. What to do?
# okay need to combine two datasets to get correct group size values by ticket number
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
p_grp3 = k_all.groupby("Ticket", as_index=False).PassengerId.count()
k_trn.loc[:,"Group"] = 1
for ndx, rw in k_trn.iterrows():
p_tkt = rw.Ticket
rw_tkt = p_grp3[p_grp3['Ticket'] == p_tkt]
cnt_tkt = rw_tkt.PassengerId.item()
if cnt_tkt > 1:
k_trn.loc[ndx, "Group"] = cnt_tkt
k_trn[k_trn["FamilySize"] == 6].sort_values("Ticket")
Seems to be an improvement. And, group size larger than family size wouldn’t seem unreasonable if staff travelling with a family. The other way around likely implies bad data. But I am not going to go looking for trouble. So, update the test dataset and save both.
k_tst.loc[:,"Group"] = 1
for ndx, rw in k_tst.iterrows():
p_tkt = rw.Ticket
rw_tkt = p_grp3[p_grp3['Ticket'] == p_tkt]
cnt_tkt = rw_tkt.PassengerId.item()
if cnt_tkt > 1:
k_tst.loc[ndx, "Group"] = cnt_tkt
# let's save our work so far
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)
Done
You know, I think that’s it for this post. Getting lengthy and has taken me more time than I expected. But, all good. Will continue looking at the remaining possible feature additions in the next post.
Feel free to download and play with my version of this post’s related notebook.
Resources
- Group and Aggregate by One or More Columns in Pandas
- How to add new columns to Pandas dataframe?