What I proposing to do next is some feature engineering. At least I hope it can be considered feature engineering.

I am going to add additional columns to the two modified datasets (oma*.csv) I created in the last post or two.

New Features

I will add that family size feature we covered in Exploratory Data Analysis, Part II. But there were also non-family groups travelling together on a single ticket.

For example, Mr. Thomas Storey (for whom we estimated a fare in the test dataset in a previous post, and then modified in the last post).

with Storey and several other shipmates; Andrew Shannon [Lionel Leonard], August Johnson, William Henry Törnquist, Alfred Carver and William Cahoone Johnson) forced to travel aboard Titanic as passengers. Storey and his shipmates boarded the Titanic at Southampton, all travelling third class (ticket number 370160).

Encyclopedia Titanica

If so I want to get that group size information into the possible feature set as well.

Given the preceding, it is likely that a number of the ticket prices shown for individuals is actually a group fare. I think the model would really work better if we had individual fares. So, I will attempt to sort something with respect to fares.

Finally, the name feature is in and of itself, not horribly helpful. But, it does in many cases contain the passengers title. It is possible that title might be a feature worth looking at. It may provide an additional prespective on a passenger’s status in the eyes of the crew and other passengers. Something which would likely affect the likelihood of survival.

Some people talked about using the cabin information to determine the deck to which a person was assigned. The thinking being that the lower the deck the worse the likelihood of survival. But, there are a great many passengers without any cabin information. And, there really is no way to impute the cabin value in any meaningful way. So, I am going to ignore the Cabin feature altogether. Though I doubt I will remove it from the CSVs.

Once I get these features sorted, I will save the modified training/test datasets to CSV files for use down the road.

Also, to be safe, I will likely save the datasets after each new feature is added. Don’t know if I will use differing file names for each iteration.

So, let’s get going.

Quick Check

Decided to take quick look at both datasets to make sure that I wasn’t missing any data values (except for Age and Cabin).

In [3]:
# paths to datasets
# current datasets of choice
oma_trn_2 = "./data/titanic/oma_trn_2.csv"
oma_tst_2 = "./data/titanic/oma_tst_2.csv"
# dataset to be generate in this notebook
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"
In [4]:
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_2)
k_tst = pd.read_csv(oma_tst_2)
In [5]:
# curiousity killed the cat
k_trn_nac = k_trn.drop(["Age", "Cabin"], axis=1)
k_trn_nac[k_trn_nac.isnull().any(axis=1)]
k_trn_nac[k_trn_nac["Fare"] == 0.0]
Out[5]:
PassengerIdSurvivedPclassNameSexSibSpParchTicketFareEmbarked
Out[5]:
PassengerIdSurvivedPclassNameSexSibSpParchTicketFareEmbarked
In [6]:
k_tst_nac = k_tst.drop(["Age", "Cabin"], axis=1)
k_tst_nac[k_tst_nac.isnull().any(axis=1)]
k_tst_nac[k_tst_nac["Fare"] == 0.0]
Out[6]:
PassengerIdPclassNameSexSibSpParchTicketFareEmbarkedSurvived
Out[6]:
PassengerIdPclassNameSexSibSpParchTicketFareEmbarkedSurvived

Family Size

This should be relatively easy, I’ve already done it once. So add a column/feature FamilySize which is the sum: SibSp + Parch + 1 for each passenger in both datasets.

In [7]:
# let's add the family size feature
k_trn['FamilySize'] = k_trn['Parch'] + k_trn['SibSp'] + 1
k_tst['FamilySize'] = k_tst['Parch'] + k_tst['SibSp'] + 1

Quick look.

In [8]:
k_trn.head()
k_tst.head()
Out[8]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySize
0103Braund, Mr. Owen Harrismale22.0010A/5 211717.25NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.0010PC 1759971.28C85C2
2313Heikkinen, Miss. Lainafemale26.0000STON/O2. 31012827.92NaNS1
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.001011380353.10C123S2
4503Allen, Mr. William Henrymale35.00003734508.05NaNS1
Out[8]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvivedFamilySize
08923Kelly, Mr. Jamesmale34.50003309117.83NaNQ01
18933Wilkes, Mrs. James (Ellen Needs)female47.00103632727.00NaNS12
28942Myles, Mr. Thomas Francismale62.00002402769.69NaNQ01
38953Wirz, Mr. Albertmale27.00003151548.66NaNS01
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.0011310129812.29NaNS13

And, let’s save that enhancement.

In [9]:
# let's save our work so far
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Groups

As mentioned, there were also possibly non-family groups travelling on a single ticket. Let’s have a look.

In [10]:
# let's check for non-family groups travelling on a single ticket
p_grps = k_trn.groupby(["Ticket", "SibSp", "Parch"], as_index=False).PassengerId.count()
p_nofam = p_grps[(p_grps['SibSp'] == 0) & (p_grps['Parch'] == 0) & (p_grps['PassengerId'] > 1)]
p_nofam.head()
Out[10]:
TicketSibSpParchPassengerId
0110152003
3110465002
33113572002
49113798002
851601007
In [11]:
k_trn[k_trn['Ticket'] == '110152']
Out[11]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySize
25725811Cherry, Miss. Gladysfemale30.000011015286.50B77S1
50450511Maioni, Miss. Robertafemale16.000011015286.50B79S1
75976011Rothes, the Countess. of (Lucy Noel Martha Dye...female33.000011015286.50B77S1

There may possibly be some value in identifying such groups. And, I may need that info elsewhere in my feature engineering. So, let’s add a new column “Group”. Probably should have called the “FamilySize” column “Family” to save some typing.

In [12]:
# add new column "Group" to training dataset, then to test dataset
# add new column with default value of 1
k_trn.loc[:,"Group"] = 1
# now change those rows that need changing
i = 0
for _, rw in k_trn.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grps[p_grps['Ticket'] == p_tkt]
  print(rw_tkt)
  cnt_tkt = rw_tkt.PassengerId.item()
  print(cnt_tkt)
  if cnt_tkt > 1:
    rw.Group = cnt_tkt
  # i += 1
  # if i > 4:
  #   break
        Ticket  SibSp  Parch  PassengerId
564  A/5 21171      1      0            1
1
       Ticket  SibSp  Parch  PassengerId
643  PC 17599      1      0            1
1
               Ticket  SibSp  Parch  PassengerId
723  STON/O2. 3101282      0      0            1
1
    Ticket  SibSp  Parch  PassengerId
51  113803      1      0            2
2
     Ticket  SibSp  Parch  PassengerId
511  373450      0      0            1
1
     Ticket  SibSp  Parch  PassengerId
301  330877      0      0            1
1
   Ticket  SibSp  Parch  PassengerId
93  17463      0      0            1
1
     Ticket  SibSp  Parch  PassengerId
427  349909      0      4            1
428  349909      3      1            3
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-52d90aec49b8> in <module>
      8   rw_tkt = p_grps[p_grps['Ticket'] == p_tkt]
      9   print(rw_tkt)
---> 10   cnt_tkt = rw_tkt.PassengerId.item()
     11   print(cnt_tkt)
     12   if cnt_tkt > 1:

E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\pandas\core\base.py in item(self) 418 if len(self) == 1: 419 return next(iter(self)) –> 420 raise ValueError("can only convert an array of size 1 to a Python scalar") 421 422 @property

ValueError: can only convert an array of size 1 to a Python scalar

Oops! Need to rework that p_grp dataframe. Also, do note, that I couldn’t use the rw variable returned by iterrows() to modify the dataframe. That’s what I attempted above. Took me a bit of fooling around to sort that out.

In [13]:
# Okay let's fix that mistake
p_grp2 = k_trn.groupby("Ticket", as_index=False).PassengerId.count()
k_trn.loc[:,"Group"] = 1
for ndx, rw in k_trn.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grp2[p_grp2['Ticket'] == p_tkt]
  cnt_tkt = rw_tkt.PassengerId.item()
  if cnt_tkt > 1:
    k_trn.loc[ndx, "Group"] = cnt_tkt
In [14]:
k_trn[k_trn["Group"] > 1].head()
Out[14]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySizeGroup
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.001011380353.10C123S22
7803Palsson, Master. Gosta Leonardmale2.003134990921.07NaNS54
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.000234774211.13NaNS33
91012Nasser, Mrs. Nicholas (Adele Achem)female14.001023773630.07NaNC22
101113Sandstrom, Miss. Marguerite Rutfemale4.0011PP 954916.70G6S32

You may have noticed in the above that “Palsson, Master. Gosta Leonard” ended up with a FamilySize of 5 and a Group of 4. That’s likely due to the other members being in the test dataset. What to do?

In [17]:
# okay need to combine two datasets to get correct group size values by ticket number
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
In [18]:
p_grp3 = k_all.groupby("Ticket", as_index=False).PassengerId.count()
k_trn.loc[:,"Group"] = 1
for ndx, rw in k_trn.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grp3[p_grp3['Ticket'] == p_tkt]
  cnt_tkt = rw_tkt.PassengerId.item()
  if cnt_tkt > 1:
    k_trn.loc[ndx, "Group"] = cnt_tkt
In [19]:
k_trn[k_trn["FamilySize"] == 6].sort_values("Ticket")
Out[19]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilySizeGroup
34134211Fortune, Miss. Alice Elizabethfemale24.003219950263.00C23 C25 C27S66
272801Fortune, Mr. Charles Alexandermale19.003219950263.00C23 C25 C27S66
43843901Fortune, Mr. Markmale64.001419950263.00C23 C25 C27S66
888911Fortune, Miss. Mabel Helenfemale23.003219950263.00C23 C25 C27S66
43743812Richards, Mrs. Sidney (Emily Hocking)female24.00232910618.75NaNS63
63863903Panula, Mrs. Juha (Maria Emilia Ojala)female41.0005310129539.69NaNS67
82482503Panula, Master. Urho Abrahammale2.0041310129539.69NaNS67
26626703Panula, Mr. Ernesti Arvidmale16.0041310129539.69NaNS67
16416503Panula, Master. Eino Viljamimale1.0041310129539.69NaNS67
505103Panula, Master. Juha Niilomale7.0041310129539.69NaNS67
68668703Panula, Mr. Jaako Arnoldmale14.0041310129539.69NaNS67
64264303Skoog, Miss. Margit Elizabethfemale2.003234708827.90NaNS66
16716803Skoog, Mrs. William (Anna Bernhardina Karlsson)female45.001434708827.90NaNS66
36036103Skoog, Mr. Wilhelmmale40.001434708827.90NaNS66
636403Skoog, Master. Haraldmale4.003234708827.90NaNS66
63463503Skoog, Miss. Mabelfemale9.003234708827.90NaNS66
81982003Skoog, Master. Karl Thorstenmale10.003234708827.90NaNS66
78778803Rice, Master. George Hughmale8.004138265229.12NaNQ66
161703Rice, Master. Eugenemale2.004138265229.12NaNQ66
17117203Rice, Master. Arthurmale4.004138265229.12NaNQ66
27827903Rice, Master. Ericmale7.004138265229.12NaNQ66
88588603Rice, Mrs. William (Margaret Norton)female39.000538265229.12NaNQ66

Seems to be an improvement. And, group size larger than family size wouldn’t seem unreasonable if staff travelling with a family. The other way around likely implies bad data. But I am not going to go looking for trouble. So, update the test dataset and save both.

In [22]:
k_tst.loc[:,"Group"] = 1
for ndx, rw in k_tst.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grp3[p_grp3['Ticket'] == p_tkt]
  cnt_tkt = rw_tkt.PassengerId.item()
  if cnt_tkt > 1:
    k_tst.loc[ndx, "Group"] = cnt_tkt
In [23]:
# let's save our work so far
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Done

You know, I think that’s it for this post. Getting lengthy and has taken me more time than I expected. But, all good. Will continue looking at the remaining possible feature additions in the next post.

Feel free to download and play with my version of this post’s related notebook.

Resources