When starting what I thought would be my next post — feature engineering, I discovered there was still some missing Fare values in the training/test datasets. Here’s how that came to be.

Additional Missing Data

I was looking at adding a feature for group size. My ramblings were going like this:

The above will be removed from that draft post on Feature Engineering.

And, I couldn’t just leave it alone. So, I am doing another post on sorting missing data before moving on to some Feature Engineering. Sorry!

Some Investigating

Okay, let’s have a look at what appears to be missing and how we might impute the iffy values. (Note, as has been the case of late not showing all the preparatory code cells. See the notebook.)

In [4]:
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn)
k_tst = pd.read_csv(oma_tst)

Storey and Company

Let’s start with a look at that group of sailors Encyclopedia Titanica tied to Mr. Storey. For now, please ignore that imputed Fare we have for Mr. Storey. Not to mention a name difference (or two) between the Encyclopedia Titanica article and the datasets.

In [5]:
k_trn[(k_trn['Ticket'] == 'LINE')]
k_tst[(k_tst['Ticket'] == 'LINE') | (k_tst['Ticket'] == '3701') | (k_tst['Ticket'] == '392095')]
Out[5]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
17918003Leonard, Mr. Lionelmale36.0000LINE0.00NaNS
27127213Tornquist, Mr. William Henrymale25.0000LINE0.00NaNS
30230303Johnson, Mr. William Cahoone Jrmale19.0000LINE0.00NaNS
59759803Johnson, Mr. Alfredmale49.0000LINE0.00NaNS
Out[5]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
12310153Carver, Mr. Alfred Johnmale28.00003920957.25NaNS0
15210443Storey, Mr. Thomasmale60.500037019.27NaNS0

What I am most likely going to do, is leave Mr. Carver’s information as is since he has a different ticket number. For Mr. Storey and the others, I am going to give them all the ticket number shown in the Enclopedia Titanica post, a suitable multiple (5) of Mr. Carver’s fare. You will understand why I am doing the latter when I start looking at feature engineering.

Zero Fares

Let’s get all the passengers with a zero fare value.

In [6]:
k_trn[(k_trn['Fare'] == 0.0)]
k_tst[(k_tst['Fare'] == 0.0)]
Out[6]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
17918003Leonard, Mr. Lionelmale36.0000LINE0.00NaNS
26326401Harrison, Mr. Williammale40.00001120590.00B94S
27127213Tornquist, Mr. William Henrymale25.0000LINE0.00NaNS
27727802Parkes, Mr. Francis "Frank"maleNaN002398530.00NaNS
30230303Johnson, Mr. William Cahoone Jrmale19.0000LINE0.00NaNS
41341402Cunningham, Mr. Alfred FlemingmaleNaN002398530.00NaNS
46646702Campbell, Mr. WilliammaleNaN002398530.00NaNS
48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.00NaNS
59759803Johnson, Mr. Alfredmale49.0000LINE0.00NaNS
63363401Parr, Mr. William Henry MarshmaleNaN001120520.00NaNS
67467502Watson, Mr. Ennis HastingsmaleNaN002398560.00NaNS
73273302Knight, Mr. Robert JmaleNaN002398550.00NaNS
80680701Andrews, Mr. Thomas Jrmale39.00001120500.00A36S
81581601Fry, Mr. RichardmaleNaN001120580.00B102S
82282301Reuchlin, Jonkheer. John Georgemale38.0000199720.00NaNS
Out[6]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
26611581Chisholm, Mr. Roderick Robert CrispinmaleNaN001120510.00NaNS0
37212641Ismay, Mr. Joseph Brucemale49.00001120580.00B52 B54 B56S1

Experiment

Let’s do a little looking at similar cases to see if we can find any suitable replacement values.

We’ll begin with tickets containing the string “2398”. Though might be better to look for tickets starting with that string.

In [7]:
k_trn[(k_trn["Ticket"].str.contains('2398'))]
Out[7]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
202102Fynney, Mr. Joseph Jmale35.000023986526.00NaNS
27727802Parkes, Mr. Francis "Frank"maleNaN002398530.00NaNS
41341402Cunningham, Mr. Alfred FlemingmaleNaN002398530.00NaNS
46646702Campbell, Mr. WilliammaleNaN002398530.00NaNS
48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.00NaNS
67467502Watson, Mr. Ennis HastingsmaleNaN002398560.00NaNS
73273302Knight, Mr. Robert JmaleNaN002398550.00NaNS
79179202Gaskell, Mr. Alfredmale16.000023986526.00NaNS

Ticket ‘239865’ had two passengers at 26.00 or 13.00 each. I will use 13.00 as the fare for each of the other passengers above. Appropriate multiples for tickets with more than one passenger, e.g. ‘239853’.

Now lets look at the ‘11205..’ ticket numbers. In both datasets.

In [8]:
k_trn[(k_trn["Ticket"].str.contains('11205'))]
k_tst[(k_tst["Ticket"].str.contains('11205'))]
Out[8]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
26326401Harrison, Mr. Williammale40.00001120590.00B94S
63363401Parr, Mr. William Henry MarshmaleNaN001120520.00NaNS
80680701Andrews, Mr. Thomas Jrmale39.00001120500.00A36S
81581601Fry, Mr. RichardmaleNaN001120580.00B102S
88788811Graham, Miss. Margaret Edithfemale19.000011205330.00B42S
Out[8]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
26611581Chisholm, Mr. Roderick Robert CrispinmaleNaN001120510.00NaNS0
37212641Ismay, Mr. Joseph Brucemale49.00001120580.00B52 B54 B56S1

That doesn’t really look too helpful. Let’s look at the last case and then come back to this one afterwards.

So, let’s have a look at Mr. Reuchlin’s ticket number.

In [9]:
k_trn[(k_trn["Ticket"].str.contains('1997'))]
Out[9]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
82282301Reuchlin, Jonkheer. John Georgemale38.0000199720.00NaNS

Not very helpful. Let’s try extending the search somewhat.

In [10]:
# that didn't do much, let's try '199'
k_trn[(k_trn["Ticket"].str.startswith('199')) & (k_trn["Pclass"] == 1) & (k_trn["SibSp"] == 0) & (k_trn["Parch"] == 0)]
Out[10]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
555611Woolner, Mr. HughmaleNaN001994735.50C52S
29829911Saalfeld, Mr. AdolphemaleNaN001998830.50C106S
46046111Anderson, Mr. Harrymale48.00001995226.55E12S
82282301Reuchlin, Jonkheer. John Georgemale38.0000199720.00NaNS

That appeared to work. I’ll use the median value of 30.50 for Mr. Reuchlin.

Now for that group without fares for tickets beginning with ‘11205’

Mr. Ismay may in fact have paid nothing for the trip. As head of the White Star Line, I expect he did not pay a fare for the trip. He apparently accompanied all his ships on their maiden voyage. And, as his valet, Mr. Fry may also not have paid any fare. Ditto for his secretary William Henry Harrison. That said, I am going to add fares so the model treats their cases appropriately. But, probably not as you expect. I want to make a clear distinction between Mr. Ismay and his staff. I very much doubt the crew would treat them equally.

To that end, I am going to assume Mr. Fry paid his own fare on a separate ticket (will add an ‘A’ to the end of his ticket nubmer). And that he would likely have paid one of the lower first class fares. Say, 30.00.

I am going to assume Mr. Ismay if travelling on his own penny would travel top class and, as such, pay a fare commensurate with his social standing. So, 512.33 it is.

I expect Mr. Harrison’s cabin was somewhat cheaper. As he was in B94, I am going to use half the price charged for the ticket assigned cabins B96 and B98. That is, 60.00.

That only leaves two others, and since they have no cabin number I am going to charge them the same as Miss Graham, 30.00.

Guess that’s that.

In [12]:
k_trn[(k_trn["Cabin"].str.startswith('B')) & (k_trn["Pclass"] == 1) & (k_trn["SibSp"] == 0) & (k_trn["Parch"] == 0)].sort_values(['Ticket', 'Cabin'])
Out[12]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
25725811Cherry, Miss. Gladysfemale30.000011015286.50B77S
75976011Rothes, the Countess. of (Lucy Noel Martha Dye...female33.000011015286.50B77S
50450511Maioni, Miss. Robertafemale16.000011015286.50B79S
17017101Van der hoef, Mr. Wyckoffmale61.000011124033.50B19S
88788811Graham, Miss. Margaret Edithfemale19.000011205330.00B42S
81581601Fry, Mr. RichardmaleNaN001120580.00B102S
26326401Harrison, Mr. Williammale40.00001120590.00B94S
53653701Butt, Major. Archibald Willinghammale45.000011305026.55B38S
616211Icard, Miss. Ameliefemale38.000011357280.00B28S
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.000011357280.00B28S
48748801Kent, Mr. Edward Austinmale58.00001177129.70B37C
52052111Perreault, Miss. Annefemale30.00001274993.50B73S
63263311Stahelin-Maeglin, Dr. Maxmale32.00001321430.50B50C
73073111Allen, Miss. Elisabeth Waltonfemale29.000024160211.34B5S
87287301Carlsson, Mr. Frans Olofmale33.00006955.00B51 B53 B55S
36937011Aubart, Mme. Leontine Paulinefemale24.0000PC 1747769.30B35C
64164211Sagesser, Mlle. Emmafemale24.0000PC 1747769.30B35C
19519611Lurette, Miss. Elisefemale58.0000PC 17569146.52B80C
78979001Guggenheim, Mr. Benjaminmale46.0000PC 1759379.20B82 B84C
13914001Giglio, Mr. Victormale24.0000PC 1759379.20B86C
19419511Brown, Mrs. James Joseph (Margaret Tobin)female44.0000PC 1761027.72B4C
73773811Lesurer, Mr. Gustave Jmale35.0000PC 17755512.33B101C

I won’t bother including the code to update and save the datasets. See the notebook.

Check Score

I decided to see if adding “Fare” as a feature to our vert first modelling attempt would improve our score.

In [23]:
# let's see if adding Fare to our basic set of features improves our score
Y = k_trn['Survived']

features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare'] X = pd.get_dummies(k_trn[features]) X_test = pd.get_dummies(k_tst[features]) model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1) model.fit(X, Y) predictions = model.predict(X_test) accuracy_score(k_tst["Survived"], predictions)

Out[23]:
RandomForestClassifier(max_depth=5, random_state=1)
Out[23]:
0.777511961722488

Our first model score was 0.7751196172248804. So adding the Fare to the feature set appears to have helped the model make better predictions.

Done

I think that’s it for this one. Looks like I can get back to some possible feature engineering in the next post.

Feel free to download and play with my version of this post’s related notebook.

Resources