When starting what I thought would be my next post — feature engineering, I discovered there was still some missing Fare values in the training/test datasets. Here’s how that came to be.

Additional Missing Data

I was looking at adding a feature for group size. My ramblings were going like this:

But there were also non-family groups travelling together on a single ticket.

For example, Mr. Thomas Storey (for whom we estimated a fare in the test dataset last post).

***But*** once I started looking at the *Fare* and *Ticket* for the above sailors:

most had a ticket number *LINE*, except Storey who had *3701* and Carver with *392095*
most had a fare of 0.0, except Carver who had *7.25*

Highly unlikely anyone travelled for ***free*** on the maiden voyage of the Titanic. So, looks like I may still have some missing data to sort out. Will decide whether or not to add a new post before this one in the next week or so. But, I am guessing they all paid the same fare of 7.25.

N.B. I am really irked by the 0.0 fares and the question of ticket numbers highlighted by the above

The above will be removed from that draft post on Feature Engineering.

And, I couldn’t just leave it alone. So, I am doing another post on sorting missing data before moving on to some Feature Engineering. Sorry!

Some Investigating

Okay, let’s have a look at what appears to be missing and how we might impute the iffy values. (Note, as has been the case of late not showing all the preparatory code cells. See the notebook.)

In [4]:

# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn)
k_tst = pd.read_csv(oma_tst)

Storey and Company

Let’s start with a look at that group of sailors Encyclopedia Titanica tied to Mr. Storey. For now, please ignore that imputed Fare we have for Mr. Storey. Not to mention a name difference (or two) between the Encyclopedia Titanica article and the datasets.

In [5]:

k_trn[(k_trn['Ticket'] == 'LINE')]
k_tst[(k_tst['Ticket'] == 'LINE') | (k_tst['Ticket'] == '3701') | (k_tst['Ticket'] == '392095')]

Out[5]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Cabin	Embarked
179	180	0	3	Leonard, Mr. Lionel	male	36.00	LINE	NaN	S
271	272	1	3	Tornquist, Mr. William Henry	male	25.00	LINE	NaN	S
302	303	0	3	Johnson, Mr. William Cahoone Jr	male	19.00	LINE	NaN	S
597	598	0	3	Johnson, Mr. Alfred	male	49.00	LINE	NaN	S

Out[5]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived
123	1015	3	Carver, Mr. Alfred John	male	28.00	0	0	392095	7.25	NaN	S	0
152	1044	3	Storey, Mr. Thomas	male	60.50	0	0	3701	9.27	NaN	S	0

What I am most likely going to do, is leave Mr. Carver’s information as is since he has a different ticket number. For Mr. Storey and the others, I am going to give them all the ticket number shown in the Enclopedia Titanica post, a suitable multiple (5) of Mr. Carver’s fare. You will understand why I am doing the latter when I start looking at feature engineering.

Zero Fares

Let’s get all the passengers with a zero fare value.

In [6]:

k_trn[(k_trn['Fare'] == 0.0)]
k_tst[(k_tst['Fare'] == 0.0)]

Out[6]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Cabin	Embarked
179	180	0	3	Leonard, Mr. Lionel	male	36.00	LINE	NaN	S
263	264	0	1	Harrison, Mr. William	male	40.00	112059	B94	S
271	272	1	3	Tornquist, Mr. William Henry	male	25.00	LINE	NaN	S
277	278	0	2	Parkes, Mr. Francis "Frank"	male	NaN	239853	NaN	S
302	303	0	3	Johnson, Mr. William Cahoone Jr	male	19.00	LINE	NaN	S
413	414	0	2	Cunningham, Mr. Alfred Fleming	male	NaN	239853	NaN	S
466	467	0	2	Campbell, Mr. William	male	NaN	239853	NaN	S
481	482	0	2	Frost, Mr. Anthony Wood "Archie"	male	NaN	239854	NaN	S
597	598	0	3	Johnson, Mr. Alfred	male	49.00	LINE	NaN	S
633	634	0	1	Parr, Mr. William Henry Marsh	male	NaN	112052	NaN	S
674	675	0	2	Watson, Mr. Ennis Hastings	male	NaN	239856	NaN	S
732	733	0	2	Knight, Mr. Robert J	male	NaN	239855	NaN	S
806	807	0	1	Andrews, Mr. Thomas Jr	male	39.00	112050	A36	S
815	816	0	1	Fry, Mr. Richard	male	NaN	112058	B102	S
822	823	0	1	Reuchlin, Jonkheer. John George	male	38.00	19972	NaN	S

Out[6]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived
266	1158	1	Chisholm, Mr. Roderick Robert Crispin	male	NaN	0	0	112051	0.00	NaN	S	0
372	1264	1	Ismay, Mr. Joseph Bruce	male	49.00	0	0	112058	0.00	B52 B54 B56	S	1

Experiment

Let’s do a little looking at similar cases to see if we can find any suitable replacement values.

We’ll begin with tickets containing the string “2398”. Though might be better to look for tickets starting with that string.

In [7]:

k_trn[(k_trn["Ticket"].str.contains('2398'))]

Out[7]:

	PassengerId	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
20	21	2	Fynney, Mr. Joseph J	male	35.00	239865	26.00	NaN	S
277	278	2	Parkes, Mr. Francis "Frank"	male	NaN	239853	0.00	NaN	S
413	414	2	Cunningham, Mr. Alfred Fleming	male	NaN	239853	0.00	NaN	S
466	467	2	Campbell, Mr. William	male	NaN	239853	0.00	NaN	S
481	482	2	Frost, Mr. Anthony Wood "Archie"	male	NaN	239854	0.00	NaN	S
674	675	2	Watson, Mr. Ennis Hastings	male	NaN	239856	0.00	NaN	S
732	733	2	Knight, Mr. Robert J	male	NaN	239855	0.00	NaN	S
791	792	2	Gaskell, Mr. Alfred	male	16.00	239865	26.00	NaN	S

Ticket ‘239865’ had two passengers at 26.00 or 13.00 each. I will use 13.00 as the fare for each of the other passengers above. Appropriate multiples for tickets with more than one passenger, e.g. ‘239853’.

Now lets look at the ‘11205..’ ticket numbers. In both datasets.

In [8]:

k_trn[(k_trn["Ticket"].str.contains('11205'))]
k_tst[(k_tst["Ticket"].str.contains('11205'))]

Out[8]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
263	264	0	1	Harrison, Mr. William	male	40.00	112059	0.00	B94	S
633	634	0	1	Parr, Mr. William Henry Marsh	male	NaN	112052	0.00	NaN	S
806	807	0	1	Andrews, Mr. Thomas Jr	male	39.00	112050	0.00	A36	S
815	816	0	1	Fry, Mr. Richard	male	NaN	112058	0.00	B102	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.00	112053	30.00	B42	S

Out[8]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived
266	1158	1	Chisholm, Mr. Roderick Robert Crispin	male	NaN	0	0	112051	0.00	NaN	S	0
372	1264	1	Ismay, Mr. Joseph Bruce	male	49.00	0	0	112058	0.00	B52 B54 B56	S	1

That doesn’t really look too helpful. Let’s look at the last case and then come back to this one afterwards.

So, let’s have a look at Mr. Reuchlin’s ticket number.

In [9]:

k_trn[(k_trn["Ticket"].str.contains('1997'))]

Out[9]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
822	823	0	1	Reuchlin, Jonkheer. John George	male	38.00	0	0	19972	0.00	NaN	S

Not very helpful. Let’s try extending the search somewhat.

In [10]:

# that didn't do much, let's try '199'
k_trn[(k_trn["Ticket"].str.startswith('199')) & (k_trn["Pclass"] == 1) & (k_trn["SibSp"] == 0) & (k_trn["Parch"] == 0)]

Out[10]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
55	56	1	1	Woolner, Mr. Hugh	male	NaN	19947	35.50	C52	S
298	299	1	1	Saalfeld, Mr. Adolphe	male	NaN	19988	30.50	C106	S
460	461	1	1	Anderson, Mr. Harry	male	48.00	19952	26.55	E12	S
822	823	0	1	Reuchlin, Jonkheer. John George	male	38.00	19972	0.00	NaN	S

That appeared to work. I’ll use the median value of 30.50 for Mr. Reuchlin.

Now for that group without fares for tickets beginning with ‘11205’

Mr. Ismay may in fact have paid nothing for the trip. As head of the White Star Line, I expect he did not pay a fare for the trip. He apparently accompanied all his ships on their maiden voyage. And, as his valet, Mr. Fry may also not have paid any fare. Ditto for his secretary William Henry Harrison. That said, I am going to add fares so the model treats their cases appropriately. But, probably not as you expect. I want to make a clear distinction between Mr. Ismay and his staff. I very much doubt the crew would treat them equally.

To that end, I am going to assume Mr. Fry paid his own fare on a separate ticket (will add an ‘A’ to the end of his ticket nubmer). And that he would likely have paid one of the lower first class fares. Say, 30.00.

I am going to assume Mr. Ismay if travelling on his own penny would travel top class and, as such, pay a fare commensurate with his social standing. So, 512.33 it is.

I expect Mr. Harrison’s cabin was somewhat cheaper. As he was in B94, I am going to use half the price charged for the ticket assigned cabins B96 and B98. That is, 60.00.

That only leaves two others, and since they have no cabin number I am going to charge them the same as Miss Graham, 30.00.

Guess that’s that.

In [12]:

k_trn[(k_trn["Cabin"].str.startswith('B')) & (k_trn["Pclass"] == 1) & (k_trn["SibSp"] == 0) & (k_trn["Parch"] == 0)].sort_values(['Ticket', 'Cabin'])

Out[12]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
257	258	1	1	Cherry, Miss. Gladys	female	30.00	110152	86.50	B77	S
759	760	1	1	Rothes, the Countess. of (Lucy Noel Martha Dye...	female	33.00	110152	86.50	B77	S
504	505	1	1	Maioni, Miss. Roberta	female	16.00	110152	86.50	B79	S
170	171	0	1	Van der hoef, Mr. Wyckoff	male	61.00	111240	33.50	B19	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.00	112053	30.00	B42	S
815	816	0	1	Fry, Mr. Richard	male	NaN	112058	0.00	B102	S
263	264	0	1	Harrison, Mr. William	male	40.00	112059	0.00	B94	S
536	537	0	1	Butt, Major. Archibald Willingham	male	45.00	113050	26.55	B38	S
61	62	1	1	Icard, Miss. Amelie	female	38.00	113572	80.00	B28	S
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.00	113572	80.00	B28	S
487	488	0	1	Kent, Mr. Edward Austin	male	58.00	11771	29.70	B37	C
520	521	1	1	Perreault, Miss. Anne	female	30.00	12749	93.50	B73	S
632	633	1	1	Stahelin-Maeglin, Dr. Max	male	32.00	13214	30.50	B50	C
730	731	1	1	Allen, Miss. Elisabeth Walton	female	29.00	24160	211.34	B5	S
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.00	695	5.00	B51 B53 B55	S
369	370	1	1	Aubart, Mme. Leontine Pauline	female	24.00	PC 17477	69.30	B35	C
641	642	1	1	Sagesser, Mlle. Emma	female	24.00	PC 17477	69.30	B35	C
195	196	1	1	Lurette, Miss. Elise	female	58.00	PC 17569	146.52	B80	C
789	790	0	1	Guggenheim, Mr. Benjamin	male	46.00	PC 17593	79.20	B82 B84	C
139	140	0	1	Giglio, Mr. Victor	male	24.00	PC 17593	79.20	B86	C
194	195	1	1	Brown, Mrs. James Joseph (Margaret Tobin)	female	44.00	PC 17610	27.72	B4	C
737	738	1	1	Lesurer, Mr. Gustave J	male	35.00	PC 17755	512.33	B101	C

I won’t bother including the code to update and save the datasets. See the notebook.

Check Score

I decided to see if adding “Fare” as a feature to our vert first modelling attempt would improve our score.

In [23]:

# let's see if adding Fare to our basic set of features improves our score
Y = k_trn['Survived']
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare']
X = pd.get_dummies(k_trn[features])
X_test = pd.get_dummies(k_tst[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)
predictions = model.predict(X_test)
accuracy_score(k_tst["Survived"], predictions)

Out[23]:

RandomForestClassifier(max_depth=5, random_state=1)

Out[23]:

0.777511961722488

Our first model score was 0.7751196172248804. So adding the Fare to the feature set appears to have helped the model make better predictions.

Done

I think that’s it for this one. Looks like I can get back to some possible feature engineering in the next post.

Feel free to download and play with my version of this post’s related notebook.

Resources

pandas.DataFrame.isin
pandas.isnull
Select rows without NaN values
Pandas filter string data based on its string length
6 ways to Sort Pandas Dataframe
Image Titanic Cabins by Class

Too Old To Code

Titanic Dataset: Missing Data, Part 2