When starting what I thought would be my next post — feature engineering, I discovered there was still some missing Fare values in the training/test datasets. Here’s how that came to be.
Additional Missing Data
I was looking at adding a feature for group size. My ramblings were going like this:
The above will be removed from that draft post on Feature Engineering.
And, I couldn’t just leave it alone. So, I am doing another post on sorting missing data before moving on to some Feature Engineering. Sorry!
Some Investigating
Okay, let’s have a look at what appears to be missing and how we might impute the iffy values. (Note, as has been the case of late not showing all the preparatory code cells. See the notebook.)
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn)
k_tst = pd.read_csv(oma_tst)
Storey and Company
Let’s start with a look at that group of sailors Encyclopedia Titanica tied to Mr. Storey. For now, please ignore that imputed Fare we have for Mr. Storey. Not to mention a name difference (or two) between the Encyclopedia Titanica article and the datasets.
k_trn[(k_trn['Ticket'] == 'LINE')]
k_tst[(k_tst['Ticket'] == 'LINE') | (k_tst['Ticket'] == '3701') | (k_tst['Ticket'] == '392095')]
What I am most likely going to do, is leave Mr. Carver’s information as is since he has a different ticket number. For Mr. Storey and the others, I am going to give them all the ticket number shown in the Enclopedia Titanica post, a suitable multiple (5) of Mr. Carver’s fare. You will understand why I am doing the latter when I start looking at feature engineering.
Zero Fares
Let’s get all the passengers with a zero fare value.
k_trn[(k_trn['Fare'] == 0.0)]
k_tst[(k_tst['Fare'] == 0.0)]
Experiment
Let’s do a little looking at similar cases to see if we can find any suitable replacement values.
We’ll begin with tickets containing the string “2398”. Though might be better to look for tickets starting with that string.
k_trn[(k_trn["Ticket"].str.contains('2398'))]
Ticket ‘239865’ had two passengers at 26.00 or 13.00 each. I will use 13.00 as the fare for each of the other passengers above. Appropriate multiples for tickets with more than one passenger, e.g. ‘239853’.
Now lets look at the ‘11205..’ ticket numbers. In both datasets.
k_trn[(k_trn["Ticket"].str.contains('11205'))]
k_tst[(k_tst["Ticket"].str.contains('11205'))]
That doesn’t really look too helpful. Let’s look at the last case and then come back to this one afterwards.
So, let’s have a look at Mr. Reuchlin’s ticket number.
k_trn[(k_trn["Ticket"].str.contains('1997'))]
Not very helpful. Let’s try extending the search somewhat.
# that didn't do much, let's try '199'
k_trn[(k_trn["Ticket"].str.startswith('199')) & (k_trn["Pclass"] == 1) & (k_trn["SibSp"] == 0) & (k_trn["Parch"] == 0)]
That appeared to work. I’ll use the median value of 30.50 for Mr. Reuchlin.
Now for that group without fares for tickets beginning with ‘11205’
Mr. Ismay may in fact have paid nothing for the trip. As head of the White Star Line, I expect he did not pay a fare for the trip. He apparently accompanied all his ships on their maiden voyage. And, as his valet, Mr. Fry may also not have paid any fare. Ditto for his secretary William Henry Harrison. That said, I am going to add fares so the model treats their cases appropriately. But, probably not as you expect. I want to make a clear distinction between Mr. Ismay and his staff. I very much doubt the crew would treat them equally.
To that end, I am going to assume Mr. Fry paid his own fare on a separate ticket (will add an ‘A’ to the end of his ticket nubmer). And that he would likely have paid one of the lower first class fares. Say, 30.00.
I am going to assume Mr. Ismay if travelling on his own penny would travel top class and, as such, pay a fare commensurate with his social standing. So, 512.33 it is.
I expect Mr. Harrison’s cabin was somewhat cheaper. As he was in B94, I am going to use half the price charged for the ticket assigned cabins B96 and B98. That is, 60.00.
That only leaves two others, and since they have no cabin number I am going to charge them the same as Miss Graham, 30.00.
Guess that’s that.
k_trn[(k_trn["Cabin"].str.startswith('B')) & (k_trn["Pclass"] == 1) & (k_trn["SibSp"] == 0) & (k_trn["Parch"] == 0)].sort_values(['Ticket', 'Cabin'])
I won’t bother including the code to update and save the datasets. See the notebook.
Check Score
I decided to see if adding “Fare” as a feature to our vert first modelling attempt would improve our score.
# let's see if adding Fare to our basic set of features improves our score
Y = k_trn['Survived']
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare']
X = pd.get_dummies(k_trn[features])
X_test = pd.get_dummies(k_tst[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)
predictions = model.predict(X_test)
accuracy_score(k_tst["Survived"], predictions)
Our first model score was 0.7751196172248804
. So adding the Fare to the feature set appears to have helped the model make better predictions.
Done
I think that’s it for this one. Looks like I can get back to some possible feature engineering in the next post.
Feel free to download and play with my version of this post’s related notebook.
Resources
- pandas.DataFrame.isin
- pandas.isnull
- Select rows without NaN values
- Pandas filter string data based on its string length
- 6 ways to Sort Pandas Dataframe
- Image Titanic Cabins by Class