Why?
When a beginner first signs up with Kaggle, it is recommended they go through the process of building a model for the Titanic dataset and submit a solution for “grading”. It is, apparently, a common way to get started on Kaggle.
The idea is that after that initial workthrough of the Kaggle process, you would go on to try and improve the success of your classification by changing models, using feature selection, creating new features, etc.
This, of course, involves submitting your predictions against the test dataset to Kaggle for scoring every time you make changes to your model.
As I never got around to trying to improve on my initial result, I plan to go through that whole process in a series of blog posts. Measuring improvement, or the lack thereof, as I go along. Now, I don’t really want to make a submission on Kaggle with each post to get an updated score. So, I need/want to create my own test set with the appropriate targets.
Target Data
I still want to use the Kaggle datasets. That way I can always make a submission if so inclined, with a reasonable idea of what it will score. So, I am going to try and create a CSV file with the targets that match the entries in the Kaggle test dataset. Expect that will be easier said than done. I did a bit of searching and couldn’t find any such target data or a Kaggle test dataset with the targets included.
Data Sources
It’s easy enough to get the Kaggle datasets. On the Titanic - Machine Learning from Disaster competition page, select the Data tab and go from there.
I also found two more datasets that I thought would help me get the job done:
Generate target.csv
Setup Notebook
Ok, let’s set up a Jupyter notebook. Start with the usuals.
from IPython.core.interactiveshell import InteractiveShell
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
# set up some notebook display defaults
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.style.use('default')
sns.set()
pd.options.display.float_format = '{:,.2f}'.format
Now let’s define some variables with the paths to the dataset files of interest. (This cell will be copied to future notebooks. As are usually the preceding cells.)
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
kaggle_trg = "./data/titanic/target.csv"
osf_full = "./data/titanic/osf_titanic.csv"
MYEkMl_full = "./data/titanic/phpMYEkMl.csv"
And, load the three we are currently most interested in.
# load the three datasets of interest
k_tst = pd.read_csv(kaggle_tst)
osf_f = pd.read_csv(osf_full)
ekml_f = pd.read_csv(MYEkMl_full)
And, of course, it always pays to have a look at what we are dealing with. Notice the case differences.
# have a quick look at each of them
k_tst.head()
osf_f.head()
ekml_f.head()
Working with the Data
Decided to do a little testing with the datasets before I got too carried away writing the code to produce the target data file.
To start I took the first row in the Kaggle test dataset. Got the traveller’s name. And, searched the other two datasets to see if I could find that name.
# some testing
print(f"Test 1:\n______\n")
tst_nm = k_tst.loc[0, "Name"]
osf_srvv = osf_f[osf_f["Name"] == tst_nm]
print(f"osf matching entries ({tst_nm}):\n{osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm]
print(f"\nekml matching entries ({tst_nm}): {ekml_srvv}")
That didn’t go so well. I didn’t notice earlier, but the OSF dataset dosen’t have the periods after salutations and the like. So, let’s create a 2nd name variable for that dataset. And, try again.
print(f"Test 1 (cont):\n______\n")
tst_nm = k_tst.loc[0, "Name"]
tst_nm_2 = tst_nm.replace(".", "")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2]
print(f"osf matching entries ({tst_nm_2}):\n{osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm]
print(f"\nekml matching entries ({tst_nm}): {ekml_srvv}")
And, something else I wasn’t expecting two entries for each name found. And, a number of discrepancies in the data. But at least we found the name in both of the datasets. As for the multiple entries, I decided to just take the mean of the survived values. If all zeroes the mean will be zero, and if all ones, the mean will be one. And, if both datasets agree, we are good to go. Let’s try that.
# some testing
print(f"Test 1 (cont):\n______\n")
tst_nm = k_tst.loc[0, "Name"]
tst_nm_2 = tst_nm.replace(".", "")
# print(f"tst_nm: {tst_nm}, tst_nm_2: {tst_nm_2}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2]
print(f"osf matching entries ({tst_nm_2}):\n{osf_srvv}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2].Survived
print(f"osf survived values for matching names\n{osf_srvv}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2].Survived.mean()
print(f"osf_srvv mean: {osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm].survived.mean()
print(f"ekml_srvv mean ({tst_nm}): {osf_srvv}")
That seems to work. Okay let’s try the next name in test.csv.
print(f"Test 2:\n______\n")
tst_nm = k_tst.loc[1, "Name"]
tst_nm_2 = tst_nm.replace(".", "")
# print(f"tst_nm: {tst_nm}, tst_nm_2: {tst_nm_2}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2]
#print(f"osf matching entries:\n{osf_srvv}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2].Survived
print(f"osf survived values for matching names ({tst_nm_2}):\n{osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm]
# print(f"ekml matching entries:\n{osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm].survived
print(f"ekml survived values for matching names ({tst_nm}):\n{osf_srvv}")
/div>
Neither name found. I did some looking, the person is in all three datasets. There are just some naming variations. I am too lazy (and not enough time for this post) to try and sort all the possible variations (there were a number of others, as you will see). Maybe at some later date.
I will simply identify these cases in the output file and deal with them manually.
There is also the possibility that the passenger would be found in only one of the two non-Kaggle datasets. Will also have to deal with that. Likely just use the one value we get.
Okay onto the next traveller.
print(f"Test 3:\n______\n")
tst_nm = k_tst.loc[2, "Name"]
tst_nm_2 = tst_nm.replace(".", "")
# print(f"tst_nm: {tst_nm}, tst_nm_2: {tst_nm_2}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2].Survived
print(f"osf survival value ({tst_nm_2}):\n{osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm].survived
print(f"\nekml survival value ({tst_nm}):\n{ekml_srvv}")
Okay, when there is only one matching entry found, we are getting a series, not the actual survival value. Series.item()
should fix that for us.
print(f"Test 3 (cont):\n______\n")
tst_nm = k_tst.loc[2, "Name"]
tst_nm_2 = tst_nm.replace(".", "")
# print(f"tst_nm: {tst_nm}, tst_nm_2: {tst_nm_2}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2].Survived.item()
print(f"osf survival value ({tst_nm_2}): {osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm].survived.item()
print(f"\nekml survival value ({tst_nm}): {ekml_srvv}")
Okay, that seems to work. But, dug around to find another potential case I will need to deal with. Let’s have a look.
print(f"Test 4:\n______\n")
tst_nm = k_tst.loc[39, "Name"]
tst_nm_2 = tst_nm.replace(".", "")
# print(f"tst_nm: {tst_nm}, tst_nm_2: {tst_nm_2}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2].Survived.item()
print(f"osf survival value as number ({tst_nm_2}): {osf_srvv}")
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm].survived.item()
print(f"\nekml survival value as number ({tst_nm}): {ekml_srvv}")
Guess I should have expected mismatches in the Survived values in the datasets, if there were mismatches of other data items. I am also not going to try to resolve these in my code — expect that would be most difficult. Will also do so manually by checking Encyclopedia Titanica.
Plan
Based on the above, the plan of attack is to go throught the Kaggle test dataset line by line:
- locate the individual in the other two datasets
- get the survival value from both files for that individual
- if name not found in either dataset, write name to target.csv
- if found in both and survival values same, write to target.csv
- if one missing but other found, write the one result to file (this may be a risk)
- if survival values different, write name and note to file
The Code
Sorry, it is very sloppy. But, I am sure you can tidy it up if so desired.
# okay not going to go throuh all steps I went through to get a semblance of succes
# create target.csv matching entries in kaggle test.csv
i = 0 # for testing only
with open(kaggle_trg, 'w') as trg_fh:
for _, rw in k_tst.iterrows():
# i += 1
osf_fnd = True
ekml_fnd = True
tst_nm = rw.Name
tst_nm = tst_nm.replace('"', '')
tst_nm_2 = tst_nm.replace(".", "")
# print(f"tst_nm: {tst_nm}, tst_nm_2: {tst_nm_2}")
osf_srvv = osf_f[osf_f["Name"] == tst_nm_2].Survived
# if i == 2:
# print(f"osf_srvv series ({tst_nm_2}):")
# display(osf_srvv)
# print(f"osf_srvv raw ({tst_nm_2}): {osf_srvv}")
if len(osf_srvv) == 0:
osf_fnd = False
print(f"osf: {tst_nm_2} not found!")
elif len(osf_srvv) == 1:
osf_srvv = osf_srvv.item()
# if i == 2:
# print(f"\tosf_srvv: {osf_srvv}")
else:
osf_srvv = osf_srvv.mean()
ekml_srvv = ekml_f[ekml_f["name"] == tst_nm].survived
# if i == 2:
# print(f"ekml_srvv series ({tst_nm}):")
# display(ekml_srvv)
if len(ekml_srvv) == 0:
ekml_fnd = False
print(f"ekml: {tst_nm} not found!")
elif len(ekml_srvv) == 1:
ekml_srvv = ekml_srvv.item()
# if i == 2:
# print(f"\tekml_srvv: {ekml_srvv}")
else:
ekml_srvv = ekml_srvv.mean()
# print(f"osf_srvv ({tst_nm_2}): {osf_srvv}, ekml_srvv ({tst_nm}): {ekml_srvv}")
if osf_fnd and ekml_fnd and osf_srvv == ekml_srvv:
t_out = trg_fh.write(f"{int(ekml_srvv)}\n")
elif osf_fnd and ekml_fnd and osf_srvv != ekml_srvv:
t_out = trg_fh.write(f"{tst_nm} -> {osf_srvv} != {ekml_srvv}\n")
elif osf_fnd and not ekml_fnd:
t_out = trg_fh.write(f"{int(osf_srvv)}\n")
elif ekml_fnd and not osf_fnd:
t_out = trg_fh.write(f"{int(ekml_srvv)}\n")
else:
t_out = trg_fh.write(f"{tst_nm}\n")
# if i == 5:
# break
Done, m’thinks
I had, before my testing, hoped to have a fully functional target.csv
once things were coded and working. No such luck. And, given the list above, it is clearly going to take me awhile to manually resolve the errors in target.csv
. So, sorry I can’t make the file available with this post. But will do my best to make sure it is available to you when needed.
Feel free to download and play with my version of this post’s related notebook.
This was a bit of a rush. I am still deeply embroiled in that course I am working on. And still over a month to go. That said, I encountered something working on one of the projects that sort of left me flabbergasted. So, the next post will take a step sideways and look at what I encountered and the lesson I learned. Until then…
Resouces
- Titanic.csv (Version: 1), uploaded 2018-02-08 09:17 AM by JASP
- Titanic.csv, Uploaded 16-10-2017 by Joaquin Vanschoren
- Titanic: phpMYEkMl.csv
- Encyclopedia Titanica
- pandas.Series.item