I was just going to use the target dataset I created in the last post and move on. But, I decided, this is not just about data science but also about programming. So, I am going to look at building that target data one more time.
While manually fixing the issues I didn’t resolve in code the first time around, it became clear that the data in the phpMYEkMl.csv Titanic dataset was of better quality than the one I downloaded from OSF. In fact, for those entries where there was a discrepancy in survival status the former was always correct.
So, I am going to look at reworking the code using only the phpMYEkMl.csv dataset to look up the survival status to match the passengers in the Kaggle test.csv dataset. I am also going to try and resolve more of the cases where a passenger’s name can not be found in the phpMYEkMl.csv dataset. May not have much success but…
I ran into a problem, after the first post in this series, when concatenating the survival status with the test.csv dataset. I hadn’t specified that there was no header row when I loaded the CSV file. Instead, I added a header row to the file manually. I will do that in the code this time.
I am also going to modify the format of the output csv file, just for my own interest. And, to perhaps, confirm that the survival status assignments are in fact assigned to the right individual in the modified Kaggle test.csv dataset I will build.
I will skip showing the workbook set up.
Refactor
Some file paths that may prove useful. A few of these below do not yet exist. They will, so I have added them here to save adding them elsewhere in the post.
# paths to datasets, current and future
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
kaggle_tst_2 = "./data/titanic/test_2.csv"
rek_k_tst = "./data/titanic/rek_test.csv"
rek_k_tst2 = "./data/titanic/rek_test_2.csv"
kaggle_trg = "./data/titanic/target.csv"
kaggle_trg_2 = "./data/titanic/target_2.csv"
osf_full = "./data/titanic/osf_titanic.csv"
MYEkMl_full = "./data/titanic/phpMYEkMl.csv"
Testing and Fixing
Issue 1
Let’s first do some more testing. The first case where I didn’t find a match in either file was for Assaf Khalil, Mrs. Mariana. There is in fact more to the name than shown.
# load the datasets of interest
k_tst = pd.read_csv(kaggle_tst)
ekml_f = pd.read_csv(MYEkMl_full)
# some more testing
# previous time: ekml: Assaf Khalil, Mrs. Mariana (Miriam) not found!
k_pass = k_tst[k_tst["Name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass)
Nothing found? Let’s try something else.
k_pass_2 = k_tst[k_tst["Name"].str.startswith("Assaf Khalil, Mrs. Mariana")]
print(k_pass_2)
Turns out the name in k_tst
(i.e. Kaggle’s test.csv
) looks like: Assaf Khalil, Mrs. Mariana (Miriam"")""
. And, when loaded into a DataFrame, the double quotes are converted to single quotes. There were ~30 other instances of the unnecessary quotes in the file. I also found a couple of samples that had trailing spaces in the name field. So, used text editor to remove them (i.e. replaced with empty string). In a new file, test_2.csv, of course.
Let’s continue looking at the prior problem names.
# okay, let's load the revised test dataset
k_tst2 = pd.read_csv(kaggle_tst_2)
k_pass_3 = k_tst2[k_tst2["Name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass_3)
# can we find it in ekml
k_pass_4 = ekml_f[ekml_f["name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass_4)
# okay why not
k_pass_5 = ekml_f[ekml_f["name"].str.startswith("Assaf Khalil, Mrs. Mariana")]
print(k_pass_5)
# okay let's look at the next one not found last time
# Johnston, Mrs. Andrew G (Elizabeth Lily Watson)
k_pass_6 = ekml_f[ekml_f["name"].str.startswith("Johnston, Mrs. Andrew G")]
print(k_pass_6)
# and the third
# Katavelas, Mr. Vassilios (Catavelas Vassilios)
k_pass_7 = ekml_f[ekml_f["name"].str.startswith("Katavelas, Mr. Vassilios")]
print(k_pass_7)
# and the fourth
# Coutts, Mrs. William (Winnie Minnie Treanor)
k_pass_8 = ekml_f[ekml_f["name"].str.startswith("Coutts, Mrs. William")]
print(k_pass_8)
I think I am seeing a trend. So let’s start coding — slow and easy.
# going to store passenger name along with survival status so can check I got things right
# we will start slow
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
t_out = trg_fh.write("Survived,TName\n")
for _, rw in k_tst2.iterrows():
tst_nm = rw.Name
ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
if len(ekml_rw) > 1:
print(f"\n{tst_nm} -> found: {len(ekml_rw)} entry/entries")
print(ekml_rw)
elif len(ekml_rw == 1):
pass
else:
print(f"\n{tst_nm} -> not found")
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">25</span><span class="p">:</span>
<span class="k">break</span>
Well, definitely still some problems.
Issue 2
The first/next problem is a little different: duplicate
names in different samples. You may have noticed the ticket number is different for each one. So, let’s see if we can find the right ticket number in the samples returned by searching on the name. If we can, we’ll save the related info to our target file. If not, we’ll identify an error.
# okay, let's deal with more than 1 name first
# we'll try using the ticket feature to select from the set of rows
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
t_out = trg_fh.write("Survived,TName\n")
for _, rw in k_tst2.iterrows():
tst_nm = rw.Name
ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
if len(ekml_rw) > 1:
# print(f"\n{tst_nm} -> found: {len(ekml_rw)} entry/entries")
# print(ekml_rw)
tkt = rw.Ticket
ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
if len(ekml_tkt) == 1:
# print(f"want ticket {tkt} -> found:\n{ekml_tkt}")
ekml_srvv = ekml_tkt.survived.item()
ekml_nm = ekml_tkt.name.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"ekml: {tst_nm} w/ticket {tkt} not found")
t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
elif len(ekml_rw == 1):
ekml_nm = ekml_rw.name.item()
ekml_srvv = ekml_rw.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
elif len(ekml_rw) == 0:
print(f"\n{tst_nm} -> not found")
t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
else:
print(f"\n{tst_nm} -> ! else: !, else what")
t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">25</span><span class="p">:</span>
<span class="k">break</span>
And, at least for our test of 25 samples, that issue seems to have been resolved.
Issue 3
On to the problem pattern we identified earlier. That use of single quotes in the phpMYEkMl.csv
file. Those quotes appear to be limited to the bracketed sections in passenger names. So let’s remove those from the test name, and try again. If we find something great, if not we’ll flag it as a problem. Note, we will need to change the way we search for names in the phpMYEkMl.csv
file.
# now let's deal with those name fields with '()' in them
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
t_out = trg_fh.write("Survived,TName\n")
for _, rw in k_tst2.iterrows():
tst_nm = rw.Name
ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
if len(ekml_rw) > 1:
tkt = rw.Ticket
ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
if len(ekml_tkt) == 1:
ekml_srvv = ekml_tkt.survived.item()
ekml_nm = ekml_tkt.name.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"ekml: {tst_nm} w/ticket {tkt} not found")
t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
elif len(ekml_rw == 1):
ekml_nm = ekml_rw.name.item()
ekml_srvv = ekml_rw.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
elif len(ekml_rw) == 0:
if "(" in tst_nm:
print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
print(f"\ttry {t_nm2}")
ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
if len(ekml_re) == 1:
ekml_nm = ekml_re.name.item()
ekml_srvv = ekml_re.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
else:
print(f"\n{tst_nm} -> not found")
t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
else:
print(f"\n{tst_nm} -> ! else: !, else what")
t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">25</span><span class="p">:</span>
<span class="k">break</span>
Look’s like our test of 25 samples no longer has any errors. So, let’s remove some print statements and give it a shot using the whole test.csv
dataset.
Issue 4
# now let's go for it
with open(kaggle_trg_2, 'w') as trg_fh:
t_out = trg_fh.write("Survived,TName\n")
for _, rw in k_tst2.iterrows():
tst_nm = rw.Name
ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
if len(ekml_rw) > 1:
tkt = rw.Ticket
ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
if len(ekml_tkt) == 1:
ekml_srvv = ekml_tkt.survived.item()
ekml_nm = ekml_tkt.name.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"ekml: {tst_nm} w/ticket {tkt} not found")
t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
elif len(ekml_rw == 1):
ekml_nm = ekml_rw.name.item()
ekml_srvv = ekml_rw.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
elif len(ekml_rw) == 0:
if "(" in tst_nm:
# print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
# print(f"\ttry {t_nm2}")
ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
if len(ekml_re) == 1:
ekml_nm = ekml_re.name.item()
ekml_srvv = ekml_re.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
t_out = trg_fh.write(f'?,"{tst_nm} -> {t_nm2} NOT FOUND"\n')
else:
print(f"\n{tst_nm} -> not found")
t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
else:
print(f"\n{tst_nm} -> ! else: !, else what")
t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
Looks like we still have an issue or two. A little research identified another use of single quotes in the phpMYEkMl.csv
file.
2,1,“Hocking, Miss. Ellen ‘Nellie’”,“female”,20,2,1,“29105”,23,?,“S”,“4”,?,“Cornwall / Akron, OH”
So, let’s try removing anything in the name from the test dataset after the last space.
# another fix, but how
with open(kaggle_trg_2, 'w') as trg_fh:
t_out = trg_fh.write("Survived,TName\n")
for _, rw in k_tst2.iterrows():
tst_nm = rw.Name
ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
if len(ekml_rw) > 1:
tkt = rw.Ticket
ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
if len(ekml_tkt) == 1:
ekml_srvv = ekml_tkt.survived.item()
ekml_nm = ekml_tkt.name.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"ekml: {tst_nm} w/ticket {tkt} not found")
t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
elif len(ekml_rw == 1):
ekml_nm = ekml_rw.name.item()
ekml_srvv = ekml_rw.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
elif len(ekml_rw) == 0:
if "(" in tst_nm:
# print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
# print(f"\ttry {t_nm2}")
ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
if len(ekml_re) == 1:
ekml_nm = ekml_re.name.item()
ekml_srvv = ekml_re.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
t_out = trg_fh.write(f'?,"{tst_nm} -> {t_nm2} NOT FOUND"\n')
else:
# let's try to deal with those single quotes in the ekml file
nm_b4ls = tst_nm.rpartition(' ')[0]
ekml_b4ls = ekml_f[ekml_f["name"].str.contains(nm_b4ls)]
if len(ekml_b4ls) == 1:
ekml_nm = ekml_b4ls.name.item()
ekml_srvv = ekml_b4ls.survived.item()
t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
else:
print(f"\n{tst_nm} -> not found")
t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
else:
print(f"\n{tst_nm} -> ! else: !, else what")
t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
And, bingo, no problems flagged. Think we got it.
I realize it is very sloppy code (lot’s of repeated bits). But, as this is sort of a one off, I just wanted to get it working. Which it appears to be doing.
Finishing Touches
Let’s reload the dataset from our new, modified file.
# now a little checking
targ_2 = pd.read_csv(kaggle_trg_2)
targ_2.head()
Quick look.
Now, let’s concatenate the target data with the modified test_2.csv
(i.e. the modified Kaggle test.csv
).
rek_tst = pd.concat([k_tst2, targ_2], axis=1)
And let’s look at a few entries to cofirm things match (at least for those test samples).
rek_tst.head(10)
rek_tst.tail(10)
We best save that as well!
rek_tst.to_csv(rek_k_tst2, index=False)
Done
I think that makes for a decent day’s effort. So, that’s it for this one. Next post in the series we will use our updated rek_test_2.csv
to train and evaluate a model. The same model used in that first Kaggle training competition.
Feel free to download and play with my version of this post’s related notebook.
Do note: Before this post, I had planned to make the target dataset set available for download. Mainly because, after the first post in the series, I thought it would require some considerable manual effort to get right. With this post, that is no longer the case. So, if you would like to have the target dataset, you will have to use the code above, or the related notebook, to generate it yourself. After downloading the appropriate datasets of course.
Resources
- pandas.DataFrame.to_csv
- pandas.concat
- pandas.Series.str
- Indexing and selecting data
- How do I select a subset of a DataFrame?
- Python String Methods (search that list for
rpartition
and the like) - re — Regular expression operations
- Regular Expression HOWTO
- Split on last occurrence of delimiter