I was just going to use the target dataset I created in the last post and move on. But, I decided, this is not just about data science but also about programming. So, I am going to look at building that target data one more time.

While manually fixing the issues I didn’t resolve in code the first time around, it became clear that the data in the phpMYEkMl.csv Titanic dataset was of better quality than the one I downloaded from OSF. In fact, for those entries where there was a discrepancy in survival status the former was always correct.

So, I am going to look at reworking the code using only the phpMYEkMl.csv dataset to look up the survival status to match the passengers in the Kaggle test.csv dataset. I am also going to try and resolve more of the cases where a passenger’s name can not be found in the phpMYEkMl.csv dataset. May not have much success but…

I ran into a problem, after the first post in this series, when concatenating the survival status with the test.csv dataset. I hadn’t specified that there was no header row when I loaded the CSV file. Instead, I added a header row to the file manually. I will do that in the code this time.

I am also going to modify the format of the output csv file, just for my own interest. And, to perhaps, confirm that the survival status assignments are in fact assigned to the right individual in the modified Kaggle test.csv dataset I will build.

I will skip showing the workbook set up.

Refactor

Some file paths that may prove useful. A few of these below do not yet exist. They will, so I have added them here to save adding them elsewhere in the post.

In [3]:
# paths to datasets, current and future
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
kaggle_tst_2 = "./data/titanic/test_2.csv"
rek_k_tst = "./data/titanic/rek_test.csv"
rek_k_tst2 = "./data/titanic/rek_test_2.csv"
kaggle_trg = "./data/titanic/target.csv"
kaggle_trg_2 = "./data/titanic/target_2.csv"
osf_full = "./data/titanic/osf_titanic.csv"
MYEkMl_full = "./data/titanic/phpMYEkMl.csv"

Testing and Fixing

Issue 1

Let’s first do some more testing. The first case where I didn’t find a match in either file was for Assaf Khalil, Mrs. Mariana. There is in fact more to the name than shown.

In [4]:
# load the datasets of interest
k_tst = pd.read_csv(kaggle_tst)
ekml_f = pd.read_csv(MYEkMl_full)
In [5]:
# some more testing
# previous time: ekml: Assaf Khalil, Mrs. Mariana (Miriam) not found!

k_pass = k_tst[k_tst["Name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"] print(k_pass)

Empty DataFrame
Columns: [PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []

Nothing found? Let’s try something else.

In [6]:
k_pass_2 = k_tst[k_tst["Name"].str.startswith("Assaf Khalil, Mrs. Mariana")]
print(k_pass_2)
    PassengerId  Pclass                                   Name     Sex   Age  \
19          911       3  Assaf Khalil, Mrs. Mariana (Miriam")"  female 45.00   
SibSp  Parch Ticket  Fare Cabin Embarked  

19 0 0 2696 7.22 NaN C

Turns out the name in k_tst (i.e. Kaggle’s test.csv) looks like: Assaf Khalil, Mrs. Mariana (Miriam"")"". And, when loaded into a DataFrame, the double quotes are converted to single quotes. There were ~30 other instances of the unnecessary quotes in the file. I also found a couple of samples that had trailing spaces in the name field. So, used text editor to remove them (i.e. replaced with empty string). In a new file, test_2.csv, of course.

Let’s continue looking at the prior problem names.

In [7]:
# okay, let's load the revised test dataset
k_tst2 = pd.read_csv(kaggle_tst_2)
In [8]:
k_pass_3 = k_tst2[k_tst2["Name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass_3)
    PassengerId  Pclass                                 Name     Sex   Age  \
19          911       3  Assaf Khalil, Mrs. Mariana (Miriam)  female 45.00   
SibSp  Parch Ticket  Fare Cabin Embarked  

19 0 0 2696 7.22 NaN C

In [9]:
# can we find it in ekml
k_pass_4 = ekml_f[ekml_f["name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass_4)
Empty DataFrame
Columns: [pclass, survived, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked, boat, body, home.dest]
Index: []
In [10]:
# okay why not
k_pass_5 = ekml_f[ekml_f["name"].str.startswith("Assaf Khalil, Mrs. Mariana")]
print(k_pass_5)
     pclass  survived                                   name     sex age  \
647       3         1  Assaf Khalil, Mrs. Mariana ('Miriam')  female  45   
 sibsp  parch ticket   fare cabin embarked boat body   home.dest  

647 0 0 2696 7.225 ? C C ? Ottawa, ON

In [11]:
# okay let's look at the next one not found last time
# Johnston, Mrs. Andrew G (Elizabeth Lily Watson)
k_pass_6 = ekml_f[ekml_f["name"].str.startswith("Johnston, Mrs. Andrew G")]
print(k_pass_6)
     pclass  survived                                               name  \
903       3         0  Johnston, Mrs. Andrew G (Elizabeth 'Lily' Watson)   
    sex age  sibsp  parch      ticket   fare cabin embarked boat body  \

903 female ? 1 2 W./C. 6607 23.45 ? S ? ?

home.dest  

903 ?

In [12]:
# and the third
# Katavelas, Mr. Vassilios (Catavelas Vassilios)
k_pass_7 = ekml_f[ekml_f["name"].str.startswith("Katavelas, Mr. Vassilios")]
print(k_pass_7)
     pclass  survived                                              name   sex  \
919       3         0  Katavelas, Mr. Vassilios ('Catavelas Vassilios')  male   
  age  sibsp  parch ticket    fare cabin embarked boat body home.dest  

919 18.5 0 0 2682 7.2292 ? C ? 58 ?

In [13]:
# and the fourth
# Coutts, Mrs. William (Winnie Minnie Treanor)
k_pass_8 = ekml_f[ekml_f["name"].str.startswith("Coutts, Mrs. William")]
print(k_pass_8)
     pclass  survived                                            name     sex  \
735       3         1  Coutts, Mrs. William (Winnie 'Minnie' Treanor)  female   
age  sibsp  parch      ticket  fare cabin embarked boat body  \

735 36 0 2 C.A. 37671 15.9 ? S 2 ?

            home.dest  

735 England Brooklyn, NY

I think I am seeing a trend. So let’s start coding — slow and easy.

In [14]:
# going to store passenger name along with survival status so can check I got things right
# we will start slow
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      print(f"\n{tst_nm} -> found: {len(ekml_rw)} entry/entries")
      print(ekml_rw)
    elif len(ekml_rw == 1):
      pass
    else:
      print(f"\n{tst_nm} -> not found")
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">25</span><span class="p">:</span>
  <span class="k">break</span>

Kelly, Mr. James -> found: 2 entry/entries
     pclass  survived              name   sex   age  sibsp  parch  ticket  \
924       3         0  Kelly, Mr. James  male  34.5      0      0  330911   
925       3         0  Kelly, Mr. James  male    44      0      0  363592   
   fare cabin embarked boat body home.dest  

924 7.8292 ? Q ? 70 ?
925 8.05 ? S ? ? ?

Connolly, Miss. Kate -> found: 2 entry/entries pclass survived name sex age sibsp parch ticket
725 3 1 Connolly, Miss. Kate female 22 0 0 370373
726 3 0 Connolly, Miss. Kate female 30 0 0 330972

   fare cabin embarked boat body home.dest  

725 7.75 ? Q 13 ? Ireland
726 7.6292 ? Q ? ? Ireland

Assaf Khalil, Mrs. Mariana (Miriam) -> not found

Well, definitely still some problems.

Issue 2

The first/next problem is a little different: duplicate names in different samples. You may have noticed the ticket number is different for each one. So, let’s see if we can find the right ticket number in the samples returned by searching on the name. If we can, we’ll save the related info to our target file. If not, we’ll identify an error.

In [15]:
# okay, let's deal with more than 1 name first
# we'll try using the ticket feature to select from the set of rows
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      # print(f"\n{tst_nm} -> found: {len(ekml_rw)} entry/entries")
      # print(ekml_rw)
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        # print(f"want ticket {tkt} -> found:\n{ekml_tkt}")
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      print(f"\n{tst_nm} -> not found")
      t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">25</span><span class="p">:</span>
  <span class="k">break</span>

Assaf Khalil, Mrs. Mariana (Miriam) -> not found

And, at least for our test of 25 samples, that issue seems to have been resolved.

Issue 3

On to the problem pattern we identified earlier. That use of single quotes in the phpMYEkMl.csv file. Those quotes appear to be limited to the bracketed sections in passenger names. So let’s remove those from the test name, and try again. If we find something great, if not we’ll flag it as a problem. Note, we will need to change the way we search for names in the phpMYEkMl.csv file.

In [16]:
# now let's deal with those name fields with '()' in them
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      if "(" in tst_nm:
        print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
        t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
        print(f"\ttry {t_nm2}")
        ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
        if len(ekml_re) == 1:
          ekml_nm = ekml_re.name.item()
          ekml_srvv = ekml_re.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
          t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
      else:
        print(f"\n{tst_nm} -> not found")
        t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">25</span><span class="p">:</span>
  <span class="k">break</span>

Assaf Khalil, Mrs. Mariana (Miriam) contains brackets, let's remove that section and try again
	try Assaf Khalil, Mrs. Mariana 

Look’s like our test of 25 samples no longer has any errors. So, let’s remove some print statements and give it a shot using the whole test.csv dataset.

Issue 4

In [17]:
# now let's go for it
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      if "(" in tst_nm:
        # print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
        t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
        # print(f"\ttry {t_nm2}")
        ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
        if len(ekml_re) == 1:
          ekml_nm = ekml_re.name.item()
          ekml_srvv = ekml_re.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
          t_out = trg_fh.write(f'?,"{tst_nm} -> {t_nm2} NOT FOUND"\n')
      else:
        print(f"\n{tst_nm} -> not found")
        t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
Hocking, Miss. Ellen Nellie -> not found

Johnston, Master. William Arthur Willie -> not found

Daly, Miss. Margaret Marcella Maggie -> not found

McCarthy, Miss. Catherine Katie -> not found

Dean, Miss. Elizabeth Gladys Millvina -> not found

Riihivouri, Miss. Susanna Juhantytar Sanni -> not found

Cotterill, Mr. Henry Harry -> not found

Wheeler, Mr. Edwin Frederick -> not found

Riordan, Miss. Johanna Hannah -> not found

Looks like we still have an issue or two. A little research identified another use of single quotes in the phpMYEkMl.csv file.

2,1,“Hocking, Miss. Ellen ‘Nellie’”,“female”,20,2,1,“29105”,23,?,“S”,“4”,?,“Cornwall / Akron, OH”

So, let’s try removing anything in the name from the test dataset after the last space.

In [18]:
# another fix, but how
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      if "(" in tst_nm:
        # print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
        t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
        # print(f"\ttry {t_nm2}")
        ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
        if len(ekml_re) == 1:
          ekml_nm = ekml_re.name.item()
          ekml_srvv = ekml_re.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
          t_out = trg_fh.write(f'?,"{tst_nm} -> {t_nm2} NOT FOUND"\n')
      else:
        # let's try to deal with those single quotes in the ekml file
        nm_b4ls = tst_nm.rpartition(' ')[0]
        ekml_b4ls = ekml_f[ekml_f["name"].str.contains(nm_b4ls)]
        if len(ekml_b4ls) == 1:
          ekml_nm = ekml_b4ls.name.item()
          ekml_srvv = ekml_b4ls.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"\n{tst_nm} -> not found")
          t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')

And, bingo, no problems flagged. Think we got it.

I realize it is very sloppy code (lot’s of repeated bits). But, as this is sort of a one off, I just wanted to get it working. Which it appears to be doing.

Finishing Touches

Let’s reload the dataset from our new, modified file.

In [19]:
# now a little checking
targ_2 = pd.read_csv(kaggle_trg_2)
targ_2.head()

Quick look.

Out[19]:
SurvivedTName
00Kelly, Mr. James
11Wilkes, Mrs. James (Ellen Needs)
20Myles, Mr. Thomas Francis
30Wirz, Mr. Albert
41Hirvonen, Mrs. Alexander (Helga E Lindqvist)

Now, let’s concatenate the target data with the modified test_2.csv (i.e. the modified Kaggle test.csv).

In [20]:
rek_tst = pd.concat([k_tst2, targ_2], axis=1)

And let’s look at a few entries to cofirm things match (at least for those test samples).

In [21]:
rek_tst.head(10)
Out[21]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvivedTName
08923Kelly, Mr. Jamesmale34.50003309117.83NaNQ0Kelly, Mr. James
18933Wilkes, Mrs. James (Ellen Needs)female47.00103632727.00NaNS1Wilkes, Mrs. James (Ellen Needs)
28942Myles, Mr. Thomas Francismale62.00002402769.69NaNQ0Myles, Mr. Thomas Francis
38953Wirz, Mr. Albertmale27.00003151548.66NaNS0Wirz, Mr. Albert
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.0011310129812.29NaNS1Hirvonen, Mrs. Alexander (Helga E Lindqvist)
58973Svensson, Mr. Johan Cervinmale14.000075389.22NaNS1Svensson, Mr. Johan Cervin
68983Connolly, Miss. Katefemale30.00003309727.63NaNQ0Connolly, Miss. Kate
78992Caldwell, Mr. Albert Francismale26.001124873829.00NaNS1Caldwell, Mr. Albert Francis
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.000026577.23NaNC1Abrahim, Mrs. Joseph (Sophie Halaut Easu)
99013Davies, Mr. John Samuelmale21.0020A/4 4887124.15NaNS0Davies, Mr. John Samuel
In [22]:
rek_tst.tail(10)
Out[22]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvivedTName
40813003Riordan, Miss. Johanna HannahfemaleNaN003349157.72NaNQ1Riordan, Miss. Johanna 'Hannah'
40913013Peacock, Miss. Treasteallfemale3.0011SOTON/O.Q. 310131513.78NaNS0Peacock, Miss. Treasteall
41013023Naughton, Miss. HannahfemaleNaN003652377.75NaNQ0Naughton, Miss. Hannah
41113031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.00101992890.00C78Q1Minahan, Mrs. William Edward (Lillian E Thorpe)
41213043Henriksson, Miss. Jenny Lovisafemale28.00003470867.78NaNS0Henriksson, Miss. Jenny Lovisa
41313053Spector, Mr. WoolfmaleNaN00A.5. 32368.05NaNS0Spector, Mr. Woolf
41413061Oliva y Ocana, Dona. Ferminafemale39.0000PC 17758108.90C105C1Oliva y Ocana, Dona. Fermina
41513073Saether, Mr. Simon Sivertsenmale38.5000SOTON/O.Q. 31012627.25NaNS0Saether, Mr. Simon Sivertsen
41613083Ware, Mr. FrederickmaleNaN003593098.05NaNS0Ware, Mr. Frederick
41713093Peter, Master. Michael JmaleNaN11266822.36NaNC1Peter, Master. Michael J

We best save that as well!

In [23]:
rek_tst.to_csv(rek_k_tst2, index=False)

Done

I think that makes for a decent day’s effort. So, that’s it for this one. Next post in the series we will use our updated rek_test_2.csv to train and evaluate a model. The same model used in that first Kaggle training competition.

Feel free to download and play with my version of this post’s related notebook.

Do note: Before this post, I had planned to make the target dataset set available for download. Mainly because, after the first post in the series, I thought it would require some considerable manual effort to get right. With this post, that is no longer the case. So, if you would like to have the target dataset, you will have to use the code above, or the related notebook, to generate it yourself. After downloading the appropriate datasets of course.

Resources