I was just going to use the target dataset I created in the last post and move on. But, I decided, this is not just about data science but also about programming. So, I am going to look at building that target data one more time.

While manually fixing the issues I didn’t resolve in code the first time around, it became clear that the data in the phpMYEkMl.csv Titanic dataset was of better quality than the one I downloaded from OSF. In fact, for those entries where there was a discrepancy in survival status the former was always correct.

So, I am going to look at reworking the code using only the phpMYEkMl.csv dataset to look up the survival status to match the passengers in the Kaggle test.csv dataset. I am also going to try and resolve more of the cases where a passenger’s name can not be found in the phpMYEkMl.csv dataset. May not have much success but…

I ran into a problem, after the first post in this series, when concatenating the survival status with the test.csv dataset. I hadn’t specified that there was no header row when I loaded the CSV file. Instead, I added a header row to the file manually. I will do that in the code this time.

I am also going to modify the format of the output csv file, just for my own interest. And, to perhaps, confirm that the survival status assignments are in fact assigned to the right individual in the modified Kaggle test.csv dataset I will build.

I will skip showing the workbook set up.

Refactor

Some file paths that may prove useful. A few of these below do not yet exist. They will, so I have added them here to save adding them elsewhere in the post.

In [3]:

# paths to datasets, current and future
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
kaggle_tst_2 = "./data/titanic/test_2.csv"
rek_k_tst = "./data/titanic/rek_test.csv"
rek_k_tst2 = "./data/titanic/rek_test_2.csv"
kaggle_trg = "./data/titanic/target.csv"
kaggle_trg_2 = "./data/titanic/target_2.csv"
osf_full = "./data/titanic/osf_titanic.csv"
MYEkMl_full = "./data/titanic/phpMYEkMl.csv"

Testing and Fixing

Issue 1

Let’s first do some more testing. The first case where I didn’t find a match in either file was for Assaf Khalil, Mrs. Mariana. There is in fact more to the name than shown.

In [4]:

# load the datasets of interest
k_tst = pd.read_csv(kaggle_tst)
ekml_f = pd.read_csv(MYEkMl_full)

In [5]:

# some more testing
# previous time: ekml: Assaf Khalil, Mrs. Mariana (Miriam) not found!
k_pass = k_tst[k_tst["Name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass)

Empty DataFrame
Columns: [PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []

Nothing found? Let’s try something else.

In [6]:

k_pass_2 = k_tst[k_tst["Name"].str.startswith("Assaf Khalil, Mrs. Mariana")]
print(k_pass_2)

PassengerId Pclass Name Sex Age \ 19 911 3 Assaf Khalil, Mrs. Mariana (Miriam")" female 45.00

SibSp  Parch Ticket  Fare Cabin Embarked

19 0 0 2696 7.22 NaN C

Turns out the name in k_tst (i.e. Kaggle’s test.csv) looks like: Assaf Khalil, Mrs. Mariana (Miriam"")"". And, when loaded into a DataFrame, the double quotes are converted to single quotes. There were ~30 other instances of the unnecessary quotes in the file. I also found a couple of samples that had trailing spaces in the name field. So, used text editor to remove them (i.e. replaced with empty string). In a new file, test_2.csv, of course.

Let’s continue looking at the prior problem names.

In [7]:

# okay, let's load the revised test dataset
k_tst2 = pd.read_csv(kaggle_tst_2)

In [8]:

k_pass_3 = k_tst2[k_tst2["Name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass_3)

PassengerId Pclass Name Sex Age \ 19 911 3 Assaf Khalil, Mrs. Mariana (Miriam) female 45.00

SibSp  Parch Ticket  Fare Cabin Embarked

19 0 0 2696 7.22 NaN C

In [9]:

# can we find it in ekml
k_pass_4 = ekml_f[ekml_f["name"] == "Assaf Khalil, Mrs. Mariana (Miriam)"]
print(k_pass_4)

Empty DataFrame
Columns: [pclass, survived, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked, boat, body, home.dest]
Index: []

In [10]:

# okay why not
k_pass_5 = ekml_f[ekml_f["name"].str.startswith("Assaf Khalil, Mrs. Mariana")]
print(k_pass_5)

pclass survived name sex age \ 647 3 1 Assaf Khalil, Mrs. Mariana ('Miriam') female 45

 sibsp  parch ticket   fare cabin embarked boat body   home.dest

647 0 0 2696 7.225 ? C C ? Ottawa, ON

In [11]:

# okay let's look at the next one not found last time
# Johnston, Mrs. Andrew G (Elizabeth Lily Watson)
k_pass_6 = ekml_f[ekml_f["name"].str.startswith("Johnston, Mrs. Andrew G")]
print(k_pass_6)

pclass survived name \ 903 3 0 Johnston, Mrs. Andrew G (Elizabeth 'Lily' Watson)

    sex age  sibsp  parch      ticket   fare cabin embarked boat body  \

903 female ? 1 2 W./C. 6607 23.45 ? S ? ?

home.dest

903 ?

In [12]:

# and the third
# Katavelas, Mr. Vassilios (Catavelas Vassilios)
k_pass_7 = ekml_f[ekml_f["name"].str.startswith("Katavelas, Mr. Vassilios")]
print(k_pass_7)

pclass survived name sex \ 919 3 0 Katavelas, Mr. Vassilios ('Catavelas Vassilios') male

  age  sibsp  parch ticket    fare cabin embarked boat body home.dest

919 18.5 0 0 2682 7.2292 ? C ? 58 ?

In [13]:

# and the fourth
# Coutts, Mrs. William (Winnie Minnie Treanor)
k_pass_8 = ekml_f[ekml_f["name"].str.startswith("Coutts, Mrs. William")]
print(k_pass_8)

pclass survived name sex \ 735 3 1 Coutts, Mrs. William (Winnie 'Minnie' Treanor) female

age  sibsp  parch      ticket  fare cabin embarked boat body  \

735 36 0 2 C.A. 37671 15.9 ? S 2 ?

            home.dest

735 England Brooklyn, NY

I think I am seeing a trend. So let’s start coding — slow and easy.

In [14]:

# going to store passenger name along with survival status so can check I got things right
# we will start slow
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      print(f"\n{tst_nm} -> found: {len(ekml_rw)} entry/entries")
      print(ekml_rw)
    elif len(ekml_rw == 1):
      pass
    else:
      print(f"\n{tst_nm} -> not found")
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">25</span><span class="p">:</span>
  <span class="k">break</span>

Kelly, Mr. James -> found: 2 entry/entries pclass survived name sex age sibsp parch ticket \ 924 3 0 Kelly, Mr. James male 34.5 0 0 330911 925 3 0 Kelly, Mr. James male 44 0 0 363592

   fare cabin embarked boat body home.dest

924 7.8292 ? Q ? 70 ?
925 8.05 ? S ? ? ?

Connolly, Miss. Kate -> found: 2 entry/entries pclass survived name sex age sibsp parch ticket
725 3 1 Connolly, Miss. Kate female 22 0 0 370373
726 3 0 Connolly, Miss. Kate female 30 0 0 330972

   fare cabin embarked boat body home.dest

725 7.75 ? Q 13 ? Ireland
726 7.6292 ? Q ? ? Ireland

Assaf Khalil, Mrs. Mariana (Miriam) -> not found

Well, definitely still some problems.

Issue 2

The first/next problem is a little different: duplicate names in different samples. You may have noticed the ticket number is different for each one. So, let’s see if we can find the right ticket number in the samples returned by searching on the name. If we can, we’ll save the related info to our target file. If not, we’ll identify an error.

In [15]:

# okay, let's deal with more than 1 name first
# we'll try using the ticket feature to select from the set of rows
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      # print(f"\n{tst_nm} -> found: {len(ekml_rw)} entry/entries")
      # print(ekml_rw)
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        # print(f"want ticket {tkt} -> found:\n{ekml_tkt}")
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      print(f"\n{tst_nm} -> not found")
      t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">25</span><span class="p">:</span>
  <span class="k">break</span>

Assaf Khalil, Mrs. Mariana (Miriam) -> not found

And, at least for our test of 25 samples, that issue seems to have been resolved.

Issue 3

On to the problem pattern we identified earlier. That use of single quotes in the phpMYEkMl.csv file. Those quotes appear to be limited to the bracketed sections in passenger names. So let’s remove those from the test name, and try again. If we find something great, if not we’ll flag it as a problem. Note, we will need to change the way we search for names in the phpMYEkMl.csv file.

In [16]:

# now let's deal with those name fields with '()' in them
i = 0 # for testing only
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      if "(" in tst_nm:
        print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
        t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
        print(f"\ttry {t_nm2}")
        ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
        if len(ekml_re) == 1:
          ekml_nm = ekml_re.name.item()
          ekml_srvv = ekml_re.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
          t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
      else:
        print(f"\n{tst_nm} -> not found")
        t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">25</span><span class="p">:</span>
  <span class="k">break</span>

Assaf Khalil, Mrs. Mariana (Miriam) contains brackets, let's remove that section and try again
	try Assaf Khalil, Mrs. Mariana

Look’s like our test of 25 samples no longer has any errors. So, let’s remove some print statements and give it a shot using the whole test.csv dataset.

Issue 4

In [17]:

# now let's go for it
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      if "(" in tst_nm:
        # print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
        t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
        # print(f"\ttry {t_nm2}")
        ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
        if len(ekml_re) == 1:
          ekml_nm = ekml_re.name.item()
          ekml_srvv = ekml_re.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
          t_out = trg_fh.write(f'?,"{tst_nm} -> {t_nm2} NOT FOUND"\n')
      else:
        print(f"\n{tst_nm} -> not found")
        t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')

Hocking, Miss. Ellen Nellie -> not found

Johnston, Master. William Arthur Willie -> not found

Daly, Miss. Margaret Marcella Maggie -> not found

McCarthy, Miss. Catherine Katie -> not found

Dean, Miss. Elizabeth Gladys Millvina -> not found

Riihivouri, Miss. Susanna Juhantytar Sanni -> not found

Cotterill, Mr. Henry Harry -> not found

Wheeler, Mr. Edwin Frederick -> not found

Riordan, Miss. Johanna Hannah -> not found

Looks like we still have an issue or two. A little research identified another use of single quotes in the phpMYEkMl.csv file.

2,1,“Hocking, Miss. Ellen ‘Nellie’”,“female”,20,2,1,“29105”,23,?,“S”,“4”,?,“Cornwall / Akron, OH”

So, let’s try removing anything in the name from the test dataset after the last space.

In [18]:

# another fix, but how
with open(kaggle_trg_2, 'w') as trg_fh:
  t_out = trg_fh.write("Survived,TName\n")
  for _, rw in k_tst2.iterrows():
    tst_nm = rw.Name
    ekml_rw = ekml_f[ekml_f["name"] == tst_nm]
    if len(ekml_rw) > 1:
      tkt = rw.Ticket
      ekml_tkt = ekml_rw[ekml_rw["ticket"] == tkt]
      if len(ekml_tkt) == 1:
        ekml_srvv = ekml_tkt.survived.item()
        ekml_nm = ekml_tkt.name.item()
        t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
      else:
        print(f"ekml: {tst_nm} w/ticket {tkt} not found")
        t_out = trg_fh.write(f'?,"{ekml_nm} w/TICKET {tkt} NOT FOUND"\n')
    elif len(ekml_rw == 1):
      ekml_nm = ekml_rw.name.item()
      ekml_srvv = ekml_rw.survived.item()
      t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
    elif len(ekml_rw) == 0:
      if "(" in tst_nm:
        # print(f"\n{tst_nm} contains brackets, let's remove that section and try again")
        t_nm2 = re.match("(.*?)\(",tst_nm).group(1)
        # print(f"\ttry {t_nm2}")
        ekml_re = ekml_f[ekml_f["name"].str.contains(t_nm2)]
        if len(ekml_re) == 1:
          ekml_nm = ekml_re.name.item()
          ekml_srvv = ekml_re.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"{tst_nm} -> {t_nm2} found {ekml_re}")
          t_out = trg_fh.write(f'?,"{tst_nm} -> {t_nm2} NOT FOUND"\n')
      else:
        # let's try to deal with those single quotes in the ekml file
        nm_b4ls = tst_nm.rpartition(' ')[0]
        ekml_b4ls = ekml_f[ekml_f["name"].str.contains(nm_b4ls)]
        if len(ekml_b4ls) == 1:
          ekml_nm = ekml_b4ls.name.item()
          ekml_srvv = ekml_b4ls.survived.item()
          t_out = trg_fh.write(f'{int(ekml_srvv)},"{ekml_nm}"\n')
        else:
          print(f"\n{tst_nm} -> not found")
          t_out = trg_fh.write(f'?,"{tst_nm} NOT FOUND"\n')
    else:
      print(f"\n{tst_nm} -> ! else: !, else what")
      t_out = trg_fh.write(f'?,"{tst_nm} HIT ELSE: IN LOOP?"\n')

And, bingo, no problems flagged. Think we got it.

I realize it is very sloppy code (lot’s of repeated bits). But, as this is sort of a one off, I just wanted to get it working. Which it appears to be doing.

Finishing Touches

Let’s reload the dataset from our new, modified file.

In [19]:

# now a little checking
targ_2 = pd.read_csv(kaggle_trg_2)
targ_2.head()

Quick look.

Out[19]:

	Survived	TName
0	0	Kelly, Mr. James
1	1	Wilkes, Mrs. James (Ellen Needs)
2	0	Myles, Mr. Thomas Francis
3	0	Wirz, Mr. Albert
4	1	Hirvonen, Mrs. Alexander (Helga E Lindqvist)

Now, let’s concatenate the target data with the modified test_2.csv (i.e. the modified Kaggle test.csv).

In [20]:

rek_tst = pd.concat([k_tst2, targ_2], axis=1)

And let’s look at a few entries to cofirm things match (at least for those test samples).

In [21]:

rek_tst.head(10)

Out[21]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived	TName
0	892	3	Kelly, Mr. James	male	34.50	0	0	330911	7.83	NaN	Q	0	Kelly, Mr. James
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.00	1	0	363272	7.00	NaN	S	1	Wilkes, Mrs. James (Ellen Needs)
2	894	2	Myles, Mr. Thomas Francis	male	62.00	0	0	240276	9.69	NaN	Q	0	Myles, Mr. Thomas Francis
3	895	3	Wirz, Mr. Albert	male	27.00	0	0	315154	8.66	NaN	S	0	Wirz, Mr. Albert
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.00	1	1	3101298	12.29	NaN	S	1	Hirvonen, Mrs. Alexander (Helga E Lindqvist)
5	897	3	Svensson, Mr. Johan Cervin	male	14.00	0	0	7538	9.22	NaN	S	1	Svensson, Mr. Johan Cervin
6	898	3	Connolly, Miss. Kate	female	30.00	0	0	330972	7.63	NaN	Q	0	Connolly, Miss. Kate
7	899	2	Caldwell, Mr. Albert Francis	male	26.00	1	1	248738	29.00	NaN	S	1	Caldwell, Mr. Albert Francis
8	900	3	Abrahim, Mrs. Joseph (Sophie Halaut Easu)	female	18.00	0	0	2657	7.23	NaN	C	1	Abrahim, Mrs. Joseph (Sophie Halaut Easu)
9	901	3	Davies, Mr. John Samuel	male	21.00	2	0	A/4 48871	24.15	NaN	S	0	Davies, Mr. John Samuel

In [22]:

rek_tst.tail(10)

Out[22]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived	TName
408	1300	3	Riordan, Miss. Johanna Hannah	female	NaN	0	0	334915	7.72	NaN	Q	1	Riordan, Miss. Johanna 'Hannah'
409	1301	3	Peacock, Miss. Treasteall	female	3.00	1	1	SOTON/O.Q. 3101315	13.78	NaN	S	0	Peacock, Miss. Treasteall
410	1302	3	Naughton, Miss. Hannah	female	NaN	0	0	365237	7.75	NaN	Q	0	Naughton, Miss. Hannah
411	1303	1	Minahan, Mrs. William Edward (Lillian E Thorpe)	female	37.00	1	0	19928	90.00	C78	Q	1	Minahan, Mrs. William Edward (Lillian E Thorpe)
412	1304	3	Henriksson, Miss. Jenny Lovisa	female	28.00	0	0	347086	7.78	NaN	S	0	Henriksson, Miss. Jenny Lovisa
413	1305	3	Spector, Mr. Woolf	male	NaN	0	0	A.5. 3236	8.05	NaN	S	0	Spector, Mr. Woolf
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.00	0	0	PC 17758	108.90	C105	C	1	Oliva y Ocana, Dona. Fermina
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.50	0	0	SOTON/O.Q. 3101262	7.25	NaN	S	0	Saether, Mr. Simon Sivertsen
416	1308	3	Ware, Mr. Frederick	male	NaN	0	0	359309	8.05	NaN	S	0	Ware, Mr. Frederick
417	1309	3	Peter, Master. Michael J	male	NaN	1	1	2668	22.36	NaN	C	1	Peter, Master. Michael J

We best save that as well!

In [23]:

rek_tst.to_csv(rek_k_tst2, index=False)

Done

I think that makes for a decent day’s effort. So, that’s it for this one. Next post in the series we will use our updated rek_test_2.csv to train and evaluate a model. The same model used in that first Kaggle training competition.

Feel free to download and play with my version of this post’s related notebook.

Do note: Before this post, I had planned to make the target dataset set available for download. Mainly because, after the first post in the series, I thought it would require some considerable manual effort to get right. With this post, that is no longer the case. So, if you would like to have the target dataset, you will have to use the code above, or the related notebook, to generate it yourself. After downloading the appropriate datasets of course.

Resources

pandas.DataFrame.to_csv
pandas.concat
pandas.Series.str
Indexing and selecting data
How do I select a subset of a DataFrame?
Python String Methods (search that list for rpartition and the like)
re — Regular expression operations
Regular Expression HOWTO
Split on last occurrence of delimiter

Too Old To Code

Titanic Dataset: Test Target Data II