What I proposing to do next is some feature engineering. At least I hope it can be considered feature engineering.

I am going to add additional columns to the two modified datasets (oma*.csv) I created in the last post or two.

New Features

I will add that family size feature we covered in Exploratory Data Analysis, Part II. But there were also non-family groups travelling together on a single ticket.

For example, Mr. Thomas Storey (for whom we estimated a fare in the test dataset in a previous post, and then modified in the last post).

with Storey and several other shipmates; Andrew Shannon [Lionel Leonard], August Johnson, William Henry Törnquist, Alfred Carver and William Cahoone Johnson) forced to travel aboard Titanic as passengers. Storey and his shipmates boarded the Titanic at Southampton, all travelling third class (ticket number 370160).

Encyclopedia Titanica

If so I want to get that group size information into the possible feature set as well.

Given the preceding, it is likely that a number of the ticket prices shown for individuals is actually a group fare. I think the model would really work better if we had individual fares. So, I will attempt to sort something with respect to fares.

Finally, the name feature is in and of itself, not horribly helpful. But, it does in many cases contain the passengers title. It is possible that title might be a feature worth looking at. It may provide an additional prespective on a passenger’s status in the eyes of the crew and other passengers. Something which would likely affect the likelihood of survival.

Some people talked about using the cabin information to determine the deck to which a person was assigned. The thinking being that the lower the deck the worse the likelihood of survival. But, there are a great many passengers without any cabin information. And, there really is no way to impute the cabin value in any meaningful way. So, I am going to ignore the Cabin feature altogether. Though I doubt I will remove it from the CSVs.

Once I get these features sorted, I will save the modified training/test datasets to CSV files for use down the road.

Also, to be safe, I will likely save the datasets after each new feature is added. Don’t know if I will use differing file names for each iteration.

So, let’s get going.

Quick Check

Decided to take quick look at both datasets to make sure that I wasn’t missing any data values (except for Age and Cabin).

In [3]:

# paths to datasets
# current datasets of choice
oma_trn_2 = "./data/titanic/oma_trn_2.csv"
oma_tst_2 = "./data/titanic/oma_tst_2.csv"
# dataset to be generate in this notebook
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"

In [4]:

# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_2)
k_tst = pd.read_csv(oma_tst_2)

In [5]:

# curiousity killed the cat
k_trn_nac = k_trn.drop(["Age", "Cabin"], axis=1)
k_trn_nac[k_trn_nac.isnull().any(axis=1)]
k_trn_nac[k_trn_nac["Fare"] == 0.0]

Out[5]:

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare	Embarked

Out[5]:

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare	Embarked

In [6]:

k_tst_nac = k_tst.drop(["Age", "Cabin"], axis=1)
k_tst_nac[k_tst_nac.isnull().any(axis=1)]
k_tst_nac[k_tst_nac["Fare"] == 0.0]

Out[6]:

	PassengerId	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare	Embarked	Survived

Out[6]:

	PassengerId	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare	Embarked	Survived

Family Size

This should be relatively easy, I’ve already done it once. So add a column/feature FamilySize which is the sum: SibSp + Parch + 1 for each passenger in both datasets.

In [7]:

# let's add the family size feature
k_trn['FamilySize'] = k_trn['Parch'] + k_trn['SibSp'] + 1
k_tst['FamilySize'] = k_tst['Parch'] + k_tst['SibSp'] + 1

Quick look.

In [8]:

k_trn.head()
k_tst.head()

Out[8]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	FamilySize
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	A/5 21171	7.25	NaN	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	PC 17599	71.28	C85	C	2
2	3	1	3	Heikkinen, Miss. Laina	female	26.00	0	STON/O2. 3101282	7.92	NaN	S	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	113803	53.10	C123	S	2
4	5	0	3	Allen, Mr. William Henry	male	35.00	0	373450	8.05	NaN	S	1

Out[8]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived	FamilySize
0	892	3	Kelly, Mr. James	male	34.50	0	0	330911	7.83	NaN	Q	0	1
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.00	1	0	363272	7.00	NaN	S	1	2
2	894	2	Myles, Mr. Thomas Francis	male	62.00	0	0	240276	9.69	NaN	Q	0	1
3	895	3	Wirz, Mr. Albert	male	27.00	0	0	315154	8.66	NaN	S	0	1
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.00	1	1	3101298	12.29	NaN	S	1	3

And, let’s save that enhancement.

In [9]:

# let's save our work so far
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Groups

As mentioned, there were also possibly non-family groups travelling on a single ticket. Let’s have a look.

In [10]:

# let's check for non-family groups travelling on a single ticket
p_grps = k_trn.groupby(["Ticket", "SibSp", "Parch"], as_index=False).PassengerId.count()
p_nofam = p_grps[(p_grps['SibSp'] == 0) & (p_grps['Parch'] == 0) & (p_grps['PassengerId'] > 1)]
p_nofam.head()

Out[10]:

	Ticket	PassengerId
0	110152	3
3	110465	2
33	113572	2
49	113798	2
85	1601	7

In [11]:

k_trn[k_trn['Ticket'] == '110152']

Out[11]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked	FamilySize
257	258	1	1	Cherry, Miss. Gladys	female	30.00	110152	86.50	B77	S	1
504	505	1	1	Maioni, Miss. Roberta	female	16.00	110152	86.50	B79	S	1
759	760	1	1	Rothes, the Countess. of (Lucy Noel Martha Dye...	female	33.00	110152	86.50	B77	S	1

There may possibly be some value in identifying such groups. And, I may need that info elsewhere in my feature engineering. So, let’s add a new column “Group”. Probably should have called the “FamilySize” column “Family” to save some typing.

In [12]:

# add new column "Group" to training dataset, then to test dataset
# add new column with default value of 1
k_trn.loc[:,"Group"] = 1
# now change those rows that need changing
i = 0
for _, rw in k_trn.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grps[p_grps['Ticket'] == p_tkt]
  print(rw_tkt)
  cnt_tkt = rw_tkt.PassengerId.item()
  print(cnt_tkt)
  if cnt_tkt > 1:
    rw.Group = cnt_tkt
  # i += 1
  # if i > 4:
  #   break

        Ticket  SibSp  Parch  PassengerId
564  A/5 21171      1      0            1
1
       Ticket  SibSp  Parch  PassengerId
643  PC 17599      1      0            1
1
               Ticket  SibSp  Parch  PassengerId
723  STON/O2. 3101282      0      0            1
1
    Ticket  SibSp  Parch  PassengerId
51  113803      1      0            2
2
     Ticket  SibSp  Parch  PassengerId
511  373450      0      0            1
1
     Ticket  SibSp  Parch  PassengerId
301  330877      0      0            1
1
   Ticket  SibSp  Parch  PassengerId
93  17463      0      0            1
1
     Ticket  SibSp  Parch  PassengerId
427  349909      0      4            1
428  349909      3      1            3

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-52d90aec49b8> in <module>
      8   rw_tkt = p_grps[p_grps['Ticket'] == p_tkt]
      9   print(rw_tkt)
---> 10   cnt_tkt = rw_tkt.PassengerId.item()
     11   print(cnt_tkt)
     12   if cnt_tkt > 1:
E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\pandas\core\base.py in item(self)
    418         if len(self) == 1:
    419             return next(iter(self))
–> 420         raise ValueError("can only convert an array of size 1 to a Python scalar")
    421
    422     @property
ValueError: can only convert an array of size 1 to a Python scalar

Oops! Need to rework that p_grp dataframe. Also, do note, that I couldn’t use the rw variable returned by iterrows() to modify the dataframe. That’s what I attempted above. Took me a bit of fooling around to sort that out.

In [13]:

# Okay let's fix that mistake
p_grp2 = k_trn.groupby("Ticket", as_index=False).PassengerId.count()
k_trn.loc[:,"Group"] = 1
for ndx, rw in k_trn.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grp2[p_grp2['Ticket'] == p_tkt]
  cnt_tkt = rw_tkt.PassengerId.item()
  if cnt_tkt > 1:
    k_trn.loc[ndx, "Group"] = cnt_tkt

In [14]:

k_trn[k_trn["Group"] > 1].head()

Out[14]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	FamilySize	Group
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	0	113803	53.10	C123	S	2	2
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.00	3	1	349909	21.07	NaN	S	5	4
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.00	0	2	347742	11.13	NaN	S	3	3
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.00	1	0	237736	30.07	NaN	C	2	2
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.00	1	1	PP 9549	16.70	G6	S	3	2

You may have noticed in the above that “Palsson, Master. Gosta Leonard” ended up with a FamilySize of 5 and a Group of 4. That’s likely due to the other members being in the test dataset. What to do?

In [17]:

# okay need to combine two datasets to get correct group size values by ticket number
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)

In [18]:

p_grp3 = k_all.groupby("Ticket", as_index=False).PassengerId.count()
k_trn.loc[:,"Group"] = 1
for ndx, rw in k_trn.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grp3[p_grp3['Ticket'] == p_tkt]
  cnt_tkt = rw_tkt.PassengerId.item()
  if cnt_tkt > 1:
    k_trn.loc[ndx, "Group"] = cnt_tkt

In [19]:

k_trn[k_trn["FamilySize"] == 6].sort_values("Ticket")

Out[19]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	FamilySize	Group
341	342	1	1	Fortune, Miss. Alice Elizabeth	female	24.00	3	2	19950	263.00	C23 C25 C27	S	6	6
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.00	3	2	19950	263.00	C23 C25 C27	S	6	6
438	439	0	1	Fortune, Mr. Mark	male	64.00	1	4	19950	263.00	C23 C25 C27	S	6	6
88	89	1	1	Fortune, Miss. Mabel Helen	female	23.00	3	2	19950	263.00	C23 C25 C27	S	6	6
437	438	1	2	Richards, Mrs. Sidney (Emily Hocking)	female	24.00	2	3	29106	18.75	NaN	S	6	3
638	639	0	3	Panula, Mrs. Juha (Maria Emilia Ojala)	female	41.00	0	5	3101295	39.69	NaN	S	6	7
824	825	0	3	Panula, Master. Urho Abraham	male	2.00	4	1	3101295	39.69	NaN	S	6	7
266	267	0	3	Panula, Mr. Ernesti Arvid	male	16.00	4	1	3101295	39.69	NaN	S	6	7
164	165	0	3	Panula, Master. Eino Viljami	male	1.00	4	1	3101295	39.69	NaN	S	6	7
50	51	0	3	Panula, Master. Juha Niilo	male	7.00	4	1	3101295	39.69	NaN	S	6	7
686	687	0	3	Panula, Mr. Jaako Arnold	male	14.00	4	1	3101295	39.69	NaN	S	6	7
642	643	0	3	Skoog, Miss. Margit Elizabeth	female	2.00	3	2	347088	27.90	NaN	S	6	6
167	168	0	3	Skoog, Mrs. William (Anna Bernhardina Karlsson)	female	45.00	1	4	347088	27.90	NaN	S	6	6
360	361	0	3	Skoog, Mr. Wilhelm	male	40.00	1	4	347088	27.90	NaN	S	6	6
63	64	0	3	Skoog, Master. Harald	male	4.00	3	2	347088	27.90	NaN	S	6	6
634	635	0	3	Skoog, Miss. Mabel	female	9.00	3	2	347088	27.90	NaN	S	6	6
819	820	0	3	Skoog, Master. Karl Thorsten	male	10.00	3	2	347088	27.90	NaN	S	6	6
787	788	0	3	Rice, Master. George Hugh	male	8.00	4	1	382652	29.12	NaN	Q	6	6
16	17	0	3	Rice, Master. Eugene	male	2.00	4	1	382652	29.12	NaN	Q	6	6
171	172	0	3	Rice, Master. Arthur	male	4.00	4	1	382652	29.12	NaN	Q	6	6
278	279	0	3	Rice, Master. Eric	male	7.00	4	1	382652	29.12	NaN	Q	6	6
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.00	0	5	382652	29.12	NaN	Q	6	6

Seems to be an improvement. And, group size larger than family size wouldn’t seem unreasonable if staff travelling with a family. The other way around likely implies bad data. But I am not going to go looking for trouble. So, update the test dataset and save both.

In [22]:

k_tst.loc[:,"Group"] = 1
for ndx, rw in k_tst.iterrows():
  p_tkt = rw.Ticket
  rw_tkt = p_grp3[p_grp3['Ticket'] == p_tkt]
  cnt_tkt = rw_tkt.PassengerId.item()
  if cnt_tkt > 1:
    k_tst.loc[ndx, "Group"] = cnt_tkt

In [23]:

# let's save our work so far
k_trn.to_csv(oma_trn_3, index=False)
k_tst.to_csv(oma_tst_3, index=False)

Done

You know, I think that’s it for this post. Getting lengthy and has taken me more time than I expected. But, all good. Will continue looking at the remaining possible feature additions in the next post.

Feel free to download and play with my version of this post’s related notebook.

Resources

Group and Aggregate by One or More Columns in Pandas
How to add new columns to Pandas dataframe?

Too Old To Code

Titanic Dataset: Feature Engineering