Well, I think it’s time to see if any of that feature engineering can help me improve my score on Kaggle.

There will probably be a couple of posts looking at my attempts to do so. And, each one could prove to be relatively short. Though perhaps focused.

Review Engineered Features

Let’s recall the features we originally had and those we’ve added to the dataset.

I’ll load the datasets and have a look at the what’s in the training dataset. Two differing views.

In [3]:
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"
In [4]:
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
In [5]:
k_trn.info()
k_trn.describe()

This is a list of all the features in the training dataset, indicating those that are numeric and those that are strings or possibly categorical. The latter will need to be numerically encoded for our model.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
 12  FamilySize   891 non-null    int64  
 13  Group        891 non-null    int64  
 14  Title        891 non-null    object 
 15  iFare        891 non-null    float64
dtypes: float64(3), int64(7), object(6)
memory usage: 111.5+ KB

Do note that we will need to impute the missing values for Age in both the training and test datasets if we want to use it to train our model and make predictions from such a model..

And, here’s some stats for the existing numerical features.

Out[5]:
PassengerIdSurvivedPclassAgeSibSpParchFareFamilySizeGroupiFare
count891.00891.00891.00714.00891.00891.00891.00891.00891.00891.00
mean446.000.382.3129.700.520.3832.741.902.1214.87
std257.350.490.8414.531.100.8149.541.611.8013.57
min1.000.001.000.420.000.004.011.001.003.71
25%223.500.002.0020.120.000.007.921.001.007.65
50%446.000.003.0028.000.000.0015.001.001.008.05
75%668.501.003.0038.001.000.0031.332.003.0015.00
max891.001.003.0080.008.006.00512.3311.0011.00128.08

You will likely recall that Kaggle, and consequently I, selected ['Pclass', 'Sex', 'SibSp', 'Parch'] as the features for training and testing our first model. So, let’s try that again replacing 'SibSp', 'Parch' with 'FamilySize', 'Group'. And, perhaps after that with all four in the feature set.

Model #2

Okay, let’s build the datasets for training and testing with our choice of features. In this case, ['Pclass', 'Sex', 'FamilySize', 'Group']. Fit the model, generate the predictions and score the model’s accuracy.

In [6]:
Y = k_trn['Survived']

features = ['Pclass', 'Sex', 'FamilySize', 'Group'] X = pd.get_dummies(k_trn[features]) X_test = pd.get_dummies(k_tst[features])

In [7]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)
predictions = model.predict(X_test)
Out[7]:
RandomForestClassifier(max_depth=5, random_state=1)
In [8]:
accuracy_score(k_tst["Survived"], predictions)
Out[8]:
0.7727272727272727

Well, disappointingly, no improvement over the very first attempt. Let’s add back the 'SibSp', 'Parch' features and try that again.

Model #3

In [9]:
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'FamilySize', 'Group']
X = pd.get_dummies(k_trn[features])
X_test = pd.get_dummies(k_tst[features])
In [10]:
odel = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)
predictions = model.predict(X_test)
accuracy_score(k_tst["Survived"], predictions)
Out[10]:
RandomForestClassifier(max_depth=5, random_state=1)
Out[10]:
0.7751196172248804

And, that score is exactly the same as our first attempt. Would appear that by themselves 'FamilySize', 'Group' add no value to our model. But I expect that doesn’t mean they might not if combined with other features. Well lot’s of playing around left to do.

Done

Like I said above, this and the next few posts could prove to be pretty short and narrowly focused. This could be one of the shortest I’ve written.

In the next post, I think I will look at imputing the missing values in the Age feature and seeing if it has any impact on the accuracy of the model’s predictions. We may even look at creating a feature that puts the ages in ranges (categorical?) rather than leaving as a continuous variable.

Feel free to download and play with my version of this post’s related notebook.