Well, I think it’s time to see if any of that feature engineering can help me improve my score on Kaggle.
There will probably be a couple of posts looking at my attempts to do so. And, each one could prove to be relatively short. Though perhaps focused.
Review Engineered Features
Let’s recall the features we originally had and those we’ve added to the dataset.
I’ll load the datasets and have a look at the what’s in the training dataset. Two differing views.
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"
# load the datasets currently of interest
k_trn = pd.read_csv(oma_trn_3)
k_tst = pd.read_csv(oma_tst_3)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
k_trn.info()
k_trn.describe()
This is a list of all the features in the training dataset, indicating those that are numeric and those that are strings or possibly categorical. The latter will need to be numerically encoded for our model.
Do note that we will need to impute the missing values for Age
in both the training and test datasets if we want to use it to train our model and make predictions from such a model..
And, here’s some stats for the existing numerical features.
You will likely recall that Kaggle, and consequently I, selected ['Pclass', 'Sex', 'SibSp', 'Parch']
as the features for training and testing our first model. So, let’s try that again replacing 'SibSp', 'Parch'
with 'FamilySize', 'Group'
. And, perhaps after that with all four in the feature set.
Model #2
Okay, let’s build the datasets for training and testing with our choice of features. In this case, ['Pclass', 'Sex', 'FamilySize', 'Group']
. Fit the model, generate the predictions and score the model’s accuracy.
Y = k_trn['Survived']
features = ['Pclass', 'Sex', 'FamilySize', 'Group']
X = pd.get_dummies(k_trn[features])
X_test = pd.get_dummies(k_tst[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)
predictions = model.predict(X_test)
accuracy_score(k_tst["Survived"], predictions)
Well, disappointingly, no improvement over the very first attempt. Let’s add back the 'SibSp', 'Parch'
features and try that again.
Model #3
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'FamilySize', 'Group']
X = pd.get_dummies(k_trn[features])
X_test = pd.get_dummies(k_tst[features])
odel = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)
predictions = model.predict(X_test)
accuracy_score(k_tst["Survived"], predictions)
And, that score is exactly the same as our first attempt. Would appear that by themselves 'FamilySize', 'Group'
add no value to our model. But I expect that doesn’t mean they might not if combined with other features. Well lot’s of playing around left to do.
Done
Like I said above, this and the next few posts could prove to be pretty short and narrowly focused. This could be one of the shortest I’ve written.
In the next post, I think I will look at imputing the missing values in the Age feature and seeing if it has any impact on the accuracy of the model’s predictions. We may even look at creating a feature that puts the ages in ranges (categorical?) rather than leaving as a continuous variable.
Feel free to download and play with my version of this post’s related notebook.