Okay, let’s try our first attempt to generate a model from the training set and test its success. We will use the same model as used in that very first Kaggle training exercise using the Titanic dataset. We will score our prediction against the target data we created in the previous post in the series, and see if it matches what I got for my first submission on Kaggle.

We will be using the RandomForestClassifier for our model. And, we will be using accuracy_score for our metric.

The training data will be a subset of the features in the Kaggle Titanic training dataset. For the testing dataset we will use the modified dataset produced in the last post, rek_test_2.csv.

I expect this will be a short post. I am really just trying to confirm that the modified test dataset is at least close to that used by Kaggle to measure prediction accuracy. And, that the metric I am using is the correct one.

Imports

A couple of model/metric imports. And the usual notebook defaults.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting 
import seaborn as sns # for plotting
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
In [2]:
# set up some notebook display defaults
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.style.use('default')
sns.set()
pd.options.display.float_format = '{:,.2f}'.format

Datasets

Variables for the datasets I will or might be using.

In [3]:
# paths to datasets
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
rek_k_tst2 = "./data/titanic/rek_test_2.csv"

Okay, let’s load the datasets.

In [4]:
# load the datasets currently of interest
k_trn = pd.read_csv(kaggle_trn)
k_tst = pd.read_csv(rek_k_tst2)

Train Model

We’ll start by defining the features and targets to use for training and testing.

I am using pandas.get_dummies() to convert categorical variables into numerical values using indicator columns/variables. Machine learning algorithms in general do not appreciate categorical values. There are other ways to do this conversion, this just happens to be the one used in the Kaggle introductory notebook. We may in future look at some of the others. get_dummies(), I believe, uses one hot encoding.

Have a look at the X and X_test dataframes to see what happens.

In [5]:
Y = k_trn['Survived']

features = ['Pclass', 'Sex', 'SibSp', 'Parch'] X = pd.get_dummies(k_trn[features]) X_test = pd.get_dummies(k_tst[features])

Now, let’s train the model and make our predictions.

In [6]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, Y)
predictions = model.predict(X_test)
Out[6]:
RandomForestClassifier(max_depth=5, random_state=1)
In [7]:
predictions[0:5]
Out[7]:
array([0, 1, 0, 0, 1], dtype=int64)

Prediction Accuracy

Okay, let’s see how well our model made out.

In [8]:
accuracy_score(k_tst["Survived"], predictions)
Out[8]:
0.7751196172248804

Look’s Good

This is what I got on that first training submission on Kaggle.

Get Started with Titanic (version 1/1)
Public Score: 0.77511

The test looks close enough for me. Would seem my generated test targets are reasonably close to those on Kaggle.

Done

I did say it would be short and sweet.

Feel free to download and play with my version of this post’s related notebook.

Resources