Okay, September 29th, but going to try and get a post done for this coming Monday, October 4th. Spending way too much time on that MITx course (Machine Learning with Python-From Linear Models to Deep Learning).
Thought I’d try to cover something of some value in many, if not most, machine learning projects — cross-validation. Cross-validation is primarily used for model selection and parameter tuning. Hopefully we will get a peek at the latter in this post. Essentially, what it does, is give you multiple training and validation datasets from a single training dataset.
It is simple to understand and it generally results in a better estimate of the model’s effectiveness over other methods, e.g. a simple train/test split.
k-fold Cross-Validation
The concept is rather straighforward. Break your training data into k
pieces (typically equal sized pieces). Then repeatedly use a different piece as the validation data while using the remaining pieces as the training data. For each iteration save the model’s accuracy on the validation dataset. Each time discarding the model. Finish by taking the average of the results to get a measure of how well a given model and/or a set of tuning parameters are working. Then you apply the selected model and tuning parameters to the complete training set to train the final model.
Do note that the dataset is generally randomized before the k-fold pieces are created.
Let’s have a look at that for k=5
. Though do note that k=10
is thought to be a best value for most cases. But, the value is also somewhat dependent on the number of samples in your dataset. I am selecting the validation fold to most easily generate the following image. You can use any order you like.
Basically, you are training on the whole training dataset and, as well, validating on the whole training set. But in a way that adds some statistical validity to the process. Training on the whole dataset and then evaluating on the whole training set would intuitively not be particularly informative regarding the effectiveness of the model.
By using k-fold cross-validation, you do get a more meaningful measure of the effectiveness of the model. Though at the trade off of time. You will be training the model k
times. And, depending on the model and the size of the training dataset, that could take considerable time. At this point I have no idea how long any of my examples are going to take.
Note: image above was created with in-line SVG — very simple SVG, I don’t know enough to get fancy. Have a look at the page source if interested.
An Example
I am not going to code a procedure for the k-fold cross-validation. I am going to use the cross-validation method provided by scikit-learn. I am sure it does a better job than any code I could put together.
In that MITx course, we have been looking at using the perceptron algorithm to train classification models. So, I think I will start by using a variation of that algorithm, likely SGD (stochastic gradient descent), to train on one of the scikit-learn toy datasets.
The dataset I am going to use is the “Breast cancer wisconsin (diagnostic) dataset”, provided by scikit-learn.
Setup
So let’s set up our imports, some jupyter specific setup and load the dataset.
from IPython.core.interactiveshell import InteractiveShell
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
from sklearn import datasets
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import model_selection as ms
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.style.use('default')
sns.set()
pd.options.display.float_format = '{:,.2f}'.format
c_data = datasets.load_breast_cancer(as_frame=True)
y = c_data.target # Training labels ('malignant = 0', 'benign = 1')
X = c_data.data # 30 attributes; https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset
Review the Data
Let’s have a quick look at the dataset.
y.head()
X.head()
# let's get a bit more info on the labels
# note: 1 = benign, 0 = malignant
target_cnt = y.value_counts()
print(target_cnt)
# or we could have used a countplot
# semi-colons weren't preventing output, so assigning to variables
fig = plt.figure(figsize=(8,6))
ax = sns.countplot(x=y, order=[0, 1])
pt = plt.title('Distribution of Outcomes')
px = plt.xlabel('Is Benign (1 = True)')
py = plt.ylabel('Count')
for p in ax.patches:
#pv = ax.annotate(f'{p.get_height()}', (p.get_x(), p.get_height()+2));
pv = ax.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='center', va='top', color='white', size=14)
# a deeper look at the attributes
X.info()
Okay, no missing data. And the target is already a numeric value, rather than textual. That makes life easier.
One article I looked at normalized the data:
most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1.
“Building a Simple Machine Learning Model on Breast Cancer Data”,
by vishabh goel, Sep 29, 2018
I am going to go with the raw data, then perhaps normalize the attributes and have another look at the cross-validation and test scores.
Test and Training Data
Decided I’d look at how three basic algorithms do on this dataset: perceptron, support vector machine and stochastic gradient descent. We’ll look at data transformations (e.g. scaling) and parameter tuning in future posts — still using k-fold cross-validation.
I am also going to pull out a “test” dataset in order to test the models after the cross-validation to compare accuracy during training and on a test dataset. So, let’s start with that. As the data looks unordered, I am not going to sort it prior to creating my test data.
# I am going to split the full dataset in two, keeping 69 rows for a final test case
# since the data looks to be unordered, I am just going to pull out the last 1/6 of the samples
y_trn = y[:470]
y_tst = y[470:]
X_trn = X[:470]
X_tst = X[470:]
print(len(y_trn), len(y_tst), len(X_trn), len(X_tst))
Compare Algorithms
Perceptron
# let's start with a quick look at training/validating the perceptron algorithm on this data set
alf = linear_model.Perceptron(tol=1e-3, random_state=0)
# using 5-fold, as that would put 94 samples in each fold, 10-fold would only leave a 47 sample test on each iteration
# and, no, I don't know if that is a correct assessment of the situation
scores = ms.cross_val_score(alf, X_trn, y_trn, cv=5, scoring="accuracy")
score = scores.mean()
print(f"Perceptron: min = {min(scores)}, max = {max(scores)}, mean = {score}")
Support Vector Machine
# let's look at basic SVM
svm = SVC()
scores1 = ms.cross_val_score(svm, X_trn, y_trn, cv=5, scoring="accuracy")
score1 = scores1.mean()
print(f"SVM: min = {min(scores1)}, max = {max(scores1)}, mean = {score1}")
Stochastic Gradient Descent
# and finally let's look at SGD, using defaults for most parameters
sgm = linear_model.SGDClassifier(max_iter=25)
scores2 = ms.cross_val_score(sgm, X_trn, y_trn, cv=5, scoring="accuracy")
score2 = scores2.mean()
print(f"SGM: min = {min(scores2)}, max = {max(scores2)}, mean = {score2}")
Looks like we need to modify one of our parameters! Let's try tripling it.
# let's increase max_iter
sgm = linear_model.SGDClassifier(max_iter=75)
scores3 = ms.cross_val_score(sgm, X_trn, y_trn, cv=5, scoring="accuracy")
score3 = scores3.mean()
print(f"SGM: min = {min(scores3)}, max = {max(scores3)}, mean = {score3}")
Conclusion?
Looks like the Support Vector Machine model is the best performing of the three. With the other two close behind. But it also looks like the Perceptron model had a possibly significantly larger range of accuracy values.
Here are the three model cross-validation means together.