Okay, September 29th, but going to try and get a post done for this coming Monday, October 4th. Spending way too much time on that MITx course (Machine Learning with Python-From Linear Models to Deep Learning).

Thought I’d try to cover something of some value in many, if not most, machine learning projects — cross-validation. Cross-validation is primarily used for model selection and parameter tuning. Hopefully we will get a peek at the latter in this post. Essentially, what it does, is give you multiple training and validation datasets from a single training dataset.

It is simple to understand and it generally results in a better estimate of the model’s effectiveness over other methods, e.g. a simple train/test split.

k-fold Cross-Validation

The concept is rather straighforward. Break your training data into k pieces (typically equal sized pieces). Then repeatedly use a different piece as the validation data while using the remaining pieces as the training data. For each iteration save the model’s accuracy on the validation dataset. Each time discarding the model. Finish by taking the average of the results to get a measure of how well a given model and/or a set of tuning parameters are working. Then you apply the selected model and tuning parameters to the complete training set to train the final model.

Do note that the dataset is generally randomized before the k-fold pieces are created.

Let’s have a look at that for k=5. Though do note that k=10 is thought to be a best value for most cases. But, the value is also somewhat dependent on the number of samples in your dataset. I am selecting the validation fold to most easily generate the following image. You can use any order you like.

Iteration 1:Training SetValidation SetIteration 2:Training SetTraining SetValidation SetIteration 5:Training SetValidation Set

Basically, you are training on the whole training dataset and, as well, validating on the whole training set. But in a way that adds some statistical validity to the process. Training on the whole dataset and then evaluating on the whole training set would intuitively not be particularly informative regarding the effectiveness of the model.

By using k-fold cross-validation, you do get a more meaningful measure of the effectiveness of the model. Though at the trade off of time. You will be training the model k times. And, depending on the model and the size of the training dataset, that could take considerable time. At this point I have no idea how long any of my examples are going to take.

Note: image above was created with in-line SVG — very simple SVG, I don’t know enough to get fancy. Have a look at the page source if interested.

An Example

I am not going to code a procedure for the k-fold cross-validation. I am going to use the cross-validation method provided by scikit-learn. I am sure it does a better job than any code I could put together.

In that MITx course, we have been looking at using the perceptron algorithm to train classification models. So, I think I will start by using a variation of that algorithm, likely SGD (stochastic gradient descent), to train on one of the scikit-learn toy datasets.

The dataset I am going to use is the “Breast cancer wisconsin (diagnostic) dataset”, provided by scikit-learn.

Setup

So let’s set up our imports, some jupyter specific setup and load the dataset.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting 
import seaborn as sns # for plotting
from sklearn import datasets
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import model_selection as ms
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
In [2]:
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.style.use('default')
sns.set()
pd.options.display.float_format = '{:,.2f}'.format
In [3]:
c_data = datasets.load_breast_cancer(as_frame=True)
y = c_data.target # Training labels ('malignant = 0', 'benign = 1')
X = c_data.data # 30 attributes; https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset

Review the Data

Let’s have a quick look at the dataset.

In [4]:
y.head()
X.head()
Out[4]:
0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int32
Out[4]:
mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimension...worst radiusworst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimension
017.9910.38122.801,001.000.120.280.300.150.240.08...25.3817.33184.602,019.000.160.670.710.270.460.12
120.5717.77132.901,326.000.080.080.090.070.180.06...24.9923.41158.801,956.000.120.190.240.190.280.09
219.6921.25130.001,203.000.110.160.200.130.210.06...23.5725.53152.501,709.000.140.420.450.240.360.09
311.4220.3877.58386.100.140.280.240.110.260.10...14.9126.5098.87567.700.210.870.690.260.660.17
420.2914.34135.101,297.000.100.130.200.100.180.06...22.5416.67152.201,575.000.140.200.400.160.240.08

5 rows × 30 columns

In [5]:
# let's get a bit more info on the labels
# note: 1 = benign, 0 = malignant
target_cnt = y.value_counts()
print(target_cnt)
1    357
0    212
Name: target, dtype: int64
In [6]:
# or we could have used a countplot
# semi-colons weren't preventing output, so assigning to variables
fig = plt.figure(figsize=(8,6))
ax = sns.countplot(x=y, order=[0, 1])
pt = plt.title('Distribution of Outcomes')
px = plt.xlabel('Is Benign (1 = True)')
py = plt.ylabel('Count')

for p in ax.patches: #pv = ax.annotate(f'{p.get_height()}', (p.get_x(), p.get_height()+2)); pv = ax.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='center', va='top', color='white', size=14)

count plot of target variable by label
In [7]:
# a deeper look at the attributes
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
dtypes: float64(30)
memory usage: 133.5 KB

Okay, no missing data. And the target is already a numeric value, rather than textual. That makes life easier.

One article I looked at normalized the data:

most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1.

“Building a Simple Machine Learning Model on Breast Cancer Data”,
by vishabh goel, Sep 29, 2018

I am going to go with the raw data, then perhaps normalize the attributes and have another look at the cross-validation and test scores.

Test and Training Data

Decided I’d look at how three basic algorithms do on this dataset: perceptron, support vector machine and stochastic gradient descent. We’ll look at data transformations (e.g. scaling) and parameter tuning in future posts — still using k-fold cross-validation.

I am also going to pull out a “test” dataset in order to test the models after the cross-validation to compare accuracy during training and on a test dataset. So, let’s start with that. As the data looks unordered, I am not going to sort it prior to creating my test data.

In [8]:
# I am going to split the full dataset in two, keeping 69 rows for a final test case
# since the data looks to be unordered, I am just going to pull out the last 1/6 of the samples
y_trn = y[:470]
y_tst = y[470:]
X_trn = X[:470]
X_tst = X[470:]
print(len(y_trn), len(y_tst), len(X_trn), len(X_tst))
470 99 470 99

Compare Algorithms

Perceptron

In [9]:
# let's start with a quick look at training/validating the perceptron algorithm on this data set
alf = linear_model.Perceptron(tol=1e-3, random_state=0)
# using 5-fold, as that would put 94 samples in each fold, 10-fold would only leave a 47 sample test on each iteration
# and, no, I don't know if that is a correct assessment of the situation
scores = ms.cross_val_score(alf, X_trn, y_trn, cv=5, scoring="accuracy")
score = scores.mean()
print(f"Perceptron: min = {min(scores)}, max = {max(scores)}, mean = {score}")
Perceptron: min = 0.7978723404255319, max = 0.9148936170212766, mean = 0.8808510638297872

Support Vector Machine

In [10]:
# let's look at basic SVM
svm = SVC()
scores1 = ms.cross_val_score(svm, X_trn, y_trn, cv=5, scoring="accuracy")
score1 = scores1.mean()
print(f"SVM: min = {min(scores1)}, max = {max(scores1)}, mean = {score1}")
SVM: min = 0.8723404255319149, max = 0.9468085106382979, mean = 0.9127659574468086

Stochastic Gradient Descent

In [11]:
# and finally let's look at SGD, using defaults for most parameters
sgm = linear_model.SGDClassifier(max_iter=25)
scores2 = ms.cross_val_score(sgm, X_trn, y_trn, cv=5, scoring="accuracy")
score2 = scores2.mean()
print(f"SGM: min = {min(scores2)}, max = {max(scores2)}, mean = {score2}")
SGM: min = 0.7978723404255319, max = 0.9468085106382979, mean = 0.8936170212765957
E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:574: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn("Maximum number of iteration reached before "
E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:574: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn("Maximum number of iteration reached before "
E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:574: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn("Maximum number of iteration reached before "
E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:574: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn("Maximum number of iteration reached before "
E:\appDev\Miniconda3\envs\ds-3.9\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:574: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn("Maximum number of iteration reached before "

Looks like we need to modify one of our parameters! Let's try tripling it.
In [12]:
# let's increase max_iter
sgm = linear_model.SGDClassifier(max_iter=75)
scores3 = ms.cross_val_score(sgm, X_trn, y_trn, cv=5, scoring="accuracy")
score3 = scores3.mean()
print(f"SGM: min = {min(scores3)}, max = {max(scores3)}, mean = {score3}")
SGM: min = 0.851063829787234, max = 0.9361702127659575, mean = 0.8914893617021278

Conclusion?

Looks like the Support Vector Machine model is the best performing of the three. With the other two close behind. But it also looks like the Perceptron model had a possibly significantly larger range of accuracy values.

Here are the three model cross-validation means together.

Perceptron: mean = 0.8808510638297872
SVM: mean = 0.9127659574468086
SGM: mean = 0.8914893617021278