As mentioned in the previous post:

In the next post I plan to look at what happens if we scale the data. And then see if we can improve either the SVM or SGM results with some hyperparameter tuning.

~~Using cross-validation of course.~~

But, do note, that given a bit of reading, this may be more involved than I expected. So, hyperparameter tuning may move to another post.

Setup

I am not going to show any of the basic notebook setup (imports, display settings and such). I do, as last time, plan to create a test dataset from the dataset provided by scikit-learn. But this time I am going to use sklearn.model_selection.train_test_split. I will set aside ~20% of the original training data as test data.

In [4]:

# let's split into training and test datasets
X.shape
X_trn, X_tst, y_trn, y_tst = ms.train_test_split(X, y, test_size=.2, random_state=42)
X_tst.shape
X_trn.head()
X_tst.head()

Out[4]:

(569, 30)

Out[4]:

(114, 30)

Out[4]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
68	9.03	17.33	58.79	250.50	0.11	0.14	0.31	0.04	0.21	0.08	...	10.31	22.65	65.50	324.70	0.15	0.44	1.25	0.17	0.42	0.12
181	21.09	26.57	142.70	1,311.00	0.11	0.28	0.25	0.15	0.24	0.07	...	26.68	33.48	176.50	2,089.00	0.15	0.76	0.68	0.29	0.41	0.13
63	9.17	13.86	59.20	260.90	0.08	0.09	0.06	0.02	0.23	0.07	...	10.01	19.23	65.59	310.10	0.10	0.17	0.14	0.05	0.33	0.08
248	10.65	25.22	68.01	347.00	0.10	0.07	0.02	0.02	0.19	0.06	...	12.25	35.19	77.98	455.70	0.15	0.14	0.11	0.06	0.34	0.08
60	10.17	14.88	64.55	311.90	0.11	0.08	0.01	0.01	0.27	0.07	...	11.02	17.45	69.86	368.60	0.13	0.10	0.02	0.03	0.36	0.08

5 rows × 30 columns

Out[4]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
204	12.47	18.60	81.09	481.90	0.10	0.11	0.08	0.04	0.19	0.06	...	14.97	24.64	96.05	677.90	0.14	0.24	0.27	0.10	0.30	0.09
70	18.94	21.31	123.60	1,130.00	0.09	0.10	0.11	0.08	0.16	0.05	...	24.86	26.58	165.90	1,866.00	0.12	0.23	0.27	0.18	0.26	0.07
131	15.46	19.48	101.70	748.90	0.11	0.12	0.15	0.08	0.19	0.06	...	19.26	26.00	124.90	1,156.00	0.15	0.24	0.38	0.15	0.28	0.08
431	12.40	17.68	81.47	467.80	0.11	0.13	0.08	0.03	0.18	0.07	...	12.88	22.91	89.61	515.80	0.14	0.26	0.24	0.07	0.26	0.09
540	11.54	14.44	74.65	402.90	0.10	0.11	0.07	0.03	0.18	0.07	...	12.26	19.68	78.78	457.80	0.13	0.21	0.18	0.07	0.23	0.08

5 rows × 30 columns

Scaling Methods

I plan to look at the following scaling methods:

Min-Max Scaler: works well if the standard deviation is small and the data distribution is not Gaussian. Is sensitive to outliers.
Standard Scaler: if data is not normally distributed, this is not the best scaler to use.
Robust Scaler: specifically designed to be robust to outliers, scaling on the IQR of the data.

Might also look at:

Quantile Transformer Scaler: can distort linear correlations between attributes measured on a similar scale, but makes attributes at different scales more comparable.

I am also trying to determine whether or not it is acceptable to scale individual attributes using different methods. But so far have not seen any posts/articles suggesting or describing that approach.

Given the limits of each method, I had originally planned to check the individual attributes for outliers and whether or not it is probably taken from a gaussian distribution. Then select a specific scaler. But, one article suggested you just test them all and pick the one that performs best. So, that is what I think I will do. I may later get back to trying to determine attributes with significant outliers or from non-gaussian distributions.

ScalIng Our Data

I plan to test at least the first 3 methods listed above. And to test the 3 supervised learning models against each set of transformed data.

But first, a note of caution:

A common mistake is to apply it to the entire data before splitting into training and test sets. This will bias the model evaluation because information would have leaked from the test set to the training set.

sklearn.preprocessing.scale

I’ll start by instantiating and/or declaring some variables I will use later on.

In [8]:

# let's instantiate our 3 models
plf = linear_model.Perceptron(tol=1e-3, random_state=0)
svm = SVC(random_state=0)
sgm = linear_model.SGDClassifier(max_iter=75, random_state=0)
mlms = [plf, svm, sgm]
# and our 4 scaler method variables
mnxs = preprocessing.MinMaxScaler()
stds = preprocessing.StandardScaler()
robs = preprocessing.RobustScaler()
qts = preprocessing.QuantileTransformer(n_quantiles=50)

Min-Max Scaler

In [9]:

# let's scale our data with min-max and train and test against the transformed data
# I am creating copies of the data in each case, you may not want to do that
mnx_scores = []
X_trn_mnx = mnxs.fit_transform(X_trn)
display(X_trn_mnx[:5,:5])
X_tst_mnx = mnxs.transform(X_tst)
for mlm in mlms:
  m_f = mlm.fit(X_trn_mnx, y_trn);
  m_preds = mlm.predict(X_tst_mnx)
  mnx_scores.append(accuracy_score(y_tst, m_preds))

array([[0.06552721, 0.25769361, 0.07732252, 0.03436883, 0.48722578],
       [0.65620256, 0.57017247, 0.67420686, 0.48940187, 0.55493365],
       [0.07257946, 0.14034494, 0.08023901, 0.0388312 , 0.22190124],
       [0.14491405, 0.52451809, 0.14290795, 0.07577448, 0.3966778 ],
       [0.12140653, 0.17483936, 0.11829563, 0.06071398, 0.54861425]])

Standard Scaler

In [10]:

# now scale with standard scaler and test models
ss_scores = []
X_trn_std = stds.fit_transform(X_trn)
display(X_trn_std[:5,:5])
X_tst_std = stds.transform(X_tst)
for mlm in mlms:
  m_f = mlm.fit(X_trn_std, y_trn);
  m_preds = mlm.predict(X_tst_std)
  ss_scores.append(accuracy_score(y_tst, m_preds))

array([[-1.44075296, -0.43531947, -1.36208497, -1.1391179 ,  0.78057331],
       [ 1.97409619,  1.73302577,  2.09167167,  1.85197292,  1.319843  ],
       [-1.39998202, -1.24962228, -1.34520926, -1.10978518, -1.33264483],
       [-0.98179678,  1.41622208, -0.98258746, -0.86694414,  0.05938999],
       [-1.11769991, -1.0102595 , -1.12500192, -0.96594206,  1.26951116]])

Robust Scaler

In [11]:

# and finally, robust scaler
rs_scores = []
X_trn_rob = robs.fit_transform(X_trn)
display(X_trn_rob[:5,:5])
X_tst_rob = robs.transform(X_tst)
for mlm in mlms:
  m_f = mlm.fit(X_trn_rob, y_trn);
  m_preds = mlm.predict(X_tst_rob)
  rs_scores.append(accuracy_score(y_tst, m_preds))

array([[-1.05848823, -0.24930748, -0.94904014, -0.86726173,  0.63978638],
       [ 1.93060719,  1.45706371,  1.97975567,  2.18629427,  1.04032043],
       [-1.0228005 , -0.89012004, -0.93472949, -0.83731644, -0.92977303],
       [-0.65675341,  1.20775623, -0.62722513, -0.58940397,  0.10413885],
       [-0.77571252, -0.70175439, -0.74799302, -0.69046933,  1.00293725]])

Quantile Transformer

Those went fairly quickly, so let’s also use that fourth scaling technique.

In [12]:

# that was pretty quick, let's go with the 4 scaler method
qts_scores = []
X_trn_qts = qts.fit_transform(X_trn)
display(X_trn_qts[:5,:5])
X_tst_qts = qts.transform(X_tst)
for mlm in mlms:
  m_f = mlm.fit(X_trn_qts, y_trn);
  m_preds = mlm.predict(X_tst_qts)
  qts_scores.append(accuracy_score(y_tst, m_preds))

array([[0.03109881, 0.35373437, 0.03655297, 0.03282051, 0.78926441],
       [0.96332706, 0.9369511 , 0.96702779, 0.95393664, 0.90260553],
       [0.03621535, 0.08154944, 0.03851789, 0.03972439, 0.07131895],
       [0.12640858, 0.91838419, 0.11986995, 0.12623271, 0.54127496],
       [0.09649916, 0.13902961, 0.08876363, 0.09297491, 0.89253229]])

Raw Data

For comparison purposes, let’s also train the algorithms on original, un-scaled training data. And, test against the un-scaled test data.

In [13]:

# for comparison let's get the scors for the raw, un-modified features
raw_scores = []
for mlm in mlms:
  m_f = mlm.fit(X_trn, y_trn);
  m_preds = mlm.predict(X_tst)
  raw_scores.append(accuracy_score(y_tst, m_preds))

Scores Using Single Train/Test Split

Okay, now let’s have a look at how the 3 models faired using the raw data and the variously scaled datasets.

In [14]:

s_scores = pd.DataFrame([raw_scores, mnx_scores, ss_scores, rs_scores, qts_scores],
  columns=['Perceptron', 'SVM', 'SGD'],
  index=["No Scaling", "Min-Max", "Standard Scaler", "Robust Scaler", "Quantile"])
pd.options.display.float_format = '{:,.4f}'.format
display(s_scores)

	Perceptron	SVM	SGD
No Scaling	0.9474	0.9474	0.7281
Min-Max	0.9561	0.9737	0.9649
Standard Scaler	0.9737	0.9825	0.9825
Robust Scaler	0.9561	0.9649	0.9737
Quantile	0.8684	0.9737	0.9825

Looks like using the Standard Scaler produced the best results for all 3 supervised learning models. And, with the given training and test data SVM and SGM appear to have done equally well under Standard Scaling. With the raw data, Perceptron and SVM appear to have done equally well.

Cross-Validation

I previously said I’d be conducting these tests using k-fold cross-validation. Clearly, I misled everyone; myself included.

So, let’s have a look at using cross-validation using the complete dataset for all the models and scaler methods.

First a few declarations or perhaps re-declarations (the latter for local visibility).

In [19]:

# okay, let's do cross val on complete data set for all scalers and models
mlms = [plf, svm, sgm]
scls = [mnxs, stds, robs, qts]
df_ndx = ["No Scaling", "Min-Max", "Standard Scaler", "Robust Scaler", "Quantile"]
df_cols = ['Perceptron', 'SVM', 'SGD']
xv_scores = {ssm: [] for ssm in df_ndx}

Now a wee function to save some coding.

In [20]:

def get_xval_mean(X_arr, y_vec, cv=5, scoring="accuracy"):
  prc_scr = ms.cross_val_score(plf, X_arr, y_vec, cv=cv, scoring=scoring)
  prc_scr = np.append(prc_scr, prc_scr.mean())
  svm_scr = ms.cross_val_score(svm, X_arr, y, cv=cv, scoring=scoring)
  svm_scr = np.append(svm_scr, svm_scr.mean())
  sgm_scr = ms.cross_val_score(sgm, X_arr, y_vec, cv=cv, scoring=scoring)
  sgm_scr = np.append(sgm_scr, sgm_scores2.mean())
  return [prc_scr.mean(), svm_scr.mean(), sgm_scr.mean()]

And, now scale our full dataset. Note, already done with the standard scaler in a previous cell (not included in this post).

In [21]:

# X_trn_std2 = stds.fit_transform(X)
X_trn_mnx2 = mnxs.fit_transform(X)
X_trn_rob2 = robs.fit_transform(X)
X_trn_qts2 = qts.fit_transform(X)
trn_sets = [X, X_trn_mnx2, X_trn_std2, X_trn_rob2, X_trn_qts2]

Let’s run all those cross-validations.

In [22]:

for i in range(len(df_ndx)):
  xv_scores[df_ndx[i]] = get_xval_mean(trn_sets[i], y)
print(xv_scores)

{'No Scaling': [0.8839465921440769, 0.9121720229777983, 0.7901154065103763], 'Min-Max': [0.9612948299953423, 0.9736531594472907, 0.9716089634114785], 'Standard Scaler': [0.9648501785437045, 0.9736376339077782, 0.9525384257102935], 'Robust Scaler': [0.964865704083217, 0.9736376339077782, 0.9628111576877295], 'Quantile': [0.9596180717279924, 0.9771619313771155, 0.9539616001656057]}

In [23]:

s_scores4 = pd.DataFrame(xv_scores.values(),
  columns=df_cols,
  index=df_ndx)
#pd.options.display.float_format = '{:,.4f}'.format
display(s_scores4)

	Perceptron	SVM	SGD
No Scaling	0.8839	0.9122	0.7901
Min-Max	0.9613	0.9737	0.9716
Standard Scaler	0.9649	0.9736	0.9525
Robust Scaler	0.9649	0.9736	0.9628
Quantile	0.9596	0.9772	0.9540

Not quite like the results with got with the simple train/test split we used at first. But hard to say it is significantly different. At two decimal places, there would not be a lot of difference for any of the results using scaling. That said, scaling did more significantly beat no scaling than any particular scaler beat any of the others.

Though, in the earlier results, there seemed to be a more definitive best choice. I.E. the Standard Scaler. I am sure it has a lot to do with the difference in training and test data. I also expect that perhaps the cross-validation results are more indicative of the expected real results. But…

Sorry, But That’s It

I am working on the homework for Unit 2 in the oft mentioned machine learning course, and it is taking a bit of my time. And, I also have to get the Unit 2 project completed over the next week or so. Consquently that is it for this post.

Though I have to admit I am not as happy with it as I was with the previous one. Such is life. (Though I did, as usual, enjoy working on the code for the notebook/post.)

Feel free to download and play with my version of this post’s related notebook.

Resources

sklearn.model_selection.train_test_split
sklearn.model_selection.cross_val_score
sklearn.preprocessing: Preprocessing and Normalization API sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.StandardScaler
sklearn.preprocessing.scale
sklearn.preprocessing.RobustScaler
sklearn.preprocessing.QuantileTransformer
Compare the effect of different scalers on data with outliers
Data Transformation: Standardization vs Normalization
All about Feature Scaling
Scale, Standardize, or Normalize with Scikit-Learn
Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization
A Gentle Introduction to Normality Tests in Python
Building a Simple Machine Learning Model on Breast Cancer Data
Breast Cancer Dataset Analysis, Visualization and Machine Learning in Python

Too Old To Code

Data Science Basics: Not So Cross-Validation, Part II

Setup

Scaling Methods

ScalIng Our Data

Min-Max Scaler

Standard Scaler

Robust Scaler

Quantile Transformer

Raw Data

Scores Using Single Train/Test Split

Cross-Validation

Sorry, But That’s It

Resources