Okay, let’s have a look at using cross-validation to determine the best values for model hyperparameters. We will only be looking at the SVM or SGD classifiers as the Perceptron model doesn’t have any hyperparamters. Not that the other two models have that many. Some models have considerably more.
I will also only scale the data using the Standard Scaler. Refer to the related notebook for the code/notebook setup (imports, loading data, train/test split, scaling, etc.).
Hyperparameter Tuning
Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning.
……
I want to be absolutely clear, hyperparameters are not model parameters and they cannot be directly trained from the data.
Hyperparameter tuning for machine learning models, Jeremy Jordan, 2017.11.02
SVM
Let’s start with svm.SVC
.
class sklearn.svm.SVC(*, C=1.0, kernel=‘rbf’, degree=3, gamma=‘scale’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape=‘ovr’, break_ties=False, random_state=None)
The hyperparameter we are likely most interested in is C
. We will stick with the defaults for the other function parameters.
C: float, default=1.0
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
I purpose that we pick a set of possible values, run cross-validation on the training dataset for those values, select the best 2 or 3 results, fit models using those values and measure the performance against the test dataset for each.
Tuning
Wanted to see how long a single 5-fold cross-validation would take. Notebook said, 0.1 s.
# let's see how long a single cross val, k=5, takes
svm = SVC(C=0.5, random_state=0)
svm_scr = ms.cross_val_score(svm, X_trn_std, y_trn, cv=5, scoring="accuracy")
print(svm_scr)
Let’s create a list of potential regularization
values. And then run cross-validation against SVM model using each of them.
# let's start with the C hyperparameter for svm.SVC
cv_scores = {}
reg_cs_1 = [i/1000 for i in range(1, 10, 2)]
reg_cs_2 = [i/1000 for i in range(10, 100, 20)]
reg_cs_3 = [i/1000 for i in range(100, 1000, 200)]
reg_cs_4 = [i/1000 for i in range(1000, 10000, 2000)]
reg_cs = np.concatenate([reg_cs_1, reg_cs_2, reg_cs_3, reg_cs_4])
for cr in reg_cs:
svm = SVC(C=cr, random_state=0)
svm_scr = ms.cross_val_score(svm, X_trn_std, y_trn, cv=5, scoring="accuracy")
cv_scores[str(cr)] = svm_scr
And, let’s look at the results.
for lbl, scrs in cv_scores_1.items():
print(f"{lbl}: {[round(scr,5) for scr in scrs]} -> {scrs.mean()}")
Ok, values still look to be going up. So let’s try a few more values. Going to use integer values in range(10, 20, 2)
.
# looks like it just keeps getting better, let's try a few more values, 10 and up
reg_cs_5 = [i for i in range(10, 20, 2)]
reg_cs_2 = np.concatenate([reg_cs, reg_cs_5])
cv_scores_2 = {}
for cr in reg_cs_2:
svm = SVC(C=cr, random_state=0)
svm_scr = ms.cross_val_score(svm, X_trn_std, y_trn, cv=5, scoring="accuracy")
cv_scores_2[str(cr)] = svm_scr
for lbl, scrs in cv_scores_2.items():
print(f"{lbl}: {[round(scr,5) for scr in scrs]} -> {scrs.mean()}")
sv_means = [scrs.mean() for _, scrs in cv_scores_2.items()]
ax = plt.axes()
ax.plot(reg_cs_2, sv_means, color='k', linestyle='-')
ax.scatter(reg_cs_2, sv_means, color='blue', linestyle='-')
Looks like the best value is C >= 7 and C <= 9
. Given the SVC function default is 1, let’s test with C=1
and C=7
.
Testing
# looks like C>=7 and C<=9 gives us the best results
# C=1 is the default, so let's test using C=1 & C=7
c_tst = [1.0, 7.0]
ct_scr = {}
for ct in c_tst:
svm = SVC(C=ct, random_state=0)
t_f = svm.fit(X_trn_std, y_trn)
t_preds = svm.predict(X_tst_std)
ct_scr[str(ct)] = accuracy_score(y_tst, t_preds)
for lbl, scr in ct_scr.items():
print(f"{lbl}: {scr}")
And, there you go. Our tuning attempt appears to have failed. Guess real world data makes things complicated. I am guessing that the training and testing dataset split we created is the likely culprit. Will have to do some more research. But, that isn’t going to stop us from trying to tune the SGM classifier.
SGD
Okay, onto sklearn.linear_model.SGDClassifier.
class sklearn.linear_model.SGDClassifier(loss=‘hinge’, *, penalty=‘l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learning_rate=‘optimal’, eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, average=False)
Once again only one obvious hyperparameter.
We might also want to look at learning_rate.
And there might also be method parameters that would qualify for tuning.
Let’s go with the first, alpha
, the regularization penalty. This one defaults to a value considerably lower than that used for the SVM model. Based on the sklearn documentation, I think we will try the following values for alpha
: 10.0**-np.arange(1,7).
cv_scores_3 = {}
reg_alphas = [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3]
reg_alphas = 10.0**-np.arange(1,7)
for alpha in reg_alphas:
sgd = linear_model.SGDClassifier(alpha=alpha, max_iter=1000, random_state=0)
sgd_scr = ms.cross_val_score(sgd, X_trn_std, y_trn, cv=5, scoring="accuracy")
cv_scores_3[str(alpha)] = sgd_scr
for lbl, scrs in cv_scores_3.items():
print(f"{lbl}: {[round(scr,5) for scr in scrs]} -> {scrs.mean()}")
gd_means = [scrs.mean() for _, scrs in cv_scores_3.items()]
ax = plt.axes()
ax.plot(reg_alphas, gd_means, color='k', linestyle='-')
ax.scatter(reg_alphas, gd_means, color='blue', linestyle='-')
Ok, let’s fit the training dataset and measure performance against the test dataset.
sgd = linear_model.SGDClassifier(alpha=0.010, max_iter=1000, random_state=0)
sg_f = sgd.fit(X_trn_std, y_trn)
sg_preds = sgd.predict(X_tst_std)
print(f'SGD, alpha=0.010 -> {accuracy_score(y_tst, sg_preds)}')
That score pretty much the same as the SVM model with C=1
.
If we wanted to tune multiple hyper-parameters, a reasonable approach would be to find the best value for one of them. Then using the selected value for the first, tune the second. Repeating for as many parameters as you wish to test. That’s where the scikit-learn methods really, help. They do them all at one time — at the expense of computing cycles of course.
scikit-learn Hyper-parameter Optimizers
sci-kit learn provides a number of hyper-parameter optomizers. The two I’ve seen the most in on-line articles/posts/tutorials are:
- model_selection.GridSearchCV
- model_selection.RandomizedSearchCV
GridSearchCV pretty much does what we did. The randomized version is probably more efficient time-wise if you have a lot of possibilities. Both accept a dictionary providing the parameters to tune. Which means they can tune on multiple parameters at the same time. The grid version will literally test every possible combination. The randomized version will sample a given number of candidates from a parameter space with a specified distribution.
I thought I might try one of them out. But as of 2021.10.11 afternoon I am not sure. Might leave it for later posts when I am actually working on a machine learning exercise.
sklearn.model_selection.RandomizedSearchCV
Ah, decided to at least have a quick look at one of the scikit-learn methods.
I will use scipy.stats.uniform
to provide the distribution function for the alpha
parameter in the SGDClassifier model.
# let's try one of sklearn's tuning methods
from scipy.stats import uniform as sp_rand
from sklearn.model_selection import RandomizedSearchCV
param_grid = {'alpha': sp_rand()}
sgd = linear_model.SGDClassifier(max_iter=1000, random_state=0)
rsearch = RandomizedSearchCV(estimator=sgd, param_distributions=param_grid, n_iter=100)
rsearch.fit(X_tst_std, y_tst)
rs_tbl = pd.DataFrame(rsearch.cv_results_)
display(rs_tbl)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)
# And test the suggested value
sgd_sr = linear_model.SGDClassifier(alpha=rsearch.best_estimator_.alpha, max_iter=1000, random_state=0)
sgsr_f = sgd.fit(X_trn_std, y_trn)
sgsr_preds = sgd.predict(X_tst_std)
print(f'SGD, alpha={rsearch.best_estimator_.alpha:.4f} -> {accuracy_score(y_tst, sg_preds)}')
Pretty much in line with the number we got using regular cross-validation on a selection of potential values.
sklearn.model_selection.RandomizedSearchCV may have taken a little longer to run, but it looks to have tested a lot more values.
Done
And, with that I do think this post is ready to be finished. Perhaps not horribly informative. Maybe a different problem and a model with more hyper-parameters would have been a better choice. But, it does provide a workable introduction to the concepts and tools available.
Feel free to download and play with my version of this post’s related notebook.
Resources
- sklearn.svm.SVC
- sklearn.linear_model.SGDClassifier
- sklearn.model_selection.RandomizedSearchCV
- Tuning the hyper-parameters of an estimator
- Hyperparameter tuning for machine learning models
- Nested Cross-Validation for Machine Learning with Python
- How to make SGD Classifier perform as well as Logistic Regression using parfit
- K- Fold Cross Validation For Parameter Tuning
- scipy.stats.uniform