Okay, going to continue with my stepping sideways. Just can’t seem to come up with the energy to tackle machine learning once again.
I have been trying to figure out how to best use the base Titanic template. Especially dealing with the various variables created and functions written for some pretty specific purposes. Don’t really want them cluttering the template.
You, of course, are screaming “use packages.” Took me a while, but I did eventually get there. The problem I am having is how to structure the package and what its modules should look like. But, I do think it is something that’s worth a look. So I am going to devote a post to my initial attempt. Expect there will be changes over time. And, perhaps the package/modules can be used in future machine learning projects. That would definitely be a nice bonus.
My apologies, in advance, for any terminology mistakes I make as I go along. I have been somewhat lax at getting a proper handle on Python and other programming lingo.
I am going to put the package in the directory rek_ml
(for rek—me—and machine learning). The package name is likely going to be the same. Any modules specifically related to the Titanic dataset and posts will have a trailing _t
in the module name.
Initial Plan
I am going to put all the variables related to the titanic dataset into one module, dsets_t.py
. The functions will likely go into one or more modules depending on how generic they are or how tied they are to the Titanic dataset related work. I may also split them by function. For example, all the scoring/cross validation code in one module, display related code in another.
I looked into possibly creating a local conda/pypi package. Which would allow me to add the packages to virtual environments. That would allow me to use it in various projects more easily. But for now I have decided to just add the package directory under my working directory. If I do end up using it in other projects, will deal with the situation then.
Directory Structure and Initial Contents
I am just going to start with a single package with multiple modules. That may change as I go along. Something like:
rek_ml
│
├── __init__.py
├── dsets_t.py
├── ml_cv.py
└── ml_misc.py
First Module
Let’s start with the dataset related variables. They take a fair bit of space in the template notebook. And, I will use this first module to confirm that my approach will work.
I’ll start in a terminal window.
(base) PS R:\learn\py_play> cd ..\ds_intro
(base) PS R:\learn\ds_intro> conda activate ds-3.9
(ds-3.9) PS R:\learn\ds_intro> md rek_ml
(ds-3.9) PS R:\learn\ds_intro> cd rek_ml
(ds-3.9) PS R:\learn\ds_intro\rek_ml> type nul > __init__.py
type : Cannot find path 'R:\learn\ds_intro\rek_ml\nul' because it does not exist.
At line:1 char:1
+ type nul > __init__.py
+ ~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : ObjectNotFound: (R:\learn\ds_intro\rek_ml\nul:String) [Get-Content], ItemNotFoundExcepti
on
+ FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetContentCommand
(ds-3.9) PS R:\learn\ds_intro\rek_ml> type nul > dsets_t.py
... skip error msg ...
(ds-3.9) PS R:\learn\ds_intro\rek_ml> dir
Directory: R:\learn\ds_intro\rek_ml
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2022-01-13 13:08 0 dsets_t.py
-a---- 2022-01-13 13:04 0 __init__.py
Since I am using Powershell, I likely should have used something like the following. New-Item
, alias ni
, should create the file as UTF-8. The -type
parameter is required. You’ll be asked if it’s not there.
ni __init__.py -type file
Do note: that empty __init__.py
is important. Check the docs.
Now, in VSCode, I copied over the pertinent code to dsets_t.py
. I also copied over the reduced feature sets we tested in the previous post. May as well have those at hand just in case.
# Package: rek_ml, module: dsets_t.py, version: 0.1.0
# paths to datasets of possible interest
kaggle_trn = "./data/titanic/train.csv"
kaggle_tst = "./data/titanic/test.csv"
oma_trn_3 = "./data/titanic/oma_trn_3.csv"
oma_tst_3 = "./data/titanic/oma_tst_3.csv"
oma_trn_4 = "./data/titanic/oma_trn_4.csv"
oma_tst_4 = "./data/titanic/oma_tst_4.csv"
oma_trn_5 = "./data/titanic/oma_trn_5.csv"
oma_tst_5 = "./data/titanic/oma_tst_5.csv"
# define data types for current set of features
d_types_4 = {
"PassengerId": "int16", "Survived": "uint8", "Pclass": "uint8", "Name": "string", "Sex": "category",
"Age": "float32", "SibSp": "uint16", "Parch": "uint16", "Ticket": "string", "Fare": "float32",
"Cabin": "string", "Embarked": "category", "FamilySize": "uint16", "Group": "uint16", "Title": "category",
"iFare": "float32", "AgeBin": "category", "AgeMissing": "uint8", "logFare": "float64", "logiFare": "float64",
"Sex_enc": "uint8", "AgeBin_enc": "uint8", "Emb_C": "uint8", "Emb_Q": "uint8", "Emb_S": "uint8",
"Ttl_Master": "uint8", "Ttl_Miss": "uint8", "Ttl_Mr": "uint8", "Ttl_Mrs": "uint8",
"Ttl_Noble": "uint8", "Ttl_Official": "uint8"
}
d_types_5 = {
"PassengerId": "int16", "Survived": "uint8", "Pclass": "uint8", "Name": "string", "Sex": "category",
"Age": "float32", "SibSp": "float32", "Parch": "float32", "Ticket": "string", "Fare": "float32",
"Cabin": "string", "Embarked": "category", "FamilySize": "float32", "Group": "float32", "Title": "category",
"iFare": "float32", "AgeBin": "category", "AgeMissing": "uint8", "logFare": "float64", "logiFare": "float64",
"Sex_enc": "uint8", "AgeBin_enc": "uint8", "Emb_C": "uint8", "Emb_Q": "uint8", "Emb_S": "uint8",
"Ttl_Master": "uint8", "Ttl_Miss": "uint8", "Ttl_Mr": "uint8", "Ttl_Mrs": "uint8",
"Ttl_Noble": "uint8", "Ttl_Official": "uint8"
}
# let's make a list of all possible features (numeric) that can be used in our modelling
full_trn_features = ["Age", "Parch", "Pclass", "Sex_enc", "SibSp",
"FamilySize", "Group", "AgeBin_enc", "AgeMissing", "logFare", "logiFare",
"Emb_C", "Emb_Q", "Emb_S",
"Ttl_Master", "Ttl_Miss", "Ttl_Mr", "Ttl_Mrs", "Ttl_Noble", "Ttl_Official"
]
# and a list of some of the reduced feature sets we have been investigating
rdc_ds = {
'full': full_trn_features,
'base(4)': ['Pclass', 'Sex_enc', 'SibSp', 'Parch'],
'guess(6)': ['Age', 'Pclass', 'Sex_enc', 'FamilySize', 'Group', 'logiFare'],
'guess(7)': ['Age', 'Pclass', 'Sex_enc', 'FamilySize', 'Group', 'logiFare', 'Ttl_Mr'],
'RFE(5)': ['Age', 'Pclass', 'Group', 'logiFare', 'Ttl_Mr'],
'RFE(8)': ['Age', 'Pclass', 'Sex_enc', 'FamilySize', 'Group', 'logFare', 'logiFare', 'Ttl_Mr'],
'RFE(9)': ['Age', 'Pclass', 'Sex_enc', 'FamilySize', 'Group', 'logFare', 'logiFare', 'Ttl_Miss', 'Ttl_Mr'],
'RFE(14)': ['Age', 'Pclass', 'Sex_enc', 'SibSp', 'FamilySize', 'Group', 'logFare', 'logiFare',
'Emb_Q', 'Emb_S', 'Ttl_Miss', 'Ttl_Mr', 'Ttl_Mrs', 'Ttl_Official']
}
I removed all those cells from the base Titanic notebook template, titanic_base.ipynb
. Added the import, modified the code to use the package data and added a simple test to make sure things more or less worked. That all looks like this.
from IPython.core.interactiveshell import InteractiveShell
import math
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
# Feature and Model Selection:
import lightgbm as lgb
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression, SGDClassifier # linear classifiers
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold # train/test splitting tool for cross-validation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc # scoring metrics
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import FeatureUnion, make_pipeline, Pipeline
# My own packages/modules
import rek_ml.dsets_t as dst
# set up some notebook display defaults
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.style.use('default')
sns.set()
pd.options.display.float_format = '{:,.2f}'.format
# load the datasets currently of interest
k_trn = pd.read_csv(dst.oma_trn_5, dtype=dst.d_types_5)
y_trn = k_trn["Survived"]
_ = k_trn.drop("Survived", axis=1, inplace=True)
# _ = k_trn.drop("Cabin", axis=1, inplace=True)
k_tst = pd.read_csv(dst.oma_tst_5, dtype=dst.d_types_5)
y_tst = k_tst["Survived"]
_ = k_tst.drop("Survived", axis=1, inplace=True)
# _ = k_tst.drop("Cabin", axis=1, inplace=True)
k_all = k_trn
k_all = pd.concat([k_all, k_tst], ignore_index=True)
# let's try the full feature set
X_trn = k_trn[dst.full_trn_features]
X_trn.head()
And, that appeared to work. More testing to come.
Second Module
Now, let’s dump all of our various cross-validation and testing functions into a second module, ml_misc.py
. I will also put the classifier list in this module (for now at least). Same basic process, copy stuff over to new nodule, remove from notebook template, add import, add test. I also added a function from a future post that was not previously in the template, run_classifiers_test()
.
I also copied over all the imports. Except for the rek_ml
package modules. The rest are currently duplicated in the template and in the package module. But I expect Python is smart enough to not load them again.
That is something I will need to sort out. How many, if any, of the imports in the new module do not need to be in the notebook; and vice versa. That may wait for another day.
To test I will use some code from that future post (sorry).
# Package: rek_ml, module: ml_misc.py, version: 0.1.0
import math
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
# Feature and Model Selection:
import lightgbm as lgb
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression, SGDClassifier # linear classifiers
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold # train/test splitting tool for cross-validation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc # scoring metrics
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import FeatureUnion, make_pipeline, Pipeline
# define a few classifiers to test things against
# this should very likely be a much larger list, but I just want to give these techniques a try
# variable definitions
num_jobs=-1 # use all available CPUs when possible
classifier_list = [LogisticRegression(n_jobs=num_jobs),
SGDClassifier(alpha=0.01, n_jobs=num_jobs),
GradientBoostingClassifier(n_estimators=100, loss='deviance'),
RandomForestClassifier(n_estimators=100, n_jobs=num_jobs),
DecisionTreeClassifier(max_features='auto'),
ExtraTreesClassifier(n_estimators=100, n_jobs=num_jobs)]
# Functions currently defined in module
"""
get_clf_name(classifier)
run_classifier_cv(clf, X_data, y_data, n_folds=5, r_state=29)
run_multi_clf_cv(clfs, X_data, y_data, n_folds=5, r_state=29)
disp_tbl_scores(clf_stats, fset_lbls)
plot_feat_importances(clf_nm, fi_means, ds_cols)
"""
# get classifier name from classifier object
# a regular expression should look after things nicely, though perhaps a bit of overkill
def get_clf_name(classifier):
rgx = re.compile(r".*\.(.*)'\>")
# remember type() returns an object
rslt = rgx.match(str(type(classifier)))
return rslt[1]
# function to generate classifier statistics/feature importances
def run_classifier_cv(clf, X_data, y_data, n_folds=5, r_state=29):
"""
Parameters:
- clf: a classifier object
- X_data: the feature set on which to run the cross validation (pandas DataFrame)
- y_data: the class labels for our training data (pandas DataFrame)
- n_folds: the number of folds/splits to use, default = 5
- r_state: random state to use for reproducibility, default = 29
Returns:
- clf_score: the summary statistics across all folds (pandas DataFrame)
- feat_import_mean: the mean of the feature importances across all folds (list ? DataFrame)
"""
# start with defaults
kfold = StratifiedKFold(n_splits=n_folds)
params = clf.get_params()
params['random_state'] = r_state
clf.set_params(**params)
# print(type(clf))
clf_nm = get_clf_name(clf)
# store results for each fold, 5 for default case
trn_accuracy = []
tst_accuracy = []
tst_f1_score = []
feat_import = []
# iterate over trn, tst indices returned by kfold
for (trn, tst) in kfold.split(X_data, y_data):
# fit fold
_ = clf.fit(X_data.iloc[trn], y_data.iloc[trn])
# get various scores, etc. from classifiers fit attempt
trn_accuracy.append(clf.score(X_data.iloc[trn], y_data.iloc[trn]))
tst_accuracy.append(clf.score(X_data.iloc[tst], y_data.iloc[tst]))
tst_f1_score.append(f1_score(y_true=y_data.iloc[tst], y_pred=clf.predict(X_data.iloc[tst])))
if hasattr(clf, 'feature_importances_'): # some classifiers don't
feat_import.append(clf.feature_importances_)
# generate return values
trn_accuracy_mean = np.mean(trn_accuracy)
trn_accuracy_std = np.std(trn_accuracy)
tst_accuracy_mean = np.mean(tst_accuracy)
tst_accuracy_std = np.std(tst_accuracy)
tst_f1_score_mean = np.mean(tst_f1_score)
tst_f1_score_std = np.std(tst_f1_score)
clf_score = pd.DataFrame({'Classifier Name': clf_nm,
'Mean Train Accuracy': trn_accuracy_mean,
'Train Accuracy Standard Deviation': trn_accuracy_std,
'Mean Test Accuracy': tst_accuracy_mean,
'Test Accuracy Standard Deviation': tst_accuracy_std,
'Mean Test F1-Score': tst_f1_score_mean,
'F1-Score Standard Deviation': tst_f1_score_std}, index=[0])
if hasattr(clf, 'feature_importances_'):
feat_import_mean = np.mean(feat_import, axis=0)
else:
feat_import_mean = None
return (clf_score, feat_import_mean)
# function to get test set predictions over all specified classifiers for a given training set
def run_classifiers_test(clfs, trn_x, trn_y, tst_x, tst_y, r_state=29):
trn_clfs = []
for clf_t in clfs:
params = clf_t.get_params()
params['random_state'] = r_state
clf_t.set_params(**params)
clf_t.fit(trn_x, trn_y)
trn_clfs.append(clf_t)
y_preds = {}
for clf_t in trn_clfs:
y_pred_t = clf_t.predict(tst_x)
p_scr_t = accuracy_score(tst_y, y_pred_t)
y_preds[get_clf_name(clf_t)] = p_scr_t
return y_preds
# run cross-val for all classifiers passed in
def run_multi_clf_cv(clfs, X_data, y_data, n_folds=5, r_state=29):
all_stats = pd.DataFrame()
all_f_import = []
for clf in clfs:
clf_stats, clf_f_import = run_classifier_cv(clf, X_data, y_data, n_folds=n_folds, r_state=r_state)
all_stats = all_stats.append(clf_stats, ignore_index=True)
all_f_import.append(clf_f_import)
return all_stats, all_f_import
# function to display hierarchical dataframe comparing different CV results into a function
def disp_tbl_scores(clf_stats, fset_lbls):
"""
Parameters:
clf_stats: list of dataframes containing cross-val stats for one or more feature sets
fset_lbls: list of labels for each feature set in clf_stats
"""
nbr_sets = len(fset_lbls)
set_rows = clf_stats[0].shape[0]
index = clf_stats[0]["Classifier Name"].to_list()
# print(index)
columns = pd.MultiIndex.from_product([["Mean Train Accuracy", "Mean Test Accuracy", "Mean Test F1-Score"], fset_lbls],
names=['Score', 'Feature Set'])
# print(columns)
stats = []
for i in range(set_rows):
r_stats = []
for scr in ["Mean Train Accuracy", "Mean Test Accuracy", "Mean Test F1-Score"]:
for j in range(nbr_sets):
r_stats.append(clf_stats[j].loc[i, scr])
stats.append(r_stats)
return pd.DataFrame(stats, index=index, columns=columns)
# barchart function for the feature importances
def plot_feat_importances(clf_nm, fi_means, ds_cols):
s_ndx = np.argsort(fi_means)
# reverse the order
s_ndx = s_ndx[::-1]
_ = plt.figure(figsize=(12,5))
_ = plt.title(f"Feature Importances for {clf_nm}")
_ = plt.bar(range(len(ds_cols)), fi_means[s_ndx], align='center')
_ = plt.xticks(range(len(ds_cols)), ds_cols[s_ndx], rotation=90)
_ = plt.xlim([-1, len(ds_cols)])
plt.show()
Yes, the documentation needs work! And, add the new import, import rek_ml.ml_misc as rml
, to the notebook template.
Bigger Test
I am copying the code used in that future post to score the reduced datasets and list of classifiers against the Titanic test dataset.
clf_preds = {}
for ds_nm, ds_cl in dst.rdc_ds.items():
X_dtrn = k_trn[ds_cl]
X_dtst = k_tst[ds_cl]
tmp_pred = rml.run_classifiers_test(rml.classifier_list, X_dtrn, y_trn, X_dtst, y_tst)
clf_preds[ds_nm] = list(tmp_pred.values())
print(clf_preds)
tst_acc = {
'PCA(12)': [0.7751196172248804, 0.7751196172248804, 0.7559808612440191, 0.7416267942583732, 0.6961722488038278, 0.7464114832535885]
}
clf_preds['PCA(12)'] = tst_acc['PCA(12)']
df_cols = ['LogRegression', 'SGD', 'GBoosting', 'RandomForest', 'DecisionTree', 'ExtraTrees']
df_preds = pd.DataFrame.from_dict(clf_preds, orient='index', columns=df_cols)
pd.options.display.float_format = '{:,.5f}'.format
df_preds
And as near as I can tell that is identical to the output in the draft for that future post. (I did say I try to have a few draft posts in the pipline at any given time.)
‘Til Next Time
Well that was a fun, little exercise.
Feel free to download and play with my version of this post’s notebook.
You can also downlod the refactored base Titanic notebook template and zipped copy of the rek_ml
package directory.
Resources
Lot of stuff in here I did not use. But, it may yet come down to me doing so.
- Modules and Packages
- Creating a Package — likely overkill for my purposes
- Making a Python Package
- Packaging a python library
- Build Python Packages Without Publishing
- How to Build Your Very First Python Package
- Creating a Package
- Add packages to Anaconda environment in Python
- Anaconda: Permanently include external packages
- How to install my own python module (package) via conda and watch its changes