In that data science course I keep mentioning, in one of the projects we were asked to code (Python) some functions related to the Softmax classifier. I’d link to the appropriate page(s), but they are only available to registered students. Also, the project was due Wednesday, 2021.10.20; so, I don’t expect I am giving anything away. That said, it isn’t as if all of this is not available in numerous scholarly documents, Jupyter notebooks and posts on the web.

My initial implementation was pretty simple and utilized one or more loops in most of the functions. When it came to running the classification model with my simple code, it took so much time to fit the model, that I gave up and terminated the program.

The code was running on a block of binary images of handwritten digits from the MNIST database. There was a training set and test set. Each sample contained the features of one image, which are simply the raw pixel values flattened out into a vector of length \(784 = 28^2\).

A little reading suggested my problem was likely not with the logic of the function code as such, but more likely due to the use of loops in the code. The suggestion was to eliminate the loops and use the tools Numpy provides to manipulate the matrices involved in the calculations. After doing so, the model was fit and tested in a couple of minutes (my physical time).

This was a big learn for me. So, I wanted to document the before and after code for the three key functions: probability, cost and gradient descent. But first a very brief look at Softmax.

Softmax Classifier

The Softmax classifier is a generalization of the binary form of Logistic Regression. And it is especially suited to mutli-class classification tasks.

Softmax is a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.
……
Specifically, the network is configured to output N values, one for each class in the classification task, and the softmax function is used to normalize the outputs, converting them from weighted sum values into probabilities that sum to one. Each value in the output of the softmax function is interpreted as the probability of membership for each class.

Softmax Activation Function with Python, Jason Brownlee PhD

Softmax Regression is also known as Maximum Entropy Classifier.

There were three basic functions that had to be written for use by the model itself. The following provides the nitty-gritty formula to allow coding each function. No attempt is going to be made to provide any derivation or proof for them. Feel free to search the internet.

Probability function

For each data point \(x^{(i)}\), the probability that \(x^{(i)}\) is labeled as \(j\) for \(j = 0,1,\dots ,k-1\).

$$h(x) = \frac{1}{\sum _{j=0}^{k-1} e^{[\theta _ j \cdot x / \tau ] - c}} \begin{bmatrix} e^{[\theta _0 \cdot x / \tau ] - c} \\ e^{[\theta _1 \cdot x / \tau ] - c} \\ \vdots \\ e^{[\theta _{k-1} \cdot x / \tau ] - c} \end{bmatrix}$$

where \(\tau >0\) is the temperature parameter, and c is used to reduce the possibility of numerical or overflow errors. A good choice is \(c = \max _ j \theta _ j \cdot x / \tau\).

Cost Function

$$J(\theta ) = -\frac{1}{n}\Bigg[\sum _{i=1}^ n \sum _{j=0}^{k-1} [[y^{(i)} == j]] \log {\frac{e^{\theta _ j \cdot x^{(i)} / \tau }}{\sum _{l=0}^{k-1} e^{\theta _ l \cdot x^{(i)} / \tau }}}\Bigg] + \frac{\lambda }{2}\sum _{j=0}^{k-1}\sum _{i=0}^{d-1} \theta _{ji}^2$$

Gradient Descent

$$\displaystyle \displaystyle \frac{\partial J(\theta )}{\partial \theta _ m} \displaystyle = -\frac{1}{\tau n} \sum _{i = 1} ^{n} [x^{(i)}([[y^{(i)} == m]] - p(y^{(i)} = m | x^{(i)}, \theta ))] + \lambda \theta _ m$$

Dataset

Getting the dataset used for the project is a bit of a pain. Especially as I didn’t want to add Keras to my conda environment. So, after some hunting I found mnist_784 on the OpenML web site. Downloaded the CSV (quite large, ~125 MB).

Did take a bit of time to load the dataset into the notebook. Also converted the CSV data into feature and target arrays split into training and testing datasets.

In [3]:
csv_path = './data/mnist_digits/mnist_784.csv'
df = pd.read_csv(csv_path)
data = df.values
In [4]:
print(data.shape)
print(data[:4,780:])
(70000, 785)
[[0 0 0 0 5]
 [0 0 0 0 0]
 [0 0 0 0 4]
 [0 0 0 0 1]]
In [5]:
# now to make training and test sets
# as I don't plan to run model, will make life easy, just take the first 60,000 rows as the training set

X_trn = data[:60000, :-1]
y_trn = data[:60000, -1]
print(X_trn.shape, y_trn.shape)
X_tst = data[60000: , :-1]
y_tst = data[60000:, -1]
print(X_tst.shape, y_tst.shape)
(60000, 784) (60000,)
(10000, 784) (10000,)

Now on to my first attempts at coding the functions.

Initial Code for the Functions

Okay, I have limited coding experience/knowledge and virtually no linear algebra experience/knowledge. But, based on past experience one tends to use loops for coding summations. So that’s what I did. The functions I wrote correctly processed the simple test cases so I figured I was good to go.

Working on the project I did all my testing at the command line using the test data provided by the course. For this post I am going to look at setting up a Jupyter notebook. Which means the usual setup and loading the dataset. I won’t bother showing the notebook setup in the post.

Probability

So, go through each sample in the dataset and calculate, for each possible classification, the probability that the sample matches that classification. So, looks like two loops: outer loop for samples, inner loop for possible classification values. Simple to code, right!

In [6]:
# probability function
def compute_probabilities(X, theta, temp_parameter):
  """
  Computes, for each datapoint X[i], the probability that X[i] is labeled as j
  for j = 0, 1, ..., k-1

Args: X - (n, d) NumPy array (n datapoints each with d features) theta - (k, d) NumPy array, where row j represents the parameters of our model for label j temp_parameter - the temperature parameter of softmax function (scalar) Returns: H - (k, n) NumPy array, where each entry H[j][i] is the probability that X[i] is labeled as j """ n_lbls = theta.shape[0] n_d_pts = X.shape[0] probs = np.zeros(n_lbls) bases = np.array([np.e for _ in range(n_lbls)])

r = 0 for row in X: tmp_p = np.zeros(n_lbls) poss_c = [theta[i].dot(row) / temp_parameter for i in range(n_lbls)] c_use = max(poss_c) for j in range(n_lbls): tmp_p[j] = (theta[j].dot(row) / temp_parameter) - c_use tmp_p = np.power(bases, tmp_p) avg_div = tmp_p.sum() tmp_p = tmp_p / avg_div probs = np.column_stack((probs, tmp_p))
r += 1 return probs[:,1:]

In [7]:
# let's make ourselves a test theta array and a couple of result arrays
theta = np.zeros((10, 784))
In [8]:
%%timeit -n 1 -r 1
# let's time it for one iteration
if True:
  p_60 = compute_probabilities(X_trn, theta, temp_parameter=1)
# result: 1min 30s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1min 30s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

That’s a fair bit of time. In this case, that represents 60,000 * 10 loops.

I may do my timing using the test dataset.

This is just one execution of one of 3 functions. I expect you can imagine that if the model ran my code for a few to several hundred iterations to find a reasonable solution, that it was going to take some considerable time. Well, which it, in fact, was doing…

In [9]:
%%timeit
p_10 = compute_probabilities(X_tst, theta, temp_parameter=1)
2.45 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The above cell took ~20 seconds to execute. And, that’s after 7 runs with 1 loop each. Only used 1 run with 1 loop for the training set. Of course, the times will likely vary each time the notebook is executed.

Variable Assignment During Timing?

Using the iPython magic command %%timeit didn’t allow me to access the value p_60 or p_10. So, did some digging and found the following approach. You will see why I want those values. (Note: added import for timeit in first cell and updated.)

In [10]:
result_1 = []
t_stmt = f'result_1.append(compute_probabilities(X_trn, theta, temp_parameter=1))'
print(timeit.timeit(stmt=t_stmt, setup='from __main__ import result_1, compute_probabilities, X_trn, theta', number=1))
p_60 = result_1[0] 
print(p_60)
95.5449653
[[0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 ...
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]]
In [11]:
result_2 = []
t_stmt = f'result_2.append(compute_probabilities(X_tst, theta, temp_parameter=1))'
print(timeit.timeit(stmt=t_stmt, setup='from __main__ import result_2, compute_probabilities, X_tst, theta', number=5))
p_10 = result_2[0] 
print(p_10)
11.900594299999966
[[0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 ...
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]]
In [12]:
# a different approach to timing the execution speed
result_3 = []
t_stmt = f'result_3.append(compute_probabilities(X_tst, theta, temp_parameter=1))'
print(timeit.repeat(stmt=t_stmt, setup='from __main__ import result_3, compute_probabilities, X_tst, theta', repeat=7, number=1))
p_10_2 = result_3[0]
print(p_10 == p_10_2)
[2.6034837999999922, 2.6143754000000285, 2.2095014999999876, 2.6556428000000096, 2.460252799999978, 2.525352699999985, 2.1465939999999932]
[[ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]
 ...
 [ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]]

Cost

Okay, let’s repeat that with the cost function.

I will be modifying the function slightly. In the project assignment, the probability function was called within the cost function. I don’t really want to add the time for the probability function to the execution time of this function. So, I am going to add a probs parameter for our purposes here. And, that is why I wanted to save the probability function output when timing its performance earlier.

In this case I only used one loop. The inner loop could be reduced to indexing the appropriate array element. And, there is one simple matrix arithmetic calculation per loop.

In [13]:
def compute_cost_function(X, Y, theta, probs, lambda_factor, temp_parameter):
    """
    Computes the total cost over every datapoint.

Args: X - (n, d) NumPy array (n datapoints each with d features) Y - (n, ) NumPy array containing the labels (a number from 0-9) for each data point theta - (k, d) NumPy array, where row j represents the parameters of our model for label j lambda_factor - the regularization constant (scalar) temp_parameter - the temperature parameter of softmax function (scalar)

Returns c - the cost value (scalar) """ #YOUR CODE HERE #probs = compute_probabilities(X, theta, temp_parameter) l_sum = 0 r_sum = (lambda_factor / 2) np.sum(thetatheta) for i in range(X.shape[0]): l_sum += np.log(probs[Y[i], i]) l_sum = -(l_sum / X.shape[0]) return l_sum + r_sum

Now, some timings.

In [14]:
cost_1 = []
t_stmt = f'cost_1.append(compute_cost_function(X_trn, y_trn, theta, p_60, lambda_factor=1.0e-4, temp_parameter=1))'
print(timeit.timeit(stmt=t_stmt, setup='from __main__ import cost_1, compute_cost_function, X_trn, y_trn, theta, p_60', number=1))
c_60 = cost_1[0] 
print(c_60)
0.3359523999997691
2.3025850929954172
In [15]:
cost_2 = []
t_stmt = f'cost_2.append(compute_cost_function(X_trn, y_trn, theta, p_60, lambda_factor=1.0e-4, temp_parameter=1))'
print(timeit.repeat(stmt=t_stmt, setup='from __main__ import cost_2, compute_cost_function, X_trn, y_trn, theta, p_60', repeat=7, number=3))
[0.8501957999997103, 0.8696082000005845, 0.8564771999999721, 0.8133994999998322, 0.807842499999424, 0.8858384000004662, 0.8531422000005477]

That was quicker than I expected, even on the full 60,000 training samples. Do note that the last set of seven numbers are 3 times the running time for a single call to the cost function. So, the fastest running time was ~0.9134 / 3 = 0.3045. Which is in keeping with the timeit result for a single execution of the function.

Gradient Descent

This is the funny one. There was a hint that we should use a scipy.sparse matrix for efficiently onehot encoding the target array. Consequently, I ended up doing things differently than the previous two functions. Should have taken the hint further, but…

In the course project, I didn’t write a function to do the onehot encoding. But for this exercise I did (you know, don’t repeat your code).

In [16]:
def make_onehot(v_col, n_smp, n_cat):
  return sparse.coo_matrix((np.ones(n_smp), (v_col, range(n_smp))), shape=(n_cat,n_smp)).toarray()
In [17]:
# modified so as to not call probability function, added new parameter 'probs'
# also in course code, didn't have function for creating onehot encoding of target data
def run_gradient_descent_iteration(X, Y, theta, probs, alpha, lambda_factor, temp_parameter):
    """
    Runs one step of batch gradient descent
Args: X - (n, d) NumPy array (n datapoints each with d features) Y - (n, ) NumPy array containing the labels (a number from 0-9) for each data point theta - (k, d) NumPy array, where row j represents the parameters of our model for label j alpha - the learning rate (scalar) lambda_factor - the regularization constant (scalar) temp_parameter - the temperature parameter of softmax function (scalar)
Returns: theta - (k, d) NumPy array that is the final value of parameters theta """ #YOUR CODE HERE n_smp = X.shape[0] n_cat = theta.shape[0] # y_ohe = sparse.coo_matrix((np.ones(n_smp), (Y, range(n_smp))), shape=(n_cat,n_smp)).toarray() y_ohe = make_onehot(Y, n_smp, n_cat) # probs = compute_probabilities(X, theta, temp_parameter) tmp_dot = np.dot((y_ohe - probs),X) grad = (-1 / (n_smp * temp_parameter)) * tmp_dot + lambda_factor*theta theta = theta - (alpha * grad) return theta
In [18]:
gdi_1 = []
t_stmt = f'gdi_1.append(run_gradient_descent_iteration(X_trn, y_trn, theta, p_60, alpha=0.3, lambda_factor=1.0e-4, temp_parameter=1))'
t_out = timeit.repeat(stmt=t_stmt, setup='from __main__ import gdi_1, run_gradient_descent_iteration, X_trn, y_trn, theta, p_60', repeat=7, number=3)
print([tm / 3 for tm in t_out])
[0.37253196666665644, 0.23015510000004724, 0.21697746666662474, 0.21578276666665866, 0.20662850000000313, 0.20816326666666404, 0.20455953333331914]

And, the best time is ~0.2046.

Look’s like the bottleneck in my code was the probability function. Let’s fix that.

Replace Loops with Numpy Matrix Arithmetic

Now, it’s not that the loops real disappear. It’s just that Numpy is way faster at doing things than Python. Also a good percentage of Numpy is written in C and compiled. Python is interpreted, every line is read and converted to machine code everytime it is used. That slows things down considerably compared to compiled code. One of the benefits of Python is that there are a number of ways of very easily extending your code with C. Numpy takes advantage of that opportunity.

The thing I didn’t really see at first is that matrix arithmetic is, in (many ?? most) cases, very much the same as loops. Take a simple example. Let’s say I have a 2D matrix and I want to multiply every element by some number x. Using loops in pure Python, it looks something like this:

# yes, there are other ways to do this
mX5 = [[0 for _ in range range(len(matrix[0]))] for _ in range(len(matrix)]
for i in range(len(matrix)):
  for j in range(len(matrix[0])):
    mX5[i][j] = matrix[i][j] * x

Now, with matrix as a Numpy array, it looks like this:

mX5 = matrix * 5

Or, let’s say I want to sum each column of the matrix. In pure Python:

sum = [0 for _ in range(len(matrix[0]))]
for i in range(len(matrix)):
  for j in range(len(matrix[0])):
    sum[j] += matrix[i][j]

With Numpy’s assistance:

sum = np.sum(matrix, axis=0)

That said, let’s see what happens when my initial code is refactored to eliminate Python loops as much as possible.

Probability

It took me a fair bit of fooling around to refactor this function. But, I did manage to get it done. And, it tested well when I re-submitted it in my assignment (online grading system).

In [19]:
# eliminate loops as much as possible
def compute_probabilities_2(X, theta, temp_parameter):
    """
    Computes, for each datapoint X[i], the probability that X[i] is labeled as j
    for j = 0, 1, ..., k-1

Args: X - (n, d) NumPy array (n datapoints each with d features) theta - (k, d) NumPy array, where row j represents the parameters of our model for label j temp_parameter - the temperature parameter of softmax function (scalar) Returns: H - (k, n) NumPy array, where each entry H[j][i] is the probability that X[i] is labeled as j """ bases = np.full((theta.shape[0], X.shape[0]), fill_value=np.e) base_dot = np.matmul(theta, X.T) / temp_parameter base_dot -= np.amax(base_dot, axis=0) probs = np.power(bases, base_dot) div_v = np.sum(np.power(bases, base_dot), axis=0) probs /= div_v return probs

In [20]:
prob_rf_1 = []
t_stmt = f'prob_rf_1.append(compute_probabilities_2(X_trn, theta, temp_parameter=1))'
t_out_prf_1 = timeit.repeat(stmt=t_stmt, setup='from __main__ import prob_rf_1, compute_probabilities_2, X_trn, theta', repeat=7, number=3)
p_60_rf = prob_rf_1[0]
print([tm / 3 for tm in t_out_prf_1])
print(p_60_rf)
[0.3978473333333265, 0.27019319999999425, 0.2707788666666602, 0.2602649000000383, 0.26236436666658847, 0.25575143333329226, 0.25740093333327724]
[[0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 ...
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]
 [0.1 0.1 0.1 ... 0.1 0.1 0.1]]

Wow! From something around 1 min 30 sec per call to around 0.2574 sec! When I originally saw this improvement, it absolutely boggled my mind.

Cost

Now, the cost function seems to run quickly enough. But, I did also remove the Python loop in that function. So, let’s have a look. And, get a proper timing of the change as well.

In [21]:
# once again, modify to accept probabilities, to limit timing to this functions actual code
# with the exception of the call to make_onehot()
def compute_cost_function_2(X, Y, theta, probs, lambda_factor, temp_parameter):
    """
    Computes the total cost over every datapoint.

Args: X - (n, d) NumPy array (n datapoints each with d features) Y - (n, ) NumPy array containing the labels (a number from 0-9) for each data point theta - (k, d) NumPy array, where row j represents the parameters of our model for label j lambda_factor - the regularization constant (scalar) temp_parameter - the temperature parameter of softmax function (scalar)

Returns c - the cost value (scalar) """ n_smp = X.shape[0] n_cat = theta.shape[0] probs = compute_probabilities(X, theta, temp_parameter) l_sum = 0 r_sum = (lambda_factor / 2) np.sum(thetatheta) # y_ohe = sparse.coo_matrix((np.ones(n_smp), (Y, range(n_smp))), shape=(n_cat,n_smp)).toarray() y_ohe = make_onehot(Y, n_smp, n_cat) valid_probs = np.choose(Y, np.log(probs[np.arange(theta.shape[0])], y_ohe)) l_sum = - np.sum(valid_probs) / X.shape[0] return l_sum + r_sum

In [22]:
prob_cf_1 = []
t_stmt = f'prob_cf_1.append(compute_cost_function_2(X_trn, y_trn, theta, p_60, lambda_factor=1.0e-4, temp_parameter=1))'
t_out_crf_1 = timeit.repeat(stmt=t_stmt, setup='from __main__ import prob_cf_1, compute_cost_function_2, X_trn, y_trn, theta, p_60', repeat=7, number=3)
c_60_rf = prob_cf_1[0]
print([tm / 3 for tm in t_out_crf_1])
print(c_60_rf)
[0.3329439333335055, 0.29496253333339456, 0.2793265666668958, 0.2701374333331235, 0.2652082000001125, 0.2797459999998561, 0.27032926666682516]
2.3025850929940455

The best time above is perhaps marginally better than that for the original function. Once again confirming that the bottleneck was the probability function.

And, no need to look at the gradient descent function as it did not change during my refactoring.

Done

That’s it for this one. I don’t know how to tell you how pleased I was once I managed to speed up that probability function. It was truly a light-bulb moment for me. Perhaps only a 7 watt light-bulb moment, but…

And, in working on the notebook for this post I did learn a little about using timeit in both Python and Jupyter notebooks.

Feel free to download and play with my version of this post’s related notebook.

Resources