Okay, let’s have a bit of a look at the NumPy package. In its own words:

The fundamental package for scientific computing with Python
What is NumPy?

Why NumPy?

It really boils down to what Python fundamentally provides with respect to matrices and vectors. Most everything in data science, at its lowest levels, depends on matrices and vectors. (Well you can think of a vector as a one-dimensional matrix.)

Python’s list object could possibly be used to define matrices. But not very efficiently. That’s because Python’s lists are not truly arrays in the sense of an array in C. Pretty much everything in Python is an object. That includes all the fundamental data types. Take an integer. In a language like C (which I was re-introduced to in CS50x) an integer is simply a series of memory locations (the number of those locations depending on the type of integer). In Python that integer is really a C struct (given most of Python is written in C). It carries a lot of extra information along with the representation of that integer. Not in C, you have a piece of memory with the binary encoding of that integer. Period!

Here’s a notebook cell with a bit of Python looking at a int.

In [1]:

x = 42
y = 2**1000
print(x.__repr__)
print(y.__repr__)
print(f"size of x: {sys.getsizeof(x)}")
print(f"size of y: {sys.getsizeof(y)}")

<method-wrapper '__repr__' of int object at 0x00000237698E6E50>
<method-wrapper '__repr__' of int object at 0x000002376FCDF670>
size of x: 28
size of y: 160

Depending how you initialized the two variables, in C they would both be the same size in memory.

In C an array is a sequential piece of memory with enough space for the requested number of whatever data type you are storing in that array. In Python, a list ends up being an array storing pointers to whatever is stored at each index. And there is no rule that says each element has to be of the same type (pointers don’t require that).

What this boils down to is not something really conducive to the kinds of arithmetic data science needs to do on its matrices and vectors. For one, it is generally expected that all elements of a vector or matrix are of the same type. There are a few packages in Python meant to help with using fixed-type arrays. None perhaps quite up to the challenges of data science.

NumPy provides true multi-dimensional, fixed-type arrays/matrices with the bonus of a large number of mathematical functions designed to efficiently operate on those arrays/matrices. And, because of that combination of design elements it does so rather quickly. Certainly more quickly than anything I could bumble together.

And, it provides a number of additional niceties. E.G. sparse arrays, something I last saw in the 1970s while in University. Probably not something I have though about since perhaps 1985 or so. Was trying to use linear algebra to resolve scheduling in our smaller facilities at the time. At the moment don’t expect to being using them any time soon, but…

Let’s Have a Look

We will need to import NumPy in order to use it. The convention is to use np as an alias. So,

import numpy as np

NumPy Arrays from Python Lists

To start simply, we can create vectors and matrices from Python lists.

In [3]:

# let's create a numpy vector (1 dimensional array)
list_1 = [x for x in range(1, 16, 3)]
vector = np.array(list_1)
print(vector)
list_2 = [1, 4, "two"]
vector = np.array(list_2)
print(vector)

[ 1  4  7 10 13]
['1' '4' 'two']

What happened there? Recall that NumPy only supports arrays whose elements are all the same type (unlike Python’s lists). So, when it got our second list it attempted to resolve the problem by trying to cast the elements to one of the types in the list. The only one that would work, was casting them all to strings. There was no way to cast “two” to an integer.

We can tell NumPy that we want all our elements to be of a specific type. So, let’s try “list_2” again specifying that we want all elements of the NumPy array to be integers.

In [4]:

np.array(list_2, dtype='int32')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-6695f130ec88> in <module>
----> 1 np.array(list_2, dtype='int32')
ValueError: invalid literal for int() with base 10: 'two'

As expected an error is raised. Which is probably what we’d rather have happen then all the elements being turned into strings. Let’s try a string, that could be cast into an integer.

In [5]:

list_3 = [1, 7, '42']
vector = np.array(list_3, dtype='int32')
print(vector)

[ 1  7 42]

Let’s quickly look at initializing a multidimensional array or two.

In [6]:

# let's quickly look at initializing mutlidimensional arrays
list_4 = [x for x in range(42, 424, 84)]
list_5 = [x for x in range(22, 223, 44)]
array = np.array([list_4, list_5])
print(f"array = \n{array}")
list_6 = [x for x in range(101, 106)]
array_2 = np.array([list_5, list_4, list_6])
print(f"\narray_2 = \n{array_2}")

array = 
[[ 42 126 210 294 378]
 [ 22  66 110 154 198]]
array_2 =
[[ 22  66 110 154 198]
[ 42 126 210 294 378]
[101 102 103 104 105]]

Alternate Methods to Initialize NumPy Arrays

NumPy provides numerous methods to generate arrays. Some of these will likely come in handy down the road.

In [7]:

# Create a 10 element integer vector filled with zeros
vector_0 = np.zeros(10, dtype='uint16')
# the underscore character refers to the value of the last expression the interpreter generated
print(vector_0)

[0 0 0 0 0 0 0 0 0 0]

In [8]:

# Create a 3x3 array full of ones (as floats)
array_1s = np.ones((3, 3), dtype=float)
print(array_1s)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]

In [9]:

# Create a 4x4 array filled with the sqrt of 2, specifically of type float32
array_sqrt = np.full((4, 4), 2**0.5, dtype='float32')
print(array_sqrt)

[[1.4142135 1.4142135 1.4142135 1.4142135]
 [1.4142135 1.4142135 1.4142135 1.4142135]
 [1.4142135 1.4142135 1.4142135 1.4142135]
 [1.4142135 1.4142135 1.4142135 1.4142135]]

In [10]:

# set seed so that repeatedly running the cell gives the same result each time
np.random.seed(333)
# how about a 3x3 array of uniformly distributed random values between 0 and 1
a_rnd = np.random.random((3, 3))
print(a_rnd)
# or a random set of normally distributed values with a mean of 0.25 and standard deviation of 1
a_norm = np.random.normal(0.25, 1, (3, 3))
print("\n", a_norm)
#print(sum([ 1.3940603, 1.74046459, -1.61346926]) / 3)
# or 3x3 array of random integers in the interval (1, 6) inclusive
a_die = np.random.randint(1, 7, (3, 3))
print("\n", a_die)

[[0.54329109 0.72895073 0.01688145]
 [0.3303388  0.36872182 0.04830367]
 [0.10453019 0.09743752 0.24540331]]
[[-0.27811698 -0.23079156  0.88909145]
[ 0.12753144  0.8172136   0.20920219]
[-0.82934137  0.20437988  0.17573818]]
[[3 1 1]
[6 4 4]
[6 5 6]]

In [11]:

# and the last example, create a 3x3 integer identity matrix
m_id = np.eye(3, dtype=int)
print(m_id)

[[1 0 0]
 [0 1 0]
 [0 0 1]]

There are more methods available, but as an introduction that should do for now.

Note: in the above code I used np.random.seed(333) to ensure I got the same result each time I ran the specific code block and any others after it. I have recently read an article that says not to do just that, Stop using numpy.random.seed(). I will have to read it again and make a decision at some point. Your choice, but thought I should at least mention the article.

Time to End This Post

I was going to look at accessing Numpy array elements before calling it a day. But, I think there is more than enough to look at and play with already in this one. So until next time.

I have generated a simple notebook showing some of the above bits and pieces. Feel free to download the notebook and play with it. Again, you will likely have to tell VSCode which kernel to use on your system.

Resources

What is NumPy?
NumPy user guide
NumPy API Reference
An Essential Guide to Numpy for Machine Learning in Python
Stop using numpy.random.seed()

Too Old To Code

Data Science Basics: NumPy — Intro