listnumbers = [1, 2, 3]
print(listnumbers)[1, 2, 3]
Data manipulation in Python is nearly synonymous with NumPy array manipulation: even tools like Pandas are built around the NumPy array. Numpy arrays can be thought of as mathematical vectors and behave correspondingly. This differs from Python lists, which are containers that store arbitrary objects.
We saw previously that arrays allow us to store a series of values in a single variable. For example:
Notably, in Python, a list can contain objects that are not of the same data type. For example:
[1, 'string', {'a': 1}, [[1, 3], set()]]
Adding two lists concatenates them, there is no mathematical operation: adding two containers that contain arbitrary objects means “combining” them.
Being able to store multiple data types in a single list can be convenient. However, Python lists can be cumbersome to work with when doing mathematical operations, and for more complex and multi-dimensional data. They are also relatively slow to process. For example, say we wanted to add two large lists of values together. We would need to do the following:
Let’s time how long it took to do that last line. For that, we can use the %timeit magic command in a jupyter notebook:
98.7 ms ± 873 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
While this may seem fast in human time, it’s quite slow computationally wise. If you had multiple operations like this it would quickly add up. As we will see below, it’s possible to do array operations like this much more quickly using NumPy.
For much faster, easier manipulation of numerical arrays, we use NumPy. What is NumPy? A good summary is provided by Claude AI:
NumPy is a fundamental Python library for scientific computing that provides support for large, multi-dimensional arrays and matrices. It offers a comprehensive collection of mathematical functions to operate on these arrays efficiently, with operations implemented in C for high performance. NumPy serves as the foundation for most other scientific Python libraries like pandas, scikit-learn, and matplotlib, making it essential for data science, machine learning, and numerical analysis workflows.
Let’s see how to do some basic array operations with numpy. First, if you have not done so, you’ll need to install numpy into your conda environment. To do so, in a terminal, activate your conda environment, then either run:
pip install numpy
or
conda install -y -c conda-forge numpy
We can now import numpy:
There are multiple ways to create an array using numpy. Some examples:
# from a Python list:
arrnumbers = np.array(listnumbers)
print("arrnumbers:", arrnumbers)
# manually creating it:
arrnumbers2 = np.array([5, 3, 42])
print("arrnumbers2:", arrnumbers2)
# from a range of values (compare to range above):
arrnumbers3 = np.arange(1000000)
print("arrnumbers3:", arrnumbers3)
# an array of values linearlly spaced between two endpoints:
arrlinspace = np.linspace(0, 10, 5) # 5 values equally spaced between 0 and 10
print("arrlinspace:", arrlinspace)
# an array of zeros:
arrzeros = np.zeros(4)
print("arrzeros:", arrzeros)
# an array of ones:
arrones = np.ones(4)
print("arrones:", arrones)
# an empty array (values will be whatever is in memory at the time):
arrempty = np.empty(4)
print("arrempty:", arrempty)arrnumbers: [1 2 3]
arrnumbers2: [ 5 3 42]
arrnumbers3: [ 0 1 2 ... 999997 999998 999999]
arrlinspace: [ 0. 2.5 5. 7.5 10. ]
arrzeros: [0. 0. 0. 0.]
arrones: [1. 1. 1. 1.]
arrempty: [1. 1. 1. 1.]
NumPy arrays can have multiple dimensions:
[[1 2 3]
[4 5 6]]
arr2d.ndim: 2
arr2d.shape: (2, 3)
Many array constructor functions take a shape argument to create a mult-dimensional array:
zeros2d:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
ones3d:
[[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]]
Or, we can reshape a current array:
arrnumbers3.shape: (1000000,)
arrnumbers3:
[ 0 1 2 ... 999997 999998 999999]
arrnumbers2d.shape: (1000, 1000)
arrnumbers2d:
[[ 0 1 2 ... 997 998 999]
[ 1000 1001 1002 ... 1997 1998 1999]
[ 2000 2001 2002 ... 2997 2998 2999]
...
[997000 997001 997002 ... 997997 997998 997999]
[998000 998001 998002 ... 998997 998998 998999]
[999000 999001 999002 ... 999997 999998 999999]]
Similar to Python lists, we can access elements of a list using braces and indices. The syntax is:
arr[start:end:step]
Some examples:
1
2
# print a range of elements in arrnumbers:
print('arrnumbers[0:2]:', arrnumbers[0:2])
# equivalently:
print('arrnumbers[:2]:', arrnumbers[:2]) # start is 0 by default
# or to print from an index to the end:
print('arrnumbers[1:]:', arrnumbers[1:]) # goes to the end by default
# or to print all numbers:
print('arrnumbers[:]:', arrnumbers[:]) # start and end are default
# print every second element:
print('arrnumbers[::2]:', arrnumbers[::2])arrnumbers[0:2]: [1 2]
arrnumbers[:2]: [1 2]
arrnumbers[1:]: [2 3]
arrnumbers[:]: [1 2 3]
arrnumbers[::2]: [1 3]
Negative indices can be used to slice starting from the end, and to reverse order. For example:
arrnumbers[-2:] [2 3]
arrnumbers[::-1] [3 2 1]
For multi-dimensional arrays, the same rules apply, you just separate the indexing for each dimension by commas. For example:
arr2d:
[[1 2 3]
[4 5 6]]
arr2d[0, 0]: 1
arr2d[:, 0]: [1 4]
arr2d[0, :]: [1 2 3]
arr2d[0:2, 1:3]:
[[2 3]
[5 6]]
One important–and extremely useful–thing to know about array slices is that they return views rather than copies of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies. Consider our two-dimensional array from before:
Let’s extract a \(2 \times 2\) subarray from this:
Now if we modify this subarray, we’ll see that the original array is changed! Observe:
This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.
Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the copy() method:
If we now modify this subarray, the original array is not touched:
You can use boolean expressions to retrieve certain values in an array. For example:
arrnumbers:
[ 5 3 42]
arrnumbers > 5:
[ 5 42]
What’s actually happening here is you’re first creating a boolean array. This is an array in which each element is either True or False. In this case, arrnumbers2 > 4 is creating an array indicating which indices in arrnumbers2 are greater than 4. Passing the boolean array as an index then pulls out those values. We can see this if we break it into two steps:
A key difference between NumPy arrays and Python arrays is that the data in a NumPy array must all be of the same type. You can get the data type of the values in an array using .dtype. For example:
arrnumbers: [1 2 3]
arrnumbers.dtype: int64
arrones: [1. 1. 1. 1.]
arrones.dtype: float64
If you try to create an array with different data types, numpy will automatically cast them to all be the same. For example:
mixed: [1. 2. 3. 4.8 1. 0. ]
mixed.dtype: float64
You can cast an array to a different type using .astype. This will create a copy of the array with values cast to the type you specified. For example:
One of the most useful aspects about NumPy arrays is they allow you to perform mathematical operations on the all the elements in the list using the same syntax you would for single variables. For example, we can add all the values in one array to another by doing:
c: [ 0 2 4 ... 1999994 1999996 1999998]
Compare that to the way we had to add two Python lists together above. Note that if a and b were Python lists a+b concatenates them together (i.e., appends the values of b on to the end of a) where as if a and b are Numpy arrays, the values are added together element-wise.
Aside from being easier to write, NumPy array operations are also much faster than Python operations. Let’s time how long it took to create c:
Compare to what we got when we did the same thing with Python lists above. It’s about 100 times faster!
Numpy comes with a large number of math functions built-in, which we can run on NumPy arrays. For example:
[ 0. 0.84147098 0.90929743 ... 0.21429647 -0.70613761
-0.97735203]
499999500000
499999.5
Some operations can also be executed as methods on the array. For example:
499999500000
499999.5
Note that this doesn’t work with the sine function, however:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[33], line 1 ----> 1 a.sin() AttributeError: 'numpy.ndarray' object has no attribute 'sin'
In VS Code, you can see all the operations you can call as methods of the array by typing the array name + .; e.g., a.. That will show a drop-down list that you can cycle through.
Let’s illustrate the speed and simplicity of NumPy vs native Python lists.
Create a Python array that has 100,000 values equally spaced between 0 and 2*pi (pi = 3.141592653589793).
Calculate the average of the cosine of every value in the Python array. For the cos function, you will need to import the math module.
Repeat steps 1 and 2, but using purely NumPy arrays and functions. You should be able to do step 2 in a single line of code. Note that NumPy has an in-built pi value (np.pi).
Time how long it takes the computer to do Step 2. Compare how long that takes when you use NumPy. When doing the comparison, just time the math operation step, not the array creation. Which is faster? Hint: for timing the Python version, you’ll need to use %%timeit rather than %timeit, as the Python version will require multiple lines of code. Put all the lines in a single cell in your notebook, and put %%timeit at the top to time the entire cell, rather than just a single line.
Which is computationally faster? By what factor?
np.float64(9.99999999994543e-06)
Numpy:
You should get that the numpy version is faster, by a factor of 10-100.