NumPy

Basic data structure

list in python can support different data type so its elements are actually pointers, which waste a lot of memori and CPU time. The basic objects of NumPy are ndarray and ufnc. ndarray store data (bool, int, float and etc.). ufunc contains function to cope with ndarray. ndarray, an indexable, n-dimensional array containing elements of the same type (dtype), where dimension is the number of indices that we need to visit a scalar of the array, is the basic data struture of NumPy .Vectors are 1-D arrays and matrices are 2-D arrays.

Use .shape and .dtype to get the dimension and element type of an array.

Instead of float64, we often use float32 to accelerate computing/

Vectors

Vectors are 1-D arrays in NumPy. To create a vector, we can use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

# Create a vector with 4 elements whose values are 0 and types are float64
a = np.zeros(4)
# The same as above. We use this mode to create n-D arrays
a = np.zeros((4,)) # np.ones creates ndarray whose values are 1
# Create a vector with 4 elements whose values are random value in [0, 1) and types are float64
a = np.random.random_sample((4,))
# np.arange([start=0], stop, [step=1]). Create an arithmetic progression, [start, stop)
a = np.arange(4.)
# Create a vector with 4 elements whose values are in [0, 1) obeying uniform distribution
a = np.random.rand((4,))
# Specify values manually
a = np.array([5, 4, 3, 2, 1])
a = np.array([5.0, 4, 3, 2, 1])

Matrices

Matrices are 2-D arrays in NumPy. To create a matrix, the functions used are as those in creating vectors. For examples:

1
a = np.zeros((4, 2))

However, when specifying values, numpy specifies rows first:

1
2
a = np.array([[1, 2],
[3, 4]])

We can also create a matrix from a vector:

1
a = np.zeros(6).reshape(-1, 2)

which will create a 3x2 matrix. -1 indicates that the number of row depends on the number of column. If .reshape(-1, ), we turn a matrix to a vector by concentrating the vector row by row, that is:

1
2
3
4
a = np.array([[2, 3],
[4, 5]])
print(a.reshape(-1, ))
# We get: [2 3 4 5]

.reshape regard a matrix as a vector with $mn$ elements. Therefore, if we want to transpose a matrix, we must use a.T rather than .reshape. The return object of .reshape shares memory with initial object but .reshape doesn't change the shape of initial object.

Operations

Indexing & slicing

Arrays in NumPy can be used as list in python, which means that the indexing and slicing in arrays are the same as those in list, though the data type is a built-in type of numpy (float64, int32, ndarray, etc.).

When slicing a certain column, we should use a[:, j]

In general, there are 5 different ways to read ndarray using []:

  1. Integer
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    a = np.arange(12).reshape(3, 4)
    '''
    a = [[0 1 2 3]
    [4 5 6 7]
    [8 9 10 11]]
    '''
    print(a[0])
    # Get [0 1 2 3], shape(4,)

    print(a[0][0])
    # Get 0, shape(), scalar

    The return object shares memory with initial object.

  2. Slicing
    1
    2
    3
    4
    5
    6
    7
    8
    print(a[:, 1])
    # Get [1 5 9], the 1 column, shape(3,)

    print(a[0:2, :])
    # Get [[1 2] [5 6]], shape(2, 2)

    print(a[::2, ::2]) # that is set steps to 2
    # Get [[0 2] [8 10]], shape(2, 2)

    The return object still shares memory with initial object.

  3. Integer list
    1
    2
    print(a[[0, 1]])
    # Get [[0 1 2 3] [4 5 6 7]], shape(2, 3)

    The return object still shares memory with initial object.

  4. Integer matrix
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    a = np.arange(12)
    b = np.array[[1, 2, 4], [5, 6, 9]]
    print(a[b])
    # Get [[1 2 4] [5 6 7]], shape(2, 3)

    c = np.array([[1, 2], [3, 4]])
    d = np.array([[1, 0], [0, 1]])
    print(c[d])
    '''
    [[[3 4]
    [1 2]]

    [[1 2]
    [3 4]]]
    '''
    # c[[1, 0]] is (2, 2) so c[d] is (2, 2, 2)

    The return object is a new object. It doesn't share memory with initial object. The value of b is the index of element in specific dimension in a.

    To understand the second one, you should focus on d rather than c. That is, the values of d are the indexex of c's first dimension (row). And NumPy just replaces them with value of the specified dimension.

  5. Bool array
    1
    2
    3
    4
    a = np.array([[1, 2], [3, 4]])
    b = np.array([[True, False], [False, True]])
    print(a[b])
    # Get [1 4], shape(2, )

    The return object is not a new object. It shares memory with initial object. When using bool array, Numpy will only keep the True element and return a vector.

Elementwise computations

In elementwise computations, NumPy apply the same operation to each elment of matrix. For example:

1
2
3
4
5
6
7
8
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

a +-*/ b = [[1+-*/5, 2+-*/6], [3+-*/7, 4+-*/8]]
a**b = [[1**5, 2**6], [3**7, 4**8]] # equal to np.power(a, b)
# exp^a
e = np.exp(a)

Because of broadcasting, b = 5 * a b = a + 1 or b = a**2 are also valid. Such guidelines also apply to boolen operations, that is:

1
2
3
4
5
a = np.array([[0, 1],
[1, 1],
[0, 1]])
pos = a == 1 # a == 1 will be apply to each element of a and return matrix
print(pos)
1
2
3
4
We get:
[[False True]
[ True True]
[False True]]

Broadcasting

If a is a matrix, b is a vector with the same row or column number. Then a+-*/b will +-*/ b to each row or column of a. This is the Broadcasting in numpy. See more about it on Broadcasting.

More generally, the mechanism of broadcasting is aligning the shape of each dimension of arrays to the largest one of both arrays. Broadcasting only works when two arrays have different dimensions or two arrays have the same dimensions but at least one dimension is 1. For examples:

1
2
3
4
5
6
7
a = np.arange(6).reshape(-1, 1) # shape (6, 1)
b = np.arange(5) # shape(5, )
c = a + b # shape(6, 5)
'''
a extends to (6, 5)
b extends to (6, 5)
'''

The procedures of broadcasting:

  1. Compare the shape of each dimension of two arrays from the last dimension; [e.g. a(6, 1), b(5, ), compare 1 with 5]
  2. Broadcasting the dimension: Broadcast the dimension of the array with smaller dimension from back to front; [e.g. a(6, 1), b(5, ), broadcast b(5, ) to b(1, 5)]
  3. Broadcasting the shape of dimension: The array with shape 1 in one dimension will be stretched to match the corresponding dimension shape of another array. [e.g. a(6, 1)->a(6, 5); b(1, 5)->b(6, 5)]
  4. Report error when one dimension can't be broadcast. Namely, the two arrays have different shapes in this dimension but neither of them have a shape 1.

Dot product

1
c = np.dot(a, b)

Others

1
2
3
4
5
6
7
8
# Add all the elements up, return a scalar
b = np.sum(a) # use [axis] to determine row or column
# Avearge of a, return a scalar
c = np.mean(a) # use [axis] to determini row or column
# Concentrate vectors to form a matrix. Each item is a column.
d = np.c_[a, a**2] # if a.shape=4, d.shape=(4,2)
# Returns the index of the maximum value of an array along a certain axis (0: column, 1: row).
f = np.argmax(a, [axis], [out]) # type: ndarray

Tile

numpy.tile(A, reps), where A is the input array and reps is the replication factor of A in each dimension, extends the dimension or shape of the original array.

A.dim > len(reps)

For examples:

1
2
a = np.array([[0, 1, 2],[3,4,5]])
b = np.tile(a, (2))

NumPy will extend the shape of b to (1, 2) from back to front. Therefore, it is equal to np.tile(a, (1, 2)):

1
2
3
b = 
[[0 1 2 0 1 2]
[3 4 5 3 4 5]]

A.dim < len(reps)

For examples:

1
2
a = np.array([[0, 1, 2],[3,4,5]]) # shape (2, 3)
b = np.tile(a, (2, 3, 4))

NumPy will extend the shape of a to (1, 2, 3) from back to front. Therefore, the column of a will copy 3 times, the row of a will copy 2 times and the first dimension will copy 1 time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
b =
[[[0 1 2 0 1 2 0 1 2 0 1 2]
[3 4 5 3 4 5 3 4 5 3 4 5]
[0 1 2 0 1 2 0 1 2 0 1 2]
[3 4 5 3 4 5 3 4 5 3 4 5]
[0 1 2 0 1 2 0 1 2 0 1 2]
[3 4 5 3 4 5 3 4 5 3 4 5]]

[[0 1 2 0 1 2 0 1 2 0 1 2]
[3 4 5 3 4 5 3 4 5 3 4 5]
[0 1 2 0 1 2 0 1 2 0 1 2]
[3 4 5 3 4 5 3 4 5 3 4 5]
[0 1 2 0 1 2 0 1 2 0 1 2]
[3 4 5 3 4 5 3 4 5 3 4 5]]]

All vectorizable operations use SIMD, so they are much faster than loop.

Random.choice

numpy.random.choice(a, size=None, replace=True, p=None), which is an random sampling operation and will return an array whose elements are the result of random sampling.

  • a, an array or integer. If a is an array, the samples it chooses come from it, otherwise, the samples it chooses come from np.arrange(a);
  • size, an integer or a tuple. If size is an integer, it will choose size samples in total. If size is an tuple (e.g. (m, n, k)), it will produce m x n x k samples and arrange them in the shape of (m, n, k);
  • replace, True or False, where True means sampling with replacement (放回取样) and False means sampling without replacement (不放回取样);
  • p, None or an array. If None, the probability of selecting each number is the same; if it is an array, the length of the array p should be the same as the length of a, and the elements in the array p correspond to the probability of choosing each element in a.

More information