Python for Data Scientists - NumPy
Introduction
We'll start our Python for Data Scientists series with NumPy, short for Numerical Python, which is the foundational package for scientific computing in Python. One of its primary purposes with regards to data analysis is as the primary container for data to be passed between algorithms. For numerical data, NumPy arrays are a much more efficient way of storing and manipulating data than the other built-in Python data structures. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data. Here are some of the things it provides:
- A fast and efficient multidimensional array object ndarray
- Functions for performing element-wise computations arrays
- Tools for reading and writing array-based data sets to disk
- Linear algebra operations, Fourier transform, and random number generation
- Tools for integrating connecting C, C++, and Fortran code to Python
Installation
Since everyone uses Python for different applications, there is no single solution for setting up Python and required add-on packages. Personally I recommend using one of the following base Python distributions:
- Enthought Python Distribution: a scientific-oriented Python distribution from Enthought. This includes Canopy Express, a free base scientific distribution (with NumPy, SciPy, matplotlib, Chaco, and IPython) and Canopy Full, a comprehensive suite of more than 300 scientific packages across many domains.
- Python(x,y): A free scientific-oriented Python distribution for Windows.
pip install numpy
Features
ndarray: A Multidimensional Array Object
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large data sets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. This is important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization. Consider the next snippet:import numpy as np arr = np.arange(15) # returns numbers from 0 to 15, but as an array arr[5:8] = 12 # assign 12 to items indexed from 5 to 8 arr.sort() # sorts the array arr = 1 / arr # self assignment of 1 divided by each array item arr.reshape((3, 5)) # reshapes array into 3x5 matrix arr[arr < 5] = 0 # zeroes elements greater than 5The code is self explanatory and gives you a little taste of what you can do with NumPy. Let us take a step further.
Universal functions
A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results. Look at the next examples of some them. For more details, have a look at it's page.x = np.sqrt(arr) # element-wise square root y = np.random.randn(8) * 100 y = np.floor(y) # floors each element of the array np.maximum(x, y) # element-wise maximum
Storing Arrays on Disk in Binary Format
np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy.arr1 = np.arange(10) np.save('some_array', arr2) arr2 = np.load('some_array.npy') np.array_equal(arr1, arr2)Loading text from files is a fairly standard task. It will at times be useful to load data into vanilla NumPy arrays using np.loadtxt or the more specialized np.genfromtxt. These functions have many options allowing you to specify different delimiters, converter functions for certain columns, skipping rows, and other things.
Linear Algebra
Linear algebra, like matrix multiplication, decompositions, determinants, and others are the building block of nearly every data algorithm. numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant. These are implemented under the hood using the same industry-standard Fortran libraries used in other languages like MATLAB and R, such as like BLAS, LAPACK, or the Intel MKL.import dot from np, allclose import randn from np.random import svd from np.linalg a = randn(9, 6) b = randn(9, 6) c = a + 1j*b # initiate complex matrix U, s, V = svd(a, full_matrices=True) # perform svd decomposition S = np.zeros((9, 6), dtype=complex) # 9x6 complex zero matrix S[:6, :6] = np.diag(s) # swap diagonals allclose(a, dot(U, dot(S, V))) # equal within a toleranceThis will conclude the tutorial about NumPy and feel free to check it's documentation in depth. Next time we'll be taking a deeper look into Python Data Science tool kit with an overview about SciPy.
Comments
Post a Comment