The performance impact of vectorized operations

Scripting language are often associated with being slow. As computers get faster, this may become less and less relevant, but as we get into the realm of big(ger) data, performance efficiency is always welcome. In this blog post we take a look at the impact of vectorized operations. As an example, we will create a Python script to check whether a postal code lies within a certain distance from a reference point and measure how what is the performance impact when we use vectorization.

Checking whether two points are within a certain distance, given their latitude and longitude, is usually very fast. Vectorization wouldn’t help much when doing the operation only once. But what if you repeat that operation hundred of thousands of times? In a C-world, that would also be very fast. In Python? Not so much!

First things first: distance between two points

As it would not make sense to improve the performance of a script without knowing what the script should do, let me introduce you with the Haversine formula to compute the distance between two points:

 $d = 2 r \arcsin\left(\sqrt{\sin^2\left(\frac{\phi_2 - \phi_1}{2}\right) + \cos(\phi_1) \cos(\phi_2)\sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)}\right)$

where $\phi_{1,2}$ is the latitude of the two points and $\lambda_{1,2}$ is their longitude (both in radians). If you fill in for $r$ the earth (average) radius, you will get the distance between the two points in the same unit (so in $km$ if you used $km$ , $m$ if you used $m$ , and so forth.).

A pythonic haversine formula

If you’re scared by math but in love with Python fear not: we also present you with the ready to use Python code (WARNING: you need NumPy to get it working):

from numpy import sin, cos, pi, arcsin, sqrt
def get_distance(lat, lon, pcode_lat, pcode_lon):
    """
    Find the distance between <code>(lat,lon) and the reference point
    (pcode_lat,pcode_lon).
    """
    RAD_FACTOR = pi / 180.0  # degrees to radians for trig functions
    lat_in_rad = lat * RAD_FACTOR
    lon_in_rad = lon * RAD_FACTOR
    pcode_lat_in_rad = pcode_lat * RAD_FACTOR
    pcode_lon_in_rad = pcode_lon * RAD_FACTOR
    delta_lon = lon_in_rad - pcode_lon_in_rad
    delta_lat = lat_in_rad - pcode_lat_in_rad
    # Next two lines is the Haversine formula
    inverse_angle = (sin(delta_lat / 2) ** 2 + cos(pcode_lat_in_rad) *
                     cos(lat_in_rad) * sin(delta_lon / 2) ** 2)
    haversine_angle = 2 * arcsin(sqrt(inverse_angle))
    EARTH_RADIUS = 6367  # kilometers
    return haversine_angle * EARTH_RADIUS

We could ask ourselves how fast this code is when we want to check 400000 postal codes (which is, approximately, the number of postal codes in the Netherlands). This is useful is we want to get a list of postal codes that fall within a certain radius from a reference postal code.

Some performance measurements

To measure the performance of the above code we create an array containing some random latitude and longitude pairs centered around (52.3905927,4.8412508) (the coordinate pair of our office)

from numpy import random
godatadriven = (52.3905927,4.8412508)
points = random.randn(400000, 2) * 0.01
points[:, 0] = points[:, 0] + godatadriven[0]
points[:, 1] = points[:, 1] + godatadriven[1]

The points[:, 0] syntax is the NumPy way of saying: select all fields from the first column, which in our case would be the latitude of points. To measure how much time is needed to get the distances between points and godatadriven, we can use the ipython %timeit magic function, that measures how much time, averaged on multiple runs, a script takes to execute:

def iterate_distance():
    d = []
    for p in points:
        d.append(get_distance(p[0], p[1], godatadriven[0], godatadriven[1]))
%timeit iterate_distance()  # runs in 3.53 seconds

If you’re familiar with Python, the code above will give you the shivers, as it does not use list comprehension. Changing the code, alas, has only a marginal impact on performance:

%timeit [get_distance(p[0], p[1], godatadriven[0], godatadriven[1]) for p in points]
# runs in 3.46 seconds

If you’re building a real-time application, where a postal code check could be an intermediate computation, 3.46s is an eternity. Luckily, we can do some magic with NumPy.

Vectorization to the rescue

In our code, points is a NumPy array. NumPy arrays can be easily manipulated because, under the hood, they have been coded to be a really convenient way to describe blocks of computer memory. The particular structure chosen by NumPy allows its arrays to be straightforwardly modified by C or Fortran code. This is fundamental here because, as we saw above, computing a similar operation 400000 times in Python is not exactly the most clever idea.

Vectorization is a process by which these element-wise operations are grouped together, allowing NumPy to perform them much more rapidly. In the code above, you already saw a couple of examples of vectorized operations:

points = random.randn(400000, 2) * 0.01
points[:, 0] = points[:, 0] + godatadriven[0]

Scaling randn by 0.01 didn’t require two for-loop for each of its elements: we simply told NumPy to multiply the whole array by 0.01 and it was done. The same when we added godatadriven[0] to points[:, 0]: we didn’t have to write a for-loop for each element because once again NumPy took care of it.

That said, here’s the vectorized operation of getting the distance in NumPy:

%timeit get_distance(points[:, 0], points[:, 1], godatadriven[0], godatadriven[1])
# runs in 21.5 ms!

which is some 165 times faster!

Putting it all together with Pandas

Now suppose you have a pandas dataframe with the complete list of postal codes

import pandas as pd
zip = pd.read_csv("zip.csv", names=["zip", "lat", "lon"], index = "zip")

and that, given a postal code and a given radius, you want a list of postal codes that fall within the given radius. That can be accomplished with the following code:

def get_zips(postal_code, radius):
    """
    Return the postal code that are within <code>radius of postal_code.
    """
    lat, lon = zip.ix[postal_code]
    zip_distance = get_distance(zip.lat, zip.lon, lat, lon) < radius
    return zip[zip_distance].index.values

Thanks to the vectorized nature of NumPy operations (including the comparison with radius on the previous chunk of code), the check is done in a matter of milliseconds, making it a good fit for real time web applications. Not only that: the resulting code is very clean and pythonic, requiring even less typing than what a list comprehension would.

Note: if you are so inclined, a IPython notebook with the performance test is online at nbviewer. The times reported there may vary slightly compared with the ones in this post.