Tuesday, January 6, 2015

Benchmarks of speed (Numpy vs all)

Personally I am a big fan of numpy package, since it makes the code clean and still quite fast.
However I am much worried about the speed, so decided to collect different benchmarks

I decided to write my own benchmarks with several non-trivial operation to check things claimed at different sources. Please find it here here. In my tests numpy is not much slower than other libraries or C++.

numpy vs cython vs weave (numpy is about 2 times slower than others)
(posted in 2011)

Primarily the post is about numba, the pairwise distances are computed with cython, numpy, numba.
Numba is claimed to be the fastest, around 10 times faster than numpy.
(posted in 2013)

Julia is claimed by its developers to be very fast language.
Well, if that is true, there would be no need in writing easily-vectorizable operation in pure python (yep, I mean they are simply cheating), currently these 'benchmarks' are at the main page

The following post was written in 2011, the problem observed is solution of Laplace equation as usual.
Numpy is ~10 times slower than pure C++ solution (and equal to matlab), weave shows the speed comparable with pure C++ (I don't know anyone using weave now)

Fresh (2014) benchmark of different python tools, simple vectorized expression A*B-4.1*A > 2.5*B is evaluated with numpy, cython, numba, numexpr, and parakeet (and two latest are the fastest - about 10 times less time than numpy, achieved by using multithreading with two cores)

Haven't found any general-purpose theano vs numpy benchmarks, but in the article there is comparison of neural networks and theano is expected to give much better speed than numpy/torch(c++)/matlab, specially it is fast on GPU

One more detailed review of numpy vs cython vs c (held in 2014)
Let me copy-paste the example results (computation of std).

MethodTime (ms)Compared to PythonCompared to Numpy
Pure Python183x1x0.03
Naive Cython7.76x24x0.8
Optimised Cython2.18x84x2.7
Cython calling C2.22x82x2.7

Simple, but comprehensive comparison of python accelerators was prepared by Jake Vanderplaas and Olivier Grisel:
Problem is computation of pairwise distances. The results this time seem to be unbiased:

Naive numpy64.7 ms
Parakeet12.3 ms
Cython6.57 ms

By the way, I was recently looking for a slow place in my code, and it totally dropped of my mind that computation of residual is long operation:
Integer division / remainder computation time significaly depends on the bit depth (16 bit operation are ~2 times faster than 32 bit ones, division takes roughly 10 times more time than multiplication).

Surprising: multiplication/addition/binary operations take the same time (compared on numpy 1.9)


Dr. Eisenhoover said...

In this example numpy is indeed just as fast at c. But, the main reason for your speed differences you get is because your code is different.

The following cython code is twice as fast as yours and is equivalent to what you do in your numpy implementation.

from libc.math cimport log, sqrt, exp

cimport cython
import numpy as np
cimport numpy as np

cdef double pi = np.pi
def llh_cython(np.ndarray[np.float64_t, ndim=1] data, double mean, double sigma):
double llh = 0
double s, denom_1, denom_2
np.intp_t i
denom_1 = 2 * (sigma ** 2)
denom_2 = sqrt(2 * pi) * sigma
for i in range(data.size):
s = data[i] - mean
s *= s
s /= denom_1
llh += log(exp(-s) / denom_2)
return llh

Alex said...

Hi, Dr.
First, thank you for providing alternative cython code (I am not deep into cython, so some side-check is needed).
Your cython code is indeed a bit faster (not twice, 87% compared to numpy on my machine: latest numpy and cython, Xeon core)

In some sense you're right that numpy is guaranteed not to recalculate denominators, while cython / c++ versions aren't.
But the version you provided hardly can be 'fairly' compared to original:
1. I expect compilers to optimize computations (-O3 should be enough to precompute constants in loop, IMO; otherwise my code will turn into something awful).
2. It's common situation that likelihood is written for one sample as an abstract function, being later called within loop - this makes precomputations impossible (and we never do so in numpy).

In general, we can work much on low-level implementations of tiny examples to prove that we are able to produce faster code, but that doesn't make much sense in real world.
The readability actually matters too, and from your code it is not instantly obvious what is it doing.

Anyway, I think I'll add some remark later not to mess up people.