Personally I am a big fan of

However I am much worried about the speed, so decided to collect different benchmarks

(posted in 2011)

http://technicaldiscovery.blogspot.ru/2011/06/speeding-up-python-numpy-cython-and.html

Primarily the post is about numba, the pairwise distances are computed with

(posted in 2013)

https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/

Well, if that is true, there would be no need in writing easily-vectorizable operation in pure python (yep, I mean they are simply cheating), currently these 'benchmarks' are at the main page

http://julialang.org/

The following post was written in 2011, the problem observed is solution of Laplace equation as usual.

Numpy is ~10 times slower than pure C++ solution (and equal to matlab),

http://wiki.scipy.org/PerformancePython

Fresh (2014) benchmark of different python tools, simple vectorized expression A*B-4.1*A > 2.5*B is evaluated with

http://nbviewer.ipython.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day7_2_jit_numpy.ipynb

Haven't found any general-purpose

http://conference.scipy.org/proceedings/scipy2010/pdfs/bergstra.pdf

One more detailed review of numpy vs cython vs c (held in 2014)

http://notes-on-cython.readthedocs.org/en/latest/std_dev.html

Let me copy-paste the example results (computation of std).

Simple, but comprehensive comparison of python accelerators was prepared by Jake Vanderplaas and Olivier Grisel:

http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Numba%20Parakeet%20Cython.ipynb

Problem is computation of pairwise distances. The results this time seem to be unbiased:

By the way, I was recently looking for a slow place in my code, and it totally dropped of my mind that

http://embeddedgurus.com/stack-overflow/2011/02/efficient-c-tip-13-use-the-modulus-operator-with-caution/

Integer division / remainder computation time significaly depends on the bit depth (16 bit operation are ~2 times faster than 32 bit ones, division takes roughly 10 times more time than multiplication).

Surprising: multiplication/addition/binary operations take the same time (compared on numpy 1.9)

**numpy**package, since it makes the code clean and still quite fast.However I am much worried about the speed, so decided to collect different benchmarks

I decided to write my own benchmarks with several non-trivial operation to check things claimed at different sources.
Please find it here here.
In my tests numpy is not much slower than other libraries or C++.

**numpy**vs**cython**vs**weave**(numpy is about 2 times slower than others)(posted in 2011)

http://technicaldiscovery.blogspot.ru/2011/06/speeding-up-python-numpy-cython-and.html

Primarily the post is about numba, the pairwise distances are computed with

**cython**,**numpy**,**numba**.**Numba**is claimed to be the fastest, around 10 times faster than numpy.(posted in 2013)

https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/

**Julia**is claimed by its developers to be very fast language.Well, if that is true, there would be no need in writing easily-vectorizable operation in pure python (yep, I mean they are simply cheating), currently these 'benchmarks' are at the main page

http://julialang.org/

The following post was written in 2011, the problem observed is solution of Laplace equation as usual.

Numpy is ~10 times slower than pure C++ solution (and equal to matlab),

**weave**shows the speed comparable with pure C++ (I don't know anyone using weave now)http://wiki.scipy.org/PerformancePython

Fresh (2014) benchmark of different python tools, simple vectorized expression A*B-4.1*A > 2.5*B is evaluated with

**numpy, cython, numba, numexpr, and parakeet**(and two latest are the fastest - about 10 times less time than numpy, achieved by using multithreading with two cores)http://nbviewer.ipython.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day7_2_jit_numpy.ipynb

Haven't found any general-purpose

**theano vs numpy**benchmarks, but in the article there is comparison of neural networks and theano is expected to give much better speed than numpy/torch(c++)/matlab, specially it is fast on GPUhttp://conference.scipy.org/proceedings/scipy2010/pdfs/bergstra.pdf

One more detailed review of numpy vs cython vs c (held in 2014)

http://notes-on-cython.readthedocs.org/en/latest/std_dev.html

Let me copy-paste the example results (computation of std).

Method | Time (ms) | Compared to Python | Compared to Numpy |
---|---|---|---|

Pure Python | 183 | x1 | x0.03 |

Numpy | 5.97 | x31 | x1 |

Naive Cython | 7.76 | x24 | x0.8 |

Optimised Cython | 2.18 | x84 | x2.7 |

Cython calling C | 2.22 | x82 | x2.7 |

Simple, but comprehensive comparison of python accelerators was prepared by Jake Vanderplaas and Olivier Grisel:

http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Numba%20Parakeet%20Cython.ipynb

Problem is computation of pairwise distances. The results this time seem to be unbiased:

Python | 9.51s |

Naive numpy | 64.7 ms |

Numba | 6.72ms |

Cython | 6.57ms |

Parakeet | 12.3 ms |

Cython | 6.57 ms |

By the way, I was recently looking for a slow place in my code, and it totally dropped of my mind that

*computation of residual is long operation*:http://embeddedgurus.com/stack-overflow/2011/02/efficient-c-tip-13-use-the-modulus-operator-with-caution/

Integer division / remainder computation time significaly depends on the bit depth (16 bit operation are ~2 times faster than 32 bit ones, division takes roughly 10 times more time than multiplication).

Surprising: multiplication/addition/binary operations take the same time (compared on numpy 1.9)

## 2 comments :

In this example numpy is indeed just as fast at c. But, the main reason for your speed differences you get is because your code is different.

The following cython code is twice as fast as yours and is equivalent to what you do in your numpy implementation.

%%cython

from libc.math cimport log, sqrt, exp

cimport cython

import numpy as np

cimport numpy as np

cdef double pi = np.pi

@cython.boundscheck(False)

@cython.wraparound(False)

@cython.cdivision(True)

def llh_cython(np.ndarray[np.float64_t, ndim=1] data, double mean, double sigma):

cdef:

double llh = 0

double s, denom_1, denom_2

np.intp_t i

denom_1 = 2 * (sigma ** 2)

denom_2 = sqrt(2 * pi) * sigma

for i in range(data.size):

s = data[i] - mean

s *= s

s /= denom_1

llh += log(exp(-s) / denom_2)

return llh

Hi, Dr.

First, thank you for providing alternative cython code (I am not deep into cython, so some side-check is needed).

Your cython code is indeed a bit faster (not twice, 87% compared to numpy on my machine: latest numpy and cython, Xeon core)

In some sense you're right that numpy is guaranteed not to recalculate denominators, while cython / c++ versions aren't.

But the version you provided hardly can be 'fairly' compared to original:

1. I expect compilers to optimize computations (-O3 should be enough to precompute constants in loop, IMO; otherwise my code will turn into something awful).

2. It's common situation that likelihood is written for one sample as an abstract function, being later called within loop - this makes precomputations impossible (and we never do so in numpy).

In general, we can work much on low-level implementations of tiny examples to prove that we are able to produce faster code, but that doesn't make much sense in real world.

The readability actually matters too, and from your code it is not instantly obvious what is it doing.

Anyway, I think I'll add some remark later not to mess up people.

Post a Comment