Recently worked over optimization of some (secret) classifier. The problem was mostly not in training, but in applying of trained classifier — this code was originally written in C++ and then translated to cython (which decreased the speed by factor of 2).

This was quite easy rewrite the code using numpy and vectorized approach (initally predictions were built event-by event, after rewriting the classifier was applied tree-by-tree). However this gave only speed comparable with original C++ code (so, twixe faster cython).

What really fastened the code is switching from int8 operations to int64 (the latest are natively supported in all modern processors). So 8 operations in int8 were grouped into one64-bit. This is done pretty simple by creating views:

In this simple example we see 5x speed up. Views of course do not copy the data, which is very essential for the speed. This trick can be applied with: summation / subtration / binary or / binary and, but you need that the size of original array was divisible by 8.

By the way, there is awesome collection of twiddling bit

This was quite easy rewrite the code using numpy and vectorized approach (initally predictions were built event-by event, after rewriting the classifier was applied tree-by-tree). However this gave only speed comparable with original C++ code (so, twixe faster cython).

What really fastened the code is switching from int8 operations to int64 (the latest are natively supported in all modern processors). So 8 operations in int8 were grouped into one

In[1]: import numpy In[2]: x = numpy.random.randint(0, 100, size=64000).astype(’int8′) In[3]: y = numpy.random.randint(0, 100, size=64000).astype(’int8′) In[4]: %timeit x & y 10000 loops, best of 3: 60.6 µs per loop In[5]: %timeit x.view(’int64′) & y.view(’int64′) 100000 loops, best of 3: 12 µs per loop # Checked that output is correct In[6]: numpy.all( (x & y) == (x.view(’int64′) & y.view(’int64’)).view(’int8′) ) Out[6]: True

In this simple example we see 5x speed up. Views of course do not copy the data, which is very essential for the speed. This trick can be applied with: summation / subtration / binary or / binary and, but you need that the size of original array was divisible by 8.

By the way, there is awesome collection of twiddling bit

__s__, which was my starting point in bit optimizations.

## No comments :

Post a Comment