Sunday, December 7, 2014

Optimization of vector operations

Recently worked over optimization of some (secret) classifier. The problem was mostly not in training, but in applying of trained classifier — this code was originally written in C++ and then translated to cython (which decreased the speed by factor of 2).

This was quite easy rewrite the code using numpy and vectorized approach (initally predictions were built event-by event, after rewriting the classifier was applied tree-by-tree). However this gave only speed comparable with original C++ code (so, twixe faster cython).

What really fastened the code is switching from int8 operations to int64 (the latest are natively supported in all modern processors). So 8 operations in int8 were grouped into one 64-bit. This is done pretty simple by creating views:

In[1]: import numpy
In[2]: x = numpy.random.randint(0, 100, size=64000).astype(’int8′)
In[3]: y = numpy.random.randint(0, 100, size=64000).astype(’int8′)
In[4]: %timeit x & y
10000 loops, best of 3: 60.6 µs per loop
In[5]: %timeit x.view(’int64′) & y.view(’int64′)
100000 loops, best of 3: 12 µs per loop
# Checked that output is correct
In[6]: numpy.all( (x & y) == (x.view(’int64′) & y.view(’int64’)).view(’int8′) )
Out[6]: True



In this simple example we see 5x speed up. Views of course do not copy the data, which is very essential for the speed. This trick can be applied with: summation / subtration / binary or / binary and, but you need that the size of original array was divisible by 8.

By the way, there is awesome collection of twiddling bits, which was my starting point in bit optimizations.