Friday, October 9, 2015

Moving blog to github pages

I'm moving this blog to github pages.

Now it is hosted on jekyll, so I have a bit more freedom in styling and ability to write posts in markdown.

New posts that arrived recently:

  1. Own new test of numpy crunchers on simple example, which breaks most of the results I was hearing and writing before
    http://arogozhnikov.github.io/2015/09/08/SpeedBenchmarks.html
  2. A set of advanced numpy tricks for data analysis
    http://arogozhnikov.github.io/2015/09/29/NumpyTipsAndTricks1.html
  3. An interactive demonstration of ROC curve
    http://arogozhnikov.github.io/2015/10/05/roc-curve.html
  4. Want to know how physicists reconstruct distributions from noisy data? A post about sPlot
    http://arogozhnikov.github.io/2015/10/07/splot.html
  5. Finally, a post about my algorithm: reweighting distributions with gradient boosting!
    http://arogozhnikov.github.io/2015/10/09/gradient-boosted-reweighter.html
Also, now you can see the sources of all posts in the corresponding repository.

Thanks, blogspot.

Saturday, July 4, 2015

Machine learning in COMET experiment (part II)

In previous post I've written a short explanation of COMET - Japanese experiment in particle physics. This second post will be devoted to machine learning approach we developed for tracking (currently this is uses the information which is probably unavailable in online, but I still hope we will be able to get some online information and thus apply part of the approach as high-level trigger).

Local wire features

Anyway, the data we are going to use in classification is:
  • energy deposit for each wire
  • time of deposit for each wire

Let's look at their distributions:
Energy deposited on each 'wire' by signal and background tracks
Time elapsed starting from the moment of triggering till the time when particles are detected.


How do we obtain these values?
COMET uses straw chambers (which we call wires for simplicity), these are long tubes, which fill detector. They are filled with gas, which is ionized by moving charged particles (mostly, electrons).

After ionization, electrons and ions are moving to opposite directions, so we probably can estimate the moment of ionization by drift time (which is of course much greater compared to time of particle flight in detector). Drift time also gives approximate information about the distance between center of wire and track.

Straw tubes used in many detector systems


Finally, each wire has one more characteristic, namely, the distance to target:
Signal hits are more frequent on the inner layers by construction of experiment.




The time is counted from the moment of when trigger worked (it's another subdetector system). The event is 'recorded' starting from that moment.

The basic algorithm that was proposed (baseline algorithm) is using the cut on energy deposition. As you can see, there is really significant difference: energy deposited by background hits is higher. The reason is that background particles are moving slower, so they ionize more particles.

ROC AUC when we use the only feature - energy, is around 0.95, which seems to be very high. Nevertheless, it's not enough, since we have around 4400 wires, of which around 1000 gets activated (this number is called occupancy) within each measurement (event), while the signal track usually contains around 80 points.

So the event is represented using 4400 pairs (energy, time), of which most are zeros.

So the noise which passes through such basic filter is still very significant and there is large room for improvements.

First, let's combine all the information we have about the single wire (distance from center, time and energy deposition), let's call them wire features:
ROC curves (using physical notation here), we see that usage of wire features made us able to twice decrease the background efficiency. 

Neighbors features

Let me remind you, how the whole picture of hits in COMET detector looks like (we use here projection on the plane, orthogonal to beam line):

Blue dots designate background hits, most of which are tracks of different charged particles, say protons; but some of them are simply noise.
Signal hits, which we are looking for,  are red points forming an arc of circle. It's approximate radius is known ahead.

Simple but useful observation one can see in this plot: signal hits nearly always come together.
This implies that we can try using features collected from neighbours of point.

This drives to dramatic improvement of  classification: we almost get rid of random noise tracks, still there are some coupled misidentified tracks. AUC is about 0.995, fpr (background efficiency) decreased by 4-5 times. In principle it suffices to use only information from left and right wires from the same layer.

Ok, I believe that was simple. Is there still something we missed and that could be improved?
For sure, this is the whole shape of the track. It's time to try using this information.

Hough transform and circle shape

Hough transform was initially developed to detect lines and circles. Being quite trivial, it is one of the most effective algorithms in high energy physics.

After using GBDT trained on wire features and features of its neighbours, we are getting quite clean picture with very few false positives. All we want is to detect and remove possible isolated 'islands' with misidentified background hits.

This is done by very approximate reconstruction of track centers. Since we know approximate radius of track center, we can use the Hough transform with fixed radius. It looks like:
Visualization of Hough transform for circles with fixed radius. We are trying to reconstruct the center of track, going through red points. Assuming that we know the radius of fitted track, all possible centers are laying on the circle with center in red point.
We discretize the space of possible track centers and for each point we reconstruct how likely it is the center of track. It is done using sparse matrices and some normalization + regularization (because otherwise tracks with few or many points will have very low/high probabilities).

Once we computed Hough transform, we leave only those centers, where hough transform is high, applying some nonlinear transformation there and applying inverse hough + some filtering. This way we obtain for each wire the probability that it belongs to some signal track.

Finally, we collect all the information we get for each wire:

  • local features (energy deposition, timing, layer_id)
  • features collected from neighbors
  • result of inverse hough transform
And train GBDT on these features to obtain final classifier. It's ROC AUC is 0.9993 (100 times less probability of misordering)
Final classifier is red and it is very close to ideal one. The ROC AUC is about 0.9993

When we are comparing ROC curves at the threshold of interest (with very high signal sensitivity), things are bit worse, but still very impressing: 
At stable benchmark, the background yield decreased by factor of 34. Original ROC curve is not seen on the plot, since it is much lower.


Visualization of all steps

Initial picture of hits in CyDET. Red are signal tracks, blue are background ones.

After we apply initial GBDT (which uses wire features + neighbors), we have some bck hits (to the right, for instance). See the 'island' of misidentified background to the right.
By greed dots we denote the possible centers of tracks. The bigger the point, the greater value of Hough transform image.

Now we apply some non-linear transformations to leave only centers with very high probability. Then applying inverse hough transform and apply second GBDT, which incorporates also information from inverse Hough transform.

Conclusion

I've described how simple machine learning techniques coupled with well-known algorithms can produce a very good result, superior to many complex approaches.
These results are to be checked on better simulation data, when it will be ready. Also we have some ideas about possibilities of classification for online triggers, which use same ideas.

Links


  1. Repository with algorithms and results of tracking.
    Most of plots are taken from it, thanks to Ewen Gillies.
  2. Reproducible experiment platform used in experiments, gradient boosting was used from scikit-learn
  3. Review of straw chambers. Straw chambers are used within many different experiments due to their high resolution and cheapness.

Multimixture fitting

I was wondering how one can modify Expectation-Maximization procedure for fitting mixtures (well, gaussian mixtures, because it's the only distribution that can be fitted easily) to support really many overlapping summands in mixture.

Randomization probably can be a solution to this problem.

Let me first remind how EM works. There are two steps that are computed iteratively

  1. (Expectation) where we compute probability that each particular event belongs to each distribution
  2. (Maximization) where given the probabilities we maximize parameters of each distribution.
What if we sample events according to distribution from expectation step? At each stage we will attribute each event to one (in simplest case) component of mixture, or maybe several of them. This kind of randomization should prevent us from 'shrinking' of distribution.

Ok, this again needs time for experiments.

LibFM in python

LibFM is library for factorization machines using an approach proposed by Steffen Rendle. However it seems that there is no python wrapper for this famous library.

However, if you're looking for python version of it, take a look at these projects:

Update: I've tested some LibFM implementations on several datasets. I included libFM and options proposed in this post.
  • https://github.com/coreylynch/pyFM
    Lovely minimalistic implementation of Factorization Machines using cython (previous version used numpy).
    This library has an interface similar to scikit-learn.
  • http://ibayer.github.io/fastFM/index.html
    fastFM is another option. Library supports both classification and regression.
    Contains three different solvers (SGD, ALS, MCMC). ALS = alternative least squares.
    In the paper written by author of algorithm, he argues that other algorithms are comparable to SGD.

    This library is completely following scikit-learn interface (even deriving from BaseEstimator and appropriate mixin classes).
  • I was also looking for code in theano, but the only code I found was very dirty and minimalistic (so I'm not hoping it is usable)
    https://github.com/instagibbs/FactorizationMachine

Tuesday, June 30, 2015

Friday, June 26, 2015

Learning to rank (software, datasets)

Since for some time I've been working on ranking. I was going to adopt pruning techniques to ranking problem, which could be rather helpful, but the problem is I haven't seen any significant improvement with changing the algorithm.

Ok, anyway, let's see what we have in this area.

Datasets for ranking (LETOR datasets)


  1. MSLR-WEB10k and MSLR-WEB30k
    You'll need much patience to download it, since Microsoft's server seeds with the speed of 1 Mbit or even slower.
    The only difference between these two datasets is the number of queries (10000 and 30000 respectively). They contain 136 columns,  mostly filled with different term frequencies and so on.
  2. Apart from these datasets, LETOR3.0 and LETOR4.0 are available, which were published in 2008 and 2009. Those datasets are smaller. From LETOR4.0 MQ-2007 and MQ-2008 are interesting (46 features there). MQ stays for million queries.
  3. Yahoo! LETOR dataset, from challenge organized in 2010. There are currently two versions: 1.0(400Mb) and 2.0 (600Mb). Here is more info about two sets within this data
    Set 1Set 2
    TrainValTestTrainValTest
    # queries19,9442,9946,9831,2661,2663,798
    # urls473,13471,083165,66034,81534,881103,174
    # features519596
  4. There is also Yandex imat'2009 (Интернет-Математика 2009) dataset, which is rather small. (~100000 query-pairs in test and in the same train, 245 features). 
And seems that these are most datasets.

Algorithms

There are plenty of algorithms and their modifications created specially for LETOR (with papers).

Implementations

There are many algorithms developed, but checking most of them is real problem, because there is no available implementation one can try. But constantly new algorithms appear and their developers claim that new algorithm provides best results on all (or almost all) datasets. 

This of course hardly believable, specially provided that most researchers don't publish code of their algorithm. In theory,  one shall publish not only the code of algorithms, but the whole code of experiment.

However, there are some algorithms that are available (apart from regression, of course).
  1. LEMUR.Ranklib project incorporates many algorithms in C++
    http://sourceforge.net/projects/lemur/
    the best option unless you need implementation of something specific. Currently contains

    MART (=GBRT), RankNet, RankBoost, AdaRank, Coordinate Ascent, LambdaMART and ListNet
  2. LEROT: written in python online learning to rank framework.
    Also there is less detailed, butlonger list of datasets: https://bitbucket.org/ilps/lerot#rst-header-data
  3. IPython demo on learning to rank
  4. Implementation of LambdaRank (in python specially for kaggle ranking competition)
  5. xapian-letor is part of xapian project, this library was developed at GSoC 2014. Though I haven't found anythong on ranking in documentation, some implementations can be found in C++ code:
    https://github.com/xapian/xapian/tree/master/xapian-letor
    https://github.com/v-hasu/xapian/tree/master/xapian-letor
  6. Metric learning to rank (mlr) for matlab
  7. SVM-Rank implementation in C++
  8. ListMLE, ListNET rankers (seems these were used in xapian)
  9. SVM-MAP implementation in C++
Some comparison (randomly sampled pictures from net):
Taken from http://www2009.org/pdf/T7A-LEARNING%20TO%20RANK%20TUTORIAL.pdf


Comparison from http://www.ke.tu-darmstadt.de/events/PL-12/papers/07-busa-fekete.pdf,
though paper was about comparison of nDCG implementations.


Thursday, June 25, 2015

A Programming Language

Since I'm very interested in numpy and vectorization, I became curious about when how this approach appeared. I find this notion very convenient and very producing, while it leaves the space for optimizations inside.

Obviously, a predecessor of numpy is MATLAB, which appeared a long time ago - in the late 70's, while the first release happened more than 30 years ago - in 1984! Many things were inherited from this language (not the multiCPU/GPU backend, unfortunately, but I hope that things will change oce).

But more interesting moment that vector-based syntax was not fortran, as I thought earlier. The latter got it's operations over vectors only in Fortran'90 (which was much later). 

The real predecessor of MATLAB was APL (A Programming Language, not named after Apple company), this language was quite complicated and short,  and mostly this was looking not like a code, but as numerous formulas. Apart from this,  APL used many specific symbols and combined operators. For instance,  +\ seq is sum of elements in sequence.

Wikipedia claims that this returns list of prime numbers in $1, 2, ..R$. 

(~R∊R∘.×R)/R←1↓ιR

Now you understand why this language isn't popular today :)

From that moment we learnt that many things like map, where, sum can be written using words, and this will be much more clear and reliable.

Thing I also find interesting is APL interpretation process was much more complicated then one of numpy or matlab, and in particular supported lazy evaluation (thus, construction of expressions was available). This also means that different optimization were possible, like usage of multiple different vector operations at once (like a + b*c), if processor supports this. In russian this is called векторное зацепление, but I haven't found appropriate translation in wiki.

Wednesday, June 24, 2015

Practical tasks on bioinformatics

Recently got interested in bioinformatics and the topics scientists are interested there.



A nice reading for starters like me is http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html,
which is quite inspiring (there is also extended version in russian).

Among other recommended resources I've found rosalind, which is basically a set of programming problems needed to 'computational biologists' (if you ever saw sites with ICM programs, you can imagine this). The difference here is that you're not submitting a problem, only a solution.

Each time you're trying to pass some problem, new data is generated and you're given 5 minutes or so.

Resources in russian on bioinformatics I found useful so far:

  1. http://biomolecula.ru/
  2. http://postnauka.ru/themes/bioinformatika
  3. rosalind 

Monday, June 22, 2015

Optics and smartphones

I was always sure that optics is the most important moment to have nice photos in darkness. But today's smartphones have cameras with very small objective, and nevertheless produce better pictures then my Lumix TZ5.
The latter has an objective which is at least 10 times larger in area, but this doesn't help.



Machine learning reconstruction of image? Good stabilization? Miracle? Or simply I'm doing something wrong with my old lumix?

Machine learning used in tracking of COMET (japanese particle physics experiment), part I

Recently I worked together with intern from Imperial College over the tracking system of COMET for two months, and here I'm going to briefly sum our (impressing) results. To begin with, let's explain what this experiment about.

From physics courses we know that there are conservation laws, most famous are conservation of  energy and momenta, but there are other. For instance, conservation of different leptonic family numbers.

For instance, electronic number is number of electrons + number of electronic neutrinos. For some time these numbers (electronic number, muonic number, tauonic number) were considered to be preserved, but in the late 1960's it was proved, that neutrino oscillations change the leptonic number of system (neutrino can become electronic from muonic, for instance). This observation is called lepton flavour violation. However, in the Standard Model overall leptonic number (which is sum of three named family numbers) is conserved.

Since that time physicists are searching for other processes which change the leptonic number and include charged leptons (electrons, muons, tauons), which could be a key for new physics, however no success in this direction and the results are currently mostly upper limits we can prove for some processes. Such possible processes have the name of CLFV (charged lepton flavour violation).


Details of COMET experiment

COMET is small experiment (which is only in plans at this moment) built to detect one specific decay (muon to electron conversion on nucleus), which is suppressed in the standard model:
$$ \mu^{-} + N \to e^{-} + N.$$

The frequency of this process is extremely small in SM: it's probability is about $10^{-52}$, which in particular means that we are sure this will not happen during the experiment.

COMET is sending lots of muons (with fixed energy) on aluminum, and makes the muons to interact with it. The process we are looking for is
$$ \mu^{-} + Al \to e^{-} + Al,$$
while this isn't the only way we can obtain electron from muon
$$ \mu^{-} \to e^{-} + \nu_{\mu} + \overline{\nu}_e.$$
Pay attention,  that second process is 'normal' in the sense it doesn't violate the leptonic conservation laws.

The only way to determine whether we had decay-in-orbit (DIO) or not is to measure the energy of resulting electron (since neutrinos are very elusive particles and we are unable to detect it). Fortunately, the distributions of energies of electron is quite different for these processes and background electrons usually have lower energy.

Distribution of energies for signal process and main background (Decay-in-orbit) in COMET


Some other illustrations of COMET experiment

The scheme of planned experiment COMET. Detector (with aluminum target) will be in the end of this pipeline.
This is what planned during phase II, but for the first time experiment will be simpler
COMET during phase I will be much simpler.

Detector of COMET (CyDET). The detector is filled with sensitive wires (white).
When muon hits the aluminum target (in the center), the electron produced travels over helix trajectories of larger radius in magnetic field and hits wires. The radius of trajectory depends on the transverse momentum of electron.



Data collected from COMET


The main information (apart one from trigger) is the data taken from wires, which are registering when charged particle flies nearby (by adsorbing the particles after ionization) and also it measures the time when this happened.

Another characteristic, which helps in distinguishing electrons, is the radius of trajectory, which is computed (in constant magnetic field) by formula
$$r = \dfrac{p_T}{eB}, $$
so for electron we know the distribution of gyroradius from distribution of energies. Gyroradius is how we actually will reconstruct energy of electron.

Typical signal event in COMET, particle is helixing in magnetic field, leaving energy in drift cambers.


In orthogonal projection of CyDET, the collected information from wires will look like this:
Typical picture of signal event in experiment. Red dots are corresponding to signal hits and form a circle of needed radius.
Also one can see here many background hits, which come from other particles flying through detector.

But the shape of circle with some radius is not the only useful information. During signal hits usually less energy is disposed (which may sound a bit contradictory to what I wrote earlier, but the faster particle flies, the less energy it disposes).

The energy deposited is very strong feature, though leaves significant amount of bck hits, which can spoil the picture, when we have high occupancy.

Apart from this, we can use the 'hit time' (measured from the moment when trigger detected particle).
The time after trigger tends to be smaller, while background hits are distributed uniformly across the time.
This is actually to be checked with better monte-carlo simulations (which are quite poor at the moment).

The main goal of our part is classification of signal vs background hits.
In the next post I'll write about how we applied machine learning to this problem and achieved promising results.

PS. There is second part of this post, devoted to machine learning.

Links

1. Official COMET site.
2. Detailed presentation about CLFV, COMET (and e2mu).
3. Раритеты микромира, an article in russian about rare processes (CLFV is one of examples)  
4. COMET phase-I proposal
5. Repository with algorithms and results of tracking.