Previously I posted about work-in-progress on using persistent homology to analyze BEC images to identify dark solitons. We recently finished this project and the manuscript is now available on arXiv. The approach we ended up using differs from that outlined in my original post.
Our original method was based on applying persistent homology to point clouds formed by the coordinates of low intensity pixels. When applied to point clouds, persistent homology can reliably distinguish isolated minima (clusters) from the lines (loops) forming the dark solitons. However, the limitation of this method is that it requires some arbitrary choice of cutoff intensity to identify the low density pixels. Ideally when building models for machine learning we want to minimise the use of hyper-parameters that need to be optimised. For example, identifying optimal neural network hyper-parameters (number of layers, nodes, and their connectivity) is one of the most time-consuming parts of deep learning.
Therefore, in the final manuscript we instead used the "lower star image filtration" approach, which can directly compute persistent topological features of image data. This method avoids hyper-parameters and can achieve reasonable accuracy quite quickly.
On the machine learning side, we tried several different methods including support vector machines and logistic regression, which had comparable performance once we were able to identify the relevant features to use. These methods perform very well and using much less training data compared to neural networks. One important issue I overlooked in my original post was how to properly evaluate the performance of the trained model, given that the training data set had large imbalances between the three image classes. Luckily, scikit-learn includes a variety of suitable metrics for imbalanced supervised learning problems.
All in all, this project taught me a lot about persistent homology (the power of point summaries of persistence diagrams), machine learning (the power of simple models such as logistic regression if one can identify the right data features), and Jupyter notebooks (the power of being able to revise all the manuscript figures in a few seconds without having to muck around with touching them up in Inkscape).
I've now spent a year and finished two manuscripts exploring topological data analysis. My next goal is to employ these methods to solve some important and timely problem and publish the results in Physical Review Letters (yes, a generic goal that every physicist has!). The reason being that I think we need high impact publications in order to convince other physicists that these methods are really powerful and worth learning more about. I have a few promising ideas on problems to attack, so watch this space!
No comments:
Post a Comment