I’m at a very specialized conference outside of Madrid, on structure finding in cosmological simulations, called haloes going MAD (MAD being the airport code for Madrid). Not only is it a beautiful location, situated amidst the mountains (was hiking for a few hours today after the main session), the remoteness removes distractions and the goal is for the meeting to not be just talk, but for us to get something done: to collectively write a paper comparing various algorithms.
What is a halo you ask? Well, in comparing our halo finders that’s something we have to define and part of the reason why we are here at the meeting is to develop a common language to enable said comparison. The accepted and physical definition is that a halo is a gravitationally bound structure. In practice, as a gravity calculation is computationally intensive and determining which particles are bound to a structure can have a chicken-and-the-egg problem (you have to first guess at a structure to determine if a particle is bound to it; if the particle is not determined to be bound, once you unbind the particle, the structure becomes different, so your guess should have been different as well), people use all sorts of approximations. In the end, to enable comparison, we decided to define a halo as a tuple (or object) described by various properties, including mass, radius, velocity, etc.
And why do we care? At the end of a cosmological simulation we have a collection of particles, up to trillions these days, and to connect that to the observable universe we need to do some abstraction. We believe galaxies form in these gravitational potential wells, so tracking their properties within the simulations gives us insight into galaxy formation and their distributions insight into whether we have our cosmological models screwed on correctly, among other things.
And why do I care? My background is not only in physics, but in computer science, specifically machine learning and algorithms, of which a large component is identifying patterns or clusters in data sets, and this problem initially reminded me of my “home territory”. During my Master’s thesis, on parallel analysis and visualization of large scale cosmological simulations, I got interested in the problem from a big data, parallel algorithms perspective. As we move to the peta- (and the exa-) scales, storing all the outputs of our simulations becomes infeasible, meaning we will need to be doing a lot of the analysis on the fly, and choosing the right algorithms for the job a priori becomes critical, and comparing existing algorithms, developing new ones if necessary, even more so. Post-processing won’t do, so we have start preparing by comparing.
For now I have to prepare a short talk and hope it compares with the others; a number of people (including the organizer of the conference) connected my Master’s work to my name after I arrived and asked if I could take a slot on the schedule.