Fault tolerant computing

I’ve been wanting to write a simulation code for a long while as a step to better understanding and improving upon the tools I work with. I was especially inspired by the high performance computing tutorials at EuroPython to look into driving with python, then adding in optimizations in C/C++ as needed. In addition I was very excited to start working with libraries such as pypar to use MPI directly from  python.

As a first step, and to do something useful, a few days ago I started writing a code to explore failure and recovery from failure in a distributed computation. By failure in this case, I mean when one of the computation units goes down. My test system is N harmonic oscillators on N nodes (or processes on a shared memory machine).

To see a bit into my thought process as an engineer and scientist: first I wrote several different integrators for a single harmonic oscillator using common algorithms (leapfrog, Euler, gears, …) and tested them out. Next I tested splitting the integration into two parts looking forward to the stage where this will make error recovery more straightforward. Then I implemented N harmonic oscillators on a single process and tested that. Then I started thinking about how to simulate error, and looked extensively into Simpy but decided I didn’t need something so complex:  basically I decide by sampling from a probability distribution (which can be tuned to match a given system interested in) if a given process will fail at a given time step.

Harmony and Discord designFinally I designed a system which worked in parallel. First I wrote some test code using pypar, to make sure I understood the libraries correctly, then I thought a bit overnight and sat down to design. I actually did this on pen and paper in code which ended up being almost identical to my final code, another advantage of using python. I have two classes: Harmony and Discord. The way MPI works is that each process executes the same code, and the code can do something different depending on which process it is, and how many processes there are-including doing things like communicating to other processes. Discord knows about the failure, and each Harmony class has a copy of a Discord class. So this mean each process=a Harmony class containing a Discord class.

A Harmony class, if the root node, distributes the work and checks on the other processes.  If the Harmony class isn’t the root node, is responsible for one harmonic oscillator and it receives instructions and work from the root, as well as checking it if has failed or not and clobbering itself if so. Of course in the wild, the Harmony classes won’t need Discord members. They will fail on their own accord due to actual physical failures. The point is to in the meantime, in a controlled environment, monitor their resilience to failure. So I’m happy to say that success and failure are well modeled and I’ve tested on up to 50 processes with a variety of error scenarios. Tomorrow is the fun part. Namely implementing a variety of techniques for the recovery from the failures and quantifying the cost of recovery for a given quality of recovery.

For example, this is what the sum of the kinetic (blue), potential (green) and total energy (red) of N=3 harmonic oscillators looks like when they are happy (no failures):

Happy Energy

Happy Energy

This is what the  sum of the kinetic, potential and total energy of N=~30 oscillators looks like when they are sad (failures):

Sad Energy

So the task will be to get the latter graph to get to look like the former with a minimum of effort. Then playing around with parameters, making things speedier, and then moving on to a more complicated and physically relevant scenario than N harmonic oscillators, keeping in mind that harmonic oscillators do tend to turn up everywhere in physics so I’m hopefully not that far off.

Leave a Reply

Your email address will not be published. Required fields are marked *