Fault tolerant computing
As a first step to writing my own simulation code while attempting to do something useful, a few days ago I started writing a code to explore failure and recovery from failure in a distributed computation. By failure in this case, I mean when one of the computation units goes down. My test system is N harmonic oscillators on N nodes (or processes on a shared memory machine). Read More …