Science as a practice can be depressingly messy, especially when relying on computer programs that other people have written, and especially if these programs are undocumented, and even worse, not open source. Luckily, the establishment’s insistence on reproducibility and the abundance of simple sanity checks tend to iron out higher level bugs, but it is within the realm of reason that “smaller” problems of a closed source program could be quietly fixed from one analysis to the next, without alerting the community as a whole.
As an example of what I mean by “messiness”, while waiting for a reply from the author of a particular program regarding its output format, I spent part of the day reverse engineering it. This reverse engineering consisted of creating a series of mock datasets and seeing what the code spit out given these as input. Since I knew very well what it should spit out given the mock data, I could figure out how to interpret its output. I’m in the process of confirming my inferences with the author of the program but while waiting for this confirmation I can get on with my work.
But this brings me to the real point of this post: what is to be done? Data provenance is the overarching issue, and I am of the opinion that any series of results should be able to be re-generated quickly (as measured in scientist-time not computer-time) based solely on meta-data provided–and this is the key point–as part of the results themselves. A few simple guiding principles can go a long way toward achieving this goal.
- In the absence of a well defined standard, it’s the individual scientist’s/consortium’s responsibility to define and actively use an organized meta-data standard.
- If it’s not open source it’s not science.
- A snapshot of the source code used to generate results should be given/pointed to when the results are presented.
- Minimizing reproduction time is an integral part of science.
- Principles 1-4 should be actively be encouraged, nay demanded, by funding agencies, program heads, and research advisors.
Neat! I didn’t realize you had a more research focused blog. I’m still pretty unlearned in the ways of physics aside from a couple years in high school, but I’m reading a biography of Feynman called Genius & as I read increasingly it sounds like a lot of the formulations just come down to algorithm design. Better algorithms more accurately represent what is observed, so yes clearly they should be well documented and easily available for reuse!
Given how computationally intense these algorithms started to get even by the 20’s & 30’s (based on where I’m reading so far) it sounds like performing a lot of these things by hand was getting too costly in terms of human calculation time even with the help of early adding machines. But if that’s the case, and we now have high speed computers & clusters doing the number crunching; then I most certainly agree – source should be open in order to verify the veracity of the approach; or maybe even find optimizations that would still accurately reflect our observations; or determine where break downs in the model may be related to some critically flawed component; as computing systems have become more layered, isolating the clarity of the meaningful executed code from the stack on which its written may become more important.
I’ve been thinking a lot about the evolution of communication over the ages, from physical language -> to tonal/song/prosody -> to ‘natural’ spoken language -> written languages -> mathematical languages -> programming languages. All of these changes on some level have been ways to allow for more precise communication in less time. Closed source programs analogous to writing in Etruscan. Sure they communicate something in that moment, but without an open lexical reference, such contributions will be lost in time. Not to say open source is a panacea, but at least it offers a rosetta stone to attempt to bootstrap the understanding of operations, and keep things moving forward because it facilitates collaboration with people you may never have direct contact with (perhaps geographically, perhaps hundreds of years after our passing).