Sep 9, 2014

The meaning of replicability across scientific disciplines

by

Recently, Shauna Gordon-McKeon wrote about the meaning of replicability on this blog, concentrating on examples from psychology. In this post, I summarize for comparison the situation in computational science. These two fields may well be at opposite ends of the spectrum as far as replication and replicability are concerned, so the comparison should be of interest for establishing terminology that is also suitable for other domains of science. For a more detailed discussion of the issues specific to computational science, see this post on my personal blog.

The general steps in conducting a scientific study are the same in all fields:

  1. Design: define in detail what needs to be done in order to obtain useful insight into a scientific problem. This includes a detailed description of required equipment, experimental samples, and procedures to be applied.

  2. Execution: do whatever the design requires to be done.

  3. Interpretation: draw conclusions from whatever results were obtained.

The details of the execution phase vary enormously from one discipline to another. In psychology, the "experimental sample" is typically a group of volunteers, which need to be recruited, and the "equipment" includes the people interacting with the volunteers and the tools they use, but also the conditions in which the experiment takes place. In physics or chemistry, for which the terms "sample" and "equipment" are most appropriate, both are highly specific to an experiment and acquiring them (by buying or producing) is often the hard part of the work. In computational science, there are no samples at all, and once the procedure is sufficiently well defined, its execution is essentially left to a computer, which is a very standardized form of equipment. Of course what I have given here are caricatures, as reality is usually much more complicated. Even the three steps I have listed are hardly ever done one after the other, as problems discovered during execution lead to a revision of the design. But for explaining concepts and establishing terminology, such caricatures are actually quite useful.

Broadly speaking, the term "replication" refers to taking an existing study design and repeating the execution phase. The motivation for doing this is mainly verification: the scientists who designed and executed the study initially may have made mistakes that went unnoticed, forgotten to mention an important aspect of their design in their publication, or at the extreme have cheated by making up or manipulating data.

What varies enormously across scientific disciplines is the effort or cost associated with replication. A literal replication (as defined in Shauna's post) of a psychology experiment requires recruiting another group of volunteers, respecting their characteristics as defined by the original design, and investing a lot of researchers' time to repeat the experimental procedure. A literal replication of a computational study that was designed to be replicable involves minimal human effort and an amount of computer time that is in most cases not important. On the other hand, the benefit obtained from a literal replication varies as well. The more human intervention is involved in a replication, the more chances for human error there are, and the more important it is to verify the results. The variability of the “sample” is also important: repeating an experiment with human volunteers is likely to yield different outcomes even if done with exactly the same subjects, and similar problems apply in principle with other living subjects, even as small as bacteria. In contrast, re-running a computer program is much less useful, as it can only discover rare defects in computer hardware and system software.

These differences lead to different attitudes toward replication. In psychology, as Shauna describes, literal replication is expensive and can detect only some kinds of potential problems, which are not necessarily expected to be the most frequent or important ones. This makes a less rigid approach, which Shauna calls "direct replication", more attractive: the initial design is not repeated literally, but in spirit. Details of the protocol are modified in a way that, according to the state of knowledge of the field, should not make a difference. This makes replication cheaper to implement (because the replicators can use materials and methods they are more familiar with), and covers a wider range of possible problems. On the other hand, when such an approach leads to results that are in contradiction with the original study, more work must be invested to figure out the cause of the difference.

In computational science, literal replication is cheap but at first sight seems to yield almost no benefit. The point of my original blog post was to show that this is not true: replication proves replicability, i.e. it proves that the published description of the study design is in fact sufficiently complete and detailed to make replication possible. To see why this is important, we have to look at the specificities of computation in science, and at the current habits that make most published studies impossible to replicate.

A computational study consists essentially in running a sequence of computer programs, providing each one with the input data it requires, which is usually in part obtained from the output of programs run earlier. The order in which the programs are run is very important, and the amount of input data that must be provided is often large. Typically, changing the order of execution or a single number in the input data leads to different results that are not obviously wrong. It is therefore common that mistakes go unnoticed when individual computational steps require manual intervention. And that is still the rule rather than the exception in computational science. The most common cause for non-replicability is that the scientists do not keep a complete and accurate log of what they actually did, because keeping such a log is a very laborious, time-consuming, and completely uninteresting task. There is also a lack of standards and conventions for recording and publishing such a log, making the task quite difficult as well. For these reasons, replicable computational studies remain the exception to this day. There is of course no excuse for this: it’s a moral obligation for scientists to be as accurate as humanly and technologically possible about documenting their work. While today’s insufficient technology can be partly blamed, most computational scientists (myself included) could do much better than they do. It is really a case of bad habits that we have acquired as a community.

The good news is that people are becoming aware of the problem (see for example this status report in Nature) and working on solutions. Early adopters report consistently that the additional initial effort for ensuring replicability quickly pays off over the duration of a study, even before it gets published. As with any new development, potential adopters are faced with a bewildering choice of technologies and recommended practices. I'll mention my own technology in passing, which makes computations replicable by construction. More generally, interested readers might want to look at this book, a Coursera course, two special issues of CiSE magazine (January 2009 and July 2012), and a discussion forum where you can ask questions.

An interesting way to summarize the differences across disciplines concerning replication and reproducibility is to look at the major “sources of variation” in the execution phase of a scientific study. At one end of the spectrum, we have uncontrollable and even undescribable variation in the behavior of the sample or the equipment. This is an important problem in biology or psychology, i.e. disciplines studying phenomena that we do not yet understand very well. To a lesser degree, it exists in all experimental sciences, because we never have full control over our equipment or the environmental conditions. Nevertheless, in technically more mature disciplines studying simpler phenomena, e.g. physics or chemistry, one is more likely to blame human error for discrepancies between two measurements that are supposed to be identical. Replication of someone else's published results is therefore attempted only for spectacularly surprising findings (remember cold fusion?), but in-house replication is very common when testing new scientific equipment. At the other end of the spectrum, there is the zero-variation situation of computational science, where study design uniquely determines the outcome, meaning that any difference showing up in a replication indicates a mistake, whose source can in principle be found and eliminated. Variation due to human intervention (e.g. in entering data) is considered a fault in the design, as a computational study should ideally not require any human intervention, and where it does, everything should be recorded.