May 28, 2014

The etiquette of train wreck prevention


In a famous open letter to scientists , Daniel Kahneman, seeing “a train wreck looming”, argued that social psychologists (and presumably, especially those who are publishing social priming effects) should engage in systematic and extensive replication studies to avoid a loss of credibility in the field. The fact that a Nobel Prize winning psychologist made such a clear statement gave a strong boost of support to systematic replication efforts in social psychology (see Pashler & Wagenmakers 2012, and their special issue in Psychological Science).

But in a more recent commentary, Kahneman appears to have changed his mind, and argues that “current norms allow replicators too much freedom to define their study as a direct replication of previous research”, and that the “seemingly reasonable demand” of requiring method sections to be so precise that they enable direct replications is “rarely satisfied in psychology, because behavior is easily affected by seemingly irrelevant factors”. A similar argument was put forth by Simone Schnall, who recently wrote that “human nature is complex, and identical materials will not necessarily have the identical effect on all people in all contexts”.

While I wholeheartedly agree with Kahneman’s original letter on this topic, I strongly disagree with his commentary, for reasons that I will outline here.

First, he argues (as Schnall did too) that there always are potentially influential differences between the original study and the replication attempt. But this would imply that any replication study, no matter how meticulously performed, would be meaningless. (Note that this also holds for successful replication studies.) This is a clear case of a reductio ad absurdum.

The main reason why this argument is flawed is that there is a fundamental relationship between the theoretical claim based on a finding and its proper replication, which is the topic of an interesting discussion about the degree to which a replication should be similar to the study it addresses (see Stroebe & Strack, 2014; Simons, 2014; Pashler & Harris, 2012). My position in this debate is the following. The more general the claim that the finding is claimed to support, the more “conceptual” the replication of the supporting findings can (and should) be. Suppose we have a finding F that we report in order to claim evidence for scientific claim C. In the case that C is identical to F, such that C is a claim of the type “The participants in our experiment did X at time T in location L”, it is indeed impossible to do any type of replication study, because the exact circumstances of F were unique and therefore by definition irreproducible. But in this case (that F = C), C has obviously no generality at all, and is therefore scientifically not very interesting. In such a case, there would also be no point in doing inferential statistics. If, on the other hand, C is more general than F, the level of methodological detail that is provided should be sufficient to enable readers to attempt to replicate the finding, allowing for variation that the authors do not consider important. If the authors remark that this result arises under condition A but acknowledge that it might not arise under condition A' (let's say, with participants who are aged 21-24 rather than 18-21), then clearly a follow-up experiment under condition A' isn't a valid replication. But if their claim (explicitly or implicitly) is that it doesn't matter whether condition A or A' is in effect, then a follow-up study involving condition A' might well be considered a replication. The failure to specify any particular detail might reasonably be considered an implicit claim that this detail is not important.

Second, Kahnemann is worried that even the rumor of a failed replication could damage the reputation of the original authors. But if researchers attempt to do a replication study, this does not imply that they believe or suggest that the original author was cheating. Cheating does occasionally happen, sadly, and replication studies are a good way to catch these cases. But, assuming that cheating is not completely rampant, it is much more likely that a finding cannot be replicated successfully because variables or interactions have been overlooked or not controlled for, that there were unintentional errors in the data collection or analysis, or because the results were simply a fluke, caused by our standard statistical practices severely overestimating evidence against the null hypothesis (Sellke, Bayarri & Berger, 2001; Johnson, 2013).

Furthermore, replication studies are not hostile or friendly. People are. I think it is safe to say that we all dislike uncollegial behavior and rudeness, and we all agree that it should be avoided. If Kahneman wants to give us a stern reminder that it is important for replicators to contact the original authors, then I support that, even though I personally suspect that the vast majority of replicators already do that. There already is etiquette in place in experimental psychology, and as far as I can tell, it’s mostly being observed. And for those cases where it is not, my impression is that the occasional unpleasant behavior originates not only from replicators, but also from original authors.

Another point I would like to address is the asymmetry of the relationship between author and replicator. Kahneman writes: “The relationship is also radically asymmetric: the replicator is in the offense, the author plays defense.” This may be true in some sense, but it is counteracted by other asymmetries that work in the opposite direction: The author has already successfully published the finding in question and is reaping the benefits of it. The replicator, however, is up against the strong reluctance of journals to publish replication studies, is required to have a much higher statistical power (hence invest far more resources), and is often arguing against a moving target, as more and more newly emerging and potentially relevant details of the original study can be brought forward by the original authors.

A final point: the problem that started the present replication discussion was that a number of findings that were deemed both important and implausible by many researchers failed to replicate. The defensiveness of the original authors of these findings is understandable, but so is the desire of skeptics to investigate if these effects are in fact reliable. I, both as a scientist and as a human being, really want to know if I can boost my creativity by putting an open box on my desk (Leung et al., 2012) or if the fact that I frequently take hot showers could be caused by loneliness (Bargh & Shalev, 2012). As Kahneman himself rightly put it in his original open letter: “The unusually high openness to scrutiny may be annoying and even offensive, but it is a small price to pay for the big prize of restored credibility.”


Bargh, J. A., & Shalev, I. (2012). The substitutability of physical and social warmth in daily life. Emotion, 12(1), 154. doi:10.1037/a0023527

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48), 19313-19317. doi: doi/10.1073/pnas.1313476110

Leung, A. K.-y., Kim, S., Polman, E., Ong, L. S., Qiu, L., Goncalo, J. A., et al. (2012). Embodied metaphors and creative "acts". Psychological Science, 23(5), 502-509. doi:10.1177/0956797611429801

Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7(6), 531-536. doi:10.1177/1745691612463401

Pashler, H., & Wagenmakers, E.-J. (2012). Editors' Introduction to the Special Section on Replicability in Psychological Science A Crisis of Confidence? Perspectives on Psychological Science, 7(6), 528-530. doi:10.1177/1745691612465253

Sellke, T., Bayarri, M., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62-71. doi:10.1198/000313001300339950

Simons, D. J. (2014). The Value of Direct Replication. Perspectives on Psychological Science, 9(1), 76-80. doi:10.1177/1745691613514755

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9(1), 59-71. doi:10.1177/1745691613514450

May 15, 2014

How anonymous peer review fails to do its job and damages science.


Churchill believed that democracy was the “worst form of government except all those other forms that have been tried from time to time.” Something analogous is often said about anonymous peer review (APR) in science: “it may have its flaws, but it’s the ‘least bad’ of all possible systems.” In this contribution, I present some arguments to the contrary. I believe that APR is threatening scientific progress, and therefore that it urgently needs to be fixed.

The reason we have a review system in the first place is to uphold basic standards of scientific quality. The two main goals of a review system are to minimize both the number of bad studies that are accepted for publication and the number of good studies that are rejected for publication. Borrowing terminology of signal detection theory, let’s call these false positives and false negatives respectively.

It is often implicitly assumed that minimizing the number of false positives is the primary goal of APR. However, signal detection theory tells us that reducing the number of false positives inevitably leads to an increase in the rate of false negatives. I want to draw attention here to the fact that the cost of false negatives is both invisible and potentially very high. It is invisible, obviously, because we never get to see the good work that was rejected for the wrong reasons. And the cost is high, because it removes not only good papers from our scientific discourse, but also entire scientists. I personally know a number of very talented and promising young scientists who first sent their work to a journal, fully expecting to be scrutinized, but then receiving reviews that were so personal, rude, scathing, and above all, unfair, that they decided to look for another profession and never looked back. I also know a large number of talented young scientists who are still in the game, but who suffer intensely every time they attempt to publish something and get trashed by anonymous reviewers. I would not be surprised if they also leave academia soon. The inherent conservatism in APR means that people with new, original approaches to old problems run the risk of being shut out, humiliated, and consequently chased away from academia. In the short term, this is to the advantage of the established scientists who do not like their work to be challenged. In the long run, this is obviously very damaging for science. This is especially true of the many journals that will only accept papers that receive unanimously positive reviews. These journals are not facilitating scientific progress, because work with even the faintest hint of controversy is almost automatically rejected.

With all this in mind, it is somewhat surprising that APR also fails to keep out many obviously bad papers.