James Randi, magician and scientific skeptic, has compared those who believe in the paranormal to “unsinkable rubber ducks”: after a particular claim has been thoroughly debunked, the ducks submerge, only to resurface again a little later to put forward similar claims.
In light of this analogy, it comes as no surprise that Bem and colleagues have produced a new paper claiming that people can look into the future. The paper is titled “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events” and it is authored by Bem, Tressoldi, Rabeyron, and Duggan.
Several of my colleagues have browsed Bem's meta-analysis and have asked for my opinion. Surely, they say, the statistical evidence is overwhelming, regardless of whether you compute a p-value or a Bayes factor. Have you not changed your opinion? This is a legitimate question, one which I will try and answer below by showing you my review of an earlier version of the Bem et al. manuscript.
I agree with the proponents of precognition on one crucial point: their work is important and should not be ignored. In my opinion, the work on precognition shows in dramatic fashion that our current methods for quantifying empirical knowledge are insufficiently strict. If Bem and colleagues can use a meta-analysis to demonstrate the presence of precognition, what should we conclude from a meta-analysis on other, more plausible phenomena?
Disclaimer: the authors have revised their manuscript since I reviewed it, and they are likely to revise their manuscript again in the future. However, my main worries call into question the validity of the enterprise as a whole.
To keep this blog post self-contained, I have added annotations in italics to provide context for those who have not read the Bem et al. manuscript in detail.
My review of Bem, Tressoldi, Rabeyron, and Duggan
“Ah, psi. The remainder of my review will use a professional tone, and I will try to outline the problems I have with the authors' analyses. That said, I do think that this line of research tarnishes the reputation of psychology as an academic discipline. I urge the authors to convince themselves of the absence of psi and try and replicate one of Bem's experiment in a purely confirmatory setting, with a preregistered analysis protocol. When they monitor the Bayes factor they will, as N grows large, obtain massive evidence in favor of the truth. One good, preregistered experiment is worth a thousand experiments where the results are based on cherry-picking. To indicate that cherry-picking is a problem, I have never seen a preregistered experiment that monitored the Bayes factor and ended up supporting psi. Never. If the authors are able to produce such evidence in their own lab (after preregistering the analysis on OSF, and collecting data until the one-sided BF in favor of psi reaches, say, 20) then they can challenge me for an adversarial collaboration and I will gladly accept. Anyway, after having made my prior opinion clear, let's move on to the review. I have several major worries:
Background for Worry 1: In the abstract, Bem and colleagues suggest that the meta-analysis concerns replications of Bem's 2011 studies. However, this is not the case. The meta-analysis largely consists of studies that pre-date the 2011 work. Almost all of the pre-2011 studies were conducted by ESP proponents. The post-2011 studies, many of which were conducted by ESP skeptics, found no effect.
Worry 1. The authors wish to study replications of Bem's work. This means they should only consider studies that were inspired by Bem (2011). A quick look at Table A1 shows that very many studies in the meta-analysis preceded Bem, sometimes by as much as 10 years. It is possible that the earlier studies had advance access to Bem’s protocol, but if this is the case it should be made clear from the outset. A related worry is that skeptics only got interested after the publication of Bem (2011). Hence, I believe that there may be a difference between replications “pre-Bem” (conducted and reported by proponents only) and “post-Bem” (conducted by proponents and skeptics alike). This is a factor that should be taken into account. Perhaps the size of the effect suddenly decreased after 2011?
Indeed, when I consider only those psi replications that have been published post-Bem, I find Galak, Ritchie, Robinson, Subbottsky, Traxler, and Wagenmakers (the Hitchman studies seemed to be about creativity and luck, so I did not incorporate that study; including it does not change the pattern of results). Below is a table of their experiments and effect sizes:
|Galak Exp 1||112||-0.113|
|Galak Exp 2||158||0|
|Galak Exp 3||124||0.110|
|Galak Exp 4||109||0.170|
|Galak Exp 5||211||0.050|
|Galak Exp 6||106||-0.029|
|Galak Exp 7||2469||-0.005|
|Ritchie Exp 1||50||0.016|
|Ritchie Exp 2||50||-0.222|
|Ritchie Exp 3||50||-0.041|
|Subbotsky Exp 1||75||0.282|
|Subbotsky Exp 2||25||0.302|
|Subbotsky Exp 3||26||-0.412|
|Traxler Exp 1||48||.060|
|Traxler Exp 2||60||-.346|
When it comes to a proper assessment of the replication success for Bem’s studies, I think the above table is the correct one. I did not conduct the meta-analysis but from eyeballing the numbers it seems that there is nothing there whatsoever. The fact that this picture is changed by adding the other studies supports the assertion that they are contaminated by researcher bias and a lack of control over the analysis procedure.
Background for Worry 2: A meta-analysis is sometimes likened to the goose who eats garbage and produces golden eggs. In fact, it is more apt to recall the saying “garbage in, garbage out”; when a meta-analysis is unleashed on a set of biased studies, the result is likely to be incorrect and misleading.
Worry 2. A meta-analysis is only reliable when the individual studies are reliable. When the individual studies tend to be biased towards “discovering” psi, the meta-analysis is useless and will just confirm the presence of researcher bias rather than psi. Some of the studies that tried to replicate Bem prevented cherry-picking and data torture by using preregistration. Only with preregistration can we be somewhat certain that the results are clean. Bem’s own studies, for instance, are tainted by all kinds of post-hoc procedures in order to present the results in the most favorable light. This was common practice in the field, and it still is. For a subject that is as contentious as psi, only studies with preregistration should be allowed in the meta-analysis.
Background for Worry 3: Bem et al. deem replication studies suspect when they do not use the same software program as the original study. This is odd.
Worry 3. Never before did I hear of the idea that a precise replication is one that uses the same software program. One reason not to use the original program is that it capitalized on chance by presenting many different sorts of pictures. Another reason is that the design of specific experiments was suboptimal (e.g., not counterbalanced). When the authors argue that the Galak experiments (and my own) are not direct replications they make it quite clear that their purpose is to present evidence for the presence of psi, regardless of any inconvenient facts that they may encounter along the way.
Background for Worry 4: Bayes factors (BFs) for related experiments may not simply be multiplied. This worry is important, because it affects the analysis of more traditional studies as well.
Worry 4. The authors do not indicate how they calculated the Bayes factor. I hazard to guess that they simply multiplied the BFs from the individual studies; this practice is attractive but flawed, as it assumes that the individual experiments are independent and unrelated. Instead, the authors should calculate a BF for the following two models: H0: the mean of the random effects meta-analysis equals zero, versus H1: the mean of the random effects meta-analysis is larger than zero.
Background for Worry 5: Fail-safe N is the number of studies that have to show a null result in order to wash out the reported effect. When fail-safe N is very large, it is tempting to conclude that the reported effect is robust to publication bias. However, meta-analytic fail-safe methods are not informative in the presence of questionable research practices. Again, this worry is important because it also affects the analysis of more traditional studies.
Worry 5. The fail-safe analyses are not informative. If the null hypothesis is exactly true one can still extract a significant p-value by a combination of questionable research practices. In fact, for me this is the main, really important message from the authors’ fail-safe analysis: fails-safe analyses are meaningless in the presence of questionable research practices.
Background for Worry 6: Several researchers have argued that the Bem 2011 findings do not stand up to statistical scrutiny. The meta-analysis does not cite these critical articles.
Worry 6. Can one replicate that which does not exist? The Bem studies have been subjected to critique from many different angles. For instance, Greg Francis has argued that the results are too good to be true, and Judd, Westfall, and Kenny have shown that certain experiments no longer yield significant results once you include items as a random effect.
Background for Worry 7: A meta-analysis is of limited value when the individual studies are biased. This is perhaps the main lesson we can learn from research on ESP.
Worry 7. This is a worry for the editor rather than the authors. If ever there was a paper that showed the futility of meta-analysis, this one is it. Here we have one of the most ridiculous claims that intelligent people have ever dared to make (yes; the hypothesis that aliens built the pyramids is more plausible than people being able to look into the future) – and a meta-analysis supports this claim. The unavoidable conclusion is not that psi exists; rather, it is that meta-analysis is a tool that is fraught with danger.
I raised some minor concerns as well but I've deleted these here.
I always sign my reviews,
In conclusion, the work by Bem and colleagues demonstrates that our bread-and-butter statistical methods can not be relied upon blindly. Meta-analyses, Bayes factors, p-curve analyses: all of these methods crumble to dust in the hands of researchers who are motivated to prove that ESP exists. In fact, it is not an exaggeration to state that ESP is where statistical methods go to die. Of course, Bem and colleagues will disagree with my interpretation of their work. To resolve the conflict, I repeat my challenge: preregister the experiment on OSF, monitor the Bayes factor until it exceeds 20, and then I will happily participate in an adversarial collaboration, and grudgingly acknowledge defeat if the data force me to. If the ESP proponents truly believe that the effect is real, preregistration and adversarial collaborations are the only way forward. Perhaps the same holds for other psychological phenomena that are currently under scrutiny as well.