Oct 30, 2014

Reproducible Reporting


Why reproducible reporting?

You are running an experiment when a conference deadline comes along; you quickly put together an abstract with 'preliminary findings' showing this or that effect. When preparing the talk for the conference, typically months later, you find a different p-value, F-value, t-value or whatever your favorite statistic was; it may still be significant or as expected, but different regardless; this nags you.

You are preparing a manuscript with multiple experiments, a dozen or so tables and figures and many statistical results such as t-tests, ANOVA's et cetera. During manuscript preparation you decide on using another exclusion criterion for your response time data as well as your participants, all for good reasons. Then, the new semester starts and you are overloaded with teaching duties, and you are only able to start working on your manuscript weeks later. Did you already update figure 5 with the new exclusion criteria? Did you already re-run the t-test on page 17 concerning the response times?

Unfortunately, I myself and many others are familiar with such experiences of not being able to reproduce results. Here, reproducibility refers not to the aspect of experimental results being reproducible. Reproducibility of experimental results has been written about frequently (see discussion here), and serious effort is now put into testing the reproducibility of common results (eg in the reproducibililty project), as well as improving standards when it comes to experimentation (eg by pre-registration, see discussion here). Rather, this post focuses on reproducibility of preparing the data, statistical analyses, producing figures and tables, and reporting of results in a journal manuscript.

During my own training as a psychologist, not much attention was given in the curriculum to standards of bookkeeping, lab-journals, and reporting of results. Standards that are in place focus on the stylistic aspects of reporting, such as that the F in reporting an F-value should be italicized, rather than upright. The APA reporting guidelines concern mostly matters of form and style, such as the exquisitely detailed guidelines for producing references. While such uniformity of manuscripts in form and style are highly relevant when it comes to printing material in journals, those guidelines have not much to say about the contents of what is reported and about how to maintain the reproducibility of analyses.

What is reproducible reporting?

When it comes to reproducible reporting, the goal is to have a workflow from (raw) data files to (pre-)print manuscript in such a way that every step along the way is reproducible, by others, including your future self -- for example, by others who have an interest in the results for the purposes of replication or quality control. It may be useful to have someone particular in mind such as Jelte Wicherts or Uri Simonsohn, both famous for tracking down errors (and worse) in published research -- hence, the goal is to make your work, say, Wicherts-proof. Typically, there are at least three phases involved in (re-)producing reports from data:

  1. transforming raw data files to a format that admits statistical analysis

  2. making figures and tables, and producing analytical results from the data

  3. putting it all together into a report, either a journal manuscript, a presentation for a conference, or a website

To accomplish reproducibility in reporting, the following are important desiderata for our toolchain:

  1. Scripting Using a scripting language (as opposed a point-and-click interface) for the statistical analyses, such as R, has the important advantage that analyses can easily be reproduced using the script. Popular statistical analysis packages, such as SPSS, also admit of the possibility of saving scripts of analyses. However, it is not necessary to do so, and hence one can easily omit doing so as it is not the standard way of operating.

  2. One tool instead of many R (and possibly other similar statistical packages) can be used to perform all the analytical steps from reading raw data files to analysing data, and producing (camera-ready) figures and tables; again, this is a major advantage over using a combination of text-file editors (to reformat raw data), a spreadsheet program, to clean the raw data, SPSS or otherwise to do the analyses, and a graphical program to produce figures.

  3. High quality output in multiple formats A toolchain to produce scientific output should produce high quality output, preferably suitable for immediate publication; that is, it should produce camera-ready copy. It should also produce outputs in different formats, that is, analysis scripts and text should be re-usable to produce either journal manuscripts, web pages or conference presentations. Other important requirements are i) portability (anyone using your source, in plain text format, should be able to reproduce your manuscript, whether they are working on Mac, Linux, Windows or otherwise), ii) flexibility (seamless integration of references and production of the reference list, automatic table of contents, indexing), iii) scalability (the tool should work similarly for small journal manuscripts and multi-volume books), iv) the separation of form and content (scientific reporting should be concerned with content, formatting issues should be left to journals, website maintainers et cetera).

  4. Maintain a single source Possibly the most important requirement for reproducible reporting is that the three phases of getting from raw data to pre-print manuscript are kept in a single file; data preprocessing, data analysis, and the text making up a manuscript should be tightly coupled in a single place. A workflow that only comprises a single file and where analytical results are automatically inserted into the text of a manuscript prevents common errors such as forgetting to update tables and figures (did I update this figure with the new exclusion criteria or not?), and, most importantly, simply making typing errors in copying results (the latter is arguably quite a common source of errors and hard to detect, especially so when the results are in favour of your hypotheses).

Fortunately, many tools are available these days that satisfy these requirements to help ensure reproducibility of each of the phases of reporting scientific results. In particular, data pre-processing and data analysis can be made reproducible using R, an open source script based statistics package, which currently has close to 6000 add-on packages to do analyses ranging from simple t-tests to multi-level irt models, and from analysing two-by-two tables to intricate Markov models for complex panel designs.

As for the third requirement, high quality output, LaTeX is the tool of choice. LaTeX has been the de-facto standard for high-quality typesetting in the sciences since its inception in 1986 by Leslie Lamport. LaTeX belongs to the family of markup languages, together with HTML, Markdown, XML, et cetera; the key characteristic of markup languages is the separation of form and content. Writing in LaTeX means assigning semantic labels to parts of your text, such as title, author, section, et cetera. Rather than deciding that a title should be type-set in 12-point Times New Roman, as an author I just want to indicate what the title is. Journals, websites and other media can then decide what their preferred format is for title's in their medium and use the source LaTeX file to produce say PDF, for a journal, or HTML, for a website, output. Separation of form and content is precisely the feature that allows this flexibility in output formats.

The superior typographical quality of LaTeX produced output is nicely illustrated here, showing improved hyphenation, spacing, standard use of ligatures and others. Besides typographical quality, portability, automatic indexing, producing tables of contents, citations and referencing, and scalability were built into the design of LaTeX. See here, for a -- balanced -- discussion of the benefits of LaTeX over other word-processors.

The real benefit of these tools -- R and LaTeX -- comes in when they are used in combination. Several tools are available to combine R with LaTeX in a single document and the main advantage of this combination is that all three phases scientific report production are combined in a single document, fulfilling the fourth requirement. In the following I provide some minor examples of using R/LaTeX.

How to get started with reproducible reporting?

The main tools required for combining statistical analyses in R with LaTeX are Sweave and knitr. Below are the main portals for getting the required -- free! open-source! -- software, as well as some references to appropriate introductory guides.

  1. Getting started with R: the R home page, for downloads of the program, add-on packages and manuals; as for the latter, R for psychologists, is particularly useful with worked examples of many standard analyses.

  2. Getting started with LaTeX: the LaTeX-homepage for downloads, introductory guides and much much more; here's a cheat sheet that helps in avoiding having to read introductory material.

  3. Getting started with Sweave, or alternatively knitr; both provide minimal examples to produce reproducible reports.

If using these programs through their command-line interfaces seems overwhelming to start with, Rstudio provides a nice graphical user interface for R, as well as having options to produce PDF's from Sweave and/or knitr files.

Needless to say, Google is your friend for finding more examples, troubleshooting et cetera.

Minimal R, LaTeX and Sweave example

A minimal example of using R and LaTeX using Sweave can be downloaded here: reproducible-reporting.zip.

To get it working do the following:

  1. Install Rstudio
  2. Install LaTeX
  3. Open the .Rnw file in Rstudio after unpacking and run Compile pdf

Some Markdown examples

Similar to the combination of LaTeX, R in Sweave, Markdown combines simple markup and R code to produce HTML pages as done in this blog. The examples below illustrate some of the possibilities.

To get the flavour, start with loading some data and use 'head(sleep)' to see what's in it:


This data gives the 'case' values for application of two types of drug and its influence on hours of sleep. The following plots the means and standard deviations of the two types of treatments:


This data set is the one used by Student to introduce his t-test, so it's only fitting to do such a test on these data here:


There is much more to say about reproducibility in reporting and statistical analysis than a blog can contain. There are several recent books on the topic for further reading:

Gandrud, C. (2013). Reproducible research with R and RStudio. CRC Press.

Stodden, V., Leisch, F., & Peng, R. D. (Eds.). (2014). Implementing Reproducible Research. CRC Press.

Xie, Yihui. Dynamic Documents with R and knitr. CRC Press, 2013.

... and an online course here.

There is a world to win in acceptance of reproducible research practices, and reproducible reporting should be part and parcel of that effort. Such acceptance not only requires investment on the side of scientists and authors but also on the side of journals, in that they should accept LaTeX/R formatted manuscripts. Many psychometrics and mathematical psychology journals do accept LaTeX, as well as Elsevier, who provide explicit instructions for submitting manuscripts in LaTeX, and so does Frontiers, see here. Change in journal policy can be brought about by (associate) editors: if you have such a position make sure that your journal subscribes to open-source, reproducible reporting standards as this will also help improve the standards of reviewing and eventually the journal itself. As a reviewer, one may request to see the data and analysis files to aid in judging adequacy of reported results. At least some journals, such as the Journal of Statistical Software, require the source and analysis files to be part of every submission; in fact, JSS, accepts LaTeX submissions only, and this should become the standard throughout psychology and other social sciences as well.

Oct 24, 2014

Two Calls to Conscience in the Fight for Open Access


In celebration of Open Access Week, we'd like to share two pieces of writing from open access advocates who faced or are facing persecution for their efforts towards sharing knowledge.

The first is a letter from Diego A. Gómez Hoyos. Gomez is a Colombian graduate student studying biodiversity who is facing up to eight years in prison for sharing a research article. He writes:

The use of FLOSS was my first approach to the open source world. Many times I could not access ecological or statistical software, nor geographical information systems, despite my active interest in using them to make my first steps in research and conservation. As a student, it was impossible for me to cover the costs of the main commercial tools. Today, I value access to free software such as The R project and QGis project, which keep us away from proprietary software when one does not have the budget for researching.

But it was definitely since facing a criminal prosecution for sharing information on the Internet for academic purposes, for ignoring the rigidity of copyright law, that my commitment to support initiatives promoting open access and to learn more about ethical, political, and economic foundations has been strengthened.

I am beginning my career with the conviction that access to knowledge is a global right. The first articles I have published in journals have been under Creative Commons licenses. I use free or open software for analyzing. I also do my job from a social perspective as part of my commitment and as retribution for having access to public education in both Colombia and Costa Rica.

From the situation I face, I highlight the support I have received from so many people in Colombia and worldwide. Particularly, I thank the valuable support of institutions working for our freedom in the digital world. Among them I would like to acknowledge those institutions that have joined the campaign called “Let’s stand together to promote open access worldwide”: EFF, Fundación Karisma, Creative Commons, Internet Archive, Knowledge Ecology International, Open Access Button, Derechos Digitales, Open Coalition, Open Knowledge, The Right to Research Coalition, Open Media, Fight for the Future, USENIX, Public Knowledge and all individuals that have supported the campaign.

If open access was the default choice for publishing scientific research results, the impact of these results would increase and cases like mine would not exist. There would be no doubt that the right thing is to circulate this knowledge, so that it should serve everyone.

Thank you all for your support. Diego A. Gómez Hoyos

The second document we’re sharing today is the Guerilla Open Access Manifesto, written by the late Aaron Swartz in 2008:

Information is power. But like all power, there are those who want to keep it for themselves. The world's entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations. Want to read the papers featuring the most famous results of the sciences? You'll need to send enormous amounts to publishers like Reed Elsevier.

There are those struggling to change this. The Open Access Movement has fought valiantly to ensure that scientists do not sign their copyrights away but instead ensure their work is published on the Internet, under terms that allow anyone to access it. But even under the best scenarios, their work will only apply to things published in the future. Everything up until now will have been lost.

That is too high a price to pay. Forcing academics to pay money to read the work of their colleagues? Scanning entire libraries but only allowing the folks at Google to read them? Providing scientific articles to those at elite universities in the First World, but not to children in the Global South? It's outrageous and unacceptable.

"I agree," many say, "but what can we do? The companies hold the copyrights, they make enormous amounts of money by charging for access, and it's perfectly legal — there's nothing we can do to stop them." But there is something we can, something that's already being done: we can fight back.

Those with access to these resources — students, librarians, scientists — you have been given a privilege. You get to feed at this banquet of knowledge while the rest of the world is locked out. But you need not — indeed, morally, you cannot — keep this privilege for yourselves. You have a duty to share it with the world. And you have: trading passwords with colleagues, filling download requests for friends.

Meanwhile, those who have been locked out are not standing idly by. You have been sneaking through holes and climbing over fences, liberating the information locked up by the publishers and sharing them with your friends.

But all of this action goes on in the dark, hidden underground. It's called stealing or piracy, as if sharing a wealth of knowledge were the moral equivalent of plundering a ship and murdering its crew. But sharing isn't immoral — it's a moral imperative. Only those blinded by greed would refuse to let a friend make a copy.

Large corporations, of course, are blinded by greed. The laws under which they operate require it — their shareholders would revolt at anything less. And the politicians they have bought off back them, passing laws giving them the exclusive power to decide who can make copies.

There is no justice in following unjust laws. It's time to come into the light and, in the grand tradition of civil disobedience, declare our opposition to this private theft of public culture.

We need to take information, wherever it is stored, make our copies and share them with the world. We need to take stuff that's out of copyright and add it to the archive. We need to buy secret databases and put them on the Web. We need to download scientific journals and upload them to file sharing networks. We need to fight for Guerilla Open Access.

With enough of us, around the world, we'll not just send a strong message opposing the privatization of knowledge — we'll make it a thing of the past. Will you join us?

Aaron Swartz July 2008, Eremo, Italy

In the past few years, the open access movement has gained momentum as the benefits it provides to individual researchers and to the scientific community as a whole have started to manifest. But even if open access lacked these benefits, we would still be morally obligated to advocate for it, because access to knowledge is a human right.

To learn more about open access efforts, visit the Open Access Week website.

Oct 22, 2014

Reexamining Reviewer Anonymity - More Costs than Benefits


Academic publishing dogma holds that peer reviewers (aka referees) should be anonymous. In the vast majority of cases, however, there are more costs than benefits to reviewer anonymity. Here, I make the case that reviewer identity and written reviews themselves should become publicly accessible information. Until then, reviewers should sign their reviews, as this practice can increase rigor, expose biases, encourage goodwill, and could serve as an honest signal of review quality and integrity.

Why reviewer anonymity solves nothing

The story goes that anonymity frees the reviewer from any reputational costs associated with providing a negative review. Without the cloak of invisibility, reviewers who provided devastating critiques would then become the target of attacks from those debased authors. Vengeful authors could sabotage the reviewer’s ability to get publications, grants, and tenure.

It’s imaginable that these vengeful authors who have the clout to sabotage another’s career might exist, but I’m willing to bet that few careers have been injured or sidelined due primarily to a bullying senior scientist. It’s difficult to say whether the absence of these horror stories is due to a lack of vengeful saboteurs or a lack of named reviewers. If you’re aware of rumored or confirmed instances of a scorned author who exacted revenge, please let me know in the comments section below.

Let’s appreciate that our default is to be onymous1. Without hiding behind anonymity, we manage to navigate our careers, which often includes being critical and negative. We openly criticize others’ work in commentaries, at conferences, in post-publication reviews, and on Facebook and Twitter. Editorships are not the kiss of death, even though their names appear at the bottom of rejection letters. Sure, editors typically have tenure and so you might think that there are no costs to their onymous criticism. But they also still attempt to publish and get grant funding, and their criticism, in the form of rejection letters, probably doesn’t hinder this. Moreover, for every enemy you make by publicly criticizing their work, in the form of post-publication reviews for example, you probably recruit an ally. Can’t newfound allies influence your ability to get publications, grants, and tenure just as much as adversaries?

JP de Ruiter, who wrote an excellent piece also questioning anonymous peer review, offered a simple workaround to the problem of the fearful young scientist criticizing the senior scientist: “Reviewers with tenure always sign their reviews.” This is great, but my fear is that most reviews are written by untenured scientists, so the problems associated with reviewer anonymity will remain with this rule in place. My advice to the untenured and those on, or soon to be on, the job market would be the same: sign all reviews. Even negative reviews that recommend rejection should be signed. Needless to say, negative reviews need to be written very carefully. Drawing attention to flaws, fatal or otherwise, can be done with tact. Speaking tentatively will take the sting out of any criticism. In the review that the author (or public) sees, you can suggest a more appropriate analysis or alterative explanation, but in the private comments to the editor, you can emphasize how devastating these shortcomings are. Keep in mind that reviewers do not need to communicate their recommendation (i.e., accept, revise, reject) in the review that the authors see. In fact, most editors prefer the recommendation be kept separate from the review. This allows them more freedom with their decision. Also, newly minted PhDs and postdocs should keep in mind that there are practices and laws in place so that a scorned search committee member cannot make a unilateral decision.

A second worry is that, without anonymity, reviewers would have to worry about being critical of a known colleague’s (and sometimes a friend’s) work. With anonymity, they’re free to criticize any manuscript and maintain favorable relationships with the authors. But if you’re worried about hurting a colleague’s feelings by delivering an honest and critical review, then you shouldn’t be a reviewer. Recuse yourself. Or maybe you shouldn’t infantilize your colleagues. They’ve probably learned to keep personal and professional relationships separate, and they would surely prefer an honest and constructive review, even if it was accompanied by some short-lived emotional pangs.

A third worry about reviewer transparency might be that it could produce credulous or timid reviews. I don’t see this as a serious threat. Even in the light of onymity, reviewers will still demand good evidence. Identified reviewers will still provide the occasional dismissive, sarcastic, and insulting review. I’m certain of this because of my history of providing brutal, onymous reviews and because of those few that I’ve received. Even with my name signed at the bottom, I’ve written some things that I would find difficult to say to the author’s face. I’m not inappropriate, but I can be frank.

Moreover, the concern that identifying reviewers will lead to overly effusive reviews is alleviated when we appreciate the reputational costs associated with providing uncritical reviews. No one wants their name in the acknowledgements of a worthless paper.

Five benefits of reviewer transparency

1) Encourage goodwill. Obviously, reviewer transparency can curb misbehavior. We’re all well aware that it’s easy to be nasty when anonymity reduces the associated risks. The vileness of many YouTube comments is an obvious example. de Ruiter argues that anonymous peer review not only has the unintended consequence of removing good science from the literature, but it also removes good scientists. I can attest to this, too. One of my former graduate students, having just gone through the peer review process, questioned his future in academia. He expressed that he didn’t want the fate of his career to hinge on the whims of three random people who are loaded with biases and can behave badly without consequence.

2) Get credit. Currently, we don't get credit for reviewing. If it's not related to your research, it's not worth your time to write a review, let alone write a high-quality review. "Opportunity costs!" screams the economist. But if we make reviews archivable, then we can receive credit, and we should be more likely to review. Some retention, tenure, and promotion committees would likely count these archived reviews as forms of scholarship and productivity. Altmetrics—quantitative measures of a researcher’s impact, other than journal impact factor—are becoming more and more popular, and unless journal impact factors are completely revamped, which is unlikely to happen anytime soon, we’ll all be hearing a lot more about altmetrics in the future. Digital-born journals are in a good position to overhaul the peer review process to make it transparent and archivable. F1000Research and PeerJ, for example, have laudable open peer review models.

The flip side of this “getting credit” benefit is that we’ll be able to see who’s free-riding. In a correspondence piece in Nature, Dan Graur argued that those scientists who publish the most are least likely to serve as reviewers. “The biggest consumers of peer review seem to contribute the least to the process,” he wrote. This inverse correlation was not found, however, in a proper analysis of four ecology journals over an 8-year period, but the ratio of researchers’ reviews to submissions could be journal or discipline specific. Bottom line: free-riding could be a problem, and reviewer onymity could help to reduce it.

A journal’s prestige comes primarily from the quality of papers it publishes. And the quality of papers rests largely on the shoulders of the editors and peer reviewers. It follows, then, that the prestige of a journal is owed to its editors and reviewers. Editors get acknowledged. Their names are easily found in the colophon of a print journal and on the journal’s website, but not so for the reviewers’ names. Some journals publish an annual acknowledgment of manuscript reviewers, but because it’s divorced from any content—e.g., you don’t know who reviewed what and how often—it’s largely worthless and probably ignored. Given that the dissemination (and progress?) of science depends on free labor provided by reviewers, they should get credit for doing it. Admittedly, this would introduce complexities, such as including the recommendations of the reviewers. I’d appreciate if I were acknowledged as a reviewer in each paper I review, but only if my recommendation accompanied my name: “Aaron Goetz recommended rejection.” A reviewer’s name, without her accompanying recommendation, in the acknowledgements of a published paper would look like an endorsement, and I know I’m not the only one to recommend rejection to a paper that was subsequently published. Even better, it would not be difficult to link associated reviews to the paper.

3) Accountability. You’ve surely opened up the pages of your discipline’s flagship journal, saw a laughable article, and wondered who let this nonsense get published. Picking the low-hanging fruit, who reviewed Bem’s precognition paper? Some reviewers of that paper, not Eric-Jan Wagenmakers, should be held accountable for wasting researcher resources. Try not to calculate how much time and effort was spent on these projects that set the record straight.

Another benefit that comes from shining light on reviewers would be the ability to recognize unsuitable reviewers and conflicts of interest. I bet a nontrivial number of people have reviewed their former students’ or former advisor’s work. I also have a hunch that a few topics within psychology owe their existence to a small network of researchers who are continually selected as the reviewers for these papers. As a hypothetical example, wouldn’t it be important to know that the majority of work on terror management theory was reviewed by Greenberg, Solomon, Pyszczynski, and their students? Although I don’t think that G, S, P, and their students conducted the majority of reviews of the hundreds of terror management papers, I am highly skeptical of TMT for theoretical reasons. But I digress.

Some colleagues have confessed that, when reviewing a manuscript that has the potential to steal their thunder or undermine their work, they were more critical, were more likely to recommend rejection, and took significantly longer to return their review. This is toxic and is “damaging science” in de Ruiter’s words.

And for those senior researchers who delegate reviews to graduate students, onymity could alleviate the associated bad practices. Senior researchers will either be forced to write their own reviews or engage in more pedagogy so that their students’ reviews meet basic standards of quality.

4) Clarification. Authors would be able to ask reviewers to clarify their comments and suggestions, even if official correspondence between the two is severed due to the manuscript’s rejection. I’ve certainly received comments that I didn’t know quite what to do with. I once got “The authors should consider whether these perceptual shifts are commercial.” Huh? Commercial? Of course, a potential danger is that authors and reviewers could open a back-channel dialog that excludes the editor. I imagine that some editors will read potential danger, while some will read potential benefit. If you’re the former, an explicit “Authors and reviewers should refrain from communicating with one another about the manuscript throughout the review process” would probably do the trick.

5) Increased quality. This is the primary benefit of review transparency. I know that I’m not the only reviewer who has, at some point, written a hasty or careless review. Currently, there are no costs to reviewers who provide careless or poor-quality reviews, but there are serious costs associated with careless reviews, the primary being impeding scientific progress and wasting researcher resources. If we tie reputational consequences to reviews, then review quality increases. This practice might also increase review time, but that’s a cost we should be willing to incur to increase quality and accountability, expose biases, give credit where it’s due, and encourage goodwill.

There’s actually some empirical evidence suggesting that signed reviews are of higher quality than unsigned reviews. Reviewers for the British Journal of Psychiatry were randomly assigned to signed and unsigned groups and provided real reviews for manuscripts, per business as usual. The researchers then measured the quality of reviews and compared them. By most measures, review quality was modestly but statistically better among the signed reviews. These data, however, aren’t as relevant to my argument, because reviewer identity was only revealed to authors of the manuscripts rather than the entire scientific community. Any differences noted between the signed and unsigned groups are likely conservative estimates of what would happen if reviewers’ names and recommendations were publically attached to papers where reputational costs and benefits could be incurred. Another study also examining the effect of reviewer anonymity on review quality did not find any differences between the signed and unsigned groups, but this study suffers from the same limitation as the first: reviewer identity was only revealed to the authors of the manuscripts and did not appear in the subsequent publishing of accepted manuscripts.

Signed reviews could become what evolutionary scientists call honest signals. Honest signals—sometimes referred to as hard-to-fake signals, costly signals, or Zahavian signals—refer to traits or behaviors that are metabolically and energetically costly or dangerous to produce, maintain, or express. We all know the peacock’s tail as the standard example. A peacock’s tail honestly signals low parasite load. Only healthy, high quality males can afford to produce large, bright tail feathers. And many of us learned that the stotting of many antelope species is best understood as an honest signal of quality.

Much in same way that large, bright tail feathers honestly signal health, signed reviews can honestly signal review quality and integrity. Only a reviewer who writes a high quality review and is confident that the review is high quality can afford to sign her name at the bottom of her review. And only a reviewer who is confident that her critical review is fair and warranted can afford sign her name.

It’s easy to write a subpar review; it probably happens every day. It’s not easy, however, to write a subpar review if your name is attached to it. Our desire to maintain our reputation is strong. To illustrate this, I challenge you to tweet or update your Facebook status to this: “CBS is hands down the best network. I could watch it all day.”


I recently reread an onymous review I received from a colleague I deeply respect. His review did not pull any punches, and parts of it would probably be considered abrasive by those who haven’t developed a thick skin. When I received it, I recognized that to give my paper such a review—to dig up those obscure references, to run those analyses, to identify all of the strengths and weaknesses, and to entertain those alternative explanations—he had to get intimate with it. He had to let it marinade in his thoughts before he savored each paragraph, citation, and analysis. Although he ultimately devoured it, it was deeply rewarding to have someone else care that much about my work. And it fills me with gratitude to know who it was that give up their weekend, their time on their own work, or their time with their friends and family. Anything that increases rigor, exposes biases, aids scientific progress, and promotes gratitude and goodwill should at least be considered. And beyond mere consideration, journals should think seriously about examining the differences between signed and unsigned reviews and between public and private reviews. Editors have all they need to examine the differences between signed and unsigned reviews, and editors open to testing an open reviewer system that links reviews to published papers can contact me and others at the Open Science Collaboration.


1 Yes, you’re recognizing onymous as the back-formation of anonymous. Onymous is synonymous with named, identified, or signed. 2 JP de Ruiter, whom I mention few times throughout, wrote a piece that argued the same basic point that I’m trying to here: anonymous peer review is toxic and it should be fixed. Andrew Sabisky alerted me to de Ruiter’s post, and it inspired me to finish writing this piece. Many thanks to both. I encourage you to read de Ruiter’s post. Also, Kristy MacLeod wrote a great post about her decision to sign a review, and her decision to sign seems to be rooted in honest signaling. I recommend you read it, too. 3 Geoffrey Miller wrote and signed the review I referenced in the last paragraph. 4 Special thanks go to Jessica Ayers, Kayla Causey, JP de Ruiter, Jon Grahe, and Gorge Romero who gave comments on an earlier draft of this post. All errors are my own, of course.

Oct 14, 2014

A Psi Test for the Health of Science


Science is sick. How will we know when it's been cured?

Meta-analysis quantitatively combines the evidence from multiple experiments, across different papers and laboratories. It's the best way we have to determine the upshot of a spate of studies.

Published studies of psi (telepathy, psychokinesis, and other parapsychological phenomena) have been submitted to meta-analysis. The verdict of these meta-analyses is that the evidence for the existence of psi is close to overwhelming. Bosch, Steinkamp, & Boller (2006, Psychological Bulletin), for example, meta-analyzed studies of the ability of participants to affect the output of random number generators. These experiments stemmed from an older tradition in which participants attempted to influence a throw of dice to yield a particular target number. As for the old dice experiments, many of the studies found that the number spat out by the random number generator was more often the target number that the participant was gunning for than one would expect by chance. In their heroic effort, Bosch et al. combined the results of 380 published experiments, and calculated that if in fact psychokinesis does not exist, the probability of finding the evidence published was less than one in a thousand (for one of their measures, z = 3.67). In other words, it is extremely unlikely that so much evidence in favor of psychokinesis would have resulted if psychokinesis actually does not exist.

Like many others, I suspect that this evidence stems not from the existence of psi, but rather from various biases in the way science today is typically conducted.

"Publication bias" refers to the tendency for a study to be published if it is interesting, while boring results rarely make it out of the lab. "P-hacking" - equally insidious - is the tendency of scientists to try many different statistical analyses until they find a statistically significant result. If you try enough analyses or tests, you're nearly guaranteed to find a statistically significant although spurious result. But despite scientists' suspicion that the seemingly-overwhelming evidence for psi is a result of publication bias and p-hacking, there is no way to prove this, or to establish it beyond a reasonable doubt (we shouldn't expect proof, as that may be a higher standard than is feasible for empirical studies of a probabilistic phenomenon).

Fortunately these issues have received plenty of attention, and new measures are being adopted (albeit slowly) to address them. Researchers have been encouraged to publicly announce (simply by posting on a website) a single, specific statistical analysis plan prior to collecting data. This can eliminate p-hacking. Other positive steps, like sharing of code and data, helps other scientists to evaluate the evidence more deeply, to spot signs of p-hacking as well as inappropriate analyses and simple errors. In the case of a recent study of psi by Tressoldi et al., Sam Schwartzkopf has been able to wade into the arcane details of the study, revealing possible problems. But even if the Tressoldi et al. study is shown to be seriously flawed, Sam's efforts won't overturn all the previous evidence for psi, nor will it combat publication bias in future studies. We need a combination of measures to address the maladies that afflict science.

OK, so let's say that preregistration, open science, and other measures are implemented, and together fully remedy the unhealthy traditions that hold back efforts to know the truth. How will we know science has been cured?

A Psi Test for the health of science might be the answer. According to the Psi Test, until it can be concluded that psi does not exist using the same meta-analysis standards as are applied to any other phenomenon in the biomedical or psychological literature, science has not yet been cured.

Do we really need to eliminate publication bias to pass the Psi Test, or can meta-analyses deal with it? Funnel plots can provide evidence for publication bias. But given that most areas of science are rife with publication bias, if we use publication bias to overturn the evidence for psi, to be consistent we'd end up disbelieving countless more-legitimate phenomena. And my reading of medicine’s standard meta-analysis guide, by the Cochrane Collaboration, is that in Cochrane reviews, evidence for publication bias raises concerns but is not used to overturn the verdict indicated by the evidence.

Of course, instead of concluding that science is sick, we might instead conclude that psi actually exists. But I think this is not the case - mainly because of what I hear from physicists. And I think if psi did exist, there’d likely be even more overwhelming evidence for it by now than we have. Still, I want us to be able to dismiss psi using the same meta-analysis techniques we use for the run-of-the-mill. Others have made similar points.

The Psi Test for the health of science, even if valid, won't tell us right away that science has been fixed. But in retrospect we’ll know. After the year science is cured, when taking psi studies published that year and after, applying the standard meta-analysis technique will result in the conclusion that psi does not exist.

Below, I consider two objections to this Psi Test.

Objection 1: some say that we already can conclude that psi does not exist, based on Bayesian evaluation of the psi proposition. To evaluate the evidence from psi studies, a Bayesian first assigns a probability that psi exists, prior to seeing the studies' data. Most physicists and neuroscientists would say that our knowledge of how the brain works and of physical law very strongly suggests that psychokinesis is impossible. To overturn this Bayesian prior, one would need much stronger evidence than even the one-in-a-thousand chance derived from psi studies that I mentioned above. I agree; it's one reason I don't believe in psi. However, it's pretty hard to pin down in quantitative fashion and partially explains why Bayesian analysis hasn’t taken over the scientific literature more rapidly. Also, there may be expert physicists out there that think some sort of quantum interaction could underlie psi, and it's hard to know how to quantitatively combine the opinions of dissenters with the majority.

Rather than relying on a Bayesian argument (although Bayesian analysis is still useful, even with a neutral prior), I'd prefer that our future scientific practice, involving preregistration, unbiased publishing, replication protocols, and so on reach the point where if hundreds of experiments on a topic are available, they should be fairly definitive. Do you think we will get there?

Objection 2: Some will say that science can never eliminate publication bias. While publication bias is reduced by the advent of journals like PLoS ONE that accept null results, and by the growing number of journals that accept papers prior to the data being collected, it may forever remain a significant problem. But there are further steps one could take: in open notebook science, all data is posted on the net as soon as it is collected, eliminating all opportunity for publication bias. But open notebook science might never become standard practice, and publication bias may remain strong enough that substantial doubt will persist for many scientific issues. In that case, the only solution may be a pre-registered, confirmatory large-scale replication of an experiment, similar to what we are doing at Perspectives on Psychological Science (I'm an associate editor for the new Registered Replication Report article track). Will science always need that to pass the psi test?

Oct 7, 2014

What Open Science Framework and Impactstory mean to these scientists' careers


This article was originally posted on the Impactstory blog.

Yesterday, we announced three winners in the Center for Open Science’s random drawing to win a year’s subscription to Impactstory for users that connected their Impactstory profile to their Open Science Framework (OSF) profile: Leonardo Candela (OSF, Impactstory), Rebecca Dore (OSF, Impactstory), and Calvin Lai (OSF, Impactstory). Congrats, all!

We know our users would be interested to hear from other researchers practicing Open Science, especially how and why they use the tools they use. So, we emailed our winners who graciously agreed to share their experiences using the OSF (a platform that supports project management with collaborators and project sharing with the public) and Impactstory (a webapp that helps researchers discover and share the impacts of all their research outputs). Read on!

What's your research focus?

Leonardo: I’m a computer science researcher. My research interests include Data Infrastructures, Virtual Research Environments, Data Publication, Open Science, Digital Library Management Systems and Architectures, Digital Libraries Models, Distributed Information Retrieval, and Grid and Cloud Computing.

Rebecca: I am a PhD student in Developmental Psychology. Broadly, my research focuses on children’s experiences in pretense, fiction and fantasy. How do children understand these experiences? How might these experiences affect children's behaviors, beliefs and abilities?

Calvin: I'm a doctoral student in Social Psychology studying how to change unconscious or automatic biases. In their most insidious forms, unconscious biases lead to discrepancies between what people value (e.g., egalitarianism) and how people act (e.g., discriminating based on race). My interest is in understanding how to change these unconscious thoughts so that they're aligned with our conscious values and behavior.

How do you use the Open Science Framework in the course of your research?

Leonardo: Rather than an end user of the system for supporting my research tasks, I’m interested in analysing and comparing the facilities offered by such an environment and the concept of Virtual Research Environments.

Rebecca: At this stage, I use the OSF to keep all of the information about my various projects in one place and to easily make that information available to my collaborators--it is much more efficient to stay organized than constantly exchanging and keeping track of emails. I use the wiki feature to keep notes on what decisions were made and when and store files with drafts of materials and writing related to each project. Version control of everything is very convenient.

Calvin: For me, the Open Science Framework (OSF) encompasses all aspects of the research process - from study inception to publication. I use the OSF as a staging ground in the early stages for plotting out potential study designs and analysis plans. I will then register my study shortly before data collection to gain the advantage of pre-registered confirmatory testing. After data collection, I will often refer back to the OSF as a reminder of what I did and as a guide for analyses and manuscript-writing. Finally, after publication, I use the OSF as a repository for public access to my data and study materials.

What's your favorite Impactstory feature? Why?

Leonardo: I really appreciate the effort Impactstory is posing on collecting metrics on the impact my research products have on the web. I like its integration with ORCID and the recently supported “Key profile metrics” since it gives a nice overview of a researcher impact.

Rebecca: I had never heard of ImpactStory before this promotion, and it has been really neat to start testing out. It took me 2 minutes to copy my publication DOIs into the system, and I got really useful information that shows the reach of my work that I hadn't considered before, for example shares on Twitter and where the reach of each article falls relative to other psychology publications. I'm on the job market this year and can see this being potentially useful as supplementary information on my CV.

Calvin: Citation metrics can only tell us so much about the reach of a particular publication. For me, Impactstory's alternative metrics have been important for figuring out where else my publications are having impact across the internet. It has been particularly valuable for pointing out connections that my research is making that I wasn't aware of before.

Thanks to all our users who participated in the drawing by connecting their OSF and Impactstory profiles! Both of our organizations are proud and excited to be working to support the needs of researchers practicing Open Science, and thereby changing science for the better.

To learn more about our open source non-profits, visit the Impactstory and Open Science Framework websites.

Sep 9, 2014

The meaning of replicability across scientific disciplines


Recently, Shauna Gordon-McKeon wrote about the meaning of replicability on this blog, concentrating on examples from psychology. In this post, I summarize for comparison the situation in computational science. These two fields may well be at opposite ends of the spectrum as far as replication and replicability are concerned, so the comparison should be of interest for establishing terminology that is also suitable for other domains of science. For a more detailed discussion of the issues specific to computational science, see this post on my personal blog.

The general steps in conducting a scientific study are the same in all fields:

  1. Design: define in detail what needs to be done in order to obtain useful insight into a scientific problem. This includes a detailed description of required equipment, experimental samples, and procedures to be applied.

  2. Execution: do whatever the design requires to be done.

  3. Interpretation: draw conclusions from whatever results were obtained.

The details of the execution phase vary enormously from one discipline to another. In psychology, the "experimental sample" is typically a group of volunteers, which need to be recruited, and the "equipment" includes the people interacting with the volunteers and the tools they use, but also the conditions in which the experiment takes place. In physics or chemistry, for which the terms "sample" and "equipment" are most appropriate, both are highly specific to an experiment and acquiring them (by buying or producing) is often the hard part of the work. In computational science, there are no samples at all, and once the procedure is sufficiently well defined, its execution is essentially left to a computer, which is a very standardized form of equipment. Of course what I have given here are caricatures, as reality is usually much more complicated. Even the three steps I have listed are hardly ever done one after the other, as problems discovered during execution lead to a revision of the design. But for explaining concepts and establishing terminology, such caricatures are actually quite useful.

Broadly speaking, the term "replication" refers to taking an existing study design and repeating the execution phase. The motivation for doing this is mainly verification: the scientists who designed and executed the study initially may have made mistakes that went unnoticed, forgotten to mention an important aspect of their design in their publication, or at the extreme have cheated by making up or manipulating data.

What varies enormously across scientific disciplines is the effort or cost associated with replication. A literal replication (as defined in Shauna's post) of a psychology experiment requires recruiting another group of volunteers, respecting their characteristics as defined by the original design, and investing a lot of researchers' time to repeat the experimental procedure. A literal replication of a computational study that was designed to be replicable involves minimal human effort and an amount of computer time that is in most cases not important. On the other hand, the benefit obtained from a literal replication varies as well. The more human intervention is involved in a replication, the more chances for human error there are, and the more important it is to verify the results. The variability of the “sample” is also important: repeating an experiment with human volunteers is likely to yield different outcomes even if done with exactly the same subjects, and similar problems apply in principle with other living subjects, even as small as bacteria. In contrast, re-running a computer program is much less useful, as it can only discover rare defects in computer hardware and system software.

These differences lead to different attitudes toward replication. In psychology, as Shauna describes, literal replication is expensive and can detect only some kinds of potential problems, which are not necessarily expected to be the most frequent or important ones. This makes a less rigid approach, which Shauna calls "direct replication", more attractive: the initial design is not repeated literally, but in spirit. Details of the protocol are modified in a way that, according to the state of knowledge of the field, should not make a difference. This makes replication cheaper to implement (because the replicators can use materials and methods they are more familiar with), and covers a wider range of possible problems. On the other hand, when such an approach leads to results that are in contradiction with the original study, more work must be invested to figure out the cause of the difference.

In computational science, literal replication is cheap but at first sight seems to yield almost no benefit. The point of my original blog post was to show that this is not true: replication proves replicability, i.e. it proves that the published description of the study design is in fact sufficiently complete and detailed to make replication possible. To see why this is important, we have to look at the specificities of computation in science, and at the current habits that make most published studies impossible to replicate.

A computational study consists essentially in running a sequence of computer programs, providing each one with the input data it requires, which is usually in part obtained from the output of programs run earlier. The order in which the programs are run is very important, and the amount of input data that must be provided is often large. Typically, changing the order of execution or a single number in the input data leads to different results that are not obviously wrong. It is therefore common that mistakes go unnoticed when individual computational steps require manual intervention. And that is still the rule rather than the exception in computational science. The most common cause for non-replicability is that the scientists do not keep a complete and accurate log of what they actually did, because keeping such a log is a very laborious, time-consuming, and completely uninteresting task. There is also a lack of standards and conventions for recording and publishing such a log, making the task quite difficult as well. For these reasons, replicable computational studies remain the exception to this day. There is of course no excuse for this: it’s a moral obligation for scientists to be as accurate as humanly and technologically possible about documenting their work. While today’s insufficient technology can be partly blamed, most computational scientists (myself included) could do much better than they do. It is really a case of bad habits that we have acquired as a community.

The good news is that people are becoming aware of the problem (see for example this status report in Nature) and working on solutions. Early adopters report consistently that the additional initial effort for ensuring replicability quickly pays off over the duration of a study, even before it gets published. As with any new development, potential adopters are faced with a bewildering choice of technologies and recommended practices. I'll mention my own technology in passing, which makes computations replicable by construction. More generally, interested readers might want to look at this book, a Coursera course, two special issues of CiSE magazine (January 2009 and July 2012), and a discussion forum where you can ask questions.

An interesting way to summarize the differences across disciplines concerning replication and reproducibility is to look at the major “sources of variation” in the execution phase of a scientific study. At one end of the spectrum, we have uncontrollable and even undescribable variation in the behavior of the sample or the equipment. This is an important problem in biology or psychology, i.e. disciplines studying phenomena that we do not yet understand very well. To a lesser degree, it exists in all experimental sciences, because we never have full control over our equipment or the environmental conditions. Nevertheless, in technically more mature disciplines studying simpler phenomena, e.g. physics or chemistry, one is more likely to blame human error for discrepancies between two measurements that are supposed to be identical. Replication of someone else's published results is therefore attempted only for spectacularly surprising findings (remember cold fusion?), but in-house replication is very common when testing new scientific equipment. At the other end of the spectrum, there is the zero-variation situation of computational science, where study design uniquely determines the outcome, meaning that any difference showing up in a replication indicates a mistake, whose source can in principle be found and eliminated. Variation due to human intervention (e.g. in entering data) is considered a fault in the design, as a computational study should ideally not require any human intervention, and where it does, everything should be recorded.

Aug 22, 2014

Call for Papers on Research Transparency


The Berkeley Initiative for Transparency in the Social Sciences (BITSS) will be holding its 3rd annual conference at UC Berkeley on December 11-12, 2014. The goal of the meeting is to bring together leaders from academia, scholarly publishing, and policy which are committed to strengthening the standards of rigor across social science disciplines.

A select number of papers elaborating new tools and strategies to increase the transparency of research will be presented and discussed. Topics for papers include, but are not limited to:

  • Pre-registration and the use of pre-analysis plans;
  • Disclosure and transparent reporting;
  • Replicability and reproducibility;
  • Data sharing;
  • Methods for detecting and reducing publication bias or data mining.

Papers or long abstracts must be submitted by Friday, October 10th (midnight Pacific time) through CEGA’s Submission Platform. Travel funds will be provided for presenters. Submission can be of completed papers or works in progress.

The 2014 BITSS Conference is sponsored by the Alfred P. Sloan Foundation and the Laura and John Arnold Foundation.

Aug 7, 2014

What we talk about when we talk about replication


If I said, “Researcher A replicated researcher B’s work”, what would you take me to mean?

There are many possible interpretations. I could mean that A had repeated precisely the methods of researcher B, and obtained similar results. Or I could be saying that A had repeated precisely the methods of researcher B, and obtained very different results. I could be saying that A had repeated only those methods which were theorized to influence the results. I could mean that A had devised new methods which were meant to explore the same phenomenon. Or I could mean that researcher B had copied everything down to the last detail.

We do have terms for these different interpretations. A replication of precise methods is a direct replication, while a replication which uses new methods but gets at the same phenomenon is a conceptual replication. Once a replication has been completed, you can look at the results and call it a “successful replication” if the results are the same, and a “failed replication” if the results are different.

Unfortunately, these terms are not always used, and the result is that recent debates over replication have become not only heated, but confused.

Take, for instance, nobel laureate Daniel Kahneman’s open letter to the scientific community, A New Etiquette for Replication. He writes:

“Even rumors of a failed replication cause immediate reputational damage by raising a suspicion of negligence (if not worse). The hypothesis that the failure is due to a flawed replication comes less readily to mind – except for authors and their supporters, who often feel wronged.”

Here he uses the common phrasing, “failed replication”, to indicate a replication where different results were obtained. The cause of those different results is unknown, and he suggests that one option is that the methods used in the direct replication were not correct, which he calls a “flawed replication”. What, then, is the term for a replication where the methods are known to be correct but different results were still found?

Further on in his letter, Kahneman adds:

“In the myth of perfect science, the method section of a research report always includes enough detail to permit a direct replication. Unfortunately, this seemingly reasonable demand is rarely satisfied in psychology, because behavior is easily affected by seemingly irrelevant factors.”

We take “direct replication” to mean copying the original researcher’s methods. As Kahneman points out, perfect copying is impossible. When a factor that once seemed irrelevant may have influenced the results, is that a “flawed replication”, or simply no longer a “direct replication”? How can we distinguish between replications which copy as much of the methods as possible, and those which copy only those elements of the methods which the original author hypothesizes should influence the result?

This terminology is not only imprecise, it differs from what others use. In their Registered Reports: A Method to Increase the Credibility of Published Results, Brian Nosek and Daniel Lakens write:

“There is no such thing as an exact replication. Any replication will differ in innumerable ways from the original. A direct replication is the attempt to duplicate the conditions and procedure that existing theory and evidence anticipate as necessary for obtaining the effect (Open Science Collaboration, 2012, 2013; Schmidt, 2009). Successful replication bolsters evidence that all of the sample, setting, and procedural differences presumed to be irrelevant are, in fact, irrelevant.”

This statement contains an admirably clear definition of “direct replication”, which the authors use here to mean a replication copying only those elements of the methods considered relevant. This is distinct from Kahneman’s usage of the term “direct replication”. Kahneman, instead, may be conflating “direct replication” with “literal replication”, a much less common term meaning “the precise duplication of the specific design and results of a previous study” (Heiman, 2002).

Nosek and Lakens also use the term “successful replication” in a way which implies that not only were the results replicated, the methods were as well, as they take the replication’s success to be a commentary on the methods. However, even “successful replications” may not successfully replicate methods, as pointed out by Simone Schnall in her critique of the special issue edited by Nosek and Lakens:

Various errors in several of the replications (e.g., in the “Many Labs” paper) became only apparent once original authors were allowed to give feedback. Errors were uncovered even for successfully replicated findings.

Whether or not there were methodological errors in these particular cases, the possibility of such errors even when results are replicated remains a possibility, one which is elided by the terminology of “successful replication”. This is not merely a point of semantics, as "successful replications" may be checked less carefully for methodological errors than “failed replications”.

There are many other examples of researchers using replication terminology in ways that are not maximally clear. So far I have only quoted from social psychologists. When we attempt to speak across disciplines we face even greater potential for confusion.

As such, I propose:

1) That we resurrect the term “literal replication”, meaning “the precise duplication of the specific design of a previous study” rather than overload the term “direct replication”. Direct replication can then mean only the duplication of those methods deemed to be relevant. Of course, a perfect literal replication is impossible, but using this terminology implies that duplication of as much of the previous study as possible is the goal.

2) That we retire the phrases “failed replication” and “successful replication”, which do not distinguish between procedure and results. In their place, we can use “replication with different results” and “flawed replication” for the former, and “replication with similar results” and “sound replication” for the latter.

Thus, a replication attempt where the goal was to precisely duplicate materials and where this was successfully done, but different results were found, would be a sound literal replication with different results. An attempt only to duplicate elements of the design hypothesized to be relevant, leading to some methodological questions, yet where similar results were found, would be a flawed direct replication with similar results.

These terms may seem unnecessarily wordy, and indeed may not always be needed, but I encourage everyone to use them when precision is important, for instance in published articles or in debates with those who disagree with you. I know that from now on, when I hear someone use the bare term “replication”, I will ask, “What kind?”

Thanks to JP de Ruiter, Etienne LeBel, and Sheila Miguez for their feedback on this post.

Jul 30, 2014

Open-source software for science


A little more than three years ago I started working on OpenSesame, a free program for the easy development of experiments, mostly oriented at psychologists and neuroscientists. The first version of OpenSesame was the result of a weekend-long hacking sprint. By now, OpenSesame has grown into a substantial project, with a small team of core developers, tens of occasional contributors, and about 2500 active users.

Because of my work on OpenSesame, I've become increasingly interested in open-source software in general. How is it used? Who makes it? Who is crazy enough to invest time in developing a program, only to give it away for free? Well ... quite a few people, because open source is everywhere. Browsers like Firefox and Chrome. Operating systems like Ubuntu and Android. Programming languages like Python and R. Media players like VLC. These are all examples of open-source programs that many people use on a daily basis.

But what about specialized scientific software? More specifically: Which programs do experimental psychologists and neuroscientists use? Although this varies from person to person, a number of expensive, closed-source programs come to mind first: E-Prime, SPSS, MATLAB, Presentation, Brainvoyager, etc. Le psychonomist moyen is not really into open source.

In principle, there are open-source alternatives to all of the above programs. Think of PsychoPy, R, Python, or FSL. But I can imagine the frown on the reader's face: Come on, really? These freebies are not nearly as good as 'the real thing', are they? But this, although true to some extent, merely raises another question: Why doesn't the scientific community invest more effort in the development of open-source alternatives? Why do we keep accepting inconvenient licenses (no SPSS license at home?), high costs ($995 for E-Prime 2 professional), and scripts written in proprietary languages that cannot easily be shared between labs. This last point has become particularly relevant with the recent focus on replication and transparency. How do you perform a direct replication of an experiment if you do not have the required software? And what does transparency even mean if we cannot run each other's scripts?

Despite widespread skepticism, I suspect that most scientists feel that open source is ideologically preferable over proprietary scientific software. But open source suffers from an image problem. For example, a widely shared misconception is that open-source software is buggy, whereas proprietary software is solid and reliable. But even though quality is subjective--and due to cognitive dissonance strongly biased in favor of expensive software!--this belief is not consistent with reality: Reports have shown that open-source software contains about half as many errors per line of code as proprietary software.

Another misconception is that developing (in-house) open-source software is expensive and inefficient. This is essentially a prisoners dilemma. Of course, for an individual organization it is often more expensive to develop software than to purchase a commercial license. But what if scientific organizations would work together to develop the software that they all need: You write this for me, I write this for you? Would open source still be inefficient then?

Let's consider this by first comparing a few commercial packages: E-Prime, Presentation, and Inquisit. These are all programs for developing experiments. Yet the wheel has been re-invented for each program. All overlapping functionality has been re-designed and re-implemented anew, because vendors of proprietary software dislike few things as much as sharing code and ideas. (This is made painfully clear by numerous patent wars.) Now, let's compare a few open-source programs: Expyriment, OpenSesame, and PsychoPy. These too are all programs for developing experiments. And these too have overlapping functionality. But you can use these programs together. Moreover, they build on each other's functionality, because open-source licenses allow developers to modify and re-use each other's code. The point that I'm trying to make is not that open-source programs are better than their proprietary counterparts. Everyone can decide that for him or herself. The crucial point is that the development process of open-source software is collaborative and therefore efficient. Certainly in theory, but often in practice as well.

So it is clear that open-source software has many advantages, also--maybe even especially so--for science. Therefore, development of open-source software should be encouraged. How could universities and other academic organizations contribute to this?

A necessary first step is to acknowledge that software needs time to mature. There are plenty of young researchers, technically skilled and brimming with enthusiasm, who start a software project. Typically, this is software that they developed for their own research, and subsequently made freely available. If you are lucky, your boss allows this type of frivolous fun, as long the 'real' work doesn't suffer. And maybe you can even get a paper out of it, for example in Behavior Research Methods, Journal of Neuroscience Methods, or Frontiers in Neuroinformatics. But it is often forgotten that software needs to be maintained. Bugs need to be fixed. Changes in computers and operating systems require software updates. Unmaintained software spoils like an open carton of milk.

And this is where things get awkward, because universities don't like maintenance. Developing new software is one thing. That's innovation, and somewhat resembles doing research. But maintaining software after the initial development stage is over is not interesting at all. You cannot write papers about maintenance, and maintenance does not make an attractive grant proposal. Therefore, a lot of software ends up 'abandonware', unmaintained ghost pages on development sites like GitHub, SourceForge, or Google Code.

Ideally, universities would encourage maintenance of open-source scientific software. The message should be: Once you start something, go through with it. They should recognize that the development of high-quality software requires stamina. This would be an attitude change, and would require that universities get over their publication fetish. Because the value of a program is not in the papers that have been written about it, but in the scientists that use it. Open-source scientific software has a very concrete and self-evident impact for which developers should be rewarded. Without incentives, they won't make the high-quality software that we all need!

In other words, developers could use a bit of encouragement and support, and this is currently lacking. I recently attended the APS convention, where I met Jeffrey Spies, one of the founders of the Center for Open Science (COS). As readers of this blog probably know, the COS is an American organization that (among many other things) facilitates development of open-source scientific software. They provide advice, support promising projects, and build networks. (Social, digital, and a mix of both, like this blog!) A related organization that focuses more specifically on software development is the Mozilla Science Lab (MSL). I think that the COS and MSL do great work, and provide models that could be adopted by other organizations. For example, I currently work for the CNRS, the French organization for fundamental research. The CNRS is very large, and could easily provide sustained support for the development of high-quality open-source projects. And the European Research Council could easily do so as well. However, these large research organization do not appear to recognize the importance of software development. They prefer to invest all of their budget in individual research projects, rather than invest a small part of it in the development and maintenance of the software that these research projects need.

In summary, a little systematic support would do wonders for the quality and availability of open-source scientific software. Investing in the future, is that not what science is about?

A Dutch version of this article initially appeared in De Psychonoom, the magazine of the Dutch psychonomic society. This article has been translated and updated for the OSC blog.

Jul 16, 2014

Digging a little deeper - Understanding workflows of archaeologists


Scientific domains vary by the tools and instruments used, the way data are collected and managed, and even how results are analyzed and presented. As advocates of open science practices, it’s important that we understand the common obstacles to scientific workflow across many domains. The COS team visits scientists in their labs and out in the field to discuss and experience their research processes first-hand. We experience the day-to-day of researchers and do our own investigating. We find where data loss occurs, where there are inefficiencies in workflow, and what interferes with reproducibility. These field trips inspire new tools and features for the Open Science Framework to support openness and reproducibility across scientific domains.

Last week, the team visited the Monticello Department of Archaeology to dig a little deeper (bad pun) into the workflow of archaeologists, as well as learn about the Digital Archaeological Archive of Comparative Slavery (DAACS). Derek Wheeler, Research Archaeologist at Monticello, gave us a nice overview of how the Archaeology Department surveys land for artifacts. Shovel test pits, approximately 1 foot square, are dug every 40 feet on center as deep as anyone has dug in the past (i.e., down to undisturbed clay). If artifacts are found, the shovel test pits are dug every 20 feet on center. At Monticello, artifacts are primarily man-made items like nails, bricks or pottery. The first 300 acres surveyed contained 12,000 shovel test pits -- and that’s just 10% of the total planned survey area. That’s a whole lot of holes, and even more data.

Fraser Neiman addresses crowd Fraser Neiman, Director of Archaeology at Monticello, describes the work being done to excavate on Mulberry Row - the industrial hub of Jefferson’s agricultural industry.

At the Mulberry Row excavation site, Fraser Neiman, Director of Archaeology, explained the meticulous and painstaking process of excavating quadrats, small plots of land isolated for study. Within a quadrat, there exist contexts - stratigraphic units. Any artifacts found within a context are carefully recorded on a context sheet - what the artifact is, its location within the quadrat, along with information about the fill (dirt, clay, etc.) in the context. The fill itself is screened to pull out smaller artifacts the eye may not catch. All of the excavation and data collection at the Mulberry Row Reassessment is conducted following the standards of the Digital Archaeological Archive of Comparative Slavery (DAACS). Standards developed by DAACS help archaeologists in the Chesapeake region to generate, report, and compare data from 20 different sites across the region in a systematic way. Without these standards, archiving and comparing artifacts from different sites would be extremely difficult.

Researchers measure excavation site Researchers make careful measurements at the Monticello Mulberry Row excavation site, while recording data on a context sheet.

The artifacts, often sherds, are collected by context and taken to the lab for washing, labeling, analysis and storage. After washing, every sherd within a particular context is labeled with the same number and stored together. All of the data from the context sheets, as well as photos of the quadrants and sherds, are carefully input into DAACS following the standards set out in the DAACS Cataloging Manual. There is an enormous amount of manual labor associated with preparing and curating each artifact. Jillian Galle, Project Manager of DAACS, described the extensive training users must undergo in order to deposit their data in the archive to ensure the standards outlined by the Cataloging Manual are kept. This regimented process ensures the quality and consistency of the data- and thus its utility. The result is a publicly available dataset of the history of Monticello for researchers of all kinds to examine this important site in America’s history.

Washed and numbered sherds These sherds have been washed and numbered to denote their context.

Our trip to Monticello Archaeology was eye-opening, as none of us had any practical experience with archaeological research or data. The impressive DAACS protocols and standards represent an important aspect of all scientific research - the ability to accurately capture large amounts of data in a systematic, thoughtful way - and then share it freely with others.

Jul 10, 2014

What Jason Mitchell's 'On the emptiness of failed replications' gets right


Jason Mitchell's essay 'On the emptiness of failed replications' is notable for being against the current effort to publish replication attempts. Commentary on the essay that I saw was pretty negative (e.g. "awe-inspiringly clueless", “defensive pseudo-scientific, anti-Bayesian academic ass-covering”, "Do you get points in social psychology for publicly declaring you have no idea how science works?").

Although I reject his premises, and disagree with his conclusion, I don't think Mitchell's arguments are incomprehensibly mad. This seems to put me in a minority, so I thought I'd try and explain the value in what he's saying. I'd like to walk through his essay assuming he is a thoughtful rational person. Why would a smart guy come to the views he has? What is he really trying to say, and what are his assumptions about the world of psychology that might, perhaps, illuminate our own assumptions?

Experiments as artefacts, not samples

First off, key to Mitchell's argument is a view that experiments are complex artefacts, in the construction of which errors are very likely. Effects, in this view, are hard won, eventually teased out via a difficult process of refinement and validation. The value of replication is self-evident to anyone who thinks statistically: sampling error and publication bias will produce lots of false positives, you improve your estimate of the true effect by independent samples (= replications). Mitchell seems to be saying that the experiments are so complex that replications by other labs aren't independent samples of the same effect. Although they are called replications there are, he claims, most likely to be botched, and so informative of nothing more than the incompetence of the replicators.

When teaching our students many of us will have deployed the saying "The plural of anecdote is not data". What we mean by this is that many weak observations - of ghosts, aliens or psychic powers - do not combine multiplicatively to make strong evidence in favour of these phenomena. If I've read him right, Mitchell is saying the same thing about replication experiments - many weak experiments are uninformative about real effects.

Tacit practical knowledge

Part of Mitchell's argument rests on the importance of tacit knowledge in running experiments (see his section "The problem with recipe-following"). We all know that tacit knowledge about experimental procedures exists in science. Mitchell puts a heavy weight on the importance of this. This is a position which presumably would have lots of sympathy from Daniel Kahneman, who suggested that all replication attempts should involve the original authors.

There's a tension here between how science should be and how it is. Obviously our job is to make things explicit, to explain how to successfully run experiments so that anyone can run them but the truth is, full explanations aren't always possible. Sure, anyone can try and replicate based on a methods section, but - says Mitchell - you will probably be wasting your time generating noise rather than data, and shouldn't be allowed to commit this to the scientific record.

Most of us would be comfortable with the idea that if a non-psychologist ran our experiments they might make some serious errors (one thinks of the hash some physical scientists made of psi-experiments, failing completely to account for things like demand effects, for example). Mitchell's line of thought here seems to take this one step further, you can't run a social psychologist's experiments without special training in social psychology. Or even, maybe, you can't successfully run another lab's experiment without training from that lab.

I think happen to think he's wrong on this, and that he neglects to mention the harm of assuming that successful experiments have a "special sauce" which cannot be easily communicated (it seems to be a road to elitism and mysticism to me, completely contrary to the goals science should have). Nonetheless, there's definitely some truth to the idea, and I think it is useful to consider the errors we will make if we assume the contrary, that methods sections are complete records and no special background is required to run experiments.


Mitchell makes the claim that targeting an effect for replication amounts to the innuendo that the effects under inspection are unreliable, which is a slur on the scientists who originally published them. Isn't this correct? Several people on twitter admitted, or tacitly admitted, that their prior beliefs were that many of these effects aren't real. There is something disingenuous about claiming, on the one hand, that all effects should be replicated, but, on the other, targeting particular effects for attention. If you bought Mitchell's view that experiments are delicate artefacts which render most replications uninformative, you can see how the result is a situation which isn't just uninformative but actively harmful to the hard-working psychologists whose work is impugned. Even if you don't buy that view, you might think that selection of which effects should be the focus of something like the Many Labs project is an active decision made by a small number of people, and which targets particular individuals. How this processes works out in practice deserves careful consideration, even if everyone agrees that it is a Good Thing overall.


There are a number of issues in Mitchell's essay I haven't touched on - this isn't meant to be a complete treatment, just an explanation of some of the reasonable arguments I think he makes. Even if I disagree with them, I think they are reasonable; they aren't as obviously wrong as some have suggested and should be countered rather than dismissed.

Stepping back, my take on the 'replication crisis' in psychology is that it really isn't about replication. Instead, this is what digital disruption looks like in a culture organised around scholarly kudos rather than profit. We now have the software tools to coordinate data collection, share methods and data, analyse data, and interact with non-psychologists, both directly and via the media, in unprecedented ways and at an unprecedented rate. Established scholarly communities are threatened as "the way things are done" is challenged. Witness John Bargh's incredulous reaction to having his work challenged (and note that this was 'a replicate and explain via alternate mechanism' type study that Mitchell says is a valid way of doing replication). Witness the recent complaint of medical researcher Jonathan S. Nguyen-Van-Tam when a journalist included critique of his analysis technique in a report on his work. These guys obviously believe in a set of rules concerning academic publishing which many of us aren't fully aware of or believe no longer apply.

By looking at other disrupted industries, such as music or publishing, we can discern morals for both sides. Those who can see the value in the old way of doing things, like Mitchell, need to articulate that value and fast. There's no way of going back, but we need to salvage the good things about tight-knit, slow moving, scholarly communities. The moral for the progressives is that we shouldn't let the romance of change blind us to the way that the same old evils will reassert themselves in new forms, by hiding behind a facade of being new, improved and more equitable.

Jul 9, 2014

Response to Jason Mitchell’s “On the Emptiness of Failed Replications”


Jason Mitchell recently wrote an article entitled “On the Emptiness of Failed Replications.”. In this article, Dr. Mitchell takes an unconventional and extremely strong stance against replication, arguing that: “… studies that produce null results -- including preregistered studies -- should not be published.” The crux of the argument seems to be that "scientists who get p > .05 are just incompetent." It completely ignores the possibility that a positive result could also (maybe even equally) be due to experimenter error. Dr. Mitchell also appears to ignore the possibility of simply getting a false positive (which is expected to happen under the null in 5% of cases).

More importantly, it ignores issues of effect size and treats the outcome of research as a dichotomous "success or fail.” The advantages of examining effect sizes over simple directional hypotheses using null hypothesis significance testing are beyond the scope of this short post, but you might check out Sullivan and Feinn (2012) as an open-access starting point. Generally speaking, the problem is that sampling variation means that some experiments will find null results even when the experimenter does everything right. As an illustration, below is 1000 simulated correlations, assuming that r = .30 in the population, and a sample size of 100 (I used a monte carlo method).

In this picture, the units of analysis are individual correlations obtained in 1 of 1000 hypothetical research studies. The x-axis is the value of the correlation coefficient found, and the y-axis is the number of studies reporting that value. The red line is the critical value for significant results at p < .05 assuming a sample size of 100. As you can see from this picture, the majority of studies are supportive of an effect that is greater than zero. However (simply due to chance) all the studies to the left of the red line turned out non-significant. If we suppressed all the null results (i.e., all those unlucky scientists to the left of the red line) as Dr. Mitchell suggests, then our estimate of the effect size in the population would be inaccurate; specifically, it would appear to be larger than it really is, because certain aspects of random variation (i.e., null results) are being selectively suppressed. Without the minority of null findings (in addition to the majority of positive findings) the overall estimate of the effect cannot be correctly estimated.

The situation is even more grim if there really is no effect in the population.

In this case, a small proportion of studies will produce false positives, with a roughly equal chance of an effect in either direction. If we fail to report null results, false positives may be reified as substantive effects. The reversal of signs across repeated studies might be a warning sign that the effect doesn’t really exist, but without replication, a single false positive could define a field if it happens (by chance) to be in line with prior theory.

With this in mind, I also disagree that replications are “publicly impugning the scientific integrity of their colleagues.” Some people feel threatened or attacked by replication. The ideas we produce as scientists are close to our hearts, and we tend to get defensive when they’re challenged. If we focus on effect sizes, rather than the “success or fail” logic of null hypothesis significance testing, then I don’t believe that “failed” replications damage the integrity of the original author, but rather simply suggests that we should modulate the estimate of the effect size downwards. In this framework, replication is less about “proving someone wrong” and more about centering on the magnitude of an effect size.

Something that is often missed in discussion of replication is that the very nature of randomness inherent in the statistical procedures scientists use means that any individual study (even if perfectly conducted) will probably generate an effect size that is a bit larger or smaller than it is in the population. It is only through repeated experiments that we are able to center on an accurate estimate of the effect size. This issue is independent of researcher competence, and means that even the most competent researchers will come to the wrong conclusions occasionally because of the statistical procedures and cutoffs we’ve chosen to rely on. With this in mind, people should be aware that a failed replication does not necessarily mean that one of the two researchers is incorrect or incompetent – instead, it is assumed (until further evidence is collected) that the best estimate is a weighted average of the effect size from each research study.

For some more commentary from other bloggers, you might check out the following links:

"Are replication efforts pointless?" by Richard Tomsett

"Being as wrong as can be on the so-called replication crisis of science" by drugmonkey at Scientopia

"Are replication efforts useless?" by Jan Moren

"Jason Mitchell’s essay" by Chris Said

"#MethodsWeDontReport – brief thought on Jason Mitchell versus the replicators" by Micah Allen

"On 'On the emptiness of failed replications'" by Neuroskeptic

Jul 2, 2014

phack - An R Function for Examining the Effects of P-hacking


This article was originally posted in the author's personal blog.

Imagine you have a two group between-S study with N=30 in each group. You compute a two-sample t-test and the result is p = .09, not statistically significant with an effect size r = .17. Unbeknownst to you there is really no relationship between the IV and the DV. But, because you believe there is a relationship (you decided to run the study after all!), you think maybe adding five more subjects to each condition will help clarify things. So now you have N=35 in each group and you compute your t-test again. Now p = .04 with r = .21.

If you are reading this blog you might recognize what happened here as an instance of p-hacking. This particular form (testing periodically as you increase N) of p-hacking was one of the many data analytic flexibility issues exposed by Simmons, Nelson, and Simonshon (2011). But what are the real consequences of p-hacking? How often will p-hacking turn a null result into a positive result? What is the impact of p-hacking on effect size?

These were the kinds of questions that I had. So I wrote a little R function that simulates this type of p-hacking. The function – called phack – is designed to be flexible, although right now it only works for two-group between-S designs. The user is allowed to input and manipulate the following factors (argument name in parentheses):

  • Initial Sample Size (initialN): The initial sample size (for each group) one had in mind when beginning the study (default = 30).
  • Hack Rate (hackrate): The number of subjects to add to each group if the p-value is not statistically significant before testing again (default = 5).
  • Population Means (grp1M, grp2M): The population means (Mu) for each group (default 0 for both).
  • Population SDs (grp1SD, grp2SD): The population standard deviations (Sigmas) for each group (default = 1 for both).
  • Maximum Sample Size (maxN): You weren’t really going to run the study forever right? This is the sample size (for each group) at which you will give up the endeavor and go run another study (default = 200).
  • Type I Error Rate (alpha): The value (or lower) at which you will declare a result statistically significant (default = .05).
  • Hypothesis Direction (alternative): Did your study have a directional hypothesis? Two-group studies often do (i.e., this group will have a higher mean than that group). You can choose from “greater” (Group 1 mean is higher), “less” (Group 2 mean is higher), or “two.sided” (any difference at all will work for me, thank you very much!). The default is “greater.”
  • Display p-curve graph (graph)?: The function will output a figure displaying the p-curve for the results based on the initial study and the results for just those studies that (eventually) reached statistical significance (default = TRUE). More on this below.
  • How many simulations do you want (sims). The number of times you want to simulate your p-hacking experiment.

To make this concrete, consider the following R code:

res <- phack(initialN=30, hackrate=5, grp1M=0, grp2M=0, grp1SD=1, 
  grp2SD=1, maxN=200, alpha=.05, alternative="greater", graph=TRUE, sims=1000)

This says you have planned a two-group study with N=30 (initialN=30) in each group. You are going to compute your t-test on that initial sample. If that is not statistically significant you are going to add 5 more (hackrate=5) to each group and repeat that process until it is statistically significant or you reach 200 subjects in each group (maxN=200). You have set the population Ms to both be 0 (grp1M=0; grp2M=0) with SDs of 1 (grp1SD=1; grp2SD=1). You have set your nominal alpha level to .05 (alpha=.05), specified a direction hypothesis where group 1 should be higher than group 2 (alternative=“greater”), and asked for graphical output (graph=TRUE). Finally, you have requested to run this simulation 1000 times (sims=1000).

So what happens if we run this experiment?1 So we can get the same thing, I have set the random seed in the code below.

source("http://rynesherman.com/phack.r") # read in the p-hack function
res <- phack(initialN=30, hackrate=5, grp1M=0, grp2M=0, grp1SD=1, grp2SD=1,
   maxN=200, alpha=.05, alternative="greater", graph=TRUE, sims=1000)

The following output appears in R:

Proportion of Original Samples Statistically Significant = 0.054
Proportion of Samples Statistically Significant After Hacking = 0.196
Probability of Stopping Before Reaching Significance = 0.805
Average Number of Hacks Before Significant/Stopping = 28.871
Average N Added Before Significant/Stopping = 144.355
Average Total N 174.355
Estimated r without hacking 0
Estimated r with hacking 0.03
Estimated r with hacking 0.19 (non-significant results not included)

The first line tells us how many (out of the 1000 simulations) of the originally planned (N=30 in each group) studies had a p-value that was .05 or less. Because there was no true effect (grp1M = grp2M) this at just about the nominal rate of .05. But what if we had used our p-hacking scheme (testing every 5 subjects per condition until significant or N=200)? That result is in the next line. It shows that just about 20% of the time we would have gotten a statistically significant result. So this type of hacking has inflated our Type I error rate from 5% to 20%. How often would we have given up (i.e., N=200) before reaching statistical significance? That is about 80% of the time. We also averaged 28.87 “hacks” before reaching significance/stopping, averaged having to add N=144 (per condition) before significance/stopping, and had an average total N of 174 (per condition) before significance/stopping.

What about effect sizes? Naturally the estimated effect size (r) was .00 if we just used our original N=30 in each group design. If we include the results of all 1000 completed simulations that effect size averages out to be r = .03. Most importantly, if we exclude those studies that never reached statistical significance, our average effect size r = .19.

This is pretty telling. But there is more. We also get this nice picture:


It shows the distribution of the p-values below .05 for the initial study (upper panel) and for those p-values below .05 for those reaching statistical significance. The p-curves (see Simonsohn, Nelson, & Simmons, 2013) are also drawn on. If there is really no effect, we should see a flat p-curve (as we do in the upper panel). And if there is no effect and p-hacking has occurred, we should see a p-curve that slopes up towards the critical value (as we do in the lower panel).

Finally, the function provides us with more detailed output that is summarized above. We can get a glimpse of this by running the following code:


This generates the following output:

Initial.p  Hackcount     Final.p  NAdded    Initial.r       Final.r
0.86410908         34  0.45176972     170  -0.14422580   0.006078565
0.28870264         34  0.56397332     170   0.07339944  -0.008077691
0.69915219         27  0.04164525     135  -0.06878039   0.095492249
0.84974744         34  0.30702946     170  -0.13594941   0.025289555
0.28048754         34  0.87849707     170   0.07656582  -0.058508736
0.07712726         34  0.58909693     170   0.18669338  -0.011296131

The object res contains the key results from each simulation including the p-value for the initial study (Initial.p), the number of times we had to hack (Hackcount), the p-value for the last study run (Final.p), the total N added to each condition (NAdded), the effect size r for the initial study (Initial.r), and the effect size r for the last study run (Final.r).

So what can we do with this? I see lots of possibilities and quite frankly I don’t have the time or energy to do them. Here are some quick ideas:

  • What would happen if there were a true effect?
  • What would happen if there were a true (but small) effect?
  • What would happen if we checked for significance after each subject (hackrate=1)?
  • What would happen if the maxN were lower?
  • What would happen if the initial sample size was larger/smaller?
  • What happens if we set the alpha = .10?
  • What happens if we try various combinations of these things?

I’ll admit I have tried out a few of these ideas myself, but I haven’t really done anything systematic. I just thought other people might find this function interesting and fun to play with.

1 By the way, all of these arguments are set to their default, so you can do the same thing by simply running:

res <- phack()

Jun 25, 2014

Bem is Back: A Skeptic's Review of a Meta-Analysis on Psi


James Randi, magician and scientific skeptic, has compared those who believe in the paranormal to “unsinkable rubber ducks”: after a particular claim has been thoroughly debunked, the ducks submerge, only to resurface again a little later to put forward similar claims.

In light of this analogy, it comes as no surprise that Bem and colleagues have produced a new paper claiming that people can look into the future. The paper is titled “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events” and it is authored by Bem, Tressoldi, Rabeyron, and Duggan.

Several of my colleagues have browsed Bem's meta-analysis and have asked for my opinion. Surely, they say, the statistical evidence is overwhelming, regardless of whether you compute a p-value or a Bayes factor. Have you not changed your opinion? This is a legitimate question, one which I will try and answer below by showing you my review of an earlier version of the Bem et al. manuscript.

I agree with the proponents of precognition on one crucial point: their work is important and should not be ignored. In my opinion, the work on precognition shows in dramatic fashion that our current methods for quantifying empirical knowledge are insufficiently strict. If Bem and colleagues can use a meta-analysis to demonstrate the presence of precognition, what should we conclude from a meta-analysis on other, more plausible phenomena?

Disclaimer: the authors have revised their manuscript since I reviewed it, and they are likely to revise their manuscript again in the future. However, my main worries call into question the validity of the enterprise as a whole.

To keep this blog post self-contained, I have added annotations in italics to provide context for those who have not read the Bem et al. manuscript in detail.

My review of Bem, Tressoldi, Rabeyron, and Duggan


Jun 18, 2014

Open Science Initiatives promote Diversity, Social Justice, and Sustainability


As I follow the recent social media ruckus centered on replication science questioning motives and methods, it becomes clear that the open science discussion needs to consider the point made by the title of this blog; maybe repeatedly. For readers who weren’t following this, this blog by a political scientist and another post from the SPSP Blog might be of interest. I invite you to join me in evaluating this argument as the discussion progresses. I contend that “Open Science Initiatives promote Diversity, Social Justice, and Sustainability.” Replication science and registered reports are two Open Science Initiatives and by extension should also promote these ideals. If this is not true, I will abandon this revolution and go back to the status quo. However, I am confident that when considering all the evidence, you will agree with me that these idealistic principles benefit from openness generally and by open science specifically.

Before suggesting specific mechanisms by which this occurs, I will briefly note that the definitions of Open Science, Diversity, Social Justice, and Sustainability that are listed on Wikipedia are sufficient for this discussion since Wikipedia itself is an Open Science initiative. Also, I would like to convey the challenge of advancing each of these simultaneously. My own institution, Pacific Lutheran University (PLU), in our recent long range plan, PLU2020, highlighted the importance of uplifting each of these at our own institution as introduced on page 11, “As we discern our commitments for the future, we reaffirm the ongoing commitments to diversity, sustainability, and justice that already shape our contemporary identity, and we resolve to integrate these values ever more intentionally into our mission and institution.” This is easier said than done because at times the goals of these ideals sometimes conflict. For instance, the environmental costs of feeding billions of people and heating their homes are enormous. Sometimes valuing diversity (such as scholarships targeted for people of color) seems unjust because resources are being assigned unevenly. These tensions can be described with many examples across numerous goals in all three dimensions and highlight the need to make balanced decisions.

PLU has not yet resolved this challenge in uplifting all three simultaneously, but I hope that we succeed as we continue the vigorous discussion. Why each is important should be considered from is a Venn diagram on the sustainability Wikipedia page showing sustainable development as intersections between three pillars of sustainable development, social (people), economic, and environmental because even sustainability itself represents competing interests. Diversity and Social Justice are both core aspects of the social dimension, where uplifting diversity highlights the importance of distinct ideas and cultures and helps us understand why people and their varied ideas, in addition to oceans and forests are important resources of our planet. The ideals of social justice aim to provide mechanisms such that all members of our diverse population receive and contribute our fair share of these resources. Because resources are limited and society complex and flawed, these ideals are often more aspirational rather than practical. However, the basic premise of uplifting all three is that we are better when valuing diversity, providing social justice, and sustainably using the planet’s resources (people, animals, plants, and rocks). Below I provide examples for how OSIs promote each of these principles while illustrating why each is important to science.


Jun 11, 2014

Thoughts on this debate about social scientific rigor


This article was originally posted on Betsy Levy Paluck's website.

On his terrific blog, Professor Sanjay Srivastava points out that the current (vitriolic) debate about replication in psychology has been "salted with casually sexist language, and historically illiterate" arguments, on both sides. I agree, and thank him for pointing this out.

I'd like to add that I believe academics participating in this debate should be mindful of co-opting powerful terms like bullying and police (e.g., the "replication police") to describe the replication movement. Why? Bullying behavior describes repeated abuse from a person of higher power and influence. Likewise, many people in the US and throughout the world have a well-grounded terror of police abuse. The terror and power inequality that these terms connote is diminished when we use it to describe the experience of academics replicating one another's studies. Let's keep evocative language in reserve so that we can use it to name and change the experience of truly powerless and oppressed people.

Back to replication. Here is the thing: we all believe in the principle of replication. As scientists and as psychologists, we are all here because we wish to contribute to cumulative research that makes progress on important psychological questions. This desire unites us.

So what's up?

It seems to me that some people oppose the current wave of replication efforts because they do not like the tenor of the recent public discussions. As I already mentioned, neither do I. I'm bewildered by the vitriol. Just a few days ago, one of the most prominent modern economists, currently an internationally bestselling author, had his book called into question over alleged data errors in a spreadsheet that he made public. His response was cordial and curious; his colleagues followed up with care, equanimity, and respect.

Are we really being taught a lesson in manners from economists? Is that happening?

As one of my favorite TV characters said recently ...

If we don't like the tenor of the discussion about replication, registration, etc., let's change it.

In this spirit, I offer a brief description of what we are doing in my lab to try to make our social science rigorous, transparent, and replicable. It's one model for your consideration, and we are open to suggestions.

For the past few years we have registered analysis plans for every new project we start. (They can be found here on the EGAP website; this is a group to which I belong. EGAP has had great discussions in partnership with BITSS about transparency.) My lab's analysis registrations are accompanied by a codebook describing each variable in the dataset.

I am happy to say that we are just starting to get better at producing replication code and data & file organization that is sharing-ready as we do the research, rather than trying to reconstruct these things from our messy code files and Dropbox disaster areas following publication (for this, I thank my brilliant students, who surpass me with their coding skills and help me to keep things organized and in place. See also this). What a privilege and a learning experience to have graduate students, right? Note that they are listening to us have this debate.

Margaret Tankard, Rebecca Littman, Graeme Blair, Sherry Wu, Joan Ricart-Huguet, Andreana Kenrick (awesome grad students), and Robin Gomila and David Mackenzie (awesome lab managers) have all been writing analysis registrations, organizing files, checking data codebooks, and writing replication code for the experiments we've done in the past three years, and colleagues Hana Shepherd, Peter Aronow, Debbie Prentice, and Eldar Shafir are doing the same with me. Thank goodness for all these amazing and dedicated collaborators, because one reason I understand replication to be so difficult is that it is a huge challenge to reconstruct what you thought and did over a long period of time, without careful record keeping (note: analysis registration also serves that purpose for us!).

Previously, I posted data at Yale's ISPS archive, and for other datasets made them available on request if I thought I was going to work more on them. But in future we plan to post all published data plus the dataset's codebook. Economist and political scientists friends often post to their personal websites. Another possibility is posting in digital archives (like Yale's, but there are others: I follow @annthegreen for updates on digital archiving).

I owe so much of my appreciation for these practices to my advisor Donald Green. I've also learned a lot from Macartan Humphreys.

I'm interested in how we can be better. I'm listening to the constructive debates and to the suggestions out there. If anyone has questions about our current process, please leave a comment below! I'd be happy to answer questions, provide examples, and to take suggestions.

It costs nothing to do this--but it slows us down. Slowing down is not a bad thing for research (though I recognize that a bad heuristic of quantity = quality still dominates our discipline). During registration, we can stop to think-- are we sure we want to predict this? With this kind of measurement? Should we go back to the drawing board about this particular secondary prediction? I know that if I personally slow down, I can oversee everything more carefully. I'm learning how to say no to new and shiny projects.

I want to end on the following note. I am now tenured. If good health continues, I'll be on hiring committees for years to come. In a hiring capacity, I will appreciate applicants who, though they do not have a ton of publications, can link their projects to an online analysis registration, or have posted data and replication code. Why? I will infer that they were slowing down to do very careful work, that they are doing their best to build a cumulative science. I will also appreciate candidates who have conducted studies that "failed to replicate" and who responded to those replication results with follow up work and with thoughtful engagement and curiosity (I have read about Eugene Caruso's response and thought that he is a great model of this kind of response).

I say this because it's true, and also because some academics report that their graduate students are very nervous about how replication of their lab's studies might ruin their reputations on the job market (see Question 13). I think the concern is understandable, so it's important for those of us in these lucky positions to speak out about what we value and to allay fears of punishment over non-replication (see Funder: SERIOUSLY NOT OK).

In sum, I am excited by efforts to improve the transparency and cumulative power of our social science. I'll try them myself and support newer academics who engage in these practices. Of course, we need to have good ideas as well as good research practices (ugh--this business is not easy. Tell that to your friends who think that you've chosen grad school as a shelter from the bad job market).

I encourage all of my colleagues, and especially colleagues from diverse positions in academia and from underrepresented groups in science, to comment on what they are doing in their own research and how they are affected by these ideas and practices. Feel free to post below, post on (real) blogs, write letters to the editor, have conversations in your lab and department, or tweet. I am listening. Thanks for reading.


A collection of comments I've been reading about the replication debate, in case you haven't been keeping up. Please do post more links below, since this isn't comprehensive.

I'm disappointed: a graduate student's perspective

Does the replication debate have a diversity problem?

Replications of Important Results in Social Psychology: Special Issue of Social Psychology

The perilous plight of the (non)-replicator

"Replication Bullying": Who replicates the replicators?

Rejoinder to Schnall (2014) in Social Psychology

Context and Correspondence for Special Issue of Social Psychology

Behavioral Priming: Time to Nut Up or Shut Up


Jun 5, 2014

Open Projects - Open Humans


This article is the second in a series highlighting open science projects around the community. You can read the interview this article was based on: edited for clarity, unedited.

While many researchers encounter no privacy-based barriers to releasing data, those working with human participants, such as doctors, psychologists, and geneticists, have a difficult problem to surmount. How do they reconcile their desire to share data, allowing their analyses and conclusions to be verified, with the need to protect participant privacy? It's a dilemma we've talked about before on the blog (see: Open Data and IRBs, Privacy and Open Data). A new project, Open Humans, seeks to resolve the issue by finding patients who are willing - even eager - to share their personal data.

Open Humans, which recently won a $500,000 grant from the Knight Foundation, grew out of the Personal Genome Project. Founded in 2005 by Harvard genetics professor George Church, the Personal Genome Project sought to solve a problem that many genetics researchers had yet to recognize. "At the time people didn't really see genomes as inherently identifiable," Madeleine Price Ball explains. Ball is co-founder of OpenHumans, Senior Research Scientist at PersonalGenomes.org, and Director of Research at the Harvard Personal Genome Project. She quotes from 1000 Genomes' informed consent form: "'Because of these measures, it will be very hard for anyone who looks at any of the scientific databases to know which information came from you, or even that any information in the scientific databases came from you.'"

"So that's sort of the attitude scientists had towards genomes at the time. Also, the Genetic Information Nondiscrimination Act didn't exist yet. And there was GATTACA. Privacy was still this thing everyone thought they could have, and genomes were this thing people thought would be crazy to share in an identifiable manner. I think the scientific community had a bit of unconscious blindness, because they couldn't imagine an alternative."

Church found an initial ten participants - the list includes university professors, health care professionals, and Church himself. The IRB interviewed each of the participants to make sure they truly understood the project and, satisfied, allowed it to move forward. The Personal Genome Project now boasts over 3,400 participants, each of whom have passed an entrance exam showing that they understand what will happen to their data, and the risks involved. Most participants are enthusiastic about sharing. One participant described it as "donating my body to science, but I don't have to die first".

The Personal Genome Project's expansion hasn't been without growing pains. "We've started to try to collect data beyond genomes." Personal health information, including medical history, procedures, test results, prescriptions, has been provided by a subset of participants. "Every time one of these new studies was brought before the IRB they'd be like ‘what? that too?? I don't understand what are you doing???' It wasn't scaling, it was confusing, the PGP was trying to collect samples and sequence genomes and it was trying to let other groups collect samples and do other things."

Thus, Open Humans was born. "Open Humans is an abstraction that takes part of what the PGP was doing (the second part) and make it scalable," Ball explains. "It's a cohort of participants that demonstrate an interest in public data sharing, and it's researchers that promise to return data to participants."

Open Humans will start out with a number of participants and an array of public data sets, thanks to collaborating projects American Gut, Flu Near You, and of course, the Harvard Personal Genome Project. Participants share data and, in return, researchers promise to share results. What precisely "sharing results" means has yet to be determined. "We're just starting out and know that figuring out how this will work is a learning process," Ball explains. But she's already seen what can happen when participants are brought into the research process - and brought together:

"One of the participants made an online forum, another a Facebook group, and another maintains a LinkedIn group… before this happened it hadn't occurred to me that abandoning the privacy-assurance model of research could empower participants in this manner. Think about the typical study - each participant is isolated, they never see each other. Meeting each other could breach confidentiality! Here they can talk to each other and gasp complain about you. That's pretty empowering." Ball and her colleague Jason Bobe, Open Humans co-founder and Executive Director of PersonalGenomes.org, hope to see all sorts of collaborations between participants and researchers. Participants could help researchers refine and test protocols, catch errors, and even provide their own analyses.

Despite these dreams, Ball is keeping the project grounded. When asked whether Open Humans will require articles published using their datasets to be made open access, she replies that, "stacking up a bunch of ethical mandates can sometimes do more harm than good if it limits adoption". Asked about the effect of participant withdrawals on datasets and reproducibility, she responds, "I don't want to overthink it and implement things to protect researchers at the expense of participant autonomy based on just speculation." (It is mostly speculation. Less than 1% of Personal Genome Project users have withdrawn from the study, and none of the participants who've provided whole genome or exome data have done so.)

It's clear that Open Humans is focused on the road directly ahead. And what does that road look like? "Immediately, my biggest concern is building our staff. Now that we won funding, we need to hire a good programmer... so if you are or know someone that seems like a perfect fit for us, please pass along our hiring opportunities". She adds that anyone can join the project's mailing list to get updates and find out when Open Humans is open to new participants - and new researchers. "And just talk about us. Referring to us is an intangible but important aspect for helping promote awareness of participant-mediated data sharing as a participatory research method and as a method for creating open data."

In other words: start spreading the news. Participant mediated data isn't the only solution to privacy issues, but it's an enticing one - and the more people who embrace it, the better a solution it will be.

May 29, 2014

Questions and Answers about the Förster case


By now, everyone is probably familiar with the recent investigation of the work of Dr. Förster, in which the Landelijk Orgaan Wetenschappelijke Integriteit (LOWI) concluded that data reported in a paper by Dr. Förster had been manipulated. In his reaction to the newspaper article NRC Dr. Förster suggested that our department would be involved in a witch-hunt. This is incorrect.

However, we have noticed that there are many questions about both the nature of the case and the procedure followed. We have compiled the following list of questions and answers to explain what happened. If any other questions arise, feel free to email them to us so we can add them to this document.

Q: What was the basis of the allegations against Dr. Förster?
A: In every single one of 40 experiments, reported across three papers, the means of two experimental conditions (“local focus” and “global focus”) showed almost exactly opposite behavior with respect to the control condition. So whenever the local focus condition led to a one-point increase of the mean level of the dependent variable compared to the control condition, the global condition led almost exactly to a one-point decrease. Thus, the samples exhibit an unrealistic level of linearity.

Q: Couldn’t the effects actually be linear in reality?
A: Yes, that is unlikely but possible. However, in addition to the perfect linearity of the effects themselves, there is far too little variance in the means of the conditions, given the variance that is present within the conditions. In other words: the means across the conditions follow the linear pattern (much) too perfectly. To show this, the whistleblower’s complaint computed the probability of finding this level of linearity (or even more perfect linearity) in the samples researched, under the assumption that, in reality, the effect is linear in the population. That probability equals 1/508,000,000,000,000,000,000.


May 28, 2014

The etiquette of train wreck prevention


In a famous open letter to scientists , Daniel Kahneman, seeing “a train wreck looming”, argued that social psychologists (and presumably, especially those who are publishing social priming effects) should engage in systematic and extensive replication studies to avoid a loss of credibility in the field. The fact that a Nobel Prize winning psychologist made such a clear statement gave a strong boost of support to systematic replication efforts in social psychology (see Pashler & Wagenmakers 2012, and their special issue in Psychological Science).

But in a more recent commentary, Kahneman appears to have changed his mind, and argues that “current norms allow replicators too much freedom to define their study as a direct replication of previous research”, and that the “seemingly reasonable demand” of requiring method sections to be so precise that they enable direct replications is “rarely satisfied in psychology, because behavior is easily affected by seemingly irrelevant factors”. A similar argument was put forth by Simone Schnall, who recently wrote that “human nature is complex, and identical materials will not necessarily have the identical effect on all people in all contexts”.

While I wholeheartedly agree with Kahneman’s original letter on this topic, I strongly disagree with his commentary, for reasons that I will outline here.

First, he argues (as Schnall did too) that there always are potentially influential differences between the original study and the replication attempt. But this would imply that any replication study, no matter how meticulously performed, would be meaningless. (Note that this also holds for successful replication studies.) This is a clear case of a reductio ad absurdum.

The main reason why this argument is flawed is that there is a fundamental relationship between the theoretical claim based on a finding and its proper replication, which is the topic of an interesting discussion about the degree to which a replication should be similar to the study it addresses (see Stroebe & Strack, 2014; Simons, 2014; Pashler & Harris, 2012). My position in this debate is the following. The more general the claim that the finding is claimed to support, the more “conceptual” the replication of the supporting findings can (and should) be. Suppose we have a finding F that we report in order to claim evidence for scientific claim C. In the case that C is identical to F, such that C is a claim of the type “The participants in our experiment did X at time T in location L”, it is indeed impossible to do any type of replication study, because the exact circumstances of F were unique and therefore by definition irreproducible. But in this case (that F = C), C has obviously no generality at all, and is therefore scientifically not very interesting. In such a case, there would also be no point in doing inferential statistics. If, on the other hand, C is more general than F, the level of methodological detail that is provided should be sufficient to enable readers to attempt to replicate the finding, allowing for variation that the authors do not consider important. If the authors remark that this result arises under condition A but acknowledge that it might not arise under condition A' (let's say, with participants who are aged 21-24 rather than 18-21), then clearly a follow-up experiment under condition A' isn't a valid replication. But if their claim (explicitly or implicitly) is that it doesn't matter whether condition A or A' is in effect, then a follow-up study involving condition A' might well be considered a replication. The failure to specify any particular detail might reasonably be considered an implicit claim that this detail is not important.

Second, Kahnemann is worried that even the rumor of a failed replication could damage the reputation of the original authors. But if researchers attempt to do a replication study, this does not imply that they believe or suggest that the original author was cheating. Cheating does occasionally happen, sadly, and replication studies are a good way to catch these cases. But, assuming that cheating is not completely rampant, it is much more likely that a finding cannot be replicated successfully because variables or interactions have been overlooked or not controlled for, that there were unintentional errors in the data collection or analysis, or because the results were simply a fluke, caused by our standard statistical practices severely overestimating evidence against the null hypothesis (Sellke, Bayarri & Berger, 2001; Johnson, 2013).

Furthermore, replication studies are not hostile or friendly. People are. I think it is safe to say that we all dislike uncollegial behavior and rudeness, and we all agree that it should be avoided. If Kahneman wants to give us a stern reminder that it is important for replicators to contact the original authors, then I support that, even though I personally suspect that the vast majority of replicators already do that. There already is etiquette in place in experimental psychology, and as far as I can tell, it’s mostly being observed. And for those cases where it is not, my impression is that the occasional unpleasant behavior originates not only from replicators, but also from original authors.

Another point I would like to address is the asymmetry of the relationship between author and replicator. Kahneman writes: “The relationship is also radically asymmetric: the replicator is in the offense, the author plays defense.” This may be true in some sense, but it is counteracted by other asymmetries that work in the opposite direction: The author has already successfully published the finding in question and is reaping the benefits of it. The replicator, however, is up against the strong reluctance of journals to publish replication studies, is required to have a much higher statistical power (hence invest far more resources), and is often arguing against a moving target, as more and more newly emerging and potentially relevant details of the original study can be brought forward by the original authors.

A final point: the problem that started the present replication discussion was that a number of findings that were deemed both important and implausible by many researchers failed to replicate. The defensiveness of the original authors of these findings is understandable, but so is the desire of skeptics to investigate if these effects are in fact reliable. I, both as a scientist and as a human being, really want to know if I can boost my creativity by putting an open box on my desk (Leung et al., 2012) or if the fact that I frequently take hot showers could be caused by loneliness (Bargh & Shalev, 2012). As Kahneman himself rightly put it in his original open letter: “The unusually high openness to scrutiny may be annoying and even offensive, but it is a small price to pay for the big prize of restored credibility.”


Bargh, J. A., & Shalev, I. (2012). The substitutability of physical and social warmth in daily life. Emotion, 12(1), 154. doi:10.1037/a0023527

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48), 19313-19317. doi: doi/10.1073/pnas.1313476110

Leung, A. K.-y., Kim, S., Polman, E., Ong, L. S., Qiu, L., Goncalo, J. A., et al. (2012). Embodied metaphors and creative "acts". Psychological Science, 23(5), 502-509. doi:10.1177/0956797611429801

Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7(6), 531-536. doi:10.1177/1745691612463401

Pashler, H., & Wagenmakers, E.-J. (2012). Editors' Introduction to the Special Section on Replicability in Psychological Science A Crisis of Confidence? Perspectives on Psychological Science, 7(6), 528-530. doi:10.1177/1745691612465253

Sellke, T., Bayarri, M., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62-71. doi:10.1198/000313001300339950

Simons, D. J. (2014). The Value of Direct Replication. Perspectives on Psychological Science, 9(1), 76-80. doi:10.1177/1745691613514755

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9(1), 59-71. doi:10.1177/1745691613514450

May 20, 2014

Support Publication of Clinical Trials for International Clinical Trials Day


Today is International Clinical Trials Day, held on May 20th in honor of George Lind, the famous Scottish physician who began one of the world's first clinical trials on May 20th, 1747. This trial discovered that vitamin C deficiency was the cause of scurvy. While it and the other life-saving trials that have been conducted in the last two hundred and sixty seven years are surely worth celebration, International Clinical Trials Day is also a time to reflect on the problems that plague the clinical trials system. In particular, the lack of reporting on nearly half of all clinical trials has potentially deadly consequences.

The AllTrials campaign, launched in January 2013, aims to have all past and present clinical trials registered and reported. From the AllTrials campaign website:

Doctors and regulators need the results of clinical trials to make informed decisions about treatments.

But companies and researchers can withhold the results of clinical trials even when asked for them. The best available evidence shows that about half of all clinical trials have never been published, and trials with negative results about a treatment are much more likely to be brushed under the carpet.

This is a serious problem for evidence based medicine because we need all the evidence about a treatment to understand its risks and benefits. If you tossed a coin 50 times, but only shared the outcome when it came up heads and you didn’t tell people how many times you had tossed it, you could make it look as if your coin always came up heads. This is very similar to the absurd situation that we permit in medicine, a situation that distorts the evidence and exposes patients to unnecessary risk that the wrong treatment may be prescribed.

It also affects some very expensive drugs. Governments around the world have spent billions on a drug called Tamiflu: the UK alone spent £500 million on this one drug in 2009, which is 5% of the total £10bn NHS drugs budget. But Roche, the drug’s manufacturer, published fewer than half of the clinical trials conducted on it, and continues to withhold important information about these trials from doctors and researchers. So we don’t know if Tamiflu is any better than paracetamol. (Author's note: in April 2014 a review based on full clinical trial data determined that Tamiflu was almost entirely ineffective.)

Initiatives have been introduced to try to fix this problem, but they have all failed. Since 2008 in the US the FDA has required results of all trials to be posted within a year of completion of the trial. However an audit published in 2012 has shown that 80% of trials failed to comply with this law. Despite this fact, no fines have ever been issued for non-compliance. In any case, since most currently used drugs came on the market before 2008, the trial results that are most important for current medical practice would not have been released even if the FDA’s law was fully enforced.

We believe that this situation cannot go on. The AllTrials initiative is campaigning for the publication of the results (that is, full clinical study reports) from all clinical trials – past, present and future – on all treatments currently being used.

We are calling on governments, regulators and research bodies to implement measure to achieve this.

And we are calling for all universities, ethics committees and medical bodies to enact a change of culture, recognise that underreporting of trials is misconduct and police their own members to ensure compliance.

You can learn more about the problem of missing clinical trial data in this brief. AllTrials also provides slides on this issue to incorporate into talks and presentations as well as a petition you can sign.

May 15, 2014

How anonymous peer review fails to do its job and damages science.


Churchill believed that democracy was the “worst form of government except all those other forms that have been tried from time to time.” Something analogous is often said about anonymous peer review (APR) in science: “it may have its flaws, but it’s the ‘least bad’ of all possible systems.” In this contribution, I present some arguments to the contrary. I believe that APR is threatening scientific progress, and therefore that it urgently needs to be fixed.

The reason we have a review system in the first place is to uphold basic standards of scientific quality. The two main goals of a review system are to minimize both the number of bad studies that are accepted for publication and the number of good studies that are rejected for publication. Borrowing terminology of signal detection theory, let’s call these false positives and false negatives respectively.

It is often implicitly assumed that minimizing the number of false positives is the primary goal of APR. However, signal detection theory tells us that reducing the number of false positives inevitably leads to an increase in the rate of false negatives. I want to draw attention here to the fact that the cost of false negatives is both invisible and potentially very high. It is invisible, obviously, because we never get to see the good work that was rejected for the wrong reasons. And the cost is high, because it removes not only good papers from our scientific discourse, but also entire scientists. I personally know a number of very talented and promising young scientists who first sent their work to a journal, fully expecting to be scrutinized, but then receiving reviews that were so personal, rude, scathing, and above all, unfair, that they decided to look for another profession and never looked back. I also know a large number of talented young scientists who are still in the game, but who suffer intensely every time they attempt to publish something and get trashed by anonymous reviewers. I would not be surprised if they also leave academia soon. The inherent conservatism in APR means that people with new, original approaches to old problems run the risk of being shut out, humiliated, and consequently chased away from academia. In the short term, this is to the advantage of the established scientists who do not like their work to be challenged. In the long run, this is obviously very damaging for science. This is especially true of the many journals that will only accept papers that receive unanimously positive reviews. These journals are not facilitating scientific progress, because work with even the faintest hint of controversy is almost automatically rejected.

With all this in mind, it is somewhat surprising that APR also fails to keep out many obviously bad papers.


May 7, 2014

When Science Selects for Fraud


This post is in response to Jon Grahe's recent article in which he invited readers to propose metaphors that might help us understand why fraud occurs and how to prevent it.

Natural selection is the process by which populations change as individual organisms succeed or fail to adapt to their environments. It is also an apt metaphor for how human cultures form and thrive. The scientific community, broadly speaking, selects for a number of personality traits, and those traits are more common among scientists than in the general population. In some cases, this is necessary and beneficial. In other cases, it is tragic.

The scientific community selects for curiosity. Not every scientist is driven by a deep desire to understand the natural world, but so many are. How boring would endless conferences, lab meetings, and lectures be if one didn’t delight in asking questions and figuring out answers. It also selects for a certain kind of analytical thinking. Those who can spot a confound or design a carefully controlled experiment are more likely to succeed. And it selects for perseverance. Just ask the researchers who work late into the night running gels, observing mice, or analyzing data.

The scientific community, like the broader culture of which it is a part, sometimes selects unjustly. It selects for the well-off: those who can afford the kind of schools where a love of science is cultivated rather than ignored or squashed, those who can volunteer in labs because they don’t need to work to support themselves and others, those who can pay $30 to read a journal article. It selects for white men: those who don’t have to face conscious and unconscious discrimination, cultural stereotyping, and microaggressions.

Of particular relevance right now is the way the scientific community selects for fraud. If asked, most scientists would say that the ideal scientist is honest, open-minded, and able to accept being wrong. But we do not directly reward these attributes. Instead, success - publication of papers, grant funding, academic positions and tenure, the approbation of our peers - is too often based on a specific kind of result. We reward those who can produce novel and positive results. We don’t reward based on how they produce them.

This does give an advantage to those with good scientific intuitions, which is a reasonable thing to select for. It also gives an advantage to risk-takers, those willing to risk their careers on being right. The risk averse? They have two options: to drop out of scientific research, as I did, or to commit fraud in order to ensure positive results, as Diederik Stapel, Marc Hauser and Jens Foster did. Among the risk-averse, those who are unwilling to do shoddy or unethical science are selected against. Those who are willing are selected for, and often reach the tops of their fields.

One of the more famous examples of natural selection is the peppered moth of England. Before the Industrial Revolution, these moths were lightly colored, allowing them to blend in with the light gray bark of the average tree. During the Industrial Revolution, extreme pollution painted the trees of England black with soot. To adapt, peppered moths evolved dark, soot-colored wings.

We can censure the individuals who commit fraud, but this is like punishing the peppered moth for its dirty wings. As long as success in the scientific community is measured by results and not process, we will continue to select for those willing to violate process in order to ensure results. Our species, the scientists, need to change our environment if we want to evolve past fraud.

Photo of Jon Grahe Biston betularia by Donald Hobern, CC BY 2.0

May 2, 2014

Avoiding a Witch Hunt: What is the Next Phase of our Scientific Inquisition?


Earlier this week, I learned about another case of fraud in psychological science (Retraction Watch, 4.29.2014). The conclusions from the evidence in the case against him after an extended investigation are hard to ignore. The probability that the findings could have occurred by chance are so minute that it is hard to believe that they didn’t result from falsified data. In an email to the scientific community (Retraction Watch, 4.30.2014), the target of this investigation strongly asserted that he never faked any data, while assuring us that the coauthor target never worked on the data, it was all his. Some comments from the Retraction Watch post use the term “witch hunt.” It was the first term I used in response as well, suggesting caution before judgment. A colleague pointed out that the difference was that there were no witches, and that there are clearly dishonest scientists. I have no choice but to agree; I think a better analogy is that of the Inquisition. We are entering the era of the Scientific Inquisition. A body of experts (LOWI in this case) will use a battery of sophisticated tools to examine the likelihood that the findings’ irregularities occurred by chance. In this case it is hard to believe his denial, but thankfully I am not a judge in the Scientific Inquisition.


Apr 23, 2014

Memo From the Office of Open Science


Dear Professor Lucky,

Congratulations on your new position as assistant professor at Utopia University. We look forward to your joining our community and are eager to aid you in your transition from Antiquated Academy. It’s our understanding that Antiquated Academy does not have an Office of Open Science, so you may be unfamiliar with who we are and what we do.

The Office of Open Science was created to provide faculty, staff and students with the technical, educational, social and logistical support they need to do their research openly. We recognize that the fast pace of research and the demands placed on scientists to be productive make it difficult to prioritize open science. We collaborate with researchers at all levels to make it easier to do this work.

Listed below are some of the services we offer.


Apr 16, 2014

Expectations of replicability and variability in priming effects, Part II: When should we expect replication, how does this relate to variability, and what do we do when we fail to replicate?


Continued from Part 1.

Now that some initial points and clarifications have been offered, we can move to the meat of the argument. Direct replication is essential to science. What does it mean to replicate an effect? All effects require a set of contingencies to be in place. To replicate an effect is to set up those same contingencies that were present in the initial investigation and observe the same effect, whereas to fail to replicate an effect is to set up those same contingencies and fail to observe the same effect. Putting aside what we mean by "same effect" (i.e., directional consistency versus magnitude), we don't see any way in which people can reasonably disagree on this point. This is a general point true of all domains of scientific inquiry.

The real question becomes, how can we know what contingencies produced the effect in the original investigation? Or more specifically, how can we separate the important contingencies from the unimportant contingencies? There are innumerable contingencies present in a scientific investigation that are totally irrelevant to obtaining the effect: the brand of the light bulb in the room, the sock color of the experimenter, whether the participant got a haircut last Friday morning or Friday afternoon. Common sense can provide some guidance, but in the end the theory used to explain the effect specifies the necessary contingencies and, by omission, the unnecessary contingencies. Therefore, if one is operating under the wrong theory, one might think some contingencies are important when really they are unimportant, and more interestingly, one might miss some necessary contingencies because the theory did not mention them as being important.

Before providing an example, it might be useful to note that, as far as we can tell, no one has offered any criticism of the logic outlined above. Many sarcastic comments have been made along the lines of, "apparently we can never learn anything because of all these mysterious moderators." And it is true that the argument can be misused to defend poor research practices. But at core, there is no criticism about the basic point that contingencies are necessary for all effects and a theory establishes those contingencies.


Apr 9, 2014

Expectations of replicability and variability in priming effects, Part I: Setting the scope and some basic definitions


We are probably thought of as "defenders" of priming effects and along with that comes the expectation that we will provide some convincing argument for why priming effects are real. We will do no such thing. The kinds of priming effects under consideration (priming of social categories which result in behavioral priming effects) is a field with relatively few direct replications1 and we therefore lack good estimates of the effect size of any specific effect. Judgments about the nature of such effects can only be made after thorough, systematic research, which will take some years still (assuming priming researchers change their research practices). And of course, we must be open to the possibility that further data will show any given effect to be small or non-existent.

One really important thing we could do to advance the field to that future ideal state is to stop calling everything priming. It appears now, especially with the introduction of the awful term "social priming," that any manipulation used by a social cognition researcher can be called priming and, if such a manipulation fails to have an effect, it is cheerfully linked to this nebulous, poorly-defined class of research called "social priming." There is no such thing as "social priming." There is priming of social categories (elderly, professor) and priming of motivational terms (achievement) and priming of objects (flags, money) and so on. And there are priming effects at the level of cognition (increased activation of concepts) or affect (valence, arousal, or emotions) or behavior (walking, trivial pursuit performance) or physiology, and some of these priming effects will be automatic and some not (and even then recognizing the different varieties of automaticity; Bargh, 1989). These are all different things and need to be treated separately.


Apr 2, 2014

The Deathly Hallows of Psychological Science


This piece was originally posted to the Personality Interest Group and Espresso (PIG-E) web blog at the University of Illinois.

As of late, psychological science has arguably done more to address the ongoing believability crisis than most other areas of science. Many notable efforts have been put forward to improve our methods. From the Open Science Framework (OSF), to changes in journal reporting practices, to new statistics, psychologists are doing more than any other science to rectify practices that allow far too many unbelievable findings to populate our journal pages.

The efforts in psychology to improve the believability of our science can be boiled down to some relatively simple changes. We need to replace/supplement the typical reporting practices and statistical approaches by:

  1. Providing more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices).
  2. Designing our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes).
  3. Providing more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
  4. Valuing direct replication.

It seems pretty simple. Actually, the proposed changes are simple, even mundane.

What has been most surprising is the consistent push back and protests against these seemingly innocuous recommendations. When confronted with these recommendations it seems many psychological researchers balk. Despite calls for transparency, most researchers avoid platforms like the OSF. A striking number of individuals argue against and are quite disdainful of reporting effect sizes. Direct replications are disparaged. In response to the various recommendations outlined above, prototypical protests are:

  1. Effect sizes are unimportant because we are “testing theory” and effect sizes are only for “applied research.”
  2. Reporting effect sizes is nonsensical because our research is on constructs and ideas that have no natural metric, so that documenting effect sizes is meaningless.
  3. Having highly powered studies is cheating because it allows you to lay claim to effects that are so small as to be uninteresting.
  4. Direct replications are uninteresting and uninformative.
  5. Conceptual replications are to be preferred because we are testing theories, not confirming techniques.

While these protestations seem reasonable, the passion with which they are provided is disproportionate to the changes being recommended. After all, if you’ve run a t-test, it is little trouble to estimate an effect size too. Furthermore, running a direct replication is hardly a serious burden, especially when the typical study only examines 50 to 60 odd subjects in a 2×2 design. Writing entire treatises arguing against direct replication when direct replication is so easy to do falls into the category of “the lady doth protest too much, methinks.” Maybe it is a reflection of my repressed Freudian predilections, but it is hard not to take a Depth Psychology stance on these protests. If smart people balk at seemingly benign changes, then there must be something psychologically big lurking behind those protests. What might that big thing be? I believe the reason for the passion behind the protests lies in the fact that, though mundane, the changes that are being recommended to improve the believability of psychological science undermine the incentive structure on which the field is built.

I think this confrontation needs to be more closely examined because we need to consider the challenges and consequences of deconstructing our incentive system and status structure. This, then begs the question, what is our incentive system and just what are we proposing to do to it? For this, I believe a good analogy is the dilemma faced by Harry Potter in the last book of the eponymously titled book series.


Mar 26, 2014

Behavioral Priming: Time to Nut Up or Shut Up


In the epic movie "Zombieland", one of the main protagonists –Tallahassee, played by Woody Harrelson– is about to enter a zombie-infested supermarket in search of Twinkies. Armed with a banjo, a baseball bat, and a pair of hedge shears, he tells his companion it is "time to nut up or shut up". In other words, the pursuit of happiness sometimes requires that you expose yourself to grave danger. Tallahasee could have walked away from that supermarket and its zombie occupants, but then he would never have discovered whether or not it contained the Twinkies he so desired.

At its not-so-serious core, Zombieland is about leaving one's comfort zone and facing up to your fears. This I believe is exactly the challenge that confronts the proponents of behavioral priming today. To recap, the phenomenon of behavioral priming refers to unconscious, indirect influences of prior experiences on actual behavior. For instance, presenting people with words associated with old age ("Florida", "grey", etc.) primes the elderly stereotype and supposedly makes people walk more slowly; in the same vein, having people list the attributes of a typical professor ("confused", "nerdy", etc.) primes the concept of intelligence and supposedly makes people answer more Trivia questions correctly.

In recent years, the phenomenon of behavioral priming has been scrutinized with increasing intensity. Crucial to the debate is that many (if not all) of the behavioral priming effects appear to vanish like thin air in the hands of other researchers. Many of these researchers –from now on, the skeptics– have reached the conclusion that behavioral priming effects are elusive, brought about mostly by confirmation bias, the use of questionable research practices, and selective reporting.


Mar 19, 2014

If You Have Data, Use It When Theorizing


There is a reason data collection is part of the empirical cycle. If you have a good theory that allows for what Platt (1964) called ‘strong inferences’, then statistical inferences from empirical data can be used to test theoretical predictions. In psychology, as in most sciences, this testing is not done in a Popperian fashion (where we consider a theory falsified if the data does not support our prediction), but we test ideas in Lakatosian lines of research, which can either be progressive or degenerative (e.g., Meehl, 1990). In (meta-scientific) theory, we judge (scientific) theories based on whether they have something going for them.

In scientific practice, this means we need to evaluate research lines. One really flawed way to do this is to use ‘vote-counting’ procedures, where you examine the literature, and say: "Look at all these significant findings! And there are almost no non-significant findings! This theory is the best!” Read Borenstein, Hedges, Higgins, & Rothstein (2006) who explain “Why Vote-Counting Is Wrong” (p. 252 – but read the rest of the book while you’re at it).


Mar 12, 2014

In the Previous Episodes of the Tale of Social Priming and Reproducibility


We have lined up a nice set of posts responding to the recent special section in PoPS on social priming and replication/reproducibility, which we will publish in the coming weeks. It has proven easier to find critics of social priming than to find defenders of the phenomenon, and if there are primers out there who want to chime in they are most welcome and may contact us at oscblog@googlegroups.com.

The special section in PoPS was immediately prompted by this wonderful November 2012 issue from PoPS on replicability in psychology (open access!), but the Problems with Priming started prior to this. For those of you who didn’t seat yourself in front of the screen with a tub of well-buttered pop-corn every time behavioral priming made it outside the trade journals, I’ll provide some back-story, and links to posts and articles that frames the current response.

The mitochondrial Eve of behavioral priming is Bargh’s Elderly Prime1. The unsuspecting participants were given scrambled sentences, and were asked to create proper sentences out of four of the five words in each. Some of the sentences included words like Bingo or Flordia – words that may have made you think of the elderly, if you were a student in New York in the mid nineties. Then, they measured the speed with which the participant walked down the corridor to return their work, and, surprising to many, those that unscrambled sentences that included “Bingo” and “Florida” walked slower than those that did not. Conclusion: the construct of “elderly” had been primed, causing participants to adjust their behavior (slower walk) accordingly. You can check out sample sentences in this Marginal Revolution post – yes, priming made it to this high-traffic economy blog.

This paper has been cited 2571 times, so far (according to Google Scholar). It even appears in Kahneman’s Thinking, Fast and Slow, and has been high on the wish-list for replication on Pashler’s PsychFile Drawer. (No longer in the top 20, though).

Finally, in January 2012, Doyen, Klein, Pichon & Cleeremans (a Belgian group) published a replication attempt in PLOSone where they suggest the effect was due to demand. Ed Yong did this nice write-up of the research.

Bargh was not amused, and wrote a scathing rebuttal on his blog in the Psychology Today domain. He took it down after some time (for good reason – I think it can be found, but I won’t look for it.). Ed commented on this too.

A number of good posts from blogging psychological scientists also commented on the story. A sampling are Sanjay Srivastava on his blog Hardest Science, Chris Chambers on NeuroChambers, and Cedar Riener on his Cedarsdigest.

The British Psychological Society published a notice about it in The Psychologist which links to additional commentary. In May, Ed Yong had an article in Nature discussing the status of non-replication in psychology in general, but where he also brings up the Doyen/Bargh controversy. On January 13, the Chronicle published a summary of what had happened.

But, prior to that, Daniel Kahneman made a call for psychologists to clean up their act as far as behavioral priming goes. Ed Yong (again) published two pieces about it. One in Nature and one on his blog.

The controversies surrounding priming continued in the spring of 2013. This time it was David Shanks who, as a hobby (from his video - scroll down below the fold) had taken to attempting to replicate priming of intelligence, work originally done by Dijksterhuis and van Knippenberg in 1998. He had his students perform a series of replications, all of which showed no effect, and was then collected in this PLOSone paper.

Dijksterhuis retorted in the comment section2. Rolf Zwaan blogged about it. Then, Nature posted a breathless article suggesting that this was a fresh blow for us who are Social Psychologists.

Now, most of us who do science thought instead that this was science working just like it ought to be working, and blogged up a storm about it – with some of the posts (including one of mine) linked in Ed Yong’s “Missing links” feature. The links are all in the fourth paragraph, above the scroll, and includes additional links to discussions on replicability, and the damage done by a certain Dutch fraudster.

So here you are, ready for the next set of installments.

1 Ancestral to this is Srull & Wyer’s (1979) story of Donald, who is either hostile or kind, depending on which set of sentences the participant unscrambled in that earlier experiment that had nothing to do with judging Donald.

2 A nice feature. No waiting years for the retorts to be published in the dead tree variant we all get as PDF’s anyway.

Mar 6, 2014

Confidence Intervals for Effect Sizes from Noncentral Distributions


(Thanks to Shauna Gordon-McKeon, Fred Hasselman, Daniël Lakens, Sean Mackinnon, and Sheila Miguez for their contributions and feedback to this post.)

I recently took on the task of calculating a confidence interval around an effect size stemming from a noncentral statistical distribution (the F-distribution to be precise). This was new to me, and as I am of the view that such statistical procedures would add value to the work being done in the social and behavioral sciences, but that they are not common in practice at the present time, potentially due to lack of awareness, I wanted to pass along some of the things that I found.
In an effort to estimate the replicability of psychological science, an important first step is to determine the criteria for declaring a given replication attempt as successful. Lacking clear consensus around this criteria, the OpenScience group determined that rather than settling on a single set of criteria by which the replicability of psychological research would be assessed, multiple methods would be employed, all which provide a measure of valuable insight regarding the reproducibility of published findings in psychology (OpenScience Collaboration, 2012). One such method is to examine the confidence interval around the original target effect and to see if this confidence interval overlaps with the confidence interval from the replication effect. However, estimating the confidence interval around many effects in social science research requires the use of non-central probability distributions, and most mainstream statistical packages (e.g. SAS, SPSS) do not provide off the shelf capabilities for deriving confidence intervals from these distributions (Kelley, 2007).

Most of us probably picture common statistical distributions such as the t-distribution, the F-distribution, and the χ2 distribution as being two dimensional, with the x-axis representing the value of the test statistic and the area under the curve representing the likelihood of observing such a value in a sample population. When first learning to conduct these statistical tests, such visual representations likely provided a helpful way to convey the concept that more extreme values of the test statistic were less likely. In the realm of null hypothesis statistical testing (NHST), this provides a tool for visualizing how extreme the test statistic would need to be before we would be willing to reject a null hypothesis. However, it is important to remember that these distributions vary along a third parameter as well: the noncentrality parameter. The distribution that we use to determine the cut-off points for rejecting a null hypothesis is a special, central case of the distribution when the noncentrality parameter is zero. This special-case distribution gives the probabilities of test statistic values when the null hypothesis is true (i.e., when the population effect is zero). As the noncentrality parameter changes (i.e., when we assume that an effect does exist), the shape of the distribution which defines the probabilities of obtaining various values of the parameter in our statistical tests changes as well. The following figure (copied from the Wikipedia page for the noncentral t-distribution) might help provide a sense of how the shape of the t-distribution changes as the noncentrality parameter varies.

non-central T distribution
Figure by Skbkekas, licensed CC BY 3.0.

The first two plots (orange and purple) illustrate the different shapes of the distribution under the assumption that the true population parameter (the difference in means) is zero. The value of v indicates the degrees of freedom used to determine the probabilities under the curve. The difference between these first two curves stems from the fact that the purple curve has more degrees of freedom (a larger sample), and thus there will be a higher probability of observing values near the mean. These distributions are central (and symmetrical), and as such, values of x that are equally higher or lower than the mean are equally probable. The second two plots (blue and green) illustrate the shapes of the distribution under the assumption that the true population parameter is two. Notice that both of these curves are positively skewed, and that this skewness is particularly pronounced in the blue curve as it is based on fewer degrees of freedom (smaller sample size). The important thing to note is that for these plots, values of x that are equally higher or lower than the mean are NOT equally probable. Observing a value of x = 4 under the assumption that the true value of x is two is considerably more probable than observing a value of x = 0. Because of this, a confidence interval around an effect that is anything other than zero will be asymmetrical and will require a bit of work to calculate.

Because the shape (and thus the degree of symmetry) of many statistical distributions depends on the size of the effect that is present in the population, we need a noncentrality parameter to aid in determining the shape of the distribution and the boundaries of any confidence interval of the population effect. As mentioned previously, these complexities do not arise as often as we might expect in everyday research because when we use these distributions in the context of null-hypothesis statistical testing (NHST), we can assume a special, ‘centralized’ case of the distributions that occurs when the true population effect of interest is zero (the typical null hypothesis). However, confidence intervals can provide different information than what can be obtained through NHST. When testing a null hypothesis, what we glean from our statistics is the probability of obtaining the effect observed in our sample if the true population effect is zero. The p-value represents this probability, and is derived from a probability curve with a noncentrality parameter of zero. As mentioned above, these special cases of statistical distributions such as the t, F, and χ2 are ‘central’ distributions. On the other hand, when we wish to construct a confidence interval of a population effect, we are no longer in the NHST world, and we no longer operate under the assumption of ‘no effect’. In fact, when we build a confidence interval, we are not necessarily making assumptions at all about the existence or non-existence of an effect. Instead, when we build a confidence interval, we want a range of values that is likely to contain the true population effect with some degree of confidence. To be crystal clear, when we construct a 95% confidence interval around a test statistic, what we are saying is that if we repeatedly tested random samples of the same size from the target population under identical conditions, the true population parameter will be bounded by the 95% confidence interval derived from these samples 95% of the time.

From a practical standpoint, a confidence interval can tell us everything that NHST can, and then some. If the 95% confidence interval of a given effect contains the value of zero, then there is a good chance that there is a negligible effect in the relationship you are testing. In this case, as a researcher, the conclusion that you would reach is conceptually similar to declaring that you are not willing to reject a null hypothesis of zero effect on the grounds that there is greater than a 5% chance that the effect is actually zero. However, a confidence interval allows the researcher to say a bit more about the potential size of a population effect as well as the degree of variability that exists in it’s estimate, whereas NHST only permits the researcher to state, with a specified level of confidence, the likelihood that an effect exists at all.

Why, then, is NHST the overwhelming choice of statisticians in the social sciences? The likely answer has to do with the idea of non-centrality stated above. When we build a confidence interval around an effect size, we generally do not build the confidence interval around an effect of zero. Instead, we build the confidence interval around the effect that we find in our sample. As such, we are unable to build the confidence interval using the symmetrical, special case instances of many of our statistical distributions. We have to build it using an asymmetrical distribution that has a shape (a degree of non-centrality) that depends on the effect that we found in our sample. This gets messy, complicated, and requires a lot of computation. As such, the calculation of these confidence intervals was not practical until it became commonplace for researchers to have at their disposal the computational power available in modern computing systems. However, research in the social sciences has been around much longer than your everyday, affordable, quad-core laptop, and because building confidence intervals around effects from non-central distributions was impractical for much of the history of the social sciences, these statistical techniques were not often taught, and their lack of use is likely to be an artifact of institutional history (Steiger & Fouladi, 1997). All of this to say that in today’s world, researchers generally have more than enough computational power at their disposal to easily and efficiently construct a confidence interval around an effect from a non-central distribution. The barriers to these statistical techniques have been largely removed, and as the value of the information obtained from a confidence interval exceeds the value of the information that can be obtained from NHST, it is useful to spread the word about resources that can help in the computation of confidence intervals around common effect size metrics in the social and behavioral sciences.

One resource that I found to be particularly useful is the MBESS (Methods for the Behavioral, Educational, and Social Sciences) package for the R statistical software platform. For those unfamiliar with R, it is a free, open-source statistical software package which can be run on Unix, Mac, and Windows platforms. The standard R software contains basic statistics functionality, but also provides the capability for contributors to develop their own functionality (typically referred to as ‘packages’) which can be made available to the larger user community for download. MBESS is one such package which provides ninety-seven different functions for statistical procedures that are readily applicable to statistical analysis techniques in the behavioral, educational, and social sciences. Twenty-five of these functions involve the calculation of confidence intervals or confidence limits, mostly for statistics stemming from noncentral distributions.

For example, I used the ci.pvaf (confidence interval of the proportion of variance accounted for) function from the MBESS package to obtain a 95% confidence interval around an η2 effect of 0.11 from a one-way between groups analysis of variance. In order to do this, I only needed to supply the function with several relevant arguments:

F-value: This is the F-value from a fixed-effects ANOVA
df: The numerator and denominator degrees of freedom from the analysis
N: The sample size
Confidence Level: The confidence level coverage that you desire (i.e. 95%)

No more information is required. Based on this, the function can calculate the desired confidence interval around the effect. Here is a copy of the code that I entered and what was produced (with comments in italics to explain what is going on in each step):


once you have installed the MBESS package, this command makes it available for your current session of R

ci.pvaf(F.value=4.97, df.1=2, df.2=81, N=84, conf.level=.95)

this uses the ci.pvaf function in the MBESS package to calculate the confidence interval. I have given # the function an F-value (F.value) of 4.97, with 2 degrees of freedom between groups (df.1), and 81 # degrees of freedom within groups (df.2), a sample size (N) of 84, and have asked it to produce a 95% confidence interval (conf.level). Executing the above command produces the following output:

[1] 0.007611619

[1] 0.025

[1] 0.2320935

[1] 0.025

[1] 0.95

Thus, the 95% confidence interval around my η2 effect is [0.01 - 0.23].

Similar functions are available in the MBESS package for calculating confidence intervals around a contrast in a fixed-effects ANOVA, multiple correlation coefficient, squared multiple correlation coefficient, regression coefficient, reliability coefficient, RMSEA, standardized mean difference, signal-to-noise ratio, and χ2 parameters, among others.

Additional Resources
  • Fred Hasselman has created a brief tutorial for computing effect size confidence intervals using R.

  • For those more familiar with conducting statistics in an SPSS environment, Dr. Karl Wuensch at East Carolina University provides links to several SPSS programs on his Web Page. This program is for calculating confidence intervals for a standardized mean difference (Cohen’s d).

  • In addition, I came across several publications that I found useful in providing background information regarding non-central distributions (a few of which are cited above). I’m sure there are more, but I found these to be a good place to start:

Cumming, G. (2006). How the noncentral t distribution got its hump. Paper presented at the seventh International Conference on Teaching Statistics, Salvador, Bahia, Brazil.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7-29. DOI: 10.1177/0956797613504966

Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software, 20, 1-24.

Smithson, M. (2001). Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educational And Psychological Measurement, 61(4), 605-632. doi:10.1177/00131640121971392

Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L. Harlow, S. > Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 221-256). Mahwah, NJ: Erlbaum.

Hopefully others find this information as useful as I did!

Feb 27, 2014

Data trawling and bycatch – using it well


Pre-registration is starting to outgrow its old home, clinical trials. Because it is a good way to (a) show that your theory can make viable predictions and (b) that your empirical finding is not vulnerable to hypothesising after the results are known (HARKing) and some other questionable research practices, more and more scientists endorse and actually do pre-registration. Many remain wary though and some simply think pre-registration cannot work for their kind of research. A recent amendment (October 2013) to the Declaration of Helsinki mandates public registration of all research on humans before recruiting the first subject and the publication of all results, positive, negative and inconclusive.

For some of science the widespread “fishing for significance” metaphor illustrates the problem well: Like an experimental scientist the fisherman casts out the rod many times, tinkering with a variety of baits and bobbers, one at a time, trying to make a good catch, but possibly developing a superstition about the best bobber. And, like an experimental scientist, if he returns the next day to the same spot, it would be easy to check whether the success of the bobber replicates. If he prefers to tell fishing lore and enshrine his bobber in a display at his home, other fishermen can evaluate his lore by doing as he did in his stories.

Some disciplines (epidemiology, economics, developmental and personality psychology come to mind) proceed, quite legitimately, more like fishing trawlers – that is to say data collection is a laborious, time-consuming, collaborative endeavour. Because these operations are so large and complex, some data bycatch will inevitably end up in the dragnet.


Feb 5, 2014

Open Data and IRBs


Among other things the open science movement encourages “open data” practices, that is, researchers making data freely available on personal/lab websites or institutional repositories for others to use. For some, open data is a necessity as the NIH and NSF have adopted data-sharing policies and require some grant applications to include data management and dissemination plans. According to the NIH:

“...all data should be considered for data sharing. Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data.” (emphasis theirs)

Before making human subject data open several issues must be considered. First, data should be de-identified to maintain subject confidentiality so responses cannot be linked to identities and data are seemingly anonymous. Second, researchers should consider Institutional Review Board’s (IRB) policies about data sharing. (Disclosure: I have been a member of my university's IRB for 6 years and chair of my Departmental Review Board, DRB, for 7 years.)

Unfortunately, while the policies and procedure of all IRBs require researchers to obtain consent, disclose study procedures to subjects, and maintain confidentiality, it is unknown how many IRBs have policies and procedures for open data dissemination. Thus, a conflict may arise between researchers who want to adopt open data practices or need to disseminate data (those with NIH or NSF grants) and judgements of IRBs.

This is an especially important issue for those who want to share data that are already collected: can use data be openly disseminated without IRB review? (I address this below when I offer recommendations.) What can researchers do when they want or need to share data freely, but their IRB does not have a clear policy? And what say does an IRB have in open data practices?

While IRBs should be consulted and informed about open data, as I delineate below IRBs are not now and were never intended to be data-monitoring groups (Bankert & Amdur, 2000). IRBs are regulated and have little say in whether a researcher can share data, based on the purview, scope, and responsibilities of IRBs.

IRBs in the United States are regulated under US Health and Human Services (HHS) guidelines for Protection of Human Subjects. The guidelines describe the composition of IRBs, record keeping, define levels of risk, and list specific duties of IRBs and hint at their limits.

When they function appropriately IRBs review research protocols to (1) evaluate risks; (2) determine whether subject confidentiality is maintained, that is, whether responses are linked to identities (‘confidentiality’ differs from ‘privacy’, which means others will not know a person participated in a study); and (3) evaluate whether subjects are given sufficient information about risks, procedures, privacy, and confidentiality. HHS Regulations Part 46, Subpart A, Section 111 ("Criteria for IRB Approval of Research") (a)(2), is very specific on the purview of IRBs in evaluating protocols:

"In evaluating risks and benefits, the IRB should consider only those risks and benefits that may result from the research (as distinguished from risks and benefits of therapies subjects would receive even if not participating in the research). The IRB should not consider possible long-range effects of applying knowledge gained in the research (for example, the possible effects of the research on public policy) as among those research risks that fall within the purview of its responsibility." [emphasis added]

And regulations §46.111 (a)(6) and (a)(7) state that IRBs are to evaluate the safety, privacy, and confidentiality of subjects in proposed research:

(a)(6) "When appropriate, the research plan makes adequate provision for monitoring the data collected to ensure the safety of subjects.” (a)(7) “When appropriate, there are adequate provisions to protect the privacy of subjects and to maintain the confidentiality of data."

The regulations make it clear that IRBs should consider only risks directly related to the study, and explicitly forbid IRBs from evaluating potential long-range effects of new knowledge gained from the study, as in new knowledge resulting from data sharing. Thus, IRBs should concern themselves with evaluating a study for safety, confidentiality, and that information is disclosed; reviewing existing data for dissemination is not under the purview of the IRB. The only issue that should concern IRBs about open data is whether the data are de-identified to “...protect the privacy of subjects and to maintain the confidentiality of data." It is not the responsibility of the IRB to monitor data, that responsibility falls to the researcher.

Nonetheless, IRBs may take the position that they are data monitors and deny a researcher’s request to openly disseminate data. In denying a request an IRB may use the argument ‘subjects would not have participated if they knew the data would be openly shared.’ In this case, IRBs would be playing mind-readers; there is no way an IRB can assume subjects would not have participated if they knew data would be openly shared. However, whether a person would decline to participate if they were informed about a researcher’s intent to openly disseminate data is an empirical question.

Also, with this argument the IRB is implicitly suggesting subjects would need to have been informed about open data dissemination in the consent form. But, such a requirement for consent forms neglects other federal guidelines. The Belmont Report provides responsibilities for human researchers, much like the APA's ethical principles, and describes what information should be included in the consent process:

“Most codes of research establish specific items for disclosure intended to assure that subjects are given sufficient information. These items generally include: the research procedure, their purposes, risks and anticipated benefits, alternative procedures (where therapy is involved), and a statement offering the subject the opportunity to ask questions and to withdraw at any time from the research.”

The Belmont Report does not even mention that subjects should be informed about the potential long-range plans or uses of the data they provide. Indeed, researchers do not have to tell subjects what analyses will be used, and for good reason. All the Belmont requires is for subjects be informed about the purpose of the study, the procedures, and be informed about their privacy and confidentiality of responses.

Another argument an IRB could make is the data could be used maliciously. For example, a researcher could make a data set open that included ethnicity and test scores and someone else could use that data to show certain ethnic groups are smarter than others. (This example is based on a recent Open Science Framework post that is the basis for this post.)

Although it is more likely that open data would be used as intended, someone could use data as they were not intended and may find a relationship between ethnicity and test scores. So what? The data are not malicious or problematic, it is the person using (misusing?) the data, and IRBs should not be in the habit of allowing only politically correct research to proceed (Lilienfeld, 2010). Also, by considering what others might do with open data, IRBs would be mind-reading and overstepping its purview by considering “...long-range effects of applying knowledge gained in the research (for example, the possible effects of the research on public policy).”

The bottom line is IRBs cannot know whether subjects would not have participated in a project if they knew the data would be openly disseminated, or potential findings by others. Federal regulations inform IRBs of their specific duties, which do not include data monitoring or making judgments on open data dissemination; those duties are the responsibilities of the researcher.

So what should you do if you want to make your data open? First, don't fear the IRB, but don’t forget the IRB. Perhaps re-examine IRB policies any time you plan a new project to remind yourself of the IRB requirements.

Second, making your data open does depend on what subjects agree to on the consent form, and this is especially important if you want to make existing data open. If subjects are told their participation will remain private (identities not disclosed) and responses will remain confidential (identities not linked to responses), openly disseminating de-identified data would not violate the agreement. However, if subjects were told the data would ‘not be disseminated’, the researcher may violate the agreement if they openly share data. In this case the IRB would need to be involved, subjects may need to re-consent to allow their responses to be disseminated, and new IRB approval may be needed as the original consent agreement may change.

Third, de-identify data sets you plan to make open. This includes removing names, student IDs, the subject numbers, timestamps, and anything else that could be used to uniquely identify a person.

Fourth, inform your IRB and department of your intentions. Describe your de-identification process and that you are engaging in open data practices as you see appropriate while maintaining subject confidentiality and privacy. (If someone objects, direct them toward federal IRB regulations.)

Finally, work with your IRB to develop guidelines and policies for data sharing. Given the speed and recency of the open science and open data movements, it is unlikely many IRBs have considered such policies.

We want greater transparency in science, and open data is one practice the can help. The IRB should not be seen as a hurdle or barrier to disseminating data, but as a reminder that one of the best practices in science is to ensure the integrity of our data and information communications by responsibly maintaining the confidence and privacy of our research subjects.


Bankert, E., & Amdur, R. (2000). The IRB is not a data and safety monitoring board. IRB: Ethics and Human Research, 22(6), 9-11.

De Wolfe, V. A., Sieber, J. E., Steel, P. M., & Zarate, A. O. (2005). Part I: What is the requirement for data sharing? IRB: Ethics and Human Research, 27(6), 12-16.

De Wolfe, V. A., Sieber, J. E., Steel, P. M., & Zarate, A. O. (2006). Part III: Meeting the challenge when data sharing is required. IRB: Ethics and Human Research, 28(2), 10-15.

Lilienfeld, S.O. (2010). Can psychology become a science? Personality and Individual Differences, 49, 281-288.

Jan 29, 2014

Privacy in the Age of Open Data


Nothing is really private anymore. Corporations like Facebook and Google have been collecting our information for some time, and selling it in aggregate to the highest bidder. People have been raising concerns over these invasions of privacy, but generally only technically-savvy, highly motivated people can really be successful at remaining anonymous in this new digital world.

For a variety of incredibly important reasons, we are moving towards open research data as a scientific norm – that is, micro datasets and statistical syntax openly available to anyone who wants it. However, some people are uncomfortable with open research data, because they have concerns about privacy and confidentiality violations. Some of these violations are even making the news: A high profile case about people being identified from their publicly shared genetic information comes to mind.

With open data comes increased responsibility. As researchers, we need to take particular care to balance the advantages of data-sharing with the need to protect research participants from harm. I’m particularly primed for this issue because my own research often intersects with clinical psychology. I ask questions about things like depression, anxiety, eating disorders, substance use and conflict with romantic partners. The data collected in many of my studies has the potential to seriously harm the reputation – and potentially the mental health – of participants if linked to their identity by a malicious person. This said, I believe in the value of open data sharing. In this post, I’m going to discuss a few core issues as it pertains to de-identification – that is, ensuring the anonymity of participants in an openly shared dataset. Violations of privacy will always be a risk: However, some relatively simple steps on the part of the researcher can make re-identification of individual participants much more challenging.

Who are we protecting the data from?

Throughout the process, it’s helpful to imagine yourself as a person trying to get dirt on a potential participant. Of course, this is ignoring the fact that very few people are likely to use data for malicious purposes … but for now, let’s just consider the rare cases where this might happen. It only takes one high-profile incident to be a public relations and ethics nightmare for your research! There are two possibilities for malicious users that I can think of:

  1. Identity thieves who don’t know the participant directly, but are looking for enough personal information to duplicate someone’s identity for criminal activities, such as credit card fraud. These users are unlikely to know anything about participants ahead of time, so they have a much more challenging job because they have to be able to identify people exclusively using publicly available information.

  2. People who know the participant in real-life and want to find out private information about someone for some unpleasant purpose (e.g., stalkers, jealous romantic partners, a fired employee, etc.). In this case, the party likely knows (a) that the person of interest is in your dataset; (b) basic demographic information on the person such as sex, age, occupation, and the city they live in. Whether or not this user is successful in identifying individuals in an open dataset depends on what exactly the researcher has shared. For fine-grained data, it could be very easy; however, for properly de-identified data, it should be virtually impossible.

Key Identifiers to Consider when De-Identifying Data

The primary way to safeguard privacy in publicly shared data is to avoid identifiers; that is, pieces of information that can be used directly or indirectly to determine a person’s identity. A useful starting point for this is the list of 18 identifiers indicated in the Health Insurance Portability and Accountability Act that are to be used with Protected Health Information. A full list of these identifiers can be found here. Many of these identifiers are obvious (e.g., no names, phone numbers, SIN numbers, etc.), but some identifiers are worth discussing more specifically in the context of psychological research paradigm which shares data openly.

Demographic variables. Most of the variables that psychologists are interested in are not going to be very informative for identifying individuals. For example, reaction time data (even if unique to an individual) is very unlikely to identify participants – and in any event, most people are unlikely to care if other people know that they respond 50ms faster to certain types of visual stimuli. The type of data that are generally problematic are what I’ll call “demographic variables.” So things like sex, ethnicity, age, occupation, university major, etc. These data are sometimes used in analyses, but most often are just used to characterize the sample in the participants section of manuscripts. Most of the time, demographic variables can’t be used in isolation to identify people; instead, combinations of variables are used (e.g., a 27-year old, Mexican woman who works as a nurse may be the only person with that combination of traits in the data, leaving her vulnerable to loss of privacy). Because the combination of several demographic characteristics can potentially produce identifiable profiles, a common rule of thumb I picked up when working with Statistics Canada is to require a minimum of 5 participants per cell. In other words, if a particular combination of demographic features yields less than 5 individuals, the group will be collapsed into a larger, more anonymous, aggregate group. The most common example of this would be using age ranges (e.g., ages 18-25) instead of exact ages; similar logic could apply to most demographic variables. This rule can get restrictive fast (but also demonstrates how little data can be required to identify individual people!) so ideally, share only the demographic information that is theoretically and empirically important to your research area.

Outliers and rare values. Another major issue are outliers and other rare values. Outliers are variably defined depending on the statistical text you read, but generally refer to extreme values when variables are using continuous, interval, or ordinal measurement (e.g., someone has an IQ of 150 in your sample, and the next highest person is 120). Rare values refer to categorical data that very few people endorse (e.g., the only physics professor in a sample). There are lots of different ways you can deal with outliers, and there’s not necessarily a lot of agreement on which is the best – indeed, it’s one of those researcher degrees of freedom you might have heard about. Though this may depend on the sensitivity of the data in question, outliers often have the potential to be a privacy risk. From a privacy standpoint, it may be best for the researcher to deal with outliers by deleting or transforming them before sharing the data. For rare values, you can collapse response options together until there are no more unique values (e.g., perhaps classify the physics professor as a “teaching professional” if there are other teachers in the sample). In the worst case scenario, you may need to report the value as missing data (e.g., a single intersex person in your sample that doesn’t identify as male or female). Whatever you decide, you should disclose to readers what your strategy was for dealing with outliers and rare values in the accompanying documentation so it is clear for everyone using the data.

Dates. Though it might not be immediately obvious, any exact dates in the dataset place participants at risk for re-identification. For example, if someone knew what day the participant took part in a study (e.g., they mention it to a friend; they’re seen in a participant waiting area) then their data would be easily identifiable by this date. To minimize privacy risks, no exact dates should be included in the shared dataset. If dates are necessary for certain analyses, transforming the data into some less identifiable format that is still useful for analyses is preferable (e.g., have variables for “day of week” or “number of days in between measurement occasions” if these are important).

Geographic Locations. The rule of having “no geographic subdivisions smaller than a state” from the HIPAA guidelines is immediately problematic for many studies. Most researchers collect data from their surrounding community. Thus, it will be impossible to blind the geographic location in many circumstances (e.g., if I recruit psychology students for my study, it will be easy for others to infer that I did so from my place of employment at Dalhousie University). So at a minimum, people will know that participants are probably living relatively close to my place of employment. This is going to be unavoidable in many circumstances, but in most cases it should not be enough to identify participants. However, you will need to consider if this geographical information can be combined with other demographic information to potentially identify people, since it will not be possible to suppress this information in many cases. Aside from that, you’ll just have to do your best to avoid more finely grained geographical information. For example, in Canada, a reverse lookup of postal codes can identify some locations with a surprising degree of accuracy, sometimes down to a particular street!

Participant ID numbers. Almost every dataset will (and should) have a unique identification number for each participant. If this is just a randomly selected number, there are no major issues. However, most researchers I know generate ID numbers in non-random ways. For example, in my own research on romantic couples we assign ID numbers chronologically, with a suffix number of “1” indicating men and “2” indicating women. So ID 003-2 would be the third couple that participated, and the male within that couple. In this kind of research, the most likely person to snoop would probably be the other romantic partner. If I were to leave the ID numbers as originally entered, the romantic partner would easily be able to find their own partner’s data (assuming a heterosexual relationship and that participants remember their own ID number). There are many other algorithms researchers might use to create ID numbers, many of which do not provide helpful information to other researchers, but could be used to identify people. Before freely sharing data, you might consider scrambling the unique ID numbers so that they cannot be a privacy risk (you can, of course, keep a record of the original ID numbers in your own files if needed for administrative purposes).

Some Final Thoughts

Risk of re-identification is never zero. Especially when data are shared openly online, there will always be a risk for participants. Making sure participants are fully informed about the risks involved during the consent process is essential. Careless sharing of data could result in a breach of privacy, which could have extremely negative consequences both for the participants and for your own research program. However, with proper safeguards, the risk of re-identification is low, in part due to some naturally occurring features of research. The slow, plodding pace of scientific research inadvertently protects the privacy of participants: Databases are likely to be 1-3 years old by the time they are posted, and people can change considerably within that time, making them harder to identify. Naturally occurring noise (e.g., missing data, imputation, errors by participants) also impedes the ability to identify people, and the variables psychologists are usually most interested in are often not likely candidates to re-identify someone.

As a community of scientists devoted to making science more transparent and open, we also carry the responsibility of protecting the privacy and rights of participants as much as is possible. I don’t think we have all the answers yet, and there’s a lot more to consider when moving forward. Ethical principles are not static; there are no single “right” answers that will be appropriate for all research, and standards will change as technology and social mores change with each generation. Still, by moving forward with an open mind, and a strong ethical conscience to protect the privacy of participants, I believe that data can really be both open and private.

Jan 22, 2014

Open Projects - Wikipedia Project Medicine


This article is the first in a series highlighting open science projects around the community. You can read the interview this article was based on: edited for clarity, unedited.

Six years ago, Doctor James Heilman was working a night shift in the ER when he came across an error-ridden article on Wikipedia. Someone else might have used the article to dismiss the online encyclopedia, which was then less than half the size it is now. Instead, Heilman decided to improve the article. “I noticed an edit button and realized that I could fix it. Sort of got hooked from there. I’m still finding lots of articles that need a great deal of work before they reflect the best available medical evidence.”

Heilman, who goes by the username Jmh649 on Wikipedia, is now the president of the board of Wiki Project Med. A non-profit corporation created to promote medical content on Wikipedia, WPM contains over a dozen different initiatives aimed at adding and improving articles, building relationships with schools, journals and other medical organizations, and increasing access to research.

One of the initiatives closest to Heilman’s heart is the Translation Task Force, an effort to identify key medical articles and translate them into as many languages as they can. These articles cover common and potentially deadly medical circumstances, such as gastroenteritis (diarrhea), birth control, HIV/AIDS, and burns. With the help of Translators Without Borders, over 3 million words have been translated into about 60 languages. One of these languages is Yoruba, a West African language. Although Yoruba is spoken by nearly 30 million people, there are only a few editors working to translate medical articles into it.

“The first two billion people online by and large speak/understand at least one of the wealthy languages of the world. With more and more people getting online via cellphones that is not going to be true for the next 5 billion coming online. Many of them will find little that they can understand.” Wikipedia Zero, a program which provides users in some developing countries access to Wikipedia without mobile data charges, is increasing access to the site.

“People are, for better or worse, learning about life and death issues through Wikipedia. So we need to make sure that content is accurate, up to date, well-sourced, comprehensive, and accessible. For readers with no native medical literature, Wikipedia may well be the only option they have to learn about health and disease.”

That’s Jake Orlowitz (Ocaasi), WPM’s outreach coordinator. He and Heilman stress that there’s a lot of need for volunteer help, and not just with translating. Of the 80+ articles identified as key, only 31 are ready to be translated. The rest need citations verified, jargon simplified, content updated and restructured, and more.

In an effort to find more expert contributors, WPM has launched a number of initiatives to partner with medical schools and other research organizations. Orlowitz was recently a course ambassador to the UCSF medical school, where students edited Wikipedia articles for credit. He also set up a partnership with the Cochrane Collaboration a non-profit made up of over 30,000 volunteers, mostly medical professionals, who conduct reviews of medical interventions. “We arranged a donation of 100 full access accounts to The Cochrane Library, and we are currently coordinating a Wikipedian in Residence position with them. That person will teach dozens of Cochrane authors how to incorporate their findings into Wikipedia,” explains Orlowitz.

Those who are familiar with how Wikipedia is edited might balk at the thought of contributing. Won’t they be drawn in to “edit wars”, endless battles with people who don’t believe in evolution or who just enjoy conflict? “There are edit wars,” admits Heilman. “They are not that common though. 99% of articles can be easily edited without problems.”

Orlowitz elaborates on some of the problems that arise. “We have a lot of new editors who don't understand evidence quality.” The medical experts they recruit face a different set of challenges. “One difficulty many experts have is that they wish to reference their own primary sources. Or write about themselves. Both those are frowned upon. We also have some drug and device companies that edit articles in their area of business--we discourage this strongly and it's something we keep an eye on.”

And what about legitimate differences of opinion about as yet unsettled medical theories, facts and treatments?

“Wikipedia 'describes debates rather than engaging in them'. We don't take sides, we just summarize the evidence on all sides--in proportion to the quality and quantity of that evidence,” says Orlowitz. Heilman continues: “For example Cochrane reviews state it is unclear if the risk versus benefits of breast cancer screening are positive or negative. The USPSTF is supportive. We state both.” Wikipedia provides detailed guidelines for evaluating sources and dealing with conflicting evidence.

Another reason academics might hesitate before contributing is the poor reputation Wikipedia has in academic circles. Another initiative, the Wikipedia-journal collaboration, states: "One reason some academics express for not contributing to Wikipedia is that they are unable to get the recognition they require for their current professional position. A number of medical journals have agreed in principle to publishing high quality Wikipedia articles under authors' real names following formal peer review.” A pilot paper, adapted from the Wikipedia article on Dengue Fever, is to be published in the Journal of Open Medicine, with more publications hopefully to come.

The stigma against Wikipedia itself is also decreasing. “The usage stats for the lay public, medical students, junior physicians, and doctors, and pharmacists are just mindbogglingly high. It's in the range of 50-90%, even for clinical professionals. We hear a lot that doctors 'jog their memory' with Wikipedia, or use it as a starting point,” says Orlowitz. One 2013 study found that a third or more of general practitioners, specialists and medical professors had used Wikipedia, with over half of physicians in training accessing it. As more diverse kinds of scientific contributions begin to be recognized, Wikipedia edits may make their way onto CVs.

Open science activists may be disappointed to learn that Wikipedia doesn’t require or even prefer open access sources for its articles. “Our policy simply states that our primary concern is article content, and verifi_ability_. That standard is irrespective of how hard or easy it is to verify,” explains Orlowitz. Both Wikipedians personally support open access, and would welcome efforts to supplement closed access citations with open ones. “If there are multiple sources of equal quality that come to the same conclusions we support using the open source ones,” says Heilman. A new project, the Open Access Signalling project aims to help readers quickly distinguish what sources they’ll be able to access.

So what are the best ways for newcomers to get involved? Heilman stresses that editing articles remains one of the most important tasks of the project. This is especially true of people affiliated with universities. “Ironically, since these folks have access to high quality paywalled sources, one great thing they could do would be to update articles with them. We also could explore affiliating a Wikipedia editor with a university as a Visiting Scholar, so they'd have access to the library's catalogue to improve Wikipedia, in the spirit of research affiliates,” says Orlowitz.

Adds Heilman, “If there are institution who would be willing to donate library accounts to Wikipedia's we would appreciate it. This would require having the Wikipedian register in some manner with the university. There are also a number of us who may be willing / able to speak to Universities that wish to learn more about the place of Wikipedia in Medicine.” The two also speak at conferences and other events.

Wiki Project Med, like Wikipedia itself, is an open community - a “do-ocracy”, as Orlowitz calls it. If you’re interested in learning more, or in getting involved, you can check out their project page, which details their many initiatives, or reach out to Orlowitz or the project as a whole on Twitter (@JakeOrlowitz, @WikiProjectMed) or via email (jorlowitz@gmail.com, wikiprojectmed@gmail.com).

Jan 15, 2014

The APA and Open Data: one step forward, two steps back?


Photo of Denny

I was pleasantly surprised when, last year, I was approached with the request to become Consulting Editor for a new APA journal called Archives of Scientific Psychology. The journal, as advertised on its website upon launch, had a distinct Open Science signature. As its motto said, it was an “Open Methodology, Open Data, Open Access journal”. That’s a lot of openness indeed.

When the journal started, the website not only boosted the Open Access feature of the journal, but went on to say that "[t]he authors have made available for use by others the data that underlie the analyses presented in the paper". This was an incredibly daunting move by APA - or so it seemed. Of course, I happily accepted the position.

After a few months, the first papers in Archives were published. Open Data enthusiast Jelte Wicherts of Tilburg University immediately tried to retrieve data for reanalysis. Then it turned out that the APA holds a quite ideosyncratic definition of the word “open”: upon his request, Wicherts was referred to a website that presented a daunting list of requirements for data-requests to fulfill. That was quite a bit more intimidating than the positive tone struck in the editorial that accompanied the launch of the journal.

This didn’t seem open to me at all. So: I approached the editors and said that I could not subscribe to this procedure, given the fact that the journal is supposed to have open data. The editors then informed me that their choice to implement these procedures was an entirely conscious one, and that they stood by it. Their point of view is articulated in their data sharing guidelines. For instance, "next-users of data must formally agree to offer co-authorship to the generator(s) of the data on any subsequent publications" since "[i]t is the opinion of the Archives editors that designing and conducting the original data collection is a scientific contribution that cannot be exhausted after one use of the data; it resides in the data permanently."

Well, that's not my opinion at all. In fact it's quite directly opposed to virtually everything I think is important about openness in scientific research. So I chose to resign my position.

In October 2013, I learned that Wicherts had taken the initiative of exposing the Archives’ policy in an open letter to the editorial board, in which he says:

“[…] I recently learned that data from empirical articles published in the Archives are not even close to being “open”.

In fact, a request for data published in the Archives involves not only a full-blown review committee but also the filling in and signing of an extensive form: http://www.apa.org/pubs/journals/features/arc-data-access-request-form.pdf

This 15-page form asks for the sending of professional resumes, descriptions of the policies concerning academic integrity at one’s institution, explicit research plans including hypotheses and societal relevance, specification of the types of analyses, full ethics approval of the reanalysis by the IRB, descriptions of the background of the research environment, an indication of the primary source of revenue of one’s institution, dissemination plans of the work to be done with the data, a justification for the data request, manners of storage, types of computers and storage media being used, ways of transmitting data between research team members, whether data will be encrypted, and signatures of institutional heads.

The requester of the data also has to sign that (s)he provides an “Offer [of] co-authorship to the data generators on any subsequent publications” and the (s)he will offer to the review committee an “annual data use report that outlines what has been done, that the investigator remains in compliance with the original research proposal, and provide references of any resulting publications.”

In case of non-compliance of any of these stipulations, the requester can face up to a $10,000 fine as well a future prohibition of data access from work published in the Archives.”

A fine? Seriously? Kafkaesque!

Wicherts also notes that “the guidelines with respect to data sharing in the Archives considerably exceed APA’s Ethical Standard 8.14”. Ethical Standard 8.14 is a default that applies to all APA journals, and says:

“After research results are published, psychologists do not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis and who intend to use such data only for that purpose, provided that the confidentiality of the participants can be protected and unless legal rights concerning proprietary data preclude their release.”

Since this guideline says nothing about fines and co-authorship requirements, we indeed have to conclude that it’s harder to get data from APA’s open science journal, than it is to get data from its regular journals. Picture that!

In response to my resignation and Wicherts' letter, the editors have taken an interesting course of action. Rather than change their policy such that their deeds match their name, they have changed their name to match their deeds. The journal is now no longer an "Open Methodology, Open Data, Open Access Journal" but an "Open Methodology, Collaborative Data Sharing, Open Access Journal".

The APA and open data. One step forward, two steps back.

Jan 8, 2014

When Open Science is Hard Science


When it comes to opening up your work there is, ironically, a bit of a secret. Here it is: being open - in open science, open source software, or any other open community - can be hard. Sometimes it can be harder than being closed.

In an effort to attract more people to the cause, advocates of openness tend to tout its benefits. Said benefits are bountiful: increased collaboration and dissemination of ideas, transparency leading to more frequent error checking, improved reproducibility, easier meta-analysis, and greater diversity in participation, just to name a few.

But there are downsides, too. One of those is that it can be difficult to do your research openly. (Note here that I mean well and openly. Taking the full contents of your hard drive and dumping it on a server somewhere might be technically open, but it’s not much use to anyone.)

How is it hard to open up your work? And why?

Closed means privacy.

In the privacy of my own home, I seldom brush my hair. Sometimes I spend all day in my pajamas. I leave my dirty dishes on the table and eat ice cream straight out of the tub. But when I have visitors, or when I’m going out, I make sure to clean up.

In the privacy of a closed access project, you might take shortcuts. You might recruit participants from your own 101 class, or process your data without carefully documenting which steps you took. You’d never intentionally do something unethical, but you might get sloppy.

Humans are social animals. We try to be more perfect for each other than we do for ourselves. This makes openness better, but it also makes it harder.

Two heads need more explanation than one.

As I mentioned above, taking all your work and throwing it online without organization or documentation is not very helpful. There’s a difference between access and accessibility. To create a truly open project, you need to be willing to explain your research to those trying to understand it.

There are numerous routes towards sharing your work, and the most open projects take more than one. You can create stellar documentation of your project. You can point people towards background material, finding good explanations of the way your research methodology was developed or the math behind your data analysis or how the code that runs your stimulus presentation works. You can design tutorials or trainings for people who want to run your study. You can encourage people to ask questions about the project, and reply publicly. You can make sure to do all the above for people at all levels - laypeople, students, and participants as well as colleagues.

Even closed science is usually collaborative, so hopefully your project is decently well documented. But making it accessible to everyone is a project in itself.

New ideas and tools need to be learned.

As long as closed is the default, we’ll need to learn new skills and tools in the process of becoming open, such as version control, format conversion and database management.

These skills aren’t unique to working openly. And if you have a good network of friends and colleagues, you can lean on them to supplement your own expertise. But the fact remains that “going open” isn’t as easy as flipping a switch. Unless you’re already well-connected and well-informed, you’ll have a lot to learn.

People can be exhausting.

Making your work open often means dealing with other people - and not always the people you want to deal with. There are the people who mean well, but end up confusing, misleading, or offending you. There are the people who don’t mean well at all. There are the discussions that go off in unproductive directions, the conversations that turn into conflicts, the promises that get forgotten.

Other people are both a joy and a frustration, in many areas of life beyond open science. But the nature of openness assures you’ll get your fair share. This is especially true of open science projects that are explicitly trying to build community.

It can be all too easy to overlook this emotional labor, but it’s work - hard work, at that.

There are no guarantees.

For all the effort you put into opening up your research, you may find no one else is willing to engage with it. There are plenty of open source software projects with no forks or new contributors, open science articles that are seldom downloaded or science wikis that remain mostly empty, open government tools or datasets that no one uses.

Open access may increase impact on the whole, but there are no promises for any particular project. It’s a sobering prospect to someone considering opening up their research.

How can we make open science easier?

We can advocate for open science while acknowledging the barriers to achieving it. And we can do our best to lower those barriers:

Forgive imperfections. We need to create an environment where mistakes are routine and failures are expected - only then will researchers feel comfortable exposing their work to widespread review. That’s a tall order in the cutthroat world of academia, but we can begin with our own roles as teachers, mentors, reviewers, and internet commentators. Be a role model: encourage others to review your work and point out your mistakes.

Share your skills as well as your research. Talk about your experiences opening up your research with colleagues. Host lab meetings, department events, and conference panels to discuss the practical difficulties. If a training, website, or individual helped you understand some skill or concept, recommend widely. Talking about the individual steps will help the journey seem less intimidating - and will give others a map for how to get there.

Recognize the hard work of others with words and, if you can, financial support. Organization, documentation, mentorship, community management. These are areas that often get overlooked when it comes to celebrating scientific achievement - and allocating funding. Yet many open science projects would fail without leadership in these areas. Contribute what you can and support others who take on these roles.

Collaborate. Open source advocates have been creating tools to help share the work involved in opening research - there’s Software Carpentry, the Open Science Framework, Sage Bionetworks, and Research Compendia, just to name a few. But beyond sharing tools, we can share time and resources. Not every researcher will have the skillset, experience, or personality to quickly and easily open up their work. Sharing efforts across labs, departments and even schools can lighten the load. So can open science specialists, if we create a scientific culture where these specialists are trained, utilized and valued.

We can and should demand open scientific practices from our colleagues and our institutions. But we can also provide guidelines, tools, resources and sympathy. Open science is hard. Let’s not make it any harder.

Jan 1, 2014

Timeline of Notable Open Science Events in 2013 - Psychology


Happy New Year! New Year’s is a great time for reflection and resolution, and when I reflect on 2013, I view it with an air of excitement and promise. As a social psychologist, I celebrated with my many of my colleagues in Washington, DC. at the 25th anniversary of the Association for Psychological Science. There were many celebrations including a ‘80s themed dance night at the Convention. However, this year was also marred by the “Crisis of Confidence” in psychological and broader sciences that has been percolating since the turn of the 21st century. Our timeline begins the year with the Perspectives on Psychological Science’s special issue dedicated to addressing this Crisis. Rather than focusing on the problems, papers in this issue suggested solutions and many of those suggestions emerged as projects in 2013. This timeline focuses on these many Open Science Collaboration successes and initiatives and offers a glimpse at the activity directed at reaching the Scientific Utopia envisioned by so many in the OSC.

Maybe when APS celebrates its 50th Anniversary, it will also mark the 25th Anniversary of the year that the tide turned on the bad practices that had led to the “Crisis of Confidence”. Perhaps in addition to a ‘13 themed dance band playing Lorde’s “Royals” or Imagine Dragon’s “Demons”, maybe there will be a theme reflecting on changing science practices. With the COS celebrating a 25th anniversary of its own, let us share your memory of the important events from 2013.

These posts reflect a limited list of psychology-related events that one person noticed. We invite you to add other notable events that you feel are missing from this list, particularly in other scientific areas. Add a comment below with information about any research projects aimed at replication across institutions or initiatives directed at making science practices more transparent.

View the timeline!

Dec 18, 2013

Researcher Degrees of Freedom in Data Analysis


The enormous amount of options available for modern data analysis is both a blessing and a curse. On one hand, researchers have specialized tools for any number of complex questions. On the other hand, we’re also faced with a staggering number of equally-viable choices, many times without any clear-cut guidelines for deciding between them. For instance, I just popped open SPSS statistical software and counted 18 different ways to conduct post-hoc tests for a one-way ANOVA. Some choices are clearly inferior (e.g., the LSD test doesn’t adjust p-values for multiple comparisons) but it’s possible to defend the use of many of the available options. These ambiguous choice points are sometimes referred to as researcher degrees of freedom.

In theory, researcher degrees of freedom shouldn’t be a problem. More choice is better, right? The problem arises from two interconnected issues: (a) Ambiguity as to which statistical test is most appropriate and (b) an incentive system where scientists are rewarded with publications, grants, and career stability when their p-values fall below the revered p < .05 criterion. So, perhaps unsurprisingly, when faced with a host of ambiguous options for data analysis, most people settle on the one that achieves statistically significant results. Simmons, Nelson, and Simonsohn (2011) argue that this undisclosed flexibility in data analysis allows people to present almost any data as “significant,” and calls for 10 simple guidelines for reviewers and authors to disclose in every paper – which, if you haven’t read yet are worth checking out. In this post, I will discuss a few guidelines of my own for conducting data analysis in a way that strives to overcome our inherent tendency to be self-serving.

  1. Make as many data analytic decisions as possible before looking at your data. Review the statistical literature and decide on which statistical test(s) will be best before looking at your collected data. Continue to use those tests until enough evidence emerges to change your mind. The important thing is that you make these decisions before looking at your data. Once you start playing with the actual data, your self-serving biases will start to kick in. Do not underestimate your ability for self-deception: Self-serving biases are powerful, pervasive, and apply to virtually everyone. Consider pre-registering your data analysis plan (perhaps using the Open Science Framework to keep yourself honest and to convince future reviewers that you aren’t exploiting researcher degrees of freedom.

  2. When faced with a situation where there are too many equally viable choices, run a small number of the best choices, and report all of them. In this case, decide on 2-5 different tests ahead of time. Report the results of all choices, and make a tentative conclusion based if the majority of these tests agree. For instance, when determining model fit in structural equation modeling, there many different methods you might use. If you can’t figure out which method is best by reviewing the statistical literature – it’s not entirely clear, statisticians disagree about as often as any other group of scientists – then report the results of all tests, and make a conclusion if they all converge on the same solution. When they disagree, make a tentative conclusion based on the majority of tests that agree (e.g., 2 of 3 tests come to the same conclusion). For the record, I currently use CFI, TLI, RMSEA, and SRMR in my own work, and use these even if other fit indices provide more favorable results.

  3. When deciding on a data analysis plan after you’ve seen the data, keep in mind that most researcher degrees of freedom have minimal impact on strong results. For any number of reasons, you might find yourself deciding on a data analysis plan after you’ve played around with the data for a while. At the end of the day, strong data will not be influenced much by researcher degrees of freedom. For instance, results should look much the same regardless of whether you exclude outliers, transform them, or leave them in the data when you have a study with high statistical power. Simmons et al. (2011) specifically recommend that results should be presented (a) with and without covariates, and (b) with and without specific data points excluded, if any were removed. Again, the general idea is that strong results will not change much when you alter researcher degrees of freedom. Thus, I again recommend analyzing the data in a few different ways and looking for convergence across all methods when you’re developing a data analysis plan after seeing the data. This sets the bar higher to try and combat your natural tendency to report just the one analysis that “works.” When minor data analytic choices drastically change the conclusions, this should be a warning sign that your solution is unstable and the results are probably not trustworthy. The number one reason why you have an unstable solution is probably because you have low statistical power. Since you hopefully had a strict data collection end date, the only viable alternative when results are unstable is to replicate the results in a second, more highly-powered study using the same data analytic approach.

At the end of the day, there is no “quick-fix” for the problem of self-serving biases during data analysis so long as the incentive system continues to reward novel, statistically significant results. However, by using the tips in this article (and elsewhere) researchers can focus on finding strong, replicable results by minimizing the natural human tendency to be self-serving.


Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366. doi:10.1177/0956797611417632

Dec 13, 2013

Chasing Paper, Part 3


This is part three of a three part post brainstorming potential improvements to the journal article format. Part one is here, part two is here.

The classic journal article is only readable by domain experts.

Journal articles are currently written for domain experts. While novel concepts or terms are usually explained, there is the assumption of a vast array of background knowledge and jargon is the rule, not the exception. While this leads to quick reading for domain experts, it can make for a difficult slog for everyone else.

Why is this a problem? For one thing, it prevents interdisciplinary collaboration. Researchers will not make a habit of reading outside their field if it takes hours of painstaking, self-directed work to comprehend a single article. It also discourages public engagement. While science writers do admirable work boiling hard concepts down to their comprehensible cores, many non-scientists want to actually read the articles, and get discouraged when they can’t.

While opaque scientific writing exists in every format, technologies present new options to translate and teach. Jargon could be linked to a glossary or other reference material. You could be given a plain english explanation of a term when your mouse hovers over it. Perhaps each article could have multiple versions - for domain experts, other scientists, and for laypeople.

Of course, the ability to write accessibly is a skill not everyone has. Luckily, any given paper would mostly use terminology already introduced in previous papers. If researchers could easily credit the teaching and popularization work done by others, they could acknowledge the value of those contributions while at the same time making their own work accessible.

The classic journal article has no universally-agreed upon standards.

Academic publishing, historically, has been a distributed system. Currently, the top three publishers still account for less than half (42%) of all published articles (McGuigan and Russell, 2008). While certain format and content conventions are shared among publishers, generally speaking it’s difficult to propagate new standards, and even harder to enforce them. Not only do standards vary, they are frequently hidden, with most of the review and editing process taking place behind closed doors.

There are benefits to decentralization, but the drawbacks are clear. Widespread adoption of new standards, such as Simmons et al’s 21 Word Solution or open science practices, depends on the hard work and high status of those advocating for them. How can the article format be changed to better accommodate changing standards, while still retaining individual publishers’ autonomy?

One option might be to create a new section of each journal article, a free-form field where users could record whether an article met this or that standard. Researchers could then independently decide what standards they wanted to pay attention to. While this sounds messy, if properly implemented this feature could be used very much like a search filter, yet would not require the creation or maintenance of a centralized database.

A different approach is already being embraced: an effort to make the standards that currently exist more transparent by bringing peer review out into the open. Open peer review allows readers to view an article’s pre-publication history, including the authorship and content of peer reviews, while public peer review allows the public to participate in the review process. However, these methods have yet to be generally adopted.


It’s clear that journal articles are already changing. But they may not be changing fast enough. It may be better to forgo the trappings of the journal article entirely, and seek a new system that more naturally encourages collaboration, curation, and the efficient use of the incredible resources at our disposal. With journal articles commonly costing more than $30 each, some might jump at the chance to leave them behind.

Of course, it’s easy to play “what if” and imagine alternatives; it’s far harder to actually implement them. And not all innovations are improvements. But with over a billion dollars spent on research each day in the United States, with over 25,000 journals in existence, and over a million articles published each year, surely there is room to experiment.


Budd, J.M., Coble, Z.C. and Anderson, K.M. (2011) Retracted Publications in Biomedicine: Cause for Concern.

Wright, K. and McDaid, C. (2011). Reporting of article retractions in bibliographic databases and online journals. J Med Libr Assoc. 2011 April; 99(2): 164–167.

McGuigan, G.S. and Russell, R.D. (2008). The Business of Academic Publishing: A Strategic Analysis of the Academic Journal Publishing Industry and its Impact on the Future of Scholarly Publishing. Electronic Journal of Academic and Special Librarianship. Winter 2008; 9(3).

Simmons, J.P., Nelson, L.D. and Simonsohn, U.A. (2012) A 21 Word Solution.

Dec 12, 2013

Chasing Paper, Part 2


This is part two of a three part post brainstorming potential improvements to the journal article format. Part one is here, part three is here here.

The classic journal article format is not easily updated or corrected.

Scientific understanding is constantly changing as phenomena are discovered and mistakes uncovered. The classic journal article, however, is static. When a serious flaw in an article is found, the best a paper-based system can do is issue a retraction, and hope that a reader going through past issues will eventually come across the change.

Surprisingly, retractions and corrections continue to go mostly unnoticed in the digital era. Studies have shown that retracted papers go on to receive, on average, more than 10 post-retraction citations, with less than 10% of those citations acknowledging the retraction (Budd et al, 2011). Why is this happening? While many article databases such as PubMed provide retraction notices, the articles themselves are often not amended. Readers accessing papers directly from publishers’ websites, or from previously saved copies, can sometimes miss it. A case study of 18 retracted articles found several which they classified as “high risk of missing [the] notice”, with no notice given in the text of the pdf or html copies themselves (Wright et al, 2011). It seems likely that corrections have even more difficulty being seen and acknowledged by subsequent researchers.

There are several technological solutions which can be tried. One promising avenue would be the adoption of version control. Also called revision control, this is a way of tracking all changes made to a project. This technology has been used for decades in computer science and is becoming more and more popular - Wikipedia and Google Docs, for instance, both use version control. Citations for a paper could reference the version of the paper then available, but subsequent readers would be notified that a more recent version could be viewed. In addition to making it easy to see how articles have been changed, adopting such a system would acknowledge the frequency of retractions and corrections and the need to check for up to date information.

Another potential tool would be an alert system. When changes are made to an article, the authors of all articles which cite it could be notified. However, this would require the maintenance of up-to-date contact information for authors, and the adoption of communications standards across publishers (something that has been accomplished before with initiatives like CrossRef). A more transformative approach would be to view papers not as static documents but as ongoing projects that can be updated and contributed to over time. Projects could be tracked through version control from their very inception, allowing for a kind of pre-registration. Replications and new analyses could be added to the project as they’re completed. The most insightful questions and critiques from the public could lead to changes in new versions of the article.

The classic journal article only recognizes certain kinds of contributions.

When journal articles were first developed in the 1600s, the idea of crediting an author or authors must have seemed straightforward. After all, most research was being done by individuals or very small groups, and there were no such things as curriculum vitae or tenure committees. Over time, academic authorship has become the single most important factor in determining career success for individual scientists. The limitations of authorship can therefore have an incredible impact on scientific progress.

There are two major problems with authorship as it currently functions, and they are sides of the same coin. Authorship does not tell you what, precisely, each author did on a paper. And authorship does not tell you who, precisely, is responsible for each part of a paper. Currently, the authorship model provides only a vague idea of who is responsible for a paper. While this is sometimes elaborated upon briefly in the footnotes, or mentioned in the article, more often readers employ simple heuristics. In psychology, the first author is believed to have led the work, the last author to have provided physical and conceptual resources for the experiment, and any middle authors to have contributed in an unknown but significant way. This is obviously not an ideal way to credit people, and often leads to disputes, with first authorship sometimes misattributed. It has grown increasingly impractical as multiauthor papers have become more and more common. What does authorship on a 500-author paper even mean?

The situation is even worse for people whose contributions are not awarded with authorship. While contributions may be mentioned in the acknowledgements or cited in the body of the paper, neither of these have much impact when scientists are applying for jobs or up for tenure. This gives them little motivation to do work which will not be recognized with authorship. And such work is greatly needed. The development of tools, the collection and release of open data sets, the creation of popularizations and teaching materials, and the deep and thorough review of others’ work - these are all done as favors or side projects, even though they are vital to the progress of research. How can new technologies address these problems? There have been few changes made in this area, perhaps due to the heavy weight of authorship in scientific life, although there are some tools like Figshare which allow users to share non-traditional materials such as datasets and posters in citable (and therefore creditable) form. A more transformative change might be to use the version control system mentioned above. Instead of tracking changes to the article from publishing onwards, it could follow the article from its beginning stages. In that way, each change could be attributed to a specific person.

Another option might simply be to describe contributions in more detail. Currently if I use your methodology wholesale, or briefly mention a finding of yours, I acknowledge you in the same way - a citation. What if, instead, all significant contributions were listed? Although space is not a constraint with digital articles, the human attention span remains limited, and so it might be useful to create common categories for contribution, such as reviewing the article, providing materials, doing analyses, or coming up with an explanation for discussion.

There are two other problems are worth mentioning in brief. First, the phenomenon of ghost authorship, where substantial contributions to the running of a study or preparation of a manuscript go unacknowledged. This is frequently done in industry-sponsored research to hide conflicts of interest. If journal articles used a format where every contribution was tracked, ghost authorship would be impossible. Another issue is the assignment of contact authors, the researchers on a paper who readers are invited to direct questions to. Contact information can become outdated fairly quickly, causing access to data and materials to be lost; if contact information can be changed, or responsibility passed on to a new person, such loss can be prevented.

Dec 11, 2013

Chasing Paper, Part 1


This is part one of a three part post. Parts two and three have now been posted.

The academic paper is old - older than the steam engine, the pocket watch, the piano, and the light bulb. The first journal, Philosophical Transactions, was published on March 6th, 1665. Now that doesn’t mean that the journal article format is obsolete - many inventions much older are still in wide use today. But after a third of a millennium, it’s only natural that the format needs some serious updating.

When brainstorming changes, it may be useful to think of the limitations of ink and paper. From there, we can consider how new technologies can improve or even transform the journal article. Some of these changes have already been widely adopted, while others have never even been debated. Some are adaptive, using the greater storage capacity of computing to extend the functions of the classic journal article, while others are transformative, creating new functions and features only available in the 21st century.

The ideas below are suggestions, not recommendations - it may be that some aspects of the journal article format are better left alone. But we all benefit from challenging our assumptions about what an article is and ought to be.

The classic journal article format cannot convey the full range of information associated with an experiment.

Until the rise of modern computing, there was simply no way for researchers to share all the data they collected in their experiments. Researchers were forced to summarize: to gloss over the details of their methods and the reasoning behind their decisions and, of course, to provide statistical analyses in the place of raw data. While fields like particle physics and genetics continue to push the limits of memory, most experimenters now have the technical capacity to share all of their data.

Many journals have taken to publishing supplemental materials, although this rarely encompasses the entirety of data collected, or enough methodological detail to allow for independent replication. There are plenty of explanations for this slow adoption, including ethical considerations around human subjects data, the potential to patent methods, or the cost to journals of hosting this extra materials. But these are obstacles to address, not reasons to give up. The potential benefits are enormous: What if every published paper contained enough methodological detail that it could be independently replicated? What if every paper contained enough raw data that it could be included in meta-analysis? How much of meta-scientific work is never undertaken, because it's dependent on getting dozens or hundreds of contact authors to return your emails, and on universities to properly store data and materials?

Providing supplemental material, no matter how extensive, is still an adaptive change. What might a transformative change look like? Elsevier’s Article of the Future project attempts to answer that question with new, experimental formats that include videos, interactive models, and infographics. These designs are just the beginning. What if articles allowed readers to actually interact with the data and perform their own analyses? Virtual environments could be set up, lowering the barrier to independent verification of results. What if authors reported when they made questionable methodological decisions, and allowed readers, where possible, to see the results when a variable was not controlled for, or a sample was not excluded?

The classic journal article format is difficult to organize, index or search.

New technology has already transformed the way we search the scientific literature. Where before researchers were reliant on catalogues and indexes from publishers, and used abstracts to guess at relevance, databases such as PubMed and Google Scholar allow us to find all mentions of a term, tool, or phenomena across vast swathes of articles. While searching databases is itself a skill, its one that allows us to search comprehensively and efficiently, and gives us more opportunities to explore.

Yet old issues of organization and curation remain. Indexes used to speed the slow process of skimming through physical papers. Now they’re needed to help researchers sort through the abundance of articles constantly being published. With tens of millions of journal articles out there, how can we be sure we’re really accessing all the relevant literature? How can we compare and synthesize the thousands of results one might get on a given search?

Special kinds of articles - reviews and meta-analyses - have traditionally helped us synthesize and curate information. As discussed above, new technologies can help make meta-analyses more common by making it easier for researchers to access information about past studies. We can further improve the search experience by creating more detailed metadata. Metadata, in this context, is the information attached to an article which lets us categorize it without having to read the article itself. Currently, fields like title, author, date, and journal are quite common in databases. More complicated fields less often adopted, but you can find metadata on study type, population, level of clinical trial (where applicable), and so forth. What would truly comprehensive metadata look like? Is it possible to store the details of experimental structure or analysis in machine-readable format - and is that even desirable?

What happens when we reconsider not the metadata but the content itself? Most articles are structurally complex, containing literature reviews, methodological information, data, and analysis. Perhaps we might be better served by breaking those articles down into their constituent parts. What if methods, data, analysis were always published separately, creating a network of papers that were linked but discrete? Would that be easier or harder to organize? It may be that what we need here is not a better kind of journal article, but a new way of curating research entirely.

Dec 9, 2013

New “Reviewer Statement” Initiative Aims to (Further) Improve Community Norms Toward Disclosure


Photo of Etienne LeBel

An Open Science Collaboration -- made up of Uri Simonsohn, Etienne LeBel, Don Moore, Leif D. Nelson, Brian Nosek, and Joe Simmons -- is glad to announce a new initiative aiming to improve community norms toward the disclosure of basic methodological information during the peer-review process. Endorsed by the Center for Open Science, the initiative involves a standard reviewer statement that any peer reviewer can include in their review requesting that authors add a statement to the paper confirming that they have disclosed all data exclusions, experimental conditions, assessed measures, and how they determined their samples sizes (following from the 21-word solution; Simmons, Nelson, & Simonsohn, 2012, 2013; see also PsychDisclosure.org; LeBel et al., 2013). Here is the statement, which is available on the Open Science Framework:

"I request that the authors add a statement to the paper confirming whether, for all experiments, they have reported all measures, conditions, data exclusions, and how they determined their sample sizes. The authors should, of course, add any additional text to ensure the statement is accurate. This is the standard reviewer disclosure request endorsed by the Center for Open Science (see http://osf.io/project/hadz3). I include it in every review."

The idea originated from the realization that as peer reviewers, we typically lack fundamental information regarding how the data was collected and analyzed which prevents us from be able to properly evaluate the claims made in a submitted manuscript. Some reviewers interested in requesting such information, however, were concerned that such requests would make them appear selective and/or compromise their anonymity. Discussions ensued and contributors developed a standard reviewer disclosure request statement that overcomes these concerns and allows the community of reviewers to improve community norms toward the disclosure of such methodological information across all journals and articles.

Some of the contributors, including myself, were hoping for a reviewer statement with a bit more teeth. For instance, requesting the disclosure of such information as a requirement before accepting to review an article or requiring the re-review of a revised manuscript once the requested information has been disclosed. The team of contributors, however, ultimately decided that it would be better to start small to get acceptance, in order to maximize the probability that the initiative has an impact in shaping the community norms.

Hence, next time you are invited to review a manuscript for publication at any journal, please remember to include the reviewer disclosure statement!


LeBel, E. P., Borsboom, D., Giner-Sorolla, R., Hasselman, F., Peters, K. R., Ratliff, K. A., & Smith, C. T. (2013). PsychDisclosure.org: Grassroots support for reforming reporting standards in psychology. Perspectives on Psychological Science, 8(4), 424-432. doi: 10.1177/1745691613491437

Simmons J., Nelson L. & Simonsohn U. (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant. Psychological Science, 22(11), 1359-1366.

Simmons J., Nelson L. & Simonsohn U. (2012) A 21 Word Solution. Dialogue: The Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4-7.

Nov 27, 2013

The State of Open Access


To celebrate Open Access Week last month, we asked people four questions about the state of open access and how it's changing. Here are some in depth answers from two people working on open access: Peter Suber, Director of the Harvard Office for Scholarly Communication and the Harvard Open Access Project, and Elizabeth Silva, associate editor at the Public Library of Science (PLOS).

How is your work relevant to the changing landscape of Open Access? What would be a successful outcome of your work in this area?

Elizabeth: PLOS is now synonymous with open access publishing, so it’s hard to believe that 10 years ago, when PLOS was founded, most researchers were not even aware that availability of research was a problem. We all published our best research in the best journals. We assumed our colleagues could access it, and we weren’t aware of (or didn’t recognize the problem with) the inability of people outside of the ivory tower to see this work. At that time it was apparent to the founders of PLOS, who were among the few researchers who recognized the problem, that the best way to convince researchers to publish open access would be for PLOS to become an open access publisher, and prove that OA could be a viable business model and an attractive publishing venue at the same time. I think that we can safely say that the founders of PLOS succeeded in this mission, and they did it decisively.

We’re now at an exciting time, where open access in the natural sciences is all but inevitable. We now get to work on new challenges, trying to solve other issues in research communication.

Peter: My current job has two parts. I direct the Harvard Office for Scholarly Communication (OSC), and I direct the Harvard Open Access Project (HOAP). The OSC aims to provide OA to research done at Harvard University. We implement Harvard's OA policies and maintain its OA repository. We focus on peer-reviewed articles by faculty, but are expanding to other categories of research and researchers. In my HOAP work, I consult pro bono with universities, scholarly societies, publishers, funding agencies, and governments, to help them adopt effective OA policies. HOAP also maintains a guide to good practices for university OA policies, manages the Open Access Tracking Project, writes reference pages on federal OA-related legislation, such as FASTR, and makes regular contributions to the Open Access Directory and the catalog of OA journals from society publishers.

To me success would be making OA the default for new research in every field and language. However, this kind of success more like a new plateau than a finish line. We often focus on the goal of OA itself, or the goal of removing access barriers to knowledge. But that's merely a precondition for an exciting range of new possibilities for making use of that knowledge. In that sense, OA is closer to the minimum than the maximum of how to take advantage of the internet for improving research. Once OA is the default for new research, we can give less energy to attaining it and more energy to reaping the benefits, for example, integrating OA texts with open data, improving the methods of meta-analysis and reproducibility, and building better tools for knowledge extraction, text and data mining, question answering, reference linking, impact measurement, current awareness, search, summary, translation, organization, and recommendation.

From the researcher's side, making OA the new default means that essentially all the new work they write, and essentially all the new work they want to read, will be OA. From the publisher's side, making OA the new default means that sustainability cannot depend on access barriers that subtract value, and must depend on creative ways to add value to research that is already and irrevocably OA.

How do you think the lack of Open Access is currently impacting how science is practiced?

Peter: The lack of OA slows down research. It distorts inquiry by making the retrievability of research a function of publisher prices and library budgets rather than author consent and internet connectivity. It hides results that happen to sit in journals that exceed the affordability threshold for you or your institution. It limits the correction of scientific error by limiting the number of eyeballs that can examine new results. It prevents the use of text and data mining to supplement human analysis with machine analysis. It hinders the reproducibility of research by excluding many who would want to reproduce it. At the same time, and ironically, it increases the inefficient duplication of research by scholars who don't realize that certain experiments have already been done.

It prevents journalists from reading the latest developments, reporting on them, and providing direct, usable links for interested readers. It prevents unaffiliated scholars and the lay public from reading new work in which they may have an interest, especially in the humanities and medicine. It blocks research-driven industries from creating jobs, products, and innovations. It prevents taxpayers from maximizing the return on their enormous investment in publicly-funded research.

I assume we're talking about research that authors publish voluntarily, as opposed to notes, emails, and unfinished manuscripts, and I assume we're talking about research that authors write without expectation of revenue. If so, then the lack of OA harms research and researchers without qualification. The lack of OA benefits no one except conventional publishers who want to own it, sell it, and limit the audience to paying customers.

Elizabeth: There is a prevailing idea that those that need access to the literature already have it; that those that have the ability to understand the content are at institutions that can afford the subscriptions. First, this ignores the needs of physicians, educators, science communicators, and smaller institutions and companies. More fundamentally, limiting access to knowledge, so that rests in the hands of an elite 1%, is archaic, backwards, and counterproductive. There has never been a greater urgency to find solutions to problems that fundamentally threaten human existence – climate change, disease transmission, food security – and in the face of this why would we advocate limited dissemination of knowledge? Full adoption of open access has the potential to fundamentally change the pace of scientific progress, as we make this information available to everyone, worldwide.

When it comes to issues of reproducibility, fraud or misreporting, all journals face similar issues regardless of the business model. Researchers design their experiments and collect their data long before they decide the publishing venue, and the quality of the reporting likely won’t change based on whether the venue is OA. I think that these issues are better tackled by requirements for open data and improved reporting. Of course these philosophies are certainly intrinsically linked – improved transparency and access can only improve matters.

What do you think is the biggest reason that people resist Open Access? Do you think there are good reasons for not making a paper open access?

Elizabeth: Of course there are many publishers who resist open access, which reflects a need to protect established revenue streams. In addition to large commercial publishers, there are a lot of scholarly societies whose primary sources of income are the subscriptions for the journals they publish.

Resistance from authors, in my experience, comes principally in two forms. The first is linked to the impact factor, rather than the business model. Researchers are stuck in a paradigm that requires them to publish as ‘high’ as possible to achieve career advancement. While there are plenty of high impact OA publications with which people choose to publish, it just so happens that the highest are subscription journals. We know that open access increases utility, visibility and impact of individual pieces of research, but the fallacy that a high impact journal is equivalent to high impact research persists.

The second reason cited is that the cost is prohibitory. This is a problem everyone at PLOS can really appreciate, and we very much sympathize with authors who do not have the money in their budget to pay author publication charges (APCs). However, it’s a problem that should really be a lot easier to overcome. If research institutions were to pay publication fees, rather than subscription fees, they would save a fortune; a few institutions have realized this and are paying the APCs for authors who choose to go OA. It would also help if funders could recognize publishing as an intrinsic part of the research, folding the APC into the grant. We are also moving the technology forward in an effort to reduce costs, so that savings can be passed onto authors. PLOS ONE has been around for nearly 7 years, and the fees have not changed. This reflects efforts to keep costs as low as we can. Ironically, the biggest of the pay-walled journals already charge authors to publish: for example, it can be between $500 and $1000 for the first color figure, and a few hundred for each additional one; on top of this there are page charges and reprint costs. Not only is the public paying for the research and the subscription, they are paying for papers that they can’t read.

Peter: There are no good reasons for not making a paper OA, or at least for not wanting to.

There are sometimes reasons not to publish in an OA journal. For example, the best journals in your field may not be OA. Your promotion and tenure committee may give you artificial incentives to limit yourself to a certain list of journals. Or the best OA journals in your field may charge publication fees which your funder or employer will not pay on your behalf. However, in those cases you can publish in a non-OA journal and deposit the peer-reviewed manuscript in an OA repository.

The resistance of non-OA publishers is easier to grasp. But if we're talking about publishing scholars, not publishers, then the largest cause of resistance by far is misunderstanding. Far too many researchers still accept false assumptions about OA, such as these 10:

--that the only way to make an article OA is to publish it in an OA journal --that all or most OA journals charge publication fees --that all or most publication fees are paid by authors out of pocket --that all or most OA journals are not peer reviewed --that peer-reviewed OA journals cannot use the same standards and even the same people as the best non-OA journals --that publishing in a non-OA journal closes the door on lawfully making the same article OA --that making work OA makes it harder rather than easier to find --that making work OA limits rather than enhances author rights over it --that OA mandates are about submitting new work to OA journals rather than depositing it in OA repositories, or --that everyone who needs access already has access.

In a recent article in The Guardian I corrected six of the most widespread and harmful myths about OA. In a 2009 article, I corrected 25. And in my 2012 book, I tried to take on the whole legendarium.

How has the Open Access movement changed in the last five years? How do you think it will change in the next five years?

Peter: OA has been making unmistakable progress for more than 20 years. Five years ago we were not in a qualitatively different place. We were just a bit further down the slope from where we are today.

Over the next five years, I expect more than just another five years' worth of progress as usual. I expect five years' worth of progress toward the kind of success I described in my answer to your first question. In fact, insofar as progress tends to add cooperating players and remove or convert resisting players, I expect five years' worth of compound interest and acceleration.

In some fields, like particle physics, OA is already the default. In the next five years we'll see this new reality move at an uneven rate across the research landscape. Every year more and more researchers will be able to stop struggling for access against needless legal, financial, and technical barriers. Every year, those still struggling will have the benefit of a widening circle of precedents, allies, tools, policies, best practices, accommodating publishers, and alternatives to publishers.

Green OA mandates are spreading among universities. They're also spreading among funding agencies, for example, in the US, the EU, and global south. This trend will definitely continue, especially with the support it has received from Global Research Council, Science Europe, the G8 Science Ministers, and the World Bank.

With the exception of the UK and the Netherlands, countries adopting new OA policies are learning from the experience of their predecessors and starting with green. I've argued in many places that mandating gold OA is a mistake. But it's a mistake mainly for historical reasons, and historical circumstances will change. Gold OA mandates are foolish today in part because too few journals are OA, and there's no reason to limit the freedom of authors to publish in the journals of their choice. But the percentage of peer-reviewed journals that are OA is growing and will continue to grow. (Today it's about 30%.) Gold OA mandates are also foolish today because gold OA is much more expensive than green OA, and there's no reason to compromise the public interest in order to guarantee revenue for non-adaptive publishers. But the costs of OA journals will decline, as the growing number of OA journals compete for authors, and the money to pay for OA journals will grow as libraries redirect money from conventional journals to OA.

We'll see a rise in policies linking deposit in repositories with research assessment, promotion, and tenure. These policies were pioneered by the University of Liege, and since adopted at institutions in nine countries, and recommended by the Budapest Open Access Initiative, the UK House of Commons Select Committee on Business, Innovation and Skills, and the Mediterranean Open Access Network. Most recently, this kind of policy has been proposed at the national level by the Higher Education Funding Council for England. If it's adopted, it will mitigate the damage of a gold-first policy in the UK. A similar possibility has been suggested for the Netherlands.

I expect we'll see OA in the humanities start to catch up with OA in the sciences, and OA for books start to catch up with OA for articles. But in both cases, the pace of progress has already picked up significantly, and so has the number of people eager to see these two kinds of progress accelerate.

The recent decision that Google's book scanning is fair use means that a much larger swath of print literature will be digitized, if not in every country, then at least in the US, and if not for OA, then at least for searching. This won't open the doors to vaults that have been closed, but it will open windows to help us see what is inside.

Finally, I expect to see evolution in the genres or containers of research. Like most people, I'm accustomed to the genres I grew up with. I love articles and books, both as a reader and author. But they have limitations that we can overcome, and we don't have to drop them to enhance them or to create post-articles and post-books alongside them. The low barriers to digital experimentation mean that we can try out new breeds until we find some that carry more advantages than disadvantages for specific purposes. Last year I sketched out one idea along these lines, which I call an evidence rack, but it's only one in an indefinitely large space constrained only by the limits on our imagination.

Elizabeth: It’s starting to feel like universal open access is no longer “if” but “when”. In the next five years we will see funders and institutions recognize the importance of access and adopt policies that mandate and financially support OA; resistance will fade away, and it will simply be the way research is published. As that happens, I think the OA movement will shift towards tackling other issues in research communication: providing better measures of impact in the form of article level metrics, decreasing the time to publication, and improving reproducibility and utility of research.

Nov 20, 2013

Theoretical Amnesia


Photo of Denny

In the past few months, the Center for Open Science and its associated enterprises have gathered enormous support in the community of psychological scientists. While these developments are happy ones, in my view, they also cast a shadow over the field of psychology: clearly, many people think that the activities of the Center for Open Science, like organizing massive replication work and promoting preregistration, are necessary. That, in turn, implies that something in the current scientific order is seriously broken. I think that, apart from working towards improvements, it is useful to investigate what that something is. In this post, I want to point towards a factor that I think has received too little attention in the public debate; namely, the near absence of unambiguously formalized scientific theory in psychology.

Scientific theories are perhaps the most bizarre entities that the scientific imagination has produced. They have incredible properties that, if we weren’t so familiar with them, would do pretty well in a Harry Potter novel. For instance, scientific theories allow you to work out, on a piece of paper, what would happen to stuff in conditions that aren’t actually realized. So you can figure out whether an imaginary bridge will stand or collapse in imaginary conditions. You can do this by simply just feeding some imaginary quantities that your imaginary bridge would have (like its mass and dimensions) to a scientific theory (say, Newton’s) and out comes a prediction on what will happen. In the more impressive cases, the predictions are so good that you can actually design the entire bridge on paper, then build it according to specifications (by systematically mapping empirical objects to theoretical terms), and then the bridge will do precisely what the theory says it should do. No surprises.

That’s how they put a man on the moon and that’s how they make the computer screen you’re now looking at. It’s all done in theory before it’s done for real, and that’s what makes it possible to construct complicated but functional pieces of equipment. This is, in effect, why scientific theory makes technology possible, and therefore this is an absolutely central ingredient of the scientific enterprise which, without technology, would be much less impressive than it is.

It’s useful to take stock here, and marvel. A good scientific theory allows you infer what would happen to things in certain situations without creating the situations. Thus, scientific theories are crystal balls that actually work. For this reason, some philosophers of science have suggested that scientific theories should be interpreted as inference tickets. Once you’ve got the ticket, you get to sidestep all the tedious empirical work. Which is great, because empirical work is, well, tedious. Scientific theories are thus exquisitely suited to the needs of lazy people.

My field – psychology – unfortunately does not afford much of a lazy life. We don’t have theories that can offer predictions sufficiently precise to intervene in the world with appreciable certainty. That’s why there exists no such thing as a psychological engineer. And that’s why there are fields of theoretical physics, theoretical biology, and even theoretical economics, while there is no parallel field of theoretical psychology. It is a sad but, in my view, inescapable conclusion: we don’t have much in the way of scientific theory in psychology. For this reason, we have very few inference tickets – let alone inference tickets that work.

And that’s why psychology is so hyper-ultra-mega empirical. We never know how our interventions will pan out, because we have no theory that says how they will pan out (incidentally, that’s also why we need preregistration: in psychology, predictions are made by individual researchers rather than by standing theory, and you can’t trust people the way you can trust theory). The upshot is that, if we want to know what would happen if we did X, we have to actually do X. Because we don’t have inference tickets, we never get to take the shortcut. We always have to wade through the empirical morass. Always.

This has important consequences. For instance, as a field has less theory, it has to leave more to the data. Since you can’t learn anything from data without the armature of statistical analysis, a field without theory tends to grow a thriving statistical community. Thus, the role of statistics grows as soon as the presence of scientific theory wanes. In extreme cases, when statistics has entirely taken over, fields of inquiry can actually develop a kind of philosophical disorder: theoretical amnesia. In fields with this disorder, researchers no longer know what a theory is, which means that they can neither recognize its presence nor its absence. In such fields, for instance, a statistical model – like a factor model – can come to occupy the vacuum created by the absence of theory. I am often afraid that this is precisely what has happened with the advent of “theories” like those of general intelligence (a single factor model) and the so-called “Big Five” of personality (a five-factor model). In fact, I am afraid that this happened in many fields in psychology, where statistical models (which, in their barest null-hypothesis testing form, are misleadingly called “effects”) rule the day.

If your science thrives on experiment and statistics, but lacks the power of theory, you get peculiar problems. Most importantly, you get slow. To see why, it’s interesting to wonder how psychologists would build a bridge, if they were to use their typical methodological strategies. Probably, they would build a thousand bridges, record whether they stand or fall, and then fit a regression equation to figure out which properties are predictive of the outcome. Predictors would be chosen on the basis of statistical significance, which would introduce a multiple testing problem. In response, some of the regressors might be clustered through factor analysis, to handle the overload of predictive variables. Such analyses would probably indicate lots of structure in the data, and psychologists would likely find that the bridges’ weight, size, and elasticity loads on a single latent “strength factor”, producing the “theory” that bridges higher on the “strength factor” are less likely to fall down. Cross validation of the model would be attempted by reproducing the analysis in a new sample of a thousand bridges, to weed out chance findings. It’s likely that, after many years of empirical research, and under a great number of “context-dependent” conditions that would be poorly understood, psychologists would be able to predict a modest but significant proportion of the variance in the outcome variable. Without a doubt, it would ta ke a thousand years to establish empirically what Newton grasped in a split second, as he wrote down his F=m*a.

Because increased reliance on empirical data makes you so incredibly slow, it also makes you susceptible to fads and frauds. A good theory can be tested in an infinity of ways, many of which are directly available to the interested reader (this is what give classroom demonstrations such enormous evidential force). But if your science is entirely built on generalizations derived from specifics of tediously gathered experimental data, you can’t really test these generalizations without tediously gathering the same, or highly similar, experimental data. That’s not something that people typically like to do, and it’s certainly not what journals want to print. As a result, a field can become dominated by poorly tested generalizations. When that happens, you’re in very big trouble. The reason is that your scientific field becomes susceptible to the equivalent of what evolutionary theorists call free riders: people who capitalize on the invested honest work of others by consistently taking the moral shortcut. Free riders can come to rule a scientific field if two conditions are satisfied: (a) fame is bestowed on whoever dares to make the most adventurous claims (rather than the most defensible ones), and (b) it takes longer to falsify a bogus claim than it takes to become famous. If these conditions are satisfied, you can build your scientific career on a fad and get away with it. By the time they find out your work really doesn’t survive detailed scrutiny, you’re sitting warmly by the fire in the library of your National Academy of Sciences1.

Much of our standard methodological teachings in psychology rest on the implicit assumption that scientific fields are similar if not identical in their methodological setup. That simply isn’t true. Without theory, the scientific ball game has to be played by different rules. I think that these new rules are now being invented: without good theory, you need fast acting replication teams, you need a reproducibility project, and you need preregistered hypotheses. Thus, the current period of crisis may lead to extremely important methodological innovations – especially those that are crucial in fields that are low on theory.

Nevertheless, it would be extremely healthy if psychologists received more education in fields which do have some theories, even if they are empirically shaky ones, like you often see in economics or biology. In itself, it’s no shame that we have so little theory: psychology probably has the hardest subject matter ever studied, and to change that may very well take a scientific event of the order of Newton’s discoveries. I don’t know how to do it and I don’t think anybody else knows either. But what we can do is keep in contact with other fields, and at least try to remember what theory is and what it’s good for, so that we don’t fall into theoretical amnesia. As they say, it’s the unknown unknowns that hurt you most.

1 Caveat: I am not saying that people do this on purpose. I believe that free riders are typically unaware of the fact that they are free riders – people are very good at labeling their own actions positively, especially if the rest of the world says that they are brilliant. So, if you think this post isn’t about you, that could be entirely wrong. In fact, I cannot even be sure that this post isn’t about me.

Nov 13, 2013

Let’s Report Our Findings More Transparently – As We Used to Do


Photo of Etienne LeBel

In 1959, Festinger and Carlsmith reported the results of an experiment that spawned a voluminous body of research on cognitive dissonance. In that experiment, all subjects performed a boring task. Some participants were paid $1 or $20 to tell the next subject the task was interesting and fun whereas participants in a control condition did no such thing. All participants then indicated how enjoyable they felt the task was, their desire to participate in another similar experiment, and the scientific importance of the experiment. The authors hypothesized that participants paid $1 to tell the next participant that the boring task they just completed was interesting would experience internal dissonance, which could be reduced by altering their attitudes on the three outcomes measures. A little known fact about the results of this experiment, however, is that only on one of these outcome measures did a statistically significant effect emerge across conditions. The authors reported that subjects paid $1 enjoyed the task more than those paid $20 (or the control participants), but no statistically significant differences emerged on the other two measures.

In another highly influential paper, Word, Zanna, and Cooper (1974) documented the self-fulfilling prophecies of racial stereotypes. The researchers had white subjects interview trained white and black applicants and coded for six non-verbal behaviors of immediacy (distance, forward lean, eye contact, shoulder orientation, interview length, and speech error rate). They found that that white interviewers treated black applicants with lower levels of non-verbal immediacy than white applicants. In a follow-up study involving white subjects only, applicants treated with less immediate non-verbal behaviors were judged to perform less well during the interview than applicants treated with more immediate non-verbal behaviors. A fascinating result, however, a little known fact about this paper is that only three of the six measures of non-verbal behaviors assessed in the first study (and subsequently used in the second study) were statistically significant.

What do these two examples make salient in relation to how psychologists report their results nowadays? Regular readers of prominent journals like Psychological Science and Journal of Personality and Social Psychology may see what I’m getting at: It is very rare these days to see articles in these journals wherein half or most of the reported dependent variables (DVs) fail to show statistically significant effects. Rather, one typically sees squeaky-clean looking articles where all of the DVs show statistically significant effects across all of the studies, with an occasional mention of a DV achieving “marginal significance” (Giner-Sorolla, 2012).

In this post, I want us to consider the possibility that psychologists’ reporting practices may have changed in the past 50 years. This then raises the question as to how this came about. One possibility is that as incentives became increasingly more perverse in psychology (Nosek,Spies, & Motyl, 2012), some researchers realized that they could out-compete their peers by reporting “cleaner” looking results which would appear more compelling to editors and reviewers (Giner-Sorolla, 2012). For example, decisions were made to simply not report DVs that failed to show significant differences across conditions or that only achieved “marginal significance”. Indeed, nowadays sometimes even editors or reviewers will demand that such DVs not be reported (see PsychDisclosure.org; LeBel et al., 2013). A similar logic may also have contributed to researchers’ deciding not to fully report independent variables that failed to yield statistically significant effects and not fully reporting the exclusion of outlying participants due to fear that this information may raise doubts among the editor/reviewers and hurt their chance of getting their foot in the door (i.e., at least getting a revise-and-resubmit).

An alternative explanation is that new tools and technology have given us the ability to measure a greater number of DVs, which makes it more difficult to report on all them. For example, neuroscience (e.g., EEG, fMRI) and eye-tracking methods yield multitudes of analyzable data that were not previously available. Though this is undeniably true, the internet and online article supplements gives us the ability to fully report our methods and results and use the article to draw attention to the most interesting data.

Considering the possibility that psychologists’ reporting practices have changed in the past 50 years has implications for how to construe recent calls for the need to raise reporting standards in psychology (LeBel et al., 2013; Simmons, Nelson, & Simonsohn, 2011; Simmons, Nelson, & Simonsohn, 2012). Rather than seeing these calls as rigid new requirements that might interfere with exploratory research and stifle our science, one could construe such calls as a plea to revert back to the fuller reporting of results that used to be the norm in our science. From this perspective, it should not be viewed as overly onerous or authoritarian to ask researchers to disclose all excluded observations, all tested experimental conditions, all analyzed measures, and their data collection termination rule (what I’m now calling the BASIC 4 methodological categories covered by PsychDisclosure.org and Simmons et al.’s, 2012 21-word solution). It would simply be the way our forefathers used to do it.


Festinger, L. & Carlsmith, J. M. (1959). Cognitive consequences of forced compliance. The Journal of Abnormal and Social Psychology, 58(2), Mar 1959, 203-210. doi: 10.1037/h0041593

Giner-Sorolla, R. (2012). Science or art? How esthetic standards grease the way through the publication bottleneck but undermine science. Perspectives on Psychological Science, 7(6), 562–571. 10.1177/1745691612457576

LeBel, E. P., Borsboom, D., Giner-Sorolla, R., Hasselman, F., Peters, K. R., Ratliff, K. A., & Smith, C. T. (2013). PsychDisclosure.org: Grassroots support for reforming reporting standards in psychology. Perspectives on Psychological Science, 8(4), 424-432. doi: 10.1177/1745691613491437

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615-631. doi: 10.1177/1745691612459058

Simmons J., Nelson L. & Simonsohn U. (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant. Psychological Science, 22(11), 1359-1366.

Simmons J., Nelson L. & Simonsohn U. (2012) A 21 Word Solution Dialogue: The Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4-7.

Word, C. O., Zanna, M. P., & Cooper, J. (1974). The nonverbal mediation of self-fulfilling prophecies in interracial interaction. Journal of Experimental Social Psychology, 10(2), 109–120. doi: 10.1016/0022-1031(74)90059-6

Nov 3, 2013

Increasing statistical power in psychological research without increasing sample size


What is statistical power and precision?

This post is going to give you some practical tips to increase statistical power in your research. Before going there though, let’s make sure everyone is on the same page by starting with some definitions.

Statistical power is the probability that the test will reject the null hypothesis when the null hypothesis is false. Many authors suggest a statistical power rate of at least .80, which corresponds to an 80% probability of not committing a Type II error.

Precision refers to the width of the confidence interval for an effect size. The smaller this width, the more precise your results are. For 80% power, the confidence interval width will be roughly plus or minus 70% of the population effect size (Goodman & Berlin, 1994). Studies that have low precision have a greater probability of both Type I and Type II errors (Button et al., 2013).

To get an idea of how this works, here are a few examples of the sample size required to achieve .80 power for small, medium, and large (Cohen, 1992) correlations as well as the expected confidence intervals

Population Effect Size Sample Size for 80% Power Estimated Precision
r = .10 782 95% CI [.03, .17]
r = .30 84 95% CI [.09, .51]
r = .50 29 95% CI [.15, .85]

Studies in psychology are grossly underpowered

Okay, so now you know what power is. But why should you care? Fifty years ago, Cohen (1962) estimated the statistical power to detect a medium effect size in abnormal psychology was about .48. That’s a false negative rate of 52%, which is no better than a coin-flip! The situation has improved slightly, but it’s still a serious problem today. For instance, one review suggested only 52% of articles in the applied psychology literature achieved .80 power for a medium effect size (Mone et al., 1996). This is in part because psychologists are studying small effects. One massive review of 322 meta-analyses including 8 million participants (Richard et al., 2003) suggested that the average effect size in social psychology is relatively small (r = .21). To put this into perspective, you’d need 175 participants to have .80 power for a simple correlation between two variables at this effect size. This gets even worse when we’re studying interaction effects. One review suggests that the average effect size for interaction effects is even smaller (f2 = .009), which means that sample sizes of around 875 people would be needed to achieve .80 power (Aguinis et al., 2005). Odds are, if you took the time to design a research study and collect data, you want to find a relationship if one really exists. You don’t want to "miss" something that is really there. More than this, you probably want to have a reasonably precise estimate of the effect size (it’s not that impressive to just say a relationship is positive and probably non-zero). Below, I discuss concrete strategies for improving power and precision.

What can we do to increase power?

It is well-known that increasing sample size increases statistical power and precision. Increasing the population effect size increases statistical power, but has no effect on precision (Maxwell et al., 2008). Increasing sample size improves power and precision by reducing standard error of the effect size. Take a look at this formula for the confidence interval of a linear regression coefficient (McClelland, 2000):

Power Equation

MSE is the mean square error, n is the sample size, Vx is the variance of X, and (1-R2) is the proportion of the variance in X not shared by any other variables in the model. Okay, hopefully you didn’t nod off there. There’s a reason I’m showing you this formula. In this formula, decreasing any value in the numerator (MSE) or increasing anything in the denominator (n, Vx, 1-R2) will decrease the standard error of the effect size, and will thus increase power and precision. This formula demonstrates that there are at least three other ways to increase statistical power aside from sample size: (a) Decreasing the mean square error; (b) increasing the variance of x; and (c) increasing the proportion of the variance in X not shared by any other predictors in the model. Below, I’ll give you a few ways to do just that.

Recommendation 1: Decrease the mean square error

Referring to the formula above, you can see that decreasing the mean square error will have about the same impact as increasing sample size. Okay. You’ve probably heard the term "mean square error" before, but the definition might be kind of fuzzy. Basically, your model makes a prediction for what the outcome variable (Y) should be, given certain values of the predictor (X). Naturally, it’s not a perfect prediction because you have measurement error, and because there are other important variables you probably didn’t measure. The mean square error is the difference between what your model predicts, and what the true values of the data actually are. So, anything that improves the quality of your measurement or accounts for potential confounding variables will reduce the mean square error, and thus improve statistical power. Let’s make this concrete. Here are three specific techniques you can use:

a) Reduce measurement error by using more reliable measures(i.e., better internal consistency, test-retest reliability, inter-rater reliability, etc.). You’ve probably read that .70 is the "rule-of-thumb" for acceptable reliability. Okay, sure. That’s publishable. But consider this: Let’s say you want to test a correlation coefficient. Assuming both measures have a reliability of .70, your observed correlation will be about 1.43 times smaller than the true population parameter (I got this using Spearman’s correlation attenuation formula). Because you have a smaller observed effect size, you end up with less statistical power. Why do this to yourself? Reduce measurement error. If you’re an experimentalist, make sure you execute your experimental manipulations exactly the same way each time, preferably by automating them. Slight variations in the manipulation (e.g., different locations, slight variations in timing) might reduce the reliability of the manipulation, and thus reduce power.

b) Control for confounding variables. With correlational research, this means including control variables that predict the outcome variable, but are relatively uncorrelated with other predictor variables. In experimental designs, this means taking great care to control for as many possible confounds as possible. In both cases, this reduces the mean square error and improves the overall predictive power of the model – and thus, improves statistical power. Be careful when adding control variables into your models though: There are diminishing returns for adding covariates. Adding a couple of good covariates is bound to improve your model, but you always have to balance predictive power against model complexity. Adding a large number of predictors can sometimes lead to overfitting (i.e., the model is just describing noise or random error) when there are too many predictors in the model relative to the sample size. So, controlling for a couple of good covariates is generally a good idea, but too many covariates will probably make your model worse, not better, especially if the sample is small.

c) Use repeated-measures designs. Repeated measures designs are where participants are measured multiple times (e.g., once a day surveys, multiple trials in an experiment, etc.). Repeated measures designs reduce the mean square error by partitioning out the variance due to individual participants. Depending on the kind of analysis you do, it can also increase the degrees of freedom for the analysis substantially. For example, you might only have 100 participants, but if you measured them once a day for 21 days, you’ll actually have 2100 data points to analyze. The data analysis can get tricky and the interpretation of the data may change, but many multilevel and structural equation models can take advantage of these designs by examining each measurement occasion (i.e., each day, each trial, etc.) as the unit of interest, instead of each individual participant. Increasing the degrees of freedom is much like increasing the sample size in terms of increasing statistical power. I’m a big fan of repeated measures designs, because they allow researchers to collect a lot of data from fewer participants.

Recommendation 2: Increase the variance of your predictor variable

Another less-known way to increase statistical power and precision is to increase the variance of your predictor variables (X). The formula listed above shows that doubling the variance of X is has the same impact on increasing statistical precision as doubling the sample size does! So it’s worth figuring this out.

a) In correlational research, use more comprehensive continuous measures. That is, there should be a large possible range of values endorsed by participants. However, the measure should also capture many different aspects of the construct of interest; artificially increasing the range of X by adding redundant items (i.e., simply re-phrasing existing items to ask the same question) will actually hurt the validity of the analysis. Also, avoid dichotomizing your measures (e.g., median splits), because this reduces the variance and typically reduces power (MacCallum et al., 2002).

b) In experimental research, unequally allocating participants to each condition can improve statistical power. For example, if you were designing an experiment with 3 conditions (let’s say low, medium, or high self-esteem). Most of us would equally assign participants to all three groups, right? Well, as it turns out, assigning participants equally across groups usually reduces statistical power. The idea behind assigning participants unequally to conditions is to maximize the variance of X for the particular kind of relationship under study -- which, according the formula I gave earlier, will increase power and precision. For example, the optimal design for a linear relationship would be 50% low, 50% high, and omit the medium condition. The optimal design for a quadratic relationship would be 25% low, 50% medium, and 25% high. The proportions vary widely depending on the design and the kind of relationship you expect, but I recommend you check out McClelland (1997) to get more information on efficient experimental designs. You might be surprised.

Recommendation 3: Make sure predictor variables are uncorrelated with each other

A final way to increase statistical power is to increase the proportion of the variance in X not shared with other variables in the model. When predictor variables are correlated with each other, this is known as colinearity. For example, depression and anxiety are positively correlated with each other; including both as simultaneous predictors (say, in multiple regression) means that statistical power will be reduced, especially if one of the two variables actually doesn’t predict the outcome variable. Lots of textbooks suggest that we should only be worried about this when colinearity is extremely high (e.g., correlations around > .70). However, studies have shown that even modest intercorrlations among predictor variables will reduce statistical power (Mason et al., 1991). Bottom line: If you can design a model where your predictor variables are relatively uncorrelated with each other, you can improve statistical power.


Increasing statistical power is one of the rare times where what is good for science, and what is good for your career actually coincides. It increases the accuracy and replicability of results, so it’s good for science. It also increases your likelihood of finding a statistically significant result (assuming the effect actually exists), making it more likely to get something published. You don’t need to torture your data with obsessive re-analysis until you get p < .05. Instead, put more thought into research design in order to maximize statistical power. Everyone wins, and you can use that time you used to spend sweating over p-values to do something more productive. Like volunteering with the Open Science Collaboration.


Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect Size and Power in Assessing Moderating Effects of Categorical Variables Using Multiple Regression: A 30-Year Review. Journal of Applied Psychology, 90, 94-107. doi:10.1037/0021-9010.90.1.94

Button, K. S., Ioannidis, J. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376. doi: 10.1038/nrn3475

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65, 145-153. doi:10.1037/h0045186

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. doi:10.1037/0033-2909.112.1.155

Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine, 121, 200-206.

Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. In L. M. Collins & L. A. Seitz (Eds.), Advances in data analysis for prevention intervention research (NIDA Research Monograph 142, NIH Publication No. 94-3599, pp. 184–195). Rockville, MD: National Institutes of Health.

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40. doi:10.1037/1082-989X.7.1.19

Mason, C. H., & Perreault, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28, 268-280. doi:10.2307/3172863

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563. doi:10.1146/annurev.psych.59.103006.093735

McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19. doi:10.1037/1082-989X.2.1.3

McClelland, G. H. (2000). Increasing statistical power without increasing sample size. American Psychologist, 55, 963-964. doi:10.1037/0003-066X.55.8.963

Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology, 49, 103-120. doi:10.1111/j.1744-6570.1996.tb01793.x

Open Science Collaboration. (in press). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series). New York, NY: Taylor & Francis. doi:10.2139/ssrn.2195999

Richard, F. D., Bond, C. r., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7, 331-363. doi:10.1037/1089-2680.7.4.331

Oct 25, 2013

It’s Easy Being Green (Open Access)


This is the first installment of the Open Science Toolkit, a recurring feature that outlines practical steps individuals and organizations can take to make science more open and reproducible.

Photo of Frank

Congratulations! Your manuscript has been peer reviewed and accepted for publication in a journal. The journal is owned by a major publisher who wants you to know that, for $3,000, you can make your article open access (OA) forever. Anyone in the world with access to the Internet will have access to your article, which may be cited more often because of its OA status. Otherwise, the journal would be happy to make your paper available to subscribers and others willing to pay a fee.

Does this sound familiar? It sure does to me. For many years, when I heard about Open Access (OA) to scientific research, it was always about making an article freely available in a peer-reviewed journal -- the so-called “gold” OA option -- often at considerable expense. I liked the idea of making my work available to the widest possible audience, but the costs were too prohibitive.

As it turns out, however, the “best-kept secret” of OA is that you can make your work OA for free by self-archiving it in an OA repository, even if it has already been published in a non-OA journal. Such “green” OA is possible because many journals have already provided blanket permission for authors to deposit their peer-reviewed manuscript in an OA repository.

The flowchart below shows how easy it is to make your prior work OA. The key is to make sure you follow the self-archiving policy of the journal your work was published in, and to deposit the work in a suitable repository.

Click to enlarge.Flowchart showing how to
archive your research

Journals typically display their self-archiving and copyright policies on their websites, but you can also search for them in the SHERPA/RoMEO database, which has a nicely curated collection of policies from 1,333 publishers. The database assigns a code to journals based on how far in the publication process their permissions extend. It distinguishes between a pre-print, which is the version of the manuscript before it underwent peer review, and a post-print, the peer-reviewed version before the journal copyedited and typeset it. Few journals allow you to self-archive their copyedited PDF version of your article, but many let you do so for the pre-print or post-print version. Unfortunately, some journals don’t provide blanket permission for self-archiving, or require you to wait for an embargo period to end before doing so. If you run into this problem, you should contact the journal and ask for permission to deposit the non-copyedited version of your article in an OA repository.

It has also become easy to find a suitable OA repository in which to deposit your work. Your first stop should be the Directory of Open Access Repositories (OpenDOAR), which currently lists over 2,200 institutional, disciplinary, and universal repositories. Although an article deposited in an OA repository will be available to anyone with Internet access, repositories differ in feature sets and policies. For example, some repositories, like figshare, automatically assign a CC BY license to all publicly shared papers; others, like Open Depot, allow you to choose a license before making the article public. A good OA repository will tell you how it ensures the long-term digital preservation of your content as well as what metadata it exposes to search engines and web services.

Once you’ve deposited your article in an OA repository, consider making others aware of its existence. Link to it on your website, mention it on social media, or add it to your CV.

In honor of Open Access Week, I am issuing a “Green OA Challenge” to readers of this blog who have published at least one peer-reviewed article. The challenge is to self-archive one of your articles in an OA repository and link to it in the comments below. Please also feel free to share any comments you have about the self-archiving process or about green OA. Happy archiving!

Oct 16, 2013

Thriving au naturel amid science on steroids

Brian A. Nosek
University of Virginia, Center for Open Science

Jeffrey R. Spies
Center for Open Science

Photo of Brian Nosek Photo of Jeffrey Spies Last fall, the present first author taught a graduate class called “Improving (Our) Science” at the University of Virginia. The class reviewed evidence suggesting that scientific practices are not operating ideally and are damaging the reproducibility of publishing findings. For example, the power of an experimental design in null hypothesis significance testing is a function of the effect size being investigated and the size the sample to test it—power is greater when effects are larger and samples are bigger. In the authors’ field of psychology, for example, estimates suggest that the power of published studies to detect an average effect size is .50 or less (Cohen, 1962; Gigerenzer & Sedlmeier, 1989). Assuming that all of the published effects are true, approximately 50% of published studies would reveal positive results (i.e., p < .05 supporting the hypothesis). In reality, more than 90% of published results are positive (Sterling, 1959; Fanelli, 2010).

How is it possible that the average study has power to detect the average true effect 50% of the time or less and yet does so about 90% of the time? It isn’t. Then how does this occur? One obvious contributor is selective reporting. Positive effects are more likely than negative effects to be submitted and accepted for publication (Greenwald, 1975). The consequences include [1] the published literature is more likely to exaggerate the size of true effects because with low powered designs researchers must still leverage chance to obtain a large enough effect size to produce a positive result; and [2] the proportion of false positives – there isn’t actually an effect to detect – will be inflated beyond the nominal alpha level of 5% (Ioannidis, 2005).

The class discussed this and other scientific practices that may interfere with knowledge accumulation. Some of the relatively common ones are described in Table 1 along with some solutions that we, and others, identified. Problem. Solution. Easy. The class just fixed science. Now, class members can adopt the solutions as best available practices. Our scientific outputs will be more accurate, and significant effects will be more reproducible. Our science will be better.

Alex Schiller, a class member and graduate student, demurred. He agreed that the new practices would make science better, but disagreed that we should do them all. A better solution, he argued, is to take small steps: adopt one solution, wait for that to become standard scientific practice, and then adopt another solution.

We know that some of our practices are deficient, we know how to improve them, but Alex is arguing that we shouldn’t implement all the solutions? Alex’s lapse of judgment can be forgiven—he’s just a graduate student. However, his point isn’t a lapse. Faced with the reality of succeeding as a scientist, Alex is right.

Table 1

Scientific practices that increase irreproducibility of published findings, possible solutions, and barriers that prevent adoption of those solutions

Practice Problem Possible Solution Barrier to Solution
Run many low-powered studies rather than few high-powered studies Inflates false positive and false negative rates Run high-powered studies Non-significant effects are a threat to publishability; Risky to devote extra resources to high-powered tests that might not produce significant effects
Report significant effects and dismiss non-significant effects as methodologically flawed Using outcome to evaluate method is a logical error and can inflate false positive rate Report all effects with rationale for why some should be ignored; let reader decide Non-significant and mixed effects are a threat to publishability
Analyze during data collection, stop when significant result is obtained or continue until significant result is obtained Inflates false positive rate Define data stopping rule in advance Non-significant effects are a threat to publishability
Include multiple conditions or outcome variables, report the subset that showed significant effects Inflates false positive rate Report all conditions and outcome variables Non-significant and mixed effects are a threat to publishability
Try multiple analysis strategies, data exclusions, data transformations, report cleanest subset Inflates false positive rate Prespecify data analysis plan, or report all analysis strategies Non-significant and mixed effects are a threat to publishability
Report discoveries as if they had resulted from confirmatory tests Inflates false positive rate Pre-specify hypotheses; Report exploratory and confirmatory analyses separately Many findings are discoveries, but stories are nicer and scientists seem smarter if they had thought it in advance
Never do a direct replication Inflates false positive rate Conduct direct replications of important effects Incentives are focused on innovation, replications are boring; Original authors might feel embarrassed if their original finding is irreproducible

Note: For reviews of these practices and their effects see Ioannidis, 2005; Giner-Sorolla, 2012; Greenwald, 1975; John et al., 2012; Nosek et al., 2012; Rosenthal, 1979; Simmons et al., 2011; Young et al., 2008

The Context

In an ideal world, scientists use the best available practices to produce accurate, reproducible science. But, scientists don’t live in an ideal world. Alex is creating a career for himself. To succeed, he must publish. Papers are academic currency. They are Alex’s ticket to job security, fame, and fortune. Well, okay, maybe just job security. But, not everything is published, and some publications are valued more than others. Alex can maximize his publishing success by producing particular kinds of results. Positive effects, not negative effects (Fanelli, 2010; Sterling, 1959). Novel effects, not verifications of prior effects (Open Science Collaboration, 2012). Aesthetically appealing, clean results, not results with ambiguities or inconsistencies (Giner-Sorolla, 2012). Just look in the pages of Nature, or any other leading journal, they are filled with articles producing positive, novel, beautiful results. They are wonderful, exciting, and groundbreaking. Who wouldn’t want that?

We do want that, and science advances in leaps with groundbreaking results. The hard reality is that few results are actually groundbreaking. And, even for important research, the results are often far from beautiful. There are confusing contradictions, apparent exceptions, and things that just don’t make sense. To those in the laboratory, this is no surprise. Being at the frontiers of knowledge is hard. We don’t quite know what we are looking at. That’s why we are studying it. Or, as Einstein said, “If we knew what we were doing, it wouldn’t be called research”.   But, those outside the laboratory get a different impression. When the research becomes a published article, much of the muck goes away. The published articles are like the pictures of this commentary’s authors at the top of this page. Those pictures are about as good as we can look. You should see the discards. Those with insider access know, for example, that we each own three shirts with buttons and have highly variable shaving habits. Published articles present the best-dressed, clean-shaven versions of the actual work.

Just as with people, when you replicate effects yourself to see them in person, they may not be as beautiful as they appeared in print. The published version often looks much better than reality. The effect is hard to get, dependent on a multitude of unmentioned limiting conditions, or entirely irreproducible (Begley & Ellis, 2012; Prinz et al., 2011).

The Problem

It is not surprising that effects are presented in their best light. Career advancement depends on publishing success. More beautiful looking results are easier to publish and more likely to earn rewards (Giner-Sorolla, 2012). Individual incentives align for maximizing publishability, even at the expense of accuracy (Nosek et al., 2012).

Consider three hypothetical papers shown in Table 2. For all three, the researchers identified an important problem and had an idea for a novel solution. Paper A is a natural beauty. Two well planned studies showed effects supporting the idea. Paper B and Paper C were conducted with identical study designs. Paper B is natural, but not beautiful; Paper C is a manufactured beauty. Both Paper B and Paper C were based on 3 studies. One study for each showed clear support for the idea. A second study was a mixed success for Paper B, but “worked” for Paper C after increasing the sample size a bit and analyzing the data a few different ways. A third study did not work for either. Paper B reported the failure with an explanation for why the methodology might be to blame, rather than the idea being incorrect. The authors of Paper C generated the same methodological explanation, categorized the study as a pilot, and did not report it at all. Also, Paper C described the final sample sizes and analysis strategies, but did not mention that extra data was collected after initial analysis, or that alternative analysis strategies had been tried and dismissed.

Table 2

Summary of research practices for three hypothetical papers

Step Paper Paper B Paper C
Data collection Conducted two studies Conducted three studies Conducted three studies
Data analysis Analyzed data after completing data collection following a pre-specified analysis plan Analyzed data after completing data collection following a pre-specified analysis plan Analyzed during data collection and collected more data to get to significance in one case. Selected from multiple analysis strategies for all studies.
Result reporting Reported the results of the planned analyses for both studies Reported the results of the planned analyses for all studies Reported results of final analyses only. Did not report one study that did not reach significance.
Final paper Two studies demonstrating clear support for idea One study demonstrating clear support for idea, one mixed, one not at all Two studies demonstrating clear support for idea

Paper A is clearly better than Paper B. Paper A should be published in a more prestigious outlet and generate more attention and accolade. Paper C looks like Paper A, but in reality it is like Paper B. The actual evidence is more circumspect than the apparent evidence. Based on the report, however, no one can tell the difference between Paper A and Paper C.

Two possibilities would minimize the negative impact of publishing manufactured beauties like Paper C. First, if replication were standard practice, then manufactured effects would be identified rapidly. However, direct replication is very uncommon (Open Science Collaboration, 2012). Once an effect is the literature, there is little systematic ethic to self-correct. Rather than be weeded out, false effects persist or just slowly fade away. Second, scientists could just avoid doing the practices that lead to Paper C making this illustration an irrelevant hypothetical. Unfortunately, a growing body of evidence suggests that these practices occur, and some are even common (e.g., John et al., 2012).

To avoid the practices that produce Paper C, the scientist must be aware of and confront a conflict-of-interest—what is best for science versus what is best for me. Scientists have inordinate opportunity to pursue flexible decision-making in design and analysis, and there is minimal accountability for those practices. Further, humans’ prodigious motivated reasoning capacities provide a way to decide that the outcomes that look best for us also has the most compelling rationale (Kunda, 1990). So, we may convince ourselves that the best course of action for us was the best course of action period. It is very difficult to stop doing suspect practices when we have thoroughly convinced ourselves that we are not doing them.

The Solution

Alex needs to publish to succeed. The practices in Table 1 are to the scientist, what steroids are to the athlete. They amplify the likelihood of success in a competitive marketplace. If others are using, and Alex decides to rely on his natural performance, then he will disadvantage his career prospects. Alex wants to do the best science he can and be successful for doing it. In short, he is the same as every other scientist we know, ourselves included. Alex shouldn’t have to make a choice between doing the best science and being successful—these should be the same thing.

Is Alex stuck? Must he wait for institutional regulation, audits, and the science police to fix the system? In a regulatory world, practices are enforced and he need not worry that he’s committing career suicide by following them. Many scientists are wary of a strong regulatory environment in science, particularly for the possibility of stifling innovation. Some of the best ideas start with barely any evidence at all, and restrictive regulations on confidence in outputs could discourage taking risks on new ideas. Nonetheless, funders, governments, and other stakeholders are taking notice of the problematic incentive structures in science. If we don’t solve the problem ourselves, regulators may solve them for us.

Luckily, Alex has an alternative. The practices in Table 1 may be widespread, but the solutions are also well known and endorsed as good practice (Fuchs et al., 2012). That is, scientists easily understand the differences between Papers A, B, and C – if they have full access to how the findings were produced. As a consequence, the only way to be rewarded for natural achievements over manufactured ones is to make the process of obtaining the results transparent. Using the best available practices privately will improve science but hurt careers. Using the best available practices publicly will improve science while simultaneously improving the reputation of the scientist. With openness, success can be influenced by the results and by how they were obtained.  


The present incentives for publishing are focused on the one thing that we scientists are absolutely, positively not supposed to control - the results of the investigation. Scientists have complete control over the design, procedures, and execution of the study. The results are what they are.

A better science will emerge when the incentives for achievement align with the things that scientists can (and should) control with their wits, effort, and creativity. With results, beauty is contingent on what is known about their origin. Obfuscation of methodology can make ugly results appear beautiful. With methodology, if it looks beautiful, it is beautiful. The beauty of methodology is revealed by openness.

Most scientific results have warts. Evidence is halting, uncertain, incomplete, confusing, and messy. It is that way because scientists are working on hard problems. Exposing it will accelerate finding solutions to clean it up. Instead of trying to make results look beautiful when they are not, the inner beauty of science can be made apparent. Whatever the results, the inner beauty—strong design, brilliant reasoning, careful analysis—is what counts. With openness, we won’t stop aiming for A papers. But, when we get them, it will be clear that we earned them.


Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical cancer research. Nature, 483, 531-533.

Cohen , J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153.

Fanelli, D. (2010). "Positive" results increase down the hierarchy of the sciences. PLoS ONE, 5(4), e10068. doi:10.1371/journal.pone.0010068

Fuchs H., Jenny, M., & Fiedler, S. (2012). Psychologists are open to change, yet wary of rules. Perspectives on Psychological Science, 7, 634-637. doi:10.1177/1745691612459521

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309-316.

Giner-Sorolla, R. (2012). Science or art? How esthetic standards grease the way through the publication bottleneck but undermine science. Perspectives on Psychological Science.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1-20.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524-532. doi:10.1177/0956797611430953

Kunda, Z. (1990). The case for motivated reasoning. Psychological Bulletin, 108, 480-498. doi:10.1037/0033-2909.108.3.480

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7,615-631. doi:10.1177/1745691612459058

Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7, 657-660. doi:10.1177/1745691612462588

Prinz, F., Schlange, T. & Asadullah, K. (2011). Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10, 712-713.

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638-641. doi:10.1037/0033-2909.86.3.638

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa. Journal of the American Statistical Association, 54, 30-34.

Young, N. S., Ioannidis, J. P. A., & Al-Ubaydli, O. (2008). Why current publication practices may distort science. PLoS Medicine, 5, 1418-1422.

Oct 10, 2013

Opportunities for Collaborative Research


Photo of Jon Grahe

As a research methods instructor, I encourage my students to conduct “authentic” research projects. For now, consider authentic undergraduate research experiences as student projects at any level where the findings might result in a conference presentation or a publishable paper. This included getting IRB approval and attempts to present our findings at regional conferences, maybe including a journal submission. Eventually, I participated in opportunities for my students to contribute to big science by joining in two crowd-sourcing projects organized by Alan Reifman. The first was published in Teaching of Psychology (School Spirit Study Group, 2004) as an example for others to follow. The data included samples from 22 research methods classes who measured indices representative of the group name. Classes evaluated the data from their own school and the researchers collapsed the data to look at generalization across institutions. The second collaborative project included surveys collected by students in 10 research methods classes. The topic was primarily focused on emerging adulthood and political attitudes, but included many other psychological constructs. Later, I note the current opportunities for Emerging Adulthood theorists to use these data for a special issue of the journal, Emerging Adulthood. These two projects notwithstanding, there have been few open calls for instructors to participate in big science. Instructors who want to include authentic research experiences do so by completing all their own legwork as characterized by Frank and Saxe (2012) or as displayed by many of the poster presentations that you see at Regional Conferences.

However, that state of affairs has changed. Though instructors are likely to continue engaging in authentic research in their classrooms, they don’t have to develop the projects on their own anymore. I am very excited about the recent opportunities that allow students to contribute to “big science” by acting as crowd-sourcing experimenters. In full disclosure, I acknowledge my direct involvement in developing three of the following projects. However, I will save you a description of my own pilot test called the Collective Undergraduate Research Project (Grahe, 2010). Instead, I will briefly review recent “open invitation projects”, those where any qualified researcher can contribute. The first to emerge (Psych File Drawer, Reproducibility Project) were focused on PhD level researchers. However, since August 2012 there have been three research projects that specifically invite students to help generate data either as experimenters or as coders. More are likely to emerge soon as theorists grasp the idea that others can help them collect data.

This is a great time to be a research psychologist. These projects provide real opportunities for students or other researchers at any expertise level to get involved in not only authentic, but transformative research opportunities. The Council of Undergraduate Research recently published an edited volume (Karukstis & Hensel, 2010) dedicated to fostering transformative research for undergraduates. According to Wikipedia, “Transformative research is a term that became increasingly common within the science policy community in the 2000s for research that shifts or breaks existing scientific paradigms.” I consider open invitation projects transformative because they change the way that we view minimally acceptable standards for research. Each one is intended to change the basic premise of collaboration by bridging not only institutions, but also the chasm of acquaintance. Any qualified researcher can participate in data collection and authorship. Now, when I introduce research opportunities to my students, the ideas are grand. Here are research projects with grand ideas that invite contributions.

Many Labs Project - The Many Labs Team’s original invitation to contributors, sent out in February 2013 asked contributors to join their “wide-scale replication project {that was} conditionally accepted for publication in the special issue of Social Psychology.” They made a follow-up invitation in July reminding us of their Oct. 1st, 2013 deadline for data collection. This deadline limits future contributors, but it is a great example of a crowd-sourcing project. Their goal is to replicate 12 effects using the Project Implicit infrastructure. As is typical of these projects, contributors meeting a minimum goal (N > 80 cases in this instance) will be listed as coauthors in future publications. As they stated in their July post, “The project and accepted proposal has been registered on the Open Science Framework and can be found at this link: http://www.openscienceframework.org/project/WX7Ck/.” Richard Klein (project coordinator) reported that their project includes 18 different researchers at 15 distinct labs, with a plan for 38 labs contributing data before the deadline. This project is nearing the completion deadline and so new contributions are limited, but the project represents an important exemplar of potential projects.

Reproducibility Project – As stated on their invitation to new contributors on the Open Science Framework page, "Our primary goal is to conduct high quality replications of studies from three 2008 psychology journals and then compare our results to the original studies." As of right now, Johanna Cohoon from the Center for Open Science states, “We have 136 replication authors who come from 59 different institutions, with an additional 19 acknowledged replication researchers (155 replication researchers total). We also have an additional 40 researchers who are not involved in a replication that have earned authorship through coding/vetting 10 or more articles.” In short, this is a large collaboration and is welcoming more committed researchers. Though this project needs advanced researchers, it is possible for faculty to work closely with students who might wish to contribute.

PsychFileDrawer Project – This is less a collaborative effort between researchers than a compendium of replications, as stated by the project organizers: “PsychFileDrawer.org is a tool designed to address the File Drawer Problem as it pertains to psychological research: the distortion in the scientific literature that results from the failure to publish non-replications." It is collaborative in the sense that they host a “top 20” list of studies that the viewers want to see replicated. However, any researcher with a replication is invited to contribute the data. Further, although this was not initially targeted toward students, there is a new feature that allows contributors to identify a sample as a class project and the FAQ page asks instructors to comment on, “…the level and type of instructor supervision, and instructor’s degree of confidence in the fidelity with which the experimental protocol was implemented.”

Emerging Adulthood, Political Decisions, and More (2004) project – Alan Reifman is an early proponent of collaborative undergraduate research projects. After successfully guiding the School Spirit Study Group, he called again to research methods instructors to collectively conduct a survey that included two emerging adulthood scales, some political attitudes and intentions, and other scales that contributors wanted to add to the paper and pencil survey. By the time the survey was finalized, it was over 10 pages long and included hundreds of individual items measuring dozens of distinct psychological constructs on 12 scales. In retrospect, the survey was too long and the invitation to add on measures might have caused of the attrition of committed contributors that occurred. However, the final sample included over 1300 cases from 11 distinct institutions across the US. A list of contributors and initial findings can be found at Alan Reifman’s Emerging Adulthood Page. This project suffered from the amorphous structure of the collaboration. In short, no one instructor was interested in all the constructs. To resolve this situation where wonderfully rich data sit unanalyzed, the contributors are inviting emerging adulthood theorists to analyze the data and submit their work for a special issue of Emerging Adulthood. Details for how to request the data are available on the project’s OSF page. The deadline for a submission is July, 2014.

International Situations Project (ISP) – David Funder and Esther Guillaume-Hanes from the University of California—Riverside have organized a coalition of international researchers (19 and counting) to complete an internet protocol. As they describe on their about page, “Participants will describe situations by writing a brief description of the situation they experienced the previous day at 7 pm. They will identify situational characteristics uing the Riverside Situation Q-sort (RSQ) which includes 89 items that participants place into one of nine categories ranging from not at all characteristic to very characteristics. They then identify the behaviors they displayed using the Riverside Behavioral Q-Sort (RBQ) which includes 68 items using the same sorting procedure. The UC-Riverside researchers are taking primary responsibility for writing research reports. They have papers in print and others in preparation where all contributors who provide more than 80 cases are listed as authors. In Fall 2012, Psi Chi and Psi Beta encouraged their members to replicate the ISP in the US to create a series of local samples yielding 11 samples with 5 more committed contributors (Grahe, Guillaume-Hanes, & Rudmann, 2014). Currently, contributors are invited to complete either this project or their subsequent Personality project which includes completing this protocol, then completing the California Adult Q-Sort two weeks later. Interested researchers should contact Esther Guillaume directly at eguil002@ucr.edu.

Collaborative Replications and Education Project (CREP)This project has recently started inviting contributions and is explicitly designed for undergraduates. Instructors who guide student research projects are encouraged to share the available studies list with their students. Hans IJzerman and Mark Brandt reviewed the top three cited empirical articles in the top journals in nine sub disciplines and rated them for feasibility to be completed by undergraduates selecting nine studies that were the most feasible. The project provides small ($200-$500) CREP research awards for completed projects (sponsored by Psi Chi and the Center for Open Science). Contributors will be encouraged/supported in writing and submitting reports for publication by the project coordinators. Anyone interested in participating should contact Jon Grahe (graheje@plu.edu).

Archival Project – This Center for Open Science project is also specifically designed as a crowd-sourcing opportunity for students. When it completes the beta-testing phase, the Archival Project will be publicly advertised. Unlike all the projects reviewed thus far, this project asks contributors to serve as coders rather than act as experimenters. It is a companion project to the OSC Reproducibility Project in that the target articles are from the same three Journals from the first three months of 2008. This project has a low bar for entry as training can take little time (particularly with the now available online tutorial) and coders can code as few as a single article and still make a real contribution. However, this project also has a system of honors as they state on their “getting involved” page: “Contributors who provide five or more accurate codings will be listed as part of the collective authorship.” This project was designed with the expectation that instructors will find the opportunity pedagogically useful and that they will employ it as a course exercise. Alternatively, students in organized clubs (such as Psi Chi) are invited to participate to increase their own methodological competence while simultaneously accumulating evidence of their contributions to an authentic research project. Finally, graduate students are invited to participate without faculty supervision. Interested parties should contact Johanna Cohoon (johannacohoon@gmail.com) for more information.

Future Opportunities – While this is intended to be an exhaustive list of open invitation projects, the field is not static and this list is likely to grow. What is exciting is that we now have ample opportunities to participate in “big science” with relatively small contributions. When developed with care, these projects follow Grahe et al. (2012)’s recommendation to take advantage of the magnitude of research being conducted each year by psychology students. The bar for entry varies in these projects from relatively intensive (e.g. Reproducibility Project, CREP) to relatively easy (e.g. Archival Project, ISP), providing opportunities for individuals with varying resources, from graduate students and PhD level researchers capable of completing high quality replications to students and instructors who seek opportunities in the classroom. Beyond the basic opportunity to participate in transformative research, these projects provide exemplars for how future collaborative projects should be designed and managed.

This is surely an incomplete list of current or potential examples of crowd-sourcing research. Please share other examples as comments below. Further consider pointing out strengths, weaknesses, opportunities or threats that could emerge from collaborative research. Finally, any public statements about intentions to participate in this type of open science initiative are welcome.


Frank, M. C., & Saxe, R. (2012). Teaching replication. Perspectives On Psychological Science, 7(6), 600-604. doi:10.1177/1745691612460686

Grahe. J. E., Gullaume-Hanes, E., & Rudmann, J. (2014). Students collaborate to advance science: The International Situations Project. Council for Undergraduate Research Quarterly

Grahe, J. E., Reifman, A., Hermann, A. D., Walker, M., Oleson, K. C., Nario-Redmond, M., & Wiebe, R. P. (2012). Harnessing the undiscovered resource of student research projects. Perspectives On Psychological Science, 7(6), 605-607. doi:10.1177/1745691612459057

Karukstis, K. K., & Hensel, N. (2010) Transformative research at predominately undergraduate institutions.” Council of Undergraduate Research. Washington DC., USA.

Oct 4, 2013

A publishing sting, but what was stung?


Before Open Science there was Open Access (OA) — a movement driven by the desire to make published research publicly accessible (after all, the public usually had paid for it), rather than hidden behind paywalls.

Open Access is, by now, its very own strong movement — follow for example Björn Brembs if you want to know what is happening there — and there are now nice Open Access journals, like PLOS and Frontiers, which are peer-reviewed and reputable. Some of these journals charge an article processing fee for OA articles, but in many cases funders have introduced or are developing provisions to cover these costs. (In fact, the big private funder in Sweden INSISTS on it.)

But, as always, situations where there is money involved and targets who are desperate (please please please publish my baby so I won’t perish) breed mimics and cuckoos and charlatans, ready to game the new playing field to their advantage. This is probably just a feature of the human condition (see Triver’s “Folly of Fools”).

There are lists of potentially predatory Open Access journals — I have linked some in on my private blog (Åse Fixes Science) here and here. Reputation counts. Buyers beware!

Demonstrating the challenges of this new marketplace, John Bohannon published in Science (a decidedly not Open Access journal) a sting operation in which he spoofed Open Access journals to test their peer-review system. The papers were machine generated nonsense — one may recall the Sokal Hoax from the previous Science Wars. One may also recall the classic Ceci paper from 1982, which made the rounds again earlier this year (and I blogged about that one too - on my other blog).

Crucially, all of Bohannon’s papers contained fatal flaws that a decent peer-reviewer should catch. The big news? Lots of them did not (though PLOS did). Here’s the Science article with its commentary stream, and a commentary from Retraction Watch).

This is, of course, interesting — and it is generating buzz. But, it is also generating some negative reaction. For one, Bohannon did not include the regular non-OA journals in his test, so the experiment lacks a control group, which means we can make no comparison and draw no firm inferences from the data. The reason he states (it is quoted on the Retraction Watch site) is the very long turnaround times for regular journals, which can be months, even a year (or longer, as I’ve heard). I kinda buy it, but this is really what is angering the Open Access crowd, who sees this letter as an attempt to implicate Open Access itself as the source of the problem. And, apart from Bohannon not including regular journals in his test, Science published what he wrote without peer reviewing it.

My initial take? I think it is important to test these things — to uncover the flaws in the system, and also to uncover the cuckoos and the mimics and the gamers. Of course, the problems in peer review are not solely on the shoulders of Open Access — Diederik Stapel, and other frauds, published almost exclusively in paywalled journals (including Science). The status of peer review warrants its own scrutiny.

But, I think the Open Access advocates have a really important point to make. As noted on Retraction Watch, Bohannon's study didn't include non-OA journals, so it's unclear whether the peer-review problems he identified in OA journals are unique to their OA status.

I’ll end by linking in some of the commentary that I have seen so far — and, of course, we’re happy to hear your comments.

Oct 2, 2013

Smoking on an Airplane


Photo of Denny

People used to smoke on airplanes. It's hard to imagine, but it's true. In less than twenty years, smoking on airplanes has grown so unacceptable that it has become difficult to see how people ever condoned it in the first place. Psychological scientists used to refuse to share their data. It's not so hard to imagine, and it's still partly true. However, my guess is that a few years from now, data-secrecy will be as unimaginable as smoking on an airplane is today. We've already come a long way. When in 2005 Jelte Wicherts, Dylan Molenaar, Judith Kats, and I asked 141 psychological scientists to send us their raw data to verify their analyses, many of them told us to get lost - even though, at the time of publishing the research, they had signed an agreement to share their data upon request. "I don't have time for this," one famous psychologist said bluntly, as if living up to a written agreement is a hobby rather than a moral responsibility. Many psychologists responded in the same way. If they responded at all, that is.

Like Diederik Stapel.

I remember that Judith Kats, the student in our group who prepared the emails asking researchers to make data available, stood in my office. She explained to me how researchers had responded to our emails. Although many researchers had refused to share data, some of our Dutch colleagues had done so in an extremely colloquial, if not downright condescending way. Judith asked me how she should respond. Should she once again inform our colleagues that they had signed an APA agreement, and that they were in violation of a moral code?

I said no.

It's one of the very few things in my scientific career that I regret. Had we pushed our colleagues to the limit, perhaps we would have been able to identify Stapel's criminal practices years earlier. As his autobiography shows, Stapel counterfeited his data in an unbelievably clumsy way, and I am convinced that we would have easily identified his data as fake. I had many reasons for saying no, which seemed legitimate at the time, but in hindsight I think my behavior was a sign of adaptation to a defective research culture. I had simply grown accustomed to the fact that, when I entered an elevator, conversations regarding statistical analyses would fall silent. I took it as a fact of life that, after we methodologists had explained students how to analyze data in a responsible way, some of our colleagues would take it upon themselves to show students how scientific data analysis really worked (today, these practices are known as p-hacking). We all lived in a scientific version of The Matrix, in which the reality of research was hidden from all - except those who had the doubtful honor of being initiated. There was the science that people reported and there was the science that people did.

In Groningen University, where Stapel used to work, he was known as The Lord of the Data, because he never let anyone near his SPSS files. He pulled results out of thin air, throwing them around as presents to his co-workers, and when anybody asked him to show the underlying data files, he simply didn't respond. Very few people saw this as problematic, because, hey, these were his data, and why should Stapel share his data with outsiders?

That was the moral order of scientific psychology. Data are private property. Nosy colleagues asking for data? Just chase them away, like you chase coyotes from your farm. That is why researchers had no problem whatsoever denying access to their data, and that is why several people saw the data-sharing request itself as unethical. "Why don't you trust us?," I recall one researcher saying in a suspicious tone of voice.

It is unbelievable how quickly things have changed.

In the wake of the Stapel case, the community of psychological scientists committed to openness, data-sharing, and methodological transparency quickly reached a critical mass. The Open Science Framework allows researchers to archive all of their research materials, including stimuli, analysis code, and data, to make them public by simply pressing a button. The new Journal of Open Psychology Data offers an outlet specifically designed to publish datasets, thereby giving these the status of a publication. PsychDisclosure.org asks researchers to document decisions regarding, e.g., sample size determination and variable selection, that were left unmentioned in publications; most researchers provide this information without hesitation - some actually do so voluntarily. The journal Psychological Science will likely implement requirements for this type information in the submission process. Data-archiving possibilities are growing like crazy. Major funding institutes require data-archiving or are preparing regulations that do. In the Reproducibility Project, hundreds of studies are being replicated in a concerted effort. As a major symbol of these developments, we now have the Center for Open Science, which facilitates the massive grassroots effort to open up the scientific regime in psychology.

If you had told me that any of this would happen back in 2005, I would have laughed you away, just as I would have laughed you away in 1990, had you told me that the future would involve such bizarre elements as smoke-free airplanes.

The moral order of research in psychology has changed. It has changed for the better, and I hope it has changed for good.



Welcome to the blog of the Open Science Collaboration! We are a loose network of researchers, professionals, citizen scientists, and others with an interest in open science, metascience, and good scientific practices. We’ll be writing about:

  • Open science topics like reproducibility, transparency in methods and analyses, and changing editorial and publication practices
  • Updates on open science initiatives like the Reproducibility Project and opportunities to get involved in new projects like the Archival Project
  • Interviews and opinions from researchers, developers, publishers, and citizen scientists working to make science more transparent, rigorous, or reproducible.

We hope that the blog will stimulate open discussion and help improve the way science is done!

If you'd like to suggest a topic for a post, propose a guest post or column, or get involved with moderation, promotion, or technical development, we would love to hear from you! Email us at oscbloggers@gmail.com or tweet @OSCbloggers.

The OSC is an open collaboration - anyone is welcome to join, contribute to discussions, or develop a project. The OSC blog is supported by the Center for Open Science, a non-profit organization dedicated to increasing openness, integrity and reproducibility of scientific research.