Nov 20, 2013

Theoretical Amnesia

by

Photo of Denny
Boorsboom

In the past few months, the Center for Open Science and its associated enterprises have gathered enormous support in the community of psychological scientists. While these developments are happy ones, in my view, they also cast a shadow over the field of psychology: clearly, many people think that the activities of the Center for Open Science, like organizing massive replication work and promoting preregistration, are necessary. That, in turn, implies that something in the current scientific order is seriously broken. I think that, apart from working towards improvements, it is useful to investigate what that something is. In this post, I want to point towards a factor that I think has received too little attention in the public debate; namely, the near absence of unambiguously formalized scientific theory in psychology.

Scientific theories are perhaps the most bizarre entities that the scientific imagination has produced. They have incredible properties that, if we weren’t so familiar with them, would do pretty well in a Harry Potter novel. For instance, scientific theories allow you to work out, on a piece of paper, what would happen to stuff in conditions that aren’t actually realized. So you can figure out whether an imaginary bridge will stand or collapse in imaginary conditions. You can do this by simply just feeding some imaginary quantities that your imaginary bridge would have (like its mass and dimensions) to a scientific theory (say, Newton’s) and out comes a prediction on what will happen. In the more impressive cases, the predictions are so good that you can actually design the entire bridge on paper, then build it according to specifications (by systematically mapping empirical objects to theoretical terms), and then the bridge will do precisely what the theory says it should do. No surprises.

That’s how they put a man on the moon and that’s how they make the computer screen you’re now looking at. It’s all done in theory before it’s done for real, and that’s what makes it possible to construct complicated but functional pieces of equipment. This is, in effect, why scientific theory makes technology possible, and therefore this is an absolutely central ingredient of the scientific enterprise which, without technology, would be much less impressive than it is.

It’s useful to take stock here, and marvel. A good scientific theory allows you infer what would happen to things in certain situations without creating the situations. Thus, scientific theories are crystal balls that actually work. For this reason, some philosophers of science have suggested that scientific theories should be interpreted as inference tickets. Once you’ve got the ticket, you get to sidestep all the tedious empirical work. Which is great, because empirical work is, well, tedious. Scientific theories are thus exquisitely suited to the needs of lazy people.

My field – psychology – unfortunately does not afford much of a lazy life. We don’t have theories that can offer predictions sufficiently precise to intervene in the world with appreciable certainty. That’s why there exists no such thing as a psychological engineer. And that’s why there are fields of theoretical physics, theoretical biology, and even theoretical economics, while there is no parallel field of theoretical psychology. It is a sad but, in my view, inescapable conclusion: we don’t have much in the way of scientific theory in psychology. For this reason, we have very few inference tickets – let alone inference tickets that work.

And that’s why psychology is so hyper-ultra-mega empirical. We never know how our interventions will pan out, because we have no theory that says how they will pan out (incidentally, that’s also why we need preregistration: in psychology, predictions are made by individual researchers rather than by standing theory, and you can’t trust people the way you can trust theory). The upshot is that, if we want to know what would happen if we did X, we have to actually do X. Because we don’t have inference tickets, we never get to take the shortcut. We always have to wade through the empirical morass. Always.

This has important consequences. For instance, as a field has less theory, it has to leave more to the data. Since you can’t learn anything from data without the armature of statistical analysis, a field without theory tends to grow a thriving statistical community. Thus, the role of statistics grows as soon as the presence of scientific theory wanes. In extreme cases, when statistics has entirely taken over, fields of inquiry can actually develop a kind of philosophical disorder: theoretical amnesia. In fields with this disorder, researchers no longer know what a theory is, which means that they can neither recognize its presence nor its absence. In such fields, for instance, a statistical model – like a factor model – can come to occupy the vacuum created by the absence of theory. I am often afraid that this is precisely what has happened with the advent of “theories” like those of general intelligence (a single factor model) and the so-called “Big Five” of personality (a five-factor model). In fact, I am afraid that this happened in many fields in psychology, where statistical models (which, in their barest null-hypothesis testing form, are misleadingly called “effects”) rule the day.

If your science thrives on experiment and statistics, but lacks the power of theory, you get peculiar problems. Most importantly, you get slow. To see why, it’s interesting to wonder how psychologists would build a bridge, if they were to use their typical methodological strategies. Probably, they would build a thousand bridges, record whether they stand or fall, and then fit a regression equation to figure out which properties are predictive of the outcome. Predictors would be chosen on the basis of statistical significance, which would introduce a multiple testing problem. In response, some of the regressors might be clustered through factor analysis, to handle the overload of predictive variables. Such analyses would probably indicate lots of structure in the data, and psychologists would likely find that the bridges’ weight, size, and elasticity loads on a single latent “strength factor”, producing the “theory” that bridges higher on the “strength factor” are less likely to fall down. Cross validation of the model would be attempted by reproducing the analysis in a new sample of a thousand bridges, to weed out chance findings. It’s likely that, after many years of empirical research, and under a great number of “context-dependent” conditions that would be poorly understood, psychologists would be able to predict a modest but significant proportion of the variance in the outcome variable. Without a doubt, it would ta ke a thousand years to establish empirically what Newton grasped in a split second, as he wrote down his F=m*a.

Because increased reliance on empirical data makes you so incredibly slow, it also makes you susceptible to fads and frauds. A good theory can be tested in an infinity of ways, many of which are directly available to the interested reader (this is what give classroom demonstrations such enormous evidential force). But if your science is entirely built on generalizations derived from specifics of tediously gathered experimental data, you can’t really test these generalizations without tediously gathering the same, or highly similar, experimental data. That’s not something that people typically like to do, and it’s certainly not what journals want to print. As a result, a field can become dominated by poorly tested generalizations. When that happens, you’re in very big trouble. The reason is that your scientific field becomes susceptible to the equivalent of what evolutionary theorists call free riders: people who capitalize on the invested honest work of others by consistently taking the moral shortcut. Free riders can come to rule a scientific field if two conditions are satisfied: (a) fame is bestowed on whoever dares to make the most adventurous claims (rather than the most defensible ones), and (b) it takes longer to falsify a bogus claim than it takes to become famous. If these conditions are satisfied, you can build your scientific career on a fad and get away with it. By the time they find out your work really doesn’t survive detailed scrutiny, you’re sitting warmly by the fire in the library of your National Academy of Sciences1.

Much of our standard methodological teachings in psychology rest on the implicit assumption that scientific fields are similar if not identical in their methodological setup. That simply isn’t true. Without theory, the scientific ball game has to be played by different rules. I think that these new rules are now being invented: without good theory, you need fast acting replication teams, you need a reproducibility project, and you need preregistered hypotheses. Thus, the current period of crisis may lead to extremely important methodological innovations – especially those that are crucial in fields that are low on theory.

Nevertheless, it would be extremely healthy if psychologists received more education in fields which do have some theories, even if they are empirically shaky ones, like you often see in economics or biology. In itself, it’s no shame that we have so little theory: psychology probably has the hardest subject matter ever studied, and to change that may very well take a scientific event of the order of Newton’s discoveries. I don’t know how to do it and I don’t think anybody else knows either. But what we can do is keep in contact with other fields, and at least try to remember what theory is and what it’s good for, so that we don’t fall into theoretical amnesia. As they say, it’s the unknown unknowns that hurt you most.

1 Caveat: I am not saying that people do this on purpose. I believe that free riders are typically unaware of the fact that they are free riders – people are very good at labeling their own actions positively, especially if the rest of the world says that they are brilliant. So, if you think this post isn’t about you, that could be entirely wrong. In fact, I cannot even be sure that this post isn’t about me.

Nov 13, 2013

Let’s Report Our Findings More Transparently – As We Used to Do

by

Photo of Etienne LeBel

In 1959, Festinger and Carlsmith reported the results of an experiment that spawned a voluminous body of research on cognitive dissonance. In that experiment, all subjects performed a boring task. Some participants were paid $1 or $20 to tell the next subject the task was interesting and fun whereas participants in a control condition did no such thing. All participants then indicated how enjoyable they felt the task was, their desire to participate in another similar experiment, and the scientific importance of the experiment. The authors hypothesized that participants paid $1 to tell the next participant that the boring task they just completed was interesting would experience internal dissonance, which could be reduced by altering their attitudes on the three outcomes measures. A little known fact about the results of this experiment, however, is that only on one of these outcome measures did a statistically significant effect emerge across conditions. The authors reported that subjects paid $1 enjoyed the task more than those paid $20 (or the control participants), but no statistically significant differences emerged on the other two measures.

In another highly influential paper, Word, Zanna, and Cooper (1974) documented the self-fulfilling prophecies of racial stereotypes. The researchers had white subjects interview trained white and black applicants and coded for six non-verbal behaviors of immediacy (distance, forward lean, eye contact, shoulder orientation, interview length, and speech error rate). They found that that white interviewers treated black applicants with lower levels of non-verbal immediacy than white applicants. In a follow-up study involving white subjects only, applicants treated with less immediate non-verbal behaviors were judged to perform less well during the interview than applicants treated with more immediate non-verbal behaviors. A fascinating result, however, a little known fact about this paper is that only three of the six measures of non-verbal behaviors assessed in the first study (and subsequently used in the second study) were statistically significant.

What do these two examples make salient in relation to how psychologists report their results nowadays? Regular readers of prominent journals like Psychological Science and Journal of Personality and Social Psychology may see what I’m getting at: It is very rare these days to see articles in these journals wherein half or most of the reported dependent variables (DVs) fail to show statistically significant effects. Rather, one typically sees squeaky-clean looking articles where all of the DVs show statistically significant effects across all of the studies, with an occasional mention of a DV achieving “marginal significance” (Giner-Sorolla, 2012).

In this post, I want us to consider the possibility that psychologists’ reporting practices may have changed in the past 50 years. This then raises the question as to how this came about. One possibility is that as incentives became increasingly more perverse in psychology (Nosek,Spies, & Motyl, 2012), some researchers realized that they could out-compete their peers by reporting “cleaner” looking results which would appear more compelling to editors and reviewers (Giner-Sorolla, 2012). For example, decisions were made to simply not report DVs that failed to show significant differences across conditions or that only achieved “marginal significance”. Indeed, nowadays sometimes even editors or reviewers will demand that such DVs not be reported (see PsychDisclosure.org; LeBel et al., 2013). A similar logic may also have contributed to researchers’ deciding not to fully report independent variables that failed to yield statistically significant effects and not fully reporting the exclusion of outlying participants due to fear that this information may raise doubts among the editor/reviewers and hurt their chance of getting their foot in the door (i.e., at least getting a revise-and-resubmit).

An alternative explanation is that new tools and technology have given us the ability to measure a greater number of DVs, which makes it more difficult to report on all them. For example, neuroscience (e.g., EEG, fMRI) and eye-tracking methods yield multitudes of analyzable data that were not previously available. Though this is undeniably true, the internet and online article supplements gives us the ability to fully report our methods and results and use the article to draw attention to the most interesting data.

Considering the possibility that psychologists’ reporting practices have changed in the past 50 years has implications for how to construe recent calls for the need to raise reporting standards in psychology (LeBel et al., 2013; Simmons, Nelson, & Simonsohn, 2011; Simmons, Nelson, & Simonsohn, 2012). Rather than seeing these calls as rigid new requirements that might interfere with exploratory research and stifle our science, one could construe such calls as a plea to revert back to the fuller reporting of results that used to be the norm in our science. From this perspective, it should not be viewed as overly onerous or authoritarian to ask researchers to disclose all excluded observations, all tested experimental conditions, all analyzed measures, and their data collection termination rule (what I’m now calling the BASIC 4 methodological categories covered by PsychDisclosure.org and Simmons et al.’s, 2012 21-word solution). It would simply be the way our forefathers used to do it.

References

Festinger, L. & Carlsmith, J. M. (1959). Cognitive consequences of forced compliance. The Journal of Abnormal and Social Psychology, 58(2), Mar 1959, 203-210. doi: 10.1037/h0041593

Giner-Sorolla, R. (2012). Science or art? How esthetic standards grease the way through the publication bottleneck but undermine science. Perspectives on Psychological Science, 7(6), 562–571. 10.1177/1745691612457576

LeBel, E. P., Borsboom, D., Giner-Sorolla, R., Hasselman, F., Peters, K. R., Ratliff, K. A., & Smith, C. T. (2013). PsychDisclosure.org: Grassroots support for reforming reporting standards in psychology. Perspectives on Psychological Science, 8(4), 424-432. doi: 10.1177/1745691613491437

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615-631. doi: 10.1177/1745691612459058

Simmons J., Nelson L. & Simonsohn U. (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant. Psychological Science, 22(11), 1359-1366.

Simmons J., Nelson L. & Simonsohn U. (2012) A 21 Word Solution Dialogue: The Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4-7.

Word, C. O., Zanna, M. P., & Cooper, J. (1974). The nonverbal mediation of self-fulfilling prophecies in interracial interaction. Journal of Experimental Social Psychology, 10(2), 109–120. doi: 10.1016/0022-1031(74)90059-6

Nov 3, 2013

Increasing statistical power in psychological research without increasing sample size

by

What is statistical power and precision?

This post is going to give you some practical tips to increase statistical power in your research. Before going there though, let’s make sure everyone is on the same page by starting with some definitions.

Statistical power is the probability that the test will reject the null hypothesis when the null hypothesis is false. Many authors suggest a statistical power rate of at least .80, which corresponds to an 80% probability of not committing a Type II error.

Precision refers to the width of the confidence interval for an effect size. The smaller this width, the more precise your results are. For 80% power, the confidence interval width will be roughly plus or minus 70% of the population effect size (Goodman & Berlin, 1994). Studies that have low precision have a greater probability of both Type I and Type II errors (Button et al., 2013).

To get an idea of how this works, here are a few examples of the sample size required to achieve .80 power for small, medium, and large (Cohen, 1992) correlations as well as the expected confidence intervals

Population Effect Size Sample Size for 80% Power Estimated Precision
r = .10 782 95% CI [.03, .17]
r = .30 84 95% CI [.09, .51]
r = .50 29 95% CI [.15, .85]

Studies in psychology are grossly underpowered

Okay, so now you know what power is. But why should you care? Fifty years ago, Cohen (1962) estimated the statistical power to detect a medium effect size in abnormal psychology was about .48. That’s a false negative rate of 52%, which is no better than a coin-flip! The situation has improved slightly, but it’s still a serious problem today. For instance, one review suggested only 52% of articles in the applied psychology literature achieved .80 power for a medium effect size (Mone et al., 1996). This is in part because psychologists are studying small effects. One massive review of 322 meta-analyses including 8 million participants (Richard et al., 2003) suggested that the average effect size in social psychology is relatively small (r = .21). To put this into perspective, you’d need 175 participants to have .80 power for a simple correlation between two variables at this effect size. This gets even worse when we’re studying interaction effects. One review suggests that the average effect size for interaction effects is even smaller (f2 = .009), which means that sample sizes of around 875 people would be needed to achieve .80 power (Aguinis et al., 2005). Odds are, if you took the time to design a research study and collect data, you want to find a relationship if one really exists. You don’t want to "miss" something that is really there. More than this, you probably want to have a reasonably precise estimate of the effect size (it’s not that impressive to just say a relationship is positive and probably non-zero). Below, I discuss concrete strategies for improving power and precision.

What can we do to increase power?

It is well-known that increasing sample size increases statistical power and precision. Increasing the population effect size increases statistical power, but has no effect on precision (Maxwell et al., 2008). Increasing sample size improves power and precision by reducing standard error of the effect size. Take a look at this formula for the confidence interval of a linear regression coefficient (McClelland, 2000):

Power Equation

MSE is the mean square error, n is the sample size, Vx is the variance of X, and (1-R2) is the proportion of the variance in X not shared by any other variables in the model. Okay, hopefully you didn’t nod off there. There’s a reason I’m showing you this formula. In this formula, decreasing any value in the numerator (MSE) or increasing anything in the denominator (n, Vx, 1-R2) will decrease the standard error of the effect size, and will thus increase power and precision. This formula demonstrates that there are at least three other ways to increase statistical power aside from sample size: (a) Decreasing the mean square error; (b) increasing the variance of x; and (c) increasing the proportion of the variance in X not shared by any other predictors in the model. Below, I’ll give you a few ways to do just that.

Recommendation 1: Decrease the mean square error

Referring to the formula above, you can see that decreasing the mean square error will have about the same impact as increasing sample size. Okay. You’ve probably heard the term "mean square error" before, but the definition might be kind of fuzzy. Basically, your model makes a prediction for what the outcome variable (Y) should be, given certain values of the predictor (X). Naturally, it’s not a perfect prediction because you have measurement error, and because there are other important variables you probably didn’t measure. The mean square error is the difference between what your model predicts, and what the true values of the data actually are. So, anything that improves the quality of your measurement or accounts for potential confounding variables will reduce the mean square error, and thus improve statistical power. Let’s make this concrete. Here are three specific techniques you can use:

a) Reduce measurement error by using more reliable measures(i.e., better internal consistency, test-retest reliability, inter-rater reliability, etc.). You’ve probably read that .70 is the "rule-of-thumb" for acceptable reliability. Okay, sure. That’s publishable. But consider this: Let’s say you want to test a correlation coefficient. Assuming both measures have a reliability of .70, your observed correlation will be about 1.43 times smaller than the true population parameter (I got this using Spearman’s correlation attenuation formula). Because you have a smaller observed effect size, you end up with less statistical power. Why do this to yourself? Reduce measurement error. If you’re an experimentalist, make sure you execute your experimental manipulations exactly the same way each time, preferably by automating them. Slight variations in the manipulation (e.g., different locations, slight variations in timing) might reduce the reliability of the manipulation, and thus reduce power.

b) Control for confounding variables. With correlational research, this means including control variables that predict the outcome variable, but are relatively uncorrelated with other predictor variables. In experimental designs, this means taking great care to control for as many possible confounds as possible. In both cases, this reduces the mean square error and improves the overall predictive power of the model – and thus, improves statistical power. Be careful when adding control variables into your models though: There are diminishing returns for adding covariates. Adding a couple of good covariates is bound to improve your model, but you always have to balance predictive power against model complexity. Adding a large number of predictors can sometimes lead to overfitting (i.e., the model is just describing noise or random error) when there are too many predictors in the model relative to the sample size. So, controlling for a couple of good covariates is generally a good idea, but too many covariates will probably make your model worse, not better, especially if the sample is small.

c) Use repeated-measures designs. Repeated measures designs are where participants are measured multiple times (e.g., once a day surveys, multiple trials in an experiment, etc.). Repeated measures designs reduce the mean square error by partitioning out the variance due to individual participants. Depending on the kind of analysis you do, it can also increase the degrees of freedom for the analysis substantially. For example, you might only have 100 participants, but if you measured them once a day for 21 days, you’ll actually have 2100 data points to analyze. The data analysis can get tricky and the interpretation of the data may change, but many multilevel and structural equation models can take advantage of these designs by examining each measurement occasion (i.e., each day, each trial, etc.) as the unit of interest, instead of each individual participant. Increasing the degrees of freedom is much like increasing the sample size in terms of increasing statistical power. I’m a big fan of repeated measures designs, because they allow researchers to collect a lot of data from fewer participants.

Recommendation 2: Increase the variance of your predictor variable

Another less-known way to increase statistical power and precision is to increase the variance of your predictor variables (X). The formula listed above shows that doubling the variance of X is has the same impact on increasing statistical precision as doubling the sample size does! So it’s worth figuring this out.

a) In correlational research, use more comprehensive continuous measures. That is, there should be a large possible range of values endorsed by participants. However, the measure should also capture many different aspects of the construct of interest; artificially increasing the range of X by adding redundant items (i.e., simply re-phrasing existing items to ask the same question) will actually hurt the validity of the analysis. Also, avoid dichotomizing your measures (e.g., median splits), because this reduces the variance and typically reduces power (MacCallum et al., 2002).

b) In experimental research, unequally allocating participants to each condition can improve statistical power. For example, if you were designing an experiment with 3 conditions (let’s say low, medium, or high self-esteem). Most of us would equally assign participants to all three groups, right? Well, as it turns out, assigning participants equally across groups usually reduces statistical power. The idea behind assigning participants unequally to conditions is to maximize the variance of X for the particular kind of relationship under study -- which, according the formula I gave earlier, will increase power and precision. For example, the optimal design for a linear relationship would be 50% low, 50% high, and omit the medium condition. The optimal design for a quadratic relationship would be 25% low, 50% medium, and 25% high. The proportions vary widely depending on the design and the kind of relationship you expect, but I recommend you check out McClelland (1997) to get more information on efficient experimental designs. You might be surprised.

Recommendation 3: Make sure predictor variables are uncorrelated with each other

A final way to increase statistical power is to increase the proportion of the variance in X not shared with other variables in the model. When predictor variables are correlated with each other, this is known as colinearity. For example, depression and anxiety are positively correlated with each other; including both as simultaneous predictors (say, in multiple regression) means that statistical power will be reduced, especially if one of the two variables actually doesn’t predict the outcome variable. Lots of textbooks suggest that we should only be worried about this when colinearity is extremely high (e.g., correlations around > .70). However, studies have shown that even modest intercorrlations among predictor variables will reduce statistical power (Mason et al., 1991). Bottom line: If you can design a model where your predictor variables are relatively uncorrelated with each other, you can improve statistical power.

Conclusion

Increasing statistical power is one of the rare times where what is good for science, and what is good for your career actually coincides. It increases the accuracy and replicability of results, so it’s good for science. It also increases your likelihood of finding a statistically significant result (assuming the effect actually exists), making it more likely to get something published. You don’t need to torture your data with obsessive re-analysis until you get p < .05. Instead, put more thought into research design in order to maximize statistical power. Everyone wins, and you can use that time you used to spend sweating over p-values to do something more productive. Like volunteering with the Open Science Collaboration.

References

Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect Size and Power in Assessing Moderating Effects of Categorical Variables Using Multiple Regression: A 30-Year Review. Journal of Applied Psychology, 90, 94-107. doi:10.1037/0021-9010.90.1.94

Button, K. S., Ioannidis, J. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376. doi: 10.1038/nrn3475

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65, 145-153. doi:10.1037/h0045186

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. doi:10.1037/0033-2909.112.1.155

Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine, 121, 200-206.

Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. In L. M. Collins & L. A. Seitz (Eds.), Advances in data analysis for prevention intervention research (NIDA Research Monograph 142, NIH Publication No. 94-3599, pp. 184–195). Rockville, MD: National Institutes of Health.

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40. doi:10.1037/1082-989X.7.1.19

Mason, C. H., & Perreault, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28, 268-280. doi:10.2307/3172863

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563. doi:10.1146/annurev.psych.59.103006.093735

McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19. doi:10.1037/1082-989X.2.1.3

McClelland, G. H. (2000). Increasing statistical power without increasing sample size. American Psychologist, 55, 963-964. doi:10.1037/0003-066X.55.8.963

Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology, 49, 103-120. doi:10.1111/j.1744-6570.1996.tb01793.x

Open Science Collaboration. (in press). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series). New York, NY: Taylor & Francis. doi:10.2139/ssrn.2195999

Richard, F. D., Bond, C. r., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7, 331-363. doi:10.1037/1089-2680.7.4.331

Oct 25, 2013

It’s Easy Being Green (Open Access)

by

This is the first installment of the Open Science Toolkit, a recurring feature that outlines practical steps individuals and organizations can take to make science more open and reproducible.

Photo of Frank
Farach

Congratulations! Your manuscript has been peer reviewed and accepted for publication in a journal. The journal is owned by a major publisher who wants you to know that, for $3,000, you can make your article open access (OA) forever. Anyone in the world with access to the Internet will have access to your article, which may be cited more often because of its OA status. Otherwise, the journal would be happy to make your paper available to subscribers and others willing to pay a fee.

Does this sound familiar? It sure does to me. For many years, when I heard about Open Access (OA) to scientific research, it was always about making an article freely available in a peer-reviewed journal -- the so-called “gold” OA option -- often at considerable expense. I liked the idea of making my work available to the widest possible audience, but the costs were too prohibitive.

As it turns out, however, the “best-kept secret” of OA is that you can make your work OA for free by self-archiving it in an OA repository, even if it has already been published in a non-OA journal. Such “green” OA is possible because many journals have already provided blanket permission for authors to deposit their peer-reviewed manuscript in an OA repository.

The flowchart below shows how easy it is to make your prior work OA. The key is to make sure you follow the self-archiving policy of the journal your work was published in, and to deposit the work in a suitable repository.

Click to enlarge.Flowchart showing how to
archive your research

Journals typically display their self-archiving and copyright policies on their websites, but you can also search for them in the SHERPA/RoMEO database, which has a nicely curated collection of policies from 1,333 publishers. The database assigns a code to journals based on how far in the publication process their permissions extend. It distinguishes between a pre-print, which is the version of the manuscript before it underwent peer review, and a post-print, the peer-reviewed version before the journal copyedited and typeset it. Few journals allow you to self-archive their copyedited PDF version of your article, but many let you do so for the pre-print or post-print version. Unfortunately, some journals don’t provide blanket permission for self-archiving, or require you to wait for an embargo period to end before doing so. If you run into this problem, you should contact the journal and ask for permission to deposit the non-copyedited version of your article in an OA repository.

It has also become easy to find a suitable OA repository in which to deposit your work. Your first stop should be the Directory of Open Access Repositories (OpenDOAR), which currently lists over 2,200 institutional, disciplinary, and universal repositories. Although an article deposited in an OA repository will be available to anyone with Internet access, repositories differ in feature sets and policies. For example, some repositories, like figshare, automatically assign a CC BY license to all publicly shared papers; others, like Open Depot, allow you to choose a license before making the article public. A good OA repository will tell you how it ensures the long-term digital preservation of your content as well as what metadata it exposes to search engines and web services.

Once you’ve deposited your article in an OA repository, consider making others aware of its existence. Link to it on your website, mention it on social media, or add it to your CV.

In honor of Open Access Week, I am issuing a “Green OA Challenge” to readers of this blog who have published at least one peer-reviewed article. The challenge is to self-archive one of your articles in an OA repository and link to it in the comments below. Please also feel free to share any comments you have about the self-archiving process or about green OA. Happy archiving!

Oct 16, 2013

Thriving au naturel amid science on steroids

Brian A. Nosek
University of Virginia, Center for Open Science

Jeffrey R. Spies
Center for Open Science

Photo of Brian Nosek Photo of Jeffrey Spies Last fall, the present first author taught a graduate class called “Improving (Our) Science” at the University of Virginia. The class reviewed evidence suggesting that scientific practices are not operating ideally and are damaging the reproducibility of publishing findings. For example, the power of an experimental design in null hypothesis significance testing is a function of the effect size being investigated and the size the sample to test it—power is greater when effects are larger and samples are bigger. In the authors’ field of psychology, for example, estimates suggest that the power of published studies to detect an average effect size is .50 or less (Cohen, 1962; Gigerenzer & Sedlmeier, 1989). Assuming that all of the published effects are true, approximately 50% of published studies would reveal positive results (i.e., p < .05 supporting the hypothesis). In reality, more than 90% of published results are positive (Sterling, 1959; Fanelli, 2010).

How is it possible that the average study has power to detect the average true effect 50% of the time or less and yet does so about 90% of the time? It isn’t. Then how does this occur? One obvious contributor is selective reporting. Positive effects are more likely than negative effects to be submitted and accepted for publication (Greenwald, 1975). The consequences include [1] the published literature is more likely to exaggerate the size of true effects because with low powered designs researchers must still leverage chance to obtain a large enough effect size to produce a positive result; and [2] the proportion of false positives – there isn’t actually an effect to detect – will be inflated beyond the nominal alpha level of 5% (Ioannidis, 2005).

The class discussed this and other scientific practices that may interfere with knowledge accumulation. Some of the relatively common ones are described in Table 1 along with some solutions that we, and others, identified. Problem. Solution. Easy. The class just fixed science. Now, class members can adopt the solutions as best available practices. Our scientific outputs will be more accurate, and significant effects will be more reproducible. Our science will be better.

Alex Schiller, a class member and graduate student, demurred. He agreed that the new practices would make science better, but disagreed that we should do them all. A better solution, he argued, is to take small steps: adopt one solution, wait for that to become standard scientific practice, and then adopt another solution.

We know that some of our practices are deficient, we know how to improve them, but Alex is arguing that we shouldn’t implement all the solutions? Alex’s lapse of judgment can be forgiven—he’s just a graduate student. However, his point isn’t a lapse. Faced with the reality of succeeding as a scientist, Alex is right.

Table 1

Scientific practices that increase irreproducibility of published findings, possible solutions, and barriers that prevent adoption of those solutions

Practice Problem Possible Solution Barrier to Solution
Run many low-powered studies rather than few high-powered studies Inflates false positive and false negative rates Run high-powered studies Non-significant effects are a threat to publishability; Risky to devote extra resources to high-powered tests that might not produce significant effects
Report significant effects and dismiss non-significant effects as methodologically flawed Using outcome to evaluate method is a logical error and can inflate false positive rate Report all effects with rationale for why some should be ignored; let reader decide Non-significant and mixed effects are a threat to publishability
Analyze during data collection, stop when significant result is obtained or continue until significant result is obtained Inflates false positive rate Define data stopping rule in advance Non-significant effects are a threat to publishability
Include multiple conditions or outcome variables, report the subset that showed significant effects Inflates false positive rate Report all conditions and outcome variables Non-significant and mixed effects are a threat to publishability
Try multiple analysis strategies, data exclusions, data transformations, report cleanest subset Inflates false positive rate Prespecify data analysis plan, or report all analysis strategies Non-significant and mixed effects are a threat to publishability
Report discoveries as if they had resulted from confirmatory tests Inflates false positive rate Pre-specify hypotheses; Report exploratory and confirmatory analyses separately Many findings are discoveries, but stories are nicer and scientists seem smarter if they had thought it in advance
Never do a direct replication Inflates false positive rate Conduct direct replications of important effects Incentives are focused on innovation, replications are boring; Original authors might feel embarrassed if their original finding is irreproducible

Note: For reviews of these practices and their effects see Ioannidis, 2005; Giner-Sorolla, 2012; Greenwald, 1975; John et al., 2012; Nosek et al., 2012; Rosenthal, 1979; Simmons et al., 2011; Young et al., 2008

The Context

In an ideal world, scientists use the best available practices to produce accurate, reproducible science. But, scientists don’t live in an ideal world. Alex is creating a career for himself. To succeed, he must publish. Papers are academic currency. They are Alex’s ticket to job security, fame, and fortune. Well, okay, maybe just job security. But, not everything is published, and some publications are valued more than others. Alex can maximize his publishing success by producing particular kinds of results. Positive effects, not negative effects (Fanelli, 2010; Sterling, 1959). Novel effects, not verifications of prior effects (Open Science Collaboration, 2012). Aesthetically appealing, clean results, not results with ambiguities or inconsistencies (Giner-Sorolla, 2012). Just look in the pages of Nature, or any other leading journal, they are filled with articles producing positive, novel, beautiful results. They are wonderful, exciting, and groundbreaking. Who wouldn’t want that?

We do want that, and science advances in leaps with groundbreaking results. The hard reality is that few results are actually groundbreaking. And, even for important research, the results are often far from beautiful. There are confusing contradictions, apparent exceptions, and things that just don’t make sense. To those in the laboratory, this is no surprise. Being at the frontiers of knowledge is hard. We don’t quite know what we are looking at. That’s why we are studying it. Or, as Einstein said, “If we knew what we were doing, it wouldn’t be called research”.   But, those outside the laboratory get a different impression. When the research becomes a published article, much of the muck goes away. The published articles are like the pictures of this commentary’s authors at the top of this page. Those pictures are about as good as we can look. You should see the discards. Those with insider access know, for example, that we each own three shirts with buttons and have highly variable shaving habits. Published articles present the best-dressed, clean-shaven versions of the actual work.

Just as with people, when you replicate effects yourself to see them in person, they may not be as beautiful as they appeared in print. The published version often looks much better than reality. The effect is hard to get, dependent on a multitude of unmentioned limiting conditions, or entirely irreproducible (Begley & Ellis, 2012; Prinz et al., 2011).

The Problem

It is not surprising that effects are presented in their best light. Career advancement depends on publishing success. More beautiful looking results are easier to publish and more likely to earn rewards (Giner-Sorolla, 2012). Individual incentives align for maximizing publishability, even at the expense of accuracy (Nosek et al., 2012).

Consider three hypothetical papers shown in Table 2. For all three, the researchers identified an important problem and had an idea for a novel solution. Paper A is a natural beauty. Two well planned studies showed effects supporting the idea. Paper B and Paper C were conducted with identical study designs. Paper B is natural, but not beautiful; Paper C is a manufactured beauty. Both Paper B and Paper C were based on 3 studies. One study for each showed clear support for the idea. A second study was a mixed success for Paper B, but “worked” for Paper C after increasing the sample size a bit and analyzing the data a few different ways. A third study did not work for either. Paper B reported the failure with an explanation for why the methodology might be to blame, rather than the idea being incorrect. The authors of Paper C generated the same methodological explanation, categorized the study as a pilot, and did not report it at all. Also, Paper C described the final sample sizes and analysis strategies, but did not mention that extra data was collected after initial analysis, or that alternative analysis strategies had been tried and dismissed.

Table 2

Summary of research practices for three hypothetical papers

Step Paper Paper B Paper C
Data collection Conducted two studies Conducted three studies Conducted three studies
Data analysis Analyzed data after completing data collection following a pre-specified analysis plan Analyzed data after completing data collection following a pre-specified analysis plan Analyzed during data collection and collected more data to get to significance in one case. Selected from multiple analysis strategies for all studies.
Result reporting Reported the results of the planned analyses for both studies Reported the results of the planned analyses for all studies Reported results of final analyses only. Did not report one study that did not reach significance.
Final paper Two studies demonstrating clear support for idea One study demonstrating clear support for idea, one mixed, one not at all Two studies demonstrating clear support for idea

Paper A is clearly better than Paper B. Paper A should be published in a more prestigious outlet and generate more attention and accolade. Paper C looks like Paper A, but in reality it is like Paper B. The actual evidence is more circumspect than the apparent evidence. Based on the report, however, no one can tell the difference between Paper A and Paper C.

Two possibilities would minimize the negative impact of publishing manufactured beauties like Paper C. First, if replication were standard practice, then manufactured effects would be identified rapidly. However, direct replication is very uncommon (Open Science Collaboration, 2012). Once an effect is the literature, there is little systematic ethic to self-correct. Rather than be weeded out, false effects persist or just slowly fade away. Second, scientists could just avoid doing the practices that lead to Paper C making this illustration an irrelevant hypothetical. Unfortunately, a growing body of evidence suggests that these practices occur, and some are even common (e.g., John et al., 2012).

To avoid the practices that produce Paper C, the scientist must be aware of and confront a conflict-of-interest—what is best for science versus what is best for me. Scientists have inordinate opportunity to pursue flexible decision-making in design and analysis, and there is minimal accountability for those practices. Further, humans’ prodigious motivated reasoning capacities provide a way to decide that the outcomes that look best for us also has the most compelling rationale (Kunda, 1990). So, we may convince ourselves that the best course of action for us was the best course of action period. It is very difficult to stop doing suspect practices when we have thoroughly convinced ourselves that we are not doing them.

The Solution

Alex needs to publish to succeed. The practices in Table 1 are to the scientist, what steroids are to the athlete. They amplify the likelihood of success in a competitive marketplace. If others are using, and Alex decides to rely on his natural performance, then he will disadvantage his career prospects. Alex wants to do the best science he can and be successful for doing it. In short, he is the same as every other scientist we know, ourselves included. Alex shouldn’t have to make a choice between doing the best science and being successful—these should be the same thing.

Is Alex stuck? Must he wait for institutional regulation, audits, and the science police to fix the system? In a regulatory world, practices are enforced and he need not worry that he’s committing career suicide by following them. Many scientists are wary of a strong regulatory environment in science, particularly for the possibility of stifling innovation. Some of the best ideas start with barely any evidence at all, and restrictive regulations on confidence in outputs could discourage taking risks on new ideas. Nonetheless, funders, governments, and other stakeholders are taking notice of the problematic incentive structures in science. If we don’t solve the problem ourselves, regulators may solve them for us.

Luckily, Alex has an alternative. The practices in Table 1 may be widespread, but the solutions are also well known and endorsed as good practice (Fuchs et al., 2012). That is, scientists easily understand the differences between Papers A, B, and C – if they have full access to how the findings were produced. As a consequence, the only way to be rewarded for natural achievements over manufactured ones is to make the process of obtaining the results transparent. Using the best available practices privately will improve science but hurt careers. Using the best available practices publicly will improve science while simultaneously improving the reputation of the scientist. With openness, success can be influenced by the results and by how they were obtained.  

Conclusion

The present incentives for publishing are focused on the one thing that we scientists are absolutely, positively not supposed to control - the results of the investigation. Scientists have complete control over the design, procedures, and execution of the study. The results are what they are.

A better science will emerge when the incentives for achievement align with the things that scientists can (and should) control with their wits, effort, and creativity. With results, beauty is contingent on what is known about their origin. Obfuscation of methodology can make ugly results appear beautiful. With methodology, if it looks beautiful, it is beautiful. The beauty of methodology is revealed by openness.

Most scientific results have warts. Evidence is halting, uncertain, incomplete, confusing, and messy. It is that way because scientists are working on hard problems. Exposing it will accelerate finding solutions to clean it up. Instead of trying to make results look beautiful when they are not, the inner beauty of science can be made apparent. Whatever the results, the inner beauty—strong design, brilliant reasoning, careful analysis—is what counts. With openness, we won’t stop aiming for A papers. But, when we get them, it will be clear that we earned them.

References

Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical cancer research. Nature, 483, 531-533.

Cohen , J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153.

Fanelli, D. (2010). "Positive" results increase down the hierarchy of the sciences. PLoS ONE, 5(4), e10068. doi:10.1371/journal.pone.0010068

Fuchs H., Jenny, M., & Fiedler, S. (2012). Psychologists are open to change, yet wary of rules. Perspectives on Psychological Science, 7, 634-637. doi:10.1177/1745691612459521

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309-316.

Giner-Sorolla, R. (2012). Science or art? How esthetic standards grease the way through the publication bottleneck but undermine science. Perspectives on Psychological Science.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1-20.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524-532. doi:10.1177/0956797611430953

Kunda, Z. (1990). The case for motivated reasoning. Psychological Bulletin, 108, 480-498. doi:10.1037/0033-2909.108.3.480

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7,615-631. doi:10.1177/1745691612459058

Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7, 657-660. doi:10.1177/1745691612462588

Prinz, F., Schlange, T. & Asadullah, K. (2011). Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10, 712-713.

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638-641. doi:10.1037/0033-2909.86.3.638

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa. Journal of the American Statistical Association, 54, 30-34.

Young, N. S., Ioannidis, J. P. A., & Al-Ubaydli, O. (2008). Why current publication practices may distort science. PLoS Medicine, 5, 1418-1422.

Oct 10, 2013

Opportunities for Collaborative Research

by

Photo of Jon Grahe

As a research methods instructor, I encourage my students to conduct “authentic” research projects. For now, consider authentic undergraduate research experiences as student projects at any level where the findings might result in a conference presentation or a publishable paper. This included getting IRB approval and attempts to present our findings at regional conferences, maybe including a journal submission. Eventually, I participated in opportunities for my students to contribute to big science by joining in two crowd-sourcing projects organized by Alan Reifman. The first was published in Teaching of Psychology (School Spirit Study Group, 2004) as an example for others to follow. The data included samples from 22 research methods classes who measured indices representative of the group name. Classes evaluated the data from their own school and the researchers collapsed the data to look at generalization across institutions. The second collaborative project included surveys collected by students in 10 research methods classes. The topic was primarily focused on emerging adulthood and political attitudes, but included many other psychological constructs. Later, I note the current opportunities for Emerging Adulthood theorists to use these data for a special issue of the journal, Emerging Adulthood. These two projects notwithstanding, there have been few open calls for instructors to participate in big science. Instructors who want to include authentic research experiences do so by completing all their own legwork as characterized by Frank and Saxe (2012) or as displayed by many of the poster presentations that you see at Regional Conferences.

However, that state of affairs has changed. Though instructors are likely to continue engaging in authentic research in their classrooms, they don’t have to develop the projects on their own anymore. I am very excited about the recent opportunities that allow students to contribute to “big science” by acting as crowd-sourcing experimenters. In full disclosure, I acknowledge my direct involvement in developing three of the following projects. However, I will save you a description of my own pilot test called the Collective Undergraduate Research Project (Grahe, 2010). Instead, I will briefly review recent “open invitation projects”, those where any qualified researcher can contribute. The first to emerge (Psych File Drawer, Reproducibility Project) were focused on PhD level researchers. However, since August 2012 there have been three research projects that specifically invite students to help generate data either as experimenters or as coders. More are likely to emerge soon as theorists grasp the idea that others can help them collect data.

This is a great time to be a research psychologist. These projects provide real opportunities for students or other researchers at any expertise level to get involved in not only authentic, but transformative research opportunities. The Council of Undergraduate Research recently published an edited volume (Karukstis & Hensel, 2010) dedicated to fostering transformative research for undergraduates. According to Wikipedia, “Transformative research is a term that became increasingly common within the science policy community in the 2000s for research that shifts or breaks existing scientific paradigms.” I consider open invitation projects transformative because they change the way that we view minimally acceptable standards for research. Each one is intended to change the basic premise of collaboration by bridging not only institutions, but also the chasm of acquaintance. Any qualified researcher can participate in data collection and authorship. Now, when I introduce research opportunities to my students, the ideas are grand. Here are research projects with grand ideas that invite contributions.

Many Labs Project - The Many Labs Team’s original invitation to contributors, sent out in February 2013 asked contributors to join their “wide-scale replication project {that was} conditionally accepted for publication in the special issue of Social Psychology.” They made a follow-up invitation in July reminding us of their Oct. 1st, 2013 deadline for data collection. This deadline limits future contributors, but it is a great example of a crowd-sourcing project. Their goal is to replicate 12 effects using the Project Implicit infrastructure. As is typical of these projects, contributors meeting a minimum goal (N > 80 cases in this instance) will be listed as coauthors in future publications. As they stated in their July post, “The project and accepted proposal has been registered on the Open Science Framework and can be found at this link: http://www.openscienceframework.org/project/WX7Ck/.” Richard Klein (project coordinator) reported that their project includes 18 different researchers at 15 distinct labs, with a plan for 38 labs contributing data before the deadline. This project is nearing the completion deadline and so new contributions are limited, but the project represents an important exemplar of potential projects.

Reproducibility Project – As stated on their invitation to new contributors on the Open Science Framework page, "Our primary goal is to conduct high quality replications of studies from three 2008 psychology journals and then compare our results to the original studies." As of right now, Johanna Cohoon from the Center for Open Science states, “We have 136 replication authors who come from 59 different institutions, with an additional 19 acknowledged replication researchers (155 replication researchers total). We also have an additional 40 researchers who are not involved in a replication that have earned authorship through coding/vetting 10 or more articles.” In short, this is a large collaboration and is welcoming more committed researchers. Though this project needs advanced researchers, it is possible for faculty to work closely with students who might wish to contribute.

PsychFileDrawer Project – This is less a collaborative effort between researchers than a compendium of replications, as stated by the project organizers: “PsychFileDrawer.org is a tool designed to address the File Drawer Problem as it pertains to psychological research: the distortion in the scientific literature that results from the failure to publish non-replications." It is collaborative in the sense that they host a “top 20” list of studies that the viewers want to see replicated. However, any researcher with a replication is invited to contribute the data. Further, although this was not initially targeted toward students, there is a new feature that allows contributors to identify a sample as a class project and the FAQ page asks instructors to comment on, “…the level and type of instructor supervision, and instructor’s degree of confidence in the fidelity with which the experimental protocol was implemented.”

Emerging Adulthood, Political Decisions, and More (2004) project – Alan Reifman is an early proponent of collaborative undergraduate research projects. After successfully guiding the School Spirit Study Group, he called again to research methods instructors to collectively conduct a survey that included two emerging adulthood scales, some political attitudes and intentions, and other scales that contributors wanted to add to the paper and pencil survey. By the time the survey was finalized, it was over 10 pages long and included hundreds of individual items measuring dozens of distinct psychological constructs on 12 scales. In retrospect, the survey was too long and the invitation to add on measures might have caused of the attrition of committed contributors that occurred. However, the final sample included over 1300 cases from 11 distinct institutions across the US. A list of contributors and initial findings can be found at Alan Reifman’s Emerging Adulthood Page. This project suffered from the amorphous structure of the collaboration. In short, no one instructor was interested in all the constructs. To resolve this situation where wonderfully rich data sit unanalyzed, the contributors are inviting emerging adulthood theorists to analyze the data and submit their work for a special issue of Emerging Adulthood. Details for how to request the data are available on the project’s OSF page. The deadline for a submission is July, 2014.

International Situations Project (ISP) – David Funder and Esther Guillaume-Hanes from the University of California—Riverside have organized a coalition of international researchers (19 and counting) to complete an internet protocol. As they describe on their about page, “Participants will describe situations by writing a brief description of the situation they experienced the previous day at 7 pm. They will identify situational characteristics uing the Riverside Situation Q-sort (RSQ) which includes 89 items that participants place into one of nine categories ranging from not at all characteristic to very characteristics. They then identify the behaviors they displayed using the Riverside Behavioral Q-Sort (RBQ) which includes 68 items using the same sorting procedure. The UC-Riverside researchers are taking primary responsibility for writing research reports. They have papers in print and others in preparation where all contributors who provide more than 80 cases are listed as authors. In Fall 2012, Psi Chi and Psi Beta encouraged their members to replicate the ISP in the US to create a series of local samples yielding 11 samples with 5 more committed contributors (Grahe, Guillaume-Hanes, & Rudmann, 2014). Currently, contributors are invited to complete either this project or their subsequent Personality project which includes completing this protocol, then completing the California Adult Q-Sort two weeks later. Interested researchers should contact Esther Guillaume directly at eguil002@ucr.edu.

Collaborative Replications and Education Project (CREP)This project has recently started inviting contributions and is explicitly designed for undergraduates. Instructors who guide student research projects are encouraged to share the available studies list with their students. Hans IJzerman and Mark Brandt reviewed the top three cited empirical articles in the top journals in nine sub disciplines and rated them for feasibility to be completed by undergraduates selecting nine studies that were the most feasible. The project provides small ($200-$500) CREP research awards for completed projects (sponsored by Psi Chi and the Center for Open Science). Contributors will be encouraged/supported in writing and submitting reports for publication by the project coordinators. Anyone interested in participating should contact Jon Grahe (graheje@plu.edu).

Archival Project – This Center for Open Science project is also specifically designed as a crowd-sourcing opportunity for students. When it completes the beta-testing phase, the Archival Project will be publicly advertised. Unlike all the projects reviewed thus far, this project asks contributors to serve as coders rather than act as experimenters. It is a companion project to the OSC Reproducibility Project in that the target articles are from the same three Journals from the first three months of 2008. This project has a low bar for entry as training can take little time (particularly with the now available online tutorial) and coders can code as few as a single article and still make a real contribution. However, this project also has a system of honors as they state on their “getting involved” page: “Contributors who provide five or more accurate codings will be listed as part of the collective authorship.” This project was designed with the expectation that instructors will find the opportunity pedagogically useful and that they will employ it as a course exercise. Alternatively, students in organized clubs (such as Psi Chi) are invited to participate to increase their own methodological competence while simultaneously accumulating evidence of their contributions to an authentic research project. Finally, graduate students are invited to participate without faculty supervision. Interested parties should contact Johanna Cohoon (johannacohoon@gmail.com) for more information.

Future Opportunities – While this is intended to be an exhaustive list of open invitation projects, the field is not static and this list is likely to grow. What is exciting is that we now have ample opportunities to participate in “big science” with relatively small contributions. When developed with care, these projects follow Grahe et al. (2012)’s recommendation to take advantage of the magnitude of research being conducted each year by psychology students. The bar for entry varies in these projects from relatively intensive (e.g. Reproducibility Project, CREP) to relatively easy (e.g. Archival Project, ISP), providing opportunities for individuals with varying resources, from graduate students and PhD level researchers capable of completing high quality replications to students and instructors who seek opportunities in the classroom. Beyond the basic opportunity to participate in transformative research, these projects provide exemplars for how future collaborative projects should be designed and managed.

This is surely an incomplete list of current or potential examples of crowd-sourcing research. Please share other examples as comments below. Further consider pointing out strengths, weaknesses, opportunities or threats that could emerge from collaborative research. Finally, any public statements about intentions to participate in this type of open science initiative are welcome.

References

Frank, M. C., & Saxe, R. (2012). Teaching replication. Perspectives On Psychological Science, 7(6), 600-604. doi:10.1177/1745691612460686

Grahe. J. E., Gullaume-Hanes, E., & Rudmann, J. (2014). Students collaborate to advance science: The International Situations Project. Council for Undergraduate Research Quarterly

Grahe, J. E., Reifman, A., Hermann, A. D., Walker, M., Oleson, K. C., Nario-Redmond, M., & Wiebe, R. P. (2012). Harnessing the undiscovered resource of student research projects. Perspectives On Psychological Science, 7(6), 605-607. doi:10.1177/1745691612459057

Karukstis, K. K., & Hensel, N. (2010) Transformative research at predominately undergraduate institutions.” Council of Undergraduate Research. Washington DC., USA.

Oct 4, 2013

A publishing sting, but what was stung?

by

Before Open Science there was Open Access (OA) — a movement driven by the desire to make published research publicly accessible (after all, the public usually had paid for it), rather than hidden behind paywalls.

Open Access is, by now, its very own strong movement — follow for example Björn Brembs if you want to know what is happening there — and there are now nice Open Access journals, like PLOS and Frontiers, which are peer-reviewed and reputable. Some of these journals charge an article processing fee for OA articles, but in many cases funders have introduced or are developing provisions to cover these costs. (In fact, the big private funder in Sweden INSISTS on it.)

But, as always, situations where there is money involved and targets who are desperate (please please please publish my baby so I won’t perish) breed mimics and cuckoos and charlatans, ready to game the new playing field to their advantage. This is probably just a feature of the human condition (see Triver’s “Folly of Fools”).

There are lists of potentially predatory Open Access journals — I have linked some in on my private blog (Åse Fixes Science) here and here. Reputation counts. Buyers beware!

Demonstrating the challenges of this new marketplace, John Bohannon published in Science (a decidedly not Open Access journal) a sting operation in which he spoofed Open Access journals to test their peer-review system. The papers were machine generated nonsense — one may recall the Sokal Hoax from the previous Science Wars. One may also recall the classic Ceci paper from 1982, which made the rounds again earlier this year (and I blogged about that one too - on my other blog).

Crucially, all of Bohannon’s papers contained fatal flaws that a decent peer-reviewer should catch. The big news? Lots of them did not (though PLOS did). Here’s the Science article with its commentary stream, and a commentary from Retraction Watch).

This is, of course, interesting — and it is generating buzz. But, it is also generating some negative reaction. For one, Bohannon did not include the regular non-OA journals in his test, so the experiment lacks a control group, which means we can make no comparison and draw no firm inferences from the data. The reason he states (it is quoted on the Retraction Watch site) is the very long turnaround times for regular journals, which can be months, even a year (or longer, as I’ve heard). I kinda buy it, but this is really what is angering the Open Access crowd, who sees this letter as an attempt to implicate Open Access itself as the source of the problem. And, apart from Bohannon not including regular journals in his test, Science published what he wrote without peer reviewing it.

My initial take? I think it is important to test these things — to uncover the flaws in the system, and also to uncover the cuckoos and the mimics and the gamers. Of course, the problems in peer review are not solely on the shoulders of Open Access — Diederik Stapel, and other frauds, published almost exclusively in paywalled journals (including Science). The status of peer review warrants its own scrutiny.

But, I think the Open Access advocates have a really important point to make. As noted on Retraction Watch, Bohannon's study didn't include non-OA journals, so it's unclear whether the peer-review problems he identified in OA journals are unique to their OA status.

I’ll end by linking in some of the commentary that I have seen so far — and, of course, we’re happy to hear your comments.

Oct 2, 2013

Smoking on an Airplane

by

Photo of Denny
Boorsboom

People used to smoke on airplanes. It's hard to imagine, but it's true. In less than twenty years, smoking on airplanes has grown so unacceptable that it has become difficult to see how people ever condoned it in the first place. Psychological scientists used to refuse to share their data. It's not so hard to imagine, and it's still partly true. However, my guess is that a few years from now, data-secrecy will be as unimaginable as smoking on an airplane is today. We've already come a long way. When in 2005 Jelte Wicherts, Dylan Molenaar, Judith Kats, and I asked 141 psychological scientists to send us their raw data to verify their analyses, many of them told us to get lost - even though, at the time of publishing the research, they had signed an agreement to share their data upon request. "I don't have time for this," one famous psychologist said bluntly, as if living up to a written agreement is a hobby rather than a moral responsibility. Many psychologists responded in the same way. If they responded at all, that is.

Like Diederik Stapel.

I remember that Judith Kats, the student in our group who prepared the emails asking researchers to make data available, stood in my office. She explained to me how researchers had responded to our emails. Although many researchers had refused to share data, some of our Dutch colleagues had done so in an extremely colloquial, if not downright condescending way. Judith asked me how she should respond. Should she once again inform our colleagues that they had signed an APA agreement, and that they were in violation of a moral code?

I said no.

It's one of the very few things in my scientific career that I regret. Had we pushed our colleagues to the limit, perhaps we would have been able to identify Stapel's criminal practices years earlier. As his autobiography shows, Stapel counterfeited his data in an unbelievably clumsy way, and I am convinced that we would have easily identified his data as fake. I had many reasons for saying no, which seemed legitimate at the time, but in hindsight I think my behavior was a sign of adaptation to a defective research culture. I had simply grown accustomed to the fact that, when I entered an elevator, conversations regarding statistical analyses would fall silent. I took it as a fact of life that, after we methodologists had explained students how to analyze data in a responsible way, some of our colleagues would take it upon themselves to show students how scientific data analysis really worked (today, these practices are known as p-hacking). We all lived in a scientific version of The Matrix, in which the reality of research was hidden from all - except those who had the doubtful honor of being initiated. There was the science that people reported and there was the science that people did.

In Groningen University, where Stapel used to work, he was known as The Lord of the Data, because he never let anyone near his SPSS files. He pulled results out of thin air, throwing them around as presents to his co-workers, and when anybody asked him to show the underlying data files, he simply didn't respond. Very few people saw this as problematic, because, hey, these were his data, and why should Stapel share his data with outsiders?

That was the moral order of scientific psychology. Data are private property. Nosy colleagues asking for data? Just chase them away, like you chase coyotes from your farm. That is why researchers had no problem whatsoever denying access to their data, and that is why several people saw the data-sharing request itself as unethical. "Why don't you trust us?," I recall one researcher saying in a suspicious tone of voice.

It is unbelievable how quickly things have changed.

In the wake of the Stapel case, the community of psychological scientists committed to openness, data-sharing, and methodological transparency quickly reached a critical mass. The Open Science Framework allows researchers to archive all of their research materials, including stimuli, analysis code, and data, to make them public by simply pressing a button. The new Journal of Open Psychology Data offers an outlet specifically designed to publish datasets, thereby giving these the status of a publication. PsychDisclosure.org asks researchers to document decisions regarding, e.g., sample size determination and variable selection, that were left unmentioned in publications; most researchers provide this information without hesitation - some actually do so voluntarily. The journal Psychological Science will likely implement requirements for this type information in the submission process. Data-archiving possibilities are growing like crazy. Major funding institutes require data-archiving or are preparing regulations that do. In the Reproducibility Project, hundreds of studies are being replicated in a concerted effort. As a major symbol of these developments, we now have the Center for Open Science, which facilitates the massive grassroots effort to open up the scientific regime in psychology.

If you had told me that any of this would happen back in 2005, I would have laughed you away, just as I would have laughed you away in 1990, had you told me that the future would involve such bizarre elements as smoke-free airplanes.

The moral order of research in psychology has changed. It has changed for the better, and I hope it has changed for good.

Welcome!

by

Welcome to the blog of the Open Science Collaboration! We are a loose network of researchers, professionals, citizen scientists, and others with an interest in open science, metascience, and good scientific practices. We’ll be writing about:

  • Open science topics like reproducibility, transparency in methods and analyses, and changing editorial and publication practices
  • Updates on open science initiatives like the Reproducibility Project and opportunities to get involved in new projects like the Archival Project
  • Interviews and opinions from researchers, developers, publishers, and citizen scientists working to make science more transparent, rigorous, or reproducible.

We hope that the blog will stimulate open discussion and help improve the way science is done!

If you'd like to suggest a topic for a post, propose a guest post or column, or get involved with moderation, promotion, or technical development, we would love to hear from you! Email us at oscbloggers@gmail.com or tweet @OSCbloggers.

The OSC is an open collaboration - anyone is welcome to join, contribute to discussions, or develop a project. The OSC blog is supported by the Center for Open Science, a non-profit organization dedicated to increasing openness, integrity and reproducibility of scientific research.

← Previous Page 6 of 6