An adversarial test for replication success
(tl;dr: I argue that the only way to tell if a replication study was successful is by considering the theory that motivated the original.)
Psychology is in the middle of a sea change in its attitudes towards direct replication. Despite their value in providing evidence for the reliability of a particular experimental finding, incentives for direct replications have typically been limited. Increasingly, however, journals and funding agencies now increasingly value these sorts of efforts. One major challenge, however, has been evaluating the success of direct replications studies. In short, how do we know if the finding is the same?
There has been limited consensus on this issue, so many projects have used a diversity of methods. The RP:P 100-study replication project, reports several indicators of replication success, including 1) the statistical significance of the replication, 2) whether the original effect size lies within the confidence interval of the replication, 3) the relationship between the original and replication effect size, 4) the meta-analytic estimate of effect size combining both, and 5) a subjective assessment of replication by the team. Mostly these indicators hung together, though there were numerical differences.
Several of these criteria are flawed from a technical perspective. As Uri Simonsohn points out in his "Small Telescopes" paper, as the power of the replication study goes to infinity, the replication will always be statistically significant, even if it's finding a very small effect that's quite different from the original. And similarly, as N in the original study goes to zero (if it's very underpowered), it gets harder and harder to differentiate its effect size from any other, because of its wide confidence interval. So both statistical significance of the replication and comparison of effect sizes have notable flaws.*
In addition, all this trouble is just for a single effect. In fact, one weakness of RP:P was that researchers were forced to choose just a single effect size as the key analysis in the original study. If you start looking at an experiment that has multiple important analyses, the situation gets way worse. Consider a simple 2x2 factorial design: Even if the key test identified by a replicator is the interaction, if the replication study fails to see a main effect or sees a new, un-predicted main effect, those findings might lead someone to say that the replication result was different than the original. And in practice it's even more complicated than that because sometimes it's not straightforward to figure out whether it was the main effect or the interaction the authors cared about (or maybe it was both). Students in my class routinely struggle to find the key effect that they should focus on in their replication projects.
Recently we had a case that was yet more confounding than this. We did a direct replication of an influential paper and found that we were able to reproduce every single one of the statistical tests. The only issue was that we also found another significant result where the authors' theory would predict a null effect or even an effect in the opposite direction. (We were studying theory of mind reasoning and we found that participants' responses were slower not only when the state of the world was incongruent with their and others' beliefs, but also when it was congruent with belief information). In this case, it was only the theoretical interpretation that allowed us to argue that our "successful replication" was in fact inconsistent with the authors' theory.
I think this case illustrates a broader generalization, namely that statistical methods for assessing replication success need to be considered as secondary to the theoretical interpretation of the result. Instead, I propose that:
Note that statistical and theoretical criteria may often line up in simple cases. If the original study was an RCT of an intervention, then the key theoretical interpretation was essentially captured by the effect size, and so the relationship between the two effect sizes is important. But in other cases, the theoretical interpretation may hinge on the reliability and direction of the effect, not its magnitude; in those cases, the theoretical test gives the right answer while the statistics may not. And similarly, in the case I described above, the theoretical test gives the right answer � our data didn't support the original theory even though the statistics all lined up.
This argument apparently puts me in an odd position, because it seems like I'm advocating for giving up on an important family of quantitative approaches to reproducibility. In particular, the effect-size estimation approach to reproducibility emerges from the tradition of statistical meta-analysis. And meta-analysis is just about as good as it gets in terms of aggregating data across multiple studies right now. So is this an argument for vagueness?
No. The key point that emerges from this set of ideas is instead that the precision of the original theoretical specification is what governs whether a replication is successful or not. If the original theory is vague, it's simply hard to tell whether what you saw gives support to it, and all the statistics in the world won't really help. (This is of course the problem in all the discussion of context-sensitivity in replication). In contrast, if the original theory is precisely specified, it's very easy to assess support.
In other words, rather than arguing for a vaguer definition of replication success, what I'm instead arguing for is more precise theories. Replication is only well-defined if we know what we're looking for. The tools of meta-analysis provide a theory-neutral fix for class of single effect statistics (think of the effect size for the RCT). But once we get beyond the light shed by that small lamp post, theory is going to be the only way we find our keys.
---
* Simonsohn proposes another � perhaps more promising � criterion for distinguishing effect sizes that I won't go into here because it's limited to the single-effect domain.
Psychology is in the middle of a sea change in its attitudes towards direct replication. Despite their value in providing evidence for the reliability of a particular experimental finding, incentives for direct replications have typically been limited. Increasingly, however, journals and funding agencies now increasingly value these sorts of efforts. One major challenge, however, has been evaluating the success of direct replications studies. In short, how do we know if the finding is the same?
There has been limited consensus on this issue, so many projects have used a diversity of methods. The RP:P 100-study replication project, reports several indicators of replication success, including 1) the statistical significance of the replication, 2) whether the original effect size lies within the confidence interval of the replication, 3) the relationship between the original and replication effect size, 4) the meta-analytic estimate of effect size combining both, and 5) a subjective assessment of replication by the team. Mostly these indicators hung together, though there were numerical differences.
Several of these criteria are flawed from a technical perspective. As Uri Simonsohn points out in his "Small Telescopes" paper, as the power of the replication study goes to infinity, the replication will always be statistically significant, even if it's finding a very small effect that's quite different from the original. And similarly, as N in the original study goes to zero (if it's very underpowered), it gets harder and harder to differentiate its effect size from any other, because of its wide confidence interval. So both statistical significance of the replication and comparison of effect sizes have notable flaws.*
In addition, all this trouble is just for a single effect. In fact, one weakness of RP:P was that researchers were forced to choose just a single effect size as the key analysis in the original study. If you start looking at an experiment that has multiple important analyses, the situation gets way worse. Consider a simple 2x2 factorial design: Even if the key test identified by a replicator is the interaction, if the replication study fails to see a main effect or sees a new, un-predicted main effect, those findings might lead someone to say that the replication result was different than the original. And in practice it's even more complicated than that because sometimes it's not straightforward to figure out whether it was the main effect or the interaction the authors cared about (or maybe it was both). Students in my class routinely struggle to find the key effect that they should focus on in their replication projects.
Recently we had a case that was yet more confounding than this. We did a direct replication of an influential paper and found that we were able to reproduce every single one of the statistical tests. The only issue was that we also found another significant result where the authors' theory would predict a null effect or even an effect in the opposite direction. (We were studying theory of mind reasoning and we found that participants' responses were slower not only when the state of the world was incongruent with their and others' beliefs, but also when it was congruent with belief information). In this case, it was only the theoretical interpretation that allowed us to argue that our "successful replication" was in fact inconsistent with the authors' theory.
I think this case illustrates a broader generalization, namely that statistical methods for assessing replication success need to be considered as secondary to the theoretical interpretation of the result. Instead, I propose that:
The primary measure of whether a replication result is congruent with the original finding is whether it provides support for the theoretical interpretation given to the original.And as in my discussion of publication bias, I think the key test is adversarial. In other words, a replication is unsuccessful if a knowledgeable and adversarial reviewer could reasonably argue that the new data fail to support the interpretation.
This argument apparently puts me in an odd position, because it seems like I'm advocating for giving up on an important family of quantitative approaches to reproducibility. In particular, the effect-size estimation approach to reproducibility emerges from the tradition of statistical meta-analysis. And meta-analysis is just about as good as it gets in terms of aggregating data across multiple studies right now. So is this an argument for vagueness?
No. The key point that emerges from this set of ideas is instead that the precision of the original theoretical specification is what governs whether a replication is successful or not. If the original theory is vague, it's simply hard to tell whether what you saw gives support to it, and all the statistics in the world won't really help. (This is of course the problem in all the discussion of context-sensitivity in replication). In contrast, if the original theory is precisely specified, it's very easy to assess support.
In other words, rather than arguing for a vaguer definition of replication success, what I'm instead arguing for is more precise theories. Replication is only well-defined if we know what we're looking for. The tools of meta-analysis provide a theory-neutral fix for class of single effect statistics (think of the effect size for the RCT). But once we get beyond the light shed by that small lamp post, theory is going to be the only way we find our keys.
---
* Simonsohn proposes another � perhaps more promising � criterion for distinguishing effect sizes that I won't go into here because it's limited to the single-effect domain.
Comments
Post a Comment