The Value of Reproducible Research: Sometimes the response matters more than the results

Yesterday I followed a tweet to a post by Jason Lyall responding to apparently widespread criticism of a new survey in Afghanistan done by the Asia Foundation. The post was the first I’d heard of the survey or of the response to it, so I don’t know anything more about the criticism than what Jason wrote, or much about the nature or arguments of the criticism. But the post did link to one criticism in particular, from Sarah Chayes, a journalist turned NGO-founder and regular ISAF-hired expert on Afghanistan. The general approach taken in her critique seems illustrative of something I find very valuable about systematic and reproducible research and analysis: it facilitates productive and progressive (though perhaps not always intentionally so) responses.

This is probably best demonstrated by first considering the alternative. What kind of responses do non-reproducible research facilitate? I got into this a little bit in a post I wrote about a wide swath of anthropology a few months ago. But in general, non-reproducible research methods are typically communicated in some combination of the following phrases:

”I spent x years…talked to many local people…formed friendships…everyone on the inside knows…deep study of contextual issues…time spent on the ground…thorough analysis of the ethnographical work that’s been done…extensive experience with locals.” Etc.

When taking on the study of practically any subject, there are countless ways to formulate the research problem. But let’s say in attempting to tackle a social topic that you limit yourself to a modest ten features or factors of society that you consider important, and then for convenience sake let’s pretend each of those factors is only a binary – either it’s present or it isn’t (e.g. Muslim/non-Muslim, Poor/non-Poor, Literate/non-Literate). Just those ten factors can be combined in 1,024 ways. Now let’s say you’ve spent an impressive amount of time and effort and as one result you’ve produced a 30-page paper.

At a general level, two responses come immediately to mind: 1) One could throw up one’s hands in the air and say “there’s just absolutely no way you can ever really analyze human culture/behavior/society. So, whatever.” (A response I consider more often than is probably healthy) or 2) One can say something like, “ok, but I think it’d be better or more useful, or we still need, to explore these other two or three combinations.” And off they go to implement their design. Now how does the resulting research connect? It doesn’t really. One person’s take is her take. Another person’s take is his. The basic feature of this process is that it’s entirely horizontal scaling. You can’t really build on someone else’s work because there’s no clear way one set of features is related to another. You can just build out. And there’s really no end in sight, because the original assumption of ten factors isn’t a reasonable one. In this world of endlessly additional factors and rapidly multiplying new combinations, “progress” in research and analysis means very little.

Now turn to reproducible research. What the Asia Foundation did was say “this question is important” and we’re going to ask “these people”. So we know exactly what questions they asked (assuming some basic research integrity) and we know fairly precisely whom they asked. And just having those clear details of the research design facilitates a very different kind of response. The very worst-case scenario is where we believe outright fraud occurred. In that case the very clear response is to go and actually ask those questions and to those same people (or as close as we can get), and compare the responses (here are fraudulent responses, here are real responses). Few people really do assume outright fraud, and in that case there are some other clear responses:

“That was the wrong question.”

Ok, let’s go ask the right question and compare responses again.

“Those were the wrong people.”

Ok, let’s go ask the right people, and again compare responses.

“We shouldn’t be asking questions, we should be comparing weapons procurement and sales.”

Ok, let’s go try to survey the prices of weapons in different areas, or find out who has weapons. Afghans will lie because they think you’ll try to take their weapons away? Ok, let’s have other Afghans ask. Or let’s figure out some alternative indirect measures, and again compare responses. All of a sudden it’s clear and obvious how each subsequent study builds and relates to the former. Each new research project isn’t its own brand new thing; instead it’s built on the work of the past. All of a sudden we’re engaged in a great collective effort and not lost in our own eternally isolated worlds. We have a foothold and we can begin to climb towards improvement.

Reproducible research is so powerful that even those who fundamentally believe that systematic research has nothing to offer find themselves making their criticisms in reproducible, checkable ways. Even Chayes, who doubts any methods are as good as experience, spends a substantial portion of her critique making criticisms of the methods (which she can only do because they were clear and reproducible). How does one criticize an article based on experience? “You should have had a different experience!”?

This brings me to my main point. The value of reproducible research is sometimes not really in its initial results, but rather in the responses it provokes. It may be that the Asia Foundation’s study is so problematic that we can’t really use it to say much about Afghanistan (I haven’t looked at it thoroughly enough to say anything about that). Fine, but just by doing the study they’ve provoked a criticism that motivates progress. “They did a poor job, and we can do it better.” Whatever happens next, it can be built on what happened before. And through an iterative process a collective effort will emerge with collective results that are simply not possible without that continuity and collective effort. The alternative worlds are either one where we all lay our own individual bricks on our own individual plots of land or one where we lay our bricks on top of each other, and we build something really usable. This after all, is a popular and important critique that’s frequently made of the ISAF effort: that there is no continuity between deployments. That units do their thing and then leave and the next unit comes along with no knowledge of what was done before, and so proceeds to do something entirely new and entirely different and entirely doomed to end six-to-nine months later with the end of that unit’s deployment. It’s recognizable madness.

Somewhat weirdly, in the community I currently inhabit it doesn’t make sense to write this post. Tell someone here at Harvard (at least in the parts of the departments I tend to experience) that you’re writing about the importance of reproducible research and evidence, and they’ll look at you like you’re writing a paper for an intro to research methods course. “Why would you waste time writing something this basic and obvious?” This past summer, an article was written making a similar argument to the one Chayes makes but at a general level (not about Afghanistan) in a Scientific American blog post. I didn’t really see much of a response to it, except for this small strident tweet from Dan Gilbert, a prominent professor in the psychology department:

Stupidest essay of the year (so far). The claim that “we shouldn’t study this scientifically” is always wrong. bit.ly/P3E2Fe”

— Daniel Gilbert (@DanTGilbert) August 10, 2012

Arguments that are “the stupidest of the year” don’t warrant much attention, and they don’t typically get it around here. But I think that’s actually sad and unfortunate. Because in the other (government analysis) world, the idea that analysis of the conflict in Afghanistan should be done systematically and using methods that are reproducible is hardly a trivial point. In that other world, the world where recognizably important decisions are being made, reproducibility is a strange and somewhat incoherent concept. Chayes writes that “recent conversations…(are) factors (which) provide more eloquent indications about prevailing conditions than do opinion surveys”, and she really believes that. It’s very likely that she doesn’t know anyone (and she knows a lot of the population of “experts” considered to “know something” about Afghanistan) who works in Afghanistan and believes systematic analysis has anything to do with understanding the conflict in Afghanistan.

The gap between the reproducible social science community and the foreign policy/national security community is so large that neither side recognizes how or why it might need to communicate with the other. Hence in the US we have a relatively vibrant scientific community that has almost no contact with or impact on a relatively powerful policy community. And of such gaps and disconnects are massive disasters like the war in Afghanistan (and many other congressional policies, for that matter) made.

If you want to comment on this post, write it up somewhere public and let us know - we'll join in. If you have a question, please email the author at meinshap@gmail.com.