I attended a presentation on “a framework of corruption” the other day. Perhaps this is true for other areas of research as well, but researchers and analysts who look at corruption love to talk about frameworks and maps and indices and typologies. In a sense you can’t blame them. Corruption is about as vague a term in social research as is possible. To make talking and thinking about it useful you have to first break it into pieces. What kinds of corruption are there? Unfortunately, the typologies usually just involve other terms and ideas that are nearly as vague as the original word.

This bothers me. As Schaun mentioned in his last post, corruption is a label we place on observable behaviors. You can point to an official engaged in bribery or in fraudulent record-keeping; you can’t point to an official engaged in corruption (until you’ve identified an observable behavior as corruption). This was my typical response to Army commanders in Afghanistan when they asked me about corruption. “Don’t tell me you have a problem with corruption. Tell me you have a problem with such-and-such actions on the part of such-and-such people. Then we can start to deal with that problem.” Now I mostly look at corruption in the US, and think more about “institutional corruption” than the kind of behaviors we encountered in Afghanistan. If anything, the problem of specificity here is worse.

Since switching focus I’ve developed a slightly different way of looking at the problem. Researchers interested in corruption don’t tend to get very specific because they don’t actually know what to be specific about. We “know” corruption contributed to the financial crisis. We argue about contributing factors like “a lack of transparency” or “financial misrepresentation”. Ask which people did what and how many of them did it, and you probably won’t get many replies. But everything that’s happened in the last two to five years has happened in actual locations and has been done by actual people. In big banks and big organizations, lots of analysts and managers and personnel of varying sorts have engaged in a relatively wide variety of tasks and behaviors (not all of which or even most of which, needless to say, are corrupt). The consequences of all those individual tasks and behaviors combined comprise the problems we face today. Corruption is all real stuff. It’s just that we know very little about and have very little access to that kind of population-level reality.

I don’t claim this is a new insight. Lots of researchers have pointed out the general lack of empirical data in social science and social research. In fact, I’m basically saying something very similar to what Schaun expressed here. It’s probably safe to claim that social science has become much more empirical over the last few decades. It seems to me that most well-known work tends to have some empirical element. However I think we can do a lot more in response to this insight than we have and I think we must do a lot more about the problem than most researchers have considered necessary. The rest of this post mentions three limitations of current data collection, and then makes the case for massively increased data collection with some examples.

Isolated factors and overly-targeted data collection: Data collection is expensive in terms of time, energy, and  money, so researchers tend to get very specific in the data they collect. You think income level affects smoking? Collect data on income-levels and smoking patterns. Think social networks (who is friends with whom) affect smoking? Collect social network data. In the long-view, this narrowness is a mistake. Few researchers actually think the “x causes y” model accurately represents reality. They “exclude” and “control for” and “hold everything else constant” because that’s all they can do given that they can’t get data for all the other factors. The problem is that all those other factors and behaviors are constantly occurring before, during, and after the behavior the researcher is concerned with. We can hypothesize why they don’t matter or why they matter less than the things for which we did collect data, but as argued previously, we don’t actually know enough to make that case very strongly yet. Before we can know something like that, we have to have much fuller and more robust descriptions of people’s behaviors. Ultimately, we have to collect data on almost every aspect and stage of human life you can imagine. That’s what the advance of social science will require.

Giving credit to data collectors. Everyone admires a researcher with an awesome data set. I don’t think many really care that much about the actual data collector. That’s a mistake. Getting out into the world and collecting data should be a big and important part of the work of research. I think some of the great psychologists knew that. In his memoirs, Stanley Milgram describes tons of research projects that involved active involvement in life – on subway cars, in the streets, etc. Anthropologists have always placed importance on the active work of participant observation (although sadly most bring back and publish their interpretations and reflections more than actual data). I’ve always enjoyed reading the work of scientists who study ants (like Edmund Wilson and Deborah Gordon), and it’s amazing how much time they spend just watching ants and recording their movements. The physicist Ernest Rutherford is alleged to have said “all science is either physics or stamp collecting.” Social scientists and researchers need to do a whole lot more stamp collecting, and we should love it, and we should admire it.

Data ownership. I was in a Geospatial Information Systems workshop recently and overheard some senior researchers discussing the growing interest in the idea of making data public. They were speaking specifically in reference to federally funded research and seemed to be in support of the notion that when research is funded by the National Science Foundation, the researchers’ data should eventually be made accessible to the public (which I later learned is actually something of a hot topic at the moment). A few weeks later I mentioned this idea in a meeting and was kind of surprised at the skeptical reception. No one wants to release data until they’ve “got their papers out of it”. By that time they’re working on new things and no one cares (“because the data has already been used”). I think that’s a really strange way of thinking about science. Data is thought of like tissue-paper – use it once and throw it away; definitely don’t use someone else’s. -That’s a horrible mistake (Google “The Republic of Science” by Michael Polanyi. He makes a great case why).

But all of the above are problems that might go away quickly if data-collection became perceived as an exciting endeavor. An analogy is appropriate here. Accounting is an extremely important part of contemporary human society. Modern accounting is based on double-entry bookkeeping. Double-entry bookkeeping was invented sometime between the 13th and 15th centuries. It’s basically a great way of tracking transactions over time. It doesn’t seem at first like an amazing idea, but the poet Goethe called double-entry bookkeeping one of “the most beautiful discoveries of the human spirit” (or “finest inventions of the human mind”, depending on your translation). The development and advance of social data-collection is going to be just as important, and based on a lot of similarly mundane-seeming record-keeping.

To get the level of data that will be really useful will require increased prioritization of record keeping. Focus on record keeping and data collection, and analysis of the data will follow fairly naturally. Focus on analysis, on the other hand, and data collection will remain something only a minority of researchers does much of and only to the extent that it allows them to squeeze a paper or two out of it (after which they promptly drop the collection). When I worked for the Army and with the intelligence community, most of the work revolved around dueling assertions. You make one claim and then someone else makes another. There wasn’t a lot of systematic data or description. That frustrated me and I often said so. In a sense I was just making another claim, which would typically be met by another counter-claim (“this isn’t about evidence! You just have to make the call!”), thus slipping us back into the patterned grind of “intelligence-work”. Then Schaun and I and a couple of others started putting together systematic data-sets and using them to conduct our analyses and guide our claims. When we did that we met one of two responses from other analysts:  they either thought our dataset was crap so they tried to put together a better one, or they thought our analysis was crap so they tried to reanalyze our dataset. Doing without data was just no longer an option. At first I was irritated. I quickly became thrilled. Both responses were an improvement over the old world of claim/counter-claim.

With everyone constantly doing things, people (including researchers) usually don’t have the time to record for everyone else. Keeping record requires technology. Given all the other things people have to do, recording all their doings themselves is going to fall fairly low on the list of priorities. We need to invent technology that makes it as costless as possible. That kind of technology is becoming more and more available. A lot of popular activities have a fundamental technological component which almost inherently involves record-keeping: Twitter, Facebook, 4square, Pinterest, texting, phone calls, online chatting, product and customer-experience reviews (e.g. Yelp), credit cards. All of this automatically involves the recording of date and time information, as well as some information about the experience/event. Most could easily record location information. But even older and basic technologies that used to have no “social” component could be updated to aid record-keeping. Cars, microwaves, plumbing, heating/cooling, etc. How long does your microwave operate every day? How often? At what times of day? Cars already record mileage. If auto-makers linked odometers to the car’s computers, then that information could be made even more accessible and useable.

We need as full an account of daily life as possible. We can figure out what matters, when, and why, once we have the data. Assuming something doesn’t matter before then is unwise.

It’s an exciting prospect. I think it adds a great deal of importance and appeal to research. In a sense it also expands the world of research. The emergence of modern science involved a whole lot of amateur scientists and hobbyists who made important contributions by working away in their own corners of the world. The theory-heavy social science that has become typical and conventional today presents a fairly massive barrier to entry. In order to participate you have to spend a whole lot of time reading what a whole lot of other people have said about the world. You have to know the terms they used and you have to use them yourself. Only then do you get to start doing your own work. Most non-academic people don’t have the time or interest to engage in that primarily theoretical work.  The work of systematic description, on the other hand, is an activity with a fairly low entry-cost and it’s very accessible to amateurs working away developing their own tools and technologies and data sets. If we can let go, just for a brief period, of the pie-in-the-sky goal of “building a science of behavior” and focus instead on the more practical goal of just getting systematic descriptions of observables, we might quickly discover we are able to achieve both goals at once


If you want to comment on this post, write it up somewhere public and let us know - we'll join in. If you have a question, please email the author at .