I’ve had some time over the last year to rework some of my thinking on data science and data-driven decision making. I’ve reached the point where I need to make some of that thinking explicit to keep it from falling out of my head.
I don’t like the data science Venn diagram. There’s nothing really wrong with it - it’s just that it seems to reduce data science almost entirely to analytic capabilities. I’ve written before on how I think the most important thing about data science is the “science” part, and analysis is certainly a necessary part of that. But I think the focus on analysis has fostered the misconception among data-science consumers that data science is the practice of taking data and feeding it into a set of sophisticated computations that yield insights. I guess that’s technically accurate, in the same sense that it’s technically accurate to say driving a car is the practice of feeding fuel into an internal combustion engine to yield movement.
I want to make it clear that my ideas deal with the types of environments I typically work in: small-to-not-quite-large, non-tech companies with limited analytic budgets. I think the rules are probably different for the Googles of the world. There aren’t that many Googles in the world.
The way I think about this now wouldn’t really fit into a diagram, so I’ll make do with lists. Data-driven decision making is five overlapping steps:
Each step overlaps with the next, and the last step overlaps with the first. It’s like Venn diagram where five overlapping rings wrap around in a circle. I can’t really define a problem fully without getting into data preparation, because the availability, quality, etc. of the data will shape how I define the problem. Likewise, it’s difficult…or at least unfruitful…to full analyze something without planning for how I will communicate the findings from the analysis. Communication plans often loop back and motivate changes to the analysis. Curation is a topic that I think deserves much more attention than it gets. When an analysis is finished and the communication is well under way, it’s almost always worth it to give a lot of thought to how the data and results can be stored to make them as integrate-able as possible with future projects. That means curation depends on the definition of the next problems coming up the road.
In each of these stages, I typically have three tasks that really don’t come in any particular order. I usually jump between tasks many, many times in any given stage for any given project:
Consolidation means getting rid of noise - removing information that hides or distorts other, more useful information. Validation means explicitly estimating to what extent other people ought to take my work seriously. Experimentation means doing lots of random stuff in hopes that it will do some good even though I have little empirical reason to believe that it will.
I’ve chosen some words to characterize each task within each step, but don’t get hung up on those. I’m just using them for the sake of convenience.
Define. Most projects - at least the not-terribly-painful ones - start with some kind of business problem: “X is happeing and we don’t want it to,” or “We want people to do Y and they aren’t.” That needs to be translated into analyzable metrics, and translation usually isn’t a straightforward process. The natural-language version of a problem hides a lot of assumptions and implicit expectations. Those need to be made explicit. Each time something implicit is made a little more explicit, I have to circle back and make sure that the explicit version of things still seems to stand a reasonable chance of addressing the problem that launched the project in the first place. That’s sanity-checking. I’ve often seen a completely reasonable-looking request turn entirely unreasonable as it is made explicit: once we unpacked all the assumptions, we saw the project wasn’t feasible within our time and resource constraints. I also lump collection in with the definition tasks. In my experience, it’s rare to know beforehand whether and to what extent new data will really help with a project. A lot of collection is just getting new observations on the chance they will show me something I didn’t see before.
Prepare. Data rarely come in ready-to-analyze format. Even if they’ve been carefully collected and stored, it needs to be transformed - binned into relevant categories; added to, subtracted from, or multiplied or divided by other values or series; turned into a series of separate binary variables; and so on. A full data set of un-transformed variables is about as useful as having the data set in several different pieces. It hasn’t been fully brought together until everything is formatted correctly. This also involves a lot of cleaning. No one ever overestimates the time it will take to clean data. It always takes longer than expected. Over time, I’ve come to appreciate how much preparation involves reshaping the data set. Most analytic techniques, at their most basic, require me to line up the data into rows and columns. I recently had data that was divided up by year, country, and product category. I eventually lined up product categories by year and country because that best met the goals of the project, but I didn’t arrive at that understanding until I had already tried lining up countries by year and product, countries by year separately for each product, and even years by country and product. There often isn’t away to know beforehand which shape is best. I have to try out a few options and see what works, often going back to the definition stage to keep grounded in the business problem that motivated the project.
Analyze. This stage gets a lot of press. I’m not sure it deserves it. More often than not, I have a bunch of potential predictors, and most of those are just going to cluttering things up. I have to select the ones that will be most useful. There are a bunch of methods for doing this, and while some (stepwise selection based on p-values, for example) are demonstrably ineffective, most options seem about as good as any other. Same goes for the process of validating the analysis. Choose a loss function or a goodness-of-fit measure and make sure I don’t over-fit. There are lots of options for doing this seem to work well. The hard part, for me, is the tuning of a model. So I’ve got a reasonable set of predictor variables and the model seem to make reasonably accurate predictions. Can I do better? Add more trees? Change a prior or a hyper-parameter? Every tunable method I’ve found seems to enjoy near-consensus on good default parameters, and the opposite of consensus on what I should try after I’ve tried the defaults. In many cases, good non-default settings seem to depend on the peculiarities of the data. I just have to try some stuff and see if any of it improves the model.
Communicate. I get kind of worked up about this stage, mostly because two of the three tasks seem to get ignored in almost every data/analytics conversation I witness. Visualization is great. Even if I have an awesome model based on awesome data derived from an awesome research plan, most people who are going to use my work aren’t trained to see, understand, or care about any of that awesomeness. A good visualization - chart or table - shows the results, the uncertainty surrounding those results, and the direct ties to the business problem in one, neat package. But before I can do that, I need to back-translate the results. With natural languages, when you have a document translated from language A to language B, you back-translate it by giving it to a native speaker of B and having him or her translate it into A. If the original document and the back-translation are similar, that means the translation was pretty good. Once I have the results of an analysis, I need to talk about those results with someone close to the original business problem. I need them to tell me what I’ve told them about the analysis. If what they tell me is wildly different from what I think the analysis actually says, that tells me I communicated it poorly, which tells me that I might not have translated the business problem appropriately in the first place (which is kind of a downer), or that I need to work on the way I talk about the analysis (less of a downer). There’s a third option: my customers might need instruction. Sometimes I back-translate just fine, and my visualizations are fine - my customers just need to learn more about analysis. This is especially true in situations where an organization is growing it’s analytic capabilities. If a person has only ever seen analysis in the form of averages and bar charts, that person is not going to get much benefit from a cluster analysis unless he or she will willing to become a more educated consumer.
Curate. At the time that I actually present the results of an analysis to stakeholders, those results are usually spread across half a dozen files. The code that generated those results is full of hacks and not full of documentation. In each of those output files, the results are in whatever format seemed most appropriate for explaining the analysis to other people, which might be (and usually is) entirely different from the format needed to reproduce the analysis or integrate the results into current data stores. I find few things more difficult than mustering the will to structure the code and results of a project in a way that maximizes their usability for future projects. It’s tempting to tell myself that I’ll take care of that down the road if I need to. I have never, ever had a good experience giving in to that temptation. The outputs need to be structured. They need to be mapped to the data currently in the databases so they can be quickly called up and integrated in future analyses. They need to be tagged so I or other people can find the stuff later. All of this requires me to look at the next projects on the horizon and make guesses about what I’m going to need at future dates.
Data-driven decision making requires all of these capabilities. Under-fund a capability and you create bottlenecks: you’ll get analysts with super-fancy tool belts who have to go begging to your IT department every time they need data in a format other than what IT planned for, or who have to find ways cram results into PowerPoints because they don’t have the time or resources to automate reporting, or who have to do the analyses the stakeholders think they want instead of having the authority to choose the best tool given the stakeholders’ objectives. You’ll torpedo your capabilities. You’ll find it hard to keep talented and motivated employees from leaving. There are cheaper ways to make gut decisions.
I don’t think wording changes mean that much by themselves, but I do think it’s worth it for any organization to think about the difference between “data”, analytics”, “research”, and “science”. In my experience, non-academic organizations (and quite a few academic ones) define “data” as “numbers in a speadsheet,” and “analytics” as “aggregated information in tables and charts,” and “research” as “papers and presentations that justify a position.” In my experience, few businesses outside of the tech industry have a working definition of “science.” It’s just not the way most people are trained to think about decision making. It shifts the decision maker from the driver’s seat to the passenger’s seat: still very much involved, but only one part of the process that brings us to our final destination. That’s a difficult change to make, but it’s where the real potential lies. You can’t have a data-driven organization and constantly try to take the wheel yourself.
If you want to comment on this post, write it up somewhere public and let us know - we'll join in. If you have a question, please email the author at email@example.com.