By Jack Moore
The work folder on my laptop contains 218 Microsoft Excel files comprised of 436 megabytes of data. These files have been the muse for hundreds of posts I've published at statistically slanted sites like FanGraphs.com, AdvancedNFLStats.com and many others over the last year and a half. I'm not alone -- for the amateur statisticians like me publishing work in the sports field, Excel is our main weapon.
Monday morning, the Political Economy Research Institute at the University of Massachusetts-Amherst released a critical paper destroying results from a previous study done by Princeton economists Carmen Reinhart and Kenneth Rogoff. One of the major findings in the paper is actually something tiny, the kind of mistake we'd expect from an undergraduate student -- or, you know, a sports blogger -- one wrong character in one cell of an Excel spreadsheet.
Reinhart and Rogoff's study found that countries with massive public debt -- think the United States -- experienced startlingly low economic growth. Their findings were widely cited by politicians, particularly on the right, as the reason the United States needed to enact austerity policies. If we didn't slash spending and social programs, debt would rise, and high debt would lead to depression-level economic conditions.
Specifically, according to Reinhart and Rogoff, countries carrying public debt over 90 percent of GDP experienced an average growth rate of minus-0.1 percent. For reference, the United States experienced a plus-3.3 percent average growth rate percent from 1947-2008 according to data from the Bureau of Economic Analysis.
As complicated as refuting such a claim may seem, the UMass paper's destruction of Reinhart and Rogoff's results is agonizingly simple. Thomas Herndon, Michael Ash and Robert Pollin, the authors of the UMass study, got their hands on Reinhart and Rogoff's data. They found a litany of mistakes in the spreadsheets Reinhart and Rogoff turned over. The most infuriating can be found in just one cell of their working spreadsheet:
In cell L51, Reinhart and Rogoff take the average of cells L30 through L44 to find the growth experienced by countries with public debt greater than 90 percent of GDP, not cells L30 through L49, the full range of the data. In missing these data points they excluded Belgium, which has experienced 2.6 percent growth since the end of World War II when carrying public debt over 90 percent.
Until yesterday, fellow economists had been unable to repeat the results of the Reinhart and Rogoff study. The culprit, it turns out, was that little switch of a "9" to a "4" in cell L51.
Fixing this and the other errors present drastically changes the results -- instead of minus-0.1 percent GDP growth for countries with high public debt, the result becomes plus-2.2 percent. The Roosevelt Institute covers these issues in more detail here as well as some other issues with the study unrelated to data-entry and pure calculation -- as if that wasn't enough.
This study was used to drive economic policy in Washington for years, holding influence everywhere from Representative Paul Ryan to the Washington Post to everywhere in between. It's the perfect illustration of what can happen when we let data -- or, at least our deceptively human handling of data -- define our thinking by itself.
When used correctly data can be incredibly powerful. But we must step back and check our work. We must step back and ask ourselves, "Does this make sense?" And sports are far from immune from this mistake.
Did it make sense for Matt Wieters to be projected for a MVP-caliber 7.9 WARP in 2009 by Baseball Prospectus's PECOTA system before he had played a major league game? No, and it turned out performance in the Double-A Eastern League, which Wieters torched for a .365/..460/.625 line in 2008, was weighted too heavily by PECOTA. Sure enough, Wieters hit .288/.340/.412 in 2009 -- solidly above average for a catcher, but not even an All-Star level overall.
Asking questions of our data is critically important. Whether in economics or in sports, the results we derive from data must rest on a logical explanation. Graham MacAree, a sabermetrician whose work has certainly influenced me, covered this issue in a post titled "The Problem with Sabermetrics":
"At its best, sabermetrics flows directly from the innate logic of the game, and then fits the observed data in an agreeable way. Stuff like win probability, linear weights and baseruns aren't really statistical constructs but logical ones. Thinking about the game in a rigorous enough manner gets you to those concepts, whether or not you can do the maths involved to nail down the minutiae. The ideas are what's important, and they all come from baseball."
Big Data in sports gives us so many new ways to evaluate players, whether via PITCHf/x for baseball fielders, QBR for quarterbacks or the fascinating SportVU system for basketball players. But we have to understand the data first, whether it's the calibration of the cameras behind PITCHf/x, the formulas behind QBR, or the programming behind the Toronto Raptors' fascinating ghost defense system. And whether it's coming from a blogger or a team executive or a university economist, mistakes, whether conceptual or typographical, can -- and will -- be made.
In the era of Big Data, with statistics determining so much more of our processes and decisions, whether in sports or economics or political policy, it is more critical than ever to be vigilant. To sniff out these mistakes before they can lead us astray, whether we're making decisions impacting a fantasy team, a professional team, or a country's economy, we must find the logic (or lack thereof) behind the data, rather than blindly follow it.
* * *
Jack Moore's sports addiction was a lost cause from the moment his older brother mowed a makeshift baseball diamond into his backyard. Now he writes about sports wherever the web will have him. Right now, you can catch him at CBSSports.com, FanGraphs, Advanced NFL Stats, Bucky's 5th Quarter, DisciplesOfUecker.com, RotoWire.com and on Twitter (@jh_moore).