We at Insightus are currently busy organizing a conference, coming later this year to Durham, North Carolina, to bring activist defenders of voting rights from across the South together with progressive data science practitioners in the interest of defending democracy. As the conference’s prospectus states,
In the never-ending struggle against voter suppression and election manipulation, technology is no substitute for good old-fashioned volunteers on the ground, demonstrators in the halls of power, and attorneys in the courts of law. But when effectively applied, new technologies have the potential to be a potent force multiplier in defense of democracy. Data science – sophisticated Big Data analysis and visualization for situational awareness and decision support – particularly holds that promise.
This conference is, in part, our effort to respond constructively to a challenge that many would-be do-gooders in this field have encountered too often: too few activists understand what data science is, even fewer can think of anything it might possibly do to assist their efforts, and way too many have a vague, unsettling suspicion that data science might just be a bad thing.
Because we all know that ‘Big Data’ (and therefore, by extension its Big Daddy, data science) is a bad thing, right? Edward-Snowden-something-something-NSA-something-Google-something-DATA! It follows us around the web, hoovering up our personal information, monetizing our interests, beliefs, and problems, predicting what we’ll buy or who we’ll vote for next, scrutinizing our associations, reading our email, deciding whether we should be allowed on an airplane or not. Amiright?
Well...sure, yeah. Except, um, no. Nuh-uh. For, in the immortal words of an eminent authority (cough-me-cough) commenting recently here at DKos on a book with the really scary subtitle: “How Big Data Increases Inequality and Threatens Democracy”
“Big Data increases inequality and threatens democracy” in a manner similar to the way that “big hammers smash unsightly holes in drywall.” Yes, they do...if that’s what you use them for. But if you use them for hanging drywall, quite the opposite is true.
There’s nothing inherently good or bad about data science. The goodness or badness is in the motivation of the user. Just as it is with hammers.
Data Science: A Definition(Or: “Begin at the beginning,” the King said, very gravely, “and go on till you come to the end: then stop.”)
Let’s really start at the beginning, and first define data.
I had three cups of coffee and a bowl of granola for breakfast this morning (don’t you judge me). What makes that data isn’t the fact that it happened. What makes it data is the fact that I wrote it down just now...I recorded it. If a tree falls in the forest and no one hears, that’s not data. Data is information recorded.
In this particular example, that record of what I had for breakfast this morning is what you might call Small Data. Now, if everyone in America somehow pulled together to list on this page what all 321 million of us had for breakfast this morning (including, of course, those who went to school or work or out on the streets hungry...which damn sure shouldn’t happen, but don’t get me started)...well, that would be Big Data. That’s all Big Data is; it’s one whole helluva lot of Small Data.
Small data: Suzy Democrat just donated $35 to the school board campaign of Joe Progressive (go Joe!). Big Data corollary: the awesome database at FollowTheMoney.org, lovingly recording for posterity, in one handy location, practically every dang reportable penny anyone has donated to any American politician in the last umpteen years. You see where I’m going with this? FollowTheMoney is Freaking Big Data, but it certainly isn’t evil. In fact, it’s on our side.
Continuing with our breakfast analogy, here’s where data science starts to come in. If three of us made a simple list of what we had for breakfast this morning, like so...
Joe Blow: green eggs and ham |
SuziQ: breakfast burritoe |
Doc Dawg: granola and coffee |
...it would be easy enough to use this list to ask “which of us had granola this morning?” But if all 321 million of us were to add our breakfast menus to that list, obviously the result would be bloody worthless. Because do you really want to scan 321 million rows of a table by eye and count the number of “granola”s on your fingers? That’s why God created data science: to enable mere mortals to extract useful insights from big data.
That 321 million line long list on a Daily Kos page is, if nothing else, at least a good start, because at least the data exists somewhere...we won’t have to go around ringing millions of doorbells and polling people in order to create it. Data collection is really outside the domain of data science...it’s SEP (Somebody Else’s Problem). If you don’t have the data then there’s nothing to science the shit out of. Which is why data science always begins with “show me the data!” Still, it will pay you to check with a data scientist first before you go collecting data. There’s one holy helluva lot of data already out there in this world, hidden away (yet accessible) in the darnedest places. And data scientists tend to know where a lot of those places are. We might-could save you a whole lot of effort.
Instead, our data science challenge begins with the fact that our web page — with a table that’s 321 million rows long, remember — is completely unusable. That data needs to be given a good home, such as a database — a type of software that stores any sort of data in a beautifully organized fashion that makes it easy to ask questions of, such as “hey, who in America had a burrito this morning?”
If your data set is relatively small...say, less than a hundred thousand rows...you might be tempted to just stuff it into an Excel spreadsheet and roll your own data analysis using Excel’s built-in functions. But please, please don’t do that. Spreadsheets aren’t built for data science. They aren’t just easy to use...they’re incredibly easy to screw up with, too. If you have more data than you can easily take stock of by eye, it’s nigh-on a certainty that you will screw it up all to hell with a spreadsheet. Happens all the time. Just don’t go there. Trust me. Call a friendly data scientist.
Talk Dirty To MeBut I’ve just skipped over quite a lot of thorny issues that are the day-to-day grind of data science. How am I going to get that list of 321 million lines off of a Daily Kos page and into a database? And once I do, how am I going to detect (and correct) the mis-spelled word ‘burritoe’ in that list? — because if I don’t, we’ll never know that SuziQ had a burrito for breakfast. And how am I going to know whether ‘Dawg’ is a first name or a last name, or whether SuziQ even has a last name, or whether “granola and coffee” counts as one dish, or two? Plus a whole lot of other niggling issues, all of which matter. This is a messy old world, and its data is messy, too...it’s in the wrong places, in the wrong forms, full of errors and omissions and undocumented assumptions, and worse. The short answer to these questions (all part of what is called data preparation or data cleaning) is that data scientists are, first and foremost, tool-users (rather like chimpanzees), and there are tools for these things. Except when there aren’t...in which case one just has to jump in and start cleaning stuff up by hand. Or, if you’re cool with promoting gig-economy indentured servitude, you could just hop over to Amazon and pay a pittance for a bunch of Mechanical Turks to hand-clean the data for you. Oy. But even with appropriate tools and automated processes, data acquisition and preparation are inevitably slow, tiring, often maddening work, informally known as data munging. And really tireless, skilled, careful data mungers are the unsung heroes of data science (but, alas, they seldom get the applause they so richly deserve).
So the next time you walk in cold on a data scientist and tell her “Here’s a shite-load of data rattling around in this box. I need a straight answer from it by tomorrow morning,” don’t be surprised if you get a Wolverine action figure (or worse) thrown at you. Which is why we’re holding our upcoming conference a full year before the next election: because people like me are fast running out of things to throw at people like you.
The dang thing is that...as crucial and costly and time-consuming as data munging can be (and in my experience, anyway, a good 50% to 90% of a typical project is devoted to it), munging is actually only the tiniest little sliver of data science. Because once your sweet little data-baby is birthed, bathed, and buttoned up in a nice snug onesie, it’s time to put that lazy brat to work. And that’s the heart of data science: putting data to work.
Data Visualization(or: Let’s Just See About That, Shall We?)
You and I both know that it’s a lot easier to look at a nice chart or graph, and quickly find something informative there, than it is to puzzle over a table full of numbers. Turning numbers into helpful graphics is called data visualization. It can be as simple as cranking out a nice graph, or as woo-woo as a 3D virtual reality animation. Either way, there’s a lot more to a really good data visualization than meets the eye. Whole books...hell, whole rooms full of books...have been written on the subject, including The Bible: Edward Tufte’s The Visual Display of Quantitative Information (which, by the way, is a great read even if you’re not a techie...it’s style is very accessible...and it makes a wonderful coffee table book if you’re really just into looking at pictures).
Any data scientist worth his or her salt knows a whole helluva lot about how to produce really good data visualizations...and, perhaps more importantly, how to avoid producing really bad ones. Which is more than one can say for the typical “infographic artist,” who can be relied upon to produce nightmarish monstrosities such as the following, which I stumbled across only this morning.
Click to enlarge in a new tabThere’s so much wrong here it’s hard to know where to start. For instance those yellow bars, which a casual reader could be forgiven for mistaking for a bar graph? They’re not. They’re just some graphic artist’s aesthetic masturbations. They’re worse than useless...they’re misleading. Secondly, what are we to make of the assertion that 182% of new voters aged 26-40 are registered as unaffiliated? Thirdly, if your graph requires footnotes (for chrissake) and several hundred words of text, you’re doing it wrong. I could go on and on, but I won’t. You want to have a graphic artist pretty up your data visualizations? Fine. But for God’s sake please have a data scientist design them.
Here There Be Dragons
One particular type of data visualization that is really hot these days, and that can be fantastically informative, is mapping. Here at DKos for instance, Stephen Wolf is routinely doing really nice things with gerrymandering and election result stories using really nice, clear, easy-to-understand quantitative maps.
Importantly, mapping can do much more than merely illustrate your point. It can actually help you discover new, important, and actionable insights. I’ll be utterly shameless here and take my own team’s work as one good example. Employing quite a lot of nice data science methods, including mapping, we broke the story (previously suppressed by Republican governor Pat McCrory’s Department of Health and Human Services) of widespread contamination of drinking water wells in North Carolina with the toxic and radioactive element, uranium. When we mapped the spatial distribution of those contaminated wells we discovered evidence that one county’s health department (Wake County’s) knew about the problem, but must have failed to warn neighboring counties’ health departments about the threat. Or, as the state’s chief toxicologist told me in an interview,
"Wake never told us the uranium contamination was on its county line. We never looked at a map of the data. It never occurred to us before this [our interview with Rudo] that those other counties might have problems, too.”
Almost a year after we published that story, my inbox is still frequently flooded with gushing thank-yous from well owners in the affected area. If your data has a geospatial dimension to it (for instance, addresses), mapping can sometimes produce miracles of insight. It can even save lives.
Yeah, okay, but what does this have to do with voting rights? I’m glad you asked. Let’s look at another example: Insightus’s interactive web application, EVE (North Carolina’s Early Voting Evaluator). I’ve written about it extensively here at DKos already (here and here), so I won’t do so again now. But if you want to understand the kind of actionable voting rights insights that mapping voters, polling places, precincts and more can deliver, check out those diaries. EVE was a massive undertaking, requiring (among many other challenges faced) converting more than 7 million voters’ addresses to latitude/longitude data so that they could all be located on maps. That’s the kind of thing that is simply impossible to do by hand, but that becomes pretty straightforward with the proper application of suitable data science tools.
There’s So Much More (But Not Today)Data acquisition, preparation, cleaning, and visualization (plus maybe a dash of statistical analysis) are just table-stakes in data science. There’s a lot more woo-woo stuff we could get into as well, such as modeling, machine-learning, prediction, profiling...but we won’t. At least not right now. Because the reality is that most voting rights activists will never need more than just the basics from data science.
Next week at this same time (9:15 AM EDT on Friday...recs then will be mighty appreciated to keep the article from falling off the Recent/Recommended lists) we’ll take a more laser-focused look at precisely what kinds of things data science can do to assist in the defense of voting rights and fair elections. And we’ll start by looking at how the bad guys use data science to suppress voting and nullify the voters’ will.
In the meantime, please pass along to your favorite neighborhood voting rights activist a link to this diary, and also to the prospectus for our upcoming Data & Democracy 2018 conference. Because remember: the whole point of these diaries is to figure out how to avoid once again bringing a knife to a gunfight in 2018.