Previously I wrote about the allure of big data. Now I turn to the question of “raw” data. Is there such a thing or is it a myth, an oxymoron — like “jumbo shrimp” or “just one episode on Netflix”?
Why do we cling to this notion of raw data if it doesn’t exist?
I recently read “Raw Data” is an Oxymoron (2013, edited by Lisa Gitelman), which is a fascinating book that turns “raw” data on its head just about every which way, looking both back in time and across disciplines. Just listen to this mind-blowing sentence from the introduction: “Indeed, the seemingly indispensable misperception that data are ever raw seems to be one way in which data are forever contextualized — that is, framed —according to a mythology of their own supposed decontextualization” (pg 6). Basically, by thinking the thought of “raw data” we are already framing and molding it to our preconceptions of rawness.
Data: from Latin to English
One of my favorite chapters in this edited volume was a historical and linguistic overview of the word “data.” When faced with any dauntingly broad topic, I always love honing in on the word itself to get some definitional and etymological clarity (remember, etymology=words, entomology=bugs). So I just ate up Daniel Rosenberg’s chapter on “Data Before the Fact”; the following sections are my summary of this chapter.
Rosenberg traces the word “data” from Latin into English, a “naturalization” process that occurred during the 1700’s. (I suppose that if people can be naturalized when they change citizenship, so too can words when they take up residence in a new language.) “Data” comes from the Latin verb “dare,” to give, so right off the bat we have this inclination to think of data as “a given.” The common Latin phrase “data desuper” means “given from above.”
Indeed, the early English language instances of the word “data” were primarily in the context of theology and mathematics. Data were either given from above and were therefore not questioned, or they were given as a set of assumptions before starting a mathematical proof. Either way, data were something you started with, something everyone mutually agreed were “beyond argument.”
Data: from given to gotten
By the 1800’s, English language usage of the word “data” had begun to shift away from something given to something obtained. Specifically, data came to be thought of as something gained through empirical observation and experimentation. This latter connotation is closer to what we have today: even if we think of data as raw, we do tend to think of it as something that you get or collect from out in the world.
Rosenberg made these observations by searching a large collection of texts called the Eighteenth-Century Collections Online. While not available at the time of his research, he also discusses using Google Ngram for these types of queries. Just for fun, I tried a Google Ngram of the words “data, fact, and evidence” from 1800 to 2000. Here’s what that looks like:
I have to note the irony of the Google Ngram page footer: “Run your own experiment! Raw data is available for download here.”
Data: from plural to “mass” noun
Returning to the book’s introduction, which explains how data are inherently “aggregative” — i.e., we tend to think of data in herds rather than as solo animals. And here I was sadly robbed of what I thought was one of my solidly “smarty pants” moves. I used to pride myself on correctly conjugating “data” as a plural noun: i.e., “data are” versus “datum is.” But now I understand that it’s about equally common to say “data is” versus “data are,” and bright folks like Steven Pinker are telling us to wake up and smell the mass noun (pg 19). This “massness” of data is broader than just a grammatical issue, however. I tie it back to the concept of big data: data are (is? ack!) powerful in the aggregate. Data kind of presupposes a horde of like-minded data, such that we don’t pay much attention to an individual datum/data point.
Data: rawness is relative
My experience with the idea of “raw” data is that it’s all relative. In the genetics data coordinating where I work, raw data are the genetic data (the A’s, C’s, T’s and G’s) that we get from the genotyping lab. When I’ve talked to people who work in genotyping labs, they say “Oh no, the raw data is what comes off the machine” (here the “machine” being a genotyping or sequencing machine). Seems like data are raw when they first come into our possession — at least that’s a convenient way to think about it. Similar to when you go to the grocery store: the raw produce are in the bins, it’s what you take home to chop up and cook. Rawness may be relative in practice, but in absolute terms – Gitelman and the book’s contributing authors would remind us it’s elusive!