TECH TALK: Welcome to the world of Big Data

ERIC’S TECH TALK

by Eric Austin
Computer Technical Advisor

 

What exactly is Big Data? Forbes defines it as “the exponential explosion in the amount of data we have generated since the dawn of the digital age.”

Harvard researchers, Erez Aiden and Jean-Baptiste Michel, explore this phenomenon in their book, Uncharted: Big Data as a Lens on Human Culture. They note, “If we write a book, Google scans it; if we take a photo, Flickr stores it; if we make a movie, YouTube streams it.”

And Big Data is more than just user created content from the digital era. It also includes previously published books that are now newly-digitized and available for analysis.

Together with Google, Aiden and Michel have created the Google Ngram Viewer, a free online tool allowing anyone to search for n-grams, or linguistic phrases, in published works and plot their occurrence over time.

Since 2004 Google has been scanning the world’s books and storing their full text in a database. To date, they have scanned 15 million of the 129 million books published between 1500 and 2008. From this database, researchers created a table of two billion phrases, or n-grams, which can be analyzed by the year of the publication of the book in which they appear. Such analysis can provide insight into the evolution of language and culture over many generations.

As an example, the researchers investigated the phrase “the United States are” versus “the United States is.” When did we start referring to the United States as a singular entity, rather than a group of individual states? Most linguists think this change occurred after the Civil War in 1865, but from careful analysis with the Google Ngram Viewer, it is clear this didn’t take off until a generation later in the 1880s.

Author Seth Stephens-Davidowitz thinks the internet has an even greater resource for understanding human behavior: Google searches. Whenever we do a search on Google, our query is stored in a database. That database of search queries is itself searchable using the online tool Google Trends. Stephens-Davidowitz found this data so interesting he wrote his dissertation on it, and now has written a book: Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are.

Google Trends doesn’t just tell us what people are searching for on the internet, it also tells us where those people live, how old they are, and what their occupation is. Clever analysts can cross-index this data to tell us some interesting facts about ourselves. Stephens-Davidowitz argues this data is even more accurate than surveys because people lie to other people, but not to Google.

In his book, Everybody Lies, Stephens-Davidowitz reports that on the night of Obama’s election in 2008, one out of a hundred Google searches containing the word “Obama” also contained the word “nigger” or “KKK.” But who was making those searches? Are Republicans more racist than Democrats? Not according to the data. Stephens-Davidowitz says there were a similar number of these type of searches in Democratically dominant areas of the country as in Replublican ones. The real divide is not North/South or Democrat/Republican, he asserts, but East/West, with a sharp drop off in states west of the Mississippi River.

Stephens-Davidowitz even suggests Google Trends can offer a more accurate way of predicting vote outcomes than exit polling. By looking at searches containing the names of both candidates in the 2016 election, he found that the order in which the names appear in a search may demonstrate voter preference. In key swing states, there were a greater number of searches for “Trump Clinton” versus “Clinton Trump,” indicating a general movement toward the Republican candidate. This contradicted much of the polling data at the time, but turned out to be a more accurate barometer of candidate preference.

The world of Big Data is huge and growing larger every day. Researchers and scientists are finding new and better ways of analyzing it to tell us more about the most devious creatures on this planet. Us.

But we must be careful of the seductive lure of Big Data, and we should remember the words immortalized by Mark Twain: “There are three kinds of lies: lies, damn lies, and statistics.”

 
 

Responsible journalism is hard work!
It is also expensive!


If you enjoy reading The Town Line and the good news we bring you each week, would you consider a donation to help us continue the work we’re doing?

The Town Line is a 501(c)(3) nonprofit private foundation, and all donations are tax deductible under the Internal Revenue Service code.

To help, please visit our online donation page or mail a check payable to The Town Line, PO Box 89, South China, ME 04358. Your contribution is appreciated!

 
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *