Many of you have heard of the Google Books Project, an effort to digitize as many books as possible in three main categories — public domain and out of copyright, out-of-print but still in copyright, and in-print — in order to make them all Google-able. The project involves both cooperation with public research libraries to scan books on the supply side, and efforts to sell electronic versions of these scanned books though a new Google Bookstore on the demand side. It is an effort not without controversy.
Recently Google has begun to roll out some tools for scholars to mine this incredible database of scanned literature. One of the most intriguing of these is the Google Ngram Viewer. Here’s the description:
When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years. Let’s look at a sample graph:
This shows trends in three ngrams from 1950 to 2000: “nursery school” (a 2-gram or bigram), “kindergarten” (a 1-gram or unigram), and “child care” (another bigram). What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are “nursery school” or “child care”? Of all the unigrams, what percentage of them are “kindergarten”? Here, you can see that use of the phrase “child care” started to rise in the late 1960s, overtaking “nursery school” around 1970 and then “kindergarten” around 1973. It peaked shortly after 1990 and has been falling steadily since.
I went ahead and made a Google Ngram graph of “journalism, advertising, public relations“:
Not sure this chart offers any easy conclusions, but it’s a fascinating opening to ask some more interesting questions. Any other ideas for using this tool?