The amount of books present in the Google Ngram database varies considerably trough years. As the plot below shows, even considering only the last century, the number of words increases roughly tenfold between 1900 and 2000.
In the Google Ngram Viewer (and in the 2011 Science paper that introduced the Culturomics project) words frequency is obtained by normalizing the word count with the total number of words for each year (tot-normalization). Others (for example Bentley et al. 2012, Acerbi et al. 2013) preferred to normalize using the yearly count of the word “the” (the-normalization). The rationale (let’s call it the “recent-trash” argument) is that the raw number of words would be affected by the influx of data, special characters, etc. that may have increase in recent books. On the contrary, the word “the” would be a good representative of “real” writing and “real” sentences. If this is correct, we would expect tot-normalized words being biased towards a decrease in frequency in recent years.
Overall, the total number of words and the number of “the” is, not surprisingly, strongly correlated (a visualisation is provided below – each data point represents one year from 1900 to 2000 included), but, again not surprisingly, small differences exist. Most importantly, the differences are small but consistent: for example, in recent years, the count of “the” is consistently lower (in proportion) than the total count of words (as the upper right corner of the plot shows). This is indeed what would be expected according to the recent-trash argument. However the question is: what is the influence of these differences?
To try to answer to this question, I re-analysed some data I had collected to test the new (July 2012) version of the database. In short, I extracted 100 random words from the Part of Speech database, stemmed them (but the results are the same for the not-stemmed words) and searched for those words in the Google Ngram database, limiting the search from 1900 to 2000. I repeated this operation 100 times (making a total of 10,000 random words searches). I tried both normalizations: the plots below show the same 100 repetitions (averaged) for the tot-normalization (left) and the the-normalization (right).
Even at visual inspection (you can click on the image for a larger version) seems quite clear that the frequencies of the same words tend to decrease in the case of the tot-normalization and to increase in the case of the the-normalization. If we average the repetitions the effect is more striking (I also z-transformed the data to have the same scale in the two plots, but this does not change the trends).
If I am not missing something, this confirms the recent-trash argument (words do tend to decrease in frequency when tot-normalised) but it also shows that with the the-normalization the opposite problem is present, that is, words frequency artificially increases in recent years. We have a few ideas to explain why this should be the case, but they need to be tested.
These biases do not represent a problem when comparing trends of words (like “Sherlock Holmes” VS “Frankenstein”) as long as they are normalized in the same way (obviously). However, if one takes a single word, or, especially, a set of words semantically related (e.g. words associated to emotions, religion, economy, etc.) to analyse their “absolute” trends, the normalization might create unwanted effects. One possibility to avoid this is to compare them with trends from random words normalized in the same way (as we did in our recent paper, showing a general decrease in the use of words related to emotions, with the exception of words associated to “fear”).
- Michel et al, 2011, Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 331 (6014)
- Bentley et al, 2012, Word Diffusion and Climate Science, PLoS ONE, 7 (11)
- Acerbi et al, 2013, The Expression of Emotions in 20th Century Books, PLoS ONE, 8 (3)