Normalization biases in Google Ngram

The amount of books present in the Google Ngram database varies considerably trough years. As the plot below shows, even considering only the last century, the number of words increases roughly tenfold between 1900 and 2000.

totWords

In the Google Ngram Viewer (and in the 2011 Science paper that introduced the Culturomics project) words frequency is obtained by normalizing the word count with the total number of words for each year (tot-normalization). Others (for example Bentley et al. 2012, Acerbi et al. 2013) preferred to normalize using the yearly count of the word “the” (the-normalization). The rationale (let’s call it the “recent-trash” argument) is that the raw number of words would be affected by the influx of data, special characters, etc. that may have increase in recent books. On the contrary, the word “the” would be a good representative of “real” writing and “real” sentences. If this is correct, we would expect tot-normalized words being biased towards a decrease in frequency in recent years.

Overall, the total number of words and the number of “the” is, not surprisingly, strongly correlated (a visualisation is provided below – each data point represents one year from 1900 to 2000 included), but, again not surprisingly, small differences exist. Most importantly, the differences are small but consistent: for example, in recent years, the count of “the” is consistently lower (in proportion) than the total count of words (as the upper right corner of the plot shows). This is indeed what would be expected according to the recent-trash argument. However the question is: what is the influence of these differences?

TotVSThe

To try to answer to this question, I re-analysed some data I had collected to test the new (July 2012) version of the database. In short, I extracted 100 random words from the Part of Speech databasestemmed them (but the results are the same for the not-stemmed words) and searched for those words in the Google Ngram database, limiting the search from 1900 to 2000. I repeated this operation 100 times (making a total of 10,000 random words searches). I tried both normalizations: the plots below show the same 100 repetitions (averaged) for the tot-normalization (left) and the the-normalization (right).

allAverages

Even at visual inspection (you can click on the image for a larger version) seems quite clear that the frequencies of the same words tend to decrease in the case of the tot-normalization and to increase in the case of the the-normalization. If we average the repetitions the effect is more striking (I also z-transformed the data to have the same scale in the two plots, but this does not change the trends).

zScores

If I am not missing something, this confirms the recent-trash argument (words do tend to decrease in frequency when tot-normalised) but it also shows that with the the-normalization the opposite problem is present, that is, words frequency artificially increases in recent years. We have a few ideas to explain why this should be the case, but they need to be tested.

These biases do not represent a problem when comparing trends of words (like “Sherlock Holmes” VS “Frankenstein”) as long as they are normalized in the same way (obviously). However, if one takes a single word, or, especially, a set of words semantically related (e.g. words associated to emotions, religion, economy, etc.) to analyse their “absolute” trends, the normalization might create unwanted effects. One possibility to avoid this is to compare them with trends from random words normalized in the same way (as we did in our recent paper, showing a general decrease in the use of words related to emotions, with the exception of words associated to “fear”).

References

The Expression of Emotions in 20th Century Books

We just published a paper in the journal PLoS ONE in which we analysed the usage of words with emotional content in English-language books, using the enormous database provided by Google Books (the version we used contains more than 5 millions books).

We found, for example, that there is a general, steady, decrease in the usage of words with emotional content throughout the last century, with the interesting exception of words associated to “fear”, that have an opposite trend starting from the 70s. Also we found that American and British books are quite different in their trends regarding emotional content, with American being more ‘emotional’ than British. Perhaps surprisingly, this divergence is only observable from the 60s, while, before, books in the two variants of English language showed pretty much the same  trends.

These findings resonate well with the popular narrative, but it is great (at least from my quantitative-scientific-minded-anthropological point of view) that we can support it with data, and that we will be able to use those data to dig further into it. Of course many big questions are open: for example, we don’t know what caused those changes – but hopefully our results could provide a starting point to study this – and we don’t know what is the relationships between changes in books and broader cultural changes.  My hope is that, given the amount of data, and the fact the Google Books is not explicitly biased towards successful or influent books, we may be able to detect genuine long-term cultural changes more than ‘literary’ ones.

The paper is open access and can be found here. We had quite a few press interest: articles not surprisingly varied from enthusiastic to skeptical, from accurate and scientifically sound to sort-of sensationalist (and I learn a new British word: boffin) but overall I am happy with what happened. Philip Ball wrote a great (and ‘neat’) piece for Nature about our work, and I was quickly interviewed by Adam Rutherford for the BBC Radio4 science programme Material World.

BAARS Seminar: “Attraction VS social influence in cultural evolution”

Wednesday 20 March I will give a talk for the Bristol Archaeology and Anthropology Research Seminars series (Department of Archaeology and Anthropology, MA Seminar Room, 43 Woodland Road, Bristol). Below is the abstract:

In this talk I will explore how individual-level biases in selection of cultural variants impact on long term cultural change. Cultural evolutionary researches focus usually on contextual – or social – biases (i.e. copy from the majority, copy prestigious individuals, and so on), and relatively less attention is payed to content biases, i.e. intrinsic features of cultural variants that make them more attractive, or more “sticky”. I will show, using mathematical models and computer simulations, that various combinations of biases produce different long term cultural dynamics, and that we may be able, by comparing models predictions with empirical data, to recognise the role of attraction and social influence in cultural evolution.