Abstract submission is now open, with deadline the 31st of December 2013.
See you in Bristol!
In the last few days, for independent reasons (i) I was told the Horse_ebooks story (in short, an “artistic” project where humans pretended to be a Twitterbot and gained around 200K followers – but if you don’t know anything about it please read the Wikipedia page and the links cited in the References there, it is quite interesting), (ii) I stumbled upon this page with a few example of Twitterbots worth to follow (at least according to digitaltrends.com), and, finally (iii) I was pointed to this NYTimes article (from August 2013) on social-bots (claiming, among other things, that only 35% of twitter users are humans). This seemed enough to me to try and see how difficult was to set up a Twitterbot.
A Twitterbot is a program that produces automated posts via Twitter (surprise!). In my case, @CultEvoBot is a short python script that every hour – when my laptop is on – uses google news search or google blogs search (after having flipped a coin to decide) and search there for “cultural evolution”. It then goes trough the links proposed and, if one is not in its log file of past links, posts it in twitter with the title provided by google (and adds it in its log file). That’s all (it also follows its followers, which is completely useless at the moment – among other things because I am the only follower – but might be useful in the future).
So basically, @CultEvoBot does not do much more than providing links to potentially interesting sources, still I am pretty satisfied of the result. Programming a Twitterbot – also with more elaborate functions (like answering to specific users or posts, re-tweeting, etc.) – seems quite straightforward, and I can imagine that I will be able to use them in the future for scientific (or artsy) projects, even though at the moment I don’t have any specific idea (suggestions welcome).
p.s.: to be honest, @CultEvoBot has been shortly preceded by @Lonley_Giorgio, that I am mainly using as a sandbox. @Lonley_Giorgio mixes random pieces of sentences from Britney Spears’s songs (translated in Italian) with random pieces of sentences from the prolific italian continental philosopher Giorgio Agamben. “He” also replies directly (with a random Giorgio Agamben quote) to messages including the word “solitudine” (as in Britney Spears’ “My loneliness”). Weirdly enough, “he” received quite a lot of answers, and, perhaps not surprising, they are mainly pseudo-depreseed teen-agers and religious-oriented people.
And, of course, please follow them both! (or at least @CultEvoBot …)
Just to give an idea of the analysis mentioned in the previous post, the plot below shows the trend for a rough measure of the “happiness” of the books present in the Google Books database. For WordNet-Affect (WNA) this is obtained, simplifying a little, by subtracting the cumulative scores of the categories of “Joy” and “Sadness”, while for Linguistic Inquiry and Word Count (LIWC) the two (equivalent) categories are called “Positive emotions” and, again, “Sadness”. Values above zero indicate generally ‘happy’ periods, and values below the zero indicate generally ‘sad’ periods.
This result is interesting for me not much because we can discover something new about the last century (even though I wonder why the 80s seems to be so sad), but because if (i) two independent ways to score the emotional content of texts (ii) trough a quite rough analysis of (iii) an enormous database of books, give highly correlated trends, this means that there is a meaningful “signal” that we can extract (which can not be taken for granted).
We also performed an analogous analysis using a tool called “Hedonometer“ (HED – see the plot below). In this case the results are quite different, even though some similarities are present, e.g. the 20s positive peak, the negative peak corresponding to Second World War, the post-80s increase in happiness. The reason is probably that LIWC and WNA are conceptually quite different from HED. LIWC and WNA are basically “lists” of words related to specific emotions (so, for example, the first – alphabetically – 5 words in LIWC’s category of “Sadness” are: abandon*, ache*, aching, agoniz*, agony), while HED uses a list of generic words not directly related to emotional states, but evaluated by human subjects as particularly happy or sad. So, for example, HED scores in texts the presence of words such as “terrorism” or “Christmas”.
One interesting things to notice regarding HED is that it is the only index that “tracks” the effect of the First World War. Also, comparing the absolute values of our results (the right y-axis in the plot above) with the the values obtained for contemporary twitter messages (see here), it seems that, in general, books tend to be slightly more “sad” than tweets.
If you are interested in more details, and in the other analysis, the preprint of our contribution can be found here.
I’ll give today a short talk at the Big Humanities Workshop, held in conjunction with the 2013 IEEE International Conference on Big Data, on our research on the emotional content of English-language books.
The next step has been to perform additional analysis to check the robustness of these results. In details, we re-run the same analysis with the last (2012) version of the Google Books corpus (which contains approximately 3 millions more books than the one we used originally), we compare the results of different, independent, ways to score the emotional content of the texts (originally we used WordNet-Affect, that now we compare with Linguistic Inquiry and Word Count and “Hedonometer“), we run more detailed statistical analysis (to check the effect of high-frequency mood-words that might determine on their own the trends for specific emotions, obscuring the role of the numerous low-frequency terms), and, finally, we compare our original results with trends obtained by considering only terms tagged as adjectives or adverbs, which are considered reliable indicators of emotional content (Part-Of-Speech information was not present in the first version of the Google corpus).
Overall, we were happy to see that the original results demonstrated to be quite robust (especially results #2 and #3). The next step would be now to understand what they mean – to me, especially interesting is the decrease in the emotional content – assuming that they do not derive from some idiosyncrasy of the Google database. Apparently the official Proceedings of the IEEE Big Data Conference are not around yet, but here you can find a preprint of our contribution (thanks to Bill, coauthor together with Alex Bentley).
Unfortunately I will not be physically in some room in Santa Clara, California, to present my talk. It would have been very interesting for me to get to know more of the “Digital Humanities” world (to me, books are just one kind of artefact useful to study more general cultural dynamics, and it happens they are convenient to quantify, have temporal depth – someone talks, in this regard, of long data), hopefully there will be other occasions. Also my distant-talk will end up to be after 11 pm Bristol-time, and after a Puccini’s La Bohème, so if you, reader, are in the workshop, I apologise in advance…
The dynamics of dog breeds popularity have been recently used to test various assumptions of models of cultural evolution. Bentley et al. (Random drift and large shifts in popularity of dog breeds) found that the cumulative distribution of breeds popularity (i.e. how many dogs of each breed have been registered overall – in a period covering years from 1946 to 2001) roughly follows a power-law, meaning that very few breeds totalised the great majority of the registrations, and the great majority of breeds totalised proportionally very few registrations. Power-laws are ubiquitous in natural (the distribution of earthquakes magnitude) and socio-cultural (the distribution of wealth, the frequency of words in books, etc.) phenomena, and Bentley et al. showed that, for socio-cultural phenomena, these distributions can be produced by a simple “neutral” model of cultural evolution, which assumes that individuals just copy randomly cultural traits (dogs, in the specific case) from each other.
Subsequently, together with Stefano Ghirlanda and Magnus Enquist (The logic of fashion cycles), we used the same data focusing on another feature of dog breeds popularity, i.e. the fact that there is a correlation between the speed of increase in popularity of a breed and the speed of decrease: dogs that become quickly popular tend also to become quickly unpopular, and vice-versa (see the Rottweiler example below, many others here). To explain this feature – also found for baby names - we proposed a slightly more complicated model of cultural evolution, in which individuals may copy from each other not only cultural traits (the dogs) but also preferences for cultural traits (“I love Dalmatians!”). One of the property of this model is that individuals with low preferences for popular cultural traits tend to be better “influencers” (this is quite counterintuitive and, I think, interesting – you can have a look here), so that, in the model, when a cultural trait becomes quickly popular, preferences for this cultural traits also become quickly negative, generating the correlation we found in the data.
Both models, however, assume that what drives the popularity of dog breeds is social influence (“fashion”). This is paradoxical: given that people presumably ponder their choice when deciding to have a pet, one would expect that features of breeds (“function”) would be more important in this decision. We just published a paper in which we take on this question. We used data on longevity, health, and behavioural characteristics of breeds (such as aggressivity, trainability, attachment, etc.) and correlated them with various popularity measures (speed of increase and decrease, total popularity, and volatility) to see which features were influencing more the popularity. The answer? None!
A part of taking a clear side in the debate whether to publish or not negative results, I think that there are some interesting conclusions from our analysis. Either social influence can indeed be, at least in some domains, a “blind” force that almost autonomously generates cultural dynamics, or – as I now prefer to think – it is tricky to recognise, from population level data, biases acting at individual level. This seems to me a quite interesting problem for people studying cultural evolution.
Ghirlanda S, Acerbi A, Herzog H, Serpell JA (2013) Fashion vs. Function in Cultural Evolution: The Case of Dog Breed Popularity. PLoS ONE 8(9): e74770.
I admit this is due to my current situation (i.e. I have a couple of papers I don’t know where to submit), but I am wondering if it may not be of a more general interest.
A quite coherent field – scientific and (with nuances) evolutionary study of culture – had grown in recent years. I am referring to works à la Boyd and Richerson – if you are reading this you know what I mean – or to the use of phylogenetic methods for studying culture and language, but also to works inspired by “scientifically-oriented” cognitive anthropologists such Dan Sperber, Pascal Boyer, etc., or, more generally, the use of experiments and mathematical or computer models to study cultural dynamics. Finally, and this is an even more recent development, the availability of, and ease to access vast amounts of data, either because they are only now produced (e.g. twitter) or because they are accessible thanks to digitalisation (e.g. google ngram), opened new perspectives, with works from statisticians, or people specialised, for example, in machine learning or network theory, being very relevant for the study of cultural evolution.
If one wants to publish in this field, the first choice is, of course, the Triad of BIG interdisciplinary journals (Nature/Science/PNAS), but your work need to be very good and of general interest – and sometimes this is not, legitimately, the case – , or you need to be very lucky (it is not a rant, I’ve been able so far to publish once in the Triad). Another possibility is “classic” anthropological journals (e.g. Current Anthropology, American Anthropologists, etc.) but they are not especially sensitive to quantitative/modelling works. Then, certainly, there are other journals where one can try to “fit” a manuscript. Proceeding of the Royal Society B and Evolution and Human Behavior publish works in this field (but sometimes – legitimately again – they are not enough “biological” or “evolutionary”), Theoretical Population Biology and Journal of Theoretical Biology also (but with a tendency towards heavy – especially mathematical – modelling), interesting works appear in Nature Communications or in the new EPJ Data Science. Psychology journals can at times fit, and even marketing-oriented publications (cognitive anthropologists have a journal though: Journal of Cognition and Culture). Finally, without fail, PLoS ONE is here for that, and, I have to say, my experiences with them have been very positive (and yes, I had very careful reviews). The impression is however of great fragmentation. Is there a need for a “Journal of Cultural Evolution”?
While it would probably make my scientific life easier, I want also point out two possible negative sides. First, people working in this field tend to proudly claim their interdisciplinary approach, so that a dedicated journal might look as an attempt to tame their (our) efforts and to make cultural evolution look like any other academic discipline. Second, academic publishing now is in such a great turmoil that the same idea of a traditional-style journal looks almost reactionary. But, yes, it would make my scientific life easier.
9, June 2013
A very quick update after some twitter feedback:
The amount of books present in the Google Ngram database varies considerably trough years. As the plot below shows, even considering only the last century, the number of words increases roughly tenfold between 1900 and 2000.
In the Google Ngram Viewer (and in the 2011 Science paper that introduced the Culturomics project) words frequency is obtained by normalizing the word count with the total number of words for each year (tot-normalization). Others (for example Bentley et al. 2012, Acerbi et al. 2013) preferred to normalize using the yearly count of the word “the” (the-normalization). The rationale (let’s call it the “recent-trash” argument) is that the raw number of words would be affected by the influx of data, special characters, etc. that may have increase in recent books. On the contrary, the word “the” would be a good representative of “real” writing and “real” sentences. If this is correct, we would expect tot-normalized words being biased towards a decrease in frequency in recent years.
Overall, the total number of words and the number of “the” is, not surprisingly, strongly correlated (a visualisation is provided below – each data point represents one year from 1900 to 2000 included), but, again not surprisingly, small differences exist. Most importantly, the differences are small but consistent: for example, in recent years, the count of “the” is consistently lower (in proportion) than the total count of words (as the upper right corner of the plot shows). This is indeed what would be expected according to the recent-trash argument. However the question is: what is the influence of these differences?
To try to answer to this question, I re-analysed some data I had collected to test the new (July 2012) version of the database. In short, I extracted 100 random words from the Part of Speech database, stemmed them (but the results are the same for the not-stemmed words) and searched for those words in the Google Ngram database, limiting the search from 1900 to 2000. I repeated this operation 100 times (making a total of 10,000 random words searches). I tried both normalizations: the plots below show the same 100 repetitions (averaged) for the tot-normalization (left) and the the-normalization (right).
Even at visual inspection (you can click on the image for a larger version) seems quite clear that the frequencies of the same words tend to decrease in the case of the tot-normalization and to increase in the case of the the-normalization. If we average the repetitions the effect is more striking (I also z-transformed the data to have the same scale in the two plots, but this does not change the trends).
If I am not missing something, this confirms the recent-trash argument (words do tend to decrease in frequency when tot-normalised) but it also shows that with the the-normalization the opposite problem is present, that is, words frequency artificially increases in recent years. We have a few ideas to explain why this should be the case, but they need to be tested.
These biases do not represent a problem when comparing trends of words (like “Sherlock Holmes” VS “Frankenstein”) as long as they are normalized in the same way (obviously). However, if one takes a single word, or, especially, a set of words semantically related (e.g. words associated to emotions, religion, economy, etc.) to analyse their “absolute” trends, the normalization might create unwanted effects. One possibility to avoid this is to compare them with trends from random words normalized in the same way (as we did in our recent paper, showing a general decrease in the use of words related to emotions, with the exception of words associated to “fear”).