Movie scripts dataset

I uploaded on Open Science Framework (here) a dataset. From the description there:

This dataset contains 1,093 movie scripts collected from the website, each in a separate text file. The file imsdb_sample.txt contains the titles of all movies (corresponding file names are in the form Script_TITLE.txt).

The website was crawled in January 2017. Some scripts are not present as they were missing in or because they were uploaded as pdf files. Please notice that (i) the original scripts were uploaded on the website by individual users, so that they might not correspond exactly to the movie scripts and typos may be present; (ii) html formatting was not consistent in the website, and so neither is the formatting of the resulting text files.

Even considering (i) and (ii), the quality seems good on average and the dataset can be easily used for text-mining tasks.

My initial intention was to use this material to check whether movies would show the same decline in emotional content that we found in English literature (see this post). However, the great majority of scripts present are very recent – being almost all from 1980s onwards – so that a meaningful comparison would not have been possible.

Said so, I decided to make the dataset public as it is ready-to-use for any text mining task. Looking for appropriate metadata (for example rating, earnings, gender of the actors, or anything one can think about) it is possible to check how various textual features of the scripts relate to them. If you have any interesting hypothesis, try it there, or let me know!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s