Building a tweet corpus

I wanted to play around with some tweets, but I quickly discovered that getting a hand on a corpus is not that easy because of Twitter terms of service. It is up to every one to create their own corpus.

Luckily, Twitter has an API to sample tweets randomly. I created a small application over it that can be used following these steps:

  1. Register a twitter application on https://dev.twitter.com/terms/api-terms. Application name is not important, you only want to get its credentials.

  2. Download the last version of twitter-sampler.

  3. Download credentials.clj and fill in the blanks with the credentials of your application.

  4. Run the following command:

java -jar twitter-sampler-1.0.0-SNAPSHOT-standalone.jar -c credentials.clj -n 1000 tweets.json

where credentials.clj is the file containing your credentials, 1000 is the number of tweets you want to download and tweets.json is the file where the tweets should be saved.

You should now have a corpus of tweets to play with.

comments powered by Disqus