Building a gazetteer of music bands using Wikidata

A few weeks ago I started a hobby project for fun and learning. My task this week is to compile a gazetteer of music bands and artists. I wanted to play with Wikidata for a long time and it was the perfect opportunity.

Getting familiar with Wikidata

Wikidata is the Wikipedia of data. Contributors are either robots or human updating a database of facts. The best way to grasp how Wikipedia and Wikidata compares is to look at a concrete example, the entries for the band Arcade Fire: Wikidata, Wikipedia.

The first thing we notice is that Wikipedia is written in prose and it is targeted at our fellow humans. Wikidata on the other hand is much more structured and doesn’t bother with well formed sentences. It makes it a perfect source of knowledge for machines.

To convince ourselves, let’s look find the place of origin of Arcade Fire from both sources. On Wikipedia, we first enter Arcade Fire in the search box, pick the right article and then read the text. We get Montreal, our answer, from this paragraph:

Win Butler and Josh Deu founded Arcade Fire in Montreal around 2001, having first met at Phillips Exeter Academy as high school students.

On Wikidata, can ask directly for the place of origin of Arcade Fire using the following query:

SELECT ?origin ?originLabel
{
  wd:Q58608 wdt:P740 ?origin.
  ?origin rdfs:label ?originLabel.
  FILTER (LANG(?originLabel) = 'en').
}

And get the following result:

origin originLabel
http://www.wikidata.org/entity/Q340 Montreal

Let’s unroll what just happened. The query is written in Sparql, a language at the intersection of Prolog and SQL. The symbols beginning with a question mark (?origin and ?originLabel) are blank values that we would like Wikidata to fill for us. The block inside the curly braces asks Wikidata to fill the ?origin variable with the place of origin (wdt:P740) of the band Arcade Fire (wd:Q58608). The returned value (Q340) is an id for Montreal. The two remaining line asks for the label of Montreal in English.

We could also ask for the place of origin to be labeled in French:

SELECT ?origin ?originLabel
{
  wd:Q58608 wdt:P740 ?origin.
  ?origin rdfs:label ?originLabel.
  FILTER (LANG(?originLabel) = 'fr').
}

and get the following result:

origin originLabel
http://www.wikidata.org/entity/Q340 Montréal

One fair question is why are we using these weird wd ids to identify entities and wdt ids to identify properties? The simple answer is that Wikidata is language agnostic and unambiguous. Q340 identifies the concept Montreal and can only refer to the city in Canada, never Montreal in Wisconsin.

Using a Sparql query may looks over-complicated, but it allows us to do things that would be difficult if we were only using Wikipedia. For example, if we wanted to get five other music bands from Montreal, we could run the following query:

SELECT ?band ?bandLabel
{
  ?band wdt:P740 wd:Q340.
  ?band wdt:P31 wd:Q215380.
  ?band rdfs:label ?bandLabel.
  FILTER (LANG(?bandLabel) = 'en').
}
LIMIT 5

And obtain results like these:

band bandLabel
http://www.wikidata.org/entity/Q368132 Blessed by a Broken Heart
http://www.wikidata.org/entity/Q485825 Simple Plan
http://www.wikidata.org/entity/Q499847 Islands
http://www.wikidata.org/entity/Q614949 The Luyas
http://www.wikidata.org/entity/Q630797 The Stills

For this query, we used one Sparql’s most useful property, instance of (P31), to find other bands (Q215380) from Montreal. If you would like to learn more about Sparql, you can start with this tutorial to work your way through more complex queries.

Searching for music bands on Wikidata

Our task is to build an extensive list of music band names from Wikidata. I am no Wikidata taxonomist, so we will have to look around to learn how to build the right query. Let’s start our investigation by looking at the Wikidata entry for Arcade Fire.

Arcade Fire is an instance of band (with id Q215380), which is a subclass of musical ensemble (with id Q2088357), which is defined as group of people who perform instrumental and\/or vocal music, with the ensemble typically known by a distinct name.

Let’s investigate other examples:

  • Alt-J is also an instance of band;

  • Jean Leloup, a song-writer from Quebec, is not an instance of band but an instance of human. This is not very useful. If we look further down the page, we see that his occupation is singer.

  • Céline Dion is also entered as a singer.

  • André Gagnon, a famous pianist, is a pianist which is in the field of occupation of music. Looking back at singer, it is also the case.

If we generalize from these examples, we are searching for music ensembles or humans in the field of occupation of music. Let’s translate this into two different queries. One for musical ensembles:

SELECT DISTINCT ?band ?bandLabel
WHERE
{
  ?band wdt:P31/wdt:P279* wd:Q2088357.
  ?band rdfs:label ?bandLabel.
  FILTER (LANG(?bandLabel) = 'en')
}
LIMIT 5
band bandLabel
http://www.wikidata.org/entity/Q396 U2
http://www.wikidata.org/entity/Q371 !!!
http://www.wikidata.org/entity/Q689 Bastille
http://www.wikidata.org/entity/Q50598 Infinite
http://www.wikidata.org/entity/Q18788 Epik High

And one for human (Q5) whose occupation (P106) is a subclass (P279) of musician (Q639669):

SELECT DISTINCT ?musician ?musicianLabel
WHERE
{
  ?musician wdt:P31 wd:Q5;
	    wdt:P106/wdt:P279* wd:Q639669.

  ?musician rdfs:label ?musicianLabel.
  FILTER (LANG(?musicianLabel) = 'en')
}
LIMIT 5
musician musicianLabel
http://www.wikidata.org/entity/Q254 Wolfgang Amadeus Mozart
http://www.wikidata.org/entity/Q255 Ludwig van Beethoven
http://www.wikidata.org/entity/Q180861 Roger Waters
http://www.wikidata.org/entity/Q122538 Laurentius Laurentii
http://www.wikidata.org/entity/Q107164 Atlas Crusius

Looking at these queries, we get around 77,000 bands and 238,000 musicians.

Downloading our gazetteers

Now that we know which queries we want to build our gazetteers from, we are ready to download them. The easiest way I found to achieve this is to use the following curl command:

curl --data-urlencode query@file-with-query.sparql \
  --header "Accept: text/csv" \
  --output dataset.csv \
  https://query.wikidata.org/bigdata/namespace/wdq/sparql

Our dataset will be saved as a csv file under dataset.csv. With these datasets in hand, my next step for Word of Mouth is to annotate bands and musician mentions in reddit posts.

comments powered by Disqus