Mapping Literature with Google BigQuery and Maps Javascript API

UPDATE: New Maps are now posted here at Litmaps

Current Available Maps (April 26, 2016):

Shakespeare:
http://bit.ly/1MUrCCz

Herman Melville:
http://bit.ly/1SvU4r9

Although hardly a novel idea, mapping locations mentioned in works of literature can provide an interesting visualization of an authors perception of the world. Typically, this is done through crowdsourcing (www.placingliterature.com/) or by mapping the general setting of a novel instead of each location mentioned within. This is one of the few, if not only mapping script that uses BigQuery to automate the whole process and produce a map of a novel within seconds. Using a combination of Google BigQuery API, Google Map Javascript API, Wikipedia API and Google Apps Script, you can map out all locations mentioned in a book.

Google BigQuery

Google offers an experimental API called BigQuery that lets you search large databases for free. You’re given an allotment of thousands of queries and terabytes of data per day. As part of this service, they have integrated with large datasets such all the GPS data for New York taxis in a given year or all the birth records in the United States for the past 100 years. However, among these are databases full of works of literature. BigQuery can let you return a query, searching all public domain books written since 1800, within seconds. Using BigQuery to return locations is the first step of this process.

Location Search Algorithm

These two datasets come from two separate databases. The Shakespeare Corpus comes with no meta data, while the Internet Archive comes with a wealth of meta data, including coordinates of all locations mentioned in a given work.

Mapping works from the Internet Archive is simple. All you need to do is extract the corresponding coordinates for a given work or author, then include them as markers using the javascript API. However, the database starts at the year 1800 and no other works written before that are included.

This means that for all other works, including the Shakespeare corpus dataset, you have to write an algorithm to parse the entire text of a book for locations. The way I got around this was two-fold.

  1. The first step of the algorithm I constructed was to cut down a large part of the text at the query level. All locations mentioned in novels tend to be written in uppercase, so I used Biguery to return all uppercase words. This step completes in seconds.
  2. The next step uses the Wikipedia search API to search all the resulting uppercase words for coordinates. This step isn’t perfect on it’s own so I needed to develop checks for each word search.
    1. To deal with duplicate place names I check the word if it leads to a disambiguation page. From the disambiguation page I then return the first location listed. However, an obvious problem to this is that the correct location might not be the first one that is listed. This hasn’t been solved yet. A possible solution to this that I’ve considered is to search the text for the nearest word (to the current query word) that’s a country. Then to use the country as a basis to determine the proper location.
    2. The next issue was to check for mispellings. When dealing with older texts, the spelling of each location might have small variances. Wikipedia is good at redirecting to the right page if a search contains a mispelt word, however, their API doesn’t easily let you deal with this. My solution was to search get the HTTP response of the given keyword and return the title of the redirected page. This search term then could be properly searched for coordinates.
       

Creating the Map

Once I return all of the coordinates, they are then placed in a Google Sheet for safe keeping using a Google Apps Script. I also record the amount of times each word ismentioned in the book and the name of the location. The Google Maps Javascript API then reads a function within the apps script to return all items in the Google Sheet as an array. From this array I am able to then add a marker for each location.

Marker size is determined by using a simple function to normalize the range between the greatest amount of mentions and the least.

That’s a simple (and poorly written) overview of the process and expect to see more updates as I map out more works of literature and clean up the code so it can be open sourced and distributed.