UPDATE: New Maps are now posted here at Litmaps
Current Available Maps (April 26, 2016):
Google offers an experimental API called BigQuery that lets you search large databases for free. You’re given an allotment of thousands of queries and terabytes of data per day. As part of this service, they have integrated with large datasets such all the GPS data for New York taxis in a given year or all the birth records in the United States for the past 100 years. However, among these are databases full of works of literature. BigQuery can let you return a query, searching all public domain books written since 1800, within seconds. Using BigQuery to return locations is the first step of this process.
Location Search Algorithm
These two datasets come from two separate databases. The Shakespeare Corpus comes with no meta data, while the Internet Archive comes with a wealth of meta data, including coordinates of all locations mentioned in a given work.
This means that for all other works, including the Shakespeare corpus dataset, you have to write an algorithm to parse the entire text of a book for locations. The way I got around this was two-fold.
- The first step of the algorithm I constructed was to cut down a large part of the text at the query level. All locations mentioned in novels tend to be written in uppercase, so I used Biguery to return all uppercase words. This step completes in seconds.
- The next step uses the Wikipedia search API to search all the resulting uppercase words for coordinates. This step isn’t perfect on it’s own so I needed to develop checks for each word search.
- To deal with duplicate place names I check the word if it leads to a disambiguation page. From the disambiguation page I then return the first location listed. However, an obvious problem to this is that the correct location might not be the first one that is listed. This hasn’t been solved yet. A possible solution to this that I’ve considered is to search the text for the nearest word (to the current query word) that’s a country. Then to use the country as a basis to determine the proper location.
- The next issue was to check for mispellings. When dealing with older texts, the spelling of each location might have small variances. Wikipedia is good at redirecting to the right page if a search contains a mispelt word, however, their API doesn’t easily let you deal with this. My solution was to search get the HTTP response of the given keyword and return the title of the redirected page. This search term then could be properly searched for coordinates.
Creating the Map
Marker size is determined by using a simple function to normalize the range between the greatest amount of mentions and the least.
That’s a simple (and poorly written) overview of the process and expect to see more updates as I map out more works of literature and clean up the code so it can be open sourced and distributed.