Dictionary Building for Word-Search

Mining Wikipedia for a "geek" lexicon.

One of the local pubs styles itself after geek/nerd culture (e.g. sci-fi, fantasy, and board games). The back of the coasters feature a word-search. It reports to contain 45 "geek words and phrases":

Writing code to solve a word-search isn't particularly tricky (if you remember to use a prefix trie) as long as you have a list of words to find, but in this case we are given no such clues.

But, since we are tremendously lazy, how can we solve this with code anyways?

First, I spent a few minutes searching the old-fashioned way (and not particularly well at that):

We have isolated some Star Trek, Star Wars, Game of Thrones, Alien, and Predator. Ergo, lets mine those terms to get a vocabulary.

I wrote a Wikipedia scraper in Python that can recursively walk from a starting point (caching pages at is goes for successive runs), and output the unique wordlist from that set of documents:

$ python wp_fetch.py -d 0 -o depth0.txt 'Star Trek' 'Star Wars' 'Game of Thrones' 'Alien (film)' 'Predator (film)'
Star Trek 79383
Star Wars 114584
Game of Thrones 86865
Alien (film) 116904
Predator (film) 32345
$ wc depth0.txt 
    5335    5335   43433 depth0.txt

We collected 5335 words! Lets quickly OCR the puzzle, write a word-search solver in Python, and try it with our found dictionary:

$ python solve.py -w depth0.txt puzzle.txt 
Puzzle is 19 by 19
Loading depth0.txt
5335 words
<manually snipped to remove lots of garbage results>
*  2,11  E: ANDROID
*  3, 8  E: PREDATOR
*  4,11  E: DROID
*  5,14  W: ROBOT
*  6,16  W: WIZARD
*  9, 5  E: HERO
*  9,14  E: STARWARS
* 10, 2 SW: ALIEN
* 11,15  N: BATTLES
* 12,12 SE: DRAGON
* 13, 7 SW: MASTER
* 15,16  E: WOLF
* 17,13  N: STARTREK
* 18, 2 SW: BEAST

That found a few more words for us (and a few times as much junk, but this isn't a very smart solver), but we are really missing quite a bit. Lets recurse once into Wikipedia:

$ python wp_fetch.py -d 1 -o depth1.txt 'Star Trek' 'Star Wars' 'Game of Thrones' 'Alien (film)' 'Predator (film)' | wc
    2398    9038   57852
$ wc depth1.txt
   61245   61245  561530 depth1.txt

That found 12 times as many unique words. Hopefully many more of them will be related to the topics at hand. Lets see!

$ python solve.py -w depth1.txt puzzle.txt 
Puzzle is 19 by 19
Loading depth0.txt
61245 words
<manually snipped for interesting results>
*  0,12  N: GOBLIN
*  1,12 SE: HOBBIT
*  2,11  E: ANDROID
*  6,16  W: WIZARD
*  9,17 NW: SAURON
* 11,15  N: BATTLESTAR
* 12, 0  S: DUNGEON
* 12,13 SW: WALKER
* 16, 3  S: DARTHVADER
* 18, 0 SW: TOLKIEN
* 18,14  N: BRADBURY

Still only 35 words of the reported 45, but we found some words (e.g. "SAURON", "WIZARD", "WALKER") for which we didn't even list the source material. Score!

Posted . Categories: .