martin carpenter

contents

most popular
2012/05/05, updated 2012/12/15
ubuntu unity lens for vim
2010/04/14
ckwtmpx

generating word lists from ispell(1) dictionaries

2013/03/20

tags: dictionary cracking

ispell(1) is a venerable UNIX spellchecking utility. It ships with dictionaries in a variety of languages using a base set of root words (dictionary) and rules for stemming those words (affinities).

If you're looking for a word list in a little-known language then ispell(1) can be persuaded to output all the forms of all of the words in the base dictionary.

I was looking for a list of Bulgarian words (cyrillic alphabet, native population ~10 000 000) but this was not easy to find. However I did find Bulgarian ispell .dict and .aff files. So...

Firstly we combine the dictionary and affinities into a hash file using the ispell utility buildhash:

me@home:~$ buildhash bg.dict bg.aff bg.hash

(This creates the file bg.hash).

Then we can create the word list by asking ispell to expand (-e) each dictionary root with all applicable affinities. Since it places all variations of a given root on the same line we pipe through sed(1) to write one-word-per line into bg.words:

me@home:~$ ispell -d ./bg -e < bg.dict | sed "s/ /\n/g" > bg.words

Finally we have to consider the question of character encoding. I pulled up bg.words in Firefox and clicked through View → Character Encoding → Auto Detect → Universal. This told me that the encoding was (surprisingly) Cyrillic Windows-1251.

I needed to convert this to UTF-8. This is easy using a (modern, 1.9) ruby and the external/internal encoding conversion on IO objects:

#!/usr/bin/env ruby
File.open('bg.words', 'r:Windows-1251:UTF-8').each { |x| puts x }

Done!