2013/03/20
tags: dictionary cracking
ispell(1)
is a venerable UNIX spellchecking utility. It ships with
dictionaries in a variety of languages using a base set of root words
(dictionary)
and rules for stemming those words (affinities).
If you're looking for a word list in a little-known language then
ispell(1)
can be persuaded to output all the forms of all of the
words in the base dictionary.
I was looking for a list of Bulgarian words (cyrillic alphabet, native
population ~10 000 000) but this was not easy to find. However I did find
Bulgarian ispell
.dict
and
.aff
files. So...
Firstly we combine the dictionary and affinities into a hash file using the
ispell
utility buildhash
:
me@home:~$ buildhash bg.dict bg.aff bg.hash
(This creates the file bg.hash
).
Then we can create the word list by asking ispell
to expand (-e
)
each dictionary root with all applicable affinities. Since it places all
variations of a given root on the same line we pipe through sed(1)
to write one-word-per line into bg.words
:
me@home:~$ ispell -d ./bg -e < bg.dict | sed "s/ /\n/g" > bg.words
Finally we have to consider the question of character encoding. I
pulled up bg.words
in Firefox and clicked through View →
Character Encoding → Auto Detect → Universal. This
told me that the encoding was (surprisingly) Cyrillic Windows-1251.
I needed to convert this to UTF-8. This is easy using a (modern, 1.9) ruby and the external/internal encoding conversion on IO objects:
#!/usr/bin/env ruby File.open('bg.words', 'r:Windows-1251:UTF-8').each { |x| puts x }
Done!