September 14, 2010

State-Of-The-Art Unsupervised Part-Of-Speech Tagging in 300 lines of Clojure (from Scratch)

Recently, Yoong-Keok Lee, Regina Barzilay, and myself, published a paper on doing unsupervised part-of-speech tagging. I.e., how do we learn syntactic categories of words from raw text. This model is actually pretty simple relevant to other published papers and actually yields the best results on several languages. The C++ code for this project is available and can finish in under a few minutes for a large corpus.

Although the model is pretty simple, you might not be able to tell from the C++ code, despite Yoong being a top-notch coder. The problem is the language just doesn’t facilitate expressiveness the way my favorite language, Clojure, does. In fact the entire code for the model, without dependencies beyond the language and the standard library, clojure contrib, can be written in about 300 lines of code, complete with comments. This includes a lot of standard probabilistic computation utilities necessary for doing something like Gibbs Sampling, which is how inference is done here.

Without further ado, the code is on gisthub and github (in case I make changes).