State-Of-The-Art Unsupervised Part-Of-Speech Tagging in 300 lines of Clojure (from Scratch)

Recently, Yoong-Keok Lee, Regina Barzilay, and myself, published a paper on doing unsupervised part-of-speech tagging. I.e., how do we learn syntactic categories of words from raw text. This model is actually pretty simple relevant to other published papers and actually yields the best results on several languages. The C++ code for this project is available and can finish in under a few minutes for a large corpus.

Although the model is pretty simple, you might not be able to tell from the C++ code, despite Yoong being a top-notch coder. The problem is the language just doesn’t facilitate expressiveness the way my favorite language, Clojure, does. In fact the entire code for the model, without dependencies beyond the language and the standard library, clojure contrib, can be written in about 300 lines of code, complete with comments. This includes a lot of standard probabilistic computation utilities necessary for doing something like Gibbs Sampling, which is how inference is done here.

Without further ado, the code is on gisthub and github (in case I make changes).

This entry was posted in computer science. Bookmark the permalink.

4 Responses to State-Of-The-Art Unsupervised Part-Of-Speech Tagging in 300 lines of Clojure (from Scratch)

  1. Chris Brew says:

    I hope you’ll expand your paper for a journal article. If you do, a useful
    reference point might be Julian Kupiec’s article

    Computer Speech & Language
    Volume 6, Issue 3, July 1992, Pages 225-242

    doi:10.1016/0885-2308(92)90019-Z

    I find this work clearer on what “unsupervised” might mean than is Merialdo. As usual, unsupervised is really a matter of degree, not a 1-0 thing.

    To my horror, I find myself drawn into the role of Guardian of the History.

  2. Pingback: Clojure Unsupervised Part-Of-Speech Tagger Explained | Stuff Aria Likes

  3. Pingback: Unsupervised Part-Of-Speech Tagger | The Empirical Humanist

  4. Pingback: Alexander7

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>