Recently, Yoong-Keok Lee, Regina Barzilay, and myself, published a paper on doing unsupervised part-of-speech tagging. I.e., how do we learn syntactic categories of words from raw text. This model is actually pretty simple relevant to other published papers and actually yields the best results on several languages. The C++ code for this project is available and can finish in under a few minutes for a large corpus.
Although the model is pretty simple, you might not be able to tell from the C++ code, despite Yoong being a top-notch coder. The problem is the language just doesn’t facilitate expressiveness the way my favorite language, Clojure, does. In fact the entire code for the model, without dependencies beyond the language and the standard library, clojure contrib, can be written in about 300 lines of code, complete with comments. This includes a lot of standard probabilistic computation utilities necessary for doing something like Gibbs Sampling, which is how inference is done here.
Without further ado, the code is on gisthub and github (in case I make changes).
I hope you’ll expand your paper for a journal article. If you do, a useful
reference point might be Julian Kupiec’s article
Computer Speech & Language
Volume 6, Issue 3, July 1992, Pages 225-242
doi:10.1016/0885-2308(92)90019-Z
I find this work clearer on what “unsupervised” might mean than is Merialdo. As usual, unsupervised is really a matter of degree, not a 1-0 thing.
To my horror, I find myself drawn into the role of Guardian of the History.
Pingback: Clojure Unsupervised Part-Of-Speech Tagger Explained | Stuff Aria Likes
Pingback: Unsupervised Part-Of-Speech Tagger | The Empirical Humanist
Pingback: Alexander7