Condensr: Automatic Yelp Snippet Summarization using Natural Language Processing

In an earlier post, I hinted at a Yelp review summarization system demo. Well, here it is. Some background: Christy Sauper, Regina Barzilay, and I recently presented Incorporating Content Structure into Text Analysis Applications at the Empirical Methods of Natural Language Processing (EMNLP) 2010 conference. We used the model developed in that paper on a whole lot of Yelp from the Boston/Cambridge, SF, Berkeley, New York, Los Angeles, Philadelphia, and Seattle areas.

If you head over to condensr.com, you can see highlights for tons of restaurants in these areas. The system identifies the informative highlights from several restaurant reviews which you might want to know (e.g., a lot of people saying something like “Chicken parmesan is great”) and organizes these snippets into basic attributes (food, service, ambiance, etc.). This is definitely work in progress and
we’re still actively working out new models to do a better job at high-level sentiment summarization and clustering, we thought we’d share what we have now. Please take a look and feel free to drop me a line with comments or suggestions.

Posted in computer science, machine learning, nlp, reviews | 1 Comment

Classification with MIRA in Clojure

A few people from my last post asked for an accessible explanation of the margin infused relaxation algorithm (MIRA) and confidence-weighted learning (CW) classification algorithms I discussed. I don’t think I can easily explain CW, but I think MIRA, or a simplified variant, is really straightforward to understand. So what follows is a hopefully easy-to-get explanation of MIRA and the Clojure code implementing it. The code for the project is available on GitHub.

The Online Machine Learning Setup

We’re assuming the standard supervised learning scenario. We have access to a set of labeled examples: (x_1,y_1),\ldots,(x_n,y_n), where x_i is a feature vector and y_i is a label or class. A feature vector is basically a set of key-value pairs where the key represents a feature, such as this document contains the word “awesome.” Each y_i represents a label of interest about the feature vector x_i. For instance, y_i might say this document (represented by x_i) has positive sentiment. Getting ahead of ourselves, in Clojure, I implement a feature vector as just a map from anything to a double.[1]

The model family is just a simple linear family. For each possible label y, there is a weight vector w_y over possible features. At any time for a given feature vector, x, we predict \hat{y} = \arg\max_y w_y^T x, where w_y^T x represents the dot-product between the vectors. Note that computing a dot-product can be done in time proportional to the sparser of the input vectors (which will be x in this case). The score for each label is w_y^T x and we simply select the highest scoring class/label.

In particular, we’ll be working in the online learning which works as follows:

Initialize weights w_y to zero vector for each y
For each iteration:
  For each example (x,y^*):
    compute prediction: \hat{y} = \arg\max_y w_y^T x
    if y^* \neq \hat{y}: update weight vectors w_{y^*} and w_{\hat{y}}

MIRA is about a particular way of implementing the weight update step. Let’s look at that.

How MIRA works

Here’s how MIRA works.[2] In response to an example pair (x,y^*), we make an update to the current weight vector w’_y to a new one w_y for y=\hat{y} and y=y^*. Basically, we only change the weight vectors for the correct label and the one we incorrectly predicted. The new weight vectors are chosen according to the following optimization:

\begin{array}{l}<br />
\min_w \frac{1}{2}\|w_{y^*} – w_{y^*}’\|^2 + \frac{1}{2}\|w_{\hat{y}} – w_{\hat{y}}’\|^2   \\<br />
 \mbox{s.t.}  \hspace{2pt} w^T_{y^*} x – w^T_{\hat{y}} x \geq \ell(y^*,\hat{y}) \\<br />
\hspace{15pt} \hat{y} = \arg\max_{y} w_y’^T x<br />
\end{array}

What the heck does that mean?

Here’s the optimization problem in words. Consider the prediction you would make \hat{y} which is best according to your current weights (w’_ys). You made an error, so you want to update w_{y^*} to score higher on x and update w_{\hat{y}} to score lower on x. The term w^T_{y^*} x – w^T_{\hat{y}} x represents the gap between the score for the correct answer and the predicted answer. This quantity can’t be positive for the old weights w’ since \hat{y} scored at least as high as y^*; we made a mistake after all. We want the new weight vectors to have the property that this gap is positive and at least \ell(y^*,\hat{y}), a user-specific loss between the two labels. Typically this loss is just 1 when \hat{y}\neq y^*, but it can be more complex. This is the constraint we want for the new weight vectors w_{y^*} and w_{\hat{y}}. Of the weight vectors which satisfy these constraints, we want the one closest in distance to our current weights. So we want to get the correct answer without changing things too much.

How do you solve the problem?

Using a little optimization theory, it’s straightforward to see that the solution for the new w_{y^*} and w_{\hat{y}} take the forms:[3]

\begin{array}{l}<br />
w_{y^*} \leftarrow w’_{y^*} + \alpha x \\<br />
w_{\hat{y}} \leftarrow w’_{\hat{y}} – \alpha x<br />
\end{array}<br />

Where \alpha is some positive constant. Essentially, you want whatever features where active (non-zero) in x to get bigger for the correct answer y^* and for
the weights for the incorrect answer w_{\hat{y}} to get smaller for those features active in x.

You can just solve for \alpha which satisfy the constraint:

 (w’_{y^*} + \alpha x)^T x – (w’_{\hat{y}} – \alpha x)^T x \geq \ell(y^*,\hat{y})

Using some basic algebra we get:

 (w’^T_{y^*} x – w’^T_{\hat{y}} x) + 2 \alpha \| x \|^2  \geq \ell(y^*,\hat{y})

Solving for \alpha yields:

 \alpha \geq \frac{\ell(y^*,\hat{y}) – (  w’^T_{y^*} x – w’^T_{\hat{y}} x )}{2 \| x \|^2}

Any \alpha which satisfies the above will satisfy our condition. Since we need to make the smallest changes possible, this corresponds to selecting the smallest \alpha which satisfies the constraint. Basically we set \alpha to the right hand-side of the above. So the \alpha is composed of: the loss, the gap with the current weights (w’), and the datum norm (\|x\|^2). Once we compute alpha, we make weight vector updates and move on through the rest of the examples. Notice that once we make a pass over the data and don’t make an error, the weights never change.

Clojure Code

Here’s the Clojure code for implementing MIRA. I implement machine learning vectors via the clojure map where the keys are typically strings given as input. This isn’t the most efficient encoding, but it makes the code easier to write. This code has been fairly optimized and is reasonably fast. It can write weights to disk, load them and make predictions on new datums. In general, when I need quick and dirty multi-class classifcation, I’ll use this.

One detail in the code is that it’s usually better to use the average of weight vectors over all updates rather than the final weight vectors. We accomplish this by tracking the sum over all weight vectors (:cum-label-weights). In order to make the updates to the summed vector efficient, we need to know how many updates are left to go. This way we can add the contribution of the current update to all future updates.

  1. [1] Although unfortunately, like Java, you have to map to a Double object and pay the cost of boxing and unboxing.
  2. [2] The variant of working with here is for k=1 so the update has a closed form.
  3. [3] You get this by
    looking at the dual optimization problem.
Posted in clojure, computer science, machine learning | 4 Comments

The Relation between MIRA and CW Learning

Note: This post won’t make sense unless you’re steeped in recent machine learning. There’s a good chance that if you are, you already know this.

During a machine learning reading group with Mike Collins, Jenny Finkel, Alexander Rush and myself were reading a paper about confidence-weighted (CW) learning. At a very high-level, CW is a online learning approach: you make updates to parameters after observing a labeled examples (x,y^*). Online methods have enjoyed a lot of popularity recently: the margin infused relaxation algorithm (MIRA)[1] and Primal Estimated sub-GrAdient SOlver for SVM (PEGASOS)[2]. These techniques are much simpler and faster to converge than their batch counterparts; the performance gap is or has become negligible (although there might be folks who disagree).

One thing we wanted to understand better is how this approach is different from MIRA. One obvious difference which the authors push is that they’re capturing variance of individual features as well as between features which yields stronger performance. Those are all valid points. But if we strip the feature covariance out of the picture how does the update optimization problem differ? The answer, I think, is that they’re essentially equivalent modulo one subtle difference which is probably important. This is probably obvious to machine learning gurus, but it took me a few minutes to work out. I’m sure this observation is even spelled out in one of the CW papers.

MIRA: Here’s the variant of MIRA I’m working with. You have a current weight vector \mu’ and you want to update to a new weight vector \mu based on a new example pair (x,y^*):

\begin{array}{l}<br />
\min_\mu \|\mu – \mu’\|^2  \\<br />
 \mbox{s.t.}  \hspace{2pt} \mu^T \Delta f \geq \gamma \\<br />
\hspace{15pt} \hat{y} = \arg\max_{y} \mu^T f(x,y)   \\<br />
\hspace{15pt} \Delta f = f(x,y^*) – f(x,\hat{y})<br />
\end{array}

Here \gamma > 0 is typically a fixed constant and the above update is done only when an error is made (\hat{y} \neq y^*).

CW: In contrast, CW doesn’t have a single weight vector, it has distribution over weight vectors w \sim \mathcal{N}(\mu,\Sigma). In normal CW, you get the covariance matrix \Sigma as parameters. Here, I’m considering a variant where the covariance matrix is fixed to be the identity.[3] The only parameters I’m considering here are the mean weight vector \mu. The update optimization for CW in this context is given by:

\begin{array}{l}<br />
\min_\mu  KL(\mathcal{N}(\mu’,I) | \mathcal{N}(\mu,I))  \\<br />
\mbox{s.t.} \hspace{2pt}  P(w^T \Delta f \geq 0) \geq \eta\\<br />
 \hspace{15pt} w \sim \mathcal{N}(\mu, I)<br />
\end{array}

The KL(\cdot | \cdot) is the Kullback-Liebler Divergence. If you take a look at the expression for the KL diverence between two gaussians, here, it’s pretty straightforward to see that if the covariance matrices are the identity, the KL divergence is within a constant of \| \mu – \mu’ \|^2.

Now for the constraint. The first thing to notice is that w^{T} \Delta f \sim \mathcal{N}(\mu^{T} \Delta f, \| \Delta f \|^{2}). So if Z is a zero-mean unit-variance gaussian, we want

 P(\| \Delta f \| Z + \mu^{T} \Delta f \geq 0) = P\left(Z \geq -\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right)

If \Phi is the cumulative distribution function for the unit-normal, we want:

 1 – \Phi\left(-\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) \geq \eta

This implies,

<br />
	   \Phi\left(-\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) \leq 1 – \eta \Rightarrow<br />
	    erf\left(\frac{ \mu^{T} \Delta f}{\sqrt{2} \| \Delta f \|}\right) \leq 1 – 2\eta<br />

where erf(\cdot) is the error function.

Here’s the subtlety: If we assume that our feature vectors f(x,y) are normalized and that for any two y,y’ that f(x,y) and f(x,y’) don’t overlap in non-zero features (which is common in NLP since weight vectors are partitioned for different ys) then \| \Delta f \| is a constant independent of the particular update. In which case, ensuring erf (c \mu^{T} \Delta f) \leq 1 – 2 \eta (assuming \eta > 0.5) just amounts to making sure \mu^{T} \Delta f exceeds some constant independent of the particular update, which is equivalent to selecting that choice of \gamma in MIRA. So the two optimizations are essentially the same.

However, if feature vectors are not normalized, then the two aren’t equivalent. Essentially, the larger the feature vector norm the larger the “gap” term \mu^T \Delta f needs to be. If you have exclusively binary features, which many NLP applications do, this means the more features active in a datum, the larger “gap” (\mu^T \Delta f) we require. This makes a lot of sense. We can get this in MIRA pretty straightforwardly:

\begin{array}{l}<br />
\min_\mu \|\mu – \mu’\|^2  \\<br />
\hspace{2pt} \mbox{s.t.}  \mu^T \Delta f > \gamma \| \Delta f \| \\<br />
\end{array}

then it’s always equivalent modulo constant choices. I don’t actually know if the \| \Delta f \| scaling improves accuracy, but I wouldn’t be surprised if it did.

  1. [1] This algorithm is actually called the Passive Aggressive algorithm, but I’ve always known it as MIRA.
  2. [2] Yes I know, it’s a tortured backronym
  3. [3] In practice, you wouldn’t use this variant, here I’m just trying to get a handle on the objective
Posted in computer science, machine learning, nlp | 4 Comments

Extracting Useful Review Snippets from Yelp Using Natural Language Processing

Recently, Christy Sauper, Regina Barzilay, and me published
a paper, Incorporating Content Structure into Text Analysis Applications, about how to use content structure in a document to improve accuracy on information extraction tasks. One of the datasets we worked with was derived from Yelp. The specific task is to extract short key-phrases about various aspects of a restaurant review: the food, service, ambiance , price, etc.

When you extract these key-phrases from all the reviews of a particular restaurant and cluster them, you start to get a good sense of what people think about the restaurant that is a lot richer than just a “I hated it” or “I loved it” perspective. There will be a proper demo soon, but for now, I have a few screen shots to give you a small example. The snippets from each restaurant are automatically extracted from raw review data using the technique described in our paper and some simple clustering of these phrases. In the next week or so, I’ll post a link to a fully-functional demo that participants in the upcoming Empirical Methods in NLP (EMNLP) 2010 conference will hopefully find useful in finding good places to eat in the Boston area. For automatically extracted snippets, I think these look quite good! The sentiment ratings are also done automatically by another system.

example1.png
example3.png
example2.png

Posted in computer science, discourse, information extraction, nlp | Leave a comment

Clojure Unsupervised Part-Of-Speech Tagger Explained

Last week, I posted a 300 line clojure script which implements some recent work I’ve published in unsupervised part-of-speech tagging. In this post, I’m going to describe more fully how the model works and also how the implementation works. This post is going to assume that you have some basic background in probability and that you know some clojure. The post is massive, so feel free to skip sections if you feel like something is too remedial; I’ve put superfluous details in footnotes or marked paragraphs.

What exactly is unsupervised part-of-speech tagging?

Unsupervised part-of-speech (POS) tagging is the task of taking the raw text of a language and inducing syntactic categories of words. For instance, in English, the words “dog”,”basketball”, and “compiler”, aren’t semantically related, but all are common nouns. You probably know the basic syntactic categories of words: nouns, verbs, adjectives, adverbs, etc. However, natural language processing (NLP) applications typically require more fine grained distinctions. For instance, the difference between a singular, plural, or proper nouns. In English, the most commonly used annotated POS data has 45 different tags.

What we’re interested in here is unsupervised learning, meaning that at no point does the model get told about the kinds of syntactic categories we want nor does it get examples of annotated sentences or examples (there is no supervised data); you just get raw data. There are several advantages to using unsupervised learning, not least of which being there are languages that don’t have POS annotated data.

A subtle consequence of being unsupervised is that we aren’t going to directly learn that the word “dog” is a singular common noun. Instead, we learn there are some fixed number of tag states and all the things we call a singular common noun may map to tag state 42, for instance. Basically, the tags in the model don’t come with the names we recognize, we have to map them to meaningful names (if that’s what you’re application requires). Essentially, what you get out of this is a clustering over words which corresponds to meaningful syntactic distinctions.

How does the model work?

The model is a variation on the standard Hidden Markov Model (HMM), which I’ll briefly recap. The HMM unsupervised POS tagging story works as follows: We assume a fixed number of possible tag states, K as well as a fixed vocabulary of size  V. The Markov model part of HMM refers to the fact that the probability distribution of tag states for a single sentence is generated under a first-order Markov assumption. I.e., the probability
P(t_1,\ldots,t_m) for a sentence of length m is given by,

 P(t_1,\ldots,t_m) = \prod_{i=1}^m P(t_{i+1} | t_i)

This encodes the intuition that typically some kind of noun or adjectives usually follows a determiner (e.g. “the”,”a”,”this”).[1]

Once the tags of the sentence have been generated, for each position, a word is drawn conditioned on the underlying tag. Specifically, for i=1,\ldots,n a word w_i is drawn conditioned on the corresponding tag t_i, P(w_i | t_i). This emission distribution is parametrized according to parameters \theta_t for each tag t over the V vocabulary elements. So for instance, for tag state 42, which we suppose corresponds to a singular noun, there is a distribution over all possible singular nouns, that might look like this:[2]

 \theta_{42} =  \left\{ dog: 0.03, basketball: 0.02, compiler: 0.01, \ldots \right\}

If we let, \mathbf{w} denote a corpus, consisting of a bunch of sentences, then the HMM puts mass on the corpus as well as corresponding
tag sequences \mathbf{t} as follows,

 P(\mathbf{w},\mathbf{t}) = \prod_{(w,t) \in (\mathbf{w},\mathbf{t})} P(w,t) = \prod_{(w,t) \in (\mathbf{w},\mathbf{t})} \left( \prod_{i=0}^m P(w_{i+1} | t_{i+1}) P(t_{i+1} | t_i) \right)

What’s wrong with the HMM?

There’s a lot wrong with the HMM approach to learning POS structure. One of the most important is that the model doesn’t encode the constraint that a given word typically should be associated with a few number of tags. The model is perfectly happy to learn that a given word can be generated any number of tags. A lot of doing unsupervised machine learning is understanding how to alter models to reflect the constraints and preferences that the structure we are interested in has.

Another more subtle issue is that there is a significant skew to the number of words which belong to each part-of-speech category. For instance, there are very few determiners in most languages, but they appear very frequently at the token level. There is no way to encode this constraint that some tags are infrequent or frequent at the type-level (have very few (or many) unique word types that can use a given tag category). So the model has a prior P(T) over tag assignments to words.[3]

What’s the approach in the paper?

The approach in the paper is actually very simple: For each word type W_i, assign it a single legal tag state T_i. So for the word type “dog”, the model chooses a single legal tag (amongst t=1,\ldots,K); essentially, decisions are made for each word once at the type level, rather than at the token-level for each individual instantiation of the word. Once this has been done for the entire vocabulary, these type tag assignments constrain the HMM \theta_t parameters so only words assigned to tag t can have non-zero probability. Essentially, we strictly enforce the constrain that a given word be given a single tag throughout a corpus.

When the model makes this decision it can use a type-level prior on how likely it is that a word is a determiner. Determiners, or articles, in general are very frequent at the token level (they occur a lot in sentences), but there are very few unique words which are determiners. Another thing we can do is have features on a word type in a naive-bayes fashion. We assume that each word is a bag of feature-type and feature-value pairs which are generated from the tag assigned to the word. The features you might have on a word type are what is the suffix of the word? Is it capitalized? You can configure these features very easily.

Let’s summarize the model. Assume that the vocabulary of a language consists of n word types. The probability of a type-level tag assignment is given by:

 P(\mathbf{T},\mathbf{W}) = \prod_{i=1}^n P(T_i) P(W_i | T_i) = \prod_{i=1}^n P(T_i) \left( \prod_{(f,v) \in W_i} P(v | f, T_i) \right)

where, (f,v) is a feature-type and feature-value pair in the word type (e.g., (:hasPunctuation, false). So each tag t has a distribution over the values for each feature type. For instance, the common noun, tag 42 in our examples so far, is somewhat likely to have punctuation in the word (as in “roly-poly”). It’s distribution over the :hasPunctuation feature-type might look like:[4]

\left\{ false: 0.95, true: 0.05 \right\}

Once the tag assignments have been generated, everything proceeds identically to the standard token-level HMM except with the constraint that emission distributions have been constrained so that a tag can only emit a word if that word has been assigned to the tag.

How do you learn?

The fairly simple change to the model made in the last section not only yields better performance, but also makes learning much simpler and efficient. Learning and inference will be done using Gibbs Sampling. I can’t go over Gibbs Sampling fully, but I’ll summarize the idea in the context of this work. The random variable we don’t know in this model are the type-level assignments \mathbf{T} = T_1,\ldots, T_n. In the context of Bayesian models, we are interested in the posterior P(\mathbf{T} | \mathbf{W}, \mathbf{w}), where \mathbf{W} and \mathbf{w} denote the word types in the vocabulary and the tokens of the corpus respectively; essentially, they’re both observed data.[5] We can obtain samples from this posterior by repeatedly sampling each of the T_i variables with the other assignments, denoted \mathbf{T}_{-i}, fixed. We sample T_i according to the posterior P(T_i | \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w}), which basically reprsents the following probability: If I assume all my other tag assignments are correct, what is the distribution for the tag assignment to the ith word. It’s relatively straightforward to show that if we continually update the sampling state \mathbf{T} one-tag-at-a-time in this way, at some point, the sampling state \mathbf{T} is drawn from the desired posterior P(\mathbf{T} | \mathbf{W}, \mathbf{w}).[6] So essentially, learning boils down to looping over tagging assignments and sampling values while all other decisions are fixed.

In the original HMM, when using Gibbs Sampling, the state consists of all token-level assignments of words to tags. So the number of variables you need to sample is proportional to the number of words in the corpus, which can be massive. In this model, we only need to sample a variable for each word type, which is substantially smaller, and importantly grows very slowly relative to the amount of data you want to learn on.

Okay, so learning with this model boils down to how to compute the local posterior:

 \begin{array}{cl}<br />
        P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})<br />
             \propto& P(T_i = t | \mathbf{T}_{-i}) P(W_i | T_i = t,\mathbf{T}_{-i},\mathbf{W}_{-i}) \\<br />
             & P(\mathbf{w} | T_i = t, \mathbf{T}_{-i})<br />
\end{array}

Let me break down each of these terms. The P(T_i = t | \mathbf{T}_{-i}) is straight-forward to compute; if we count all the other tag assignments, the probability of assigning T_i to t is given by,  \frac{n_{t} + \alpha}{n-1 + \alpha} where n_t is the number of tags in \mathbf{T}_{-i} which are currently assigned to t. The \alpha term is the smoothing concentration parameter.[7]

A similar reasoning is used to compute,

 P(W_i | T_i = t,\mathbf{T}_{-i}) = \prod_{(f,v) \in W_i} P(v | f, T_i = t, \mathbf{T}_{-i}, \mathbf{W}_{-i})

which decomposes a product over the various features on the word type. Each individual feature probability can be computed by using counts of how often a feature value is seen for other words assigned to the same tag.

The last term requires a little thinking. For the purpose of Gibbs Sampling, any probability term which doesn’t involve the thing we’re sampling, we can safely drop. At the token-level, the assignment of the ith word type to t only affects the local contexts in which the ith word type appears. Let’s use w to denote the ith word type. Each usage of w in the corpora are associated with a previous (before) word and a following (after) word.[8] Let’s use (b,w,a) to represent the before word, the word itself, and the after word; so (b,w,a) represents a trigram in the corpus. Let T(b) and T(a) denote the tag assignments to words b and a (this is given to us by \mathbf{T}). The only probability terms associated with this usage which not constant with respect to the T_i = t assignment are:

 P(w | T_i = t, \mathbf{T}_{-i}, \mathbf{w}_{-i}) P(t | T(b), \mathbf{T}) P(T(a) | t, \mathbf{T})

These terms are the probability of the word itself with the considered tag, the probability of transitioning to tag t from the tag assigned to the previous word, and transitioning to the tag assigned to the successor word. The only terms which are relevant to the assignment come from all the context usages of the ith word type.
Specifically, if C_i represents the multi-set of such context usages, we have P(\mathbf{w} | T_i=t, \mathbf{T}_{-i}) is proportional to a product of the terms
in each (b,w,a) usage. These probabilities can be computed by storing corpus level counts. Specifically for each word, we need counts of the (before, after) words as well as the counts for all individual words.

Finally, walking through the implementation!

Okay, so after a lot of prep work, we’re ready to dissect the code. I’m going to go linearly through the code and explain how each piece work. For reference, the full
script can be found here.

It’s all about counters

So one of the basic data abstractions you need for probabilistic computing is a counter.[9] Essentially, a counter is a map of items to their counts, that needs, for computing probabilities, to support a fast way to get the sum of all counts. Here’s the code snippet that declares the appropriate data structure as well as the important methods. The proper way to do this is to make Counter a protocol (which I’ve done in my NLP clojure library here):

The two functions here are the only two we need for a counter: inc-count increments a count and returns a new counter, and get-count returns the current count. Since in Gibbs Sampling, none of our counts should be negative, we add an important :post check on get-count which will likely catch bugs.

Dirichlet Distributions

Once we have the counter abstraction, it’s very straightforward to build a probability distribution; all the distributions here are over a finite number of possible events. This kind of distribution is called a multinomial. Here, we use a DiricheltMultinomial which represent the a multinomial drawn from the symmetric Dirichlet distribution, which essentially means that all outcomes are given “fake” counts to smooth the probabilities (i.e., ensure no probability becomes zero or too small). The kinds of things we want to do with a distribution, simply include asking for the log-probability[10] and making a weighted observation which changes the probabilities the distribution produces. Here’s the code. I’ll give more explanation and examples after:

Paragraph can be safely skipped: The probabilities we need from the DirichletMultinomial are actually the “predictive” probabilities obtained from integrating out the Dirichelt parameters. Specifically, suppose we have a distribution with n possible event outcomes and assume the multinomial over these n events are drawn \theta \sim Dirichlet(n, \alpha). Without observing any data, all n outcomes are equally likely. Now, suppose we have observed data \mathbf{X} and that n_i is the number of times, we have observed the ith outcome in \mathbf{X}. Then, we want the probability of a new event e^* given the observed data,

 P(e^* = i | \mathbf{X}) = \int_\theta P(\theta | \mathbf{X}) P(e^* = i | \theta) d \theta = \frac{n_i + \alpha}{\left(\sum_{i’} n_{i’}\right) + n * \alpha}

Given, a counter over events, we can efficiently compute a given probability. Each probability depends on knowing: the count of the event (get-count), the sum over all counts for all events (total from the counter), as well as the number of unique keys that this distribution could emit (num-keys). The reason we don’t just look at the number of keys in the counter is because we’re interested in the number of possible values; at any given time, we may not have counts for all possible events.

Making an observation to a distribution, in this context, just requires increment the count of the event so that subsequent calls to log-prob reflect this observation.

What’s in a word?

Okay, now that we have some standard code out of the way, we need to do some POS-specific code. I’m going ot use a record WordInfo which represents all the information we need about a word in order to efficiently support Gibbs Sampling inference. This information includes: a string of the word itself, its total count in the corpus, a map of the feature-type and feature-value pairs, and a counter over the pairs of the context words which occur before and after word (specifically it will be a counter over [before-word after-word] pairs). Here’s the code:

The get-feats function simply returns a map of the feature-type (a keyword here) and its value. You can easily edit this function to have other features and the rest of the code will just work.

Now that we have this data-structure, we need to build this data structure to represent the statistics from a large corpus. Okay, suppose that I want to update the word-info for a given word after having observed a usage. The only info we need from the usage is the before and after word:

Two things need to change: (1) We need to increment the total usage of the word (the :count field in WordInfo). (2) We need to increment the count of the before-and-after pair [before after] in the counter for the context usages. Here’s what I love about clojure: If you design your abstractions and functions correctly, they work seamlessly with the language. If you don’t know the -> threading macro: learn it, live it, love it. I think in conjunction with the update-in function, it allows for very succinct functions to update several piece of a complex data structure.

Okay, so let me show you the rest of the pieces which build of the word info data structures from a corpus (a seq of sentences):

What we want to do in tally-sent is update the word-info records for a given sentence. For this function, we have a map from a word string to its corresponding WordInfo record. The (partition 3 1 sent) produces a sequence of (before-word word after-word) trigrams which are all we need to update against. For each word in this triple, we ensure we have a WordInfo record (is there a assoc-if-absent function in core or contrib). And then we use our tally-usage function to update against the before and after word. Finally, we perform this update over all the sentences of a corpus in build-vocab.

Gibbs Sampling State

Let’s talk about how we represent the state of the Gibbs Sampler. Okay state is a dirty word in Clojure, and luckily the usage of state here is from Statistics and it represents an immutable value: for a given point in Gibbs Sampling, what are all the relevant assignments and the derived corpus counts from this assignment. Here’s the code:

I think the comments are sufficient here. The one thing that I should explain is that given the corpus and the type-assigns, all the other fields are determined and could theoretically be computing on the fly as needed. For efficiency however, it’s better to update those counts incrementally.

Updating Gibbs Sampling State After an assignment/unassignment

Now there are a lot of functions we need to write to support what happens when you add an assignment of a tag to a given word type or remove the assignment. These operations are the same, except when you make an assignment you are adding positive counts, and when you are unassigning, you remove counts. All these functions tend to take a weight to allow code reuse for these operations. Okay, so let’s take the case of updating the emission distribution associated with the tag which has been assigned/unassigned to a word-info. Two things need to change: we need to change the number of possible values the distribution can produce. If we are assigning the tag to the word, there is another possible outcome for the emission distribution; similarly we need to decrement if we are removing the assignment. Also, we need to observe the word the number of times it occurs in the corpus.

To be clear, tag-emission-distr is obtained from (get-in state [:emission-distrs , tag]) where state is an instance of State.

There are analogous functions for updating the counts for the feature distributions and for the transitions. I’ll briefly go over updating the transitions since it’s bit trickier. When we assign a word to a tag, we need to loop over the [before-word after-word] counts in the WordInfo and, depending on the current tagging assignment, change these counts. Here’s the code:

Gibbs Sampling Step

Okay, so let’s take a top-down perspective for looking at how we make a simple Gibbs Sampling step. We first take our current state, unassign the current assignment to a word, and then sample a new value from the distribution  P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w}):

I didn’t show you the assign and unassign functions. All they do is update the Gibbs Sampling state data structures to reflect the change in assignment for a given word as discussed above. They both are nice pure functions and return new states.

You also haven't seen score-assign and sample-from-scores, which I'll discuss now. score-assign will return something proportional to the log-probability of  P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w}). sample-from-scores will take these scores from the possible assignments and sample one.

Here's score-assign:

The (log-prob (:tag-prior state) tag) corresponds to P(T_i = t | \mathbf{T}_{-i}). The following sum form corresponds to the log of \prod_{(f,v) \in W_i} P(v | f, T_i), the probability of the bundle of features associated with a given word type conditioned on the tag. The last top-level form (headed by let) has all the token-level terms: P(w | \mathbf{T}, \mathbf{w}_{-i})^{n_i} \prod_{(b,a) \in C_i}  P(t | T(b), \mathbf{T}) P(T(a) | t, \mathbf{T}). That let statement needs to suppose that the tag assignment has already happened to correctly compute the probability of the word under the tag. The inner sum term for each [[before-word after-word] count] entry adds the log-probabilities for all these usages (I also lump in the word log-probability itself, although this could be in a separate term weighted with the total occurrence of the word).

Note that the time it takes to score a given assignment is proportional to the number of unique contexts in which a word appears.

Once we have this function, we need to sample proportionally to these log-probabilities. Here is some very standard machine learning code that would normally be in a standard library:

All the rest...

From here, I think the rest of the code is straightforward. An iteration of the code consists of sampling each word's assignment. There is a lot of code towards the end for initializing state. The complexity here is due to the fact that I need to initialize all maps with distributions with the correct number of possible keys. I hope this code make sense.

  1. [1] For each tag t, there are transition parameters \psi_t over successor tags, drawn from a Dirichlet distribution over K elements and hyper-parameter \alpha. These parameterize the P(t_{i+1} | t_i) distributions.
  2. [2] Each \theta_t is drawn from a symmetric Dirichlet prior over V elements and with concentration parameter \alpha (shared with transition parameters for simplicity).
  3. [3] This distribution is parametrized from a symmetric Dirichelt with hyper-parameter \beta over K possible tags.
  4. [4] For each tag and feature-type, the distribution is parametrized by a symmetric Dirichlet over all possible feature-values and hyper-parameter \beta.
  5. [5] Note that the token-level tags \mathbf{t} are determined by type-assignments \mathbf{T}, since each word can only have one tag which can generate it.
  6. [6] In practice, for any real problem, one doesn’t know when Gibbs Sampling, or MCMC in general, has “burned in”.
  7. [7] I don’t have the room to discuss this here, but this probability represents the “predictive” distribution obtained by integrating out the distribution parameters.
  8. [8] We pad each sentence with start and stop symbols to ensure this.
  9. [9] A lot of the names for these abstractions come from Dan Klein, my PhD advisor, but I’m pretty sure modulo the name, the abstractions are pretty universal from my survey of machine learning libraries.
  10. [10] To guard against numerical underflow, we work primarily with log-probabilities.
Posted in clojure, computer science, nlp | 10 Comments

State-Of-The-Art Unsupervised Part-Of-Speech Tagging in 300 lines of Clojure (from Scratch)

Recently, Yoong-Keok Lee, Regina Barzilay, and myself, published a paper on doing unsupervised part-of-speech tagging. I.e., how do we learn syntactic categories of words from raw text. This model is actually pretty simple relevant to other published papers and actually yields the best results on several languages. The C++ code for this project is available and can finish in under a few minutes for a large corpus.

Although the model is pretty simple, you might not be able to tell from the C++ code, despite Yoong being a top-notch coder. The problem is the language just doesn’t facilitate expressiveness the way my favorite language, Clojure, does. In fact the entire code for the model, without dependencies beyond the language and the standard library, clojure contrib, can be written in about 300 lines of code, complete with comments. This includes a lot of standard probabilistic computation utilities necessary for doing something like Gibbs Sampling, which is how inference is done here.

Without further ado, the code is on gisthub and github (in case I make changes).

Posted in computer science | 4 Comments

Computer and Computational Science

There’s a divide I’ve noticed amongst people lumped into a “computer science” department. Compactly, I think there are computer scientists and computational scientists; the knowledge base of these groups is rapidly diverging and CS departments should do a better job catering to each’s needs.

So what exactly is the difference? Well, it’s definitely a fuzzy distinction, but essentially a computational scientist works with data and her primary job is extracting useful information from it. Typically, a computational scientist requires a significant amount of statistical knowledge as well as usually a lot of knowledge from a particular domain in order to make use of data.

Take myself for example: my specialty is statistical natural language processing. My research essentially involves inducing structured data
from unstructured language data and this requires far more knowledge about statistics and linguistics than it does expertise with computer architecture, databases, or systems.

A computer scientist, on the other hand, well, is a computer scientist. Her daily bread is understanding the science of how computers run: low-level operating and embedded systems, tuning a database, scaling a web server, etc. So for instance, a post like this is all about the computer science.

Now, most computational scientists have to know a little bit about computer science in order to implement what it is she wants to do with data. Increasingly though,  advances, made by computer scientists, have enabled data scientists to do their job at higher levels of abstractions without having to think much about what computer scientists think about. These improvements range from the fact that you can make performant systems in higher-level languages  to frameworks like Hadoop that let a computational scientist focus on data and her domain.

There is plenty the areas share which justifies putting them in the same department: much of standard algorithms and computational theory I believe are still broadly relevant to both areas. Procedural thinking, for better or worse, is at the foundation of computer science as well as how we think about doing things with data.

Thinking about data and how to use it certainly isn’t new; statisticians have been doing it for centuries. What is new is the availability of large data and a focus on what actionable decision should be made with it. Computational science has certainly enjoyed a lot of recent success and growth. The New York Times recently called the area the new sexy job. The number of areas which can make use of computational science is growing and will continue to do so for a long time. Computational science will, hopefully, still be a big part of CS departments for a long time to come.

Here’s the issue though. I don’t think the educational curriculum of CS departments has adjusted itself to this growing area. Machine learning isn’t a standard part of the undergraduate curriculum; some instructors have converted their Artificial Intelligence courses into ML ones, but those aren’t always required either. A statistics course isn’t typically required; and no, bundling probability theory into the tail end of a discrete math course doesn’t count. I mean a course where a student does basic analytics on a larg-ish dataset, including things such as simple statistical tests, which are useful in a lot of surprising contexts. Many universities require physics and EE courses for the computer scientists, where is the equivalent statistics course for computational scientists?

A related problem with the standard CS (conflating computer and computational) curriculum is that it doesn’t really convey the broad range of potential CS applications: social science, biology, law, finance, linguistics, astronomy, even comparative literature. I think this is one of the most exciting things about doing CS and exploring these applications is important for budding young computational scientists. However, early CS courses focus on the nuts & bolts important to computer scientists: programming language details, data structures, low-level memory management, etc. I’m not sure it’s a fair analogy, but it’s as though your first year biology course focused on the structure and use of lab equipment; this week: bunsen burners. Clearly, you need a little computer science to do computational science, but I don’t think it needs to be buried so deep in the curriculum.

Posted in computer science | Tagged , | 3 Comments