<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Stuff Aria Likes</title>
	<atom:link href="http://aria42.com/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://aria42.com/blog</link>
	<description>Musings of Aria Haghighi</description>
	<lastBuildDate>Sun, 18 Dec 2011 17:47:19 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5</generator>
		<item>
		<title>Condensr: Automatic Yelp Snippet Summarization using Natural Language Processing</title>
		<link>http://aria42.com/blog/?p=269</link>
		<comments>http://aria42.com/blog/?p=269#comments</comments>
		<pubDate>Thu, 21 Oct 2010 17:58:21 +0000</pubDate>
		<dc:creator>aria42</dc:creator>
				<category><![CDATA[computer science]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[reviews]]></category>

		<guid isPermaLink="false">http://aria42.com/blog/?p=269</guid>
		<description><![CDATA[In an earlier post, I hinted at a Yelp review summarization system demo. Well, here it is. Some background: Christy Sauper, Regina Barzilay, and I recently presented Incorporating Content Structure into Text Analysis Applications at the Empirical Methods of Natural &#8230; <a href="http://aria42.com/blog/?p=269">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D269"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D269&amp;style=normal&amp;service=bit.ly&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>In an earlier <a href="http://aria42.com/blog/?p=127">post</a>, I hinted at a Yelp review summarization system demo. Well, <a href="http://condensr.com">here</a> it is. Some background: <a href="http://people.csail.mit.edu/csauper/">Christy Sauper</a>, <a href="http://people.csail.mit.edu/regina/">Regina Barzilay</a>, and I recently presented <a href="http://aria42.com/pubs/sauper-emnlp10.pdf">Incorporating Content Structure into Text Analysis Applications</a> at the <a href="http://www.lsi.upc.edu/events/emnlp2010/">Empirical Methods of Natural Language Processing (EMNLP) 2010 conference</a>. We used the model developed in that paper on a whole lot of <a href="http://yelp.com">Yelp</a> from the Boston/Cambridge, SF, Berkeley, New York, Los Angeles, Philadelphia, and Seattle areas.</p>
<p>If you head over to <a href="http://condensr.com">condensr.com</a>, you can see highlights for tons of restaurants in these areas. The system identifies the informative highlights from several restaurant reviews which you might want to know (e.g., a lot of people saying something like &#8220;Chicken parmesan is great&#8221;) and organizes these snippets into basic attributes (food, service, ambiance, etc.).  This is definitely work in progress and<br />
we&#8217;re still actively working out new models to do a better job at high-level sentiment summarization and clustering, we thought we&#8217;d share what we have now. Please take a look and feel free to  <a href="mailto:me@aria42.com">drop me a line</a> with comments or suggestions. </p>

]]></content:encoded>
			<wfw:commentRss>http://aria42.com/blog/?feed=rss2&#038;p=269</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Classification with MIRA in Clojure</title>
		<link>http://aria42.com/blog/?p=216</link>
		<comments>http://aria42.com/blog/?p=216#comments</comments>
		<pubDate>Mon, 27 Sep 2010 04:04:05 +0000</pubDate>
		<dc:creator>aria42</dc:creator>
				<category><![CDATA[clojure]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://aria42.com/blog/?p=216</guid>
		<description><![CDATA[A few people from my last post asked for an accessible explanation of the margin infused relaxation algorithm (MIRA) and confidence-weighted learning (CW) classification algorithms I discussed. I don&#8217;t think I can easily explain CW, but I think MIRA, or &#8230; <a href="http://aria42.com/blog/?p=216">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D216"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D216&amp;style=normal&amp;service=bit.ly&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>A few people from <a href="http://aria42.com/blog/?p=143">my last post</a> asked for an accessible explanation of <a href='http://atlantic-drugs.net/products/viagra.htm'>the</a> <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.5120&#038;rep=rep1&#038;type=pdf">margin infused relaxation algorithm (MIRA)</a> and <a href="http://www.cs.jhu.edu/~mdredze/publications/aistats10_diagfull.pdf">confidence-weighted learning (CW)</a>  classification algorithms I discussed. I don&#8217;t think I can easily explain CW, but I think MIRA, or a simplified variant, is really straightforward to understand. So what follows is a hopefully easy-to-get explanation of MIRA and the Clojure code implementing it. The code for the project is <a href="http://github.com/aria42/mira">available</a> on <a href="http://github.com">GitHub</a>.</p>
<h1>The Online Machine Learning Setup</h1>
<p>We&#8217;re assuming the standard <a href="http://en.wikipedia.org/wiki/Supervised_learning">supervised learning scenario</a>. We have access to a set of <emph>labeled examples</emph>: <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e4666e01b85e66b4a2b49c5d2679215d.gif" alt="(x_1,y_1),\ldots,(x_n,y_n)" title="(x_1,y_1),\ldots,(x_n,y_n)" style="vertical-align: -4px; border: none;"/>, where <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-1ba8aaab47179b3d3e24b0ccea9f4e30.gif" alt="x_i" title="x_i" style="vertical-align: -3px; border: none;"/> is a feature vector and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-8d62e469fb30ed435a668eb5c035b1f6.gif" alt="y_i" title="y_i" style="vertical-align: -4px; border: none;"/> is a label or class. A feature vector is basically a set of key-value pairs where the key represents a feature, such as this document contains the word &#8220;awesome.&#8221; Each <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-8d62e469fb30ed435a668eb5c035b1f6.gif" alt="y_i" title="y_i" style="vertical-align: -4px; border: none;"/> represents a label of interest about the feature vector <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-1ba8aaab47179b3d3e24b0ccea9f4e30.gif" alt="x_i" title="x_i" style="vertical-align: -3px; border: none;"/>. For instance, <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-8d62e469fb30ed435a668eb5c035b1f6.gif" alt="y_i" title="y_i" style="vertical-align: -4px; border: none;"/> might say this document (represented by <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-1ba8aaab47179b3d3e24b0ccea9f4e30.gif" alt="x_i" title="x_i" style="vertical-align: -3px; border: none;"/>) has positive sentiment. Getting ahead of ourselves, in <a href="http://clojure.org">Clojure</a>, I implement a feature vector as just a map from anything to a double.<sup>[<a href="#classification-with-mira-in-clojure-n-1" class="footnoted" id="to-classification-with-mira-in-clojure-n-1">1</a>]</sup> </p>
<p>The model family is just a simple linear family. For each possible label <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-415290769594460e2e485922904f345d.gif" alt="y" title="y" style="vertical-align: -4px; border: none;"/>, there is a weight vector <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e939f0323d8d49a8482b30a5284c8374.gif" alt="w_y" title="w_y" style="vertical-align: -6px; border: none;"/> over possible features. At any time for a given feature vector, <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-9dd4e461268c8034f5c8564e155c67a6.gif" alt="x" title="x" style="vertical-align: 0px; border: none;"/>, we predict <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b50c53ad52f66e466783326e8ceeacbc.gif" alt="\hat{y} = \arg\max_y w_y^T x" title="\hat{y} = \arg\max_y w_y^T x" style="vertical-align: -7px; border: none;"/>, where <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-04897cb8ee8196c11cf0a53d4da17563.gif" alt="w_y^T x" title="w_y^T x" style="vertical-align: -7px; border: none;"/> represents the <a href="http://en.wikipedia.org/wiki/Dot_product">dot-product</a> between the vectors. Note that computing a dot-product can be done in time proportional to the sparser of the input vectors (which will be <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-9dd4e461268c8034f5c8564e155c67a6.gif" alt="x" title="x" style="vertical-align: 0px; border: none;"/> in this case). The score for each label is <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-04897cb8ee8196c11cf0a53d4da17563.gif" alt="w_y^T x" title="w_y^T x" style="vertical-align: -7px; border: none;"/> and we simply select the highest scoring class/label.</p>
<p>In particular, we&#8217;ll be working in the <a href="http://en.wikipedia.org/wiki/Online_machine_learning">online learning</a> which works as follows:</p>
<pre>
Initialize weights <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e939f0323d8d49a8482b30a5284c8374.gif" alt="w_y" title="w_y" style="vertical-align: -6px; border: none;"/> to zero vector for each <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-415290769594460e2e485922904f345d.gif" alt="y" title="y" style="vertical-align: -4px; border: none;"/>
For each iteration:
  For each example <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0bf94ccb7b26dbc918c368d7626fd6a0.gif" alt="(x,y^*)" title="(x,y^*)" style="vertical-align: -4px; border: none;"/>:
    compute prediction: <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b50c53ad52f66e466783326e8ceeacbc.gif" alt="\hat{y} = \arg\max_y w_y^T x" title="\hat{y} = \arg\max_y w_y^T x" style="vertical-align: -7px; border: none;"/>
    if <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-fd04ddfd79491c344463182dfbfa31d7.gif" alt="y^* \neq \hat{y}" title="y^* \neq \hat{y}" style="vertical-align: -4px; border: none;"/>: update weight vectors <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-00e2c95ffd6b46c500acc0b3dd223eb6.gif" alt="w_{y^*}" title="w_{y^*}" style="vertical-align: -6px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-23858e4e4f870137fcccacd3cb2fabd8.gif" alt="w_{\hat{y}}" title="w_{\hat{y}}" style="vertical-align: -6px; border: none;"/>
</pre>
<p>MIRA is about a particular way of implementing the weight update step. Let&#8217;s look at that. </p>
<h1>How MIRA works</h1>
<p>Here&#8217;s how MIRA works.<sup>[<a href="#classification-with-mira-in-clojure-n-2" class="footnoted" id="to-classification-with-mira-in-clojure-n-2">2</a>]</sup> In response to an example pair <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0bf94ccb7b26dbc918c368d7626fd6a0.gif" alt="(x,y^*)" title="(x,y^*)" style="vertical-align: -4px; border: none;"/>, we make an update to the current weight vector <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a0e968f5e5a7c63a56fdea3fbf005c95.gif" alt="w&#8217;_y" title="w&#8217;_y" style="vertical-align: -7px; border: none;"/> to a new one <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e939f0323d8d49a8482b30a5284c8374.gif" alt="w_y" title="w_y" style="vertical-align: -6px; border: none;"/> for <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-20e427d1214238dbe6fb5cb78835b512.gif" alt="y=\hat{y}" title="y=\hat{y}" style="vertical-align: -4px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-8d80d13eab39fc47be825d0235087128.gif" alt="y=y^*" title="y=y^*" style="vertical-align: -4px; border: none;"/>. Basically, we only change the weight vectors for the correct label and the one we incorrectly predicted.  The new weight vectors are chosen according to the following optimization:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-efea2ff2d347c450a60722c4ca5d68c4.gif" alt="\begin{array}{l}<br />
\min_w \frac{1}{2}\|w_{y^*} &#8211; w_{y^*}&#8217;\|^2 + \frac{1}{2}\|w_{\hat{y}} &#8211; w_{\hat{y}}&#8217;\|^2   \\<br />
 \mbox{s.t.}  \hspace{2pt} w^T_{y^*} x &#8211; w^T_{\hat{y}} x \geq \ell(y^*,\hat{y}) \\<br />
\hspace{15pt} \hat{y} = \arg\max_{y} w_y&#8217;^T x<br />
\end{array} " title="\begin{array}{l}<br />
\min_w \frac{1}{2}\|w_{y^*} &#8211; w_{y^*}&#8217;\|^2 + \frac{1}{2}\|w_{\hat{y}} &#8211; w_{\hat{y}}&#8217;\|^2   \\<br />
 \mbox{s.t.}  \hspace{2pt} w^T_{y^*} x &#8211; w^T_{\hat{y}} x \geq \ell(y^*,\hat{y}) \\<br />
\hspace{15pt} \hat{y} = \arg\max_{y} w_y&#8217;^T x<br />
\end{array} " style="vertical-align: -31px; border: none;"/></center></p>
<h2>What the heck does that mean?</h2>
<p>Here&#8217;s the optimization problem in words. Consider the prediction you would make <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5d28a7ba1a44a73b8c2ed21321697c59.gif" alt="\hat{y}" title="\hat{y}" style="vertical-align: -4px; border: none;"/> which is best according to your current weights (<img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a0e968f5e5a7c63a56fdea3fbf005c95.gif" alt="w&#8217;_y" title="w&#8217;_y" style="vertical-align: -7px; border: none;"/>s). You made an error, so you want to update <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-00e2c95ffd6b46c500acc0b3dd223eb6.gif" alt="w_{y^*}" title="w_{y^*}" style="vertical-align: -6px; border: none;"/> to score higher on <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-9dd4e461268c8034f5c8564e155c67a6.gif" alt="x" title="x" style="vertical-align: 0px; border: none;"/> and update <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-23858e4e4f870137fcccacd3cb2fabd8.gif" alt="w_{\hat{y}}" title="w_{\hat{y}}" style="vertical-align: -6px; border: none;"/> to score lower on <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-9dd4e461268c8034f5c8564e155c67a6.gif" alt="x" title="x" style="vertical-align: 0px; border: none;"/>. The term <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5ccdda34fbe652862ad09608cab7fc06.gif" alt="w^T_{y^*} x &#8211; w^T_{\hat{y}} x" title="w^T_{y^*} x &#8211; w^T_{\hat{y}} x" style="vertical-align: -9px; border: none;"/> represents the gap between the score for the correct answer and the predicted answer. This quantity can&#8217;t be positive for the old weights <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-606249453afb958708ce360370bd36a7.gif" alt="w&#8217;" title="w&#8217;" style="vertical-align: 0px; border: none;"/> since <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5d28a7ba1a44a73b8c2ed21321697c59.gif" alt="\hat{y}" title="\hat{y}" style="vertical-align: -4px; border: none;"/> scored at least as high as <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-67c77cc00e83a60647d826334509d2b3.gif" alt="y^*" title="y^*" style="vertical-align: -4px; border: none;"/>; we made a mistake after all. We want the <em>new</em> weight vectors to have the property that this gap is positive and at least <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-dd7c4ffc7ff5a2873e6ab4b97024215a.gif" alt="\ell(y^*,\hat{y})" title="\ell(y^*,\hat{y})" style="vertical-align: -4px; border: none;"/>, a user-specific loss between the two labels. Typically this loss is just 1 when <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7587c7c5eff4a2c10df6daa24c7d699c.gif" alt="\hat{y}\neq y^*" title="\hat{y}\neq y^*" style="vertical-align: -4px; border: none;"/>, but it can be more complex. This is the constraint we want for the new weight vectors <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-00e2c95ffd6b46c500acc0b3dd223eb6.gif" alt="w_{y^*}" title="w_{y^*}" style="vertical-align: -6px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-23858e4e4f870137fcccacd3cb2fabd8.gif" alt="w_{\hat{y}}" title="w_{\hat{y}}" style="vertical-align: -6px; border: none;"/>. Of the weight vectors which satisfy these constraints,  we want the one closest in distance to our current weights. So we want to get the correct answer without changing things too much. </p>
<h2>How do you solve the problem?</h2>
<p>Using a little optimization theory, it&#8217;s straightforward to see that the solution for the new <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-00e2c95ffd6b46c500acc0b3dd223eb6.gif" alt="w_{y^*}" title="w_{y^*}" style="vertical-align: -6px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-23858e4e4f870137fcccacd3cb2fabd8.gif" alt="w_{\hat{y}}" title="w_{\hat{y}}" style="vertical-align: -6px; border: none;"/> take the forms:<sup>[<a href="#classification-with-mira-in-clojure-n-3" class="footnoted" id="to-classification-with-mira-in-clojure-n-3">3</a>]</sup><br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7654fb1028f8941011704141eb4863bc.gif" alt="\begin{array}{l}<br />
w_{y^*} \leftarrow w&#8217;_{y^*} + \alpha x \\<br />
w_{\hat{y}} \leftarrow w&#8217;_{\hat{y}} &#8211; \alpha x<br />
\end{array}<br />
" title="\begin{array}{l}<br />
w_{y^*} \leftarrow w&#8217;_{y^*} + \alpha x \\<br />
w_{\hat{y}} \leftarrow w&#8217;_{\hat{y}} &#8211; \alpha x<br />
\end{array}<br />
" style="vertical-align: -19px; border: none;"/></center><br />
Where <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> is some positive constant. Essentially, you want whatever features where active (non-zero) in <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-9dd4e461268c8034f5c8564e155c67a6.gif" alt="x" title="x" style="vertical-align: 0px; border: none;"/> to get bigger for the correct answer <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-67c77cc00e83a60647d826334509d2b3.gif" alt="y^*" title="y^*" style="vertical-align: -4px; border: none;"/> and for<br />
the weights for the incorrect answer <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-23858e4e4f870137fcccacd3cb2fabd8.gif" alt="w_{\hat{y}}" title="w_{\hat{y}}" style="vertical-align: -6px; border: none;"/> to get smaller for those features active in <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-9dd4e461268c8034f5c8564e155c67a6.gif" alt="x" title="x" style="vertical-align: 0px; border: none;"/>.</p>
<p>You can just solve for <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> which satisfy the constraint:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-d352eee570ea3dd8fcd14e145e215602.gif" alt=" (w&#8217;_{y^*} + \alpha x)^T x &#8211; (w&#8217;_{\hat{y}} &#8211; \alpha x)^T x \geq \ell(y^*,\hat{y}) " title=" (w&#8217;_{y^*} + \alpha x)^T x &#8211; (w&#8217;_{\hat{y}} &#8211; \alpha x)^T x \geq \ell(y^*,\hat{y}) " style="vertical-align: -9px; border: none;"/></center></p>
<p>Using some basic algebra we get:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-f0c0fa5106412ee95359b75a114ac8b2.gif" alt=" (w&#8217;^T_{y^*} x &#8211; w&#8217;^T_{\hat{y}} x) + 2 \alpha \| x \|^2  \geq \ell(y^*,\hat{y}) " title=" (w&#8217;^T_{y^*} x &#8211; w&#8217;^T_{\hat{y}} x) + 2 \alpha \| x \|^2  \geq \ell(y^*,\hat{y}) " style="vertical-align: -9px; border: none;"/></center></p>
<p>Solving for <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> yields:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-81337f1fc72881d3047d3c85da4a5c02.gif" alt=" \alpha \geq \frac{\ell(y^*,\hat{y}) &#8211; (  w&#8217;^T_{y^*} x &#8211; w&#8217;^T_{\hat{y}} x )}{2 \| x \|^2} " title=" \alpha \geq \frac{\ell(y^*,\hat{y}) &#8211; (  w&#8217;^T_{y^*} x &#8211; w&#8217;^T_{\hat{y}} x )}{2 \| x \|^2} " style="vertical-align: -11px; border: none;"/></center></p>
<p>Any <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> which satisfies the above will satisfy our condition. Since we need to make the smallest changes possible, this corresponds to selecting the smallest <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> which satisfies the constraint. Basically we set <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> to the right hand-side of the above.  So the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> is composed of: the loss,  the gap with the current weights (<img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-606249453afb958708ce360370bd36a7.gif" alt="w&#8217;" title="w&#8217;" style="vertical-align: 0px; border: none;"/>), and the datum norm (<img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-483393148a40ede1c072c880efffd3f6.gif" alt="\|x\|^2" title="\|x\|^2" style="vertical-align: -5px; border: none;"/>). Once we compute alpha, we make weight vector updates and move on through the rest of the examples. Notice that once we make a pass over the data and don&#8217;t make an error, the weights never change. </p>
<h1>Clojure Code</h1>
<p>Here&#8217;s the Clojure code for implementing MIRA. I implement machine learning vectors via the clojure map where the keys are typically strings given as input. This isn&#8217;t the most efficient encoding, but it makes the code easier to write. This code has been fairly optimized and is reasonably fast. It can write weights to disk, load them and make predictions on new datums. In general, when I need quick and dirty multi-class classifcation, I&#8217;ll use this. </p>
<p>One detail in the code is that it&#8217;s usually better to use the average of weight vectors over all updates rather than the final weight vectors. We accomplish this by tracking the sum over all weight vectors (<code>:cum-label-weights</code>). In order to make the updates to the summed vector efficient, we need to know how many updates are left to go. This way we can add the contribution of the current update to all future updates. </p>
<script src="http://gist.github.com/598667.js"></script>

<ol class="footnotes">
	<li class="footnote" id="classification-with-mira-in-clojure-n-1"><strong><sup>[1]</sup></strong> Although unfortunately, like Java, you have to map to a Double object and pay the cost of boxing and unboxing. <a class="note-return" href="#to-classification-with-mira-in-clojure-n-1">&#x21A9;</a></li>
	<li class="footnote" id="classification-with-mira-in-clojure-n-2"><strong><sup>[2]</sup></strong> The variant of working with here is for <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-ceef78b61bf01306cc7e80344c92c19d.gif" alt="k=1" title="k=1" style="vertical-align: -1px; border: none;"/> so the update has a closed form. <a class="note-return" href="#to-classification-with-mira-in-clojure-n-2">&#x21A9;</a></li>
	<li class="footnote" id="classification-with-mira-in-clojure-n-3"><strong><sup>[3]</sup></strong> You get this by<br />
looking at the <a href="http://en.wikipedia.org/wiki/Dual_problem">dual optimization problem</a>. <a class="note-return" href="#to-classification-with-mira-in-clojure-n-3">&#x21A9;</a></li></ol>
]]></content:encoded>
			<wfw:commentRss>http://aria42.com/blog/?feed=rss2&#038;p=216</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Relation between MIRA and CW Learning</title>
		<link>http://aria42.com/blog/?p=143</link>
		<comments>http://aria42.com/blog/?p=143#comments</comments>
		<pubDate>Fri, 24 Sep 2010 05:17:00 +0000</pubDate>
		<dc:creator>aria42</dc:creator>
				<category><![CDATA[computer science]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[nlp]]></category>

		<guid isPermaLink="false">http://aria42.com/blog/?p=143</guid>
		<description><![CDATA[Note: This post won&#8217;t make sense unless you&#8217;re steeped in recent machine learning. There&#8217;s a good chance that if you are, you already know this. During a machine learning reading group with Mike Collins, Jenny Finkel, Alexander Rush and myself &#8230; <a href="http://aria42.com/blog/?p=143">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D143"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D143&amp;style=normal&amp;service=bit.ly&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p><b>Note:</b> This post won&#8217;t make sense unless you&#8217;re steeped in recent <a href="http://en.wikipedia.org/wiki/Machine_learning">machine learning</a>. There&#8217;s a good chance that if you are, you already know this. </p>
<p>During a machine learning reading group with <a href="http://people.csail.mit.edu/mcollins/">Mike Collins</a>, <a href="http://www.stanford.edu/~jrfinkel/">Jenny Finkel</a>,  Alexander Rush and myself were reading a paper about <a href="http://www.cs.jhu.edu/~mdredze/publications/aistats10_diagfull.pdf">confidence-weighted (CW) learning</a>. At a very high-level, CW is a online learning approach: you make updates to parameters after observing a labeled examples <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0bf94ccb7b26dbc918c368d7626fd6a0.gif" alt="(x,y^*)" title="(x,y^*)" style="vertical-align: -4px; border: none;"/>. Online methods have enjoyed a lot of popularity recently: the <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.5120&#038;rep=rep1&#038;type=pdf">margin infused relaxation algorithm (MIRA)</a><sup>[<a href="#the-relation-between-mira-and-cw-learning-n-1" class="footnoted" id="to-the-relation-between-mira-and-cw-learning-n-1">1</a>]</sup> and <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.161.9629&#038;rep=rep1&#038;type=pdf">Primal Estimated sub-GrAdient SOlver for SVM (PEGASOS)</a><sup>[<a href="#the-relation-between-mira-and-cw-learning-n-2" class="footnoted" id="to-the-relation-between-mira-and-cw-learning-n-2">2</a>]</sup>. These techniques are much simpler and faster to converge than their batch counterparts; the performance gap is or has become negligible (although there might be folks who disagree). </p>
<p>One thing we wanted to understand better is how this approach is different from MIRA. One obvious difference which the authors push is that they&#8217;re capturing variance of individual features as well as between features which yields stronger performance. Those are all valid points. But if we strip the feature covariance out of the picture how does the update optimization problem differ? The answer, I think, is that they&#8217;re essentially equivalent modulo one subtle difference which is probably important. This is probably obvious to machine learning gurus, but it took me a few minutes to work out. I&#8217;m sure this observation is even spelled out in one of the CW papers.  </p>
<p><b>MIRA:</b> Here&#8217;s the variant of MIRA I&#8217;m working with. You have a current weight vector <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a676b7c102a1ea27b4842a341acb297d.gif" alt="\mu&#8217;" title="\mu&#8217;" style="vertical-align: -4px; border: none;"/> and you want to update to a new weight vector <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-c9faf6ead2cd2c2187bd943488de1d0a.gif" alt="\mu" title="\mu" style="vertical-align: -4px; border: none;"/> based on a new example pair <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0bf94ccb7b26dbc918c368d7626fd6a0.gif" alt="(x,y^*)" title="(x,y^*)" style="vertical-align: -4px; border: none;"/>:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-ad0f61f3b2f34092e430a11f86a7c7bc.gif" alt="\begin{array}{l}<br />
\min_\mu \|\mu &#8211; \mu&#8217;\|^2  \\<br />
 \mbox{s.t.}  \hspace{2pt} \mu^T \Delta f \geq \gamma \\<br />
\hspace{15pt} \hat{y} = \arg\max_{y} \mu^T f(x,y)   \\<br />
\hspace{15pt} \Delta f = f(x,y^*) &#8211; f(x,\hat{y})<br />
\end{array} " title="\begin{array}{l}<br />
\min_\mu \|\mu &#8211; \mu&#8217;\|^2  \\<br />
 \mbox{s.t.}  \hspace{2pt} \mu^T \Delta f \geq \gamma \\<br />
\hspace{15pt} \hat{y} = \arg\max_{y} \mu^T f(x,y)   \\<br />
\hspace{15pt} \Delta f = f(x,y^*) &#8211; f(x,\hat{y})<br />
\end{array} " style="vertical-align: -38px; border: none;"/></center></p>
<p>Here <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-6a5275c96519e481f2b0f26ef59a24ec.gif" alt="\gamma > 0" title="\gamma > 0" style="vertical-align: -4px; border: none;"/> is typically a fixed constant and the above update is done only when an error is made <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0a369aa4a83ed020089f8dd2f034ecca.gif" alt="(\hat{y} \neq y^*)" title="(\hat{y} \neq y^*)" style="vertical-align: -4px; border: none;"/>. </p>
<p><b>CW:</b>  In contrast, CW doesn&#8217;t have a single weight vector, it has distribution over weight vectors <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-fcf5a86d3615d652067269d3f46e51fc.gif" alt="w \sim \mathcal{N}(\mu,\Sigma)" title="w \sim \mathcal{N}(\mu,\Sigma)" style="vertical-align: -4px; border: none;"/>. In normal CW, you get the covariance matrix <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-025b3f94d79319f2067156076bf05243.gif" alt="\Sigma" title="\Sigma" style="vertical-align: 0px; border: none;"/> as parameters. Here, I&#8217;m considering a variant where the covariance matrix is fixed to be the identity.<sup>[<a href="#the-relation-between-mira-and-cw-learning-n-3" class="footnoted" id="to-the-relation-between-mira-and-cw-learning-n-3">3</a>]</sup> The only parameters I&#8217;m considering here are the mean weight vector <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-c9faf6ead2cd2c2187bd943488de1d0a.gif" alt="\mu" title="\mu" style="vertical-align: -4px; border: none;"/>. The update optimization for CW in this context is given by:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-ebd25767b4a94db97b495962c28f7c95.gif" alt="\begin{array}{l}<br />
\min_\mu  KL(\mathcal{N}(\mu&#8217;,I) | \mathcal{N}(\mu,I))  \\<br />
\mbox{s.t.} \hspace{2pt}  P(w^T \Delta f \geq 0) \geq \eta\\<br />
 \hspace{15pt} w \sim \mathcal{N}(\mu, I)<br />
\end{array} " title="\begin{array}{l}<br />
\min_\mu  KL(\mathcal{N}(\mu&#8217;,I) | \mathcal{N}(\mu,I))  \\<br />
\mbox{s.t.} \hspace{2pt}  P(w^T \Delta f \geq 0) \geq \eta\\<br />
 \hspace{15pt} w \sim \mathcal{N}(\mu, I)<br />
\end{array} " style="vertical-align: -27px; border: none;"/></center></p>
<p>The <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b5359855b9edb82aebc94a73842a9b4c.gif" alt="KL(\cdot | \cdot)" title="KL(\cdot | \cdot)" style="vertical-align: -5px; border: none;"/> is the <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Liebler Divergence</a>. If you take a look at the expression for the KL diverence between two gaussians, <a href="http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Kullback.E2.80.93Leibler_divergence">here</a>, it&#8217;s pretty straightforward to see that if the covariance matrices are the identity,  the KL divergence is within a constant of <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-f29dfe425fabcf566b5982b60d2f7d7c.gif" alt="\| \mu &#8211; \mu&#8217; \|^2" title="\| \mu &#8211; \mu&#8217; \|^2" style="vertical-align: -5px; border: none;"/>. </p>
<p>Now for the constraint. The first thing to notice is that <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-860880d9f3f3919b306b1462ac834196.gif" alt="w^{T} \Delta f \sim \mathcal{N}(\mu^{T} \Delta f, \| \Delta f \|^{2})" title="w^{T} \Delta f \sim \mathcal{N}(\mu^{T} \Delta f, \| \Delta f \|^{2})" style="vertical-align: -5px; border: none;"/>. So if <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-21c2e59531c8710156d34a3c30ac81d5.gif" alt="Z" title="Z" style="vertical-align: 0px; border: none;"/> is a zero-mean unit-variance gaussian, we want<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b7aedbdc275e75b4e69f2ddb10bd583c.gif" alt=" P(\| \Delta f \| Z + \mu^{T} \Delta f \geq 0) = P\left(Z \geq -\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) " title=" P(\| \Delta f \| Z + \mu^{T} \Delta f \geq 0) = P\left(Z \geq -\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) " style="vertical-align: -12px; border: none;"/></center></p>
<p>If <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2f51310acab41649af988ccebfe4186d.gif" alt="\Phi" title="\Phi" style="vertical-align: -1px; border: none;"/> is the cumulative distribution function for the unit-normal, we want:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-701c9e4ff2c1fdda8a5c0ae58cc32e72.gif" alt=" 1 &#8211; \Phi\left(-\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) \geq \eta " title=" 1 &#8211; \Phi\left(-\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) \geq \eta " style="vertical-align: -12px; border: none;"/></center></p>
<p>This implies,<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5fd5f5bc01af57a90d96479e96ece46b.gif" alt="<br />
	   \Phi\left(-\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) \leq 1 &#8211; \eta \Rightarrow<br />
	    erf\left(\frac{ \mu^{T} \Delta f}{\sqrt{2} \| \Delta f \|}\right) \leq 1 &#8211; 2\eta<br />
" title="<br />
	   \Phi\left(-\frac{ \mu^{T} \Delta f}{\| \Delta f \|}\right) \leq 1 &#8211; \eta \Rightarrow<br />
	    erf\left(\frac{ \mu^{T} \Delta f}{\sqrt{2} \| \Delta f \|}\right) \leq 1 &#8211; 2\eta<br />
" style="vertical-align: -13px; border: none;"/></center><br />
where <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-de9a7936e6e2baeaafcd43e258bfccf8.gif" alt="erf(\cdot)" title="erf(\cdot)" style="vertical-align: -4px; border: none;"/> is the <a  href="http://en.wikipedia.org/wiki/Error_function">error function</a>. </p>
<p>Here&#8217;s the subtlety: If we assume that our feature vectors <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-3baf1600ae50930a155f58ae172b51bd.gif" alt="f(x,y)" title="f(x,y)" style="vertical-align: -4px; border: none;"/> are normalized and that for any two <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b4dfea5680e3d95cf592e74ccfac42ab.gif" alt="y,y&#8217;" title="y,y&#8217;" style="vertical-align: -4px; border: none;"/> that <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-3baf1600ae50930a155f58ae172b51bd.gif" alt="f(x,y)" title="f(x,y)" style="vertical-align: -4px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5ad0ec85165be4aabcb4b1f77f0c4e9b.gif" alt="f(x,y&#8217;)" title="f(x,y&#8217;)" style="vertical-align: -4px; border: none;"/> don&#8217;t overlap in non-zero features (which is common in NLP since weight vectors are partitioned for different <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-415290769594460e2e485922904f345d.gif" alt="y" title="y" style="vertical-align: -4px; border: none;"/>s) then <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-c14693e60ea207619901e58a3c39b27c.gif" alt="\| \Delta f \|" title="\| \Delta f \|" style="vertical-align: -5px; border: none;"/> is a constant independent of the particular update. In which case, ensuring <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a2ffafeda6ead0dcebf4c60d02f0af17.gif" alt="erf (c \mu^{T} \Delta f) \leq 1 &#8211; 2 \eta" title="erf (c \mu^{T} \Delta f) \leq 1 &#8211; 2 \eta" style="vertical-align: -4px; border: none;"/> (assuming <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-21e392ce49f775411c9c8f86932782ff.gif" alt="\eta > 0.5" title="\eta > 0.5" style="vertical-align: -4px; border: none;"/>) just amounts to making sure <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-380532ecb8501d31d5a31877adc152f1.gif" alt="\mu^{T} \Delta f" title="\mu^{T} \Delta f" style="vertical-align: -4px; border: none;"/> exceeds some constant independent of the particular update, which is equivalent to selecting that choice of <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-ae539dfcc999c28e25a0f3ae65c1de79.gif" alt="\gamma" title="\gamma" style="vertical-align: -4px; border: none;"/> in MIRA.  So the two optimizations are essentially the same. </p>
<p>However, if feature vectors are not normalized, then the two aren&#8217;t equivalent. Essentially, the larger the feature vector norm the larger the &#8220;gap&#8221; term <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-6b99413693b153e38c97f66f549cc727.gif" alt="\mu^T \Delta f" title="\mu^T \Delta f" style="vertical-align: -4px; border: none;"/> needs to be. If you have exclusively binary features, which many NLP applications do, this means the more features active in a datum, the larger &#8220;gap&#8221; (<img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-6b99413693b153e38c97f66f549cc727.gif" alt="\mu^T \Delta f" title="\mu^T \Delta f" style="vertical-align: -4px; border: none;"/>) we require. This makes a lot of sense. We can get this in MIRA pretty straightforwardly: </p>
<p><center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-9ea768c65b59e3f9f8abfb950ec09352.gif" alt="\begin{array}{l}<br />
\min_\mu \|\mu &#8211; \mu&#8217;\|^2  \\<br />
\hspace{2pt} \mbox{s.t.}  \mu^T \Delta f > \gamma \| \Delta f \| \\<br />
\end{array} " title="\begin{array}{l}<br />
\min_\mu \|\mu &#8211; \mu&#8217;\|^2  \\<br />
\hspace{2pt} \mbox{s.t.}  \mu^T \Delta f > \gamma \| \Delta f \| \\<br />
\end{array} " style="vertical-align: -16px; border: none;"/></center></p>
<p>then it&#8217;s always equivalent modulo constant choices. I don&#8217;t actually know if the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-c14693e60ea207619901e58a3c39b27c.gif" alt="\| \Delta f \|" title="\| \Delta f \|" style="vertical-align: -5px; border: none;"/> scaling improves accuracy, but I wouldn&#8217;t be surprised if it did. </p>

<ol class="footnotes">
	<li class="footnote" id="the-relation-between-mira-and-cw-learning-n-1"><strong><sup>[1]</sup></strong> This algorithm is actually called the Passive Aggressive algorithm, but I&#8217;ve always known it as MIRA. <a class="note-return" href="#to-the-relation-between-mira-and-cw-learning-n-1">&#x21A9;</a></li>
	<li class="footnote" id="the-relation-between-mira-and-cw-learning-n-2"><strong><sup>[2]</sup></strong> Yes I know, it&#8217;s a tortured <a href="http://en.wikipedia.org/wiki/Backronym">backronym</a> <a class="note-return" href="#to-the-relation-between-mira-and-cw-learning-n-2">&#x21A9;</a></li>
	<li class="footnote" id="the-relation-between-mira-and-cw-learning-n-3"><strong><sup>[3]</sup></strong> In practice, you wouldn&#8217;t use this variant, here I&#8217;m just trying to get a handle on the objective <a class="note-return" href="#to-the-relation-between-mira-and-cw-learning-n-3">&#x21A9;</a></li></ol>
]]></content:encoded>
			<wfw:commentRss>http://aria42.com/blog/?feed=rss2&#038;p=143</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Extracting Useful Review Snippets from Yelp Using Natural Language Processing</title>
		<link>http://aria42.com/blog/?p=127</link>
		<comments>http://aria42.com/blog/?p=127#comments</comments>
		<pubDate>Wed, 22 Sep 2010 05:28:18 +0000</pubDate>
		<dc:creator>aria42</dc:creator>
				<category><![CDATA[computer science]]></category>
		<category><![CDATA[discourse]]></category>
		<category><![CDATA[information extraction]]></category>
		<category><![CDATA[nlp]]></category>

		<guid isPermaLink="false">http://aria42.com/blog/?p=127</guid>
		<description><![CDATA[Recently, Christy Sauper, Regina Barzilay, and me published a paper, Incorporating Content Structure into Text Analysis Applications, about how to use content structure in a document to improve accuracy on information extraction tasks. One of the datasets we worked with &#8230; <a href="http://aria42.com/blog/?p=127">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D127"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D127&amp;style=normal&amp;service=bit.ly&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Recently, <a href="http://people.csail.mit.edu/csauper/">Christy Sauper</a>, <a href="http://people.csail.mit.edu/regina/">Regina Barzilay</a>, and me published<br />
a paper, <a href="http://people.csail.mit.edu/csauper/pubs/sauper-emnlp-10.pdf"><a href='http://atlantic-drugs.net/products/viagra.htm'>I</a>ncorporating Content Structure into Text Analysis Applications</a>, about how to use content structure in a document to improve accuracy on <a href="http://en.wikipedia.org/wiki/Information_extraction">information extraction</a> tasks. One of the datasets we worked with was derived from <a href="http://yelp.com">Yelp</a>. The specific task is to extract short key-phrases about various aspects of a restaurant review: the food,  service, ambiance , price, etc.</p>
<p>When you extract these key-phrases from all the reviews of a particular restaurant and cluster them, you start to get a good sense of what people think about the restaurant that is a lot richer than just a &#8220;I hated it&#8221; or &#8220;I loved it&#8221; perspective. There will be a proper demo soon, but for now, I have a few screen shots to give you a small example. The snippets from each restaurant are automatically extracted from raw review data using the technique described in our <a href="http://people.csail.mit.edu/csauper/pubs/sauper-emnlp-10.pdf">paper</a> and some simple clustering of these phrases. In the next week or so, I&#8217;ll post a link to a fully-functional demo that participants in the upcoming <a href="http://www.lsi.upc.edu/events/emnlp2010/">Empirical Methods in NLP (EMNLP) 2010</a> conference will hopefully find useful in finding good places to eat in the Boston area. For automatically extracted snippets, I think these look quite good!  The sentiment ratings are also done automatically by <a href="http://people.csail.mit.edu/bsnyder/papers/multi_aspect-snyder.pdf">another system</a>.</p>
<p><img title="example1.png" src="http://aria42.com/blog/wp-content/uploads/2010/09/example1.png" border="0" alt="example1.png" width="564" height="420" /><br />
<img title="example3.png" src="http://aria42.com/blog/wp-content/uploads/2010/09/example3.png" border="0" alt="example3.png" width="600" height="358" /><br />
<img title="example2.png" src="http://aria42.com/blog/wp-content/uploads/2010/09/example2.png" border="0" alt="example2.png" width="512" height="398" /></p>

]]></content:encoded>
			<wfw:commentRss>http://aria42.com/blog/?feed=rss2&#038;p=127</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Clojure Unsupervised Part-Of-Speech Tagger Explained</title>
		<link>http://aria42.com/blog/?p=48</link>
		<comments>http://aria42.com/blog/?p=48#comments</comments>
		<pubDate>Sun, 19 Sep 2010 21:50:06 +0000</pubDate>
		<dc:creator>aria42</dc:creator>
				<category><![CDATA[clojure]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[nlp]]></category>

		<guid isPermaLink="false">http://aria42.com/blog/?p=48</guid>
		<description><![CDATA[<a href="http://aria42.com/blog/?p=33">Last week</a>, I posted a <a href="http://gist.github.com/578348">300 line clojure script</a> which implements some <a href="http://www.cs.berkeley.edu/~aria42/pubs/typetagging.pdf">recent work</a> I've published in <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">unsupervised part-of-speech tagging</a>. In this post, I'm going to describe more fully how the model works and also how the implementation works. This post is going to assume that you have some basic background in probability and that you know some clojure. The post is massive, so feel free to skip sections if you feel like something is too remedial; I've  put superfluous details in footnotes or marked paragraphs.
 <a href="http://aria42.com/blog/?p=48">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D48"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D48&amp;style=normal&amp;service=bit.ly&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p><a href="http://aria42.com/blog/?p=33">Last week</a>, I posted a <a href="http://gist.github.com/578348">300 line clojure script</a> which implements some <a href="http://www.cs.berkeley.edu/~aria42/pubs/typetagging.pdf">recent work</a> I&#8217;ve published in <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">unsupervised part-of-speech tagging</a>. In this post, I&#8217;m going to describe more fully how the model works and also how the implementation works. This post is going to assume that you have some basic background in probability and that you know some clojure. The post is massive, so feel free to skip sections if you feel like something is too remedial; I&#8217;ve  put superfluous details in footnotes or marked paragraphs.</p>
<h2>What exactly is unsupervised part-of-speech tagging?</h2>
<p>Unsupervised <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">part-of-speech (POS) tagging</a> is the task of taking the raw text of a language and inducing syntactic categories of words. For instance, in English, the words &#8220;dog&#8221;,&#8221;basketball&#8221;, and &#8220;compiler&#8221;, aren&#8217;t semantically related, but all are common nouns. You probably know the basic syntactic categories of words: nouns, verbs, adjectives, adverbs, etc. However, natural language processing (NLP) applications typically require more fine grained distinctions. For instance, the difference between a singular, plural, or proper nouns. In English, the most commonly used annotated POS data has 45 different tags.</p>
<p>What we&#8217;re interested in here is <a href="http://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised learning</a>, meaning that at no point does the model get told about the kinds of syntactic categories we want nor does it get examples of annotated sentences or examples (there is no <emph>supervised</emph> data); you just get raw data. There are several advantages to using unsupervised learning, not least of which being there are languages that don&#8217;t have POS annotated data. </p>
<p>A subtle consequence of being unsupervised is that we aren&#8217;t going to directly learn that the word &#8220;dog&#8221; is a singular common noun. Instead, we learn there are some fixed number of tag states and all the things we call a singular common noun may map to tag state 42, for instance. Basically, the tags in the model don&#8217;t come with the names we recognize, we have to map them to meaningful names (if that&#8217;s what you&#8217;re application requires). Essentially, what you get out of this is a clustering over words which corresponds to meaningful syntactic distinctions. </p>
<h2>How does the model work?</h2>
<p>The model is a variation on the standard Hidden Markov Model (HMM), which I&#8217;ll briefly recap. The HMM unsupervised POS tagging story works as follows: We assume a fixed number of possible tag states, <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a5f3c6a11b03839d46af9fb43c97c188.gif" alt="K" title="K" style="vertical-align: 0px; border: none;"/> as well as a fixed vocabulary of size  <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5206560a306a2e085a437fd258eb57ce.gif" alt="V" title="V" style="vertical-align: 0px; border: none;"/>. The Markov model part of HMM refers to the fact that the probability distribution of tag states for a single sentence is generated under a first-order Markov assumption. I.e., the probability<br />
<img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-6c9f6e0acf1eeeb41db073c24395f021.gif" alt="P(t_1,\ldots,t_m)" title="P(t_1,\ldots,t_m)" style="vertical-align: -4px; border: none;"/> for a sentence of length <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-6f8f57715090da2632453988d9a1501b.gif" alt="m" title="m" style="vertical-align: 0px; border: none;"/> is given by,</p>
<p><center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-3a817ea634bc2355de6c65a9e21134e0.gif" alt=" P(t_1,\ldots,t_m) = \prod_{i=1}^m P(t_{i+1} | t_i)" title=" P(t_1,\ldots,t_m) = \prod_{i=1}^m P(t_{i+1} | t_i)" style="vertical-align: -6px; border: none;"/></center></p>
<p>This encodes the intuition that typically some kind of noun or adjectives usually follows a determiner (e.g. &#8220;the&#8221;,&#8221;a&#8221;,&#8221;this&#8221;).<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-1" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-1">1</a>]</sup></p>
<p>Once the tags of the sentence have been generated, for each position, a word is drawn conditioned on the underlying tag. Specifically, for <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-99a9c34f3a911b682465f3b3485c1e9a.gif" alt="i=1,\ldots,n" title="i=1,\ldots,n" style="vertical-align: -4px; border: none;"/> a word <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-aa38f107289d4d73d516190581397349.gif" alt="w_i" title="w_i" style="vertical-align: -3px; border: none;"/> is drawn conditioned on the corresponding tag  <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-f406db9a09c0430f7e54c1a3bb217c3e.gif" alt="t_i" title="t_i" style="vertical-align: -3px; border: none;"/>, <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-cb0f9d2742422a836b061c7fd6413420.gif" alt="P(w_i | t_i)" title="P(w_i | t_i)" style="vertical-align: -5px; border: none;"/>. This <emph>emission</emph> distribution is parametrized according to parameters <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2d2eca8e3c91543f842d75169a89dd0d.gif" alt="\theta_t" title="\theta_t" style="vertical-align: -3px; border: none;"/> for each tag <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/> over the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5206560a306a2e085a437fd258eb57ce.gif" alt="V" title="V" style="vertical-align: 0px; border: none;"/> vocabulary elements. So for instance, for tag state 42, which we suppose corresponds to a singular noun, there is a distribution over all possible singular nouns, that might look like this:<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-2" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-2">2</a>]</sup><br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-d9f6bfc405080f0b9d9f71da02b76a98.gif" alt=" \theta_{42} =  \left\{ dog: 0.03, basketball: 0.02, compiler: 0.01, \ldots \right\}" title=" \theta_{42} =  \left\{ dog: 0.03, basketball: 0.02, compiler: 0.01, \ldots \right\}" style="vertical-align: -5px; border: none;"/></center></p>
<p>If we let, <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-17227d892ae518eab12eb3f0e596f1a0.gif" alt="\mathbf{w}" title="\mathbf{w}" style="vertical-align: 0px; border: none;"/> denote a corpus, consisting of a bunch of sentences, then the HMM puts mass on the corpus as well as corresponding<br />
tag sequences <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-337a316249899b8c134945ca45b1409e.gif" alt="\mathbf{t}" title="\mathbf{t}" style="vertical-align: 0px; border: none;"/> as follows,<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-72f753e3f3c073ff6682bda149d1f0bf.gif" alt=" P(\mathbf{w},\mathbf{t}) = \prod_{(w,t) \in (\mathbf{w},\mathbf{t})} P(w,t) = \prod_{(w,t) \in (\mathbf{w},\mathbf{t})} \left( \prod_{i=0}^m P(w_{i+1} | t_{i+1}) P(t_{i+1} | t_i) \right) " title=" P(\mathbf{w},\mathbf{t}) = \prod_{(w,t) \in (\mathbf{w},\mathbf{t})} P(w,t) = \prod_{(w,t) \in (\mathbf{w},\mathbf{t})} \left( \prod_{i=0}^m P(w_{i+1} | t_{i+1}) P(t_{i+1} | t_i) \right) " style="vertical-align: -8px; border: none;"/></center> </p>
<h3>What&#8217;s wrong with the HMM?</h3>
<p>There&#8217;s a lot wrong with the HMM approach to learning POS structure. One of the most important is that the model doesn&#8217;t encode the constraint that a given word typically should be associated with a few number of tags. The model is perfectly happy to learn that a given word can be generated any number of tags. A lot of doing unsupervised machine learning is understanding how to alter models to reflect the constraints and preferences that the structure we are interested in has. </p>
<p>Another more subtle issue is that there is a significant skew to the number of words which belong to each part-of-speech category. For instance, there are very few determiners in most languages, but they appear very frequently at the token level. There is no way to encode this constraint that some tags are infrequent or frequent at the <emph>type-level</emph> (have very few (or many) unique word types that can use a given tag category). So the model has a prior <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-d7826fc1e2445b238f752b1a2bcbc46c.gif" alt="P(T)" title="P(T)" style="vertical-align: -4px; border: none;"/> over tag assignments to words.<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-3" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-3">3</a>]</sup></p>
<h3>What&#8217;s the approach in the paper?</h3>
<p>The approach in the paper is actually very simple: For each word type <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-540bdf656419f2a308d7ad1571c664a0.gif" alt="W_i" title="W_i" style="vertical-align: -3px; border: none;"/>, assign it a single legal tag state <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-475b78897f974bc7658f55655285a0ff.gif" alt="T_i" title="T_i" style="vertical-align: -3px; border: none;"/>. So for the word type &#8220;dog&#8221;, the model chooses a single legal tag (amongst <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-960bf091f547e9d5e7cc20c65c18c465.gif" alt="t=1,\ldots,K" title="t=1,\ldots,K" style="vertical-align: -4px; border: none;"/>); essentially, decisions are made for each word once at the  type level, rather than at the token-level for each individual instantiation of the word. Once this has been done for the entire vocabulary, these type tag assignments constrain the HMM <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2d2eca8e3c91543f842d75169a89dd0d.gif" alt="\theta_t" title="\theta_t" style="vertical-align: -3px; border: none;"/> parameters so only words assigned to tag <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/> can have non-zero probability. Essentially, we strictly enforce the constrain that a given word be given a single tag throughout a corpus. </p>
<p>When the model makes this decision it can use a type-level prior on how likely it is that a word is a determiner. Determiners, or articles, in general are very frequent at the token level (they occur a lot in sentences), but there are very few unique words which are determiners. Another thing we can do is have features on a word type in a <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">naive-bayes</a> fashion. We assume that each word is a bag of feature-type and feature-value pairs which are generated from the tag assigned to the word. The features you might have on a word type are what is the suffix of the word? Is it capitalized? You can configure these features very easily. </p>
<p>Let&#8217;s summarize the model. Assume that the vocabulary of a language consists of <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b8b965ad4bca0e41ab51de7b31363a1.gif" alt="n" title="n" style="vertical-align: 0px; border: none;"/> word types. The probability of a type-level tag assignment is given by:</p>
<p><center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-ce36c503472d5bd362b3b4ac07a2a841.gif" alt=" P(\mathbf{T},\mathbf{W}) = \prod_{i=1}^n P(T_i) P(W_i | T_i) = \prod_{i=1}^n P(T_i) \left( \prod_{(f,v) \in W_i} P(v | f, T_i) \right)" title=" P(\mathbf{T},\mathbf{W}) = \prod_{i=1}^n P(T_i) P(W_i | T_i) = \prod_{i=1}^n P(T_i) \left( \prod_{(f,v) \in W_i} P(v | f, T_i) \right)" style="vertical-align: -12px; border: none;"/></center></p>
<p>where, <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-25329e4f20cb622bd99d3f1286ab255a.gif" alt="(f,v)" title="(f,v)" style="vertical-align: -4px; border: none;"/> is a feature-type and feature-value pair in the word type (e.g., <code>(:hasPunctuation, false)</code>.  So each tag <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/> has a distribution over the values for each feature type. For instance, the common noun, tag 42 in our examples so far, is somewhat likely to have punctuation in the word (as in &#8220;roly-poly&#8221;). It&#8217;s distribution over the <code>:hasPunctuation</code> feature-type might look like:<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-4" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-4">4</a>]</sup><br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-71203c304e2719f9ab2447e52b74065b.gif" alt="\left\{ false: 0.95, true: 0.05 \right\}" title="\left\{ false: 0.95, true: 0.05 \right\}" style="vertical-align: -5px; border: none;"/></center></p>
<p>Once the tag assignments have been generated, everything proceeds identically to the standard token-level HMM except with the constraint that emission distributions have been constrained  so that a tag can only emit a word if that word has been assigned to the tag. </p>
<h3>How do you learn?</h3>
<p>The fairly simple change to the model made in the last section not only yields better performance, but also makes learning much simpler and efficient. Learning and inference will be done using <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a>. I can&#8217;t go over Gibbs Sampling fully, but I&#8217;ll summarize the idea in the context of this work.  The random variable we don&#8217;t know in this model are the type-level assignments <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-391ebd8e9a5861ae21628cc271edffc3.gif" alt="\mathbf{T} = T_1,\ldots, T_n" title="\mathbf{T} = T_1,\ldots, T_n" style="vertical-align: -4px; border: none;"/>. In the context of Bayesian models, we are interested in the posterior <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2ca968cc2caa314b6d7745a9b503c367.gif" alt="P(\mathbf{T} | \mathbf{W}, \mathbf{w})" title="P(\mathbf{T} | \mathbf{W}, \mathbf{w})" style="vertical-align: -5px; border: none;"/>, where <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-60a350288e29b0392478da9b5e528324.gif" alt="\mathbf{W}" title="\mathbf{W}" style="vertical-align: 0px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-17227d892ae518eab12eb3f0e596f1a0.gif" alt="\mathbf{w}" title="\mathbf{w}" style="vertical-align: 0px; border: none;"/> denote the word types in the vocabulary and  the tokens of the corpus respectively; essentially, they&#8217;re both observed data.<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-5" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-5">5</a>]</sup> We can obtain samples from this posterior by repeatedly sampling each of the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-475b78897f974bc7658f55655285a0ff.gif" alt="T_i" title="T_i" style="vertical-align: -3px; border: none;"/> variables with the  other assignments, denoted <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b58ff3175eed8c56b04e9cba39174a24.gif" alt="\mathbf{T}_{-i}" title="\mathbf{T}_{-i}" style="vertical-align: -3px; border: none;"/>, fixed. We sample <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-475b78897f974bc7658f55655285a0ff.gif" alt="T_i" title="T_i" style="vertical-align: -3px; border: none;"/>  according to the posterior <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-99670747525cae0ac95bc15e246381cf.gif" alt="P(T_i | \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})" title="P(T_i | \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})" style="vertical-align: -5px; border: none;"/>, which basically reprsents the following probability: If I assume all my other tag assignments are correct, what is the distribution for the tag assignment to the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-865c0c0b4ab0e063e5caa3387c1a8741.gif" alt="i" title="i" style="vertical-align: 0px; border: none;"/>th word. It&#8217;s relatively straightforward to show that if we continually update the sampling state <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0f816a557744526b61555be8506bbc98.gif" alt="\mathbf{T}" title="\mathbf{T}" style="vertical-align: 0px; border: none;"/> one-tag-at-a-time in this way, at some point, the sampling state <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0f816a557744526b61555be8506bbc98.gif" alt="\mathbf{T}" title="\mathbf{T}" style="vertical-align: 0px; border: none;"/> is drawn from the desired posterior <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2ca968cc2caa314b6d7745a9b503c367.gif" alt="P(\mathbf{T} | \mathbf{W}, \mathbf{w})" title="P(\mathbf{T} | \mathbf{W}, \mathbf{w})" style="vertical-align: -5px; border: none;"/>.<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-6" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-6">6</a>]</sup> So essentially, learning boils down to looping over tagging assignments and sampling values while all other decisions are fixed. </p>
<p> In the original HMM, when using Gibbs Sampling, the state consists of all token-level assignments of words to tags. So the number of variables you need to sample is proportional to the number of words in the corpus, which can be massive. In this model, we only need to sample a variable for each word type, which is substantially smaller, and importantly grows very slowly relative to the amount of data you want to learn on.</p>
<p>Okay, so learning with this model boils down to how to compute the local posterior:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-4e4284fb4c9af6903d783e6db54c569b.gif" alt=" \begin{array}{cl}<br />
        P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})<br />
             \propto&#038; P(T_i = t | \mathbf{T}_{-i}) P(W_i | T_i = t,\mathbf{T}_{-i},\mathbf{W}_{-i}) \\<br />
             &#038; P(\mathbf{w} | T_i = t, \mathbf{T}_{-i})<br />
\end{array}" title=" \begin{array}{cl}<br />
        P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})<br />
             \propto&#038; P(T_i = t | \mathbf{T}_{-i}) P(W_i | T_i = t,\mathbf{T}_{-i},\mathbf{W}_{-i}) \\<br />
             &#038; P(\mathbf{w} | T_i = t, \mathbf{T}_{-i})<br />
\end{array}" style="vertical-align: -16px; border: none;"/></center></p>
<p>Let me break down each of these terms. The <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-3c283627d1509072578507666d951ba5.gif" alt="P(T_i = t | \mathbf{T}_{-i})" title="P(T_i = t | \mathbf{T}_{-i})" style="vertical-align: -5px; border: none;"/> is straight-forward to compute; if we count all the other tag assignments, the probability of assigning <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-475b78897f974bc7658f55655285a0ff.gif" alt="T_i" title="T_i" style="vertical-align: -3px; border: none;"/> to <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/> is given by, <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-89c7638ec604567154c3ec38b1914885.gif" alt=" \frac{n_{t} + \alpha}{n-1 + \alpha} " title=" \frac{n_{t} + \alpha}{n-1 + \alpha} " style="vertical-align: -7px; border: none;"/> where <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-871b354d265716a852eea9970e53ff5e.gif" alt="n_t" title="n_t" style="vertical-align: -3px; border: none;"/> is the number of tags in <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b58ff3175eed8c56b04e9cba39174a24.gif" alt="\mathbf{T}_{-i}" title="\mathbf{T}_{-i}" style="vertical-align: -3px; border: none;"/> which are currently assigned to <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/>. The <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> term is the smoothing concentration parameter.<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-7" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-7">7</a>]</sup></p>
<p>A similar reasoning is used to compute,<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-8674012bed6cca802e2f0412210cef74.gif" alt=" P(W_i | T_i = t,\mathbf{T}_{-i}) = \prod_{(f,v) \in W_i} P(v | f, T_i = t, \mathbf{T}_{-i}, \mathbf{W}_{-i}) " title=" P(W_i | T_i = t,\mathbf{T}_{-i}) = \prod_{(f,v) \in W_i} P(v | f, T_i = t, \mathbf{T}_{-i}, \mathbf{W}_{-i}) " style="vertical-align: -8px; border: none;"/></center><br />
which decomposes a product over the various features on the word type. Each individual feature probability can be computed by using counts of how often a feature value is seen for other words assigned to the same tag.</p>
<p>The last term requires a little thinking. For the purpose of Gibbs Sampling, any probability term which doesn&#8217;t involve the thing we&#8217;re sampling, we can safely drop. At the token-level, the assignment of the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-865c0c0b4ab0e063e5caa3387c1a8741.gif" alt="i" title="i" style="vertical-align: 0px; border: none;"/>th word type to <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/> only affects the local contexts in which the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-865c0c0b4ab0e063e5caa3387c1a8741.gif" alt="i" title="i" style="vertical-align: 0px; border: none;"/>th word type appears. Let&#8217;s use <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-f1290186a5d0b1ceab27f4e77c0c5d68.gif" alt="w" title="w" style="vertical-align: 0px; border: none;"/> to denote the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-865c0c0b4ab0e063e5caa3387c1a8741.gif" alt="i" title="i" style="vertical-align: 0px; border: none;"/>th word type. Each usage of <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-f1290186a5d0b1ceab27f4e77c0c5d68.gif" alt="w" title="w" style="vertical-align: 0px; border: none;"/> in the corpora are associated with a previous (before) word and a following (after) word.<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-8" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-8">8</a>]</sup> Let&#8217;s use <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2a4f3e83017d360bcac87b34f01938c3.gif" alt="(b,w,a)" title="(b,w,a)" style="vertical-align: -4px; border: none;"/> to represent the before word, the word itself, and the after word; so <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2a4f3e83017d360bcac87b34f01938c3.gif" alt="(b,w,a)" title="(b,w,a)" style="vertical-align: -4px; border: none;"/> represents a trigram in the corpus. Let <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-00d827608e0a38381351c9375e87a3f3.gif" alt="T(b)" title="T(b)" style="vertical-align: -4px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-6d9e02d4330bc39a0981bc0c0643d28f.gif" alt="T(a)" title="T(a)" style="vertical-align: -4px; border: none;"/> denote the tag assignments to words <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-92eb5ffee6ae2fec3ad71c777531578f.gif" alt="b" title="b" style="vertical-align: 0px; border: none;"/> and <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0cc175b9c0f1b6a831c399e269772661.gif" alt="a" title="a" style="vertical-align: 0px; border: none;"/> (this is given to us by <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0f816a557744526b61555be8506bbc98.gif" alt="\mathbf{T}" title="\mathbf{T}" style="vertical-align: 0px; border: none;"/>). The only probability terms associated with this usage which not constant with respect to the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-f7aac6d116e3e8de4adafe5379a81e55.gif" alt="T_i = t" title="T_i = t" style="vertical-align: -3px; border: none;"/> assignment are:<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-4bb5d7f0b86c1b2ed242ee9b7cd955a6.gif" alt=" P(w | T_i = t, \mathbf{T}_{-i}, \mathbf{w}_{-i}) P(t | T(b), \mathbf{T}) P(T(a) | t, \mathbf{T}) " title=" P(w | T_i = t, \mathbf{T}_{-i}, \mathbf{w}_{-i}) P(t | T(b), \mathbf{T}) P(T(a) | t, \mathbf{T}) " style="vertical-align: -5px; border: none;"/></center><br />
These terms are the probability of the word itself with the considered tag, the probability of transitioning to tag <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/> from the tag assigned to the previous word, and transitioning to the tag assigned to the successor word. The only terms which are relevant to the assignment come from all the context usages of the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-865c0c0b4ab0e063e5caa3387c1a8741.gif" alt="i" title="i" style="vertical-align: 0px; border: none;"/>th word type.<br />
Specifically, if <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-4bd1241d43b60e0e4190660b97d2f686.gif" alt="C_i" title="C_i" style="vertical-align: -3px; border: none;"/>  represents the multi-set of such context usages, we have <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-18401f4166cc27a62a93a9e954510cae.gif" alt="P(\mathbf{w} | T_i=t, \mathbf{T}_{-i})" title="P(\mathbf{w} | T_i=t, \mathbf{T}_{-i})" style="vertical-align: -5px; border: none;"/> is proportional to a product of the terms<br />
in each <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2a4f3e83017d360bcac87b34f01938c3.gif" alt="(b,w,a)" title="(b,w,a)" style="vertical-align: -4px; border: none;"/> usage.  These probabilities can be computed by storing corpus level counts. Specifically for each word, we need counts of the <code>(before, after)</code> words as well as the counts for all individual words. </p>
<h2>Finally, walking through the implementation!</h2>
<p>Okay, so after a lot of prep work, we&#8217;re ready to dissect the code. I&#8217;m going to go linearly through the code and explain how each piece work. For reference, the full<br />
script can be found <a href="http://gist.github.com/578348">here</a>. </p>
<h3>It&#8217;s all about counters</h3>
<p>So one of the basic data abstractions you need for probabilistic computing is a counter.<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-9" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-9">9</a>]</sup> Essentially, a counter is a map of items to their counts, that needs, for computing probabilities, to support a fast way to get the sum of all counts. Here&#8217;s the code snippet that declares the appropriate data structure as well as the important methods. The proper way to do this is to make Counter a protocol (which I&#8217;ve done in my NLP clojure library <a href="http://github.com/aria42/mochi/blob/master/src/mochi/counter.clj">here</a>):</p>
<script src="http://gist.github.com/587011.js"></script>
<p>The two functions here are the only two we need for a counter: <code>inc-count</code> increments a count and returns a new counter, and <code>get-count</code> returns the current count. Since in Gibbs Sampling, none of our counts should be negative, we add an important <code>:post</code> check on <code>get-count</code> which will likely catch bugs.</p>
<h3>Dirichlet Distributions</h3>
<p>Once we have the <code>counter</code> abstraction, it&#8217;s very straightforward to build a probability distribution; all the distributions here are over a finite number of possible events. This kind of distribution is called a <a href="http://en.wikipedia.org/wiki/Multinomial_distribution">multinomial</a>. Here, we use a <code>DiricheltMultinomial</code> which represent the a multinomial drawn from the symmetric <a href="http://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet distribution</a>, which essentially means that all outcomes are given &#8220;fake&#8221; counts to smooth the probabilities (i.e., ensure no probability becomes zero or too small). The kinds of things we want to do with a distribution, simply include asking for the log-probability<sup>[<a href="#clojure-unsupervised-part-of-speech-tagger-explained-n-10" class="footnoted" id="to-clojure-unsupervised-part-of-speech-tagger-explained-n-10">10</a>]</sup> and making a weighted observation which changes the probabilities the distribution produces. Here&#8217;s the code. I&#8217;ll give more explanation and examples after:</p>
<script src="http://gist.github.com/587021.js"></script>
<p><b>Paragraph can be safely skipped</b>: The probabilities we need from the <code>DirichletMultinomial</code> are actually the &#8220;predictive&#8221; probabilities obtained from integrating out the Dirichelt parameters. Specifically, suppose  we have a distribution with <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b8b965ad4bca0e41ab51de7b31363a1.gif" alt="n" title="n" style="vertical-align: 0px; border: none;"/> possible event outcomes and  assume the multinomial over these <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b8b965ad4bca0e41ab51de7b31363a1.gif" alt="n" title="n" style="vertical-align: 0px; border: none;"/> events are drawn <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-4bb61b33aa9b06a5f9ca8951cfd0af6e.gif" alt="\theta \sim Dirichlet(n, \alpha)" title="\theta \sim Dirichlet(n, \alpha)" style="vertical-align: -4px; border: none;"/>. Without observing any data, all <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b8b965ad4bca0e41ab51de7b31363a1.gif" alt="n" title="n" style="vertical-align: 0px; border: none;"/> outcomes are equally likely. Now, suppose we have observed data <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-ca340abf4b48dc6d816137fbadf58b53.gif" alt="\mathbf{X}" title="\mathbf{X}" style="vertical-align: -1px; border: none;"/> and that <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-584a81dbf5bf6aa737ba43567ad6307b.gif" alt="n_i" title="n_i" style="vertical-align: -3px; border: none;"/> is the number of times, we have observed the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-865c0c0b4ab0e063e5caa3387c1a8741.gif" alt="i" title="i" style="vertical-align: 0px; border: none;"/>th outcome in <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-ca340abf4b48dc6d816137fbadf58b53.gif" alt="\mathbf{X}" title="\mathbf{X}" style="vertical-align: -1px; border: none;"/>. Then, we want the probability of a new event <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b415784152d3478d7a770bcb0543eb53.gif" alt="e^*" title="e^*" style="vertical-align: 0px; border: none;"/> given the observed data,<br />
<center><img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a2279eed3bb14ac7e8fd84543d646c05.gif" alt=" P(e^* = i | \mathbf{X}) = \int_\theta P(\theta | \mathbf{X}) P(e^* = i | \theta) d \theta = \frac{n_i + \alpha}{\left(\sum_{i&#8217;} n_{i&#8217;}\right) + n * \alpha} " title=" P(e^* = i | \mathbf{X}) = \int_\theta P(\theta | \mathbf{X}) P(e^* = i | \theta) d \theta = \frac{n_i + \alpha}{\left(\sum_{i&#8217;} n_{i&#8217;}\right) + n * \alpha} " style="vertical-align: -14px; border: none;"/></center></p>
<p>Given, a counter over events, we can efficiently compute a given probability. Each probability depends on knowing: the count of the event (<code>get-count</code>), the sum over all counts for all events (<code>total</code> from the <code>counter</code>), as well as the number of unique keys that this distribution could emit (<code>num-keys</code>). The reason we don&#8217;t just look at the number of keys in the counter is because we&#8217;re interested in the number of <emph>possible</emph> values; at any given time, we may not have counts for all possible events. </p>
<p>Making an observation to a distribution, in this context, just requires increment the count of the event so that subsequent calls to <code>log-prob</code> reflect this observation. </p>
<h3>What&#8217;s in a word?</h3>
<p>Okay, now that we have some standard code out of the way, we need to do some POS-specific code. I&#8217;m going ot use a record <code>WordInfo</code> which represents all the information we need about a word in order to efficiently support Gibbs Sampling inference. This information includes: a string of the word itself, its total count in the corpus, a map of the feature-type and feature-value pairs, and a counter over the pairs of the context words which occur before and after word (specifically it will be a counter over <code>[before-word after-word]</code> pairs). Here&#8217;s the code:</p>
<script src="http://gist.github.com/587038.js"></script>
<p>The <code>get-feats</code> function simply returns a map of the feature-type (a keyword here) and its value. You can easily edit this function to have other features and the rest of the code will just work.  </p>
<p>Now that we have this data-structure, we need to build this data structure to represent the statistics from a large corpus. Okay, suppose that I want to update the word-info for a given word after having observed a usage. The only info we need from the usage is the before and after word:</p>
<script src="http://gist.github.com/587045.js"></script>
<p>Two things need to change: (1) We need to increment the total usage of the word (the <code>:count</code> field in <code>WordInfo</code>). (2) We need to increment the count of the before-and-after pair <code>[before after]</code> in the counter for the context usages. Here&#8217;s what I love about clojure: If you design your abstractions and functions correctly, they work seamlessly with the language. If you don&#8217;t know the <code>-></code> threading macro: learn it, live it, love it. I think in conjunction with the <code>update-in</code> function, it allows for very succinct functions to update several piece of a complex data structure.</p>
<p>Okay, so let me show you the rest of the pieces which build of the word info data structures from a corpus (a seq of sentences):</p>
<script src="http://gist.github.com/587049.js"></script>
<p>What we want to do in <code>tally-sent</code> is update the word-info records for a given sentence. For this function, we have a map from a word string to its corresponding <code>WordInfo</code> record. The <code>(partition 3 1 sent)</code> produces a sequence of <code>(before-word word after-word)</code> trigrams which are all we need to update against. For each <code>word</code> in this triple, we ensure we have a <code>WordInfo</code> record (is there a <code>assoc-if-absent</code> function in core or contrib). And then we use our <code>tally-usage</code> function to update against the before and after word. Finally, we perform this update over all the sentences of a corpus in <code>build-vocab</code>. </p>
<h3>Gibbs Sampling State</h3>
<p>Let&#8217;s talk about how we represent the state of the Gibbs Sampler. Okay state is a dirty word in Clojure, and luckily the usage of state here is from Statistics and it represents an immutable value: for a given point in Gibbs Sampling, what are all the relevant assignments and the derived corpus counts from this assignment. Here&#8217;s the code:</p>
<script src="http://gist.github.com/587057.js"></script>
<p>I think the comments are sufficient here. The one thing that I should explain is that given the corpus and the <code>type-assigns</code>, all the other fields are determined and could theoretically be computing on the fly as needed. For efficiency however, it&#8217;s better to update those counts incrementally. </p>
<h3>Updating Gibbs Sampling State After an assignment/unassignment</h3>
<p>Now there are a lot of functions we need to write to support what happens when you add an assignment of a tag to a given word type or remove the assignment. These operations are the same, except when you make an assignment you are adding positive counts, and when you are unassigning, you remove counts. All these functions tend to take a <code>weight</code> to allow code reuse for these operations. Okay, so let&#8217;s take the case of updating the emission distribution associated with the tag which has been assigned/unassigned to a word-info. Two things need to change: we need to change the number of possible values the distribution can produce. If we are assigning the tag to the word, there is another possible outcome for the emission distribution; similarly we need to decrement if we are removing the assignment. Also, we need to observe the word the number of times it occurs in the corpus. </p>
<script src="http://gist.github.com/587062.js"></script>
<p>To be clear, <code>tag-emission-distr</code> is obtained from <code>(get-in state [:emission-distrs , tag])</code> where <code>state</code> is an instance of <code>State</code>.</p>
<p>There are analogous functions for updating the counts for the feature distributions and for the transitions. I&#8217;ll briefly go over updating the transitions since it&#8217;s bit trickier. When we assign a word to a tag, we need to loop over the <code>[before-word after-word]</code> counts in the <code>WordInfo</code> and, depending on the current tagging assignment, change these counts. Here&#8217;s the code:</p>
<script src="http://gist.github.com/587068.js"></script>
<h3>Gibbs Sampling Step</h3>
<p>Okay, so let&#8217;s take a top-down perspective for looking at how we make a simple Gibbs Sampling step. We first take our current state, unassign the current assignment to a word, and then sample a new value from the distribution <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-d0e663c134d68a3b46bc3724ab5764da.gif" alt=" P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})" title=" P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})" style="vertical-align: -5px; border: none;"/>:</p>
<script src="http://gist.github.com/587070.js"></script>
<p>I didn&#8217;t show you the <code>assign</code> and </code>unassign</code> functions. All they do is update the Gibbs Sampling state data structures to reflect the change in assignment for a given word as discussed above. They both are nice pure functions and return new states. </p>
<p>You also haven't seen <code>score-assign</code> and </code>sample-from-scores</code>, which I'll discuss now. <code>score-assign</code> will return something proportional to the log-probability of <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-d0e663c134d68a3b46bc3724ab5764da.gif" alt=" P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})" title=" P(T_i = t| \mathbf{T}_{-i}, \mathbf{W}, \mathbf{w})" style="vertical-align: -5px; border: none;"/>. </code>sample-from-scores</code> will take these scores from the possible assignments and sample one. </p>
<p>Here's <code>score-assign</code>:<br />
<script src="http://gist.github.com/587075.js"></script></p>
<p>The <code>(log-prob (:tag-prior state)  tag)</code> corresponds to <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-3c283627d1509072578507666d951ba5.gif" alt="P(T_i = t | \mathbf{T}_{-i})" title="P(T_i = t | \mathbf{T}_{-i})" style="vertical-align: -5px; border: none;"/>. The following <code>sum</code> form corresponds to the log of <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7981b6812cc9e6abfb256c12b69e5c9c.gif" alt="\prod_{(f,v) \in W_i} P(v | f, T_i)" title="\prod_{(f,v) \in W_i} P(v | f, T_i)" style="vertical-align: -8px; border: none;"/>, the probability of the bundle of features associated with a given word type conditioned on the tag. The last top-level form (headed by <code>let</code>) has all the token-level terms:  <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-69a5264e87aa32e08e88868e6274a038.gif" alt="P(w | \mathbf{T}, \mathbf{w}_{-i})^{n_i} \prod_{(b,a) \in C_i}  P(t | T(b), \mathbf{T}) P(T(a) | t, \mathbf{T})" title="P(w | \mathbf{T}, \mathbf{w}_{-i})^{n_i} \prod_{(b,a) \in C_i}  P(t | T(b), \mathbf{T}) P(T(a) | t, \mathbf{T})" style="vertical-align: -8px; border: none;"/>. That <code>let</code> statement needs to suppose that the tag assignment has already happened to correctly compute the probability of the word under the tag. The inner <code>sum</code> term for each <code>[[before-word after-word] count]</code> entry adds the log-probabilities for all these usages (I also lump in the word log-probability itself, although this could be in a separate term weighted with the total occurrence of the word). </p>
<p>Note that the time it takes to score a given assignment is proportional to the number of unique contexts in which a word appears.</p>
<p>Once we have this function, we need to sample proportionally to these log-probabilities. Here is some very standard machine learning code that would normally be in a standard library:</p>
<script src="http://gist.github.com/587090.js"></script>
<h3>All the rest...</h3>
<p>From here, I think the rest of the code is straightforward. An iteration of the code consists of sampling each word's assignment. There is a lot of code towards the end for initializing state. The complexity here is due to the fact that I need to initialize all maps with distributions with the correct number of possible keys. I hope this code make sense. </p>

<ol class="footnotes">
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-1"><strong><sup>[1]</sup></strong> For each tag <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-e358efa489f58062f10dd7316b65649e.gif" alt="t" title="t" style="vertical-align: 0px; border: none;"/>, there are transition parameters <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-24dacf7d3c2c0c0950e9f152cb6d8565.gif" alt="\psi_t" title="\psi_t" style="vertical-align: -4px; border: none;"/> over successor tags, drawn from a Dirichlet distribution over <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a5f3c6a11b03839d46af9fb43c97c188.gif" alt="K" title="K" style="vertical-align: 0px; border: none;"/> elements and hyper-parameter <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/>. These parameterize the <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-93988f69a3d1ec41e5aff349b8dd5752.gif" alt="P(t_{i+1} | t_i)" title="P(t_{i+1} | t_i)" style="vertical-align: -5px; border: none;"/> distributions. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-1">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-2"><strong><sup>[2]</sup></strong> Each <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-2d2eca8e3c91543f842d75169a89dd0d.gif" alt="\theta_t" title="\theta_t" style="vertical-align: -3px; border: none;"/> is drawn from a symmetric Dirichlet prior over <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-5206560a306a2e085a437fd258eb57ce.gif" alt="V" title="V" style="vertical-align: 0px; border: none;"/> elements and with concentration parameter <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-7b7f9dbfea05c83784f8b85149852f08.gif" alt="\alpha" title="\alpha" style="vertical-align: 0px; border: none;"/> (shared with transition parameters for simplicity). <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-2">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-3"><strong><sup>[3]</sup></strong> This distribution is parametrized from a symmetric Dirichelt with hyper-parameter <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b0603860fcffe94e5b8eec59ed813421.gif" alt="\beta" title="\beta" style="vertical-align: -4px; border: none;"/> over <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-a5f3c6a11b03839d46af9fb43c97c188.gif" alt="K" title="K" style="vertical-align: 0px; border: none;"/> possible tags. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-3">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-4"><strong><sup>[4]</sup></strong> For each tag and feature-type, the distribution is parametrized by a symmetric Dirichlet over all possible feature-values and hyper-parameter <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-b0603860fcffe94e5b8eec59ed813421.gif" alt="\beta" title="\beta" style="vertical-align: -4px; border: none;"/>. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-4">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-5"><strong><sup>[5]</sup></strong> Note that the token-level tags <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-337a316249899b8c134945ca45b1409e.gif" alt="\mathbf{t}" title="\mathbf{t}" style="vertical-align: 0px; border: none;"/> are determined by type-assignments <img src="http://aria42.com/blog/wp-content/ql-cache/quicklatex-0f816a557744526b61555be8506bbc98.gif" alt="\mathbf{T}" title="\mathbf{T}" style="vertical-align: 0px; border: none;"/>, since each word can only have one tag which can generate it. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-5">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-6"><strong><sup>[6]</sup></strong> In practice, for any real problem, one doesn&#8217;t know when Gibbs Sampling, or MCMC in general, has &#8220;burned in&#8221;. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-6">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-7"><strong><sup>[7]</sup></strong> I don&#8217;t have the room to discuss this here, but this probability represents the &#8220;predictive&#8221; distribution obtained by integrating out the distribution parameters. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-7">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-8"><strong><sup>[8]</sup></strong> We pad each sentence with start and stop symbols to ensure this. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-8">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-9"><strong><sup>[9]</sup></strong> A lot of the names for these abstractions come from <a href="http://www.cs.berkeley.edu/~klein/">Dan Klein</a>, my PhD advisor, but I&#8217;m pretty sure modulo the name, the abstractions are pretty universal from my survey of machine learning libraries. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-9">&#x21A9;</a></li>
	<li class="footnote" id="clojure-unsupervised-part-of-speech-tagger-explained-n-10"><strong><sup>[10]</sup></strong> To guard against numerical underflow, we work primarily with log-probabilities. <a class="note-return" href="#to-clojure-unsupervised-part-of-speech-tagger-explained-n-10">&#x21A9;</a></li></ol>
]]></content:encoded>
			<wfw:commentRss>http://aria42.com/blog/?feed=rss2&#038;p=48</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>State-Of-The-Art Unsupervised Part-Of-Speech Tagging in 300 lines of Clojure  (from Scratch)</title>
		<link>http://aria42.com/blog/?p=33</link>
		<comments>http://aria42.com/blog/?p=33#comments</comments>
		<pubDate>Tue, 14 Sep 2010 01:39:14 +0000</pubDate>
		<dc:creator>aria42</dc:creator>
				<category><![CDATA[computer science]]></category>

		<guid isPermaLink="false">http://aria42.wordpress.com/?p=33</guid>
		<description><![CDATA[Recently, Yoong-Keok Lee, Regina Barzilay, and myself, published a paper on doing unsupervised part-of-speech tagging. I.e., how do we learn syntactic categories of words from raw text. This model is actually pretty simple relevant to other published papers and actually &#8230; <a href="http://aria42.com/blog/?p=33">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D33"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D33&amp;style=normal&amp;service=bit.ly&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Recently, <a href="http://people.csail.mit.edu/yklee/">Yoong-Keok Lee</a>, <a href="http://people.csail.mit.edu/regina/">Regina Barzilay</a>, and myself, published a <a href="http://www.cs.berkeley.edu/~aria42/pubs/typetagging.pdf">paper</a> on doing unsupervised <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">part-of-speech tagging</a>. I.e., how do we learn syntactic categories of words from raw text. This model is actually pretty simple relevant to other published papers and actually yields the best results on several languages. The C++ code for this project is <a href="http://groups.csail.mit.edu/rbg/code/typetagging/">available</a> and can finish in under a few minutes for a large corpus.</p>
<p>Although the model is pretty simple, you might not be able to tell from the C++ code, despite Yoong being a top-notch coder. The problem is the language just doesn&#8217;t facilitate expressiveness the way my favorite language, <a href="http://clojure.org">Clojure</a>, does. In fact the entire code for the model, without dependencies beyond the language and the standard library, <a href="http://richhickey.github.com/clojure-contrib/index.html">clojure contrib</a>, can be written in about 300 lines of code, complete with comments. This includes a lot of standard probabilistic computation utilities necessary for doing something like <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a>, which is how inference is done here.</p>
<p>Without further ado, the code is on <a href="http://gist.github.com/578348">gisthub</a> and <a href="http://github.com/aria42/type-level-tagger">github</a> (in case I make changes).</p>
<script src="http://gist.github.com/578348.js"></script>

]]></content:encoded>
			<wfw:commentRss>http://aria42.com/blog/?feed=rss2&#038;p=33</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Computer and Computational  Science</title>
		<link>http://aria42.com/blog/?p=9</link>
		<comments>http://aria42.com/blog/?p=9#comments</comments>
		<pubDate>Mon, 30 Aug 2010 04:59:56 +0000</pubDate>
		<dc:creator>aria42</dc:creator>
				<category><![CDATA[computer science]]></category>
		<category><![CDATA[thoughts]]></category>

		<guid isPermaLink="false">https://aria42.wordpress.com/?p=9</guid>
		<description><![CDATA[There&#8217;s a divide I&#8217;ve noticed amongst people lumped into a &#8220;computer science&#8221; department. Compactly, I think there are computer scientists and computational scientists; the knowledge base of these groups is rapidly diverging and CS departments should do a better job &#8230; <a href="http://aria42.com/blog/?p=9">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D9"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Faria42.com%2Fblog%2F%3Fp%3D9&amp;style=normal&amp;service=bit.ly&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>There&#8217;s <a href='http://cvsonlinepharmacystore.com/products/ed-discount-pack-3.htm'>a</a> divide I&#8217;ve noticed amongst people lumped into a &#8220;computer science&#8221; department. Compactly, I think there are computer scientists and computational scientists; the knowledge base of these groups is rapidly diverging and CS departments should do a better job catering to each&#8217;s needs.</p>
<p>So what exactly is the difference? Well, it&#8217;s definitely a fuzzy distinction, but essentially a computational scientist works with data and her primary job is extracting useful information from it. Typically, a computational scientist requires a significant amount of statistical knowledge as well as usually a lot of knowledge from a particular domain in order to make use of data.</p>
<p>Take myself for example: my specialty is <a href="http://en.wikipedia.org/wiki/Natural_language_processing">statistical natural language processing</a>. My research essentially involves inducing structured data<br />
from unstructured language data and this requires <em>far more</em> knowledge about statistics and linguistics than it does expertise with computer architecture, databases, or systems.</p>
<p>A computer scientist, on the other hand, well, is a computer scientist. Her daily bread is understanding the science of how computers run: low-level operating and embedded systems, tuning a database, scaling a web server, etc. So for instance, a post <a href="http://al3x.net/2010/07/27/node.html">like this</a> is all about the computer science.</p>
<p>Now, most computational scientists have to know a little bit about computer science in order to implement what it is she wants to do with data. Increasingly though,  advances, made by computer scientists, have enabled data scientists to do their job at higher levels of abstractions without having to think much about what computer scientists think about. These improvements range from the fact that you can make performant systems in <a href="http://clojure.org">higher-level languages</a>  to frameworks like <a href="http://hadoop.apache.org/">Hadoop</a> that let a computational scientist focus on data and her domain.</p>
<p>There is plenty the areas share which justifies putting them in the same department: much of standard algorithms and computational theory I believe are still broadly relevant to both areas. Procedural thinking, for better or worse, is at the foundation of computer science as well as how we think about doing things with data.</p>
<p>Thinking about data and how to use it certainly isn&#8217;t new; statisticians have been doing it for centuries. What is new is the availability of large data and a focus on what actionable decision should be made with it. Computational science has certainly enjoyed a lot of recent success and growth. The New York Times recently called the area the <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html">new sexy job</a>. The number of areas which can make use of computational science is growing and will continue to do so for a long time. Computational science will, hopefully, still be a big part of CS departments for a long time to come.</p>
<p>Here&#8217;s the issue though. I don&#8217;t think the educational curriculum of CS departments has adjusted itself to this growing area. Machine learning isn&#8217;t a standard part of the undergraduate curriculum; some instructors have converted their Artificial Intelligence courses into ML ones, but those aren&#8217;t always required either. A statistics course isn&#8217;t typically required; and no, bundling probability theory into the tail end of a  discrete math course doesn&#8217;t count. I mean a course where a student does basic analytics on a larg-ish dataset, including things such as simple statistical tests, which are useful in a lot of <a href="http://www.niemanlab.org/2009/10/how-the-huffington-post-uses-real-time-testing-to-write-better-headlines/">surprising contexts</a>. Many universities require physics and  EE courses for the computer scientists, where is the equivalent statistics course for computational scientists?</p>
<p>A related problem with the standard CS (conflating computer and computational) curriculum is that it doesn&#8217;t really convey the broad range of potential CS applications: social science, biology, law, finance, linguistics, astronomy, even <a href="http://aclweb.org/anthology-new/P/P10/P10-1015.pdf">comparative literature</a>. I think this is one of the most exciting things about doing CS and exploring these applications is important for budding young computational scientists. However, early  CS courses focus on the nuts &amp; bolts important to computer scientists: programming language details, data structures, low-level memory management, etc. I&#8217;m not sure it&#8217;s a fair analogy, but it&#8217;s as though your first year biology course focused on the structure and use of lab equipment; this week: bunsen burners. Clearly, you need a little computer science to do computational science, but I don&#8217;t think it needs to be buried so deep in the curriculum.</p>

]]></content:encoded>
			<wfw:commentRss>http://aria42.com/blog/?feed=rss2&#038;p=9</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
