Topic Modeling with Mahout

The Apache Machine Learning library

 

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "topics" and documents into mixtures of topics. It has been successfully applied to model change in scientific fields over time (Griffiths and Steyvers, 2004; Hall, et al. 2008).

A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over "topics", which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about "sports", such as "baseball", "home run", "player", and a document about steroid use in baseball might include "sports", "drugs", and "politics". Note that the labels "sports", "drugs", and "politics", are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what the topics are, and which documents employ them in what proportions.

Another way to view a topic model is as a generalization of a mixture model like Dirichlet Process Clustering. Starting from a normal mixture model, in which we have a single global mixture of several distributions, we instead say that each document has its own mixture distribution over the globally shared mixture components. Operationally in Dirichlet Process Clustering, each document has its own latent variable drawn from a global mixture that specifies which model it belongs to, while in LDA each word in each document has its own parameter drawn from a document-wide mixture.

The idea is that we use a probabilistic mixture of a number of models that we use to explain some observed data. Each observed data point is assumed to have come from one of the models in the mixture, but we don't know which. The way we deal with that is to use a so-called latent parameter which specifies which model each data point came from.

Enter the title of your article


Enter a short (max 500 characters) summation of your article
Enter the main body of your article
Lock
+Comments (0)
+Citations (0)
+About
Enter comment

Select article text to quote
welcome text

First name   Last name 

Email

Skip