Topic Modeling with Mahout
The Apache Machine Learning library
 

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "topics" and documents into mixtures of topics. It has been successfully applied to model change in scientific fields over time (Griffiths and Steyvers, 2004; Hall, et al. 2008).

A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over "topics", which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about "sports", such as "baseball", "home run", "player", and a document about steroid use in baseball might include "sports", "drugs", and "politics". Note that the labels "sports", "drugs", and "politics", are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what the topics are, and which documents employ them in what proportions.

Another way to view a topic model is as a generalization of a mixture model like Dirichlet Process Clustering. Starting from a normal mixture model, in which we have a single global mixture of several distributions, we instead say that each document has its own mixture distribution over the globally shared mixture components. Operationally in Dirichlet Process Clustering, each document has its own latent variable drawn from a global mixture that specifies which model it belongs to, while in LDA each word in each document has its own parameter drawn from a document-wide mixture.

The idea is that we use a probabilistic mixture of a number of models that we use to explain some observed data. Each observed data point is assumed to have come from one of the models in the mixture, but we don't know which. The way we deal with that is to use a so-called latent parameter which specifies which model each data point came from.

Immediately related elementsHow this works
-
OpenSherlock Project »OpenSherlock Project
Resources »Resources
Harvesting Process Support »Harvesting Process Support
Topic Modeling »Topic Modeling
Topic Modeling with Mahout
+Komentarai (0)
+Citavimą (0)
+About