How LDA Works - Amazon SageMaker

How LDA Works

Amazon SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of different categories. These categories are themselves a probability distribution over the features. LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to discriminative models, which attempt to learn how inputs map to outputs.

You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A feature is referred to as a word. And the resulting categories are referred to as topics.


Lemmatization significantly increases algorithm performance and accuracy. Consider pre-processing any input text data.

An LDA model is defined by two parameters:

  • α—A prior estimate on topic probability (in other words, the average frequency that each topic within a given document occurs).

  • β—a collection of k topics where each topic is given a probability distribution over the vocabulary used in a document corpus, also called a "topic-word distribution."

LDA is a "bag-of-words" model, which means that the order of words does not matter. LDA is a generative model where each document is generated word-by-word by choosing a topic mixture θ ∼ Dirichlet(α).

For each word in the document:

  • Choose a topic z ∼ Multinomial(θ)

  • Choose the corresponding topic-word distribution β_z.

  • Draw a word w ∼ Multinomial(β_z).

When training the model, the goal is to find parameters α and β, which maximize the probability that the text corpus is generated by the model.

The most popular methods for estimating the LDA model use Gibbs sampling or Expectation Maximization (EM) techniques. The Amazon SageMaker LDA uses tensor spectral decomposition. This provides several advantages:

  • Theoretical guarantees on results. The standard EM-method is guaranteed to converge only to local optima, which are often of poor quality.

  • Embarrassingly parallelizable. The work can be trivially divided over input documents in both training and inference. The EM-method and Gibbs Sampling approaches can be parallelized, but not as easily.

  • Fast. Although the EM-method has low iteration cost it is prone to slow convergence rates. Gibbs Sampling is also subject to slow convergence rates and also requires a large number of samples.

At a high-level, the tensor decomposition algorithm follows this process:

  1. The goal is to calculate the spectral decomposition of a V x V x V tensor, which summarizes the moments of the documents in our corpus. V is vocabulary size (in other words, the number of distinct words in all of the documents). The spectral components of this tensor are the LDA parameters α and β, which maximize the overall likelihood of the document corpus. However, because vocabulary size tends to be large, this V x V x V tensor is prohibitively large to store in memory.

  2. Instead, it uses a V x V moment matrix, which is the two-dimensional analog of the tensor from step 1, to find a whitening matrix of dimension V x k. This matrix can be used to convert the V x V moment matrix into a k x k identity matrix. k is the number of topics in the model.

  3. This same whitening matrix can then be used to find a smaller k x k x k tensor. When spectrally decomposed, this tensor has components that have a simple relationship with the components of the V x V x V tensor.

  4. Alternating Least Squares is used to decompose the smaller k x k x k tensor. This provides a substantial improvement in memory consumption and speed. The parameters α and β can be found by “unwhitening” these outputs in the spectral decomposition.

After the LDA model’s parameters have been found, you can find the topic mixtures for each document. You use stochastic gradient descent to maximize the likelihood function of observing a given topic mixture corresponding to these data.

Topic quality can be improved by increasing the number of topics to look for in training and then filtering out poor quality ones. This is in fact done automatically in SageMaker LDA: 25% more topics are computed and only the ones with largest associated Dirichlet priors are returned. To perform further topic filtering and analysis, you can increase the topic count and modify the resulting LDA model as follows:

> import mxnet as mx > alpha, beta = mx.ndarray.load(‘model.tar.gz’) > # modify alpha and beta >‘new_model.tar.gz’, [new_alpha, new_beta]) > # upload to S3 and create new SageMaker model using the console

For more information about algorithms for LDA and the SageMaker implementation, see the following:

  • Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor Decompositions for Learning Latent Variable Models, Journal of Machine Learning Research, 15:2773–2832, 2014.

  • David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.

  • Thomas L Griffiths and Mark Steyvers. Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.

  • Tamara G Kolda and Brett W Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500, 2009.