How Factorization Machines Work
The prediction task for a factorization machine model is to estimate a function ŷ from a feature set x_{i} to a target domain. This domain is realvalued for regression and binary for classification. The factorization machine model is supervised and so has a training dataset (x_{i},y_{j}) available. The advantages this model presents lie in the way it uses a factorized parametrization to capture the pairwise feature interactions. It can be represented mathematically as follows:
The three terms in this equation correspond respectively to the three components of the model:

The w_{0} term represents the global bias.

The w_{i} linear terms model the strength of the i^{th} variable.

The <v_{i},v_{j}> factorization terms model the pairwise interaction between the i^{th} and j^{th} variable.
The global bias and linear terms are the same as in a linear model. The pairwise feature interactions are modeled in the third term as the inner product of the corresponding factors learned for each feature. Learned factors can also be considered as embedding vectors for each feature. For example, in a classification task, if a pair of features tends to cooccur more often in positive labeled samples, then the inner product of their factors would be large. In other words, their embedding vectors would be close to each other in cosine similarity. For more information about the factorization machine model, see Factorization Machines.
For regression tasks, the model is trained by minimizing the squared error between the model prediction ŷ_{n} and the target value y_{n}. This is known as the square loss:
For a classification task, the model is trained by minimizing the cross entropy loss, also known as the log loss:
where:
For more information about loss functions for classification, see Loss functions for classification.