Object2Vec Algorithm - Amazon SageMaker

Object2Vec Algorithm

The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.

Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in the SageMaker BlazingText algorithm. For a blog post that discusses how to apply Object2Vec to some practical use cases, see Introduction to Amazon SageMaker Object2Vec.

I/O Interface for the Object2Vec Algorithm

You can use Object2Vec on many input data types, including the following examples.

Input Data Type Example

Sentence-sentence pairs

"A soccer game with multiple males playing." and "Some men are playing a sport."

Labels-sequence pairs

The genre tags of the movie "Titanic", such as "Romance" and "Drama", and its short description: "James Cameron's Titanic is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic. She was the most luxurious liner of her era, a ship of dreams, which ultimately carried over 1,500 people to their death in the ice cold waters of the North Atlantic in the early hours of April 15, 1912."

Customer-customer pairs

The customer ID of Jane and customer ID of Jackie.

Product-product pairs

The product ID of football and product ID of basketball.

Item review user-item pairs

A user's ID and the items she has bought, such as apple, pear, and orange.

To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec natively supports two types of input:

  • A discrete token, which is represented as a list of a single integer-id. For example, [10].

  • A sequences of discrete tokens, which is represented as a list of integer-ids. For example, [0,12,10,13].

The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token, token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:

  • Average-pooled embeddings

  • Hierarchical convolutional neural networks (CNNs),

  • Multi-layered bidirectional long short-term memory (BiLSTMs)

The input label for each pair can be one of the following:

  • A categorical label that expresses the relationship between the objects in the pair

  • A score that expresses the strength of the similarity between the two objects

For categorical labels used in classification, the algorithm supports the cross-entropy loss function. For ratings/score-based labels used in regression, the algorithm supports the mean squared error (MSE) loss function. Specify these loss functions with the output_layer hyperparameter when you create the model training job.

EC2 Instance Recommendation for the Object2Vec Algorithm

The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you are training or running inference.

When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance. For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance, you can use a larger instance. Currently, the Object2Vec algorithm can train only on a single machine. However, it does offer support for multiple GPUs. Object2Vec supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

For inference with a trained Object2Vec model that has a deep neural network, we recommend using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the INFERENCE_PREFERRED_MODE environment variable can be specified to optimize on whether the GPU optimization: Classification or Regression or GPU optimization: Encoder Embeddings inference network is loaded into GPU.

Object2Vec Sample Notebooks

Note

To run the notebooks on a notebook instance, see Access example notebooks. To run the notebooks on Studio, see Create or Open an Amazon SageMaker Studio Classic Notebook.