Object2Vec Algorithm
The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.
Object2Vec generalizes the well-known Word2Vec embedding technique for words that is
optimized in the SageMaker BlazingText algorithm.
For
a blog post that discusses how to apply Object2Vec to some practical use cases, see Introduction to Amazon SageMaker
Object2Vec
Topics
- I/O Interface for the Object2Vec Algorithm
- EC2 Instance Recommendation for the Object2Vec Algorithm
- Object2Vec Sample Notebooks
- How Object2Vec Works
- Object2Vec Hyperparameters
- Tune an Object2Vec Model
- Data Formats for Object2Vec Training
- Data Formats for Object2Vec Inference
- Encoder Embeddings for Object2Vec
I/O Interface for the Object2Vec Algorithm
You can use Object2Vec on many input data types, including the following examples.
Input Data Type | Example |
---|---|
Sentence-sentence pairs |
"A soccer game with multiple males playing." and "Some men are playing a sport." |
Labels-sequence pairs |
The genre tags of the movie "Titanic", such as "Romance" and "Drama", and its short description: "James Cameron's Titanic is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic. She was the most luxurious liner of her era, a ship of dreams, which ultimately carried over 1,500 people to their death in the ice cold waters of the North Atlantic in the early hours of April 15, 1912." |
Customer-customer pairs |
The customer ID of Jane and customer ID of Jackie. |
Product-product pairs |
The product ID of football and product ID of basketball. |
Item review user-item pairs |
A user's ID and the items she has bought, such as apple, pear, and orange. |
To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec natively supports two types of input:
-
A discrete token, which is represented as a list of a single
integer-id
. For example,[10]
. -
A sequences of discrete tokens, which is represented as a list of
integer-ids
. For example,[0,12,10,13]
.
The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token, token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:
-
Average-pooled embeddings
-
Hierarchical convolutional neural networks (CNNs),
-
Multi-layered bidirectional long short-term memory (BiLSTMs)
The input label for each pair can be one of the following:
-
A categorical label that expresses the relationship between the objects in the pair
-
A score that expresses the strength of the similarity between the two objects
For categorical labels used in classification, the algorithm supports the
cross-entropy loss function. For ratings/score-based labels used in regression, the
algorithm supports the mean squared error (MSE) loss function. Specify these loss
functions with the output_layer
hyperparameter when you create the model
training job.
EC2 Instance Recommendation for the Object2Vec Algorithm
The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you are training or running inference.
When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance. For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance, you can use a larger instance. Currently, the Object2Vec algorithm can train only on a single machine. However, it does offer support for multiple GPUs. Object2Vec supports P2, P3, G4dn, and G5 GPU instance families for training and inference.
For inference with a trained Object2Vec model that has a deep neural network, we
recommend using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the
INFERENCE_PREFERRED_MODE
environment variable can be specified to
optimize on whether the GPU
optimization: Classification or Regression or GPU
optimization: Encoder Embeddings inference network is loaded into
GPU.
Object2Vec Sample Notebooks
Note
To run the notebooks on a notebook instance, see Access example notebooks. To run the notebooks on Studio, see Create or Open an Amazon SageMaker Studio Classic Notebook.