Thanks for letting us know we're doing a good job!
If you've got a moment, please tell us what we did right so we can do more of it.
The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.
Object2Vec generalizes the well-known Word2Vec embedding technique for words that
is
optimized in the Amazon SageMaker BlazingText Algorithm.
For
a blog post that discusses how to apply Object2Vec to some practical use cases, see
Introduction to Amazon SageMaker
Object2Vec
Topics
You can use Object2Vec on many input data types, including the following examples.
Input Data Type | Example |
---|---|
Sentence-sentence pairs |
"A soccer game with multiple males playing." and "Some men are playing a sport." |
Labels-sequence pairs |
The genre tags of the movie "Titanic", such as "Romance" and "Drama", and its short description: "James Cameron's Titanic is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic. She was the most luxurious liner of her era, a ship of dreams, which ultimately carried over 1,500 people to their death in the ice cold waters of the North Atlantic in the early hours of April 15, 1912." |
Customer-customer pairs |
The customer ID of Jane and customer ID of Jackie. |
Product-product pairs |
The product ID of football and product ID of basketball. |
Item review user-item pairs |
A user's ID and the items she has bought, such as apple, pear, and orange. |
To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec natively supports two types of input:
A discrete token, which is represented as a list of a single
integer-id
. For example, [10]
.
A sequences of discrete tokens, which is represented as a list of
integer-ids
. For example, [0,12,10,13]
.
The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token, token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:
Average-pooled embeddings
Hierarchical convolutional neural networks (CNNs),
Multi-layered bidirectional long short-term memory (BiLSTMs)
The input label for each pair can be one of the following:
A categorical label that expresses the relationship between the objects in the pair
A score that expresses the strength of the similarity between the two objects
For categorical labels used in classification, the algorithm supports the
cross-entropy loss function. For ratings/score-based labels used in regression, the
algorithm supports the mean squared error (MSE) loss function. Specify these loss
functions with the output_layer
hyperparameter when you create the model
training job.
The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you are training or running inferences.
When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance. For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance, you can use a larger instance, such as an ml.m5.4xlarge or an ml.m5.12xlarge instance Currently, the Object2Vec algorithm can train only on a single machine. However, it does offer support for multiple GPUs.
For inference with a trained Object2Vec model that has a deep neural network, we
recommend using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the
INFERENCE_PREFERRED_MODE
environment variable can be specified to
optimize on whether the GPU
optimization: Classification or Regression or GPU
optimization: Encoder Embeddings inference network is loaded into
GPU.
For a sample notebook that uses the Amazon SageMaker Object2Vec algorithm to encode
sequences into
fixed-length embeddings, see Using Object2Vec to Encode Sentences into Fixed Length Embeddings