IP Insights
Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms.
SageMaker IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker IP Insights model returns a score that infers how anomalous the pattern of the event is. For example, when a user attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the IP Insights score into another machine learning model. For example, you can combine the IP Insight score with other features to rank the findings of another security system, such as those from Amazon GuardDuty.
The SageMaker IP Insights algorithm can also learn vector representations of IP addresses, known as embeddings. You can use vector-encoded embeddings as features in downstream machine learning tasks that use the information observed in the IP addresses. For example, you can use them in tasks such as measuring similarities between IP addresses in clustering and visualization tasks.
Topics
Input/Output Interface for the IP Insights Algorithm
Training and Validation
The SageMaker IP Insights algorithm supports training and validation data channels. It uses
the optional validation channel to compute an area-under-curve (AUC) score on a
predefined negative sampling strategy. The AUC metric validates how well the model
discriminates between positive and negative samples. Training and validation data
content types need to be in text/csv
format. The first column of the CSV
data is an opaque string that provides a unique identifier for the entity. The second
column is an IPv4 address in decimal-dot notation. IP Insights currently supports only
File mode. For more information and some examples, see IP Insights Training Data
Formats.
Inference
For inference, IP Insights supports text/csv
,
application/json
, and application/jsonlines
data content
types. For more information about the common data formats for inference provided by
SageMaker, see Common data formats for inference. IP Insights
inference returns output formatted as either application/json
or
application/jsonlines
. Each record in the output data contains the
corresponding dot_product
(or compatibility score) for each input data
point. For more information and some examples, see IP Insights Inference Data
Formats.
EC2 Instance Recommendation for the IP Insights Algorithm
The SageMaker IP Insights algorithm can run on both GPU and CPU instances. For training jobs, we recommend using GPU instances. However, for certain workloads with large training datasets, distributed CPU instances might reduce training costs. For inference, we recommend using CPU instances. IP Insights supports P2, P3, G4dn, and G5 GPU families.
GPU Instances for the IP Insights Algorithm
IP Insights supports all available GPUs. If you need to speed up training, we
recommend starting with a single GPU instance, such as ml.p3.2xlarge, and then
moving to a multi-GPU environment, such as ml.p3.8xlarge and ml.p3.16xlarge.
Multi-GPUs automatically divide the mini batches of training data across themselves.
If you switch from a single GPU to multiple GPUs, the mini_batch_size
is divided equally into the number of GPUs used. You may want to increase the value
of the mini_batch_size
to compensate for this.
CPU Instances for the IP Insights Algorithm
The type of CPU instance that we recommend depends largely on the instance's
available memory and the model size. The model size is determined by two
hyperparameters: vector_dim
and num_entity_vectors
. The
maximum supported model size is 8 GB. The following table lists typical EC2 instance
types that you would deploy based on these input parameters for various model sizes.
In Table 1, the value for vector_dim
in the first column range from 32
to 2048 and the values for num_entity_vectors
in the first row range
from 10,000 to 50,000,000.
vector_dim \
num_entity_vectors . |
10,000 | 50,000 | 100,000 | 500,000 | 1,000,000 | 5,000,000 | 10,000,000 | 50,000,000 |
---|---|---|---|---|---|---|---|---|
32 |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.xlarge |
ml.m5.2xlarge |
ml.m5.4xlarge |
|
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.2xlarge |
ml.m5.2xlarge |
|
|
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.2xlarge |
ml.m5.4xlarge |
|
|
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.xlarge |
ml.m5.4xlarge |
||
|
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.2xlarge |
|||
|
ml.m5.large |
ml.m5.large |
ml.m5.large |
ml.m5.xlarge |
ml.m5.4xlarge |
|||
|
ml.m5.large |
ml.m5.large |
ml.m5.xlarge |
ml.m5.xlarge |
The values for the mini_batch_size
,
num_ip_encoder_layers
, random_negative_sampling_rate
,
and shuffled_negative_sampling_rate
hyperparameters also affect the
amount of memory required. If these values are large, you might need to use a larger
instance type than normal.
IP Insights Sample Notebooks
For a sample notebook that shows how to train the SageMaker IP Insights algorithm and
perform inferences with it, see An Introduction to the SageMakerIP Insights Algorithm