MLPER-12: Choose an optimal deployment option in the cloud
If models are suitable for cloud deployment, you should determine how to deploy them for best performance efficiency according to frequency, latency, and runtime requirements in your use cases.
Implementation plan
-
Amazon SageMaker Real-time Inference - Use if you need a persistent endpoint for near-instantaneous response from the ML model for requests that can come in any time. You can host the models behind an HTTPS endpoint to be integrated with your applications. SageMaker real-time endpoints are fully managed and support autoscaling.
-
Amazon SageMaker Serverless Inference - Use if you receive spiky inference requests that vary substantially in rate and volume. This is a purpose-built inference option that makes it easy to deploy and scale ML models without managing any servers. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts.
-
Amazon SageMaker Asynchronous Inference - Use if you have model requests with large payload sizes (up to 1GB), long processing times (up to 15 minutes), and near-instantaneous latency requirements, SageMaker Asynchronous Inference is ideal as it has larger payload limit and longer time-out limit compared to SageMaker Real-time inference. SageMaker Asynchronous Inference queues incoming requests and processes them asynchronously with an internal queueing system.
-
Amazon SageMaker Batch Transform - Use if you do not need near-instantaneous response from the ML model and can gather data points together into a large batch for a schedule-based inference. When a batch transform job starts, SageMaker initializes compute instances and distributes the inference or preprocessing workload among them. SageMaker Batch Transform automatically splits input files into mini-batches (so that you won’t need to worry about out-of-memory (OOM) for large datasets) and shuts down compute instances once the entire dataset is processed.