Train a Model with Amazon SageMaker
The following diagram shows how you train and deploy a model with Amazon SageMaker. Your training code accesses your training data and outputs model artifacts from an S3 bucket. Then you can make requests to a model endpoint to run inference. You can store both the training and inference container images in an Amazon Elastic Container Registry (ECR).

The following guide highlights two components of SageMaker: model training and model deployment.
To train a model in SageMaker, you create a training job. The training job includes the following information:
-
The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training data.
-
The compute resources that you want SageMaker to use for model training. Compute resources are machine learning (ML) compute instances that are managed by SageMaker.
-
The URL of the S3 bucket where you want to store the output of the job.
-
The Amazon Elastic Container Registry path where the training code is stored. For more information, see Docker Registry Paths and Example Code.
Note
Your input dataset must be in the same AWS Region as your training job.
You have the following options for a training algorithm:
-
Use an algorithm provided by SageMaker—SageMaker provides dozens of built-in training algorithms and hundreds of pre-trained models. If one of these meets your needs, it's a great out-of-the-box solution for quick model training. For a list of algorithms provided by SageMaker, see Use Amazon SageMaker Built-in Algorithms or Pre-trained Models. To try an exercise that uses an algorithm provided by SageMaker, see Get started. You can also use SageMaker JumpStart to use algorithms and models through the Studio UI.
-
Use SageMaker Debugger—to inspect training parameters and data throughout the training process when working with the TensorFlow, PyTorch, and Apache MXNet learning frameworks or the XGBoost algorithm. Debugger automatically detects and alerts users to commonly occurring errors such as parameter values getting too large or small. For more information about using Debugger, see Use Amazon SageMaker Debugger to debug and improve model performance. Debugger sample notebooks are available at Amazon SageMaker Debugger Samples
. -
Use Apache Spark with SageMaker—SageMaker provides a library that you can use in Apache Spark to train models with SageMaker. Using the library provided by SageMaker is similar to using Apache Spark MLLib. For more information, see Use Apache Spark with Amazon SageMaker.
-
Submit custom code to train with deep learning frameworks—You can submit custom Python code that uses TensorFlow, PyTorch, or Apache MXNet for model training. For more information, see Use TensorFlow with Amazon SageMaker, Use PyTorch with Amazon SageMaker, and Use Apache MXNet with Amazon SageMaker.
-
Use your own custom algorithms—Put your code together as a Docker image and specify the registry path of the image in a SageMaker
CreateTrainingJob
API call. For more information, see Use Docker containers to build models. -
Use an algorithm that you subscribe to from AWS Marketplace—For information, see Find and Subscribe to Algorithms and Model Packages on AWS Marketplace.
After you create the training job, SageMaker launches the ML compute instances and uses the training code and the training dataset to train the model. It saves the resulting model artifacts and other output in the S3 bucket you specified for that purpose.
You can create a training job with the SageMaker console or the API. For information about
creating a training job with the API, see the CreateTrainingJob
API.
When you create a training job with the API, SageMaker replicates the entire dataset on ML
compute instances by default. To make SageMaker replicate a subset of the data on each ML compute
instance, you must set the S3DataDistributionType
field to
ShardedByS3Key
. You can set this field using the low-level SDK. For more
information, see S3DataDistributionType
in S3DataSource
.
Important
To prevent your algorithm container from contending for memory, we reserve memory for our SageMaker critical system processes on your ML compute instances and therefore you cannot expect to see all the memory for your instance type.