Machine learning experiments using Amazon SageMaker with MLflow
Amazon SageMaker with MLflow is a capability of Amazon SageMaker that lets you create, manage, analyze, and compare your machine learning experiments.
Experimentation in machine learning
Machine learning is an iterative process that requires experimenting with various combinations of data, algorithms, and parameters, while observing their impact on model accuracy. The iterative nature of ML experimentation results in numerous model training runs and versions, making it challenging to track the best performing models and their configurations. The complexity of managing and comparing iterative training runs increases with generative artificial intelligence (generative AI), where experimentation involves not only fine-tuning models but also exploring creative and diverse outputs. Researchers must adjust hyperparameters, select suitable model architectures, and curate diverse datasets to optimize both the quality and creativity of the generated content. Evaluating generative AI models requires both quantitative and qualitative metrics, adding another layer of complexity to the experimentation process.
Use MLflow with Amazon SageMaker to track, organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and register and deploy your best performing models.
MLflow integrations
Use MLflow while training and evaluating models to find the best candidates for your use case. You can compare model performance, parameters, and metrics across experiments in the MLflow UI, keep track of your best models in the MLflow Model Registry, automatically register them as a SageMaker model, and deploy registered models to SageMaker endpoints.
Amazon SageMaker with MLflow
Use MLflow to track and manage the experimentation phase of the machine learning (ML) lifecycle with AWS integrations for model development, management, deployment, and tracking.
Amazon SageMaker Studio
Create and manage tracking servers, run notebooks to create experiments, and access the MLflow UI to view and compare experiment runs all through Studio.
SageMaker Model Registry
Manage model versions and catalog models for production by automatically registering models from MLflow Model Registry to SageMaker Model Registry. For more information, see Automatically register SageMaker models with SageMaker Model Registry.
SageMaker Inference
Prepare your best models for deployment on a SageMaker endpoint using
ModelBuilder
. For more information, see Deploy MLflow models with ModelBuilder.
AWS Identity and Access Management
Configure access to MLflow using role-based access control (RBAC) with IAM. Write IAM
identity policies to authorize the MLflow APIs that can be called by a client of an MLflow
tracking server. All MLflow REST APIs are represented as IAM actions under the
sagemaker-mlflow
service prefix. For more information, see Set up IAM permissions for MLflow.
AWS CloudTrail
View logs in AWS CloudTrail to help you enable operational and risk auditing, governance, and compliance of your AWS account. For more information, see AWS CloudTrail logs.
Amazon EventBridge
Automate the model review and deployment lifecycle using MLflow events captured by Amazon EventBridge. For more information, see Amazon EventBridge events.
Supported AWS Regions
Amazon SageMaker with MLflow is generally available in all AWS commercial Regions where Amazon SageMaker Studio is available, except the China Regions and AWS GovCloud (US) Regions. SageMaker with MLflow is available using only the AWS CLI in the Europe (Zurich), Asia Pacific (Hyderabad), Asia Pacific (Melbourne), and Canada West (Calgary) AWS Regions.
Tracking servers are launched in a single availability zone within their specified Region.
How it works
An MLflow Tracking Server has three main components: compute, backend metadata storage, and artifact storage. The compute that hosts the tracking server and the backend metadata storage are securely hosted in the SageMaker service account. The artifact storage lives in an Amazon S3 bucket in your own AWS account.
A tracking server has an ARN. You can use this ARN to connect the MLflow SDK to your Tracking Server and start logging your training runs to MLflow.
Read on for more information about the following key concepts:
Backend metadata storage
When you create an MLflow Tracking Server, a backend store
Artifact storage
To provide MLflow with persistent storage for metadata for each run, such as model
weights, images, model files, and data files for your experiment runs, you must create an
artifact store using Amazon S3. The artifact store must be set up within your AWS account and you must
explicitly give MLflow access to Amazon S3 in order to access your artifact store. For more
information, see Artifact Stores
MLflow Tracking Server sizes
You can optionally specify the size of your tracking server in the Studio UI or with
the AWS CLI parameter --tracking-server-size
. You can choose between
"Small"
, "Medium"
, and "Large"
. The default MLflow
tracking server configuration size is "Small"
. You can choose a size depending on
the projected use of the tracking server such as the volume of data logged, number of users,
and frequency of use.
We recommend using a small tracking server for teams of up to 25 users, a medium tracking server for teams of up to 50 users, and a large tracking server for teams of up to 100 users. We assume that all users will make concurrent requests to your MLflow Tracking Server to make these recommendations. You should select the tracking server size based on your expected usage pattern and the TPS (Transactions Per Second) supported by each tracking server.
Note
The nature of your workload and the type of requests that you make to the tracking server dictate the TPS you see.
Tracking server size | Sustained TPS | Burst TPS |
---|---|---|
Small | Up to 25 | Up to 50 |
Medium | Up to 50 | Up to 100 |
Large | Up to 100 | Up to 200 |
Tracking server versions
The following MLflow versions are available to use with SageMaker:
MLflow version | Python version |
---|---|
MLflow 2.13.2 |
Python
3.8 |
AWS CloudTrail logs
AWS CloudTrail automatically logs activity related to your MLflow Tracking Server. The following API calls are logged in CloudTrail:
-
CreateMlflowTrackingServer
-
DescribeMlflowTrackingServer
-
UpdateMlflowTrackingServer
-
DeleteMlflowTrackingServer
-
ListMlflowTrackingServers
-
CreatePresignedMlflowTrackingServer
-
StartMlflowTrackingServer
-
StopMlflowTrackingServer
For more information about CloudTrail, see the AWS CloudTrail User Guide.
Amazon EventBridge events
Use EventBridge to route events from using MLflow with SageMaker to consumer applications across your organization. The following events are emitted to EventBridge:
-
"SageMaker Tracking Server Creating"
-
"SageMaker Tracking Server Created“
-
"SageMaker Tracking Server Create Failed"
-
"SageMaker Tracking Server Updating"
-
"SageMaker Tracking Server Updated"
-
"SageMaker Tracking Server Update Failed"
-
"SageMaker Tracking Server Deleting"
-
"SageMaker Tracking Server Deleted"
-
"SageMaker Tracking Server Delete Failed"
-
"SageMaker Tracking Server Starting"
-
"SageMaker Tracking Server Started"
-
"SageMaker Tracking Server Start Failed"
-
"SageMaker Tracking Server Stopping"
-
"SageMaker Tracking Server Stopped"
-
"SageMaker Tracking Server Stop Failed"
-
"SageMaker Tracking Server Maintenance In Progress"
-
"SageMaker Tracking Server Maintenance Complete"
-
"SageMaker Tracking Server Maintenance Failed"
-
"SageMaker MLFlow Tracking Server Creating Run"
-
"SageMaker MLFlow Tracking Server Creating RegisteredModel"
-
"SageMaker MLFlow Tracking Server Creating ModelVersion"
-
"SageMaker MLFlow Tracking Server Transitioning ModelVersion Stage"
-
"SageMaker MLFlow Tracking Server Setting Registered Model Alias"
For more information about EventBridge, see the Amazon EventBridge User Guide.