Manage machine learning experiments using Amazon SageMaker with MLflow - Amazon SageMaker

Manage machine learning experiments using Amazon SageMaker with MLflow

Amazon SageMaker with MLflow is a capability of Amazon SageMaker that lets you create, manage, analyze, and compare your machine learning experiments.

Experimentation in machine learning

Machine learning is an iterative process that requires experimenting with various combinations of data, algorithms, and parameters, while observing their impact on model accuracy. The iterative nature of ML experimentation results in numerous model training runs and versions, making it challenging to track the best performing models and their configurations. The complexity of managing and comparing iterative training runs increases with generative artificial intelligence (generative AI), where experimentation involves not only fine-tuning models but also exploring creative and diverse outputs. Researchers must adjust hyperparameters, select suitable model architectures, and curate diverse datasets to optimize both the quality and creativity of the generated content. Evaluating generative AI models requires both quantitative and qualitative metrics, adding another layer of complexity to the experimentation process.

Use MLflow with Amazon SageMaker to track, organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and register and deploy your best performing models.

MLflow integrations

Use MLflow while training and evaluating models to find the best candidates for your use case. You can compare model performance, parameters, and metrics across experiments in the MLflow UI, keep track of your best models in the MLflow Model Registry, automatically register them as a SageMaker model, and deploy registered models to SageMaker endpoints.

Amazon SageMaker with MLflow

Use MLflow to track and manage the experimentation phase of the machine learning (ML) lifecycle with AWS integrations for model development, management, deployment, and tracking.

Amazon SageMaker Studio

Create and manage tracking servers, run notebooks to create experiments, and access the MLflow UI to view and compare experiment runs all through Studio.

SageMaker Model Registry

Manage model versions and catalog models for production by automatically registering models from MLflow Model Registry to SageMaker Model Registry. For more information, see Automatically register SageMaker models with SageMaker Model Registry.

SageMaker Inference

Prepare your best models for deployment on a SageMaker endpoint using ModelBuilder. For more information, see Deploy MLflow models with ModelBuilder.

AWS Identity and Access Management

Configure access to MLflow using role-based access control (RBAC) with IAM. Write IAM identity policies to authorize the MLflow APIs that can be called by a client of an MLflow tracking server. All MLflow REST APIs are represented as IAM actions under the sagemaker-mlflow service prefix. For more information, see Set up IAM permissions for MLflow.

AWS CloudTrail

View logs in AWS CloudTrail to help you enable operational and risk auditing, governance, and compliance of your AWS account. For more information, see AWS CloudTrail logs.

Amazon EventBridge

Automate the model review and deployment lifecycle using MLflow events captured by Amazon EventBridge. For more information, see Amazon EventBridge events.

Supported AWS Regions

Amazon SageMaker with MLflow is generally available in all AWS commercial Regions where Amazon SageMaker Studio is available, except the China Regions and AWS GovCloud (US) Regions. SageMaker with MLflow is available using only the AWS CLI in the Europe (Zurich), Asia Pacific (Hyderabad), Asia Pacific (Melbourne), and Canada West (Calgary) AWS Regions.

Tracking servers are launched in a single availability zone within their specified Region.

How it works

An MLflow Tracking Server has three main components: compute, backend metadata storage, and artifact storage. The compute that hosts the tracking server and the backend metadata storage are securely hosted in the SageMaker service account. The artifact storage lives in an Amazon S3 bucket in your own AWS account.

A diagram showing that the compute and metadata store for an MLflow Tracking Server is located in the SageMaker service account and the artifact store for an MLflow Tracking Server is located in an Amazon S3 bucket in the customer account.

A tracking server has an ARN. You can use this ARN to connect the MLflow SDK to your Tracking Server and start logging your training runs to MLflow.

Read on for more information about the following key concepts:

Backend metadata storage

When you create an MLflow Tracking Server, a backend store, which persists various metadata for each Run, such as run ID, start and end times, parameters, and metrics, is automatically configured within the SageMaker service account and fully managed for you.

Artifact storage

To provide MLflow with persistent storage for metadata for each run, such as model weights, images, model files, and data files for your experiment runs, you must create an artifact store using Amazon S3. The artifact store must be set up within your AWS account and you must explicitly give MLflow access to Amazon S3 in order to access your artifact store. For more information, see Artifact Stores in the MLflow documentation.

MLflow Tracking Server sizes

You can optionally specify the size of your tracking server in the Studio UI or with the AWS CLI parameter --tracking-server-size. You can choose between "Small", "Medium", and "Large". The default MLflow tracking server configuration size is "Small". You can choose a size depending on the projected use of the tracking server such as the volume of data logged, number of users, and frequency of use.

We recommend using a small tracking server for teams of up to 25 users, a medium tracking server for teams of up to 50 users, and a large tracking server for teams of up to 100 users. We assume that all users will make concurrent requests to your MLflow Tracking Server to make these recommendations. You should select the tracking server size based on your expected usage pattern and the TPS (Transactions Per Second) supported by each tracking server.

Note

The nature of your workload and the type of requests that you make to the tracking server dictate the TPS you see.

Tracking server size Sustained TPS Burst TPS
Small Up to 25 Up to 50
Medium Up to 50 Up to 100
Large Up to 100 Up to 200

Tracking server versions

The following MLflow versions are available to use with SageMaker:

MLflow version Python version
MLflow 2.13.2 Python 3.8 or later

AWS CloudTrail logs

AWS CloudTrail automatically logs activity related to your MLflow Tracking Server. The following API calls are logged in CloudTrail:

  • CreateMlflowTrackingServer

  • DescribeMlflowTrackingServer

  • UpdateMlflowTrackingServer

  • DeleteMlflowTrackingServer

  • ListMlflowTrackingServers

  • CreatePresignedMlflowTrackingServer

  • StartMlflowTrackingServer

  • StopMlflowTrackingServer

For more information about CloudTrail, see the AWS CloudTrail User Guide.

Amazon EventBridge events

Use EventBridge to route events from using MLflow with SageMaker to consumer applications across your organization. The following events are emitted to EventBridge:

  • "SageMaker Tracking Server Creating"

  • "SageMaker Tracking Server Created“

  • "SageMaker Tracking Server Create Failed"

  • "SageMaker Tracking Server Updating"

  • "SageMaker Tracking Server Updated"

  • "SageMaker Tracking Server Update Failed"

  • "SageMaker Tracking Server Deleting"

  • "SageMaker Tracking Server Deleted"

  • "SageMaker Tracking Server Delete Failed"

  • "SageMaker Tracking Server Starting"

  • "SageMaker Tracking Server Started"

  • "SageMaker Tracking Server Start Failed"

  • "SageMaker Tracking Server Stopping"

  • "SageMaker Tracking Server Stopped"

  • "SageMaker Tracking Server Stop Failed"

  • "SageMaker Tracking Server Maintenance In Progress"

  • "SageMaker Tracking Server Maintenance Complete"

  • "SageMaker Tracking Server Maintenance Failed"

  • "SageMaker MLFlow Tracking Server Creating Run"

  • "SageMaker MLFlow Tracking Server Creating RegisteredModel"

  • "SageMaker MLFlow Tracking Server Creating ModelVersion"

  • "SageMaker MLFlow Tracking Server Transitioning ModelVersion Stage"

  • "SageMaker MLFlow Tracking Server Setting Registered Model Alias"

For more information about EventBridge, see the Amazon EventBridge User Guide.