Build an MLOps workflow by using Amazon SageMaker and Azure DevOps - AWS Prescriptive Guidance

Build an MLOps workflow by using Amazon SageMaker and Azure DevOps

Created by Deepika Kumar (AWS) and Sara van de Moosdijk (AWS)

Environment: Production

Technologies: Machine learning & AI; DevOps; Operations

Workload: Microsoft

AWS services: Amazon API Gateway; Amazon ECR; Amazon EventBridge; AWS Lambda; Amazon SageMaker

Summary

Machine learning operations (MLOps) is a set of practices that automate and simplify machine learning (ML) workflows and deployments. MLOps focuses on automating the ML lifecycle. It helps ensure that models are not just developed but also deployed, monitored, and retrained systematically and repeatedly. It brings DevOps principles to ML. MLOps results in faster deployment of ML models, better accuracy over time, and stronger assurance that they provide real business value.

Organizations often have existing DevOps tools and data storage solutions before starting their MLOps journey. This pattern showcases how to harness the strengths of both Microsoft Azure and AWS. It helps you integrate Azure DevOps with Amazon SageMaker to create an MLOps workflow.

The solution simplifies working between Azure and AWS. You can use Azure for development and AWS for machine learning. It promotes an effective process for making machine learning models from start to finish, including data handling, training, and deployment on AWS. For efficiency, you manage these processes through Azure DevOps pipelines.

Prerequisites and limitations

Prerequisites

  • Azure subscription – Access to Azure services, such as Azure DevOps, for setting up the continuous integration and continuous deployment (CI/CD) pipelines.

  • Active AWS account – Permissions to use the AWS services used in this pattern.

  • Data – Access to historical data for training the machine learning model.

  • Familiarity with ML concepts – Understanding of Python, Jupyter Notebooks, and machine learning model development.

  • Security configuration – Proper configuration of roles, policies, and permissions across both Azure and AWS to ensure secure data transfer and access.

Limitations

  • This guidance does not provide guidance on secure cross-cloud data transfers. For more information about cross-cloud data transfers, see AWS Solutions for Hybrid and Multicloud.

  • Multicloud solutions may increase latency for real-time data processing and model inference.

  • This guidance provides one example of a multi-account MLOps architecture. Adjustments are necessary based on your machine learning and AWS strategy.

Architecture

Target architecture

The target architecture integrates Azure DevOps with Amazon SageMaker, creating a cross-cloud ML workflow. It uses Azure for CI/CD processes and SageMaker for ML model training and deployment. It outlines the process of obtaining data (from sources such as Amazon S3, Snowflake, and Azure Data Lake) through model building and deployment. Key components include CI/CD pipelines for model building and deployment, data preparation, infrastructure management, and Amazon SageMaker for training, evaluation, and deployment of ML models. This architecture is designed to provide efficient, automated, and scalable ML workflows across cloud platforms.

Architecture diagram of an MLOps workflow that uses Azure Devops and SageMaker

The architecture consists of the following components:

  1. Data scientists perform ML experiments in the development account to explore different approaches for ML use cases by using various data sources. Data scientists perform unit tests and trials. Following model evaluation, Data scientists push and merge the code to the Model Build repository, which is hosted on Azure DevOps. This repository contains code for a multi-step model building pipeline.

  2. On Azure DevOps, the Model Build Pipeline, which provides continuous integration (CI), can be activated automatically or manually upon code merge to the main branch. In the Automation account, this activates the SageMaker pipeline for data preprocessing, model training and evaluation, and conditional model registration based on accuracy.

  3. The Automation account is a central account across ML platforms that hosts ML environments (Amazon ECR), models (Amazon S3), model metadata (SageMaker Model Registry), features (SageMaker Feature Store), automated pipelines (SageMaker Pipelines), and ML log insights (CloudWatch and OpenSearch Service). This account allows reusability of ML assets and enforces best practices to accelerate delivery of ML use cases.

  4. The latest model version is added to SageMaker Model Registry for review. It tracks model versions and respective artifacts (lineage and metadata). It also manages the status of the model (approve, reject, or pending), and it manages the version for downstream deployment.

  5. After a trained model in Model Registry is approved through the studio interface or an API call, an event can be dispatched to Amazon EventBridge. EventBridge starts the Model Deploy pipeline on Azure DevOps.

  6. The Model Deploy pipeline, which provides continuous deployment (CD), checks out the source from the Model Deploy repository. The source contains code, the configuration for the model deployment, and test scripts for quality benchmarks. The Model Deploy pipeline can be tailored to your inference type.

  7. After quality control checks, the Model Deploy pipeline deploys the model to the Staging account. The Staging account is a copy of the Production account, and it is used for integration testing and evaluation. For a batch transformation, the Model Deploy pipeline can automatically update the batch inference process to use the latest approved model version. For a real-time, serverless, or asynchronous inference, it sets up or updates the respective model endpoint.

  8. After successful testing in the Staging account, a model can be deployed to the Production account by manual approval through the Model Deploy pipeline. This pipeline provisions a production endpoint in the Deploy to production step, including model monitoring and a data feedback mechanism.

  9. After the model is in production, use tools such as SageMaker Model Monitor and SageMaker Clarify to identify bias, detect drift, and continuously monitor the model's performance.

Automation and scale

Use infrastructure as code (IaC) to automatically deploy to multiple accounts and environments. By automating the process of setting up an MLOps workflow, it is possible to separate the environments used by ML teams working on different projects. AWS CloudFormation helps you model, provision, and manage AWS resources by treating infrastructure as code.

Tools

AWS services

  • Amazon SageMaker is a managed ML service that helps you build and train ML models and then deploy them into a production-ready hosted environment.

  • AWS Glue is a fully managed extract, transform, and load (ETL) service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.

  • Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data. In this pattern, Amazon S3 is used for data storage and integrated with SageMaker for model training and model objects.

  • AWS Lambda is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use. In this pattern, Lamdba is used for data pre-processing and post-processing tasks.

  • Amazon Elastic Container Registry (Amazon ECR) is a managed container image registry service that’s secure, scalable, and reliable. In this pattern, it stores Docker containers that SageMaker uses as training and deployment environments.

  • Amazon EventBridge is a serverless event bus service that helps you connect your applications with real-time data from a variety of sources. In this pattern, EventBridge orchestrates event-driven or time-based workflows that initiate automatic model retraining or deployment.

  • Amazon API Gateway helps you create, publish, maintain, monitor, and secure REST, HTTP, and WebSocket APIs at any scale.  In this pattern, it is used to create an external-facing, single point of entry for Amazon SageMaker endpoints.

Other tools

Best practices

Before implementing any component of this multicloud MLOps workflow, complete the following activities:

  • Define and understand the machine learning workflow and the tools required to support it. Different use cases require different workflows and components. For example, a feature store might be required for feature reuse and low latency inference in a personalization use case, but it may not be required for other use cases. Understanding the target workflow, use case requirements, and preferred collaboration methods of the data science team is needed to successfully customize the architecture.

  • Create a clear separation of responsibility for each component of the architecture. Spreading data storage across Azure Data Lake Storage, Snowflake, and Amazon S3 can increase complexity and cost. If possible, choose a consistent storage mechanism. Similarly, avoid using a combination of Azure and AWS DevOps services, or a combination of Azure and AWS ML services.

  • Choose one or more existing models and datasets to perform end-to-end testing of the MLOps workflow. The test artifacts should reflect real use cases that the data science teams develop when the platform enters production.

Epics

TaskDescriptionSkills required

Identify data sources.

Based on current and future use cases, available data sources, and types of data (such as confidential data), document the data sources that need to be integrated with the MLOps platform. Data can be stored in Amazon S3, Azure Data Lake Storage, Snowflake, or other sources. Create a plan for integrating these sources with your platform and securing access to the correct resources.

Data engineer, Data scientist, Cloud architect

Choose applicable services.

Customize the architecture by adding or removing services based on the desired workflow of the data science team, applicable data sources, and existing cloud architecture. For example, data engineers and data scientists may perform data preprocessing and feature engineering in SageMaker, AWS Glue, or Amazon EMR. It is unlikely that all three services would be required.

AWS administrator, Data engineer, Data scientist, ML engineer

Analyze security requirements.

Gather and document security requirements. This includes determining:

  • Which teams or engineers can access specific data sources

  • Whether teams are allowed to access the code and models of other teams

  • Which permissions (if any) team members should have for non-development accounts

  • Which security measures need to be implemented for cross-cloud data transfer

AWS administrator, Cloud architect
TaskDescriptionSkills required

Set up AWS Organizations.

Set up AWS Organizations on the root AWS account. This helps you manage the subsequent accounts that you create as part of a multi-account MLOps strategy. For more information, see the AWS Organizations documentation.

AWS administrator
TaskDescriptionSkills required

Create an AWS development account.

Create an AWS account where data engineers and data scientists have permissions to experiment and create ML models. For instructions, see Creating a member account in your organization in the AWS Organizations documentation.

AWS administrator

Create a Model Build repository.

Create a Git repository in Azure where data scientists can push their model build and deployment code after the experimentation phase is complete. For instructions, see Set up a Git repository in the Azure DevOps documentation.

DevOps engineer, ML engineer

Create a Model Deploy repository.

Create a Git repository in Azure that stores standard deployment code and templates. It should include code for every deployment option that the organization uses, as identified in the design phase. For example, it should include real-time endpoints, asynchronous endpoints, serverless inference, or batch transforms. For instructions, see Set up a Git repository in the Azure DevOps documentation.

DevOps engineer, ML engineer

Create an Amazon ECR repository.

Set up an Amazon ECR repository that stores the approved ML environments as Docker images. Allow data scientists and ML engineers to define new environments. For instructions, see Creating a private repository in the Amazon ECR documentation.

ML engineer

Set up SageMaker Studio.

Set up SageMaker Studio on the development account according to the previously defined security requirements and preferred data science tools, such as your  choice of integrated development environment (IDE). Use lifecycle configurations to automate the installation of key functionality and create a uniform development environment for data scientists. For more information, see Amazon SageMaker Studio in the SageMaker documentation.

ML engineer, Data scientist
TaskDescriptionSkills required

Create an Automation account.

Create an AWS account where automated pipelines and jobs run. You can give data science teams read access to this account. For instructions, see Creating a member account in your organization in the AWS Organizations documentation.

AWS administrator

Set up a model registry.

Set up SageMaker Model Registry in the Automation account. This registry stores the metadata for ML models and helps certain data scientists or team leads to approve or reject models. For more information, see Register and deploy models with Model Registry in the SageMaker documentation.

ML engineer

Create a Model Build pipeline.

Create a CI/CD pipeline in Azure that starts manually or automatically when code is pushed to the Model Build repository. The pipeline should check out the source code and create or update a SageMaker pipeline in the Automation account. The pipeline should add a new model to the model registry. For more information about creating a pipeline, see the Azure Pipelines documentation.

DevOps engineer, ML engineer
TaskDescriptionSkills required

Create AWS staging and deployment accounts.

Create AWS accounts for staging and deployment of ML models. These accounts should be identical to allow for accurate testing of the models in staging before moving to production. You can give data science teams read access to the staging account. For instructions, see Creating a member account in your organization in the AWS Organizations documentation.

AWS administrator

Set up S3 buckets for model monitoring.

Complete this step if you want to enable model monitoring for the deployed models that are created by the Model Deploy pipeline. Create Amazon S3 buckets for storing the input and output data. For more information about creating S3 buckets, see Creating a bucket in the Amazon S3 documentation. Set up cross-account permissions so that the automated model monitoring jobs run in the Automation account. For more information, see Monitor data and model quality in the SageMaker documentation.

ML engineer

Create a Model Deploy pipeline.

Create a CI/CD pipeline in Azure that starts when a model is approved in the model registry. The pipeline should check out the source code and model artifact, build the infrastructure templates for deploying the model in the staging and production accounts, deploy the model in the staging account, run automated tests, wait for manual approval, and deploy the approved model into the production account. For more information about creating a pipeline, see the Azure Pipelines documentation.

DevOps engineer, ML engineer
TaskDescriptionSkills required

Build AWS CDK or CloudFormation templates.

Define AWS Cloud Development Kit (AWS CDK) or AWS CloudFormation templates for all environments that need to be deployed automatically. This might include the development environment, automation environment, and staging and deployment environments. For more information, see the AWS CDK and CloudFormation documentation.

AWS DevOps

Create an Infrastructure pipeline.

Create a CI/CD pipeline in Azure for infrastructure deployment. An administrator can initiate this pipeline to create new AWS accounts and set up the environments that the ML team requires.

DevOps engineer

Troubleshooting

IssueSolution

Insufficient monitoring and drift detection – Inadequate monitoring can lead to missed detection of model performance issues or data drift.

Strengthen monitoring frameworks with tools such as Amazon CloudWatch, SageMaker Model Monitor, and SageMaker Clarify. Configure alerts for immediate action on identified issues.

CI pipeline trigger errors The CI pipeline in Azure DevOps might not be triggered upon code merge due to misconfiguration.

Check the Azure DevOps project settings to ensure that the webhooks are properly set up and pointing to the correct SageMaker endpoints.

Governance The central Automation account might not enforce best practices across ML platforms, leading to inconsistent workflows.

Audit the Automation account settings, ensuring that all ML environments and models conform to predefined best practices and policies.

Model registry approval delays – This happens when there's a delay in checking and approving the model, either because people take time to review it or because of technical issues.

Implement a notification system to alert stakeholders of models that are pending approval, and streamline the review process.

Model deployment event failures Events dispatched to start model deployment pipelines might fail, causing deployment delays.

Confirm that Amazon EventBridge has the correct permissions and event patterns to invoke Azure DevOps pipelines successfully.

Production deployment bottlenecks Manual approval processes can create bottlenecks, delaying the production deployment of models.

Optimize the approval workflow within the model deploy pipeline, promoting timely reviews and clear communication channels.

Related resources

AWS documentation

Other AWS resources

Azure documentation