Help improve this page
To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.
Overview of Machine Learning on Amazon EKS
Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes platform that empowers organizations to deploy, manage, and scale AI and machine learning (ML) workloads with unparalleled flexibility and control. Built on the open source Kubernetes ecosystem, EKS lets you harness your existing Kubernetes expertise, while integrating seamlessly with open source tools and AWS services.
Whether you’re training large-scale models, running real-time online inference, or deploying generative AI applications, EKS delivers the performance, scalability, and cost efficiency your AI/ML projects demand.
Why Choose EKS for AI/ML?
EKS is a managed Kubernetes platform that helps you deploy and manage complex AI/ML workloads. Built on the open source Kubernetes ecosystem, it integrates with AWS services, providing the control and scalability needed for advanced projects. For teams new to AI/ML deployments, existing Kubernetes skills transfer directly, allowing efficient orchestration of multiple workloads.
EKS supports everything from operating system customizations to compute scaling, and its open source foundation promotes technological flexibility, preserving choice for future infrastructure decisions. The platform provides the performance and tuning options AI/ML workloads require, supporting features such as:
-
Full cluster control to fine-tune costs and configurations without hidden abstractions
-
Sub-second latency for real-time inference workloads in production
-
Advanced customizations like multi-instance GPUs, multi-cloud strategies, and OS-level tuning
-
Ability to centralize workloads using EKS as a unified orchestrator across AI/ML pipelines
Key use cases
Amazon EKS provides a robust platform for a wide range of AI/ML workloads, supporting various technologies and deployment patterns:
-
Real-time (online) inference: EKS powers immediate predictions on incoming data, such as fraud detection, with sub-second latency using tools like TorchServe, Triton Inference Server
, and KServe on Amazon EC2 Inf1 and Inf2 instances. These workloads benefit from dynamic scaling with Karpenter and KEDA , while leveraging Amazon EFS for model sharding across pods. Amazon ECR Pull Through Cache (PTC) accelerates model updates, and Bottlerocket data volumes with Amazon EBS-optimized volumes ensure fast data access. -
General model training: Organizations leverage EKS to train complex models on large datasets over extended periods using the Kubeflow Training Operator (KRO)
, Ray Serve , and Torch Distributed Elastic on Amazon EC2 P4d and Amazon EC2 Trn1 instances. These workloads are supported by batch scheduling with tools like Volcano , Yunikorn , and Kueue . Amazon EFS enables sharing of model checkpoints, and Amazon S3 handles model import/export with lifecycle policies for version management. -
Retrieval augmented generation (RAG) pipelines: EKS manages customer support chatbots and similar applications by integrating retrieval and generation processes. These workloads often use tools like Argo Workflows
and Kubeflow for orchestration, vector databases like Pinecone , Weaviate , or Amazon OpenSearch , and expose applications to users via the Application Load Balancer Controller (LBC). NVIDIA NIM optimizes GPU utilization, while Prometheus and Grafana monitor resource usage. -
Generative AI model deployment: Companies deploy real-time content creation services on EKS, such as text or image generation, using Ray Serve
, vLLM , and Triton Inference Server on Amazon EC2 G5 and Inferentia accelerators. These deployments optimize performance and memory utilization for large-scale models. JupyterHub enables iterative development, Gradio provides simple web interfaces, and the S3 Mountpoint CSI Driver allows mounting S3 buckets as file systems for accessing large model files. -
Batch (offline) inference: Organizations process large datasets efficiently through scheduled jobs with AWS Batch or Volcano
. These workloads often use Inf1 and Inf2 EC2 instances for AWS Inferentia chips, Amazon EC2 G4dn instances for NVIDIA T4 GPUs, or c5 and c6i CPU instances, maximizing resource utilization during off-peak hours for analytics tasks. The AWS Neuron SDK and NVIDIA GPU drivers optimize performance, while MIG/TS enables GPU sharing. Storage solutions include Amazon S3 and Amazon EFS and FSx for Lustre , with CSI drivers for various storage classes. Model management leverages tools like Kubeflow Pipelines , Argo Workflows , and Ray Cluster , while monitoring is handled by Prometheus, Grafana and custom model monitoring tools.
Case studies
Customers choose Amazon EKS for various reasons, such as optimizing GPU usage or running real-time inference workloads with sub-second latency, as demonstrated in the following case studies. For a list of all case studies for Amazon EKS, see AWS Customer Success Stories
-
Unitary
processes 26 million videos daily using AI for content moderation, requiring high-throughput, low-latency inference and have achieved an 80% reduction in container boot times, ensuring fast response to scaling events as traffic fluctuates. -
Miro
, the visual collaboration platform supporting 70 million users worldwide, reported an 80% reduction in compute costs compared to their previous self-managed Kubernetes clusters. -
Synthesia
, which offers generative AI video creation as a service for customers to create realistic videos from text prompts, achieved a 30x improvement in ML model training throughput. -
Harri
, providing HR technology for the hospitality industry, achieved 90% faster scaling in response to spikes in demand and reduced its compute costs by 30% by migrating to AWS Graviton processors . -
Ada Support
, an AI-powered customer service automation company, achieved a 15% reduction in compute costs alongside a 30% increase in compute efficiency. -
Snorkel AI
, which equips enterprises to build and adapt foundation models and large language models, achieved over 40% cost savings by implementing intelligent scaling mechanisms for their GPU resources.
Start using Machine Learning on EKS
To begin planning for and using Machine Learning platforms and workloads on EKS on the AWS cloud, proceed to the Get started with ML section.