Multi-tenant generative AI platform scenario
This scenario presents a generative AI service approach to alleviate challenges that organizations are facing in governing generative AI development. A generative AI service provides a managed and governed environment to develop generative AI applications at scale. The service aims to consolidate components and offer as foundational building blocks to lines of businesses (LOBs) and internal teams.
It offers a suite of tools, services, and infrastructure to streamline and automate parts of the generative AI lifecycle from data preparation, model fine tuning, developing and evaluating applications, deployment, and operations. Generative AI services democratize access to advanced AI technologies, enabling both seasoned data scientists and engineers to create sophisticated features and applications. They also help you address critical areas of model governance, security, and compliance, aligning generative AI initiatives with organizational and regulatory requirements.
To scale the power of AI application development at enterprises, we need to solve these key challenges:
-
Limitations of LLM domain knowledge and need for enterprise data to power AI apps: 1. Limitations of LLM domain knowledge and need for enterprise data to power AI apps: Commercial and open source LLMs typically cannot solve enterprise's most pressing needs upon their creation. There is a need for the application of several techniques to solve industry problems and add value for enterprise customers. These include retrieval augmented generation (RAG), grounding LLMs with context on domain-specific data, and integrating with existing capabilities.
-
Support for rapid experimentation: Generative AI technologies and customer expectations are changing at rapid speed. Competitive and industry leading solutions require that we provide teams with the ability to experiment with large numbers of potential solutions quickly so they can identify the best possible ways to leverage generative AI in their applications.
-
Responsible development: As generative AI technologies evolve, we must continuously consider risks to customers and users and identify strategies to mitigate them. We also need to account for security, legal, compliance, and privacy requirements related to handling user data on a single system across the enterprise.
Scenario characteristics
The service serves as a single point of access for state of the art generative AI models and common core services avoiding the need for individual departments to deploy separate solutions. It typically includes a gateway that provides unified APIs to access state-of-the-art multi-modal models across different providers. For building agentic applications, service will have a runtime to deploy agents, a gateway for agents, and tool discoverability and connectivity. A mature service provides infrastructure and experiment tracking capabilities for model customization. It provides reusable data pipelines and hooks to integrate with enterprise data.
The service must implement robust data encryption, access controls, and data residency compliance. It needs to avoid leaking sensitive organizational. The service requires high availability, low latency response times, and the ability to handle concurrent users without degradation in service quality. Comprehensive logging, usage analytics, cost tracking, and performance monitoring capabilities are necessary for optimization and governance. The service fosters responsible AI development with built-in evaluations and guradrails. The service can be developed in a federated pattern where common core services are offered by the service, and other parts of the application such as orchestration and data pipelines can be owned by the end users or LOB teams.
Service personas
The personas interacting with generative AI service are:
-
Service admins, who onboard LoBs onto the service, tenant rate limits, and manager use access.
-
Service engineers, who build and manage the service, on-board applications, and services. They also onboard models and host them. Engineers manage infrastructure provisioning, CI/CD pipelines, and deployments across cloud and on-premises Kubernetes environments, requiring robust automation tools and unified observability dashboards
-
DevOps engineers are responsible for infrastructure provisioning, CI/CD pipelines for AI models and applications, monitoring service health and AI application performance, and managing deployments across both cloud (AWS) and on-premises Kubernetes environments. Their specific needs include robust automation tools, clear deployment patterns for containerized models, unified observability dashboards, and infrastructure as code (IaC) templates.
-
Software developers (application developers within LOBs) build user-facing applications that integrate generative AI capabilities. They need SDKs and APIs to access service services (like fine-tuned models or RAG components), well-documented interfaces, pre-built accelerator components for common generative AI patterns, and clear examples for integrating AI into products.
-
Data scientists and ML engineers (central AI team and LOBs) focus on selecting, customizing, fine-tuning, evaluating, and deploying foundation models. They prepare domain-specific datasets and develop new AI-driven functionalities. Their specific needs include access to a curated catalog of foundation models, powerful tools for data preprocessing and augmentation, streamlined workflows for model fine-tuning (for example, using Amazon SageMaker AI for PEFT or LoRA), robust experimentation tracking (for example, MLflow or SageMaker AI Experiments), standardized model evaluation frameworks, and pathways to package models for Amazon Bedrock or on-premises deployment.
-
LOB managers and product owners are responsible for defining product roadmaps that use AI and managing ROI and budgets. They need assurance of governance, security, and compliance for AI features, clear cost attribution, capabilities for rapid prototyping to validate ideas, and metrics on AI feature adoption and impact.
Service components
A generative AI service functions as a comprehensive system that combines multiple essential components, enabling the development and deployment of AI capabilities at scale. Drawing parallels to traditional enterprise technology services, it provides developers with a standardized and flexible framework encompassing crucial elements such as user interaction handling, rendering capabilities, runtime environments, and both temporary and permanent storage solutions.
The service implements sophisticated access control mechanisms, including role-based and use case-specific permissions, providing robust security across system components. Additionally, it comes equipped with built-in features for observability, governance, and cost tracking, delivering a streamlined management experience that aligns with enterprise standards. This integrated approach allows organizations to effectively harness AI capabilities while maintaining control, security, and operational efficiency throughout their AI initiatives.
The AI service is designed with a single tenet: build once and deploy anywhere. Amazon SageMaker AI serves as a comprehensive service for the entire ML lifecycle, particularly excelling in model customization and fine-tuning. It offers built-in algorithms, but more importantly, it provides robust support for custom training scripts and environments (for example, using PyTorch or TensorFlow) essential for advanced fine-tuning techniques like Parameter-Efficient Fine-Tuning (PEFT) methods (like LoRA or QLoRA) or full model fine-tuning. It includes features like SageMaker AI Experiments for tracking tuning runs and automated hyperparameter optimization to find the best configurations for custom models, all while handling the underlying infrastructure complexity (including optimized instances like Trainium).
Once models are trained and fine-tuned in Amazon SageMaker AI (or sourced from elsewhere), Amazon Bedrock provides a serverless approach to model deployment and inference. It allows access to both Amazon's and leading third-party foundation models (like Claude, Titan, and Llama 2) and, crucially for this service, custom-trained or fine-tuned models imported as provisioned throughput or using custom model import features. This is achieved through a unified API, simplifying integration.
Amazon Bedrock can also be used for continued pre-training or fine-tuning of select foundation models directly within its environment, further streamlining the customization workflow for supported models. It reduces the need to manage infrastructure for inference, automatically scaling based on demand.
This integration allows organizations to focus on their ML use cases rather than infrastructure management, while maintaining security and governance through AWS's built-in controls. Models trained on Amazon Sagemaker AI can be containerized and stored in a container registry for on-premises deployment into a Kubernetes architecture.
Amazon's Trainium (Trn2) and Inferentia (Inf2) instance types
offer cost-effective alternatives to GPU instances for machine
learning workloads. Trainium is designed for training deep
learning models, provides up to
30-40%
cost savings compared to GPU-based instances
Architecture and design
API layer and router
In an AI service, routing functionality serves as a crucial orchestration layer to intelligently direct data flows and service requests. A routing mechanism seamlessly manages communications between different service components including inference engines, model repositories, and data processing pipelines.
A key feature of the router is its ability to dynamically direct inference requests to the most appropriate runtime environment based on the service's deployment context. For instance, when the AI service LLM prompts to propreitary models, the router channels requests to Amazon Bedrock for inference.
Conversely, when requests are made to models not available on Amazon Bedrock, the router redirects these requests to local inference infrastructure. This intelligent routing maintains system flexibility across different deployment scenarios.
Training pipeline
The training pipeline is designed with the build once and deploy anywhere philosophy, generating model artifacts that seamlessly support both Amazon Bedrock and Amazon Sagemaker deployments.
At its core, the pipeline produces optimized model weights and parameters that serve as the foundation for multiple deployment scenarios. These artifacts are packaged in formats compatible with Amazon Bedrock for cloud deployment, enabling organizations to use AWS managed services for scalable inference.
Simultaneously, the pipeline automatically containerizes the model completely with the necessary dependencies and configurations, facilitating straightforward Amazon Sagemaker AI deployment.
This dual-output approach creates consistency across deployment environments while reducing the need for environment-specific model training or manual intervention, streamlining the transition from training to production regardless of the target infrastructure.
Responsible AI: Guardrails for security and privacy
Embed controls throughout the architecture to address compliance, privacy, and security concerns. The AI service registries and policies help enforce applicable guardrails so that authorization, data handling, legal, security, privacy, and compliance requirements can be addressed.
To solve for the wide range of applications developed using the AI service, we need an extensible and configurable framework to embed applicable controls reducing redundant work, accelerating development and maintaining consistency with an enterprise's responsible AI practices.
Additionally, policies should address ethical considerations such as bias detection, fairness metrics, and transparency in model decisions.
Domain-adapted custom LLMs
The strategy is to use the best LLMs to meet the needs of your use cases. This includes domain-adapted LLMs built on top of LLMs such as Llama and Nova, which are custom trained on your domain-specific data and specialize in solving customer specific usecases.
These custom LLMs help overcome the limitations of commercial or open-source LLMs, providing increased ability for your teams to manage a variety of issues, including accuracy, cost, and latency issues.
Inference
The AI service's inference component implements a unified abstraction layer which provides a standardized interface for model invocation across Amazon Bedrock, Amazon Sagemaker AI, and Amazon EC2 deployment. This abstraction provides consistent interaction patterns whether accessing models hosted on Amazon Bedrock, deployed on Amazon SageMaker AI, or running in self-hosted Amazon EC2 infrastructure.
By normalizing the inference interface, developers can seamlessly switch between different model deployments without modifying their application code, while maintaining consistent request and response patterns. This architectural approach simplifies integration efforts, promotes code reusability, and provides flexibility in model deployment choices while abstracting away the underlying complexity of different hosting environments and their specific implementation details.
Policies
Security policies for AI development require a comprehensive framework that addresses multiple layers of protection throughout the AI lifecycle. Organizations must implement strict access controls and authentication mechanisms to protect sensitive training data, model artifacts, and deployment environments, often using principles of least privilege and role-based access control (RBAC).
Data governance policies should enforce encryption both at rest and in transit, with special attention to data anonymization and regulatory compliance.
Version control and audit trails must be maintained for all model development stages, including training data, model parameters, and deployment configurations.
Security scanning of containerized applications and regular vulnerability assessments should be mandatory, along with continuous monitoring for unusual patterns or potential security breaches.
Agents and tools: A source of domain-specific capabilities
The agents and tool components are fundamental elements of modern AI services that enhance their functionality and problem-solving capabilities.
Agents act as autonomous software entities that can perceive their environment, make decisions, and take actions to achieve specific goals, while tools are specialized functions or utilities that agents can use to perform specific tasks.
In a well-designed AI service, agents can dynamically select and combine different tools such as language processors, data analyzers, API connectors, and computational modules to solve complex problems. Customers can use open source frameworks such as LangGraph and Strands SDK for building agents and use Genesis for deployment on AWS.
This approach results in using open source tools and serverless deployment for scalability and reliability.
Configuration and implementation
The preceding AI service diagram illustrates that the gateway is a crucial component in the system architecture. It serves as a central point of control and management for integrating various AI model providers. Let's explore how the gateway can be effectively implemented to accommodate and support a diverse range of model providers, providing flexibility and scalability in the AI service's capabilities.
Gateway configuration and setup
The generative AI service uses a modular, cloud-native architecture with a central generative AI gateway responsible for securely routing inference requests across multiple foundation models and deployment backends. The gateway is implemented using the LiteLLM open-source project and deployed as a containerized service on either Amazon ECS or Amazon EKS, depending on the operational needs of the enterprise.
At the core of the gateway is a FastAPI-based proxy service served through Uvicorn, listening on port 4000. It offers OpenAI-compatible RESTful APIs and supports various transport protocols including HTTP/2, Server-Sent Events (SSE), and WebSockets. The gateway translates and forwards requests to downstream model providers such as Amazon Bedrock or third-party APIs like OpenAI and Anthropic.
A separate adapter, also containerized, operates on port 3000 and provides translation services to convert OpenAI-style calls to provider-specific APIs, with full support for Bedrock's API syntax and provisioning models.
The infrastructure is designed for multi-tenant operation, with each tenant onboarded through a virtual key mechanism supporting OAuth 2.0 bearer token authentication. LoBs are assigned rate limits, provisioning tiers, and usage quotas which are enforced through a combination of in-memory and persistent controls. Sensitive credentials such as API keys, configuration parameters, and routing logic are securely stored and managed using AWS Secrets Manager.
The system architecture integrates tightly with the broader AWS system. Incoming traffic is routed through AWS RouteĀ 53 and protected by AWS Web Application Firewall (WAF). An Application Load Balancer (ALB) distributes traffic to containerized gateway instances running in a customer-defined VPC.
Persistent metadata including tenant configurations, cost usage records, and model invocation logs are stored in Amazon RDS. Amazon ElastiCache (Redis) is employed for both semantic and request-level caching, significantly improving response times and reducing redundant calls to LLM providers.
Additionally, Amazon S3 serves as the long-term storage backend for evaluation datasets, prompt templates, logging archives, and fine-tuned model artifacts.
Operational telemetry is collected using Amazon CloudWatch, enabling real-time monitoring of gateway health, token usage, failure rates, and model performance. Integration with OpenTelemetry-compatible tools fosters extensibility for enterprise-wide observability services like Datadog or Grafana.
Example gateway architecture
Best practices, considerations, and trade-offs
A number of architectural decisions have been made to balance performance, security, cost, and manageability within this implementation.
Security and governance are implemented by default. TLS encryption is enforced using AWS Certificate Manager (ACM), and communication between the gateway and model providers is encrypted in transit.
Model guardrails, PII redaction, and usage policies are enforced both at the Bedrock layer (for models hosted on AWS) and within the LiteLLM proxy for external model calls. Audit logging, including invocation details and metadata, is enabled across API calls.
Tenant isolation is a central design goal. Each tenant has its own rate limits, API tokens, and cost tracking records, maintained both in memory and in persistent store. This architecture allows enterprises to manage fine-grained usage controls and enables internal chargeback or cost attribution strategies.
In terms of deployment choices, the service supports both Amazon ECS and Amazon EKS. Amazon ECS provides a simpler operational model with lower overhead and is recommended for most use cases, especially those that can benefit from AWS Fargate's serverless container execution. Amazon EKS is suitable for customers who need greater flexibility, including hybrid or on-premise deployments, or those already invested in Kubernetes-native tooling and GitOps pipelines.
Model provider selection and routing is abstracted from the application layer. LiteLLM enables latency-aware, error-based, and fallback routing strategies. If a primary model endpoint fails or hits rate limits, requests are seamlessly redirected to alternative endpoints based on predefined logic, including least-busy selection or weighted routing. Prompt routing and versioning are also supported, which is particularly useful for A/B testing and model experiments.
A key trade-off arises in the choice between serverless inference (using Amazon Bedrock) and self-managed model serving (using Amazon SageMaker AI or containerized on-premises deployments). Amazon Bedrock provides elasticity, low operational overhead, and built-in access to top commercial FMs, making it ideal for most general-purpose applications.
However, organizations that require extensive model customization or need to support specialized ML pipelines may benefit from training and deploying models in Amazon SageMaker AI, where full control over infrastructure, training loops, and optimization techniques such as LoRA or QLoRA is available. Trained models from SageMaker AI can be containerized and deployed using Bedrock's custom model import feature or to on-prem Kubernetes clusters as needed.
Optimization techniques, scaling strategies, and cost-saving measures
This service architecture incorporates multiple layers of optimization to provide scalability and cost efficiency across workloads.
Caching is employed aggressively at the request, prefix, and semantic levels using Amazon ElastiCache. For applications with repeated or similar prompt structures, this approach can reduce token usage and API calls by a significant margin. The gateway also supports asynchronous request handling and job queuing, allowing for scalable batch inference and high-throughput scenarios.
Scaling is handled at both the infrastructure and application levels. Container workloads on Amazon ECS or Amazon EKS scale automatically based on CPU, memory, or custom CloudWatch metrics such as token throughput or request latency. The stateless nature of the LiteLLM gateway enables horizontal scaling without service disruption, and backend components like Amazon Redis and Amazon RDS are provisioned with multi-AZ high availability for fault tolerance.
Cost controls are tightly integrated into the service. The API calls include a callback function that updates tenant-level usage records in the backend database. Combined with AWS Cost and Usage Reports (CUR) and AWS Budgets, this enables proactive alerting on spend thresholds, rate exhaustion, and budget overruns. In addition, Amazon Bedrock's provisioned throughput model allows organizations to predictably budget for high-volume applications.
Finally, a comprehensive observability stack, using CloudWatch, OpenTelemetry, and optionally Langfuse, supports model evaluation, tracing, and token-based metrics. This enables continuous performance monitoring, facilitates root-cause analysis, and supports responsible AI practices by surfacing metrics like hallucination frequency, latency variation, and guardrail violations.
Security and compliance
Implementing strong security practices is foundational tenet of the generative AI service. It is the bedrock upon which enterprise trust is built, enabling teams to innovate confidently while safeguarding organizational data and intellectual property. The service should implement a defense-in-depth strategy that addresses security within the stack, from the network perimeter to the individual API call.
Network and perimeter security
All external communication with the service's API gateway and other endpoints must be encrypted in transit using strong, current TLS protocols. Communication should be over TLS, and private network access should be supported.
For internal or hybrid cloud scenarios, the service must integrate seamlessly with the enterprise network fabric. This is achieved using AWS PrivateLink, which allows services within the LOB VPCs to connect to the service's core services (like the gateway or Amazon Bedrock) as if they were local, without traversing the public internet. This minimizes the scope of impact and verifies that data remains within the AWS network boundary.
The service's public-facing endpoints, such as the gateway, should be protected by AWS WAF to mitigate common web exploits, SQL injection, and malicious bot activity.
Identity, access, and tenant management
Secure access begins with robust identity management. User access should be secure, and a system should support fine-grained access control. The service must integrate with the enterprise's existing identity provider (for example, Okta or Azure AD) for single sign-on (SSO). Role-based access control (RBAC) is then applied to enforce the principle of least privilege for the different service personas.
For instance, a service admin can manage tenants and global policies, a data scientist can access fine-tuning pipelines and model registries, while an application developer can only generate API keys for specific, pre-approved models relevant to their LOB.
Rate limiting and throttling should be in place to help reduce abuse. This is managed at the gateway, where each tenant is assigned specific usage quotas and burst limits, helping to protect the service from denial-of-service attacks and verifying fair resource allocation.
Data and model security
Protecting data and custom models is paramount. For data security, data should be encrypted at rest and transit, and tenant data isolation patterns should be implemented. All data, including training datasets in Amazon S3, operational logs, and cached results in ElastiCache, must be encrypted at rest using AWS KMS, with customer-managed keys (CMKs) for maximum control. Embeddings stored in vector stores should be encrypted. This is critical to avoid the reverse-engineering of sensitive source data from its vector representation.
For model security, custom model weights should be encrypted and isolated for different tenants. Fine-tuned model artifacts stored in Amazon S3 or a container registry like ECR must be encrypted and protected by resource policies that restrict access to authorized deployment roles or services only. This logical isolation verifies that one LOB's proprietary model cannot be accessed by another.
Responsible AI and auditability
Beyond infrastructure security, the service must enforce responsible AI practices. Guardrails should be applied to input and output to filter topics and harmful content. The service uses built-in capabilities like Amazon Bedrock Guardrails to create customizable policies for denying specific topics, filtering PII and profanity, and removing harmful language.
These guardrails are a configurable control plane applied by the gateway to inference requests to provide consistent policy enforcement. For full accountability, collect telemetry for actions that users take on the central system. This includes detailed audit trails through AWS CloudTrail for API management actions and comprehensive logging of inference requests (including metadata, but not PII) to Amazon CloudWatch for security forensics, compliance reporting, and troubleshooting.
While the service provides these foundational controls, data quality is ownership of the consuming applications or data producers. Similarly, the consuming applications should integrate observability into applications to monitor for issues like data or concept drift that fall outside the service's direct control.
Validation and testing
Validation in the context of generative AI is not a single gate but a continuous, multi-faceted discipline essential for building trust, maintaining accuracy, and delivering tangible business value. It spans the entire application lifecycle, from initial model selection to post-deployment monitoring.
Model and application evaluation capabilities are essential needs throughout the lifecycle of a generative AI application. The AI service is designed to industrialize this process, moving it from a manual effort to a systematic, repeatable practice. AI services play multiple roles when it comes to evaluation.
The service achieves this through a holistic approach that combines automated metrics, human-in-the-loop workflows, and robust operational monitoring.
-
Automated and judge-based evaluation: During development, the service must provide access to models and application as API and batch inference to evaluate and models to serve as judge. This allows data scientists to run batch jobs against evaluation datasets to calculate quantitative metrics (for example, ROUGE for summarization or code-match for code generation). More powerfully, it supports the LLM-as-a-judge pattern, where a powerful model like Claude 3 Opus is used to score the output of a candidate model based on qualitative criteria like helpfulness, coherence, or adherence to a specific persona.
-
Traceability and experiment management: Meaningful evaluation requires perfect recall of the conditions that produced a given result. The service must provide tracing capabilities to associate evaluation results to model, application, endpoint, dataset, and prompt templates. This is achieved by integrating with tools like MLflow or SageMaker AI Experiments, where evaluation runs log the exact model version, prompt hash, hyperparameters, and evaluation dataset used. This traceability is crucial for debugging, reproducing results, and satisfying audit requirements.
-
Efficient resource provisioning: Evaluations, especially on large datasets, can be computationally intensive. The service will provide on-demand compute resources to run the evaluation, sing services like AWS Batch or Amazon SageMaker AI Processing jobs. This allows teams to run large-scale evaluations in parallel without managing underlying infrastructure, accelerating the experimentation cycle.
-
Human-in-the-loop review: Automated metrics cannot capture all nuances of quality, safety, or user preference. Therefore, the service must facilitate human review. This involves providing simple UIs or integrating with services like Amazon SageMaker Ground Truth to enable subject matter experts to rate model responses, compare outputs (A/B testing), and perform structured red teaming to proactively identify potential harms, biases, or security vulnerabilities before deployment.
-
Curation of evaluation assets: The quality of an evaluation is only as good as the data it is based on. The service must assist in the generation and curation of evaluation datasets. This includes storing and versioning golden datasets (curated prompt-response pairs representing ideal behavior) and providing tools to augment these datasets based on production traffic or insights from human reviews.
-
Evaluation as a framework: Ultimately, for the most mature organizations, the goal is to offer a complete evaluation framework as an API for different use cases. This allows LOB teams to programmatically run a standardized suite of tests (covering accuracy, robustness, toxicity, and bias) as part of their CI/CD pipeline. A new model version cannot be promoted to production unless it passes this predefined quality bar, embedding responsible and high-quality AI practices directly into the development workflow.
Lessons learned and best practices
The development and deployment of enterprise AI services has yielded several critical lessons that organizations must consider for successful implementation.
First and foremost, data quality and governance emerge as foundational requirements rather than afterthoughts. Organizations consistently find that investing heavily in data infrastructure, establishing clear data lineage, and implementing robust governance frameworks early in the process helps you avoid costly rework and verifies that AI models perform reliably in production environments. Without clean, well-structured data pipelines, even the most sophisticated AI algorithms fail to deliver meaningful business value.
Integrate security and compliance considerations from the initial design phase rather than bolting them on later. Enterprise AI services handle sensitive business data and often operate in highly regulated industries, making it essential to implement security measures such as encryption, access controls, and audit trails from the ground up. Organizations have learned that retrofitting security into existing AI systems is both expensive and risky, often requiring complete architectural overhauls that could have been avoided with proper planning.
Change management and stakeholder buy-in prove to be as critical as the technical implementation itself. Successful deployments invariably involve extensive training programs, clear communication about AI capabilities and limitations, and gradual rollouts that allow users to adapt to new workflows. Organizations that rush deployment without adequate change management frequently encounter resistance, low adoption rates, and ultimately project failure despite having technically sound solutions.
Scalability and performance optimization require careful architectural planning from the outset. Many organizations underestimate the computational resources and infrastructure requirements needed to support enterprise-scale AI workloads. Building services that can handle increasing data volumes, user loads, and model complexity while maintaining acceptable performance levels demands thoughtful system design, often involving cloud-native architectures, containerization, and sophisticated monitoring systems.
Finally, the importance of establishing clear metrics and continuous monitoring cannot be overstated. Successful AI services incorporate comprehensive observability tools that track not only technical performance metrics but also business impact indicators. This enables organizations to identify model drift, performance degradation, and opportunities for improvement while demonstrating tangible value to stakeholders and justifying continued investment in AI initiatives.