Implement AI-powered Kubernetes diagnostics and troubleshooting with K8sGPT and Amazon Bedrock integration - AWS Prescriptive Guidance

Implement AI-powered Kubernetes diagnostics and troubleshooting with K8sGPT and Amazon Bedrock integration

Ishwar Chauthaiwale, Muskan ., and Prafful Gupta, Amazon Web Services

Summary

This pattern demonstrates how to implement AI-powered Kubernetes diagnostics and troubleshooting by integrating K8sGPT with the Anthropic Claude v2 model available on Amazon Bedrock. The solution provides natural language analysis and remediation steps for Kubernetes cluster issues through a secure bastion host architecture. By combining K8sGPT Kubernetes expertise with Amazon Bedrock advanced language capabilities, DevOps teams can quickly identify and resolve cluster problems. With these capabilities, it’s possible to reduce mean time to resolution (MTTR) by up to 50 percent.

This cloud-native pattern leverages Amazon Elastic Kubernetes Service (Amazon EKS) for Kubernetes management. The pattern implements security best practices through proper AWS Identity and Access Management (IAM) roles and network isolation. This solution is particularly valuable for organizations who want to streamline their Kubernetes operations and enhance their troubleshooting capabilities with AI assistance.

Prerequisites and limitations

Prerequisites

  • An active AWS account with appropriate permissions

  • AWS Command Line Interface (AWS CLI) installed and configured

  • An Amazon EKS cluster

  • Access to Anthropic Claude 2 model on Amazon Bedrock

  • A bastion host with required security group settings

  • K8sGPT installed

Limitations

  • K8sGPT analysis is limited by the context window size of the Claude v2 model.

  • Amazon Bedrock API rate limits apply based on your account quotas.

  • Some AWS services aren’t available in all AWS Regions. For Region availability, see AWS Services by Region. For specific endpoints, see Service endpoints and quotas, and choose the link for the service.

Product versions

Architecture

The following diagram shows the architecture for AI-powered Kubernetes diagnostics using K8sGPT integrated with Amazon Bedrock in the AWS Cloud.

Workflow for Kubernetes diagnostics using K8sGPT integrated with Amazon Bedrock.

The architecture shows the following workflow:

  1. Developers access the environment through a secure connection to the bastion host. This Amazon EC2 instance serves as the secure entry point and contains the K8sGPT command line interface (CLI) installation and required configurations.

  2. The bastion host, configured with specific IAM roles, establishes secure connections to both the Amazon EKS cluster and the Amazon Bedrock endpoints. K8sGPT is installed and configured on the bastion host to perform Kubernetes cluster analysis.

  3. Amazon EKS manages the Kubernetes control plane and worker nodes, providing the target environment for K8sGPT analysis. The service runs across multiple Availability Zones within a virtual private cloud (VPC), which helps to provide high availability and resilience. Amazon EKS supplies operational data through the Kubernetes API, enabling comprehensive cluster analysis.

  4. K8sGPT sends analysis data to Amazon Bedrock, which provides the Claude v2 foundation model (FM) for natural language processing. The service processes K8sGPT analysis to generate human-readable explanations and offers detailed remediation suggestions based on identified issues. Amazon Bedrock operates as a serverless AI service with high availability and scalability.

Note

Throughout this workflow, IAM controls access between components through roles and policies, managing authentication for the bastion host, Amazon EKS, and Amazon Bedrock interactions. IAM implements the principle of least privilege and enables secure cross-service communication throughout the architecture.

Automation and scale

K8sGPT operations can be automated and scaled across multiple Amazon EKS clusters through various AWS services and tools. This solution supports continuous integration and continuous deployment (CI/CD) integration using Jenkins, GitHub Actions, or AWS CodeBuild for scheduled analysis. The K8sGPT operator enables continuous in-cluster monitoring with automated issue detection and reporting capabilities. For enterprise-scale deployments, you can use Amazon EventBridge to schedule scans and trigger automated responses with custom scripts. AWS SDK integration enables programmatic control across large fleet of clusters.

Tools

AWS services

Other tools

  • K8sGPT is an open source AI-powered tool that transforms Kubernetes management. It acts as a virtual site reliability engineering (SRE) expert, automatically scanning, diagnosing, and troubleshooting Kubernetes cluster issues. Administrators can interact with K8sGPT using natural language and get clear, actionable insights about cluster state, pod crashes, and service failures. The tool's built-in analyzers detect a wide range of issues, from misconfigured components to resource constraints, and provide easy-to-understand explanations and solutions.

Best practices

  • Implement secure access controls by using AWS Systems Manager Session Manager for bastion host access.

  • Make sure that K8sGPT authentication uses dedicated IAM roles with least privilege permissions for Amazon Bedrock and Amazon EKS interactions . For more information, see Grant least privilege and Security best practices in the IAM documentation.

  • Configure resource tagging, enable Amazon CloudWatch logging for audit trails, and implement data anonymization for sensitive information.

  • Maintain regular backups of K8sGPT configurations while setting up automated scanning schedules during off-peak hours to minimize operational impact.

Epics

TaskDescriptionSkills required

Set Amazon Bedrock as the AI backend provider for K8sGPT.

To set Amazon Bedrock as the AI backend provider for K8sGPT, use the following AWS CLI command:

k8sgpt auth add -b amazonbedrock \ -r us-west-2 \ -m anthropic.claude-v2 \ -n endpoint-name

The example command uses us-west-2 for the AWS Region. However, you can select another Region, provided that both the Amazon EKS cluster and the corresponding Amazon Bedrock model are available and enabled in that selected Region.

To check that amazonbedrock is added to the AI backend provider list and is in the Active state, run the following command:

k8sgpt auth list

Following is an example of the expected output of this command:

Default: > openai Active: > amazonbedrock Unused: > openai > localai > ollama > azureopenai > cohere > amazonsagemaker > google > noopai > huggingface > googlevertexai > oci > customrest > ibmwatsonxai
AWS DevOps
TaskDescriptionSkills required

View a list of available filters.

To see the list of all available filters, use the following AWS CLI command:

k8sgpt filters list

Following is an example of the expected output of this command:

Active: > Deployment > ReplicaSet > PersistentVolumeClaim > Service > CronJob > Node > MutatingWebhookConfiguration > Pod > Ingress > StatefulSet > ValidatingWebhookConfiguration
AWS DevOps

Scan a pod in a specific namespace by using a filter.

This command is useful for targeted debugging of specific pod issues within a Kubernetes cluster, using Amazon Bedrock AI capabilities to analyze and explain the problems it finds.

To scan a pod in a specific namespace by using a filter, use the following AWS CLI command:

k8sgpt analyze --backend amazonbedrock --explain --filter Pod -n default

Following is an example of the expected output of this command:

100% |████████████████████████████████████████████████████████| (1/1, 645 it/s) AI Provider: amazonbedrock 0: Pod default/crashme() - Error: the last termination reason is Error container=crashme pod=crashme Error: The pod named crashme terminated because the container named crashme crashed. Solution: Check logs for crashme pod to identify reason for crash. Restart pod or redeploy application to resolve crash.
AWS DevOps

Scan a deployment in a specific namespace by using a filter.

This command is useful for identifying and troubleshooting deployment-specific issues, particularly when the actual state doesn't match the desired state.

To scan a deployment in a specific namespace by using a filter, use the following AWS CLI command:

k8sgpt analyze --backend amazonbedrock --explain --filter Deployment -n default

Following is an example of the expected output of this command:

100% |██████████████████████████████████████████████████████████| (1/1, 10 it/min) AI Provider: amazonbedrock 0: Deployment default/nginx() - Error: Deployment default/nginx has 1 replicas but 2 are available Error: The Deployment named nginx in the default namespace has 1 replica specified but 2 pod replicas are running. Solution: Check if any other controllers like ReplicaSet or StatefulSet have created extra pods. Delete extra pods or adjust replica count to match available pods.
AWS DevOps

Scan a node in a specific namespace by using a filter.

To scan a node in a specific namespace by using a filter, use the following AWS CLI command:

k8sgpt analyze --backend amazonbedrock --explain --filter Node -n default

Following is an example of the expected output of this command:

AI Provider: amazonbedrock No problems detected
AWS DevOps
TaskDescriptionSkills required

Get detailed outputs.

To get detailed outputs, use the following AWS CLI command:

k8sgpt analyze --backend amazonbedrock --explain --ouput json

Following is an example of the expected output of this command:

{ "provider": "amazonbedrock", "errors": null, "status": "ProblemDetected", "problems": 1, "results": [ { "kind": "Pod", "name": "default/crashme", "error": [ { "Text": "the last termination reason is Error container=crashme pod=crashme", "KubernetesDoc": "", "Sensitive": [] } ], "details": " Error: The pod named crashme terminated because the container named crashme crashed.\nSolution: Check logs for crashme pod to identify reason for crash. Restart pod or redeploy application to resolve crash.", "parentObject": "" } ] }
AWS DevOps

Check problematic pods.

To check for specific problematic pods, use the following AWS CLI command:

kubectl get pods --all-namespaces | grep -v Running

Following is an example of the expected output of this command:

NAMESPACE NAME READY STATUS RESTARTS AGE default crashme 0/1 CrashLoopBackOff 260(91s ago) 21h
AWS DevOps

Get application-specific insights.

This command is particularly useful when:

  • You want to focus on a specific application in your cluster.

  • You use labels effectively to organize your Kubernetes resources.

  • You need to quickly check the health of a particular application component.

To get application-specific insights, use the following command:

k8sgpt analyze --backend amazonbedrock --explain -L app=nginx -n default

Following is an example of the expected output of this command:

AI Provider: amazonbedrock No problems detected

Related resources

AWS Blogs

AWS documentation

Other resources