Solution for Monitoring Amazon EKS infrastructure with Amazon Managed Grafana - Amazon Managed Grafana

Solution for Monitoring Amazon EKS infrastructure with Amazon Managed Grafana

Monitoring Amazon Elastic Kubernetes Service infrastructure is one of the most common scenarios for which Amazon Managed Grafana and Amazon Managed Service for Prometheus are used. This page describes a pre-built AWS Cloud Development Kit (AWS CDK) project that provides you with a template solution for this scenario. This solution configures an Amazon Managed Service for Prometheus backend to collect and store metrics from your Amazon EKS cluster, and an Amazon Managed Grafana front end with dashboards to view details of your cluster. Applying this solution will create dashboards that:

  • Assess the overall Amazon EKS cluster health.

  • Show the health and performance of the Amazon EKS control plane.

  • Show the health and performance of the Amazon EKS data plane.

  • Display insights on Amazon EKS workloads across Kubernetes namespaces.

  • Display resource usage across namespaces, including CPU, memory, disk, and network usage.

About this solution

This solution configures an Amazon Managed Grafana workspace to provide metrics for your Amazon EKS cluster.

The metrics help you to operate Amazon EKS clusters more effectively by providing insights into the health and performance of the Kubernetes control and data plane. You can understand your Amazon EKS cluster from the node level, to pods, down to the Kubernetes level, including detailed monitoring of resource usage.

The solution provides both anticipatory and corrective capabilities:

  • Anticipatory capabilities include:

    • Manage resource efficiency by driving scheduling decisions. For example, to provide performance and reliability SLAs to your internal users of the Amazon EKS cluster you can allocate enough CPU and memory resources to their workloads based on tracking historical usage.

    • Usage forecasts: Based on the current utilization of your Amazon EKS cluster resources such as nodes, Persistent Volumes backed by Amazon EBS, or Application Load Balancers you can plan ahead, for example, for a new product or project with similar demands.

    • Detect potential issues early: For example, by analyzing resource consumption trends on a Kubernetes namespace level, you can understand the seasonality of the workload’s usage.

  • Corrective capabilities include:

    • Decrease the mean time to detection (MTTD) of issues on the infrastructure and the Kubernetes workload level. For example, by looking at the troubleshooting dashboard, you can quickly test hypotheses about what went wrong and eliminate them.

    • Determine where in the stack a problem is happening. For example, the Amazon EKS control plane is fully managed by AWS and certain operations such as updating a Kubernetes deployment may fail if the API server is overloaded or connectivity is impacted.

The following image shows a sample of the dashboard folder for the solution.

An image showing an example of a Grafana dashboard folder built using this solution.

You can choose a dashboard to see more details, for example, choosing to view the Compute Resources for workloads will show a dashboard, such as that shown in the following image.

An image showing an example of a Grafana dashboard showing CPU usage built using this solution.

The metrics are scraped with a 1 minute scrape interval. The dashboard show metrics aggregated to 1 minute, 5 minutes, or more, based on the specific metric.

For a list of metrics tracked by this solution, see List of metrics tracked.

Costs

This solution creates and uses resources in your workspace. You will be charged for standard usage, including:

  • Amazon Managed Grafana workspace access by users.

  • Amazon Managed Service for Prometheus metric ingestion and storage, including use of the Amazon Managed Service for Prometheus agentless collector. The number of metrics used by this solution depends on the Amazon EKS cluster configuration and usage.

    You can view the ingestion and storage metrics in Amazon Managed Service for Prometheus using CloudWatch For more information, see CloudWatch metrics in the Amazon Managed Service for Prometheus User Guide.

    You can estimate the cost using the Amazon Managed Service for Prometheus pricing calculator. For more details, see the following procedure.

The pricing calculator can help you figure out approximate costs for your metrics ingested into Amazon Managed Service for Prometheus. This is mostly dependent on the number of nodes in your Amazon EKS cluster, which impacts the number of metrics ingested.

To use the pricing calculator to estimate your metric ingestion costs
  1. Open the Amazon Managed Service for Prometheus pricing calculator.

  2. Choose Calculate the cost using your existing self-hosted Prometheus.

  3. In the Metrics sample ingestion section, make the following selections:

    • Select Active Series for the metric

    • For Average active time series, enter a number that is 8000 + the number of nodes in your cluster * 15,000. The cluster itself provides about 8000 time series, and each node produces approximately 15,000 more. For example, if you have 2 nodes, you would enter 38,000, which is 8000 + ( 2 * 15,000 ).

    • For Avg Collection Interval (in seconds), enter 60.

  4. In the Managed Collectorsection, enter the following information:

    • For Number of collectors, enter 1. This solution creates a single collector for you.

    • For Number of samples collected, enter a number that is 150 + the number of nodes in your cluster * 250. This is similar to number of time series you entered previously. For example, if you have 2 nodes, you would enter 650, which is 150 + ( 2 * 250 ).

  5. You can see your estimated costs for metric ingestion at the bottom of the pricing calculator. You can save your estimate and add other costs using other features of the pricing calculator.

Prerequisites

This solution requires that you have done the following before using the solution.

  1. You must have or create an Amazon Elastic Kubernetes Service cluster that you wish to monitor, and the cluster must have at least one node. The cluster must have API server endpoint access set to include private access (it can also allow public access).

    The authentication mode must include API access (it can be set to either API or API_AND_CONFIG_MAP). This allows the CDK application to use access entries.

    Save the Cluster name to specify later. This can be found in the cluster details in the Amazon EKS console.

    Note

    For details about how to create an Amazon EKS cluster, see Getting started with Amazon EKS.

  2. You must create an Amazon Managed Service for Prometheus workspace in the same AWS account as your Amazon EKS cluster. For details, see Create a workspace in the Amazon Managed Service for Prometheus User Guide.

    Save the Amazon Managed Service for Prometheus workspace ID to specify later.

  3. You must create an Amazon Managed Grafana workspace compatible with Grafana version 9 or newer, in the same AWS Region as your Amazon EKS cluster. For details about creating a new workspace, see Create an Amazon Managed Grafana workspace.

    The workspace role must have permissions to access Amazon Managed Service for Prometheus APIs. The easiest way to do this is to use Service-managed permissions and select Amazon Managed Service for Prometheus. You can also manually add the AmazonPrometheusQueryAccess policy to your workspace IAM role.

    Save the Amazon Managed Grafana workspace URL to specify later. This can be found in the Amazon Managed Grafana console. For example, https://g-123example.grafana-workspace.<region>.amazonaws.com/.

  4. You must create an API Key with ADMIN access for calling Grafana HTTP APIs. For details, see Use API keys to authenticate with Grafana HTTP APIs. You can use the AWS CLI with the following commands to create the API key. You will need to replace the GRAFANA_ID with the ID of your workspace (it will be in the form g-123example).

    export AMG_WORKSPACE_ID=GRAFANA_ID export AMG_API_KEY=$(aws grafana create-workspace-api-key \ --key-name "grafana-operator-key" \ --key-role "ADMIN" \ --seconds-to-live 432000 \ --workspace-id $AMG_WORKSPACE_ID \ --query key \ --output text)

    After creating the API key, you must make it available to the AWS CDK, by adding it to AWS Systems Manager with the following command. Replace $AMG_API_KEY with the API key that you created, and $AWS_REGION with the Region that your solution will run in.

    aws ssm put-parameter --name "/observability-aws-solution-eks-infra/grafana-api-key" \ --type "SecureString" \ --value $AMG_API_KEY \ --region $AWS_REGION \ --overwrite
Note

While not strictly required to set up the solution, you must set up user authentication in your Amazon Managed Grafana workspace before users can access the dashboards created. For more information, see Authenticate users in Amazon Managed Grafana workspaces.

Using this solution

This solution is provided to you as an AWS CDK application. You will provide information about the resources you want to use, and the solution will create the scraper and dashboards for you.

Note

The steps here assume that you have an environment with the AWS CLI, AWS CDK, and NPM installed. One way to have an environment with all the prerequisites, is to use AWS CloudShell.

To use this solution to monitor an Amazon EKS cluster
  1. Makes sure that you have completed all of the prerequisites steps.

  2. Download all files for the solution from https://github.com/aws-observability/observability-best-practices/tree/main/solutions/oss/eks-infra/v2.0.0/iac.

    You do not need to modify these files.

  3. In your command line environment (from the iac folder that you just created), run the following commands.

    Set up the needed environment variables. Replace REGION, AMG_ENDPOINT, EKS_CLUSTER, and AMP_ARN with your AWS Region, Amazon Managed Grafana workspace endpoint (n the form http://g-123example.grafana-workspace.us-east-1.amazonaws.com, Amazon EKS cluster name, and Amazon Managed Service for Prometheus workspace ARN.

    export AWS_REGION=REGION export AMG_ENDPOINT=AMG_ENDPOINT export EKS_CLUSTER_NAME=EKS_CLUSTER export AMP_WS_ARN=AMP_ARN

    Run the following make command, which will install any other dependencies for the project.

    make deps

    Finally, run the AWS CDK project:

    make build && make pattern aws-observability-solution-eks-infra-$EKS_CLUSTER_NAME deploy
  4. [Optional] After the stack creation is complete, you may use the same environment to create more instances of the stack for other Amazon EKS clusters in the same region, as long as you complete the other prerequisites for each (including separate Amazon Managed Grafana and Amazon Managed Service for Prometheus workspaces). You will need to redefine the export commands with the new parameters.

When the stack creation is completed, your Amazon Managed Grafana workspace will be populated with a dashboard showing metrics for your Amazon EKS cluster. It will take a few minutes for metrics to be shown, as the scraper begins to collect metrics.

List of metrics tracked

This solution creates a scraper that collects metrics from your Amazon EKS cluster. Those metrics are stored in Amazon Managed Service for Prometheus, and then displayed in Amazon Managed Grafana dashboards. The following metrics are tracked with this solution.

Metric Description / Purpose

aggregator_unavailable_apiservice

Gauge of APIServices which are marked as unavailable broken down by APIService name.

apiserver_admission_webhook_admission_duration_seconds_bucket

Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).

apiserver_current_inflight_requests

Maximal number of currently used inflight request limit of this apiserver per request kind in last second.

apiserver_envelope_encryption_dek_cache_fill_percent

Percent of the cache slots currently occupied by cached DEKs.

apiserver_flowcontrol_current_executing_requests

Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem.

apiserver_flowcontrol_rejected_requests_total

Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem that were rejected.

apiserver_flowcontrol_request_concurrency_limit

Nominal number of execution seats configured for each priority level.

apiserver_flowcontrol_request_execution_seconds_bucket

The bucketed histogram of duration of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem.

apiserver_flowcontrol_request_queue_length_after_enqueue_count

The count of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem.

apiserver_request

Indicates an API server request.

apiserver_requested_deprecated_apis

Gauge of deprecated APIs that have been requested, broken out by API group, version, resource, subresource, and removed_release.

apiserver_request_duration_seconds

Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_duration_seconds_bucket

The bucketed histogram of response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_slo_duration_seconds

The Service Level Objective (SLO) response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_terminations_total

Number of requests which apiserver terminated in self-defense.

apiserver_request_total

Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.

container_cpu_usage_seconds_total

Cumulative cpu time consumed.

container_fs_reads_bytes_total

Cumulative count of bytes read.

container_fs_reads_total

Cumulative count of reads completed.

container_fs_writes_bytes_total

Cumulative count of bytes written.

container_fs_writes_total

Cumulative count of writes completed.

container_memory_cache

Total page cache memory.

container_memory_rss

Size of RSS.

container_memory_swap

Container swap usage.

container_memory_working_set_bytes

Current working set.

container_network_receive_bytes_total

Cumulative count of bytes received.

container_network_receive_packets_dropped_total

Cumulative count of packets dropped while receiving.

container_network_receive_packets_total

Cumulative count of packets received.

container_network_transmit_bytes_total

Cumulative count of bytes transmitted.

container_network_transmit_packets_dropped_total

Cumulative count of packets dropped while transmitting.

container_network_transmit_packets_total

Cumulative count of packets transmitted.

etcd_request_duration_seconds_bucket

The bucketed histogram of etcd request latency in seconds for each operation and object type.

go_goroutines

Number of goroutines that currently exist.

go_threads

Number of OS threads created.

kubelet_cgroup_manager_duration_seconds_bucket

The bucketed histogram of duration in seconds for cgroup manager operations. Broken down by method.

kubelet_cgroup_manager_duration_seconds_count

Duration in seconds for cgroup manager operations. Broken down by method.

kubelet_node_config_error

This metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise.

kubelet_node_name

The node's name. The count is always 1.

kubelet_pleg_relist_duration_seconds_bucket

The bucketed histogram of duration in seconds for relisting pods in PLEG.

kubelet_pleg_relist_duration_seconds_count

The count of duration in seconds for relisting pods in PLEG.

kubelet_pleg_relist_interval_seconds_bucket

The bucketed histogram of interval in seconds between relisting in PLEG.

kubelet_pod_start_duration_seconds_count

The count of duration in seconds from kubelet seeing a pod for the first time to the pod starting to run.

kubelet_pod_worker_duration_seconds_bucket

The bucketed histogram of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync.

kubelet_pod_worker_duration_seconds_count

The count of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync.

kubelet_running_containers

Number of containers currently running.

kubelet_running_pods

Number of pods that have a running pod sandbox.

kubelet_runtime_operations_duration_seconds_bucket

The bucketed histogram of duration in seconds of runtime operations. Broken down by operation type.

kubelet_runtime_operations_errors_total

Cumulative number of runtime operation errors by operation type.

kubelet_runtime_operations_total

Cumulative number of runtime operations by operation type.

kube_node_status_allocatable

The amount of resources allocatable for pods (after reserving some for system daemons).

kube_node_status_capacity

The total amount of resources available for a node.

kube_pod_container_resource_limits (CPU)

The number of requested limit resource by a container.

kube_pod_container_resource_limits (Memory)

The number of requested limit resource by a container.

kube_pod_container_resource_requests (CPU)

The number of requested request resource by a container.

kube_pod_container_resource_requests (Memory)

The number of requested request resource by a container.

kube_pod_owner

Information about the Pod's owner.

kube_resourcequota

Resource quotas in Kubernetes enforce usage limits on resources such as CPU, memory, and storage within namespaces.

node_cpu

The CPU usage metrics for a node, including usage per core and total usage.

node_cpu_seconds_total

Seconds the CPUs spent in each mode.

node_disk_io_time_seconds

The cumulative amount of time spent performing I/O operations on disk by a node.

node_disk_io_time_seconds_total

The total amount of time spent performing I/O operations on disk by the node.

node_disk_read_bytes_total

The total number of bytes read from disk by the node.

node_disk_written_bytes_total

The total number of bytes written to disk by the node.

node_filesystem_avail_bytes

The amount of available space in bytes on the filesystem of a node in a Kubernetes cluster.

node_filesystem_size_bytes

The total size of the filesystem on the node.

node_load1

The 1-minute load average of a node's CPU usage.

node_load15

The 15-minute load average of a node's CPU usage.

node_load5

The 5-minute load average of a node's CPU usage.

node_memory_Buffers_bytes

The amount of memory used for buffer caching by the node's operating system.

node_memory_Cached_bytes,

The amount of memory used for disk caching by the node's operating system.

node_memory_MemAvailable_bytes

The amount of memory available for use by applications and caches.

node_memory_MemFree_bytes

The amount of free memory available on the node.

node_memory_MemTotal_bytes

The total amount of physical memory available on the node.

node_network_receive_bytes_total

The total number of bytes received over the network by the node.

node_network_transmit_bytes_total

The total number of bytes transmitted over the network by the node.

process_cpu_seconds_total

Total user and system CPU time spent in seconds.

process_resident_memory_bytes

Resident memory size in bytes.

rest_client_requests_total

Number of HTTP requests, partitioned by status code, method, and host.

rest_client_request_duration_seconds_bucket

The bucketed histogram of request latency in seconds. Broken down by verb, and host.

storage_operation_duration_seconds_bucket

The bucketed histogram of duration of storage operations.

storage_operation_duration_seconds_count

The count of duration of storage operations.

storage_operation_errors_total

Cumulative number of errors during storage operations.

up

A metric indicating whether the monitored target (e.g., node) is up and running.

volume_manager_total_volumes

The total number of volumes managed by the volume manager.

workqueue_adds_total

Total number of adds handled by workqueue.

workqueue_depth

Current depth of workqueue.

workqueue_queue_duration_seconds_bucket

The bucketed histogram of how long in seconds an item stays in workqueue before being requested.

workqueue_work_duration_seconds_bucket

The bucketed histogram of how long in seconds processing an item from workqueue takes.

Troubleshooting

There are a few things that can cause the setup of the project to fail. Be sure to check the following.

  • You must complete all Prerequisites before installing the solution.

  • Your Amazon EKS cluster must have the CoreDNS and kube-proxy add-ons installed. If they are not installed, the solution will not work correctly. They are both installed by default.

  • Missing SSM parameter. If you see an error like the following, run cdk bootstrap and try again.

    Deployment failed: Error: aws-observability-solution-eks-infra-$EKS_CLUSTER_NAME: SSM parameter /cdk-bootstrap/xxxxxxx/version not found. Has the environment been bootstrapped? Please run 'cdk bootstrap' (see https://docs.aws.amazon.com/cdk/latest/ guide/bootstrapping.html)
  • The cluster must have at least one node in it before attempting to create the solution or access the metrics.

  • Deployment can fail if the OIDC provider already exists. You will see an error like the following:

    | CREATE_FAILED | Custom::AWSCDKOpenIdConnectProvider | OIDCProvider/Resource/Default Received response status [FAILED] from custom resource. Message returned: EntityAlreadyExistsException: Provider with url https://oidc.eks.REGION.amazonaws.com/id/PROVIDER ID already exists.

    In this case, go to the IAM portal and delete the OIDC provider and try again.

  • Amazon EKS pods installation timed out. This can happen if there is not enough node capacity available. This can happen for multiple reasons, including:

    • The Amazon EKS cluster was initialized with Fargate instead of Amazon EC2. This project requires Amazon EC2.

    • The nodes are tainted and therefore unavailable.

      You can use kubectl describe node NODENAME | grep Taints to check the taints. Then kubectl taint node NODENAME TAINT_NAME- to remove the taints. Make sure to include the - after the taint name.

    • The nodes have reached the capacity limit. In this case you can create a new node or increase the capacity.