Automate deployment of Node Termination Handler in Amazon EKS by using a CI/CD pipeline - AWS Prescriptive Guidance

Automate deployment of Node Termination Handler in Amazon EKS by using a CI/CD pipeline

Created by Sandip Gangapadhyay (AWS), John Vargas (AWS), Pragtideep Singh (AWS), Sandeep Gawande (AWS), and Viyoma Sachdeva (AWS)

Code repository: Deploy NTH to EKS

Environment: Production

Technologies: Containers & microservices; DevOps

AWS services: AWS CodePipeline; Amazon EKS; AWS CodeBuild

Summary

Notice: AWS CodeCommit is no longer available to new customers. Existing customers of AWS CodeCommit can continue to use the service as normal. Learn more

On the Amazon Web Services (AWS) Cloud, you can use AWS Node Termination Handler, an open-source project, to handle Amazon Elastic Compute Cloud (Amazon EC2) instance shutdown within Kubernetes gracefully. AWS Node Termination Handler helps to ensure that the Kubernetes control plane responds appropriately to events that can cause your EC2 instance to become unavailable. Such events include the following:

If an event isn’t handled, your application code might not stop gracefully. It also might take longer to recover full availability, or it might accidentally schedule work to nodes that are going down. The aws-node-termination-handler (NTH) can operate in two different modes: Instance Metadata Service (IMDS) or Queue Processor. For more information about the two modes, see the Readme file.

This pattern uses AWS CodeCommit, and it automates the deployment of NTH by using Queue Processor through a continuous integration and continuous delivery (CI/CD) pipeline.

Note: If you're using EKS managed node groups, you don't need the aws-node-termination-handler.

Prerequisites and limitations

Prerequisites 

  • An active AWS account.

  • A web browser that is supported for use with the AWS Management Console. See the list of supported browsers.

  • AWS Cloud Development Kit (AWS CDK) installed.

  • kubectl, the Kubernetes command line tool, installed.

  • eksctl, the AWS Command Line Interface (AWS CLI) for Amazon Elastic Kubernetes Service (Amazon EKS), installed.

  • A running EKS cluster with version 1.20 or later.

  • A self-managed node group attached to the EKS cluster. To create an Amazon EKS cluster with a self-managed node group, run the following command.

    eksctl create cluster --managed=false --region <region> --name <cluster_name>

    For more information on eksctl, see the eksctl documentation.

  • AWS Identity and Access Management (IAM) OpenID Connect (OIDC) provider for your cluster. For more information, see Creating an IAM OIDC provider for your cluster.

Limitations 

  • You must use an AWS Region that supports the Amazon EKS service.

Product versions

  • Kubernetes version 1.20 or later

  • eksctl version 0.107.0 or later

  • AWS CDK version 2.27.0 or later

Architecture

Target technology stack  

  • A virtual private cloud (VPC)

  • An EKS cluster

  • Amazon Simple Queue Service (Amazon SQS)

  • IAM

  • Kubernetes

Target architecture 

The following diagram shows the high-level view of the end-to-end steps when the node termination is started.

A VPC with an Auto Scaling group, an EKS cluster with Node Termination Handler, and an SQS queue.

The workflow shown in the diagram consists of the following high-level steps:

  1. The automatic scaling EC2 instance terminate event is sent to the SQS queue.

  2. The NTH Pod monitors for new messages in the SQS queue.

  3. The NTH Pod receives the new message and does the following:

    • Cordons the node so that new pod does not run on the node.

    • Drains the node, so that the existing pod is evacuated

    • Sends a lifecycle hook signal to the Auto Scaling group so that the node can be terminated.

Automation and scale

Tools

AWS services

  • AWS Cloud Development Kit (AWS CDK) is a software development framework that helps you define and provision AWS Cloud infrastructure in code.

  • AWS CodeBuild is a fully managed build service that helps you compile source code, run unit tests, and produce artifacts that are ready to deploy.

  • AWS CodeCommit is a version control service that helps you privately store and manage Git repositories, without needing to manage your own source control system.

  • AWS CodePipeline helps you quickly model and configure the different stages of a software release and automate the steps required to release software changes continuously.

  • Amazon Elastic Kubernetes Service (Amazon EKS) helps you run Kubernetes on AWS without needing to install or maintain your own Kubernetes control plane or nodes.

  • Amazon EC2 Auto Scaling helps you maintain application availability and allows you to automatically add or remove Amazon EC2 instances according to conditions you define.

  • Amazon Simple Queue Service (Amazon SQS) provides a secure, durable, and available hosted queue that helps you integrate and decouple distributed software systems and components.

Other tools

  • kubectl is a Kubernetes command line tool for running commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.

Code 

The code for this pattern is available in the deploy-nth-to-eks repo on GitHub.com. The code repo contains the following files and folders.

  • nth folder – The Helm chart, values files, and the scripts to scan and deploy the AWS CloudFormation template for Node Termination Handler.

  • config/config.json – The configuration parameter file for the application. This file contains all the parameters needed for CDK to be deployed.

  • cdk – AWS CDK source code.

  • setup.sh – The script used to deploy the AWS CDK application to create the required CI/CD pipeline and other required resources.

  • uninstall.sh – The script used to clean up the resources.

To use the example code, follow the instructions in the Epics section.

Best practices

For best practices when automating AWS Node Termination Handler, see the following:

Epics

TaskDescriptionSkills required

Clone the repo.

To clone the repo by using SSH (Secure Shell), run the following the command.

git clone git@github.com:aws-samples/deploy-nth-to-eks.git

To clone the repo by using HTTPS, run the following the command.

git clone https://github.com/aws-samples/deploy-nth-to-eks.git

Cloning the repo creates a folder named deploy-nth-to-eks.

Change to that directory.

cd deploy-nth-to-eks
App developer, AWS DevOps, DevOps engineer

Set the kubeconfig file.

Set your AWS credentials in your terminal and confirm that you have rights to assume the cluster role. You can use the following example code.

aws eks update-kubeconfig --name <Cluster_Name> --region <region>--role-arn <Role_ARN>
AWS DevOps, DevOps engineer, App developer
TaskDescriptionSkills required

Set up the parameters.

In the config/config.json file, set up the following required parameters.

  • pipelineName: The name of the CI/CD pipeline to be created by AWS CDK (for example, deploy-nth-to-eks-pipeline). AWS CodePipeline will create a pipeline that has this name.

  • repositoryName: The AWS CodeCommit repo to be created (for example, deploy-nth-to-eks-repo). AWS CDK will create this repo and set it as the source for the CI/CD pipeline.

    Note: This solution will create this CodeCommit repo and the branch (provided in the following branch parameter).

  • branch: The branch name in the repo (for example, main). A commit to this branch will initiate the CI/CD pipeline.

  • cfn_scan_script: The path of the script that will be used to scan the AWS CloudFormation template for NTH (scan.sh). This script exists in nth folder that will be part of the AWS CodeCommit repo.

  • cfn_deploy_script: The path of the script that will be used to deploy the AWS CloudFormation template for NTH (installApp.sh).

  • stackName: The name of the CloudFormation stack to be deployed.

  • eksClusterName: The name of the existing EKS cluster.

  • eksClusterRole: The IAM role that will be used to access the EKS cluster for all Kubernetes API calls (for example, clusteradmin). Usually, this role is added in aws-auth ConfigMap.

  • create_cluster_role: To create the eksClusterRole IAM role, enter yes. If you want to provide an existing cluster role in the eksClusterRole parameter, enter no.

  • create_iam_oidc_provider: To create the IAM OIDC provider for your cluster, enter yes. If an IAM OIDC provider already exists, enter no. For more information, see Creating an IAM OIDC provider for your cluster.

  • AsgGroupName: A comma-separated list of Auto Scaling group names that are part of the EKS cluster (for example, ASG_Group_1,ASG_Group_2).

  • region: The name of the AWS Region where the cluster is located (for example, us-east-2).

  • install_cdk: If AWS CDK isn’t currently installed on the machine, enter yes. Run the cdk --version command to check whether the installed AWS CDK version is 2.27.0 or later. In that case, enter no.

    If you enter yes, the setup.sh script will run the sudo npm install -g cdk@2.27.0 command to install AWS CDK on the machine. The script requires sudo permissions, so provide the account password when prompted.

App developer, AWS DevOps, DevOps engineer

Create the CI/CD pipeline to deploy NTH.

Run the setup.sh script.

./setup.sh

The script will deploy the AWS CDK application that will create the CodeCommit repo with example code, the pipeline, and CodeBuild projects based on the user input parameters in config/config.json file.

This script will ask for the password as it installs npm packages with the sudo command.

App developer, AWS DevOps, DevOps engineer

Review the CI/CD pipeline.

Open the AWS Management Console, and review the following resources created in the stack.

  • CodeCommit repo with the contents of the nth folder

  • AWS CodeBuild project cfn-scan, which will scan the CloudFormation template for vulnerabilities.

  • CodeBuild project Nth-Deploy, which will deploy the AWS CloudFormation template and the corresponding NTH Helm charts through the AWS CodePipeline pipeline.

  • A CodePipeline pipeline to deploy NTH.

After the pipeline runs successfully, Helm release aws-node-termination-handler is installed in the EKS cluster. Also, a Pod named aws-node-termination-handler is running in the kube-system namespace in the cluster.

App developer, AWS DevOps, DevOps engineer
TaskDescriptionSkills required

Simulate an Auto Scaling group scale-in event.

To simulate an automatic scaling scale-in event, do the following:

  1. On the AWS console, open the EC2 console, and choose Auto Scaling Groups.

  2. Select the Auto Scaling group that has same name as the one provided in config/config.json, and choose Edit.

  3. Decrease Desired and Minimum Capacity by 1.

  4. Choose Update.

Review the logs.

During the scale-in event, the NTH Pod will cordon and drain the corresponding worker node (the EC2 instance that will be terminated as part of the scale-in event). To check the logs, use the code in the Additional information section.

App developer, AWS DevOps, DevOps engineer
TaskDescriptionSkills required

Clean up all AWS resources.

To clean up the resources created by this pattern, run the following command.

./uninstall.sh

This will clean up all the resources created in this pattern by deleting the CloudFormation stack.

DevOps engineer

Troubleshooting

IssueSolution

The npm registry isn’t set correctly.

During the installation of this solution, the script installs npm install to download all the required packages. If, during the installation, you see a message that says "Cannot find module," the npm registry might not be set correctly. To see the current registry setting, run the following command.

npm config get registry

To set the registry with https://registry.npmjs.org/, run the following command.

npm config set registry https://registry.npmjs.org

Delay SQS message delivery.

As part of your troubleshooting, if you want to delay the SQS message delivery to NTH Pod, you can adjust the SQS delivery delay parameter. For more information, see Amazon SQS delay queues.

Related resources

Additional information

1. Find the NTH Pod name.

kubectl get pods -n kube-system |grep aws-node-termination-handler aws-node-termination-handler-65445555-kbqc7 1/1 Running 0 26m kubectl get pods -n kube-system |grep aws-node-termination-handler aws-node-termination-handler-65445555-kbqc7 1/1 Running 0 26m

2. Check the logs. An example log looks like the following. It shows that the node has been cordoned and drained before sending the Auto Scaling group lifecycle hook completion signal.

kubectl -n kube-system logs aws-node-termination-handler-65445555-kbqc7 022/07/17 20:20:43 INF Adding new event to the event store event={"AutoScalingGroupName":"eksctl-my-cluster-target-nodegroup-ng-10d99c89-NodeGroup-ZME36IGAP7O1","Description":"ASG Lifecycle Termination event received. Instance will be interrupted at 2022-07-17 20:20:42.702 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"asg-lifecycle-term-33383831316538382d353564362d343332362d613931352d383430666165636334333564","InProgress":false,"InstanceID":"i-0409f2a9d3085b80e","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"ip-192-168-75-60.us-east-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-east-2c/i-0409f2a9d3085b80e","StartTime":"2022-07-17T20:20:42.702Z","State":""} 2022/07/17 20:20:44 INF Requesting instance drain event-id=asg-lifecycle-term-33383831316538382d353564362d343332362d613931352d383430666165636334333564 instance-id=i-0409f2a9d3085b80e kind=SQS_TERMINATE node-name=ip-192-168-75-60.us-east-2.compute.internal provider-id=aws:///us-east-2c/i-0409f2a9d3085b80e 2022/07/17 20:20:44 INF Pods on node node_name=ip-192-168-75-60.us-east-2.compute.internal pod_names=["aws-node-qchsw","aws-node-termination-handler-65445555-kbqc7","kube-proxy-mz5x5"] 2022/07/17 20:20:44 INF Draining the node 2022/07/17 20:20:44 ??? WARNING: ignoring DaemonSet-managed Pods: kube-system/aws-node-qchsw, kube-system/kube-proxy-mz5x5 2022/07/17 20:20:44 INF Node successfully cordoned and drained node_name=ip-192-168-75-60.us-east-2.compute.internal reason="ASG Lifecycle Termination event received. Instance will be interrupted at 2022-07-17 20:20:42.702 +0000 UTC \n" 2022/07/17 20:20:44 INF Completed ASG Lifecycle Hook (NTH-K8S-TERM-HOOK) for instance i-0409f2a9d3085b80e