Automate deployment of Node Termination Handler in Amazon EKS by using a CI/CD pipeline
Created by Sandip Gangapadhyay (AWS), John Vargas (AWS), Pragtideep Singh (AWS), Sandeep Gawande (AWS), and Viyoma Sachdeva (AWS)
Code repository: Deploy NTH to EKS | Environment: Production | Technologies: Containers & microservices; DevOps |
AWS services: AWS CodePipeline; Amazon EKS; AWS CodeBuild |
Summary
Notice: AWS CodeCommit is no longer available to new customers. Existing customers of AWS CodeCommit can continue to use the service as normal. Learn more
On the Amazon Web Services (AWS) Cloud, you can use AWS Node Termination Handler
Auto Scaling group rebalancing across Availability Zones
EC2 instance termination through the API or the AWS Management Console
If an event isn’t handled, your application code might not stop gracefully. It also might take longer to recover full availability, or it might accidentally schedule work to nodes that are going down. The aws-node-termination-handler
(NTH) can operate in two different modes: Instance Metadata Service (IMDS) or Queue Processor. For more information about the two modes, see the Readme file
This pattern uses AWS CodeCommit, and it automates the deployment of NTH by using Queue Processor through a continuous integration and continuous delivery (CI/CD) pipeline.
Note: If you're using EKS managed node groups, you don't need the aws-node-termination-handler
.
Prerequisites and limitations
Prerequisites
An active AWS account.
A web browser that is supported for use with the AWS Management Console. See the list of supported browsers
. AWS Cloud Development Kit (AWS CDK) installed.
kubectl
, the Kubernetes command line tool, installed. eksctl
, the AWS Command Line Interface (AWS CLI) for Amazon Elastic Kubernetes Service (Amazon EKS), installed.A running EKS cluster with version 1.20 or later.
A self-managed node group attached to the EKS cluster. To create an Amazon EKS cluster with a self-managed node group, run the following command.
eksctl create cluster --managed=false --region <region> --name <cluster_name>
For more information on
eksctl
, see the eksctl documentation. AWS Identity and Access Management (IAM) OpenID Connect (OIDC) provider for your cluster. For more information, see Creating an IAM OIDC provider for your cluster.
Limitations
You must use an AWS Region that supports the Amazon EKS service.
Product versions
Kubernetes version 1.20 or later
eksctl
version 0.107.0 or laterAWS CDK version 2.27.0 or later
Architecture
Target technology stack
A virtual private cloud (VPC)
An EKS cluster
Amazon Simple Queue Service (Amazon SQS)
IAM
Kubernetes
Target architecture
The following diagram shows the high-level view of the end-to-end steps when the node termination is started.
The workflow shown in the diagram consists of the following high-level steps:
The automatic scaling EC2 instance terminate event is sent to the SQS queue.
The NTH Pod monitors for new messages in the SQS queue.
The NTH Pod receives the new message and does the following:
Cordons the node so that new pod does not run on the node.
Drains the node, so that the existing pod is evacuated
Sends a lifecycle hook signal to the Auto Scaling group so that the node can be terminated.
Automation and scale
Code is managed and deployed by AWS CDK, backed by AWS CloudFormation nested stacks.
The Amazon EKS control plane runs across multiple Availability Zones to ensure high availability.
For automatic scaling, Amazon EKS supports the Kubernetes Cluster Autoscaler
and Karpenter .
Tools
AWS services
AWS Cloud Development Kit (AWS CDK) is a software development framework that helps you define and provision AWS Cloud infrastructure in code.
AWS CodeBuild is a fully managed build service that helps you compile source code, run unit tests, and produce artifacts that are ready to deploy.
AWS CodeCommit is a version control service that helps you privately store and manage Git repositories, without needing to manage your own source control system.
AWS CodePipeline helps you quickly model and configure the different stages of a software release and automate the steps required to release software changes continuously.
Amazon Elastic Kubernetes Service (Amazon EKS) helps you run Kubernetes on AWS without needing to install or maintain your own Kubernetes control plane or nodes.
Amazon EC2 Auto Scaling helps you maintain application availability and allows you to automatically add or remove Amazon EC2 instances according to conditions you define.
Amazon Simple Queue Service (Amazon SQS) provides a secure, durable, and available hosted queue that helps you integrate and decouple distributed software systems and components.
Other tools
kubectl
is a Kubernetes command line tool for running commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.
Code
The code for this pattern is available in the deploy-nth-to-eks
nth folder
– The Helm chart, values files, and the scripts to scan and deploy the AWS CloudFormation template for Node Termination Handler.config/config.json
– The configuration parameter file for the application. This file contains all the parameters needed for CDK to be deployed.cdk
– AWS CDK source code.setup.sh
– The script used to deploy the AWS CDK application to create the required CI/CD pipeline and other required resources.uninstall.sh
– The script used to clean up the resources.
To use the example code, follow the instructions in the Epics section.
Best practices
For best practices when automating AWS Node Termination Handler, see the following:
Epics
Task | Description | Skills required |
---|---|---|
Clone the repo. | To clone the repo by using SSH (Secure Shell), run the following the command.
To clone the repo by using HTTPS, run the following the command.
Cloning the repo creates a folder named Change to that directory.
| App developer, AWS DevOps, DevOps engineer |
Set the kubeconfig file. | Set your AWS credentials in your terminal and confirm that you have rights to assume the cluster role. You can use the following example code.
| AWS DevOps, DevOps engineer, App developer |
Task | Description | Skills required |
---|---|---|
Set up the parameters. | In the
| App developer, AWS DevOps, DevOps engineer |
Create the CI/CD pipeline to deploy NTH. | Run the setup.sh script.
The script will deploy the AWS CDK application that will create the CodeCommit repo with example code, the pipeline, and CodeBuild projects based on the user input parameters in This script will ask for the password as it installs npm packages with the sudo command. | App developer, AWS DevOps, DevOps engineer |
Review the CI/CD pipeline. | Open the AWS Management Console, and review the following resources created in the stack.
After the pipeline runs successfully, Helm release | App developer, AWS DevOps, DevOps engineer |
Task | Description | Skills required |
---|---|---|
Simulate an Auto Scaling group scale-in event. | To simulate an automatic scaling scale-in event, do the following:
| |
Review the logs. | During the scale-in event, the NTH Pod will cordon and drain the corresponding worker node (the EC2 instance that will be terminated as part of the scale-in event). To check the logs, use the code in the Additional information section. | App developer, AWS DevOps, DevOps engineer |
Task | Description | Skills required |
---|---|---|
Clean up all AWS resources. | To clean up the resources created by this pattern, run the following command.
This will clean up all the resources created in this pattern by deleting the CloudFormation stack. | DevOps engineer |
Troubleshooting
Issue | Solution |
---|---|
The npm registry isn’t set correctly. | During the installation of this solution, the script installs npm install to download all the required packages. If, during the installation, you see a message that says "Cannot find module," the npm registry might not be set correctly. To see the current registry setting, run the following command.
To set the registry with
|
Delay SQS message delivery. | As part of your troubleshooting, if you want to delay the SQS message delivery to NTH Pod, you can adjust the SQS delivery delay parameter. For more information, see Amazon SQS delay queues. |
Related resources
Additional information
1. Find the NTH Pod name.
kubectl get pods -n kube-system |grep aws-node-termination-handler aws-node-termination-handler-65445555-kbqc7 1/1 Running 0 26m kubectl get pods -n kube-system |grep aws-node-termination-handler aws-node-termination-handler-65445555-kbqc7 1/1 Running 0 26m
2. Check the logs. An example log looks like the following. It shows that the node has been cordoned and drained before sending the Auto Scaling group lifecycle hook completion signal.
kubectl -n kube-system logs aws-node-termination-handler-65445555-kbqc7 022/07/17 20:20:43 INF Adding new event to the event store event={"AutoScalingGroupName":"eksctl-my-cluster-target-nodegroup-ng-10d99c89-NodeGroup-ZME36IGAP7O1","Description":"ASG Lifecycle Termination event received. Instance will be interrupted at 2022-07-17 20:20:42.702 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"asg-lifecycle-term-33383831316538382d353564362d343332362d613931352d383430666165636334333564","InProgress":false,"InstanceID":"i-0409f2a9d3085b80e","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"ip-192-168-75-60.us-east-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-east-2c/i-0409f2a9d3085b80e","StartTime":"2022-07-17T20:20:42.702Z","State":""} 2022/07/17 20:20:44 INF Requesting instance drain event-id=asg-lifecycle-term-33383831316538382d353564362d343332362d613931352d383430666165636334333564 instance-id=i-0409f2a9d3085b80e kind=SQS_TERMINATE node-name=ip-192-168-75-60.us-east-2.compute.internal provider-id=aws:///us-east-2c/i-0409f2a9d3085b80e 2022/07/17 20:20:44 INF Pods on node node_name=ip-192-168-75-60.us-east-2.compute.internal pod_names=["aws-node-qchsw","aws-node-termination-handler-65445555-kbqc7","kube-proxy-mz5x5"] 2022/07/17 20:20:44 INF Draining the node 2022/07/17 20:20:44 ??? WARNING: ignoring DaemonSet-managed Pods: kube-system/aws-node-qchsw, kube-system/kube-proxy-mz5x5 2022/07/17 20:20:44 INF Node successfully cordoned and drained node_name=ip-192-168-75-60.us-east-2.compute.internal reason="ASG Lifecycle Termination event received. Instance will be interrupted at 2022-07-17 20:20:42.702 +0000 UTC \n" 2022/07/17 20:20:44 INF Completed ASG Lifecycle Hook (NTH-K8S-TERM-HOOK) for instance i-0409f2a9d3085b80e