Amazon EKS Setup
This section provides installation instructions to setup a deep learning environment running AWS Deep Learning Containers on Amazon Elastic Kubernetes Service (Amazon EKS).
Licensing
To use GPU hardware, use an Amazon Machine Image that has the necessary GPU drivers. We recommend using the Amazon EKS-optimized AMI with GPU support, which is used in subsequent steps of this guide. This AMI includes software that is not AWS, so it requires an end user license agreement (EULA). You must subscribe to the EKS-optimized AMI in the AWS Marketplace and accept the EULA before you can use the AMI in your worker node groups.
Important
To subscribe to the AMI, visit the AWS Marketplace
Configure Security Settings
To use Amazon EKS you must have a user account that has access to several security permissions. These are set with the AWS Identity and Access Management (IAM) tool.
-
Create an IAM user or update an existing IAM user by following the steps in Creating an IAM user in your AWS account.
-
Get the credentials of this user.
-
Open the IAM console at https://console.aws.amazon.com/iam/
. -
Under
Users
, select your user. -
Select
Security Credentials
. -
Select
Create access key
. -
Download the key pair or copy the information for use later.
-
-
Add the following policies to your IAM user. These policies provide the required access for Amazon EKS, IAM, and Amazon Elastic Compute Cloud (Amazon EC2).
-
Select
Permissions
. -
Select
Add permissions
. -
Select
Create policy
. -
From the
Create policy
window, select theJSON
tab. -
Paste the following content.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "eks:*", "Resource": "*" } ] }
-
Name the policy
EKSFullAccess
and create the policy. -
Navigate back to the
Grant permissions
window. -
Select
Attach existing policies directly
. -
Search for
EKSFullAccess
, and select the check box. -
Search for
AWSCloudFormationFullAccess
, and select the check box. -
Search for
AmazonEC2FullAccess
, and select the check box. -
Search for
IAMFullAccess
, and select the check box. -
Search
AmazonEC2ContainerRegistryReadOnly
, and select the check box. -
Search
AmazonEKS_CNI_Policy
, and select the check box. -
Search
AmazonS3FullAccess
, and select the check box. -
Accept the changes.
-
Gateway Node
To setup an Amazon EKS cluster, use the open source tool, eksctl
. We
recommend that you use an Amazon EC2 instance with the Deep Learning Base AMI (Ubuntu) to
allocate and control your cluster. You can run these tools locally on your computer or
an Amazon EC2 instance that you already have running. However, to simplify this guide we
assume you're using a Deep Learning Base AMI (DLAMI) with Ubuntu 16.04. We refer to this
as your gateway node.
Before you start, consider the location of your training data or where you want to
run your cluster for responding to inference requests. Typically your data and cluster
for training or inference should be in the same Region. Also, you spin up your gateway
node in this same Region. You can follow this quick 10 minute
tutorial
-
Login to your gateway node.
-
Install or upgrade AWS CLI. To access the required new Kubernetes features, you must have the latest version.
$
sudo pip install --upgrade awscli -
Install
eksctl
by running the command corresponding to your operating system in Amazon EKS User Guide's installation instructions. For more information abouteksctl
, see also eksctl documentation. -
Install
kubectl
by following the steps in the Installing kubectl guide.Note
You must use a
kubectl
version that is within one minor version difference of your Amazon EKS cluster control plane version. For example, a 1.18kubectl
client works with Kubernetes 1.17, 1.18 and 1.19 clusters. -
Install
aws-iam-authenticator
by running the following commands. For more information on aws-iam-authenticator, see Installingaws-iam-authenticator
.$
curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/aws-iam-authenticator$
chmod +x aws-iam-authenticator$
cp ./aws-iam-authenticator $HOME/bin/aws-iam-authenticator && export PATH=$HOME/bin:$PATH -
Run
aws configure
for the IAM user from the Security Configuration section. You are copying the IAM user's AWS Access Key, then the AWS Secret Access Key that you accessed in the IAM console and pasting these into the prompts fromaws configure
.
GPU Clusters
-
Examine the following command to create a cluster using a p3.8xlarge instance type. You must make the following modifications before you run it.
-
name
is what you use to manage your cluster. You can changecluster-name
to be whatever name you like as long as there are no spaces or special characters. -
eks-version
is the Amazon EKS kubernetes version. For the supported Amazon EKS versions, see Available Amazon EKS Kubernetes versions. -
nodes
is the number of instances you want in your cluster. In this example, we're starting with three nodes. -
node-type
refers to an instance class. -
timeout
and*ssh-access *
can be left alone. -
ssh-public-key
is the name of the key that you want to use to login your worker nodes. Either use a security key you already use or create a new one but be sure to swap out the ssh-public-key with a key that was allocated for the Region you used. Note: You only need to provide the key name as seen in the 'key pairs' section of the Amazon EC2 Console. -
region
is the Amazon EC2 Region where the cluster is launched. If you plan to use training data that resides in a specific Region (other than<us-east-1>
) we recommend that you use the same Region. The ssh-public-key must have access to launch instances in this Region.Note
The rest of this guide assumes
<us-east-1>
as the Region.
-
-
After you have made changes to the command, run it, and wait. It can take several minutes for a single node cluster, and can take even longer if you chose to create a large cluster.
$
eksctl create cluster<cluster-name>
\ --version<eks-version>
\ --nodes 3 \ --node-type=<p3.8xlarge>
\ --timeout=40m \ --ssh-access \ --ssh-public-key<key_pair_name>
\ --region<us-east-1>
\ --zones=us-east-1a,us-east-1b,us-east-1d \ --auto-kubeconfigYou should see something similar to the following output:
EKS cluster "training-1" in "us-east-1" region is ready
-
Ideally the auto-kubeconfig should have configured your cluster. However, if you run into issues you can run the command below to set your kubeconfig. This command can also be used if you want to change your gateway node and manage your cluster from elsewhere.
$
aws eks --region<region>
update-kubeconfig --name<cluster-name>
You should see something similar to the following output:
Added new context arn:aws:eks:us-east-1:999999999999:cluster/training-1 to /home/ubuntu/.kube/config
-
If you plan to use GPU instance types, make sure to run the NVIDIA device plugin for Kubernetes
on your cluster with the following command: $
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml$
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml -
Verify the GPUs available on each node in your cluster
$
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
CPU Clusters
Refer to the previous section's discussion on using the eksctl
command to launch a GPU cluster, and modify node-type
to use a CPU instance
type.
Habana Clusters
Refer to the previous discussion on using the eksctl
command to launch a GPU cluster, and modify node-type
to use
an instance with Habana Gaudi accelerators, such as the DL1 instance
Test Your Clusters
-
You can run a kubectl command on the cluster to check its status. Try the command to make sure it is picking up the current cluster you want to manage.
$
kubectl get nodes -o wide -
Take a look in ~/.kube. This directory has the kubeconfig files for the various clusters configured from your gateway node. If you browse further into the folder you can find ~/.kube/eksctl/clusters - This holds the kubeconfig file for clusters created using eksctl. This file has some details which you ideally shouldn't have to modify, since the tools are generating and updating the configurations for you, but it is good to reference when troubleshooting.
-
Verify that the cluster is active.
$
aws eks --region<region>
describe-cluster --name<cluster-name>
--query cluster.statusYou should see the following output:
"ACTIVE"
-
Verify the kubectl context if you have multiple clusters set up from the same host instance. Sometimes it helps to make sure that the default context found by kubectl is set properly. Check this using the following command:
$
kubectl config get-contexts -
If the context is not set as expected, fix this using the following command:
$
aws eks --region<region>
update-kubeconfig --name<cluster-name>
Manage Your Clusters
When you want to control or query a cluster you can address it by the configuration file using the kubeconfig parameter. This is useful when you have more than one cluster. For example, if you have a separate cluster called “training-gpu-1” you can call the get pods command on it by passing the configuration file as a parameter as follows:
$
kubectl --kubeconfig=/home/ubuntu/.kube/eksctl/clusters/training-gpu-1 get pods
It is useful to note that you can run this same command without the kubeconfig parameter.
In that case, the command will use the current actively controlled cluster (current-context
).
$
kubectl get pods
If you setup multiple clusters and they have yet to have the NVIDIA plugin installed, you can install it this way:
$
kubectl --kubeconfig=/home/ubuntu/.kube/eksctl/clusters/training-gpu-1
create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml
You also change the active cluster by updating the kubeconfig, passing the name of the cluster you want to manage. The following command updates kubeconfig and removes the need to use the kubeconfig parameter.
$
aws eks —regionus-east-1
update-kubeconfig —nametraining-gpu-1
If you follow all of the examples in this guide, you might switch frequently between active clusters. This is so you can orchestrate training or inference or use different frameworks running on different clusters.
Cleanup
When you're done using the cluster, delete it to avoid incurring additional costs.
$
eksctl delete cluster --name=<cluster-name>
To delete only a pod, run the following:
$
kubectl delete pods<name>
To reset the secret for access to the cluster, run the following:
$
kubectl delete secret ${SECRET} -n ${NAMESPACE} || true
To delete a nodegroup
attached to a cluster, run the
following:
$
eksctl delete nodegroup --name<cluster_name>
To attach a nodegroup
to a cluster, run the following:
$
eksctl create nodegroup --cluster<cluster-name>
\ --node-ami<ami_id>
\ --nodes<num_nodes>
\ --node-type=<instance_type>
\ --timeout=40m \ --ssh-access \ --ssh-public-key<key_pair_name>
\ --region<us-east-1>
\ --auto-kubeconfig
Next steps
To learn about training and inference with Deep Learning Containers on Amazon EKS, visit Training or Inference.