Troubleshooting AWS Deep Learning Containers on EKS

The following are common errors that might be returned in the command line when using AWS Deep Learning Containers on an Amazon EKS cluster. Each error is followed by a solution to the error.

Troubleshooting

Topics

Setup Errors
Usage Errors
Cleanup Errors

Setup Errors

The following errors might be returned when setting up Deep Learning Containers on your Amazon EKS cluster.

Error: registry kubeflow does not exist


$ ks pkg install kubeflow/tf-serving
	 ERROR registry 'kubeflow' does not exist

To solve this error, run the following command.


ks registry add kubeflow github.com/google/kubefl ow/tree/master/kubeflow

Error: context deadline exceeded


$ eksctl create cluster <args>
	 [✖] waiting for CloudFormation stack "eksctl-training-cluster-1-nodegroup-ng-8c4c94bc" to reach "CREATE_COMPLETE" status: RequestCanceled: waiter context canceled
	 caused by: context deadline exceeded

To solve this error, verify that you have not exceeded capacity for your account. You can also try to create your cluster in a different region.

Error: The connection to the server localhost:8080 was refused


$ kubectl get nodes
	 The connection to the server localhost:8080 was refused - did you specify the right host or port?

To solve this error, copy the cluster to the Kubernetes configuration by running the following.


cp ~/.kube/eksctl/clusters/<cluster-name> ~/.kube/config

Error: handle object: patching object from cluster: merging object with existing state: Unauthorized
```
$ ks apply default
	 ERROR handle object: patching object from cluster: merging object with existing state: Unauthorized
```
This error is due to a concurrency issue that can occur when multiple users with different authorization or credentials credentials try to start jobs on the same cluster. Verify that you are starting a job on the correct cluster.

Error: Could not create app; directory '/home/ubuntu/kubeflow-tf-hvd' already exists


$ APP_NAME=kubeflow-tf-hvd; ks init ${APP_NAME}; cd ${APP_NAME}
	 INFO Using context "arn:aws:eks:eu-west-1:999999999999:cluster/training-gpu-1" from kubeconfig file "/home/ubuntu/.kube/config"
	 ERROR Could not create app; directory '/home/ubuntu/kubeflow-tf-hvd' already exists

You can safely ignore this warning. However, you may have additional cleanup to do inside that folder. To simplify cleanup, delete the folder.

Usage Errors


ssh: Could not resolve hostname openmpi-worker-1.openmpi.kubeflow-dist-train-tf: Name or service not known

If you see this error message while using the Amazon EKS cluster, run the NVIDIA device plugin installation step again. Verify that you have targeted the right cluster by either passing in the specific config file or switching your active cluster to the targeted cluster.

Cleanup Errors

The following errors might be returned when cleaning up the resources of your Amazon EKS cluster.

Error: the server doesn't have a resource type "namspace"
```
$ kubectl delete namespace ${NAMESPACE}
	 error: the server doesn't have a resource type "namspace"
```
Verify the spelling of your namespace is correct.
Error: the server has asked for the client to provide credentials
```
$ ks delete default
	 ERROR the server has asked for the client to provide credentials
```
To solve this error, verify that ~/.kube/config points to the correct cluster and that AWS credentials have been correctly configured using aws configure or by exporting AWS environment variables.
Error: finding app root from starting path: : unable to find ksonnet project
```
$ ks delete default
	 ERROR finding app root from starting path: : unable to find ksonnet project
```
To solve this error, verify that you are in the directory created by the ksonnet app. This is the folder where ks init was run.
Error: Error from server (NotFound): pods "openmpi-master" not found
```
$ kubectl logs -n ${NAMESPACE} -f ${COMPONENT}-master > results/benchmark_1.out
	 Error from server (NotFound): pods "openmpi-master" not found
```
This error might be caused by trying to access resources after the context is deleted. Deleting the default context causes the corresponding resources to be deleted as well.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Custom Entrypoints

Release Notes