在 De AWS ep Learning Container EKS - AWS 深度学习容器


在 De AWS ep Learning Container EKS

以下是在亚马逊EKS集群上使用 Dee AWS p Learning Containers 时,命令行中可能返回的常见错误。每个错误的后面都提供了错误的解决方案。



在您的亚马逊EKS集群上设置 Deep Learning Containers 时,可能会返回以下错误。

  • 错误:注册表kubeflow不存在

    $ ks pkg install kubeflow/tf-serving ERROR registry 'kubeflow' does not exist


    ks registry add kubeflow github.com/google/kubefl ow/tree/master/kubeflow
  • 错误:已超过上下文截止日期

    $ eksctl create cluster <args> [✖] waiting for CloudFormation stack "eksctl-training-cluster-1-nodegroup-ng-8c4c94bc" to reach "CREATE_COMPLETE" status: RequestCanceled: waiter context canceled caused by: context deadline exceeded


  • 错误:与服务器 localhost: 8080 的连接被拒绝

    $ kubectl get nodes The connection to the server localhost:8080 was refused - did you specify the right host or port?

    要解决此错误,请运行以下命令将集群复制到 Kubernetes 配置。

    cp ~/.kube/eksctl/clusters/<cluster-name> ~/.kube/config
  • 错误:handle 对象:正在从群集中修补对象:将对象与现有状态合并:未授权

    $ ks apply default ERROR handle object: patching object from cluster: merging object with existing state: Unauthorized


  • 错误:无法创建应用程序;目录 “/home/ubuntu/kubeflow-tf-hvd” 已存在

    $ APP_NAME=kubeflow-tf-hvd; ks init ${APP_NAME}; cd ${APP_NAME} INFO Using context "arn:aws:eks:eu-west-1:999999999999:cluster/training-gpu-1" from kubeconfig file "/home/ubuntu/.kube/config" ERROR Could not create app; directory '/home/ubuntu/kubeflow-tf-hvd' already exists



ssh: Could not resolve hostname openmpi-worker-1.openmpi.kubeflow-dist-train-tf: Name or service not known

如果您在使用 Amazon EKS 集群时看到此错误消息,请再次运行NVIDIA设备插件安装步骤。通过传入特定的配置文件或将您的活动集群切换到目标集群,验证您的目标群集是否正确。


清理 Amazon EKS 集群的资源时,可能会返回以下错误。

  • 错误:服务器没有资源类型”namspace"

    $ kubectl delete namespace ${NAMESPACE} error: the server doesn't have a resource type "namspace"


  • 错误:服务器已要求客户端提供凭据

    $ ks delete default ERROR the server has asked for the client to provide credentials

    要解决此错误,请使用aws configure或导出 AWS 环境变量验证是否~/.kube/config指向正确的集群,以及 AWS 凭据是否已正确配置。

  • 错误:从起始路径中查找应用程序根目录::找不到 ksonnet 项目

    $ ks delete default ERROR finding app root from starting path: : unable to find ksonnet project

    要解决此错误,请确认您位于 ksonnet 应用程序创建的目录中。ks init这是运行所在的文件夹。

  • 错误:来自服务器 (NotFound) 的错误:找不到 pod “openmpi-master”

    $ kubectl logs -n ${NAMESPACE} -f ${COMPONENT}-master > results/benchmark_1.out Error from server (NotFound): pods "openmpi-master" not found
