Kubeflow on AWS setup
This section provides installation instructions to set up a deep learning environment using AWS Deep Learning Containers with Kubeflow on AWS, an open source distribution of Kubeflow. After you finish Kubeflow on AWS setup, you can continue with training tutorials in this series.
Deploy Kubeflow on AWS
To deploy Kubeflow on AWS, follow the Vanilla
deployment option
If you deployed a GPU cluster following the previous instructions, the NVIDIA device plug-in for Kubernetes is already installed. You do not need any additional setup.
Note
The following tutorials use the Vanilla version of Kubeflow on AWS as an example. However, you can run all training and inference tutorials in this Kubeflow on AWS section with any other deployment option of Kubeflow on AWS.
For information about setting up and
configuring Amazon RDS, Amazon S3, and Amazon Cognito resources as part of your Kubeflow on AWS
deployment, see Deployment options
After you have set up your Amazon EKS cluster, you can verify that your context points to your cluster in the following section.
Verify cluster connection
These steps show how to verify your context. This is to make sure that you interact with the correct cluster.
-
First, confirm that the cluster is active by running the following command.
aws eks --region
<region>
describe-cluster --name<cluster-name>
--query cluster.statusYou should see the following output.
"ACTIVE"
-
To check your current context, run this command. The
current-context
field in the output should contain your cluster name.kubectl config view
If your
current-context
is not the cluster you want to interact with, run the following command to update it. For more information about updating yourkubeconfig
, visit Amazon EKS documentationaws eks update-kubeconfig --region
<region>
--name<cluster-name>
After you have deployed Kubeflow on AWS and updated your current context, verify that your Kubeflow user profile uses the right namespace in the following section.
Verify your namespace
These steps show how to verify that your active Kubeflow user profile uses the namespace kubeflow-user-example-com
.
All tutorials in this series run in this namespace.
-
Note
In Kubeflow, all namespaces should be created via profiles
. Kubeflow on AWS Vanilla installation creates a user profile with the namespace kubeflow-user-example-com
by default.Ensure that a namespace named
kubeflow-user-example-com
exists by running the following command.kubectl get namespace
If the namespace does not appear in the output, create a new Kubeflow profile as follows.
-
Open
vi
orvim
, then copy and paste the following content. Save this profile description file asprofile.yaml
. Make sure to replace the email underowner.name
with your email.apiVersion: kubeflow.org/v1beta1 kind: Profile metadata: # replace with the name of profile you want, this is the user's namespace name name:
kubeflow-user-example-com
spec: owner: kind: User # replace with the email of the user name:user@example.com
-
Run the following command to create the corresponding profile resource.
kubectl apply -f profile.yaml
-
Export the
NAMESPACE
variable.export NAMESPACE=kubeflow-user-example-com
We refer to this namespace as the variable
${NAMESPACE}
in all Kubeflow on AWS tutorials.
Next steps
Now that you have finished the Kubeflow on AWS setup, you can continue with the training and inference tutorials.
To learn about training and inference with Deep Learning Containers on Kubeflow on AWS, see the Training with Kubeflow on AWS or Run inference services on PyTorch or TensorFlow model guides.
Cleanup
This section provides cleanup instructions after you have finished running your tutorials.
Clean jobs
You can delete a specific training job when you are done running an example. To list the jobs of a specific type (PyTorchJob, MPIJob, TfJob) running in a given namespace, run the following command.
kubectl get
job_type
-n ${NAMESPACE}
Retrieve the name of the job you want to delete, then run the following command.
kubectl delete
job_type
job_name
-n ${NAMESPACE}
Your output should look similar to the following.
job_type
.kubeflow.org "job_name
" deleted
Uninstall Kubeflow on AWS
Kubeflow on AWS documentation provides uninstall commands.
Make sure that you run the command that corresponds to your deployment method: Kustomize, Helm
Delete an Amazon EKS cluster
Kubeflow on AWS documentation provides a single command to delete your entire Amazon EKS cluster