CPU Training - AWS Deep Learning Containers

CPU Training

This section shows how to train a model on CPU instances by using Kubeflow training operators and Deep Learning Containers.

For a complete list of Deep Learning Containers, see Deep Learning Containers Images. For tips about configuration settings when using the Intel Math Kernel Library (MKL), see AWS Deep Learning Containers Intel Math Kernel Library (MKL) Recommendations.

PyTorch CPU training

Your deployment of Kubeflow on AWS comes with PyTorchJob. This is the Kubeflow implementation of Kubernetes custom resource that is used to run distributed PyTorch training jobs on Kubernetes.

This tutorial guides you through training a classification model on MNIST with PyTorch in a single node CPU instance running a container from Deep Learning Containers managed by Kubeflow on AWS.

  1. To create a PyTorchJob, follow these instructions.

    1. Create the job configuration file.

      Open vi or vim, then copy and paste the following content. Save this file as pytorch.yaml.

      apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name: pytorch-training spec: pytorchReplicaSpecs: Worker: restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name: pytorch imagePullPolicy: Always image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-cpu-py310-ubuntu20.04-ec2 command: - "/bin/sh" - "-c" args: - "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py --no-cuda --epochs=1" env: - name: OMP_NUM_THREADS value: "36" - name: KMP_AFFINITY value: "granularity=fine,verbose,compact,1,0" - name: KMP_BLOCKTIME value: "1"
    2. Deploy the PyTorchJob configuration file using kubectl to start training.

      kubectl create -f pytorch.yaml -n ${NAMESPACE}

      The job creates a pod running the container from Deep Learning Containers. This is referenced in the field spec.containers.image in the YAML file above under the container name pytorch.

    3. You should see the following output.

      pytorchjob.kubeflow.org/pytorch-training created
    4. Check the status.

      The name of the job pytorch-training appears in the status. It might take some time for the job to reach a Running state. Run the following watch command to monitor the state of the job.

      kubectl get pods -n ${NAMESPACE} -w

      You should see the following output.

      NAME READY STATUS RESTARTS AGE pytorch-training 0/1 Running 8 19m
  2. Monitor your PyTorchJob

    1. Check the logs to watch the training progress.

      kubectl logs pytorch-training-worker-0 -n ${NAMESPACE}

      You should see something similar to the following output.

      Cloning into 'examples'... Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz 9920512it [00:00, 40133996.38it/s] Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw 32768it [00:00, 831315.84it/s] Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz 1654784it [00:00, 13019129.43it/s] Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz 8192it [00:00, 337197.38it/s] Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw Processing... Done! Train Epoch: 1 [0/60000 (0%)] Loss: 2.300039 Train Epoch: 1 [640/60000 (1%)] Loss: 2.213470 Train Epoch: 1 [1280/60000 (2%)] Loss: 2.170460 Train Epoch: 1 [1920/60000 (3%)] Loss: 2.076699 Train Epoch: 1 [2560/60000 (4%)] Loss: 1.868078 Train Epoch: 1 [3200/60000 (5%)] Loss: 1.414199 Train Epoch: 1 [3840/60000 (6%)] Loss: 1.000870
    2. Monitor the job state.

      Run the following command to refresh the job state. When the status changes to Succeeded, the training job is done.

      watch -n 5 kubectl get pytorchjobs pytorch-training -n ${NAMESPACE}

See Cleanup for information on cleaning up a cluster after you are done using it.

TensorFlow CPU training

Your deployment of Kubeflow on AWS comes with TFJob. This is the Kubeflow implementation of Kubernetes custom resource that is used to run distributed TensorFlow training jobs on Kubernetes.

This tutorial guides you through training a classification model on MNIST with Keras in a single node CPU instance running a container from Deep Learning Containers managed by Kubeflow on AWS.

  1. Create a TFJob.

    1. Create the job configuration file.

      Open vi or vim, then copy and paste the following content. Save this file as tf.yaml.

      apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: tensorflow-training spec: tfReplicaSpecs: Worker: restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name: tensorflow image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.12.0-cpu-py310-ubuntu20.04-ec2 command: ["/bin/sh","-c"] args: ["git clone https://github.com/keras-team/keras-io.git && python keras-io/examples/vision/mnist_convnet.py"]
    2. To start training, deploy the TFJob configuration file using kubectl.

      kubectl create -f tf.yaml -n ${NAMESPACE}

      The job creates a pod by running the container from Deep Learning Containers that you referenced in the field spec.containers.image in the YAML file above under the container name tensorflow.

    3. You should see the following output.

      pod/tensorflow-training created
    4. Check the status.

      The name of the job tensorflow-training appears in the status. It might take some time for the job to reach a Running state. Run the following watch command to monitor the state of the job.

      kubectl get pods -n ${NAMESPACE} -w

      You should see the following output.

      NAME READY STATUS RESTARTS AGE tensorflow-training 0/1 Running 8 19m
  2. Monitor your TFJob.

    1. Check the logs to watch the training progress.

      kubectl logs tensorflow-training-worker-0 -n ${NAMESPACE}

      You should see something similar to the following output.

      Cloning into 'keras'... Using TensorFlow backend. Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz 8192/11490434 [..............................] - ETA: 0s 6479872/11490434 [===============>..............] - ETA: 0s 8740864/11490434 [=====================>........] - ETA: 0s 11493376/11490434 [==============================] - 0s 0us/step x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, validate on 10000 samples Epoch 1/12 2019-03-19 01:52:33.863598: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F 2019-03-19 01:52:33.867616: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. 128/60000 [..............................] - ETA: 10:43 - loss: 2.3076 - acc: 0.0625 256/60000 [..............................] - ETA: 5:59 - loss: 2.2528 - acc: 0.1445 384/60000 [..............................] - ETA: 4:24 - loss: 2.2183 - acc: 0.1875 512/60000 [..............................] - ETA: 3:35 - loss: 2.1652 - acc: 0.1953 640/60000 [..............................] - ETA: 3:05 - loss: 2.1078 - acc: 0.2422 ...
    2. Monitor the job state.

      Run the following command to refresh the job state. When the status changes to Succeeded, the training job is done.

      watch -n 5 kubectl get tfjobs tensorflow-training -n ${NAMESPACE}

See Cleanup for information on cleaning up a cluster after you are done using it.

Next steps

To learn CPU-based inference on Kubeflow on AWS using PyTorch or TensorFlow with Deep Learning Containers, see Inference.