CPU训练 - AWS 深度学习容器




PyTorch CPU训练

本教程将指导您在单节点 CPU Pod 上训练 PyTorch 模型。

  1. 为您的集群创建 pod 文件。pod 文件将提供有关集群应运行什么的说明。此 pod 文件将下载 PyTorch存储库并运行MNIST示例。打开vivim,然后复制并粘贴以下内容。将此文件另存为 pytorch.yaml

    apiVersion: v1 kind: Pod metadata: name: pytorch-training spec: restartPolicy: OnFailure containers: - name: pytorch-training image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04 command: - "/bin/sh" - "-c" args: - "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py --no-cuda" env: - name: OMP_NUM_THREADS value: "36" - name: KMP_AFFINITY value: "granularity=fine,verbose,compact,1,0" - name: KMP_BLOCKTIME value: "1"
  2. 使用将 pod 文件分配给集群kubectl

    $ kubectl create -f pytorch.yaml
  3. 您应看到以下输出:

    pod/pytorch-training created
  4. 检查状态。作业“pytorch-training”的名称位于 pytorch.yaml 文件中。它现在将显示在状态中。如果您正在运行任何其他测试或以前运行过某些测试,则它会出现在此列表中。多次运行此项,直到您看到状态更改为“Running (正在运行)”。

    $ kubectl get pods


    NAME READY STATUS RESTARTS AGE pytorch-training 0/1 Running 8 19m
  5. 检查日志以查看训练输出。

    $ kubectl logs pytorch-training


    Cloning into 'examples'... Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz 9920512it [00:00, 40133996.38it/s] Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw 32768it [00:00, 831315.84it/s] Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz 1654784it [00:00, 13019129.43it/s] Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz 8192it [00:00, 337197.38it/s] Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw Processing... Done! Train Epoch: 1 [0/60000 (0%)] Loss: 2.300039 Train Epoch: 1 [640/60000 (1%)] Loss: 2.213470 Train Epoch: 1 [1280/60000 (2%)] Loss: 2.170460 Train Epoch: 1 [1920/60000 (3%)] Loss: 2.076699 Train Epoch: 1 [2560/60000 (4%)] Loss: 1.868078 Train Epoch: 1 [3200/60000 (5%)] Loss: 1.414199 Train Epoch: 1 [3840/60000 (6%)] Loss: 1.000870
  6. 查看日志以查看训练进度。您也可以继续选中 “get pods” 以刷新状态。当状态更改为 “Completed” 时,您将知道训练工作已完成。

TensorFlow CPU训练

本教程将指导您在单节点CPU集群上训练 TensorFlow 模型。

  1. 为您的集群创建 pod 文件。pod 文件将提供有关集群应运行什么的说明。此 pod 文件将下载 Keras 并运行 Keras 示例。此示例使用 TensorFlow 框架。打开vivim并复制并粘贴以下内容。将此文件另存为 tf.yaml。你可以将其与 TensorFlow或 TensorFlow 2 一起使用。要将其与 TensorFlow 2 一起使用,请将 Docker 镜像更改为 TensorFlow 2 镜像。

    apiVersion: v1 kind: Pod metadata: name: tensorflow-training spec: restartPolicy: OnFailure containers: - name: tensorflow-training image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.15.2-cpu-py36-ubuntu18.04 command: ["/bin/sh","-c"] args: ["git clone https://github.com/fchollet/keras.git && python /keras/examples/mnist_cnn.py"]
  2. 使用将 pod 文件分配给集群kubectl

    $ kubectl create -f tf.yaml
  3. 您应看到以下输出:

    pod/tensorflow-training created
  4. 检查状态。任务“tensorflow-training”的名称位于 tf.yaml 文件中。它现在将显示在状态中。如果您正在运行任何其他测试或以前运行过某些测试,则它会出现在此列表中。多次运行此项,直到您看到状态更改为“Running (正在运行)”。

    $ kubectl get pods


    NAME READY STATUS RESTARTS AGE tensorflow-training 0/1 Running 8 19m
  5. 检查日志以查看训练输出。

    $ kubectl logs tensorflow-training


    Cloning into 'keras'... Using TensorFlow backend. Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz 8192/11490434 [..............................] - ETA: 0s 6479872/11490434 [===============>..............] - ETA: 0s 8740864/11490434 [=====================>........] - ETA: 0s 11493376/11490434 [==============================] - 0s 0us/step x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, validate on 10000 samples Epoch 1/12 2019-03-19 01:52:33.863598: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F 2019-03-19 01:52:33.867616: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. 128/60000 [..............................] - ETA: 10:43 - loss: 2.3076 - acc: 0.0625 256/60000 [..............................] - ETA: 5:59 - loss: 2.2528 - acc: 0.1445 384/60000 [..............................] - ETA: 4:24 - loss: 2.2183 - acc: 0.1875 512/60000 [..............................] - ETA: 3:35 - loss: 2.1652 - acc: 0.1953 640/60000 [..............................] - ETA: 3:05 - loss: 2.1078 - acc: 0.2422 ...
  6. 您可以检查日志以观察训练进度。您也可以继续选中 “get pods” 以刷新状态。当状态更改为 “Completed” 时,您将知道训练工作已完成。


