在 Amazon SageMaker 上訓練和部署自訂 GPU 支援的 ML 模型

Ankur Shukla，Amazon Web Services

Summary

訓練和部署圖形處理單元 (GPU) 支援的機器學習 (ML) 模型需要初始設定和初始化特定環境變數，才能完全釋放 NVIDIA GPUs 的優勢。不過，設定環境並使其與 Amazon Web Services (AWS) 雲端上的 Amazon SageMaker 架構相容可能會很耗時。

此模式可協助您使用 Amazon SageMaker 訓練和建置自訂 GPU 支援的 ML 模型。它提供訓練和部署建置在開放原始碼 Amazon 檢閱資料集上的自訂 CatBoost 模型的步驟。然後，您可以在 Amazon Elastic Compute Cloud (Amazon EC2) p3.16xlarge 執行個體上為其效能進行基準測試。

如果您的組織想要在 SageMaker 上部署現有的 GPU 支援 ML 模型，此模式很有用。您的資料科學家可以遵循此模式中的步驟，來建立 NVIDIA GPU 支援的容器，並在這些容器上部署 ML 模型。

先決條件和限制

先決條件

作用中的 AWS 帳戶
存放模型成品和預測的 Amazon Simple Storage Service (Amazon S3) 來源儲存貯體。
了解 SageMaker 筆記本執行個體和 Jupyter 筆記本。
了解如何建立具有基本 SageMaker 角色許可的 AWS Identity and Access Management (IAM) 角色、S3 儲存貯體存取和更新許可，以及 Amazon Elastic Container Registry (Amazon ECR) 的其他許可。

限制

此模式適用於使用 Python 編寫的訓練和部署程式碼的受監督 ML 工作負載。

架構

使用 SageMaker 訓練和建置自訂 GPU 支援的 ML 模型的工作流程。

技術堆疊

SageMaker
Amazon ECR

工具

Amazon ECR – Amazon Elastic Container Registry (Amazon ECR) 是一種 AWS 受管容器映像登錄服務，安全、可擴展且可靠。
Amazon SageMaker – SageMaker 是全受管 ML 服務。
Docker – Docker 是一種軟體平台，可快速建置、測試和部署應用程式。
Python – Python 是一種程式設計語言。

Code

此模式的程式碼可在 GitHub 上使用 Catboost 和 SageMaker 儲存庫實作檢閱分類模型上取得。

史詩

任務	描述	所需的技能
建立 IAM 角色並連接所需的政策。	登入 AWS 管理主控台，開啟 IAM 主控台，並建立新的 IAM 角色。將下列內嵌政策連接到角色： `AmazonEC2ContainerRegistryFullAccess` `AmazonS3FullAccess` `AmazonSageMakerFullAccess` 如需詳細資訊，請參閱 Amazon SageMaker 文件中的建立筆記本執行個體。	資料科學家
建立 SageMaker 筆記本執行個體。	開啟 SageMaker 主控台，選擇筆記本執行個體，然後選擇建立筆記本執行個體。針對 IAM 角色，選擇您先前建立的 IAM 角色。根據您的需求設定筆記本執行個體，然後選擇建立筆記本執行個體。如需詳細步驟和說明，請參閱 Amazon SageMaker 文件中的建立筆記本執行個體。	資料科學家
複製儲存庫。	在 SageMaker 筆記本執行個體中開啟終端機，並執行下列命令，以 Catboost 和 SageMaker 儲存庫複製 GitHubImplementing 檢閱分類模型： SageMaker `git clone https://github.com/aws-samples/review-classification-using-catboost-sagemaker.git`
啟動 Jupyter 筆記本。	啟動 `Review classification model with Catboost and SageMaker.ipynb` Jupyter 筆記本，其中包含預先定義的步驟。	資料科學家

任務	描述	所需的技能
在 Jupyter 筆記本中執行命令。	開啟 Jupyter 筆記本並執行下列案例的命令，以準備資料來訓練 ML 模型。	資料科學家
從 S3 儲存貯體讀取資料。	`import pandas as pd import csv fname = 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz' df = pd.read_csv(fname,sep='\t',delimiter='\t',error_bad_lines=False)`	資料科學家
預先處理資料。	`import numpy as np def pre_process(df): df.fillna(value={'review_body': '', 'review_headline': ''}, inplace=True) df.fillna( value={'verified_purchase': 'Unk'}, inplace=True) df.fillna(0, inplace=True) return df df = pre_process(df) df.review_date = pd.to_datetime(df.review_date) df['target'] = np.where(df['star_rating']>=4,1,0)` 注意此程式碼會以空字串取代 `'review_body'`中的 null 值，並以取代資料`'verified_purchase'`欄`'Unk'`，這表示「未知」。	資料科學家
將資料分割為訓練、驗證和測試資料集。	若要保持分割集的目標標籤分佈相同，您必須使用 scikit-learn 程式庫來分層抽樣。 from sklearn.model_selection import StratifiedShuffleSplit sss = StratifiedShuffleSplit(n_splits=2, test_size=0.10, random_state=0) sss.get_n_splits(df, df['target']) for train_index, test_index in sss.split(df, df['target']): X_train_vallid , X_test = df.iloc[train_index], df.iloc[test_index] sss.get_n_splits(X_train_vallid, X_train_vallid['target']) for train_index, test_index in sss.split(X_train_vallid, X_train_vallid['target']): X_train , X_valid = X_train_vallid.iloc[train_index], X_train_vallid.iloc[test_index]	資料科學家

任務	描述	所需的技能
準備並推送 Docker 映像。	在 Jupyter 筆記本中，從下列案例執行命令，以準備 Docker 映像並將其推送至 Amazon ECR。	ML 工程師
在 Amazon ECR 中建立儲存庫。	`%%sh algorithm_name=sagemaker-catboost-github-gpu-img chmod +x code/train chmod +x code/serve account=$(aws sts get-caller-identity --query Account --output text) # Get the region defined in the current configuration (default to us-west-2 if none defined) region=$(aws configure get region) region=${region:-us-east-1} fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest" aws ecr create-repository --repository-name "${algorithm_name}" > /dev/nul`	ML 工程師
在本機建置 Docker 映像。	`docker build -t "${algorithm_name}" . docker tag ${algorithm_name} ${fullname}`	ML 工程師
執行 Docker 映像並將其推送至 Amazon ECR。	`docker push ${fullname}`	ML 工程師

任務	描述	所需的技能
建立 SageMaker 超參數調校任務。	在 Jupyter 筆記本中，從下列案例執行命令，以使用 Docker 映像建立 SageMaker 超參數調校任務。	資料科學家
建立 SageMaker 估算器。	使用 Docker 影像的名稱建立 SageMaker 估算器。 import sagemaker as sage from time import gmtime, strftime sess = sage.Session() from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner account = sess.boto_session.client('sts').get_caller_identity()['Account'] region = sess.boto_session.region_name image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-github-gpu-img:latest'.format(account, region) tree_hpo = sage.estimator.Estimator(image, role, 1, 'ml.p3.16xlarge', train_volume_size = 100, output_path="s3://{}/sagemaker/DEMO-GPU-Catboost/output".format(bucket), sagemaker_session=sess)	資料科學家
建立 HPO 任務。	建立具有參數範圍的超參數最佳化 (HPO) 調校任務，並將訓練集和驗證集作為參數傳遞給函數。 `hyperparameter_ranges = {'iterations': IntegerParameter(80000, 130000), 'max_depth': IntegerParameter(6, 10), 'max_ctr_complexity': IntegerParameter(4, 10), 'learning_rate': ContinuousParameter(0.01, 0.5)} objective_metric_name = 'auc' metric_definitions = [{'Name': 'auc', 'Regex': 'auc: ([0-9\\.]+)'}] tuner = HyperparameterTuner(tree_hpo, objective_metric_name, hyperparameter_ranges, metric_definitions, objective_type='Maximize', max_jobs=50, max_parallel_jobs=2)`	資料科學家
執行 HPO 任務。	`train_location = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/train/' valid_location = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/valid/' tuner.fit({'train': train_location, 'validation': valid_location })`	資料科學家
接收最佳效能的訓練任務。	`import sagemaker as sage from time import gmtime, strftime sess = sage.Session() best_job =tuner.best_training_job()`	資料科學家

任務描述所需的技能

任務	描述	所需的技能
在測試資料上建立 SageMaker 批次轉換任務，以進行模型預測。	在 Jupyter 筆記本中，從下列案例執行命令，從 SageMaker 超參數調校任務建立模型，並在測試資料上提交 SageMaker 批次轉換任務，以進行模型預測。	資料科學家
建立 SageMaker 模型。	使用最佳訓練任務在 SageMaker 模型中建立模型。 `attached_estimator = sage.estimator.Estimator.attach(best_job) output_path ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/' input_path ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test/' transformer = attached_estimator.transformer(instance_count=1, instance_type='ml.p3.16xlarge', assemble_with='Line', accept='text/csv', max_payload=1, output_path=output_path, env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' })`	資料科學家
建立批次轉換任務。	在測試資料集上建立批次轉換任務。 `transformer.transform(input_path, content_type='text/csv', split_type='Line')`	資料科學家

在測試資料上建立 SageMaker 批次轉換任務，以進行模型預測。

在 Jupyter 筆記本中，從下列案例執行命令，從 SageMaker 超參數調校任務建立模型，並在測試資料上提交 SageMaker 批次轉換任務，以進行模型預測。

資料科學家

建立 SageMaker 模型。

使用最佳訓練任務在 SageMaker 模型中建立模型。


attached_estimator = sage.estimator.Estimator.attach(best_job)

output_path ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/'
input_path  ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test/'

transformer = attached_estimator.transformer(instance_count=1, 
                                             instance_type='ml.p3.16xlarge', 
                                             assemble_with='Line', 
                                             accept='text/csv',
                                             max_payload=1,
                                             output_path=output_path,
                                             env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' })

資料科學家

建立批次轉換任務。

在測試資料集上建立批次轉換任務。


transformer.transform(input_path, 
                      content_type='text/csv',
                      split_type='Line')

資料科學家

任務描述所需的技能

任務	描述	所需的技能
讀取結果並評估模型的效能。	在 Jupyter 筆記本中，從下列案例執行命令，以讀取結果並評估 ROC 曲線下面積 (ROC-AUC) 和精確召回曲線下面積 (PR-AUC) 模型指標的模型效能。如需詳細資訊，請參閱《Amazon Machine Learning (Amazon ML) 文件》中的 Amazon Machine Learning 關鍵概念。 Amazon Machine Learning	資料科學家
讀取批次轉換任務結果。	將批次轉換任務結果讀取至資料框架。 `file_name = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/file_1.out' results = pd.read_csv(file_name, names=['review_id','target','score'] ,sep='\t',escapechar ='\\' , quoting=csv.QUOTE_NONE, lineterminator='\n',quotechar='"').dropna()`	資料科學家
評估效能指標。	評估模型在 ROC-AUC 和 PR-AUC 上的效能。 from sklearn import metrics import matplotlib import pandas as pd matplotlib.use('agg', warn=False, force=True) from matplotlib import pyplot as plt %matplotlib inline def analyze_results(labels, predictions): precision, recall, thresholds = metrics.precision_recall_curve(labels, predictions) auc = metrics.auc(recall, precision) fpr, tpr, _ = metrics.roc_curve(labels, predictions) roc_auc_score = metrics.roc_auc_score(labels, predictions) print('Neural-Nets: ROC auc=%.3f' % ( roc_auc_score)) plt.plot(fpr, tpr, label="data 1, auc=" + str(roc_auc_score)) plt.xlabel('1-Specificity') plt.ylabel('Sensitivity') plt.legend(loc=4) plt.show() lr_precision, lr_recall, _ = metrics.precision_recall_curve(labels, predictions) lr_auc = metrics.auc(lr_recall, lr_precision) # summarize scores print('Neural-Nets: PR auc=%.3f' % ( lr_auc)) # plot the precision-recall curves no_skill = len(labels[labels==1.0]) / len(labels) plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill') plt.plot(lr_recall, lr_precision, marker='.', label='Neural-Nets') # axis labels plt.xlabel('Recall') plt.ylabel('Precision') # show the legend plt.legend() # show the plot plt.show() return auc analyze_results(results['target'].values,results['score'].values)	資料科學家

讀取結果並評估模型的效能。

在 Jupyter 筆記本中，從下列案例執行命令，以讀取結果並評估 ROC 曲線下面積 (ROC-AUC) 和精確召回曲線下面積 (PR-AUC) 模型指標的模型效能。

如需詳細資訊，請參閱《Amazon Machine Learning (Amazon ML) 文件》中的 Amazon Machine Learning 關鍵概念。 Amazon Machine Learning

資料科學家

讀取批次轉換任務結果。

將批次轉換任務結果讀取至資料框架。


file_name = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/file_1.out'

results = pd.read_csv(file_name, names=['review_id','target','score'] ,sep='\t',escapechar ='\\' , quoting=csv.QUOTE_NONE, 
                   lineterminator='\n',quotechar='"').dropna()

資料科學家

評估效能指標。

評估模型在 ROC-AUC 和 PR-AUC 上的效能。


from sklearn import  metrics
import matplotlib
import pandas as pd
matplotlib.use('agg', warn=False, force=True)
from matplotlib import pyplot as plt

%matplotlib inline 

def analyze_results(labels, predictions):
    precision, recall, thresholds = metrics.precision_recall_curve(labels, predictions)
    auc = metrics.auc(recall, precision)
    
    fpr, tpr, _ = metrics.roc_curve(labels, predictions)
    roc_auc_score = metrics.roc_auc_score(labels, predictions)
    
    print('Neural-Nets: ROC auc=%.3f' % ( roc_auc_score))
    
    plt.plot(fpr, tpr, label="data 1, auc=" + str(roc_auc_score))
    plt.xlabel('1-Specificity')
    plt.ylabel('Sensitivity')
    plt.legend(loc=4)
    plt.show()
    
    
    lr_precision, lr_recall, _ = metrics.precision_recall_curve(labels, predictions)
    lr_auc = metrics.auc(lr_recall, lr_precision)
    # summarize scores
    print('Neural-Nets: PR auc=%.3f' % ( lr_auc))
    # plot the precision-recall curves
    no_skill = len(labels[labels==1.0]) / len(labels)
    plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
    
    plt.plot(lr_recall, lr_precision, marker='.', label='Neural-Nets')
    # axis labels
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    # show the legend
    plt.legend()
    # show the plot
    plt.show()
    
    
    return auc

analyze_results(results['target'].values,results['score'].values)

資料科學家

其他資訊

下列清單顯示在組建、執行和推送 Docker 映像到 Amazon ECR 史詩中執行的 Dockerfile 的不同元素。

使用 aws-cli 安裝 Python。


FROM amazonlinux:1
 
 RUN yum update -y && yum install -y python36 python36-devel python36-libs python36-tools python36-pip && \
 yum install gcc tar make wget util-linux kmod man sudo git -y && \
 yum install wget -y && \
 yum install aws-cli -y && \
 yum install nginx -y && \
yum install gcc-c++.noarch -y && yum clean all

安裝 Python 套件


RUN pip-3.6 install --no-cache-dir --upgrade pip && \pip3 install --no-cache-dir --upgrade setuptools && \
pip3 install Cython && \
pip3 install --no-cache-dir numpy==1.16.0 scipy==1.4.1 scikit-learn==0.20.3 pandas==0.24.2 \
flask gevent gunicorn boto3 s3fs  matplotlib joblib catboost==0.20.2

安裝 CUDA 和 CuDNN


RUN wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run \
&& chmod u+x cuda_9.0.176_384.81_linux-run \
&& ./cuda_9.0.176_384.81_linux-run --tmpdir=/data --silent --toolkit --override \
&& wget https://custom-gpu-sagemaker-image.s3.amazonaws.com/installation/cudnn-9.0-linux-x64-v7.tgz \
&& tar -xvzf cudnn-9.0-linux-x64-v7.tgz \
&& cp /data/cuda/include/cudnn.h /usr/local/cuda/include \
&& cp /data/cuda/lib64/libcudnn* /usr/local/cuda/lib64 \

&& chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn* \
&&  rm -rf /data/*

建立 SageMaker 所需的目錄結構


RUN mkdir /opt/ml /opt/ml/input /opt/ml/input/config /opt/ml/input/data /opt/ml/input/data/training /opt/ml/model /opt/ml/output /opt/program

設定 NVIDIA 環境變數


ENV PYTHONPATH=/opt/program
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

# Set NVIDIA mount environments
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
ENV NVIDIA_VISIBLE_DEVICES="all"
ENV NVIDIA_DRIVER_CAPABILITIES="compute,utility"
ENV NVIDIA_REQUIRE_CUDA "cuda>=9.0"

將訓練和推論檔案複製到 Docker 映像


COPY code/* /opt/program/
WORKDIR /opt/program

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

使用 SageMaker AI 和 hydra 簡化 ML 工作流程

在 Amazon SageMaker 上訓練和部署自訂 GPU 支援的 ML 模型

Summary

先決條件和限制

架構

工具

史詩

注意

相關資源

其他資訊