Summary Prerequisites and limitations Architecture Tools Epics Related resources Additional information

Train and deploy a custom GPU-supported ML model on Amazon SageMaker

Created by Ankur Shukla (AWS)

Environment: PoC or pilot

Technologies: Machine learning & AI; Containers & microservices

AWS services: Amazon ECS; Amazon SageMaker

Summary

Training and deploying a graphics processing unit (GPU)-supported machine learning (ML) model requires an initial setup and initialization of certain environment variables to fully unlock the benefits of NVIDIA GPUs. However, it can be time-consuming to set up the environment and make it compatible with Amazon SageMaker architecture on the Amazon Web Services (AWS) Cloud.

This pattern helps you train and build a custom GPU-supported ML model using Amazon SageMaker. It provides steps to train and deploy a custom CatBoost model built on an open-source Amazon reviews dataset. You can then benchmark its performance on a p3.16xlarge Amazon Elastic Compute Cloud (Amazon EC2) instance.

This pattern is useful if your organization wants to deploy existing GPU-supported ML models on SageMaker. Your data scientists can follow the steps in this pattern to create NVIDIA GPU-supported containers and deploy ML models on those containers.

Prerequisites and limitations

Prerequisites

An active AWS account.
An Amazon Simple Storage Service (Amazon S3) source bucket to store the model artifacts and predictions.
An understanding of SageMaker notebook instances and Jupyter notebooks.
An understanding of how to create an AWS Identity and Access Management (IAM) role with basic SageMaker role permissions, S3 bucket access and update permissions, and additional permissions for Amazon Elastic Container Registry (Amazon ECR).

Limitations

This pattern is intended for supervised ML workloads with a train and deploy code written in Python.

Architecture

Workflow to train and build a custom GPU-supported ML model using SageMaker.

Technology stack

SageMaker
Amazon ECR

Tools

Tools

Amazon ECR – Amazon Elastic Container Registry (Amazon ECR) is an AWS managed container image registry service that is secure, scalable, and reliable.
Amazon SageMaker – SageMaker is a fully managed ML service.
Docker – Docker is a software platform for building, testing, and deploying applications quickly.
Python – Python is a programming language.

Code

The code for this pattern is available on the GitHub Implementing a review classification model with Catboost and SageMaker repository.

Epics

Task	Description	Skills required
Create an IAM role and attach the required policies.	Sign in to the AWS Management Console, open the IAM console, and create a new IAM role. Attach the following policies to the IAM role: `AmazonEC2ContainerRegistryFullAccess` `AmazonS3FullAccess` `AmazonSageMakerFullAccess` For more information about this, see Create a notebook instance in the Amazon SageMaker documentation.	Data scientist
Create the SageMaker notebook instance.	Open the SageMaker console, choose Notebook instances, and then choose Create notebook instance. For IAM role, choose the IAM role that you created earlier. Configure the notebook instance according to your requirements and then choose Create notebook instance. For detailed steps and instructions, see Create a notebook instance in the Amazon SageMaker documentation.	Data scientist
Clone the repository.	Open the terminal in the SageMaker notebook instance and clone the GitHub Implementing a review classification model with Catboost and SageMaker repository by running the following command: `git clone https://github.com/aws-samples/review-classification-using-catboost-sagemaker.git`
Start the Jupyter notebook.	Start the `Review classification model with Catboost and SageMaker.ipynb` Jupyter notebook, which contains the predefined steps.	Data scientist

Task	Description	Skills required
Run commands in Jupyter notebook.	Open the Jupyter notebook and run the commands from the following stories to prepare the data to train your ML model.	Data scientist
Read the data from the S3 bucket.	`import pandas as pd import csv fname = 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz' df = pd.read_csv(fname,sep='\t',delimiter='\t',error_bad_lines=False)`	Data scientist
Preprocess the data.	`import numpy as np def pre_process(df): df.fillna(value={'review_body': '', 'review_headline': ''}, inplace=True) df.fillna( value={'verified_purchase': 'Unk'}, inplace=True) df.fillna(0, inplace=True) return df df = pre_process(df) df.review_date = pd.to_datetime(df.review_date) df['target'] = np.where(df['star_rating']>=4,1,0)` Note: This code replaces null values in the `'review_body'` with an empty string and replaces the `'verified_purchase'` column with `'Unk'`, which means “unknown.”	Data scientist
Split the data into training, validation, and test datasets.	To keep the distribution of the target label identical across the split sets, you must stratify the sampling by using the scikit-learn library. from sklearn.model_selection import StratifiedShuffleSplit sss = StratifiedShuffleSplit(n_splits=2, test_size=0.10, random_state=0) sss.get_n_splits(df, df['target']) for train_index, test_index in sss.split(df, df['target']): X_train_vallid , X_test = df.iloc[train_index], df.iloc[test_index] sss.get_n_splits(X_train_vallid, X_train_vallid['target']) for train_index, test_index in sss.split(X_train_vallid, X_train_vallid['target']): X_train , X_valid = X_train_vallid.iloc[train_index], X_train_vallid.iloc[test_index]	Data scientist

Task	Description	Skills required
Prepare and push the Docker image.	In the Jupyter notebook, run the commands from the following stories to prepare the Docker image and push it to Amazon ECR.	ML engineer
Create a repository in Amazon ECR.	`%%sh algorithm_name=sagemaker-catboost-github-gpu-img chmod +x code/train chmod +x code/serve account=$(aws sts get-caller-identity --query Account --output text) # Get the region defined in the current configuration (default to us-west-2 if none defined) region=$(aws configure get region) region=${region:-us-east-1} fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest" aws ecr create-repository --repository-name "${algorithm_name}" > /dev/nul`	ML engineer
Build a Docker image locally.	`docker build -t "${algorithm_name}" . docker tag ${algorithm_name} ${fullname}`	ML engineer
Run the Docker image and push it to Amazon ECR.	`docker push ${fullname}`	ML engineer

Task	Description	Skills required
Create a SageMaker hyperparameter tuning job.	In the Jupyter notebook, run the commands from the following stories to create a SageMaker hyperparameter tuning job using your Docker image.	Data scientist
Create a SageMaker estimator.	Create a SageMaker estimator by using the Docker image's name. import sagemaker as sage from time import gmtime, strftime sess = sage.Session() from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner account = sess.boto_session.client('sts').get_caller_identity()['Account'] region = sess.boto_session.region_name image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-github-gpu-img:latest'.format(account, region) tree_hpo = sage.estimator.Estimator(image, role, 1, 'ml.p3.16xlarge', train_volume_size = 100, output_path="s3://{}/sagemaker/DEMO-GPU-Catboost/output".format(bucket), sagemaker_session=sess)	Data scientist
Create an HPO job.	Create a hyperparameter optimization (HPO) tuning job with parameter ranges and pass the train and validation sets as parameters to the function. `hyperparameter_ranges = {'iterations': IntegerParameter(80000, 130000), 'max_depth': IntegerParameter(6, 10), 'max_ctr_complexity': IntegerParameter(4, 10), 'learning_rate': ContinuousParameter(0.01, 0.5)} objective_metric_name = 'auc' metric_definitions = [{'Name': 'auc', 'Regex': 'auc: ([0-9\\.]+)'}] tuner = HyperparameterTuner(tree_hpo, objective_metric_name, hyperparameter_ranges, metric_definitions, objective_type='Maximize', max_jobs=50, max_parallel_jobs=2)`	Data scientist
Run the HPO job.	`train_location = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/train/' valid_location = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/valid/' tuner.fit({'train': train_location, 'validation': valid_location })`	Data scientist
Receive the best performing training job.	`import sagemaker as sage from time import gmtime, strftime sess = sage.Session() best_job =tuner.best_training_job()`	Data scientist

Task Description Skills required

Task	Description	Skills required
Create a SageMaker batch transform job on test data for model prediction.	In the Jupyter notebook, run the commands from the following stories to create the model from your SageMaker hyperparameter tuning job and submit a SageMaker batch transform job on the test data for model prediction.	Data scientist
Create the SageMaker model.	Create a model in SageMaker model using the best training job. `attached_estimator = sage.estimator.Estimator.attach(best_job) output_path ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/' input_path ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test/' transformer = attached_estimator.transformer(instance_count=1, instance_type='ml.p3.16xlarge', assemble_with='Line', accept='text/csv', max_payload=1, output_path=output_path, env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' })`	Data scientist
Create batch transform job.	Create batch transform job on the test data set. `transformer.transform(input_path, content_type='text/csv', split_type='Line')`	Data scientist

Create a SageMaker batch transform job on test data for model prediction.

In the Jupyter notebook, run the commands from the following stories to create the model from your SageMaker hyperparameter tuning job and submit a SageMaker batch transform job on the test data for model prediction.

Data scientist

Create the SageMaker model.

Create a model in SageMaker model using the best training job.


attached_estimator = sage.estimator.Estimator.attach(best_job)

output_path ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/'
input_path  ='s3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test/'

transformer = attached_estimator.transformer(instance_count=1, 
                                             instance_type='ml.p3.16xlarge', 
                                             assemble_with='Line', 
                                             accept='text/csv',
                                             max_payload=1,
                                             output_path=output_path,
                                             env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' })

Data scientist

Create batch transform job.

Create batch transform job on the test data set.


transformer.transform(input_path, 
                      content_type='text/csv',
                      split_type='Line')

Data scientist

Task Description Skills required

Task	Description	Skills required
Read the results and evaluate the model's performance.	In the Jupyter notebook, run the commands from the following stories to read the results and evaluate the performance of the model on Area Under the ROC Curve (ROC-AUC) and Area Under the Precision Recall Curve (PR-AUC) model metrics. For more information about this, see Amazon Machine Learning key concepts in the Amazon Machine Learning (Amazon ML) documentation.	Data scientist
Read the batch transform job results.	Read the batch transform job results into a data frame. `file_name = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/file_1.out' results = pd.read_csv(file_name, names=['review_id','target','score'] ,sep='\t',escapechar ='\\' , quoting=csv.QUOTE_NONE, lineterminator='\n',quotechar='"').dropna()`	Data scientist
Evaluate the performance metrics.	Evaluate the performance of the model on ROC-AUC and PR-AUC. from sklearn import metrics import matplotlib import pandas as pd matplotlib.use('agg', warn=False, force=True) from matplotlib import pyplot as plt %matplotlib inline def analyze_results(labels, predictions): precision, recall, thresholds = metrics.precision_recall_curve(labels, predictions) auc = metrics.auc(recall, precision) fpr, tpr, _ = metrics.roc_curve(labels, predictions) roc_auc_score = metrics.roc_auc_score(labels, predictions) print('Neural-Nets: ROC auc=%.3f' % ( roc_auc_score)) plt.plot(fpr, tpr, label="data 1, auc=" + str(roc_auc_score)) plt.xlabel('1-Specificity') plt.ylabel('Sensitivity') plt.legend(loc=4) plt.show() lr_precision, lr_recall, _ = metrics.precision_recall_curve(labels, predictions) lr_auc = metrics.auc(lr_recall, lr_precision) # summarize scores print('Neural-Nets: PR auc=%.3f' % ( lr_auc)) # plot the precision-recall curves no_skill = len(labels[labels==1.0]) / len(labels) plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill') plt.plot(lr_recall, lr_precision, marker='.', label='Neural-Nets') # axis labels plt.xlabel('Recall') plt.ylabel('Precision') # show the legend plt.legend() # show the plot plt.show() return auc analyze_results(results['target'].values,results['score'].values)	Data scientist

Read the results and evaluate the model's performance.

In the Jupyter notebook, run the commands from the following stories to read the results and evaluate the performance of the model on Area Under the ROC Curve (ROC-AUC) and Area Under the Precision Recall Curve (PR-AUC) model metrics.

For more information about this, see Amazon Machine Learning key concepts in the Amazon Machine Learning (Amazon ML) documentation.

Data scientist

Read the batch transform job results.

Read the batch transform job results into a data frame.


file_name = 's3://'+bucket+'/sagemaker/DEMO-GPU-Catboost/data/test-predictions/file_1.out'

results = pd.read_csv(file_name, names=['review_id','target','score'] ,sep='\t',escapechar ='\\' , quoting=csv.QUOTE_NONE, 
                   lineterminator='\n',quotechar='"').dropna()

Data scientist

Evaluate the performance metrics.

Evaluate the performance of the model on ROC-AUC and PR-AUC.


from sklearn import  metrics
import matplotlib
import pandas as pd
matplotlib.use('agg', warn=False, force=True)
from matplotlib import pyplot as plt

%matplotlib inline 

def analyze_results(labels, predictions):
    precision, recall, thresholds = metrics.precision_recall_curve(labels, predictions)
    auc = metrics.auc(recall, precision)
    
    fpr, tpr, _ = metrics.roc_curve(labels, predictions)
    roc_auc_score = metrics.roc_auc_score(labels, predictions)
    
    print('Neural-Nets: ROC auc=%.3f' % ( roc_auc_score))
    
    plt.plot(fpr, tpr, label="data 1, auc=" + str(roc_auc_score))
    plt.xlabel('1-Specificity')
    plt.ylabel('Sensitivity')
    plt.legend(loc=4)
    plt.show()
    
    
    lr_precision, lr_recall, _ = metrics.precision_recall_curve(labels, predictions)
    lr_auc = metrics.auc(lr_recall, lr_precision)
    # summarize scores
    print('Neural-Nets: PR auc=%.3f' % ( lr_auc))
    # plot the precision-recall curves
    no_skill = len(labels[labels==1.0]) / len(labels)
    plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
    
    plt.plot(lr_recall, lr_precision, marker='.', label='Neural-Nets')
    # axis labels
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    # show the legend
    plt.legend()
    # show the plot
    plt.show()
    
    
    return auc

analyze_results(results['target'].values,results['score'].values)

Data scientist

Related resources

Train and host Scikit-Learn models in Amazon SageMaker by building a Scikit Docker container

Additional information

The following list shows the different elements of the Dockerfile that is run in the Build, run, and push the Docker image into Amazon ECR epic.

Install Python with aws-cli.


FROM amazonlinux:1
 
 RUN yum update -y && yum install -y python36 python36-devel python36-libs python36-tools python36-pip && \
 yum install gcc tar make wget util-linux kmod man sudo git -y && \
 yum install wget -y && \
 yum install aws-cli -y && \
 yum install nginx -y && \
yum install gcc-c++.noarch -y && yum clean all

Install the Python packages


RUN pip-3.6 install --no-cache-dir --upgrade pip && \pip3 install --no-cache-dir --upgrade setuptools && \
pip3 install Cython && \
pip3 install --no-cache-dir numpy==1.16.0 scipy==1.4.1 scikit-learn==0.20.3 pandas==0.24.2 \
flask gevent gunicorn boto3 s3fs  matplotlib joblib catboost==0.20.2

Install CUDA and CuDNN


RUN wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run \
&& chmod u+x cuda_9.0.176_384.81_linux-run \
&& ./cuda_9.0.176_384.81_linux-run --tmpdir=/data --silent --toolkit --override \
&& wget https://custom-gpu-sagemaker-image.s3.amazonaws.com/installation/cudnn-9.0-linux-x64-v7.tgz \
&& tar -xvzf cudnn-9.0-linux-x64-v7.tgz \
&& cp /data/cuda/include/cudnn.h /usr/local/cuda/include \
&& cp /data/cuda/lib64/libcudnn* /usr/local/cuda/lib64 \

&& chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn* \
&&  rm -rf /data/*

Create the required directory structure for SageMaker


RUN mkdir /opt/ml /opt/ml/input /opt/ml/input/config /opt/ml/input/data /opt/ml/input/data/training /opt/ml/model /opt/ml/output /opt/program

Set the NVIDIA environment variables


ENV PYTHONPATH=/opt/program
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

# Set NVIDIA mount environments
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
ENV NVIDIA_VISIBLE_DEVICES="all"
ENV NVIDIA_DRIVER_CAPABILITIES="compute,utility"
ENV NVIDIA_REQUIRE_CUDA "cuda>=9.0"

Copy training and inference files into the Docker image


COPY code/* /opt/program/
WORKDIR /opt/program

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Generate personalized recommendations using Amazon Personalize

Use SageMaker Processing for distributed feature engineering of terabyte-scale ML datasets