XGBoost Algorithm
The XGBoost
You can use the new release of the XGBoost algorithm either as a Amazon SageMaker built-in
algorithm or as a framework to run training scripts in your local environments. This
implementation has a smaller memory footprint, better logging, improved hyperparameter
validation, and an expanded set of metrics than the original versions. It provides an
XGBoost estimator
that executes a training script in a managed XGBoost
environment. The current release of SageMaker XGBoost is based on the original XGBoost versions
1.0, 1.2, 1.3, 1.5, and 1.7.
Supported versions
-
Framework (open source) mode: 1.0-1, 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1
-
Algorithm mode: 1.0-1, 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1
Warning
Due to required compute capacity, version 1.7-1 of SageMaker XGBoost is not compatible with GPU instances from the P2 instance family for training or inference.
Important
When you retrieve the SageMaker XGBoost image URI, do not use :latest
or :1
for the image URI tag. You must specify one of the Supported versions to choose the SageMaker-managed XGBoost container with the native XGBoost package
version that you want to use. To find the package version migrated into the
SageMaker XGBoost containers, see Docker Registry Paths and Example Code, choose your AWS Region, and
navigate to the XGBoost (algorithm) section.
Warning
The XGBoost 0.90 versions are deprecated. Supports for security updates or bug fixes for XGBoost 0.90 is discontinued. It is highly recommended to upgrade the XGBoost version to one of the newer versions.
Note
XGBoost v1.1 is not supported on SageMaker because XGBoost 1.1 has a broken capability to run prediction when the test input has fewer features than the training data in LIBSVM inputs. This capability has been restored in XGBoost v1.2. Consider using SageMaker XGBoost 1.2-2 or later.
How to Use SageMaker XGBoost
With SageMaker, you can use XGBoost as a built-in algorithm or framework. By using XGBoost as a framework, you have more flexibility and access to more advanced scenarios, such as k-fold cross-validation, because you can customize your own training scripts. The following sections describe how to use XGBoost with the SageMaker Python SDK. For information on how to use XGBoost from the Amazon SageMaker Studio UI, see SageMaker JumpStart.
-
Use XGBoost as a framework
Use XGBoost as a framework to run your customized training scripts that can incorporate additional data processing into your training jobs. In the following code example, you can find how SageMaker Python SDK provides the XGBoost API as a framework in the same way it provides other framework APIs, such as TensorFlow, MXNet, and PyTorch.
import boto3 import sagemaker from sagemaker.xgboost.estimator import XGBoost from sagemaker.session import Session from sagemaker.inputs import TrainingInput # initialize hyperparameters hyperparameters = { "max_depth":"5", "eta":"0.2", "gamma":"4", "min_child_weight":"6", "subsample":"0.7", "verbosity":"1", "objective":"reg:squarederror", "num_round":"50"} # set an output path where the trained model will be saved bucket = sagemaker.Session().default_bucket() prefix = 'DEMO-xgboost-as-a-framework' output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-framework') # construct a SageMaker XGBoost estimator # specify the entry_point to your xgboost training script estimator = XGBoost(entry_point = "
your_xgboost_abalone_script.py
", framework_version='1.7-1
', hyperparameters=hyperparameters, role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.m5.2xlarge', output_path=output_path) # define the data type and paths to the training and validation datasets content_type = "libsvm" train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type) validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type) # execute the XGBoost training job estimator.fit({'train': train_input, 'validation': validation_input})For an end-to-end example of using SageMaker XGBoost as a framework, see Regression with Amazon SageMaker XGBoost
-
Use XGBoost as a built-in algorithm
Use the XGBoost built-in algorithm to build an XGBoost training container as shown in the following code example. You can automatically spot the XGBoost built-in algorithm image URI using the SageMaker
image_uris.retrieve
API (or theget_image_uri
API if using Amazon SageMaker Python SDKversion 1). If you want to ensure if the image_uris.retrieve
API finds the correct URI, see Common parameters for built-in algorithms and look upxgboost
from the full list of built-in algorithm image URIs and available regions.After specifying the XGBoost image URI, you can use the XGBoost container to construct an estimator using the SageMaker Estimator API and initiate a training job. This XGBoost built-in algorithm mode does not incorporate your own XGBoost training script and runs directly on the input datasets.
Important
When you retrieve the SageMaker XGBoost image URI, do not use
:latest
or:1
for the image URI tag. You must specify one of the Supported versions to choose the SageMaker-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker XGBoost containers, see Docker Registry Paths and Example Code, choose your AWS Region, and navigate to the XGBoost (algorithm) section.import sagemaker import boto3 from sagemaker import image_uris from sagemaker.session import Session from sagemaker.inputs import TrainingInput # initialize hyperparameters hyperparameters = { "max_depth":"5", "eta":"0.2", "gamma":"4", "min_child_weight":"6", "subsample":"0.7", "objective":"reg:squarederror", "num_round":"50"} # set an output path where the trained model will be saved bucket = sagemaker.Session().default_bucket() prefix = 'DEMO-xgboost-as-a-built-in-algo' output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-built-in-algo') # this line automatically looks for the XGBoost image URI and builds an XGBoost container. # specify the repo_version depending on your preference. xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "
1.7-1
") # construct a SageMaker estimator that calls the xgboost-container estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, hyperparameters=hyperparameters, role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.m5.2xlarge', volume_size=5, # 5 GB output_path=output_path) # define the data type and paths to the training and validation datasets content_type = "libsvm" train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type) validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type) # execute the XGBoost training job estimator.fit({'train': train_input, 'validation': validation_input})For more information about how to set up the XGBoost as a built-in algorithm, see the following notebook examples.
Input/Output Interface for the XGBoost Algorithm
Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features.
The SageMaker implementation of XGBoost supports the following data formats for training and inference:
-
text/libsvm (default)
-
text/csv
-
application/x-parquet
-
application/x-recordio-protobuf
Note
There are a few considerations to be aware of regarding training and inference input:
-
For training with columnar input, the algorithm assumes that the target variable (label) is the first column. For inference, the algorithm assumes that the input has no label column.
For CSV data, the input should not have a header record.
For LIBSVM training, the algorithm assumes that subsequent columns after the label column contain the zero-based index value pairs for features. So each row has the format: : <label> <index0>:<value0> <index1>:<value1>.
-
For information on instance types and distributed training, see EC2 Instance Recommendation for the XGBoost Algorithm.
For CSV training input mode, the total memory available to the algorithm (Instance
Count * the memory available in the InstanceType
) must be able to hold the
training dataset. For libsvm training input mode, it's not required, but we recommend
it.
For v1.3-1 and later, SageMaker XGBoost saves the model in the XGBoost internal binary
format, using Booster.save_model
. Previous versions use the Python pickle
module to serialize/deserialize the model.
Note
Be mindful of versions when using an SageMaker XGBoost model in open source XGBoost. Versions 1.3-1 and later use the XGBoost internal binary format while previous versions use the Python pickle module.
To use a model trained with SageMaker XGBoost v1.3-1 or later in open source XGBoost
-
Use the following Python code:
import xgboost as xgb xgb_model = xgb.Booster() xgb_model.load_model(
model_file_path
) xgb_model.predict(dtest
)
To use a model trained with previous versions of SageMaker XGBoost in open source XGBoost
-
Use the following Python code:
import pickle as pkl import tarfile t = tarfile.open('model.tar.gz', 'r:gz') t.extractall() model = pkl.load(open(
model_file_path
, 'rb')) # prediction with test data pred = model.predict(dtest
)
To differentiate the importance of labelled data points use Instance Weight Supports
-
SageMaker XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For text/libsvm input, customers can assign weight values to data instances by attaching them after the labels. For example,
label:weight idx_0:val_0 idx_1:val_1...
. For text/csv input, customers need to turn on thecsv_weights
flag in the parameters and attach weight values in the column after labels. For example:label,weight,val_0,val_1,...
).
EC2 Instance Recommendation for the XGBoost Algorithm
SageMaker XGBoost supports CPU and GPU training and inference. Instance recommendations depend on training and inference needs, as well as the version of the XGBoost algorithm. Choose one of the following options for more information:
Training
The SageMaker XGBoost algorithm supports CPU and GPU training.
CPU training
SageMaker XGBoost 1.0-1 or earlier only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. Although it supports the use of disk space to handle data that does not fit into main memory (the out-of-core feature available with the libsvm input mode), writing cache files onto disk slows the algorithm processing time.
GPU training
SageMaker XGBoost version 1.2-2 or later supports GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective.
SageMaker XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.
SageMaker XGBoost version 1.7-1 or later supports P3, G4dn, and G5 GPU instance families. Note that due to compute capacity requirements, version 1.7-1 or later does not support the P2 instance family.
To take advantage of GPU training, specify the instance type as one of the GPU
instances (for example, P3) and set the tree_method
hyperparameter to
gpu_hist
in your existing XGBoost script.
Distributed training
SageMaker XGBoost supports CPU and GPU instances for distributed training.
Distributed CPU training
To run CPU training on multiple instances, set the instance_count
parameter for the estimator to a value greater than one. The input data must be
divided between the total number of instances.
Divide input data across instances
Divide the input data using the following steps:
-
Break the input data down into smaller files. The number of files should be at least equal to the number of instances used for distributed training. Using multiple smaller files as opposed to one large file also decreases the data download time for the training job.
-
When creating your TrainingInput
, set the distribution parameter to ShardedByS3Key
. This parameter ensures that each instance gets approximately 1/n of the number of files in S3 if there are n instances specified in the training job.
Distributed GPU training
You can use distributed training with either single-GPU or multi-GPU instances.
Distributed training with single-GPU instances
SageMaker XGBoost versions 1.2-2 through 1.3-1 only support single-GPU instance training. This means that even if you select a multi-GPU instance, only one GPU is used per instance.
If you use XGBoost versions 1.2-2 through 1.3-1, or if you do not need to use multi-GPU instances, then you must divide your input data between the total number of instances. For more information, see Divide input data across instances.
Note
Versions 1.2-2 through 1.3-1 of SageMaker XGBoost only use one GPU per instance even if you choose a multi-GPU instance.
Distributed training with multi-GPU instances
Starting with version 1.5-1, SageMaker XGBoost offers distributed GPU training
with Dask
Train with Dask using the following steps:
Either omit the
distribution
parameter in your TrainingInputor set it to FullyReplicated
.When defining your hyperparameters, set
use_dask_gpu_training
to"true"
.
Important
Distributed training with Dask only supports CSV and Parquet input formats. If you use other data formats such as LIBSVM or PROTOBUF, the training job fails.
For Parquet data, ensure that the column names are saved as strings. Columns that have names of other data types will fail to load.
Important
Distributed training with Dask does not support pipe mode. If pipe mode is specified, the training job fails.
There are a few considerations to be aware of when training SageMaker XGBoost with
Dask. Be sure to split your data into smaller files. Dask reads each Parquet file as a
partition. There is a Dask worker for every GPU, so the number of files should
be greater than the total number of GPUs (instance count * number of GPUs per
instance). Having a very large number of files can also degrade performance. For
more information, see Dask Best
Practices
Variations in output
The specified tree_method
hyperparameter determines the algorithm
that is used for XGBoost training. The tree methods approx
,
hist
and gpu_hist
are all approximate methods and
use sketching for quantile calculation. For more information, see Tree
Methods
Inference
SageMaker XGBoost supports CPU and GPU instances for inference. For information about the
instance types for inference, see Amazon SageMaker ML Instance
Types
XGBoost Sample Notebooks
The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker XGBoost algorithm.
Notebook Title | Description |
---|---|
This notebook shows you how to build a custom XGBoost Container with Amazon SageMaker Batch Transform. |
|
This notebook shows you how to use the Abalone dataset in Parquet to train a XGBoost model. |
|
This notebook shows how to use the MNIST dataset to train and host a multiclass classification model. |
|
This notebook shows you how to train a model to Predict Mobile Customer Departure in an effort to identify unhappy customers. |
|
An Introduction to Amazon SageMaker Managed Spot infrastructure for
XGBoost Training |
This notebook shows you how to use Spot Instances for training with a XGBoost Container. |
How to use Amazon SageMaker Debugger to debug XGBoost Training
Jobs? |
This notebook shows you how to use Amazon SageMaker Debugger to monitor training jobs to detect inconsistencies using built-in debugging rules. |
How to use Amazon SageMaker Debugger to debug XGBoost Training Jobs in
Real-Time? |
This notebook shows you how to use the MNIST dataset and Amazon SageMaker Debugger to perform real-time analysis of XGBoost training jobs while training jobs are running. |
For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon SageMaker Notebook Instances. After you have created a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the SageMaker samples. The topic modeling example notebooks using the linear learning algorithm are located in the Introduction to Amazon algorithms section. To open a notebook, choose its Use tab and choose Create copy.