XGBoost Release 0.72
This previous release of the Amazon SageMaker XGBoost algorithm is based on the 0.72
release. XGBoost
Customers should consider using the new release of XGBoost Algorithm. They can use it as a SageMaker builtin algorithm or as a framework to run scripts in their local environments as they would typically, for example, do with a Tensorflow deep learning framework. The new implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics. The earlier implementation of XGBoost remains available to customers if they need to postpone migrating to the new version. But this previous implementation will remain tied to the 0.72 release of XGBoost.
Topics
Input/Output Interface for the XGBoost Release 0.72
Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features.
The SageMaker implementation of XGBoost supports CSV and libsvm formats for training and inference:

For Training ContentType, valid inputs are text/libsvm (default) or text/csv.

For Inference ContentType, valid inputs are text/libsvm or (the default) text/csv.
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column.
For libsvm training, the algorithm assumes that the label is in the first column. Subsequent columns contain the zerobased index value pairs for features. So each row has the format: <label> <index0>:<value0> <index1>:<value1> ... Inference requests for libsvm may or may not have labels in the libsvm format.
This differs from other SageMaker algorithms, which use the protobuf training input format to maintain greater consistency with standard XGBoost data formats.
For CSV training input mode, the total memory available to the algorithm (Instance
Count * the memory available in the InstanceType
) must be able to hold the
training dataset. For libsvm training input mode, it's not required, but we recommend
it.
SageMaker XGBoost uses the Python pickle module to serialize/deserialize the model, which can be used for saving/loading the model.
To use a model trained with SageMaker XGBoost in open source XGBoost

Use the following Python code:
import pickle as pkl import tarfile import xgboost t = tarfile.open('model.tar.gz', 'r:gz') t.extractall() model = pkl.load(open(
model_file_path
, 'rb')) # prediction with test data pred = model.predict(dtest
)
To differentiate the importance of labelled data points use Instance Weight Supports

SageMaker XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For text/libsvm input, customers can assign weight values to data instances by attaching them after the labels. For example,
label:weight idx_0:val_0 idx_1:val_1...
. For text/csv input, customers need to turn on thecsv_weights
flag in the parameters and attach weight values in the column after labels. For example:label,weight,val_0,val_1,...
).
EC2 Instance Recommendation for the XGBoost Release 0.72
SageMaker XGBoost currently only trains using CPUs. It is a memorybound (as opposed to computebound) algorithm. So, a generalpurpose compute instance (for example, M4) is a better choice than a computeoptimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. Although it supports the use of disk space to handle data that does not fit into main memory (the outofcore feature available with the libsvm input mode), writing cache files onto disk slows the algorithm processing time.
XGBoost Release 0.72 Sample Notebooks
For a sample notebook that shows how to use the latest version of SageMaker XGBoost
as a
builtin algorithm to train and host a regression model, see Regression with Amazon SageMaker XGBoost algorithm
XGBoost Release 0.72 Hyperparameters
The following table contains the hyperparameters for the XGBoost algorithm. These
are
parameters that are set by users to facilitate the estimation of model parameters
from
data. The required hyperparameters that must be set are listed first, in alphabetical
order. The optional hyperparameters that can be set are listed next, also in
alphabetical order. The SageMaker XGBoost algorithm is an implementation of the opensource
XGBoost package. Currently SageMaker supports version 0.72. For more detail about
hyperparameter configuration for this version of XGBoost, see XGBoost
Parameters
Parameter Name  Description 

num_class 
The number of classes. Required if
Valid values: integer 
num_round 
The number of rounds to run the training. Required Valid values: integer 
alpha 
L1 regularization term on weights. Increasing this value makes models more conservative. Optional Valid values: float Default value: 0 
base_score 
The initial prediction score of all instances, global bias. Optional Valid values: float Default value: 0.5 
booster 
Which booster to use. The Optional Valid values: String. One of Default value: 
colsample_bylevel 
Subsample ratio of columns for each split, in each level. Optional Valid values: Float. Range: [0,1]. Default value: 1 
colsample_bytree 
Subsample ratio of columns when constructing each tree. Optional Valid values: Float. Range: [0,1]. Default value: 1 
csv_weights 
When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. Optional Valid values: 0 or 1 Default value: 0 
early_stopping_rounds 
The model trains until the validation score stops improving.
Validation error needs to decrease at least every
Optional Valid values: integer Default value:  
eta 
Step size shrinkage used in updates to prevent overfitting.
After each boosting step, you can directly get the weights of new
features. The Optional Valid values: Float. Range: [0,1]. Default value: 0.3 
eval_metric 
Evaluation metrics for validation data. A default metric is assigned according to the objective:
For a list of valid inputs, see XGBoost Parameters Optional Valid values: string Default value: Default according to objective. 
gamma 
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. Optional Valid values: Float. Range: [0,∞). Default value: 0 
grow_policy 
Controls the way that new nodes are added to the tree.
Currently supported only if Optional Valid values: String. Either Default value: 
lambda 
L2 regularization term on weights. Increasing this value makes models more conservative. Optional Valid values: float Default value: 1 
lambda_bias 
L2 regularization term on bias. Optional Valid values: Float. Range: [0.0, 1.0]. Default value: 0 
max_bin 
Maximum number of discrete bins to bucket continuous features.
Used only if Optional Valid values: integer Default value: 256 
max_delta_step 
Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 110 to help control the update. Optional Valid values: Integer. Range: [0,∞). Default value: 0 
max_depth 
Maximum depth of a tree. Increasing this value makes the model more complex and likely
to be
overfit. 0 indicates no limit. A limit is required when
Optional Valid values: Integer. Range: [0,∞) Default value: 6 
max_leaves 
Maximum number of nodes to be added. Relevant only if
Optional Valid values: integer Default value: 0 
min_child_weight 
Minimum sum of instance weight (hessian) needed in a child. If
the tree partition step results in a leaf node with the sum of
instance weight less than Optional Valid values: Float. Range: [0,∞). Default value: 1 
normalize_type 
Type of normalization algorithm. Optional Valid values: Either tree or forest. Default value: tree 
nthread 
Number of parallel threads used to run xgboost. Optional Valid values: integer Default value: Maximum number of threads. 
objective 
Specifies the learning task and the corresponding learning
objective. Examples: Optional Valid values: string Default value: 
one_drop 
When this flag is enabled, at least one tree is always dropped during the dropout. Optional Valid values: 0 or 1 Default value: 0 
process_type 
The type of boosting process to run. Optional Valid values: String. Either Default value: 
rate_drop 
The dropout rate that specifies the fraction of previous trees to drop during the dropout. Optional Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0 
refresh_leaf 
This is a parameter of the 'refresh' updater plugin. When set to Optional Valid values: 0/1 Default value: 1 
sample_type 
Type of sampling algorithm. Optional Valid values: Either Default value: 
scale_pos_weight 
Controls the balance of positive and negative weights. It's
useful for unbalanced classes. A typical value to consider:
Optional Valid values: float Default value: 1 
seed 
Random number seed. Optional Valid values: integer Default value: 0 
silent 
0 means print running messages, 1 means silent mode. Valid values: 0 or 1 Optional Default value: 0 
sketch_eps 
Used only for approximate greedy algorithm. This translates
into O(1 / Optional Valid values: Float, Range: [0, 1]. Default value: 0.03 
skip_drop 
Probability of skipping the dropout procedure during a boosting iteration. Optional Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0 
subsample 
Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. Optional Valid values: Float. Range: [0,1]. Default value: 1 
tree_method 
The tree construction algorithm used in XGBoost. Optional Valid values: One of Default value: 
tweedie_variance_power 
Parameter that controls the variance of the Tweedie distribution. Optional Valid values: Float. Range: (1, 2). Default value: 1.5 
updater 
A commaseparated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to XGBoost Parameters Optional Valid values: commaseparated string. Default value: 
Tune an XGBoost Release 0.72 Model
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:

a learning
objective
function to optimize during model training 
an
eval_metric
to use to evaluate model perrormance during validation 
a set of hyperparameters and a range of values for each to use when tuning the model automatically
You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric.
For more information about model tuning, see Perform Automatic Model Tuning.
Metrics Computed by the XGBoost Release 0.72 Algorithm
The XGBoost algorithm based on version 0.72 computes the following nine
metrics to use for model validation. When tuning the model, choose one of these
metrics to evaluate the model. For full list of valid eval_metric
values, refer to XGBoost Learning Task Parameters
Metric Name  Description  Optimization Direction 

validation:auc 
Area under the curve. 
Maximize 
validation:error 
Binary classification error rate, calculated as #(wrong cases)/#(all cases). 
Minimize 
validation:logloss 
Negative loglikelihood. 
Minimize 
validation:mae 
Mean absolute error. 
Minimize 
validation:map 
Mean average precision. 
Maximize 
validation:merror 
Multiclass classification error rate, calculated as #(wrong cases)/#(all cases). 
Minimize 
validation:mlogloss 
Negative loglikelihood for multiclass classification. 
Minimize 
validation:ndcg 
Normalized Discounted Cumulative Gain. 
Maximize 
validation:rmse 
Root mean square error. 
Minimize 
Tunable XGBoost Release 0.72 Hyperparameters
Tune the XGBoost model with the following hyperparameters. The hyperparameters
that have the greatest effect on optimizing the XGBoost evaluation metrics are:
alpha
, min_child_weight
, subsample
,
eta
, and num_round
.
Parameter Name  Parameter Type  Recommended Ranges 

alpha 
ContinuousParameterRanges 
MinValue: 0, MaxValue: 1000 
colsample_bylevel 
ContinuousParameterRanges 
MinValue: 0.1, MaxValue: 1 
colsample_bytree 
ContinuousParameterRanges 
MinValue: 0.5, MaxValue: 1 
eta 
ContinuousParameterRanges 
MinValue: 0.1, MaxValue: 0.5 
gamma 
ContinuousParameterRanges 
MinValue: 0, MaxValue: 5 
lambda 
ContinuousParameterRanges 
MinValue: 0, MaxValue: 1000 
max_delta_step 
IntegerParameterRanges 
[0, 10] 
max_depth 
IntegerParameterRanges 
[0, 10] 
min_child_weight 
ContinuousParameterRanges 
MinValue: 0, MaxValue: 120 
num_round 
IntegerParameterRanges 
[1, 4000] 
subsample 
ContinuousParameterRanges 
MinValue: 0.5, MaxValue: 1 