XGBoost hyperparameters - Amazon SageMaker

XGBoost hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker XGBoost algorithm is an implementation of the open-source DMLC XGBoost package. For details about full set of hyperparameter that can be configured for this version of XGBoost, see XGBoost Parameters.

Parameter Name Description
num_class

The number of classes.

Required if objective is set to multi:softmax or multi:softprob.

Valid values: Integer.

num_round

The number of rounds to run the training.

Required

Valid values: Integer.

alpha

L1 regularization term on weights. Increasing this value makes models more conservative.

Optional

Valid values: Float.

Default value: 0

base_score

The initial prediction score of all instances, global bias.

Optional

Valid values: Float.

Default value: 0.5

booster

Which booster to use. The gbtree and dart values use a tree-based model, while gblinear uses a linear function.

Optional

Valid values: String. One of "gbtree", "gblinear", or "dart".

Default value: "gbtree"

colsample_bylevel

Subsample ratio of columns for each split, in each level.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

colsample_bynode

Subsample ratio of columns from each node.

Optional

Valid values: Float. Range: (0,1].

Default value: 1

colsample_bytree

Subsample ratio of columns when constructing each tree.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

csv_weights

When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights.

Optional

Valid values: 0 or 1

Default value: 0

deterministic_histogram

When this flag is enabled, XGBoost builds histogram on GPU deterministically. Used only if tree_method is set to gpu_hist.

For a full list of valid inputs, please refer to XGBoost Parameters.

Optional

Valid values: String. Range: "true" or "false".

Default value: "true"

early_stopping_rounds

The model trains until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training. SageMaker hosting uses the best model for inference.

Optional

Valid values: Integer.

Default value: -

eta

Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.

Optional

Valid values: Float. Range: [0,1].

Default value: 0.3

eval_metric

Evaluation metrics for validation data. A default metric is assigned according to the objective:

  • rmse: for regression

  • error: for classification

  • map: for ranking

For a list of valid inputs, see XGBoost Learning Task Parameters.

Optional

Valid values: String.

Default value: Default according to objective.

gamma

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is.

Optional

Valid values: Float. Range: [0,∞).

Default value: 0

grow_policy

Controls the way that new nodes are added to the tree. Currently supported only if tree_method is set to hist.

Optional

Valid values: String. Either "depthwise" or "lossguide".

Default value: "depthwise"

interaction_constraints

Specify groups of variables that are allowed to interact.

Optional

Valid values: Nested list of integers. Each integer represents a feature, and each nested list contains features that are allowed to interact e.g., [[1,2], [3,4,5]].

Default value: None

lambda

L2 regularization term on weights. Increasing this value makes models more conservative.

Optional

Valid values: Float.

Default value: 1

lambda_bias

L2 regularization term on bias.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0

max_bin

Maximum number of discrete bins to bucket continuous features. Used only if tree_method is set to hist.

Optional

Valid values: Integer.

Default value: 256

max_delta_step

Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.

Optional

Valid values: Integer. Range: [0,∞).

Default value: 0

max_depth

Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when grow_policy=depth-wise.

Optional

Valid values: Integer. Range: [0,∞)

Default value: 6

max_leaves

Maximum number of nodes to be added. Relevant only if grow_policy is set to lossguide.

Optional

Valid values: Integer.

Default value: 0

min_child_weight

Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.

Optional

Valid values: Float. Range: [0,∞).

Default value: 1

monotone_constraints

Specifies monotonicity constraints on any feature.

Optional

Valid values: Tuple of Integers. Valid integers: -1 (decreasing constraint), 0 (no constraint), 1 (increasing constraint).

E.g., (0, 1): No constraint on first predictor, and an increasing constraint on the second. (-1, 1): Decreasing constraint on first predictor, and an increasing constraint on the second.

Default value: (0, 0)

normalize_type

Type of normalization algorithm.

Optional

Valid values: Either tree or forest.

Default value: tree

nthread

Number of parallel threads used to run xgboost.

Optional

Valid values: Integer.

Default value: Maximum number of threads.

objective

Specifies the learning task and the corresponding learning objective. Examples: reg:logistic, multi:softmax, reg:squarederror. For a full list of valid inputs, refer to XGBoost Learning Task Parameters.

Optional

Valid values: String

Default value: "reg:squarederror"

one_drop

When this flag is enabled, at least one tree is always dropped during the dropout.

Optional

Valid values: 0 or 1

Default value: 0

process_type

The type of boosting process to run.

Optional

Valid values: String. Either "default" or "update".

Default value: "default"

rate_drop

The dropout rate that specifies the fraction of previous trees to drop during the dropout.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0.0

refresh_leaf

This is a parameter of the 'refresh' updater plug-in. When set to true (1), tree leaves and tree node stats are updated. When set to false(0), only tree node stats are updated.

Optional

Valid values: 0/1

Default value: 1

sample_type

Type of sampling algorithm.

Optional

Valid values: Either uniform or weighted.

Default value: uniform

scale_pos_weight

Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases).

Optional

Valid values: float

Default value: 1

seed

Random number seed.

Optional

Valid values: integer

Default value: 0

single_precision_histogram

When this flag is enabled, XGBoost uses single precision to build histograms instead of double precision. Used only if tree_method is set to hist or gpu_hist.

For a full list of valid inputs, please refer to XGBoost Parameters.

Optional

Valid values: String. Range: "true" or "false"

Default value: "false"

sketch_eps

Used only for approximate greedy algorithm. This translates into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.

Optional

Valid values: Float, Range: [0, 1].

Default value: 0.03

skip_drop

Probability of skipping the dropout procedure during a boosting iteration.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0.0

subsample

Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

tree_method

The tree construction algorithm used in XGBoost.

Optional

Valid values: One of auto, exact, approx, hist, or gpu_hist.

Default value: auto

tweedie_variance_power

Parameter that controls the variance of the Tweedie distribution.

Optional

Valid values: Float. Range: (1, 2).

Default value: 1.5

updater

A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees.

For a full list of valid inputs, please refer to XGBoost Parameters.

Optional

Valid values: comma-separated string.

Default value: grow_colmaker, prune

use_dask_gpu_training

Set use_dask_gpu_training to "true" if you want to run distributed GPU training with Dask. Dask GPU training is only supported for versions 1.5-1 and later. Do not set this value to "true" for versions preceding 1.5-1. For more information, see Distributed GPU training.

Optional

Valid values: String. Range: "true" or "false"

Default value: "false"

verbosity

Verbosity of printing messages.

Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug).

Optional

Default value: 1