Advanced model building configurations - Amazon SageMaker

Advanced model building configurations

Amazon SageMaker Canvas supports various advanced settings that you can configure when building a model. The following page lists all of the advanced settings along with additional information about their options and configurations.

Note

The following advanced settings are currently only supported for numeric, categorical, and time series forecasting model types.

Advanced numeric and categorical prediction model settings

Canvas supports the following advanced settings for numeric and categorical prediction model types.

Objective metric

The objective metric is the metric that you want Canvas to optimize while building your model. If you don’t select a metric, Canvas chooses one for you by default. For descriptions of the available metrics, see the Metrics reference.

Training method

Canvas can automatically select the training method based on the dataset size, or you can select it manually. The following training methods are available for you to choose from:

  • Ensembling – SageMaker leverages the AutoGluon library to train several base models. To find the best combination for your dataset, ensemble mode runs 5–10 trials with different model and meta parameter settings. Then, these models are combined using a stacking ensemble method to create an optimal predictive model. For a list of algorithms supported by ensemble mode for tabular data, see the following Algorithms section.

  • Hyperparameter optimization (HPO) – SageMaker finds the best version of a model by tuning hyperparameters using Bayesian optimization or multi-fidelity optimization while running training jobs on your dataset. HPO mode selects the algorithms that are most relevant to your dataset and selects the best range of hyperparameters to tune your models. To tune your models, HPO mode runs up to 100 trials (default) to find the optimal hyperparameters settings within the selected range. If your dataset size is less than 100 MB, SageMaker uses Bayesian optimization. SageMaker chooses multi-fidelity optimization if your dataset is larger than 100 MB.

    For a list of algorithms supported by HPO mode for tabular data, see the following Algorithms section.

  • Auto – SageMaker automatically chooses either ensembling mode or HPO mode based on your dataset size. If your dataset is larger than 100 MB, SageMaker chooses HPO mode. Otherwise, it chooses ensembling mode.

Algorithms

In Ensembling mode, Canvas supports the following machine learning algorithms:

  • LightGBM – An optimized framework that uses tree-based algorithms with gradient boosting. This algorithm uses trees that grow in breadth, rather than depth, and is highly optimized for speed.

  • CatBoost – A framework that uses tree-based algorithms with gradient boosting. Optimized for handling categorical variables.

  • XGBoost – A framework that uses tree-based algorithms with gradient boosting that grows in depth, rather than breadth.

  • Random Forest – A tree-based algorithm that uses several decision trees on random sub-samples of the data with replacement. The trees are split into optimal nodes at each level. The decisions of each tree are averaged together to prevent overfitting and improve predictions.

  • Extra Trees – A tree-based algorithm that uses several decision trees on the entire dataset. The trees are split randomly at each level. The decisions of each tree are averaged to prevent overfitting and to improve predictions. Extra trees add a degree of randomization in comparison to the random forest algorithm.

  • Linear Models – A framework that uses a linear equation to model the relationship between two variables in observed data.

  • Neural network torch – A neural network model that's implemented using Pytorch.

  • Neural network fast.ai – A neural network model that's implemented using fast.ai.

In HPO mode, Canvas supports the following machine learning algorithms:

  • XGBoost – A supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.

  • Deep learning algorithm – A multilayer perceptron (MLP) and feedforward artificial neural network. This algorithm can handle data that is not linearly separable.

Data split

You have the option to specify how you want to split your dataset between the training set (the portion of your dataset used for building the model) and the validation set, (the portion of your dataset used for verifying the model’s accuracy). For example, a common split ratio is 80% training and 20% validation, where 80% of your data is used to build the model while 20% is saved for measuring model performance. If you don’t specify a custom ratio, then Canvas splits your dataset automatically.

Max candidates

Note

This feature is only available in the HPO training mode.

You can specify the maximum number of model candidates that Canvas generates while building your model. We recommend that you use the default number of candidates, which is 100, to build the most accurate models. The maximum number you can specify is 250. Decreasing the number of model candidates may impact your model’s accuracy.

Max job runtime

You can specify the maximum job runtime, or the maximum amount of time that Canvas spends building your model. After the time limit, Canvas stops building and selects the best model candidate.

The maximum time that you can specify is 720 hours. We highly recommend that you keep the maximum job runtime greater than 30 minutes to ensure that Canvas has enough time to generate model candidates and finish building your model.

Advanced time series forecasting model settings

For time series forecasting models, Canvas supports the following metrics listed in the previous section:

  • Objective metric

  • Algorithms

  • Max job runtime

Time series forecasting models also support the following advanced settings:

Aggregation

If you set the forecasting frequency lower than the frequency of your recorded data, Canvas aggregates any data points that don’t match the new frequency. For example, if you have daily data points but would like to make a weekly forecast, you can set the frequency to weekly, and then Canvas combines all of the daily data points for each week into a single record. Aggregation is only supported for the target column, and the column values must be in datetime format.

The default aggregation method is to sum the values of the aggregated data points, but you can also set the aggregation method to the following:

  • avg – Canvas sets the record’s value to the average of all the aggregated data points.

  • first – Canvas sets the record’s value to the first value of the aggregated data points.

  • min – Canvas sets the record’s value to the minimum value found in the aggregated data points.

  • max – Canvas sets the record’s value to the maximum value found in the aggregated data points.

Forecast quantiles

For time series forecasting, SageMaker trains 6 model candidates with your target time series. Then, SageMaker combines these models using a stacking ensemble method to create an optimal forecasting model for a given objective metric. Each forecasting model generates a probabilistic forecast by producing forecasts at quantiles between P1 and P99. These quantiles are used to account for forecast uncertainty. By default, forecasts are generated for 0.1 (p10), 0.5 (p50), and 0.9 (p90). You can choose to specify up to five of your own quantiles from 0.01 (p1) to 0.99 (p99), by increments of 0.01 or higher.