Amazon SageMaker
Developer Guide

K-Means Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the k-means training algorithm provided by Amazon SageMaker. For more information about how k-means clustering works, see How K-Means Clustering Works.

Parameter Name Description
feature_dim

The number of features in the input data.

Required

Valid values: Positive integer

k

The number of required clusters.

Required

Valid values: Positive integer

epochs

The number of passes done over the training data.

Optional

Valid values: Positive integer

Default value: 1

eval_metrics

A JSON list of metric types used to report a score for the model. Allowed values are msd for Means Square Error and ssd for Sum of Square Distance. If test data is provided, the score is reported for each of the metrics requested.

Optional

Valid values: Either [\"msd\"] or [\"ssd\"] or [\"msd\",\"ssd\"] .

Default value: [\"msd\"]

extra_center_factor

The algorithm creates K centers = num_clusters * extra_center_factor as it runs and reduces the number of centers from K to k when finalizing the model.

Optional

Valid values: Either a positive integer or auto.

Default value: auto

half_life_time_size

Used to determine the weight given to an observation when computing a cluster mean. This weight decays exponentially as more points are observed. When a point is first observed, it is assigned a weight of 1 when computing the cluster mean. The decay constant for the exponential decay function is chosen so that after observing half_life_time_size points, its weight is 1/2. If set to 0, there is no decay.

Optional

Valid values: Non-negative integer

Default value: 0

init_method

Method by which the algorithm chooses the initial cluster centers. The standard k-means approach chooses them at random. An alternative k-means++ method chooses the first cluster center at random. Then it spreads out the position of the remaining initial clusters by weighting the selection of centers with a probability distribution that is proportional to the square of the distance of the remaining data points from existing centers.

Optional

Valid values: Either random or kmeans++.

Default value: random

local_lloyd_init_method

The initialization method for Lloyd's expectation-maximization (EM) procedure used to build the final model containing k centers.

Optional

Valid values: Either random or kmeans++.

Default value: kmeans++

local_lloyd_max_iter

The maximum number of iterations for Lloyd's expectation-maximization (EM) procedure used to build the final model containing k centers.

Optional

Valid values: Positive integer

Default value: 300

local_lloyd_num_trials

The number of times the Lloyd's expectation-maximization (EM) procedure with the least loss is run when building the final model containing k centers.

Optional

Valid values: Either a positive integer or auto.

Default value: auto

local_lloyd_tol

The tolerance for change in loss for early stopping of Lloyd's expectation-maximization (EM) procedure used to build the final model containing k centers.

Optional

Valid values: Float. Range in [0, 1].

Default value: 0.0001

mini_batch_size

The number of observations per mini-batch for the data iterator.

Optional

Valid values: Positive integer

Default value: 5000