Amazon SageMaker
Developer Guide

K-Means Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the k-means training algorithm provided by Amazon SageMaker. For more information about how k-means clustering works, see How K-Means Clustering Works.

Parameter Name Description
k

Number of required clusters. Required.

Valid values: positive integer

Default value: -

feature_dim

Number of features in the input data. Required.

Valid values: positive integer

Default value: -

mini_batch_size

Number of observations per mini-batch for the data iterator. Optional.

Valid values: positive integer

Default value: 5000

init_method

Method by which the algorithm chooses the initial cluster centers. The standard k-means approach chooses them at random. An alternative k-means++ method chooses the first cluster center at random. Then it spreads out the position of the remaining initial clusters by weighting the selection of centers with a probability distribution that is proportional to the square of the distance of the remaining data points from existing centers. Optional.

Valid values: Either random or kmeans++.

Default value: random

extra_center_factor

The algorithm creates K centers = num_clusters * extra_center_factor as it runs and reduces the number of centers from K to k when finalizing the model. Optional

Valid values: Either a positive integer or auto.

Default value: auto

local_lloyd_max_iter

Maximum number of iterations for Lloyd's expectation-maximization (EM) procedure used to build the final model containing k centers. Optional.

Valid values: positive integer

Default value: 300

local_lloyd_tol

Tolerance for change in loss for early stopping of Lloyd's EM procedure used to build the final model containing k centers. Optional. .

Valid values: Float. Range in [0, 1].

Default value: 0.0001

local_lloyd_init_method

Initialization method for Lloyd's EM procedure used to build the final model containing k centers. Optional.

Valid values: Either random or kmeans++.

Default value: kmeans++

local_lloyd_num_trials

How many times the Lloyd's EM procedure with the least loss is run when building the final model containing k centers.

Valid values: Either a positive integer or auto.

Default value: auto

half_life_time_size

Used to determine the weight given to an observation when computing a cluster mean. This weight decays exponentially as more points are observed. When a point is first observed, it is assigned a weight of 1 when computing the cluster mean. The decay constant for the exponential decay function is chosen so that after observing half_life_time_size points, its weight is 1/2. If set to 0, there is no decay. Optional

Valid values: non-negative integer

Default value: 0

epochs

Number of passes done over the training data. Optional.

Valid values: positive integer

Default value: 1

eval_metrics

JSON list of metric types used to report a score for the model. Allowed values are "msd" for Means Square Error and "ssd" for Sum of square distance. If test data is provided, the score is reported for each of the metrics requested.

Valid values: Either [\"msd\"] or [\"ssd\"] or [\"msd\",\"ssd\"] .

Default value: [\"msd\"]