K-Means Algorithm
K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.
Amazon SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared
with the original version of the algorithm, the version used by Amazon SageMaker is more accurate.
Like the original algorithm, it scales to massive datasets and delivers improvements in
training time. To do this, the version used by Amazon SageMaker streams mini-batches (small, random
subsets) of the training data. For more information about mini-batch k-means, see Web-scale k-means
Clustering
The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The n attributes in each row represent a point in n-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations. The algorithm groups observations with similar attribute values (the points corresponding to these observations are closer together). For more information about how k-means works in Amazon SageMaker, see How K-Means Clustering Works.
Topics
Input/Output Interface for the K-Means Algorithm
For training, the k-means algorithm expects data to be provided in the
train channel (recommended
S3DataDistributionType=ShardedByS3Key
), with an optional
test channel (recommended
S3DataDistributionType=FullyReplicated
) to score the data on. Both
recordIO-wrapped-protobuf
and CSV
formats are supported
for training. You can use either File mode or Pipe mode to train models on data that is
formatted as recordIO-wrapped-protobuf
or as CSV
.
For inference, text/csv
, application/json
, and
application/x-recordio-protobuf
are supported. k-means returns a
closest_cluster
label and the distance_to_cluster
for each
observation.
For more information on input and output file formats, see K-Means Response Formats for inference and the K-Means Sample Notebooks. The k-means algorithm does not support multiple instance learning, in which the training set consists of labeled “bags”, each of which is a collection of unlabeled instances.
EC2 Instance Recommendation for the K-Means Algorithm
We recommend training k-means on CPU instances. You can train on GPU instances, but should limit GPU training to single-GPU instances (such as ml.g4dn.xlarge) because only one GPU is used per instance. The k-means algorithm supports P2, P3, G4dn, and G5 instances for training and inference.
K-Means Sample Notebooks
For a sample notebook that uses the SageMaker K-means algorithm to segment the population
of counties in the United States by attributes identified using principle component
analysis, see Analyze US census data for population segmentation using Amazon SageMaker