K-Nearest Neighbors (k-NN) Algorithm - Amazon SageMaker

K-Nearest Neighbors (k-NN) Algorithm

Amazon SageMaker k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression. For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label. For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building. Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction, the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model in memory and inference latency. We provide two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is to construct the index. The index enables efficient lookups of distances between points whose values or class labels have not yet been determined and the k nearest points to use for inference.

Input/Output Interface for the k-NN Algorithm

SageMaker k-NN supports train and test data channels.

  • Use a train channel for data that you want to sample and construct into the k-NN index.

  • Use a test channel to emit scores in log files. Scores are listed as one line per mini-batch: accuracy for classifier, mean-squared error (mse) for regressor for score.

For training inputs, k-NN supports text/csv and application/x-recordio-protobuf data formats. For input type text/csv, the first label_size columns are interpreted as the label vector for that row. You can use either File mode or Pipe mode to train models on data that is formatted as recordIO-wrapped-protobuf or as CSV.

For inference inputs, k-NN supports the application/json, application/x-recordio-protobuf, and text/csv data formats. The text/csv format accepts a label_size and encoding parameter. It assumes a label_size of 0 and a UTF-8 encoding.

For inference outputs, k-NN supports the application/json and application/x-recordio-protobuf data formats. These two data formats also support a verbose output mode. In verbose output mode, the API provides the search results with the distances vector sorted from smallest to largest, and corresponding elements in the labels vector.

For batch transform, k-NN supports the application/jsonlines data format for both input and output. An example input is as follows:

content-type: application/jsonlines {"features": [1.5, 16.0, 14.0, 23.0]} {"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}

An example output is as follows:

accept: application/jsonlines {"predicted_label": 0.0} {"predicted_label": 2.0}

For more information on input and output file formats, see Data Formats for k-NN Training Input for training, k-NN Request and Response Formats for inference, and the k-NN Sample Notebooks.

k-NN Sample Notebooks

For a sample notebook that uses the SageMaker k-nearest neighbor algorithm to predict wilderness cover types from geological and forest service data, see the K-Nearest Neighbor Covertype .

Use a Jupyter notebook instance to run the example in SageMaker. To learn how to create and open a Jupyter notebook instance in SageMaker, see Amazon SageMaker Notebook Instances. Once you have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker example notebooks. Find K-Nearest Neighbor notebooks in the Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create copy.

EC2 Instance Recommendation for the k-NN Algorithm

We recommend training on a CPU instance (such as ml.m5.2xlarge) or on a GPU instance. The k-NN algorithm supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

Inference requests from CPUs generally have a lower average latency than requests from GPUs because there is a tax on CPU-to-GPU communication when you use GPU hardware. However, GPUs generally have higher throughput for larger batches.