The Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm follows a multi-step training process which includes sampling the input data, performing dimension reduction, and building an index. The indexed data is then used during inference to efficiently find the k-nearest neighbors for a given data point and make predictions based on the neighboring labels or values.
Step 1:
Sample
To specify the total number of data points to be sampled from the training
dataset, use the
sample_size
parameter.
For example, if the initial dataset has 1,000 data points and the
sample_size
is set to 100, where the total number of instances is
2, each worker would sample 50 points. A total set of 100 data points would be
collected. Sampling runs in linear time with respect to the number of data points.
Step 2: Perform Dimension Reduction
The current implementation of the k-NN algorithm has
two methods of dimension reduction. You specify the method in the
dimension_reduction_type
hyperparameter. The sign
method specifies a random projection, which uses a linear projection using a matrix
of random signs, and the fjlt
method specifies a fast
Johnson-Lindenstrauss transform, a method based on the Fourier transform. Both
methods preserve the L2 and inner product distances. The fjlt
method
should be used when the target dimension is large and has better performance with
CPU inference. The methods differ in their computational complexity. The
sign
method requires O(ndk) time to reduce the dimension of a batch
of n points of dimension d into a target dimension k. The fjlt
method
requires O(nd log(d)) time, but the constants involved are larger. Using dimension
reduction introduces noise into the data and this noise can reduce prediction
accuracy.
Step 3: Build an Index
During inference, the algorithm queries the index for the k-nearest-neighbors of a
sample point. Based on the references to the points, the algorithm makes the
classification or regression prediction. It makes the prediction based on the class
labels or values provided. k-NN provides three different types of indexes: a flat
index, an inverted index, and an inverted index with product quantization. You
specify the type with the index_type
parameter.
Serialize the Model
When the k-NN algorithm finishes training, it serializes three files to prepare for inference.
-
model_algo-1: Contains the serialized index for computing the nearest neighbors.
-
model_algo-1.labels: Contains serialized labels (np.float32 binary format) for computing the predicted label based on the query result from the index.
-
model_algo-1.json: Contains the JSON-formatted model metadata which stores the
k
andpredictor_type
hyper-parameters from training for inference along with other relevant state.
With the current implementation of k-NN, you can modify the metadata file to
change the way predictions are computed. For example, you can change k
to 10 or change predictor_type
to
regressor.
{
"k": 5,
"predictor_type": "classifier",
"dimension_reduction": {"type": "sign", "seed": 3, "target_dim": 10, "input_dim": 20},
"normalize": False,
"version": "1.0"
}