How the Knn Algorithm Works
Step 1: Sample
To specify the total number of data points to be sampled from the training
dataset, use the
sample_size
parameter.
For example, if the initial dataset has 1,000 data points and the
sample_size
is set to 100, where the total number of instances is
2, each worker would sample 50 points. A total set of 100 data points would be
collected. Sampling runs in linear time with respect to the number of data points.
Step 2: Perform Dimension Reduction
The current implementation of
kNN
has two methods of dimension reduction. You specify the method in
the dimension_reduction_type
hyperparameter. The sign
method specifies a random projection, which uses a linear projection using a matrix
of random signs, and the fjlt
method specifies a fast
JohnsonLindenstrauss transform, a method based on the Fourier transform. Both
methods preserve the L2 and inner product distances. The fjlt
method
should be used when the target dimension is large and has better performance with
CPU inference. The methods differ in their computational complexity. The
sign
method requires O(ndk) time to reduce the dimension of a batch
of n points of dimension d into a target dimension k. The fjlt
method
requires O(nd log(d)) time, but the constants involved are larger. Using dimension
reduction introduces noise into the data and this noise can reduce prediction
accuracy.
Step 3: Build an Index
During inference, the algorithm queries the index for the knearestneighbors of a
sample point. Based on the references to the points, the algorithm makes the
classification or regression prediction. It makes the prediction based on the class
labels or values provided. kNN provides three different types of indexes: a flat
index, an inverted index, and an inverted index with product quantization. You
specify the type with the index_type
parameter.
Serialize the Model
When the kNN algorithm finishes training, it serializes three files to prepare for inference.

model_algo1: Contains the serialized index for computing the nearest neighbors.

model_algo1.labels: Contains serialized labels (np.float32 binary format) for computing the predicted label based on the query result from the index.

model_algo1.json: Contains the JSONformatted model metadata which stores the
k
andpredictor_type
hyperparameters from training for inference along with other relevant state.
With the current implementation of kNN, you can modify the metadata file to
change the way predictions are computed. For example, you can change k
to 10 or change predictor_type
to
regressor.
{ "k": 5, "predictor_type": "classifier", "dimension_reduction": {"type": "sign", "seed": 3, "target_dim": 10, "input_dim": 20}, "normalize": False, "version": "1.0" }