Random Cut Forest (RCF) Algorithm - Amazon SageMaker

Random Cut Forest (RCF) Algorithm

Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

While there are many applications of anomaly detection algorithms to one-dimensional time series data such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-dimensional input. Amazon SageMaker RCF scales well with respect to number of features, data set size, and number of instances.

Input/Output Interface for the RCF Algorithm

Amazon SageMaker Random Cut Forest supports the train and test data channels. The optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either application/x-recordio-protobuf or text/csv formats. For the test data, when using text/csv format, the content must be specified as text/csv;label_size=1 where the first column of each row represents the anomaly label: "1" for an anomalous data point and "0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that is formatted as recordIO-wrapped-protobuf or as CSV

Also note that the train channel only supports S3DataDistributionType=ShardedByS3Key and the test channel only supports S3DataDistributionType=FullyReplicated. The S3 distribution type can be specified using the Amazon SageMaker Python SDK as follows:

import sagemaker # specify Random Cut Forest training job information and hyperparameters rcf = sagemaker.estimator.Estimator(...) # explicitly specify "ShardedByS3Key" distribution type train_data = sagemaker.inputs.s3_input( s3_data=s3_training_data_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key') # run the training job on input data stored in S3 rcf.fit({'train': train_data})

To avoid common errors around execution roles, ensure that you have the execution roles required, AmazonSageMakerFullAccess and AmazonEC2ContainerRegistryFullAccess. To avoid common errors around your image not existing or its permissions being incorrect, ensure that your ECR image is not larger then the allocated disk space on the training instance. To avoid this, run your training job on an instance that has sufficient disk space. In addition, if your ECR image is from a different AWS account's Elastic Container Service (ECS) repository, and you do not set repository permissions to grant access, this will result in an error. See the ECR repository permissions for more information on setting a repository policy statement.

See the S3DataSource for more information on customizing the S3 data source attributes. Finally, in order to take advantage of multi-instance training the training data must be partitioned into at least as many files as instances.

For inference, RCF supports application/x-recordio-protobuf, text/csv and application/json input data content types. See the Common Data Formats for Built-in Algorithms documentation for more information. RCF inference returns application/x-recordio-protobuf or application/json formatted output. Each record in these output data contains the corresponding anomaly scores for each input data point. See Common Data Formats--Inference for more information.

For more information on input and output file formats, see RCF Response Formats for inference and the RCF Sample Notebooks.

Instance Recommendations for the RCF Algorithm

For training, we recommend the ml.m4, ml.c4, and ml.c5 instance families. For inference we recommend using a ml.c5.xl instance type in particular, for maximum performance as well as minimized cost per hour of usage. Although the algorithm could technically run on GPU instance types it does not take advantage of GPU hardware.

RCF Sample Notebooks

For an example of how to train an RCF model and perform inferences with it, see the An Introduction to SageMaker Random Cut Forests notebook. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker, see Use Amazon SageMaker Notebook Instances. Once you have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker samples. To open a notebook, click on its Use tab and select Create copy.