Principal Component Analysis (PCA) Algorithm
PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.
In Amazon SageMaker, PCA operates in two modes, depending on the scenario:
-
regular: For datasets with sparse data and a moderate number of observations and features.
-
randomized: For datasets with both a large number of observations and features. This mode uses an approximation algorithm.
PCA uses tabular data.
The rows represent observations you want to embed in a lower dimensional space. The columns represent features that you want to find a reduced approximation for. The algorithm calculates the covariance matrix (or an approximation thereof in a distributed manner), and then performs the singular value decomposition on this summary to produce the principal components.
Topics
Input/Output Interface for the PCA Algorithm
For training, PCA expects data provided in the train channel, and optionally supports
a dataset passed to the test dataset, which is scored by the final algorithm. Both
recordIO-wrapped-protobuf
and CSV
formats are supported
for training. You can use either File mode or Pipe mode to train models on data that is
formatted as recordIO-wrapped-protobuf
or as CSV
.
For inference, PCA supports text/csv
, application/json
, and
application/x-recordio-protobuf
. Results are returned in either
application/json
or application/x-recordio-protobuf
format
with a vector of "projections."
For more information on input and output file formats, see PCA Response Formats for inference and the PCA Sample Notebooks.
EC2 Instance Recommendation for the PCA Algorithm
PCA supports CPU and GPU instances for training and inference. Which instance type is most performant depends heavily on the specifics of the input data. For GPU instances, PCA supports P2, P3, G4dn, and G5.
PCA Sample Notebooks
For a sample notebook that shows how to use the SageMaker Principal Component Analysis
algorithm to analyze the images of handwritten digits from zero to nine in the MNIST
dataset, see An Introduction to PCA with MNIST