Amazon SageMaker Autopilot datasets and problem types - Amazon SageMaker

Amazon SageMaker Autopilot datasets and problem types

Amazon SageMaker Autopilot gives you the option in Studio or with the AutoML API of specifying a problem type, such as binary classification or regression, or of detecting it on your behalf based on the data you provide. Autopilot supports tabular data in which each column contains a feature with a specific data type and each row contains an observation.

Autopilot datasets, data types, and formats

Autopilot supports tabular data formatted as CSV files or as Parquet files. For tabular data, each column contains a feature with a specific data type and each row contains an observation. The properties of these two file formats differ considerably.

  • CSV (comma-separated-values) is a row-based file format that stores data in human readable plaintext which a popular choice for data exchange as they are supported by a wide range of applications.

  • Parquet is a column-based file format where the data is stored and processed more efficiently than row-based file formats. This makes them a better option for big data problems.

The data types accepted for columns include numerical, categorical, text, and time series that consists of strings of comma-separate numbers. If Autopilot detects it is dealing with time series sequences, it processes them through specialized feature transformers provided by the tsfresh library. This library takes the time series as an input and outputs a feature such as the highest absolute value of the time series or descriptive statistics on autocorrelation. These outputted features are then used as inputs to one of the three problem types.

Autopilot supports building machine learning models on large datasets up to hundreds of GBs. For details on the default resource limits for input datasets and how to increase them, see Amazon SageMaker Autopilot quotas

How to specify training and validation datasets

When using CreateAutoMLJob to create an AutoML job, you must use the InputDataConfig parameter to specify the AutoMLChannel objects that provide input data sources. Each AutoMLChannel has a ChannelType, which can be set to either training or validation values that specify how the data is to be used when building a machine learning model. At least one data source must be provided and a maximum of two data sources is allowed: one for training data and one for validation data.

How you split the data into training and validation datasets depends on whether you have one or two data sources.

  • If you only have one data source, the ChannelType is set to training by default and must have this value.

    • If the ValidationFraction value in AutoMLDataSplitConfig is not set, 0.2 (20%) of the data from this source is used for validation by default.

    • If the ValidationFraction is set to a value between 0 and 1, the dataset is split based on the value specified, where the value specifies the fraction of the dataset used for validation.

  • If you have two data sources, the ChannelType of one of the AutoMLChannel objects must be set to training, the default value. The ChannelType of the other data source must be set to validation. The two data sources must have the same format, either CSV or Parquet, and the same schema. You must not set the value for the ValidationFraction in this case because all of the data from each source is used for either training or validation. Setting this value will cause an error.

Amazon SageMaker Autopilot problem types

You set the type of problem with the CreateAutoPilot.ProblemType parameter. This limits the kind of preprocessing and algorithms that Autopilot tries. After the job is finished, if you had set the CreateAutoPilot.ProblemType, then the ResolvedAttribute.ProblemType will match the ProblemType you set. If you keep it blank (or null), the ProblemType will be whatever Autopilot decides on your behalf.

Note

In some cases, Autopilot is unable to infer the ProblemType with high enough confidence, in which case you must provide the value for the job to succeed.

Your problem type options are as follows:

Regression

Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it. An example is the prediction of house prices using features like the number of bathrooms and bedrooms, square footage of the house and garden. Regression analysis can create a model that takes one or more of these features as an input and predicts the price of a house.

Binary classification

Binary classification is a type of supervised learning that assigns an individual to one of two predefined and mutually exclusive classes based on their attributes. It is supervised because the models are trained using examples where the attributes are provided with correctly labelled objects. A medical diagnosis for whether an individual has a disease or not based on the results of diagnostic tests is an example of binary classification.

Multiclass classification

Multiclass classification is a type of supervised learning that assigns an individual to one of several classes based on their attributes. It is supervised because the models are trained using examples where the attributes are provided with correctly labelled objects. An example is the prediction of the topic most relevant to a text document. A document may be classified as being about, say, religion or politics or finance, or about one of several other predefined topic classes.