Amazon SageMaker Autopilot datasets and problem types
Amazon SageMaker Autopilot gives you the option in Studio or with the AutoML API of specifying a problem type, such as binary classification or regression, or of detecting it on your behalf based on the data you provide. Autopilot supports tabular data in which each column contains a feature with a specific data type and each row contains an observation.
Topics
Autopilot datasets, data types, and formats
Autopilot supports tabular data formatted as CSV files or as Parquet files. For tabular data, each column contains a feature with a specific data type and each row contains an observation. The properties of these two file formats differ considerably.
-
CSV (comma-separated-values) is a row-based file format that stores data in human readable plaintext which a popular choice for data exchange as they are supported by a wide range of applications.
-
Parquet is a column-based file format where the data is stored and processed more efficiently than row-based file formats. This makes them a better option for big data problems.
The data types accepted for columns include numerical,
categorical, text, and time series that consists of strings of comma-separate numbers. If
Autopilot detects it is dealing with time series sequences, it
processes them through specialized feature transformers provided by the tsfresh
Autopilot supports building machine learning models on large datasets up to hundreds of GBs. For details on the default resource limits for input datasets and how to increase them, see Amazon SageMaker Autopilot quotas
How to specify training and validation datasets
When using CreateAutoMLJob
to create an AutoML job, you must use the
InputDataConfig
parameter to specify the AutoMLChannel
objects that provide input data sources. Each
AutoMLChannel
has a ChannelType
, which can be set to either
training
or validation
values that specify how the data is to be
used when building a machine learning model. At least one data source must be provided and a
maximum of two data sources is allowed: one for training data and one for validation
data.
How you split the data into training and validation datasets depends on whether you have one or two data sources.
-
If you only have one data source, the
ChannelType
is set totraining
by default and must have this value.-
If the
ValidationFraction
value inAutoMLDataSplitConfig
is not set, 0.2 (20%) of the data from this source is used for validation by default. -
If the
ValidationFraction
is set to a value between 0 and 1, the dataset is split based on the value specified, where the value specifies the fraction of the dataset used for validation.
-
-
If you have two data sources, the
ChannelType
of one of theAutoMLChannel
objects must be set totraining
, the default value. TheChannelType
of the other data source must be set tovalidation
. The two data sources must have the same format, either CSV or Parquet, and the same schema. You must not set the value for theValidationFraction
in this case because all of the data from each source is used for either training or validation. Setting this value will cause an error.
How to select features for training
You can manually select the features to be used in training with the
FeatureSpecificatioS3Uri
attribute of AutoMLCandidateGenerationConfig within the CreateAutoMLJob API
with the following format.
{ "AutoMLJobConfig": { "CandidateGenerationConfig": { "FeatureSpecificiationS3Uri":"string" } } }
Selected features should be contained within a JSON file in the following format:
{ "FeatureAttributeNames":["col1", "col2", ...] }
The values listed in ["col1", "col2", ...]
are case sensitive. They should
be a list of strings containing unique values that are subsets of the column names in the
input data.
The list of columns provided as features cannot include the target column.
Amazon SageMaker Autopilot problem types
You set the type of problem with the CreateAutoPilot.ProblemType
parameter. This limits the kind of
preprocessing and algorithms that Autopilot tries. After the job is finished, if you had set
the CreateAutoPilot.ProblemType
, then the ResolvedAttribute.ProblemType
will match the
ProblemType
you set. If you keep it blank (or null
), the
ProblemType
will be whatever Autopilot decides on your behalf.
In some cases, Autopilot is unable to infer the ProblemType
with high enough
confidence, in which case you must provide the value for the job to succeed.
Your problem type options are as follows:
Regression
Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it. An example is the prediction of house prices using features like the number of bathrooms and bedrooms, square footage of the house and garden. Regression analysis can create a model that takes one or more of these features as an input and predicts the price of a house.
Binary classification
Binary classification is a type of supervised learning that assigns an individual to one of two predefined and mutually exclusive classes based on their attributes. It is supervised because the models are trained using examples where the attributes are provided with correctly labelled objects. A medical diagnosis for whether an individual has a disease or not based on the results of diagnostic tests is an example of binary classification.
Multiclass classification
Multiclass classification is a type of supervised learning that assigns an individual to one of several classes based on their attributes. It is supervised because the models are trained using examples where the attributes are provided with correctly labelled objects. An example is the prediction of the topic most relevant to a text document. A document may be classified as being about, say, religion or politics or finance, or about one of several other predefined topic classes.