Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Datasets format and objective metric for text classification

Focus mode
Datasets format and objective metric for text classification - Amazon SageMaker AI

In this section we learn about the available formats for datasets used in text classification as well as the metric used to evaluate the predictive quality of machine learning model candidates. The metrics calculated for candidates are specified using an array of MetricDatum types.

Datasets formats

Autopilot supports tabular data formatted as CSV files or as Parquet files. For tabular data, each column contains a feature with a specific data type and each row contains an observation. The properties of these two file formats differ considerably.

  • CSV (comma-separated-values) is a row-based file format that stores data in human readable plaintext which a popular choice for data exchange as they are supported by a wide range of applications.

  • Parquet is a column-based file format where the data is stored and processed more efficiently than row-based file formats. This makes them a better option for big data problems.

The data types accepted for columns include numerical, categorical, text.

Autopilot supports building machine learning models on large datasets up to hundreds of GBs. For details on the default resource limits for input datasets and how to increase them, see Amazon SageMaker Autopilot quotas.

Objective metric

The following list contains the names of the metrics that are currently available to measure the performance of models for text classification.

Accuracy

The ratio of the number of correctly classified items to the total number of (correctly and incorrectly) classified items. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates perfect inaccuracy.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.