Prepare data - Amazon SageMaker

Prepare data

Data preparation in machine learning refers to the process of collecting, preprocessing, and organizing raw data to make it suitable for analysis and modeling. This step ensures that the data is in a format from which machine learning algorithms can effectively learn. Data preparation tasks may include handling missing values, removing outliers, scaling features, encoding categorical variables, assessing potential biases and taking steps to mitigate them, splitting data into training and testing sets, labeling, and other necessary transformations to optimize the quality and usability of the data for subsequent machine learning tasks.

Amazon SageMaker provides several built-in features for performing data preparation tasks such as cleaning, transforming, and labeling datasets before model training.

  • For low-code data preparation, you can use Amazon SageMaker Data Wrangler to create data flows that define your ML data pre-processing and feature engineering workflows using little to no coding. Import data from sources such as Amazon S3, Amazon Redshift, or Snowflake to engineer features. You can use built-in visualizations and analyses to get insights from your data. After preparing your data, you can export the finished output to Amazon S3, Amazon SageMaker Feature Store, or SageMaker Pipelines. Data Wrangler exists within Amazon SageMaker Canvas and Amazon SageMaker Studio Classic. We recommend using it within SageMaker Canvas for the latest features. For more information about Data Wrangler within SageMaker Canvas, see Prepare data. For information about Data Wrangler within Studio Classic, see Prepare ML Data with Amazon SageMaker Data Wrangler.

  • For data preparation at scale using open-source frameworks such as Apache Spark, Apache Hive, or Presto, Amazon SageMaker Studio Classic provides a built-in integration with Amazon EMR. You can use SageMaker Studio Classic to connect or provision Amazon EMR clusters from your notebooks for petabyte-scale data processing, interactive analytics, and machine learning. For more information about using Amazon EMR from SageMaker Studio Classic, see Prepare data using Amazon EMR.

    Alternatively, you can use the Apache Spark-based serverless engine from AWS Glue interactive sessions to aggregate, transform, and prepare data from multiple sources in SageMaker Studio Classic. For more information about using AWS Glue interactive sessions within SageMaker Studio Classic, see Prepare data using AWS Glue Interactive Sessions.

  • For data preparation using SQL in Studio, the default JupyterLab image, SageMaker distribution version 1.6 and up, includes an SQL extension. Using this SQL environment, users can connect to Amazon Redshift, Athena, and Snowflake from JupyterLab notebooks. They can explore database schemas, write and run SQL queries, and retrieve results as pandas DataFrames for further analysis. The extension provides auto-complete, syntax highlighting, and query formatting to make writing complex SQL easier in JupyterLab notebooks. Queries can join data across multiple tables for data sampling, exploratory analysis, cleaning, feature engineering, and more. For information about the SQL extension in JupyterLab, see Prepare data with SQL in Studio.

  • For feature discovery and storage, the Amazon SageMaker Feature Store has capabilities to search, discover, and retrieve features for model training and provides a centralized repository to store feature data in a standardized format. Storing curated features in the Feature Store allows reuse of existing features for new ML projects. The Feature Store manages the full lifecycle of features including tracking lineage, calculating statistics, and maintaining audit trails. For more information on feature data storage for your ML pipelines, refer to the Create, store, and share features section in this guide.

  • For bias detection, you can use Amazon SageMaker Clarify to analyze your data and detect potential biases across multiple facets. For example, you can use SageMaker Clarify to detect if your training data contains imbalanced representations or labeling biases between groups such as gender, race, or age. SageMaker Clarify can help you identify these biases before training a model to avoid propagating biases into the model's predictions. For information about using SageMaker Clarify to uncover biases, refer to the Detect Pre-training Data Bias section in this guide.

  • For data labeling, you can use SageMaker Ground Truth to manage the data labeling workflows of your training datasets. For information about how to use Ground Truth for your labeling tasks, refer to the Label data with a human-in-the-loop section in this guide.

After performing exploratory data analysis and creating your data transformations steps, you can productionize your transformation code using SageMaker Processing jobs and automate your preparation workflow using Amazon SageMaker Model Building Pipelines.

For information about the SageMaker Processing API, see Amazon SageMaker processing jobs.

For information about automating your transformation steps, see SageMaker Model Building Pipelines.