Get Started with Data Wrangler - Amazon SageMaker

Get Started with Data Wrangler

Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio Classic. Use this section to learn how to access and get started using Data Wrangler. Do the following:

  1. Complete each step in Prerequisites.

  2. Follow the procedure in Access Data Wrangler to start using Data Wrangler.

Prerequisites

To use Data Wrangler, you must complete the following prerequisites.

  1. To use Data Wrangler, you need access to an Amazon Elastic Compute Cloud (Amazon EC2) instance. For more information about the Amazon EC2 instances that you can use, see Instances. To learn how to view your quotas and, if necessary, request a quota increase, see AWS service quotas.

  2. Configure the required permissions described in Security and Permissions.

  3. If your organization is using a firewall that blocks internet traffic, you must have access to the following URLs:

    • https://ui.prod-1.data-wrangler.sagemaker.aws/

    • https://ui.prod-2.data-wrangler.sagemaker.aws/

    • https://ui.prod-3.data-wrangler.sagemaker.aws/

    • https://ui.prod-4.data-wrangler.sagemaker.aws/

To use Data Wrangler, you need an active Studio Classic instance. To learn how to launch a new instance, see Amazon SageMaker domain overview. When your Studio Classic instance is Ready, use the instructions in Access Data Wrangler.

Access Data Wrangler

The following procedure assumes you have completed the Prerequisites.

To access Data Wrangler in Studio Classic, do the following.

  1. Sign in to Studio Classic. For more information, see Amazon SageMaker domain overview.

  2. Choose Studio.

  3. Choose Launch app.

  4. From the dropdown list, select Studio.

  5. Choose the Home icon.

  6. Choose Data.

  7. Choose Data Wrangler.

  8. You can also create a Data Wrangler flow by doing the following.

    1. In the top navigation bar, select File.

    2. Select New.

    3. Select Data Wrangler Flow.

  9. (Optional) Rename the new directory and the .flow file.

  10. When you create a new .flow file in Studio Classic, you might see a carousel that introduces you to Data Wrangler.

    This may take a few minutes.

    This messaging persists as long as the KernelGateway app on your User Details page is Pending. To see the status of this app, in the SageMaker console on the Amazon SageMaker Studio Classic page, select the name of the user you are using to access Studio Classic. On the User Details page, you see a KernelGateway app under Apps. Wait until this app status is Ready to start using Data Wrangler. This can take around 5 minutes the first time you launch Data Wrangler.

  11. To get started, choose a data source and use it to import a dataset. See Import to learn more.

    When you import a dataset, it appears in your data flow. To learn more, see Create and Use a Data Wrangler Flow.

  12. After you import a dataset, Data Wrangler automatically infers the type of data in each column. Choose + next to the Data types step and select Edit data types.

    Important

    After you add transforms to the Data types step, you cannot bulk-update column types using Update types.

  13. Use the data flow to add transforms and analyses. To learn more see Transform Data and Analyze and Visualize.

  14. To export a complete data flow, choose Export and choose an export option. To learn more, see Export.

  15. Finally, choose the Components and registries icon, and select Data Wrangler from the dropdown list to see all the .flow files that you've created. You can use this menu to find and move between data flows.

After you have launched Data Wrangler, you can use the following section to walk through how you might use Data Wrangler to create an ML data prep flow.

Update Data Wrangler

We recommend that you periodically update the Data Wrangler Studio Classic app to access the latest features and updates. The Data Wrangler app name starts with sagemaker-data-wrang. To learn how to update a Studio Classic app, see Shut down and Update Studio Classic Apps.

Demo: Data Wrangler Titanic Dataset Walkthrough

The following sections provide a walkthrough to help you get started using Data Wrangler. This walkthrough assumes that you have already followed the steps in Access Data Wrangler and have a new data flow file open that you intend to use for the demo. You may want to rename this .flow file to something similar to titanic-demo.flow.

This walkthrough uses the Titanic dataset. It's a modified version of the Titanic dataset that you can import into your Data Wrangler flow more easily. This data set contains the survival status, age, gender, and class (which serves as a proxy for economic status) of passengers aboard the maiden voyage of the RMS Titanic in 1912.

In this tutorial, you perform the following steps.

  1. Do one of the following:

    • Open your Data Wrangler flow and choose Use Sample Dataset.

    • Upload the Titanic dataset to Amazon Simple Storage Service (Amazon S3), and then import this dataset into Data Wrangler.

  2. Analyze this dataset using Data Wrangler analyses.

  3. Define a data flow using Data Wrangler data transforms.

  4. Export your flow to a Jupyter Notebook that you can use to create a Data Wrangler job.

  5. Process your data, and kick off a SageMaker training job to train a XGBoost Binary Classifier.

Upload Dataset to S3 and Import

To get started, you can use one of the following methods to import the Titanic dataset into Data Wrangler:

  • Importing the dataset directly from the Data Wrangler flow

  • Uploading the dataset to Amazon S3 and then importing it into Data Wrangler

To import the dataset directly into Data Wrangler, open the flow and choose Use Sample Dataset.

Uploading the dataset to Amazon S3 and importing it into Data Wrangler is closer to the experience you have importing your own data. The following information tells you how to upload your dataset and import it.

Before you start importing the data into Data Wrangler, download the Titanic dataset and upload it to an Amazon S3 (Amazon S3) bucket in the AWS Region in which you want to complete this demo.

If you are a new user of Amazon S3, you can do this using drag and drop in the Amazon S3 console. To learn how, see Uploading Files and Folders by Using Drag and Drop in the Amazon Simple Storage Service User Guide.

Important

Upload your dataset to an S3 bucket in the same AWS Region you want to use to complete this demo.

When your dataset has been successfully uploaded to Amazon S3, you can import it into Data Wrangler.

Import the Titanic dataset to Data Wrangler
  1. Choose the Import data button in your Data flow tab or choose the Import tab.

  2. Select Amazon S3.

  3. Use the Import a dataset from S3 table to find the bucket to which you added the Titanic dataset. Choose the Titanic dataset CSV file to open the Details pane.

  4. Under Details, the File type should be CSV. Check First row is header to specify that the first row of the dataset is a header. You can also name the dataset something more friendly, such as Titanic-train.

  5. Choose the Import button.

When your dataset is imported into Data Wrangler, it appears in your Data Flow tab. You can double click on a node to enter the node detail view, which allows you to add transformations or analysis. You can use the plus icon for a quick access to the navigation. In the next section, you use this data flow to add analysis and transform steps.

Data Flow

In the data flow section, the only steps in the data flow are your recently imported dataset and a Data type step. After applying transformations, you can come back to this tab and see what the data flow looks like. Now, add some basic transformations under the Prepare and Analyze tabs.

Prepare and Visualize

Data Wrangler has built-in transformations and visualizations that you can use to analyze, clean, and transform your data.

The Data tab of the node detail view lists all built-in transformations in the right panel, which also contains an area in which you can add custom transformations. The following use case showcases how to use these transformations.

To get information that might help you with data exploration and feature engineering, create a data quality and insights report. The information from the report can help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention. For more information about creating a report, see Get Insights On Data and Data Quality.

Data Exploration

First, create a table summary of the data using an analysis. Do the following:

  1. Choose the + next to the Data type step in your data flow and select Add analysis.

  2. In the Analysis area, select Table summary from the dropdown list.

  3. Give the table summary a Name.

  4. Select Preview to preview the table that will be created.

  5. Choose Save to save it to your data flow. It appears under All Analyses.

Using the statistics you see, you can make observations similar to the following about this dataset:

  • Fare average (mean) is around $33, while the max is over $500. This column likely has outliers.

  • This dataset uses ? to indicate missing values. A number of columns have missing values: cabin, embarked, and home.dest

  • The age category is missing over 250 values.

Next, clean your data using the insights gained from these stats.

Drop Unused Columns

Using the analysis from the previous section, clean up the dataset to prepare it for training. To add a new transform to your data flow, choose + next to the Data type step in your data flow and choose Add transform.

First, drop columns that you don't want to use for training. You can use pandas data analysis library to do this, or you can use one of the built-in transforms.

Use the following procedure to drop the unused columns.

To drop the unused columns.

  1. Open the Data Wrangler flow.

  2. There are two nodes in your Data Wrangler flow. Choose the + to the right of the Data types node.

  3. Choose Add transform.

  4. In the All steps column, choose Add step.

  5. In the Standard transform list, choose Manage Columns. The standard transformations are ready-made, built-in transformations. Make sure that Drop column is selected.

  6. Under Columns to drop, check the following column names:

    • cabin

    • ticket

    • name

    • sibsp

    • parch

    • home.dest

    • boat

    • body

  7. Choose Preview.

  8. Verify that the columns have been dropped, then choose Add.

To do this using pandas, follow these steps.

  1. In the All steps column, choose Add step.

  2. In the Custom transform list, choose Custom transform.

  3. Provide a name for your transformation, and choose Python (Pandas) from the dropdown list.

  4. Enter the following Python script in the code box.

    cols = ['name', 'ticket', 'cabin', 'sibsp', 'parch', 'home.dest','boat', 'body'] df = df.drop(cols, axis=1)
  5. Choose Preview to preview the change, and then choose Add to add the transformation.

Clean up Missing Values

Now, clean up missing values. You can do this with the Handling missing values transform group.

A number of columns have missing values. Of the remaining columns, age and fare contain missing values. Inspect this using a Custom Transform.

Using the Python (Pandas) option, use the following to quickly review the number of entries in each column:

df.info()

To drop rows with missing values in the age category, do the following:

  1. Choose Handle missing.

  2. Choose Drop missing for the Transformer.

  3. Choose age for the Input column.

  4. Choose Preview to see the new data frame, and then choose Add to add the transform to your flow.

  5. Repeat the same process for fare.

You can use df.info() in the Custom transform section to confirm that all rows now have 1,045 values.

Custom Pandas: Encode

Try flat encoding using Pandas. Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are Dog and Cat, you may encode this information into two vectors: [1,0] to represent Dog, and [0,1] to represent Cat.

  1. In the Custom Transform section, choose Python (Pandas) from the dropdown list.

  2. Enter the following in the code box.

    import pandas as pd dummies = [] cols = ['pclass','sex','embarked'] for col in cols: dummies.append(pd.get_dummies(df[col])) encoded = pd.concat(dummies, axis=1) df = pd.concat((df, encoded),axis=1)
  3. Choose Preview to preview the change. The encoded version of each column is added to the dataset.

  4. Choose Add to add the transformation.

Custom SQL: SELECT Columns

Now, select the columns you want to keep using SQL. For this demo, select the columns listed in the following SELECT statement. Because survived is your target column for training, put that column first.

  1. In the Custom Transform section, select SQL (PySpark SQL) from the dropdown list.

  2. Enter the following in the code box.

    SELECT survived, age, fare, 1, 2, 3, female, male, C, Q, S FROM df;
  3. Choose Preview to preview the change. The columns listed in your SELECT statement are the only remaining columns.

  4. Choose Add to add the transformation.

Export to a Data Wrangler Notebook

When you've finished creating a data flow, you have a number of export options. The following section explains how to export to a Data Wrangler job notebook. A Data Wrangler job is used to process your data using the steps defined in your data flow. To learn more about all export options, see Export.

Export to Data Wrangler Job Notebook

When you export your data flow using a Data Wrangler job, the process automatically creates a Jupyter Notebook. This notebook automatically opens in your Studio Classic instance and is configured to run a SageMaker processing job to run your Data Wrangler data flow, which is referred to as a Data Wrangler job.

  1. Save your data flow. Select File and then select Save Data Wrangler Flow.

  2. Back to the Data Flow tab, select the last step in your data flow (SQL), then choose the + to open the navigation.

  3. Choose Export, and Amazon S3 (via Jupyter Notebook). This opens a Jupyter Notebook.

  4. Choose any Python 3 (Data Science) kernel for the Kernel.

  5. When the kernel starts, run the cells in the notebook book until Kick off SageMaker Training Job (Optional).

  6. Optionally, you can run the cells in Kick off SageMaker Training Job (Optional) if you want to create a SageMaker training job to train an XGBoost classifier. You can find the cost to run a SageMaker training job in Amazon SageMaker Pricing.

    Alternatively, you can add the code blocks found in Training XGBoost Classifier to the notebook and run them to use the XGBoost open source library to train an XGBoost classifier.

  7. Uncomment and run the cell under Cleanup and run it to revert the SageMaker Python SDK to its original version.

You can monitor your Data Wrangler job status in the SageMaker console in the Processing tab. Additionally, you can monitor your Data Wrangler job using Amazon CloudWatch. For additional information, see Monitor Amazon SageMaker Processing Jobs with CloudWatch Logs and Metrics.

If you kicked off a training job, you can monitor its status using the SageMaker console under Training jobs in the Training section.

Training XGBoost Classifier

You can train an XGBoost Binary Classifier using either a Jupyter notebook or a Amazon SageMaker Autopilot. You can use Autopilot to automatically train and tune models on the data that you've transformed directly from your Data Wrangler flow. For information about Autopilot, see Automatically Train Models on Your Data Flow.

In the same notebook that kicked off the Data Wrangler job, you can pull the data and train an XGBoost Binary Classifier using the prepared data with minimal data preparation.

  1. First, upgrade necessary modules using pip and remove the _SUCCESS file (this last file is problematic when using awswrangler).

    ! pip install --upgrade awscli awswrangler boto sklearn ! aws s3 rm {output_path} --recursive --exclude "*" --include "*_SUCCESS*"
  2. Read the data from Amazon S3. You can use awswrangler to recursively read all the CSV files in the S3 prefix. The data is then split into features and labels. The label is the first column of the dataframe.

    import awswrangler as wr df = wr.s3.read_csv(path=output_path, dataset=True) X, y = df.iloc[:,:-1],df.iloc[:,-1]
    • Finally, create DMatrices (the XGBoost primitive structure for data) and do cross-validation using the XGBoost binary classification.

      import xgboost as xgb dmatrix = xgb.DMatrix(data=X, label=y) params = {"objective":"binary:logistic",'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10} xgb.cv( dtrain=dmatrix, params=params, nfold=3, num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)

Shut down Data Wrangler

When you are finished using Data Wrangler, we recommend that you shut down the instance it runs on to avoid incurring additional charges. To learn how to shut down the Data Wrangler app and associated instance, see Shut Down Data Wrangler.