Using the SageMaker Data Agent for data analytics tasks - Amazon SageMaker Unified Studio

Using the SageMaker Data Agent for data analytics tasks

Data exploration and discovery

The agent helps you explore available data sources by querying AWS Glue Data Catalog metadata through MCP connections. When you request data exploration, the agent retrieves table information, schemas, and relationships from your catalog to understand what data is available.

Example prompt for data analyst

"Show me what customer tables are available and their schemas"

The agent queries your AWS Glue Data Catalog and generates code to display table metadata.

Example prompt

"Calculate monthly revenue by product category from the sales_transactions table"

The agent accesses your AWS Glue Data Catalog to verify the sales_transactions table structure and generates:

SELECT DATE_TRUNC('month', transaction_date) as month, product_category, SUM(transaction_amount) as total_revenue, COUNT(DISTINCT customer_id) as unique_customers, AVG(transaction_amount) as avg_transaction_value FROM sales_db.sales_transactions WHERE transaction_date >= DATE_ADD(CURRENT_DATE(), -180) GROUP BY DATE_TRUNC('month', transaction_date), product_category ORDER BY month DESC, total_revenue DESC

Building multi-step analytical workflows

For complex analysis, the agent proposes step-by-step plans that break down your objective into discrete operations, requesting approval before executing each step.

Example prompt for data scientist

"Analyze the churn table and visualize retention rate by region"

The agent proposes a multi-step plan:

Now I'll create a plan to analyze the churn table and visualize retention rate by region. Since the table has a 'state' column representing regions and a 'churn' boolean column, I can calculate the retention rate (percentage of customers who didn't churn) for each state.

  1. Step 1: Load Churn Data from Catalog - Use PySpark to read the churn table from the AWS Glue catalog (sagemaker_sample_db.churn) and preview the data structure.

  2. Step 2: Calculate Retention Rate by Region - Aggregate the data by state to calculate the retention rate (percentage of customers where churn = False) for each state/region.

  3. Step 3: Visualize Retention Rates - Create a bar chart showing the retention rate by region/state, sorted for better readability.

Each step generates code in a new cell, maintaining context from previous cells and building upon earlier results.

Cell level inline code modification

The agent provides inline assistance to modify existing code based on your requirements. You can use the agent icon on every cell to trigger a prompt input box and type your request instead of code.

Example workflow

Initial cell content:

# Load sales data sales_df = spark.table("sales_db.transactions") sales_df.show(10)

Inline prompt: "Add filtering for electronics category and last quarter only"

Updated cell content:

# Load sales data with filters from datetime import datetime, timedelta quarter_start = datetime.now() - timedelta(days=90) sales_df = spark.table("sales_db.transactions") \ .filter(col("category") == "electronics") \ .filter(col("transaction_date") >= quarter_start) sales_df.show(10)

Referencing previous cell results

The agent understands your notebook context and can generate code that references data frames and variables from previous cells.

Example workflow

Cell 1:

# Customer purchase analysis customer_purchases = spark.sql(""" SELECT customer_id, product_category, SUM(amount) as total_spent FROM sales_db.transactions GROUP BY customer_id, product_category """) customer_purchases.createOrReplaceTempView("customer_summary")

Agent prompt: "Create a visualization showing the distribution of customer spending from the results above"

Cell 2 (generated by agent):

# Visualize customer spending distribution import matplotlib.pyplot as plt import pandas as pd # Convert to pandas for visualization spending_data = customer_purchases.toPandas() plt.figure(figsize=(12, 6)) plt.hist(spending_data['total_spent'], bins=50, edgecolor='black') plt.xlabel('Total Spending') plt.ylabel('Number of Customers') plt.title('Customer Spending Distribution') plt.grid(True, alpha=0.3) plt.show() # Show summary statistics print(spending_data['total_spent'].describe())

Error handling and fix with the SageMaker Data Agent

When code execution fails, the SageMaker Data Agent can analyze error messages and suggest corrections. Use the fix with agent capability to automatically generate corrected code.