Creating jobs - Amazon SageMaker Unified Studio

Creating jobs

You can create jobs in SageMaker Unified Studio using multiple authoring methods to suit different personas and use cases.

Job authoring methods

IDE/Script Editor

Data engineers can develop jobs using either their preferred external IDE or the SageMaker Unified Studio's built-in code editor. These environments provide full code editing capabilities and remote debugging support. Once scripts are written, engineers can create data processing jobs from them directly within SageMaker Unified Studio.

Visual ETL

ETL developers can create jobs using a drag-and-drop visual interface for building data transformation pipelines without writing code.

Jupyter Notebooks

Data scientists can develop jobs using familiar Jupyter Notebook interfaces for interactive development and exploration.

Note

When creating a job from a Jupyter notebook (.ipynb file), SageMaker Unified Studio automatically converts the notebook to a Python script. Once this conversion happens, subsequent changes to the original notebook will not affect the job. The job will continue to use the Python script created at the time of job creation.

To create a new job

  1. Sign in to SageMaker Unified Studio and navigate to your project.

  2. In Data Analysis & Integration , choose Data processing jobs, then Visual ETL job or Code-based job. Alternatively, you can create a Job within an existing Workflow, or through JupyterLab in IDE & Applications.

  3. Select your preferred authoring method (IDE, Visual ETL, or Notebook).

  4. Configure your job settings.

    Name

    Provide a UTF-8 string with a maximum length of 255 characters. Once a job is named, it cannot be renamed as the name is a unique identifier.

    Description

    Provide an optional description of up to 2048 characters.

    IAM Role

    Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see Identity and access management for AWS Glue. Note that IAM role cannot be changed after job creation (edit IAM roles at the project level).

    Type

    The type of ETL job. This is set automatically based on the type of data sources you select.

    SageMaker supports Apache Spark jobs that run in an Apache Spark environment managed by AWS Glue. These jobs process data in batches and provide powerful distributed computing capabilities.

    • Spark - Runs an Apache Spark ETL script.

    AWS Glue version

    AWS Glue version determines the versions of Apache Spark and Python that are available to the job, as specified in the following table:

    AWS Glue version Supported Spark and Python versions
    5.0 Spark 3.5.4, Python 3.11
    4.0 Spark 3.3.0, Python 3.10
    Language

    The code in the ETL script defines your job's logic. The script is in Python. You can choose whether the script that the job runs is generated by AWS Glue or provided by you. You provide the script name and location in Amazon S3. Confirm that there isn't a file with the same name as the script directory in the path.

    Instance type

    The resources available on AWS Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

    • G.1X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 1 DPU (4 vCPUs, 16 GB of memory) with 94GB disk. We recommend this instance type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.

    • G.2X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 2 DPU (8 vCPUs, 32 GB of memory) with 138GB disk. We recommend this instance type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.

    • G.4X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 4 DPU (16 vCPUs, 64 GB of memory) with 256GB disk (approximately 235GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries.

    • G.8X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 8 DPU (32 vCPUs, 128 GB of memory) with 512GB disk (approximately 487GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries.

    You are charged an hourly rate based on the number of DPUs used to run your ETL jobs. For more information, see the AWS Glue pricing page.

    Number of instances

    Specify the number of instances that are allocated when the job runs.

    Number of retries

    Specify the number of times, from 0 to 10, that AWS Glue will automatically restart the job if it fails. Jobs that reach the timeout limit are not restarted.

    Job timeout

    Sets the maximum execution time in minutes. The maximum is 7 days or 10,080 minutes. Otherwise, the jobs will throw an exception.

    Best practices for job timeouts: Jobs are billed based on execution time. To avoid unexpected charges, configure timeout values appropriate for the expected execution time of your job.

    Compute connection

    Specifies the compute environment configuration for job execution. This parameter allows you to define how your job connects to and utilizes compute resources.

    Network connections

    Specify your network connections here, if any.

    Enable fine-grained access control (FGAC)

    Enables enhanced security controls for data access within your job. This provides granular permissions management, allowing you to control access to specific data sources, tables, or columns based on user roles and policies.

    Spark UI

    Turn on the use of Spark UI for monitoring this job. This will write Spark UI logs to Amazon S3.

    Spark UI logs path

    The path to write logs when Spark UI is enabled.

    Note

    Important job configuration restrictions:

    • Once a job is named, it cannot be renamed (the name is a unique identifier).

    • IAM role cannot be changed after job creation (edit IAM roles at the project level).

    • Job type cannot currently be changed after creation.

    You can also configure advanced settings, job parameters, tags, and non overridable arguments. See the sections below for more information.

  5. Write or design your data processing logic using your chosen authoring method.

  6. Save your Job to your project.

Advanced settings

Automatically scale the number of instances

When enabled, costs and resource usage are optimized by scaling the number of instances up and down dynamically throughout the job run.

Generate lineage events

When enabled, Spark lineage events from your job runs are captured and sent to Amazon DataZone to visualize the table and column lineage.

Job bookmark

When enabled, specifies how state information is processed when the job runs. You can have it remember previously processed data, update state information, or ignore state information.

Job run queuing

When enabled, job runs are queued to run later when they cannot run immediately due to service quotas. If not enabled, the job runs will not be considered for queueing. If this setting does not match the value set in the job run, then the value from the job run field will be used.

Flex execution

When enabled, you can reduce costs by running the job on lower capacity. This is ideal for non-urgent workloads that don't require fast jobs start times or consistent run times.

Script path

The Amazon S3 location where the job script is stored.

Job metrics

Enables the creation of Amazon CloudWatch metrics when this job runs.

Job observability metrics

Enables the creation of additional observability Amazon CloudWatch metrics when this job runs.

Maximum concurrency

Sets the maximum number of concurrent runs that are allowed for this job.

Temporary path

The location of a working directory in Amazon S3 where temporary intermediate results are written when the job script is run.

Delay notification threshold (minutes)

Specify a delay threshold for the job. If the job runs for a longer time than that specified by the threshold, then a delay notification is sent for the job to Amazon CloudWatch.

Server-side encryption

If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon S3 temporary directory is encrypted. This is ignored if a security configuration is specified.

Use AWS Glue Data Catalog as the Hive metastore

Choose this option if you want to use the AWS Glue Data Catalog as an alternative to Apache Hive Metastore.

Python library path, Dependent jars path, or Referenced files path

Use these fields to specify the location of additional files used by the job when it runs the script.

Job parameters and arguments

SageMaker Unified Studio jobs support parameters that can be configured during job creation and optionally overridden at runtime.

Parameter overrides

Parameters defined during job creation that can be overridden when running the job through custom runs, workflows, or schedules. This allows the same job to be used for multiple use cases by changing parameters at runtime.

Job tags

SageMaker Unified Studio provides native support for tagging jobs, which helps with organization, filtering, and permission management.

Tag use cases
  • Team identification (e.g., "data-processing-team-a").

  • Domain categorization (e.g., "finance", "marketing").

  • Functional purpose (e.g., "data-governance", "quality-check").

  • Environment designation (e.g., "production", "development").

Tag benefits
  • Filter jobs in the jobs list view.

  • Identify job ownership and purpose.

  • Set permission boundaries for job access.

  • Group related jobs for reporting and management.

Non Overridable arguments

Parameters that cannot be changed at runtime, regardless of how the job is triggered. These parameters are locked to their original values set during job creation.

Use cases for non-overridable arguments
  • Security settings (e.g., fine-grain access control).

  • Critical configuration that should remain consistent.

  • Compliance requirements that must be enforced.