PySparkStreamingJob

class aws_cdk.aws_glue_alpha.PySparkStreamingJob(scope, id, *, extra_files=None, extra_jars=None, extra_jars_first=None, extra_python_files=None, job_run_queuing_enabled=None, enable_metrics=None, enable_observability_metrics=None, spark_ui=None, role, script, connections=None, continuous_logging=None, default_arguments=None, description=None, enable_profiling_metrics=None, glue_version=None, job_name=None, max_concurrent_runs=None, max_retries=None, number_of_workers=None, security_configuration=None, tags=None, timeout=None, worker_type=None)

Bases: SparkJob

(experimental) Python Spark Streaming Jobs class.

A Streaming job is similar to an ETL job, except that it performs ETL on data streams using the Apache Spark Structured Streaming framework. These jobs will default to use Python 3.9.

Similar to ETL jobs, streaming job supports Scala and Python languages. Similar to ETL, it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default to G2 worker and 4.0 version for streaming jobs which developers can override. We will enable —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.

Stability:: experimental
ExampleMetadata:: infused

Example:

import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkStreamingJob(stack, "ImportedJob", role=role, script=script)

(experimental) PySparkStreamingJob constructor.

Parameters:

scope (Construct)
id (str)
extra_files (Optional[Sequence[Code]]) – (experimental) Additional files, such as configuration files that AWS Glue copies to the working directory of your script before executing it. Default: - no extra files specified.
extra_jars (Optional[Sequence[Code]]) – (experimental) Extra Jars S3 URL (optional) S3 URL where additional jar dependencies are located. Default: - no extra jar files
extra_jars_first (Optional[bool]) – (experimental) Setting this value to true prioritizes the customer’s extra JAR files in the classpath. Default: false - priority is not given to user-provided jars
extra_python_files (Optional[Sequence[Code]]) – (experimental) Extra Python Files S3 URL (optional) S3 URL where additional python dependencies are located. Default: - no extra files
job_run_queuing_enabled (Optional[bool]) – (experimental) Specifies whether job run queuing is enabled for the job runs for this job. A value of true means job run queuing is enabled for the job runs. If false or not populated, the job runs will not be considered for queueing. If this field does not match the value set in the job run, then the value from the job run field will be used. This property must be set to false for flex jobs. If this property is enabled, maxRetries must be set to zero. Default: - no job run queuing
enable_metrics (Optional[bool]) – (experimental) Enable profiling metrics for the Glue job. When enabled, adds ‘–enable-metrics’ to job arguments. Default: true
enable_observability_metrics (Optional[bool]) – (experimental) Enable observability metrics for the Glue job. When enabled, adds ‘–enable-observability-metrics’: ‘true’ to job arguments. Default: true
spark_ui (Union[SparkUIProps, Dict[str, Any], None]) – (experimental) Enables the Spark UI debugging and monitoring with the specified props. Default: - Spark UI debugging and monitoring is disabled.
role (IRole) – (experimental) IAM Role (required) IAM Role to use for Glue job execution Must be specified by the developer because the L2 doesn’t have visibility into the actions the script(s) takes during the job execution The role must trust the Glue service principal (glue.amazonaws.com) and be granted sufficient permissions.
script (Code) – (experimental) Script Code Location (required) Script to run when the Glue job executes. Can be uploaded from the local directory structure using fromAsset or referenced via S3 location using fromBucket
connections (Optional[Sequence[IConnection]]) – (experimental) Connections (optional) List of connections to use for this Glue job Connections are used to connect to other AWS Service or resources within a VPC. Default: [] - no connections are added to the job
continuous_logging (Union[ContinuousLoggingProps, Dict[str, Any], None]) – (experimental) Enables continuous logging with the specified props. Default: - continuous logging is enabled.
default_arguments (Optional[Mapping[str, str]]) – (experimental) Default Arguments (optional) The default arguments for every run of this Glue job, specified as name-value pairs. Default: - no arguments
description (Optional[str]) – (experimental) Description (optional) Developer-specified description of the Glue job. Default: - no value
enable_profiling_metrics (Optional[bool]) – (experimental) Enables the collection of metrics for job profiling. Default: - no profiling metrics emitted.
glue_version (Optional[GlueVersion]) – (experimental) Glue Version The version of Glue to use to execute this job. Default: 3.0 for ETL
job_name (Optional[str]) – (experimental) Name of the Glue job (optional) Developer-specified name of the Glue job. Default: - a name is automatically generated
max_concurrent_runs (Union[int, float, None]) – (experimental) Max Concurrent Runs (optional) The maximum number of runs this Glue job can concurrently run. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit. Default: 1
max_retries (Union[int, float, None]) – (experimental) Max Retries (optional) Maximum number of retry attempts Glue performs if the job fails. Default: 0
number_of_workers (Union[int, float, None]) – (experimental) Number of Workers (optional) Number of workers for Glue to use during job execution. Default: 10
security_configuration (Optional[ISecurityConfiguration]) – (experimental) Security Configuration (optional) Defines the encryption options for the Glue job. Default: - no security configuration.
tags (Optional[Mapping[str, str]]) – (experimental) Tags (optional) A list of key:value pairs of tags to apply to this Glue job resources. Default: {} - no tags
timeout (Optional[Duration]) – (experimental) Timeout (optional) The maximum time that a job run can consume resources before it is terminated and enters TIMEOUT status. Specified in minutes. Default: 2880 (2 days for non-streaming)
worker_type (Optional[WorkerType]) – (experimental) Worker Type (optional) Type of Worker for Glue to use during job execution Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X Default: WorkerType.G_1X

Stability:

experimental

Methods

apply_removal_policy(policy)

Apply the given removal policy to this resource.

The Removal Policy controls what happens to this resource when it stops being managed by CloudFormation, either because you’ve removed it from the CDK application or because you’ve made a change that requires the resource to be replaced.

The resource can be deleted (RemovalPolicy.DESTROY), or left in your AWS account for data recovery and cleanup later (RemovalPolicy.RETAIN).

Parameters:: policy (RemovalPolicy)
Return type:: None

metric(metric_name, type, *, account=None, color=None, dimensions_map=None, id=None, label=None, period=None, region=None, stack_account=None, stack_region=None, statistic=None, unit=None, visible=None)

(experimental) Create a CloudWatch metric.

Parameters:

metric_name (str) – name of the metric typically prefixed with glue.driver., glue.<executorId>. or glue.ALL..
type (MetricType) – the metric type.
account (Optional[str]) – Account which this metric comes from. Default: - Deployment account.
color (Optional[str]) – The hex color code, prefixed with ‘#’ (e.g. ‘#00ff00’), to use when this metric is rendered on a graph. The Color class has a set of standard colors that can be used here. Default: - Automatic color
dimensions_map (Optional[Mapping[str, str]]) – Dimensions of the metric. Default: - No dimensions.
id (Optional[str]) – Unique identifier for this metric when used in dashboard widgets. The id can be used as a variable to represent this metric in math expressions. Valid characters are letters, numbers, and underscore. The first character must be a lowercase letter. Default: - No ID
label (Optional[str]) – Label for this metric when added to a Graph in a Dashboard. You can use dynamic labels to show summary information about the entire displayed time series in the legend. For example, if you use:: [max: ${MAX}] MyMetric As the metric label, the maximum value in the visible range will be shown next to the time series name in the graph’s legend. Default: - No label
period (Optional[Duration]) – The period over which the specified statistic is applied. Default: Duration.minutes(5)
region (Optional[str]) – Region which this metric comes from. Default: - Deployment region.
stack_account (Optional[str]) – Account of the stack this metric is attached to. Default: - Deployment account.
stack_region (Optional[str]) – Region of the stack this metric is attached to. Default: - Deployment region.
statistic (Optional[str]) – What function to use for aggregating. Use the aws_cloudwatch.Stats helper class to construct valid input strings. Can be one of the following: - “Minimum” | “min” - “Maximum” | “max” - “Average” | “avg” - “Sum” | “sum” - “SampleCount | “n” - “pNN.NN” - “tmNN.NN” | “tm(NN.NN%:NN.NN%)” - “iqm” - “wmNN.NN” | “wm(NN.NN%:NN.NN%)” - “tcNN.NN” | “tc(NN.NN%:NN.NN%)” - “tsNN.NN” | “ts(NN.NN%:NN.NN%)” Default: Average
unit (Optional[Unit]) – Unit used to filter the metric stream. Only refer to datums emitted to the metric stream with the given unit and ignore all others. Only useful when datums are being emitted to the same metric stream under different units. The default is to use all matric datums in the stream, regardless of unit, which is recommended in nearly all cases. CloudWatch does not honor this property for graphs. Default: - All metric datums in the given metric stream
visible (Optional[bool]) – Whether this metric should be visible in dashboard graphs. Setting this to false is useful when you want to hide raw metrics that are used in math expressions, and show only the expression results. Default: true

See:

https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html

Stability:

experimental

Return type:

Metric

metric_failure(*, account=None, color=None, dimensions_map=None, id=None, label=None, period=None, region=None, stack_account=None, stack_region=None, statistic=None, unit=None, visible=None)

(experimental) Return a CloudWatch Metric indicating job failure.

This metric is based on the Rule returned by no-args onFailure() call.

Parameters:

account (Optional[str]) – Account which this metric comes from. Default: - Deployment account.
color (Optional[str]) – The hex color code, prefixed with ‘#’ (e.g. ‘#00ff00’), to use when this metric is rendered on a graph. The Color class has a set of standard colors that can be used here. Default: - Automatic color
dimensions_map (Optional[Mapping[str, str]]) – Dimensions of the metric. Default: - No dimensions.
id (Optional[str]) – Unique identifier for this metric when used in dashboard widgets. The id can be used as a variable to represent this metric in math expressions. Valid characters are letters, numbers, and underscore. The first character must be a lowercase letter. Default: - No ID
label (Optional[str]) –
Label for this metric when added to a Graph in a Dashboard. You can use dynamic labels to show summary information about the entire displayed time series in the legend. For example, if you use:: [max: ${MAX}] MyMetric As the metric label, the maximum value in the visible range will be shown next to the time series name in the graph’s legend. Default: - No label
period (Optional[Duration]) – The period over which the specified statistic is applied. Default: Duration.minutes(5)
region (Optional[str]) – Region which this metric comes from. Default: - Deployment region.
stack_account (Optional[str]) – Account of the stack this metric is attached to. Default: - Deployment account.
stack_region (Optional[str]) – Region of the stack this metric is attached to. Default: - Deployment region.
statistic (Optional[str]) – What function to use for aggregating. Use the aws_cloudwatch.Stats helper class to construct valid input strings. Can be one of the following: - “Minimum” | “min” - “Maximum” | “max” - “Average” | “avg” - “Sum” | “sum” - “SampleCount | “n” - “pNN.NN” - “tmNN.NN” | “tm(NN.NN%:NN.NN%)” - “iqm” - “wmNN.NN” | “wm(NN.NN%:NN.NN%)” - “tcNN.NN” | “tc(NN.NN%:NN.NN%)” - “tsNN.NN” | “ts(NN.NN%:NN.NN%)” Default: Average
unit (Optional[Unit]) – Unit used to filter the metric stream. Only refer to datums emitted to the metric stream with the given unit and ignore all others. Only useful when datums are being emitted to the same metric stream under different units. The default is to use all matric datums in the stream, regardless of unit, which is recommended in nearly all cases. CloudWatch does not honor this property for graphs. Default: - All metric datums in the given metric stream
visible (Optional[bool]) – Whether this metric should be visible in dashboard graphs. Setting this to false is useful when you want to hide raw metrics that are used in math expressions, and show only the expression results. Default: true

Stability:

experimental

Return type:

Metric

metric_success(*, account=None, color=None, dimensions_map=None, id=None, label=None, period=None, region=None, stack_account=None, stack_region=None, statistic=None, unit=None, visible=None)

(experimental) Return a CloudWatch Metric indicating job success.

This metric is based on the Rule returned by no-args onSuccess() call.

Parameters:

account (Optional[str]) – Account which this metric comes from. Default: - Deployment account.
color (Optional[str]) – The hex color code, prefixed with ‘#’ (e.g. ‘#00ff00’), to use when this metric is rendered on a graph. The Color class has a set of standard colors that can be used here. Default: - Automatic color
dimensions_map (Optional[Mapping[str, str]]) – Dimensions of the metric. Default: - No dimensions.
id (Optional[str]) – Unique identifier for this metric when used in dashboard widgets. The id can be used as a variable to represent this metric in math expressions. Valid characters are letters, numbers, and underscore. The first character must be a lowercase letter. Default: - No ID
label (Optional[str]) –
Label for this metric when added to a Graph in a Dashboard. You can use dynamic labels to show summary information about the entire displayed time series in the legend. For example, if you use:: [max: ${MAX}] MyMetric As the metric label, the maximum value in the visible range will be shown next to the time series name in the graph’s legend. Default: - No label
period (Optional[Duration]) – The period over which the specified statistic is applied. Default: Duration.minutes(5)
region (Optional[str]) – Region which this metric comes from. Default: - Deployment region.
stack_account (Optional[str]) – Account of the stack this metric is attached to. Default: - Deployment account.
stack_region (Optional[str]) – Region of the stack this metric is attached to. Default: - Deployment region.
statistic (Optional[str]) – What function to use for aggregating. Use the aws_cloudwatch.Stats helper class to construct valid input strings. Can be one of the following: - “Minimum” | “min” - “Maximum” | “max” - “Average” | “avg” - “Sum” | “sum” - “SampleCount | “n” - “pNN.NN” - “tmNN.NN” | “tm(NN.NN%:NN.NN%)” - “iqm” - “wmNN.NN” | “wm(NN.NN%:NN.NN%)” - “tcNN.NN” | “tc(NN.NN%:NN.NN%)” - “tsNN.NN” | “ts(NN.NN%:NN.NN%)” Default: Average
unit (Optional[Unit]) – Unit used to filter the metric stream. Only refer to datums emitted to the metric stream with the given unit and ignore all others. Only useful when datums are being emitted to the same metric stream under different units. The default is to use all matric datums in the stream, regardless of unit, which is recommended in nearly all cases. CloudWatch does not honor this property for graphs. Default: - All metric datums in the given metric stream
visible (Optional[bool]) – Whether this metric should be visible in dashboard graphs. Setting this to false is useful when you want to hide raw metrics that are used in math expressions, and show only the expression results. Default: true

Stability:

experimental

Return type:

Metric

metric_timeout(*, account=None, color=None, dimensions_map=None, id=None, label=None, period=None, region=None, stack_account=None, stack_region=None, statistic=None, unit=None, visible=None)

(experimental) Return a CloudWatch Metric indicating job timeout.

This metric is based on the Rule returned by no-args onTimeout() call.

Parameters:

account (Optional[str]) – Account which this metric comes from. Default: - Deployment account.
color (Optional[str]) – The hex color code, prefixed with ‘#’ (e.g. ‘#00ff00’), to use when this metric is rendered on a graph. The Color class has a set of standard colors that can be used here. Default: - Automatic color
dimensions_map (Optional[Mapping[str, str]]) – Dimensions of the metric. Default: - No dimensions.
id (Optional[str]) – Unique identifier for this metric when used in dashboard widgets. The id can be used as a variable to represent this metric in math expressions. Valid characters are letters, numbers, and underscore. The first character must be a lowercase letter. Default: - No ID
label (Optional[str]) –
Label for this metric when added to a Graph in a Dashboard. You can use dynamic labels to show summary information about the entire displayed time series in the legend. For example, if you use:: [max: ${MAX}] MyMetric As the metric label, the maximum value in the visible range will be shown next to the time series name in the graph’s legend. Default: - No label
period (Optional[Duration]) – The period over which the specified statistic is applied. Default: Duration.minutes(5)
region (Optional[str]) – Region which this metric comes from. Default: - Deployment region.
stack_account (Optional[str]) – Account of the stack this metric is attached to. Default: - Deployment account.
stack_region (Optional[str]) – Region of the stack this metric is attached to. Default: - Deployment region.
statistic (Optional[str]) – What function to use for aggregating. Use the aws_cloudwatch.Stats helper class to construct valid input strings. Can be one of the following: - “Minimum” | “min” - “Maximum” | “max” - “Average” | “avg” - “Sum” | “sum” - “SampleCount | “n” - “pNN.NN” - “tmNN.NN” | “tm(NN.NN%:NN.NN%)” - “iqm” - “wmNN.NN” | “wm(NN.NN%:NN.NN%)” - “tcNN.NN” | “tc(NN.NN%:NN.NN%)” - “tsNN.NN” | “ts(NN.NN%:NN.NN%)” Default: Average
unit (Optional[Unit]) – Unit used to filter the metric stream. Only refer to datums emitted to the metric stream with the given unit and ignore all others. Only useful when datums are being emitted to the same metric stream under different units. The default is to use all matric datums in the stream, regardless of unit, which is recommended in nearly all cases. CloudWatch does not honor this property for graphs. Default: - All metric datums in the given metric stream
visible (Optional[bool]) – Whether this metric should be visible in dashboard graphs. Setting this to false is useful when you want to hide raw metrics that are used in math expressions, and show only the expression results. Default: true

Stability:

experimental

Return type:

Metric

on_event(id, *, target=None, cross_stack_scope=None, description=None, event_pattern=None, rule_name=None)

(experimental) Create a CloudWatch Event Rule for this Glue Job when it’s in a given state.

Parameters:

id (str) – construct id.
target (Optional[IRuleTarget]) – The target to register for the event. Default: - No target is added to the rule. Use addTarget() to add a target.
cross_stack_scope (Optional[Construct]) – The scope to use if the source of the rule and its target are in different Stacks (but in the same account & region). This helps dealing with cycles that often arise in these situations. Default: - none (the main scope will be used, even for cross-stack Events)
description (Optional[str]) – A description of the rule’s purpose. Default: - No description
event_pattern (Union[EventPattern, Dict[str, Any], None]) – Additional restrictions for the event to route to the specified target. The method that generates the rule probably imposes some type of event filtering. The filtering implied by what you pass here is added on top of that filtering. Default: - No additional filtering based on an event pattern.
rule_name (Optional[str]) – A name for the rule. Default: AWS CloudFormation generates a unique physical ID.

See:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#glue-event-types

Stability:

experimental

Return type:

Rule

on_failure(id, *, target=None, cross_stack_scope=None, description=None, event_pattern=None, rule_name=None)

(experimental) Return a CloudWatch Event Rule matching FAILED state.

Parameters:

id (str) – construct id.
target (Optional[IRuleTarget]) – The target to register for the event. Default: - No target is added to the rule. Use addTarget() to add a target.
cross_stack_scope (Optional[Construct]) – The scope to use if the source of the rule and its target are in different Stacks (but in the same account & region). This helps dealing with cycles that often arise in these situations. Default: - none (the main scope will be used, even for cross-stack Events)
description (Optional[str]) – A description of the rule’s purpose. Default: - No description
event_pattern (Union[EventPattern, Dict[str, Any], None]) – Additional restrictions for the event to route to the specified target. The method that generates the rule probably imposes some type of event filtering. The filtering implied by what you pass here is added on top of that filtering. Default: - No additional filtering based on an event pattern.
rule_name (Optional[str]) – A name for the rule. Default: AWS CloudFormation generates a unique physical ID.

Stability:

experimental

Return type:

Rule

on_success(id, *, target=None, cross_stack_scope=None, description=None, event_pattern=None, rule_name=None)

(experimental) Create a CloudWatch Event Rule matching JobState.SUCCEEDED.

Parameters:

id (str) – construct id.
target (Optional[IRuleTarget]) – The target to register for the event. Default: - No target is added to the rule. Use addTarget() to add a target.
cross_stack_scope (Optional[Construct]) – The scope to use if the source of the rule and its target are in different Stacks (but in the same account & region). This helps dealing with cycles that often arise in these situations. Default: - none (the main scope will be used, even for cross-stack Events)
description (Optional[str]) – A description of the rule’s purpose. Default: - No description
event_pattern (Union[EventPattern, Dict[str, Any], None]) – Additional restrictions for the event to route to the specified target. The method that generates the rule probably imposes some type of event filtering. The filtering implied by what you pass here is added on top of that filtering. Default: - No additional filtering based on an event pattern.
rule_name (Optional[str]) – A name for the rule. Default: AWS CloudFormation generates a unique physical ID.

Stability:

experimental

Return type:

Rule

on_timeout(id, *, target=None, cross_stack_scope=None, description=None, event_pattern=None, rule_name=None)

(experimental) Return a CloudWatch Event Rule matching TIMEOUT state.

Parameters:

id (str) – construct id.
target (Optional[IRuleTarget]) – The target to register for the event. Default: - No target is added to the rule. Use addTarget() to add a target.
cross_stack_scope (Optional[Construct]) – The scope to use if the source of the rule and its target are in different Stacks (but in the same account & region). This helps dealing with cycles that often arise in these situations. Default: - none (the main scope will be used, even for cross-stack Events)
description (Optional[str]) – A description of the rule’s purpose. Default: - No description
event_pattern (Union[EventPattern, Dict[str, Any], None]) – Additional restrictions for the event to route to the specified target. The method that generates the rule probably imposes some type of event filtering. The filtering implied by what you pass here is added on top of that filtering. Default: - No additional filtering based on an event pattern.
rule_name (Optional[str]) – A name for the rule. Default: AWS CloudFormation generates a unique physical ID.

Stability:

experimental

Return type:

Rule

to_string()

Returns a string representation of this construct.

Return type:: str

Attributes

PROPERTY_INJECTION_ID = '@aws-cdk.aws-glue-alpha.PySparkStreamingJob'

env

The environment this resource belongs to.

For resources that are created and managed in a Stack (those created by creating new class instances like new Role(), new Bucket(), etc.), this is always the same as the environment of the stack they belong to.

For referenced resources (those obtained from referencing methods like Role.fromRoleArn(), Bucket.fromBucketName(), etc.), they might be different than the stack they were imported into.

grant_principal

(experimental) The principal to grant permissions to.

Stability:: experimental

job_arn

(experimental) The ARN of the job.

Stability:: experimental

job_name

(experimental) The name of the job.

Stability:: experimental

node: The tree node.

role

(experimental) The IAM role Glue assumes to run this job.

Stability:: experimental

spark_ui_logging_location

(experimental) The Spark UI logs location if Spark UI monitoring and debugging is enabled.

See:: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
Stability:: experimental

stack: The stack in which this resource is defined.

Static Methods

classmethod from_job_attributes(scope, id, *, job_name, role=None)

(experimental) Identifies an existing Glue Job from a subset of attributes that can be referenced from within another Stack or Construct.

Parameters:

scope (Construct) – The scope creating construct (usually this).
id (str) – The construct’s id.
job_name (str) – (experimental) The name of the job.
role (Optional[IRole]) – (experimental) The IAM role assumed by Glue to run this job. Default: - undefined

Stability:

experimental

Return type:

IJob

classmethod is_construct(x)

Checks if x is a construct.

Use this method instead of instanceof to properly detect Construct instances, even when the construct library is symlinked.

Explanation: in JavaScript, multiple copies of the constructs library on disk are seen as independent, completely different libraries. As a consequence, the class Construct in each copy of the constructs library is seen as a different class, and an instance of one class will not test as instanceof the other class. npm install will not create installations like this, but users may manually symlink construct libraries together or use a monorepo tool: in those cases, multiple copies of the constructs library can be accidentally installed, and instanceof will behave unpredictably. It is safest to avoid using instanceof, and using this type-testing method instead.

Parameters:: x (Any) – Any object.
Return type:: bool
Returns:: true if x is an object created from a class which extends Construct.

classmethod is_owned_resource(construct)

Returns true if the construct was created by CDK, and false otherwise.

Parameters:: construct (IConstruct)
Return type:: bool

classmethod is_resource(construct)

Check whether the given construct is a Resource.

Parameters:: construct (IConstruct)
Return type:: bool