Troubleshooting: DAGs, Operators, Connections, and other issues in Apache Airflow v2 - Amazon Managed Workflows for Apache Airflow

Troubleshooting: DAGs, Operators, Connections, and other issues in Apache Airflow v2

The topics on this page describe resolutions to Apache Airflow v2 Python dependencies, custom plugins, DAGs, Operators, Connections, tasks, and Web server issues you may encounter on an Amazon Managed Workflows for Apache Airflow environment.

Connections

The following topic describes the errors you may receive when using an Apache Airflow connection, or using another AWS database.

I can't connect to Secrets Manager

We recommend the following steps:

  1. Learn how to create secret keys for your Apache Airflow connection and variables in Configuring an Apache Airflow connection using a AWS Secrets Manager secret.

  2. Learn how to use the secret key for an Apache Airflow variable (test-variable) in Using a secret key in AWS Secrets Manager for an Apache Airflow variable.

  3. Learn how to use the secret key for an Apache Airflow connection (myconn) in Using a secret key in AWS Secrets Manager for an Apache Airflow connection.

How do I configure secretsmanager:ResourceTag/<tag-key> secrets manager conditions or a resource restriction in my execution role policy?

Note

Applies to Apache Airflow version 2.0 and earlier.

Currently, you cannot limit access to Secrets Manager secrets by using condition keys or other resource restrictions in your environment's execution role, due to a known issue in Apache Airflow.

I can't connect to Snowflake

We recommend the following steps:

  1. Test your DAGs, custom plugins, and Python dependencies locally using the aws-mwaa-local-runner on GitHub.

  2. Add the following entries to the requirements.txt for your environment.

    apache-airflow-providers-snowflake==1.3.0
  3. Add the following imports to your DAG:

    from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator

Ensure the Apache Airflow connection object includes the following key-value pairs:

  1. Conn Id: snowflake_conn

  2. Conn Type: Snowflake

  3. Host: <my account>.<my region if not us-west-2>.snowflakecomputing.com

  4. Schema: <my schema>

  5. Login: <my user name>

  6. Password: ********

  7. Port: <port, if any>

  8. Extra:

    { "account": "<my account>", "warehouse": "<my warehouse>", "database": "<my database>", "region": "<my region if not using us-west-2 otherwise omit this line>" }

For example:

>>> import json >>> from airflow.models.connection import Connection >>> myconn = Connection( ... conn_id='snowflake_conn', ... conn_type='Snowflake', ... host='YOUR_ACCOUNT.YOUR_REGION.snowflakecomputing.com', ... schema='YOUR_SCHEMA' ... login='YOUR_USERNAME', ... password='YOUR_PASSWORD', ... port='YOUR_PORT' ... extra=json.dumps(dict(account='YOUR_ACCOUNT', warehouse='YOUR_WAREHOUSE', database='YOUR_DB_OPTION', region='YOUR_REGION')), ... )

I can't see my connection in the Airflow UI

Apache Airflow provides connection templates in the Apache Airflow UI. It uses this to generate the connection URI string, regardless of the connection type. If a connection template is not available in the Apache Airflow UI, an alternate connection template can be used to generate a connection URI string, such as using the HTTP connection template.

We recommend the following steps:

  1. View the connection types Amazon MWAA's providing in the Apache Airflow UI at Apache Airflow provider packages installed on Amazon MWAA environments.

  2. View the commands to create an Apache Airflow connection in the CLI at Apache Airflow CLI command reference.

  3. Learn how to use connection templates in the Apache Airflow UI interchangeably for connection types that aren't available in the Apache Airflow UI on Amazon MWAA at Overview of connection types.

Web server

The following topic describes the errors you may receive for your Apache Airflow Web server on Amazon MWAA.

I see a 5xx error accessing the web server

We recommend the following steps:

  1. Check Apache Airflow configuration options. Verify that the key-value pairs you specified as an Apache Airflow configuration option, such as AWS Secrets Manager, were configured correctly. To learn more, see I can't connect to Secrets Manager.

  2. Check the requirements.txt. Verify the Airflow "extras" package and other libraries listed in your requirements.txt are compatible with your Apache Airflow version.

  3. Explore ways to specify Python dependencies in a requirements.txt file, see Managing Python dependencies in requirements.txt.

I see a 'The scheduler does not appear to be running' error

If the scheduler doesn't appear to be running, or the last "heart beat" was received several hours ago, your DAGs may not appear in Apache Airflow, and new tasks will not be scheduled.

We recommend the following steps:

  1. Confirm that your VPC security group allows inbound access to port 5432. This port is needed to connect to the Amazon Aurora PostgreSQL metadata database for your environment. After this rule is added, give Amazon MWAA a few minutes, and the error should disappear. To learn more, see Security in your VPC on Amazon MWAA.

    Note
    • The Aurora PostgreSQL metadatabase is part of the Amazon MWAA service architecture and is not visible in your AWS account.

    • Database-related errors are usually a symptom of scheduler failure and not the root cause.

  2. If the scheduler is not running, it might be due to a number of factors such as dependency installation failures, or an overloaded scheduler. Confirm that your DAGs, plugins, and requirements are working correctly by viewing the corresponding log groups in CloudWatch Logs. To learn more, see Monitoring and metrics for Amazon Managed Workflows for Apache Airflow.

Tasks

The following topic describes the errors you may receive for Apache Airflow tasks in an environment.

I see my tasks stuck or not completing

If your Apache Airflow tasks are "stuck" or not completing, we recommend the following steps:

  1. There may be a large number of DAGs defined. Reduce the number of DAGs and perform an update of the environment (such as changing a log level) to force a reset.

    1. Airflow parses DAGs whether they are enabled or not. If you're using greater than 50% of your environment's capacity you may start overwhelming the Apache Airflow Scheduler. This leads to large Total Parse Time in CloudWatch Metrics or long DAG processing times in CloudWatch Logs. There are other ways to optimize Apache Airflow configurations which are outside the scope of this guide.

    2. To learn more about the best practices we recommend to tune the performance of your environment, see Performance tuning for Apache Airflow on Amazon MWAA.

  2. There may be a large number of tasks in the queue. This often appears as a large—and growing—number of tasks in the "None" state, or as a large number in Queued Tasks and/or Tasks Pending in CloudWatch. This can occur for the following reasons:

    1. If there are more tasks to run than the environment has the capacity to run, and/or a large number of tasks that were queued before autoscaling has time to detect the tasks and deploy additional Workers.

    2. If there are more tasks to run than an environment has the capacity to run, we recommend reducing the number of tasks that your DAGs run concurrently, and/or increasing the minimum Apache Airflow Workers.

    3. If there are a large number of tasks that were queued before autoscaling has had time to detect and deploy additional workers, we recommend staggering task deployment and/or increasing the minimum Apache Airflow Workers.

    4. You can use the update-environment command in the AWS Command Line Interface (AWS CLI) to change the minimum or maximum number of Workers that run on your environment.

      aws mwaa update-environment --name MyEnvironmentName --min-workers 2 --max-workers 10
    5. To learn more about the best practices we recommend to tune the performance of your environment, see Performance tuning for Apache Airflow on Amazon MWAA.

  3. There may be tasks being deleted mid-execution that appear as task logs which stop with no further indication in Apache Airflow. This can occur for the following reasons:

    1. If there is a brief moment where 1) the current tasks exceed current environment capacity, followed by 2) a few minutes of no tasks executing or being queued, then 3) new tasks being queued.

    2. Amazon MWAA autoscaling reacts to the first scenario by adding additional workers. In the second scenario, it removes the additional workers. Some of the tasks being queued may result with the workers in the process of being removed, and will end when the container is deleted.

    3. We recommend increasing the minimum number of workers on your environment. Another option is to adjust the timing of your DAGs and tasks to ensure that that these scenarios don't occur.

    4. You can also set the minimum workers equal to the maximum workers on your environment, effectively disabling autoscaling. Use the update-environment command in the AWS Command Line Interface (AWS CLI) to disable autoscaling by setting the minimum and maximum number of workers to be the same.

      aws mwaa update-environment --name MyEnvironmentName --min-workers 5 --max-workers 5
    5. To learn more about the best practices we recommend to tune the performance of your environment, see Performance tuning for Apache Airflow on Amazon MWAA.

  4. If your tasks are stuck in the "running" state, you can also clear the tasks or mark them as succeeded or failed. This allows the autoscaling component for your environment to scale down the number of workers running on your environment. The following image shows an example of a stranded task.

    This is an image with a stranded task.
    1. Choose the circle for the stranded task, and then select Clear (as shown). This allows Amazon MWAA to scale down workers; otherwise, Amazon MWAA can't determine which DAGs are enabled or disabled, and can't scale down, if there are still queued tasks.

      Apache Airflow Actions
  5. Learn more about the Apache Airflow task lifecycle at Concepts in the Apache Airflow reference guide.

CLI

The following topic describes the errors you may receive when running Airflow CLI commands in the AWS Command Line Interface.

I see a '503' error when triggering a DAG in the CLI

The Airflow CLI runs on the Apache Airflow Web server, which has limited concurrency. Typically a maximum of 4 CLI commands can run simultaneously.

Why does the dags backfill Apache Airflow CLI command fail? Is there a workaround?

Note

The following applies only to Apache Airflow v2.0.2 environments.

The backfill command, like other Apache Airflow CLI commands, parses all DAGs locally before any DAGs are processed, regardless of which DAG the CLI operation applies to. In Amazon MWAA environments using Apache Airflow v2.0.2, because plugins and requirements are not yet installed on the web server by the time the CLI command runs, the parsing operation fails, and the backfill operation is not invoked. If you did not have any requirements nor plugins in your environment, the backfill operation would succeed.

In order to be able to run the backfill CLI command, we recommend invoking it in a bash operator. In a bash operator, backfill is initiated from the worker, allowing the DAGs to parse successfully as all necessary requirements and plguins are available and installed. The following example shows how you can create a DAG with a BashOperator to run backfill.

from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.utils.dates import days_ago with DAG(dag_id="backfill_dag", schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag: cli_command = BashOperator( task_id="bash_command", bash_command="airflow dags backfill my_dag_id" )

Operators

The following topic describes the errors you may receive when using Operators.

I received a PermissionError: [Errno 13] Permission denied error using the S3Transform operator

We recommend the following steps if you're trying to run a shell script with the S3Transform operator and you're receiving a PermissionError: [Errno 13] Permission denied error. The following steps assume you have an existing plugins.zip file. If you're creating a new plugins.zip, see Installing custom plugins.

  1. Test your DAGs, custom plugins, and Python dependencies locally using the aws-mwaa-local-runner on GitHub.

  2. Create your "transform" script.

    #!/bin/bash cp $1 $2
  3. (optional) macOS and Linux users may need to run the following command to ensure the script is executable.

    chmod 777 transform_test.sh
  4. Add the script to your plugins.zip.

    zip plugins.zip transform_test.sh
  5. Follow the steps in Upload the plugins.zip to Amazon S3.

  6. Follow the steps in Specifying the plugins.zip version on the Amazon MWAA console.

  7. Create the following DAG.

    from airflow import DAG from airflow.providers.amazon.aws.operators.s3_file_transform import S3FileTransformOperator from airflow.utils.dates import days_ago import os DAG_ID = os.path.basename(__file__).replace(".py", "") with DAG (dag_id=DAG_ID, schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag: file_transform = S3FileTransformOperator( task_id='file_transform', transform_script='/usr/local/airflow/plugins/transform_test.sh', source_s3_key='s3://YOUR_S3_BUCKET/files/input.txt', dest_s3_key='s3://YOUR_S3_BUCKET/files/output.txt' )
  8. Follow the steps in Uploading DAG code to Amazon S3.