Using a startup script with Amazon MWAA
A startup script is a shell (.sh
) script that you host in your environment's Amazon S3
bucket similar to your DAGs, requirements, and plugins. Amazon MWAA runs this script during startup on every individual Apache Airflow component
(worker, scheduler, and web server) before installing requirements and initializing the Apache Airflow process. Use a startup script to do the following:
-
Install runtimes – Install Linux runtimes required by your workflows and connections.
-
Configure environment variables – Set environment variables for each Apache Airflow component. Overwrite common variables such as
PATH
,PYTHONPATH
, andLD_LIBRARY_PATH
. -
Manage keys and tokens – Pass access tokens for custom repositories to
requirements.txt
and configure security keys.
The following topics describe how to configure a startup script to install Linux runtimes, set environment variables, and troubleshoot related issues using CloudWatch Logs.
Topics
Configure a startup script
To use a startup script with your existing Amazon MWAA environment, upload a .sh
file to your environment's Amazon S3 bucket.
Then, to associate the script with the environment, specify the following in your environment details:
-
The Amazon S3 URL path to the script – The relative path to the script hosted in your bucket, for example,
s3://mwaa-environment/
startup.sh
-
The Amazon S3 version ID of the script – The version of the startup shell script in your Amazon S3 bucket. You must specify the version ID that Amazon S3 assigns to the file every time you update the script. Version IDs are Unicode, UTF-8 encoded, URL-ready, opaque strings that are no more than 1,024 bytes long, for example,
3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo
.
To complete the steps in this section, use the following sample script. The script outputs the value assigned to MWAA_AIRFLOW_COMPONENT
.
This environment variable identifies each Apache Airflow component that the script runs on.
Copy the code and save it locally as startup.sh
.
#!/bin/sh echo "Printing Apache Airflow component" echo $MWAA_AIRFLOW_COMPONENT
Next, upload the script to your Amazon S3 bucket.
Now, associate the script with your environment.
Finally, retrieve log events to verify that the script is working as expected. When you activate logging for an each Apache Airflow component, Amazon MWAA creates a new log group and log stream. For more information, see Apache Airflow log types.
Install Linux runtimes using a startup script
Use a startup script to update the operating system of an Apache Airflow component, and install additional runtime libraries to use with your workflows.
For example, the following script runs yum update
to update the operating system.
When running yum update
in a startup script, you must exclude Python using --exclude=python*
as shown in the example.
For your environment to run, Amazon MWAA installs a specific version of Python compatible with your environment. Therefore, you can't update the environment's Python version
using a startup script.
#!/bin/sh echo "Updating operating system" sudo yum update -y --exclude=python*
To install runtimes on specific Apache Airflow component, use MWAA_AIRFLOW_COMPONENT
and if
and fi
conditional statements. This example runs a single command
to install the libaio
library on the scheduler and worker, but not on the web server.
Important
-
If you have configured a private web server, you must either use the following condition or provide all installation files locally in order to avoid installation timeouts.
-
Use
sudo
to run operations that require administrative privileges.
#!/bin/sh if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]] then sudo yum -y install libaio fi
You can use a startup script to check the Python version.
#!/bin/sh export PYTHON_VERSION_CHECK=`python -c 'import sys; version=sys.version_info[:3]; print("{0}.{1}.{2}".format(*version))'` echo "Python version is $PYTHON_VERSION_CHECK"
Amazon MWAA does not support overriding the default Python version, as this may lead to incompatibilities with the installed Apache Airflow libraries.
Set environment variables using a startup script
Use startup scripts to set environment variables and modify Apache Airflow configurations. The following defines a new variable, ENVIRONMENT_STAGE
.
You can reference this variable in a DAG or in your custom modules.
#!/bin/sh export ENVIRONMENT_STAGE="development" echo "$ENVIRONMENT_STAGE"
Use startup scripts to overwrite common Apache Airflow or system variables. For example, you set LD_LIBRARY_PATH
to instruct Python to look for binaries
in the path you specify. This lets you provide custom binaries for your workflows using
plugins
#!/bin/sh export LD_LIBRARY_PATH=/usr/local/airflow/plugins/
your-custom-binary
Reserved environment variables
Amazon MWAA reserves a set of critical environment variables. If you overwrite a reserved variable, Amazon MWAA restores it to its default. The following lists the reserved variables:
-
MWAA__AIRFLOW__COMPONENT
– Used to identify the Apache Airflow component with one of the following values:scheduler
,worker
, orwebserver
. -
AIRFLOW__WEBSERVER__SECRET_KEY
– The secret key used for securely signing session cookies in the Apache Airflow web server. -
AIRFLOW__CORE__FERNET_KEY
– The key used for encryption and decryption of sensitive data stored in the metadata database, for example, connection passwords. -
AIRFLOW_HOME
– The path to the Apache Airflow home directory where configuration files and DAG files are stored locally. -
AIRFLOW__CELERY__BROKER_URL
– The URL of the message broker used for communication between the Apache Airflow scheduler and the Celery worker nodes. -
AIRFLOW__CELERY__RESULT_BACKEND
– The URL of the database used to store the results of Celery tasks. -
AIRFLOW__CORE__EXECUTOR
– The executor class that Apache Airflow should use. In Amazon MWAA this is aCeleryExecutor
-
AIRFLOW__CORE__LOAD_EXAMPLES
– Used to activate, or deactivate, the loading of example DAGs. -
AIRFLOW__METRICS__METRICS_BLOCK_LIST
– Used to manage which Apache Airflow metrics are emitted and captured by Amazon MWAA in CloudWatch. -
SQL_ALCHEMY_CONN
– The connection string for the RDS for PostgreSQL database used to store Apache Airflow metadata in Amazon MWAA. -
AIRFLOW__CORE__SQL_ALCHEMY_CONN
– Used for the same purpose asSQL_ALCHEMY_CONN
, but following the new Apache Airflow naming convention. -
AIRFLOW__CELERY__DEFAULT_QUEUE
– The default queue for Celery tasks in Apache Airflow. -
AIRFLOW__OPERATORS__DEFAULT_QUEUE
– The default queue for tasks using specific Apache Airflow operators. -
AIRFLOW_VERSION
– The Apache Airflow version installed in the Amazon MWAA environment. -
AIRFLOW_CONN_AWS_DEFAULT
– The default AWS credentials used to integrate with other AWS services in. -
AWS_DEFAULT_REGION
– Sets the default AWS Region used with default credentials to integrate with other AWS services. -
AWS_REGION
– If defined, this environment variable overrides the values in the environment variableAWS_DEFAULT_REGION
and the profile setting region. -
PYTHONUNBUFFERED
– Used to sendstdout
andstderr
streams to container logs. -
AIRFLOW__METRICS__STATSD_ALLOW_LIST
– Used to configure an allow list of comma-separated prefixes to send the metrics that start with the elements of the list. -
AIRFLOW__METRICS__STATSD_ON
– Activates sending metrics toStatsD
. -
AIRFLOW__METRICS__STATSD_HOST
– Used to connect to theStatSD
daemon. -
AIRFLOW__METRICS__STATSD_PORT
– Used to connect to theStatSD
daemon. -
AIRFLOW__METRICS__STATSD_PREFIX
– Used to connect to theStatSD
daemon. -
AIRFLOW__CELERY__WORKER_AUTOSCALE
– Sets the maximum and minimum concurrency. -
AIRFLOW__CORE__DAG_CONCURRENCY
– Sets the number of task instances that can run concurrently by the scheduler in one DAG. -
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG
– Sets the maximum number of active tasks per DAG. -
AIRFLOW__CORE__PARALLELISM
– Defines the maximum number of task instances that can simultaneously. -
AIRFLOW__SCHEDULER__PARSING_PROCESSES
– Sets the maximum number of processes parsed by the scheduler to schedule DAGs. -
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__VISIBILITY_TIMEOUT
– Defines the number of seconds a worker waits to acknowledge the task before the message is redelivered to another worker. -
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__REGION
– Sets the AWS Region for the underlying Celery transport. -
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__PREDEFINED_QUEUES
– Sets the queue for the underlying Celery transport. -
AIRFLOW_SCHEDULER_ALLOWED_RUN_ID_PATTERN
– Used to verify the validity of your input for therun_id
parameter when triggering a DAG. -
AIRFLOW__WEBSERVER__BASE_URL
– The URL of the web server used to host the Apache Airflow UI.
Unreserved environment variables
You can use a startup script to overwrite unreserved environment variables. The following lists some of these common variables:
-
PATH
– Specifies a list of directories where the operating system searches for executable files and scripts. When a command runs in the command line, the system checks the directories inPATH
in order to find and execute the command. When you create custom operators or tasks in Apache Airflow, you might need to rely on external scripts or executables. If the directories containing these files are not in the specified in thePATH
variable, the tasks fail to run when the system is unable to locate them. By adding the appropriate directories toPATH
, Apache Airflow tasks can find and run the required executables. -
PYTHONPATH
– Used by the Python interpreter to determine which directories to search for imported modules and packages. It is a list of directories that you can add to the default search path. This lets the interpreter find and load Python libraries not included in the standard library, or installed in system directories. Use this variable to add your modules and custom Python packages and use them with your DAGs. -
LD_LIBRARY_PATH
– An environment variable used by the dynamic linker and loader in Linux to find and load shared libraries. It specifies a list of directories containing shared libraries, which are searched before the default system library directories. Use this variable to specify your custom binaries. -
CLASSPATH
– Used by the Java Runtime Environment (JRE) and Java Development Kit (JDK) to locate and load Java classes, libraries, and resources at runtime. It is a list of directories, JAR files, and ZIP archives that contain compiled Java code.