Workflow tasks - AWS HealthOmics

Workflow tasks

Workflow tasks are the individual processes within a run. Each task has a unique identifier. For a private workflow, HealthOmics workflows will use your defined compute specifications to run your task. Input files to the workflow and workflow tasks are staged to a scratch volume that's dedicated to the workflow run. They are read-only, which prevents tasks modifying potential inputs to other tasks in a workflow. The directories included as input are also read-only. All inputs are made available to the tasks' working directory as symbolic links. They're only accessible if they're declared in the workflow definition file. Many genomics applications assume that index files will be in the same location as a sequence file (such as a companion bai file for a bam file). To make sure indexes are present, you must specify them as tasks inputs.

Because workflow tasks can't connect to the public internet, they can't download resources by using http, https, or ftp. Required resources should be included as workflow inputs from Amazon S3 or an HealthOmics sequence store. They also should be present in the container images that are used to run workflow tasks. Workflow tasks may interact with Amazon S3 as long as the IAM role used to run the workflow has been granted access.

Memory and computation considerations for tasks

Private workflow tasks are run on HealthOmics instances by using the smallest instance that can accommodate the requested CPU and memory. We recommend choosing the most sensible combination of CPU and memory for your needs. For example, if you need 64 GiB of RAM, then the most cost-effective type is omics.r.2xlarge. This type allocates eight vCPUs. If your task allocation only calls for one vCPU, the task container only gets one, even though eight are available on the host machine. Unless this will adversely affect the workflow, you might want to set a larger number of vCPUs. All tasks reserve a small amount of memory for management and logging agents, so the full memory allocation might not always be available to the application in the task.

Container resource allocations are hard limits. Tasks that run out of memory or attempt to use additional vCPUs can be immediately shut down by the host, potentially without warning.

Running Java applications in a private workflow task

When running Java applications in a workflow task, the image used to run the task must contain Java 1.8–212 or later. Versions before this might attempt to allocate heap memory that's not available to the container. By default, for versions after 1.8-212, the heap allocation of the JVM will be 25% of the memory available to the container. If you use a -Xmx flag to request a larger amount, be aware that not all memory consumed by Java is heap memory. Allocating 100% of the available memory to the JVM heap causes the task to fail.

Including task inputs in Amazon ECR images

All executables, libraries, and scripts needed to run a workflow task should be provided by the Amazon ECR image that's used to run the task.

It's best practice to avoid using scripts, binaries, and libraries that are external to a tasks container image. This is especially important when using nf-core workflows that use a bin directory as part of the workflow package. While this directory will be available to the workflow task, it's mounted as a read-only directory. Required resources in this directory should be copied into the task image and made available at runtime or when building the container image used for the task.

Debugging workflow tasks

The following are best practices and considerations for debugging your tasks and workflows.

  • Task logs rely on STDOUT and STDERR being produced by the task. If the application used in the task doesn’t produce either of these, then there won't be a task log. To assist with debugging, use applications in verbose mode.

  • To view the commands being run in a task along with their interpolated values, use the set -x Bash command. This can help determine if the task is using the correct inputs and identify where errors might have kept the task from running as intended.

  • Use the echo command to output the values of variables to STDOUT or STDERR. This helps you confirm that they're being set as expected.

  • Use commands like ls -l <name_of_input_file> to confirm that inputs are present and are of the expected size. If they aren't, this might reveal a problem with a prior task producing empty outputs due to a bug.

  • Use the command df -Ph . | awk 'NR==2 {print $4}' in a tasks script to determine the space currently available to the task and help identify situations where you might need to run the workflow with additional storage allocation.

Including any of the preceding commands in a task script assumes that the task container also includes these commands and that they are on the path of the container environment.