Workflow tasks
Workflow tasks are the individual processes within a run. Each task has a unique
identifier. For a private workflow, HealthOmics workflows will use your defined compute
specifications to run your task. Input files to the workflow and workflow tasks are
staged to a scratch volume that's dedicated to the workflow run. They are read-only,
which prevents tasks modifying potential inputs to other tasks in a workflow. The
directories included as input are also read-only. All inputs are made available to the
tasks' working directory as symbolic links. They're only accessible if they're declared
in the workflow definition file. Many genomics applications assume that index files will
be in the same location as a sequence file (such as a companion bai
file
for a bam
file). To make sure indexes are present, you must specify them as
tasks inputs.
Because workflow tasks can't connect to the public internet, they can't download
resources by using http
, https
, or ftp
. Required
resources should be included as workflow inputs from Amazon S3 or an HealthOmics sequence
store. They also should be present in the container images that are used to run workflow
tasks. Workflow tasks may interact with Amazon S3 as long as the IAM role used to run the
workflow has been granted access.
Memory and computation considerations for tasks
Private workflow tasks are run on HealthOmics instances by using the smallest
instance that can accommodate the requested CPU and memory. We recommend choosing
the most sensible combination of CPU and memory for your needs. For example, if you
need 64 GiB of RAM, then the most cost-effective type is
omics.r.2xlarge
. This type allocates eight vCPUs. If your task
allocation only calls for one vCPU, the task container only gets one, even though
eight are available on the host machine. Unless this will adversely affect the
workflow, you might want to set a larger number of vCPUs. All tasks reserve a small
amount of memory for management and logging agents, so the full memory allocation
might not always be available to the application in the task.
Container resource allocations are hard limits. Tasks that run out of memory or attempt to use additional vCPUs can be immediately shut down by the host, potentially without warning.
Running Java applications in a private workflow task
When running Java applications in a workflow task, the image used to run the task
must contain Java 1.8–212 or later. Versions before this might attempt to
allocate heap memory that's not available to the container. By default, for versions
after 1.8-212, the heap allocation of the JVM will be 25% of the memory available to
the container. If you use a -Xmx
flag to request a larger amount, be
aware that not all memory consumed by Java is heap memory. Allocating 100% of the
available memory to the JVM heap causes the task to fail.
Including task inputs in Amazon ECR images
All executables, libraries, and scripts needed to run a workflow task should be provided by the Amazon ECR image that's used to run the task.
It's best practice to avoid using scripts, binaries, and libraries that are
external to a tasks container image. This is especially important when using
nf-core
workflows that use a bin
directory as part of
the workflow package. While this directory will be available to the workflow task,
it's mounted as a read-only directory. Required resources in this directory should
be copied into the task image and made available at runtime or when building the
container image used for the task.
Debugging workflow tasks
The following are best practices and considerations for debugging your tasks and workflows.
-
Task logs rely on
STDOUT
andSTDERR
being produced by the task. If the application used in the task doesn’t produce either of these, then there won't be a task log. To assist with debugging, use applications inverbose
mode. -
To view the commands being run in a task along with their interpolated values, use the
set -x
Bash command. This can help determine if the task is using the correct inputs and identify where errors might have kept the task from running as intended. -
Use the
echo
command to output the values of variables toSTDOUT
orSTDERR
. This helps you confirm that they're being set as expected. -
Use commands like
ls -l
<name_of_input_file>
to confirm that inputs are present and are of the expected size. If they aren't, this might reveal a problem with a prior task producing empty outputs due to a bug. -
Use the command
df -Ph . | awk 'NR==2 {print $4}'
in a tasks script to determine the space currently available to the task and help identify situations where you might need to run the workflow with additional storage allocation.
Including any of the preceding commands in a task script assumes that the task
container also includes these commands and that they are on the path
of
the container environment.