Fault-tolerant execution in Trino - Amazon EMR

Fault-tolerant execution in Trino

Fault-tolerant execution is a mechanism in Trino that a cluster can use to mitigate query failures. To do this, it retries queries or their component tasks when they fail. When fault-tolerant execution is activated, intermediate exchange data is spooled, and another worker can reuse it in the event of a worker outage or other fault during query execution.

For more information on fault-tolerant execution in Trino, see Project Tardigrade delivers ETL at Trino speeds to early users on the Trino blog.

Configuration

Fault-tolerant execution is deactivated by default. To activate the feature, set the retry-policy configuration property in the trino-config classification to either QUERY or TASK based on the desired retry policy, as follows.

{"classification": "trino-config", "properties": { "retry-policy": "QUERY" } }

A QUERY retry policy instructs Trino to retry a query automatically when an error occurs on a worker node. We recommend that you use a QUERY retry policy when the majority of the workload for the Trino cluster comprises many small queries.

A TASK retry policy instructs Trino to retry individual query tasks in the event of failure. We recommend this policy when Trino executes large batch queries. The cluster can more efficiently retry smaller tasks within the query rather than retry the whole query.

Exchange manager

An exchange manager stores and manages spooled data for fault-tolerant execution. It uses external storage to store spilled data beyond the in-memory buffer size. You can configure a file system-based exchange manager that stores spooled data in a specified location, such as Amazon S3, Amazon S3 compatible systems, or HDFS.

Amazon EMR releases 6.9.0 and later include the trino-exchange-manager classification to configure the exchange manager. These releases also support HDFS for spooling.

Setting up exchange manager

Use the trino-exchange-manager configuration classification to configure an exchange manager. This classification internally creates a etc/exchange-manager.properties configuration file on the coordinator and all worker nodes. The classification also sets exchange-manager.name configuration property to filesystem.

By default, Amazon EMR releases 6.9.0 and later use HDFS as an exchange manager. HDFS is available in the Amazon EMR EC2 clusters, and spooling occurs in the trino-exchange/ directory by default. To use the default settings, set the following configuration:

{"Classification": "trino-exchange-manager" }

If you want to provide a custom location, set the following properties in the trino-exchange-manager classification:

  • Set exchange.use-local-hdfs to true.

  • Set exchange.base-directories to the custom directory location in HDFS, for example, exchange.base-directories=/exchange. If the custom directory isn't already in HDFS, Amazon EMR will create it.

HDFS exchange manager configurations

Based on internal testing results, we recommend that you spool to local HDFS for better query performance in comparison to other cloud-based file systems. You can set the following configurations for the exchange manager with HDFS.

Configuration Description Default setting

exchange.hdfs.block-size

Block size for HDFS storage

4 MB

hdfs.config.resources

List of file paths to configure HDFS

If exchange.use-local-hdfs is true, uses the paths to core-site.xml, hdfs-site.xml files; otherwise null

For additional fault-tolerant execution configuration properties, and for information on how to set up Amazon S3 or other Amazon S3 compatible systems for spooling, see the Fault-tolerant execution page of the Trino documentation.

Considerations and limitations

  • If you enable fault-tolerant execution, it disables write operations for connectors that do not support write when retry-policy is set. As of Amazon EMR release 6.9.0, Delta Lake, Hive and Iceberg connectors support write operations with retry-policy.

  • If you use exchange manager and perform expensive I/O operations, your queries may experience degraded performance while exchange manager spools the intermediate data to external storage.