Troubleshooting Kinesis Data Analytics for Apache Flink - Amazon Kinesis Data Analytics

Troubleshooting Kinesis Data Analytics for Apache Flink

The following can help you troubleshoot problems that you might encounter with Amazon Kinesis Data Analytics for Apache Flink.

Development Troubleshooting

Compile Error: "Could not resolve dependencies for project"

In order to compile the Kinesis Data Analytics for Apache Flink sample applications, you must first download and compile the Apache Flink Kinesis connector and add it to your local Maven repository. If the connector hasn't been added to your repository, a compile error similar to the following appears:

Could not resolve dependencies for project your project name: Failure to find org.apache.flink:flink-connector-kinesis_2.11:jar:1.8.2 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced

To resolve this error, you must download the Apache Flink source code (version 1.8.2 from https://flink.apache.org/downloads.html) for the connector. For instructions about how to download, compile, and install the Apache Flink source code, see Using the Apache Flink Kinesis Streams Connector.

Invalid Choice: "kinesisanalyticsv2"

To use v2 of the Kinesis Data Analytics API, you need the latest version of the AWS Command Line Interface (AWS CLI).

For information about upgrading the AWS CLI, see Installing the AWS Command Line Interface in the AWS Command Line Interface User Guide.

UpdateApplication Action Isn't Reloading Application Code

The UpdateApplication action will not reload application code with the same file name if no S3 object version is specified. To reload application code with the same file name, enable versioning on your S3 bucket, and specify the new object version using the ObjectVersionUpdate parameter. For more information about enabling object versioning in an S3 bucket, see Enabling or Disabling Versioning.

Runtime Troubleshooting

This section contains issues that arise when running an application.

You can investigate runtime application issues by querying your application's CloudWatch logs. We recommend that you set your log level to INFO. This log level writes sufficient information to your logs to troubleshoot most issues. You can use the DEBUG log level for short periods of time while troubleshooting issues, but it can create significant performance issues for your application.

For information about setting up and analyzing CloudWatch logs, see Logging and Monitoring.

If your application is not writing entries to your CloudWatch log, see Logging Troubleshooting.

Application Is Stuck in a Transient Status

If your application stays in a transient status (STARTING, UPDATING, STOPPING, or AUTOSCALING), you can stop your application using the StopApplication action with the Force parameter set to true. You can't force stop an application in the DELETING status.

Note

Force-stopping your application may lead to data loss or duplication. To prevent data loss or duplicate processing of data during application restarts, we recommend you to take frequent snapshots of your application.

Causes for stuck applications include the following:

  • Application state is too large: Having an application state that is too large or too persistent can cause the application to become stuck during a checkpoint or snapshot operation. Check your application's lastCheckpointDuration and lastCheckpointSize metrics for steadily increasing values or abnormally high values.

  • Application code is too large: Verify that your application JAR file is smaller than 512 MB. JAR files larger than 512 MB are not supported.

  • Application snapshot creation fails: Kinesis Data Analytics takes a snapshot of the application during an UpdateApplication or StopApplication request. The service then uses this snapshot state and restores the application using the updated application configuration to provide exactly-once processing semantics.If automatic snapshot creation fails, see Snapshot Creation Fails following.

  • Restoring from a snapshot fails: If you remove or change an operator in an application update and attempt to restore from a snapshot, the restore will fail by default if the snapshot contains state data for the missing operator. In addition, the application will be stuck in either the STOPPED or UPDATING status. To change this behavior and allow the restore to succeed, change the AllowNonRestoredState parameter of the application's FlinkRunConfiguration to true. This will allow the resume operation to skip state data that cannot be mapped to the new program.

You can check your application status using either the ListApplications or the DescribeApplication actions.

Snapshot Creation Fails

The Kinesis Data Analytics service can't take a snapshot under the following circumstances:

  • The application exceeded the snapshot limit. The limit for snapshots is 1,000. For more information, see Snapshots.

  • The application isn't in a healthy state.

  • The application doesn't have permissions to access its source or sink.

  • The application code isn't functioning properly.

  • The application is experiencing other configuration issues.

If you get an exception while taking a snapshot during an application update or while stopping the application, set the SnapshotsEnabled property of your application's ApplicationSnapshotConfiguration to false and retry the request.

After the application returns to a healthy state, we recommend that you set the application's SnapshotsEnabled property to true.

DaemonException: The child process has been shutdown and can no longer accept messages.

This error occurs if you build the Apache Flink Kinesis Streams Connector with an unsupported version of KPL. Ensure that you are building the connector with KPL version 0.14.0 or higher.

Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not found

This error occurs in an application that uses Apache Beam, but doesn't have the correct dependencies or dependency versions. Kinesis Data Analytics applications that use Apache Beam must use the following versions of dependencies:

<jackson.version>2.10.2</jackson.version> ... <dependency> <groupId>com.fasterxml.jackson.module</groupId> <artifactId>jackson-module-jaxb-annotations</artifactId> <version>2.10.2</version> </dependency>

If you do not provide the correct version of jackson-module-jaxb-annotations as an explicit dependency, your application loads it from the environment dependencies, and since the versions do not match, the application crashes at runtime.

For more information about using Apache Beam with Kinesis Data Analytics, see Apache Beam.

Cannot Access Resources in a VPC

If your application uses a VPC running on Amazon VPC, do the following to verify that your application has access to its resources:

  • Check your CloudWatch logs for the following error. This error indicates that your application cannot access resources in your VPC:

    org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.

    If you see this error, verify that your route tables are set up correctly, and that your connectors have the correct connection settings.

    For information about setting up and analyzing CloudWatch logs, see Logging and Monitoring.

Data Is Lost When Writing to an Amazon S3 Bucket

Some data loss might occur when writing output to an Amazon S3 bucket using Apache Flink version 1.6.2. We recommend using Apache Flink version 1.8.2. when using Amazon S3 for output directly. To write to an Amazon S3 bucket using Apache Flink 1.6.2, we recommend using Kinesis Data Firehose. For more information about using Kinesis Data Firehose with Kinesis Data Analytics, see Kinesis Data Firehose Sink.

Application Is in the RUNNING Status But Isn't Processing Data

You can check your application status by using either the ListApplications or the DescribeApplication actions. If your application enters the RUNNING status but isn't writing data to your sink, you can troubleshoot the issue by adding an Amazon CloudWatch log stream to your application. For more information, see Working with Application CloudWatch Logging Options. The log stream contains messages that you can use to troubleshoot application issues.

Throughput Is Too Slow or MillisBehindLatest Is Increasing

There can be many causes for slow application throughput. For troubleshooting steps for slow throughput or MillisBehindLatest increasing, see Troubleshooting Throughput.

Snapshot, Application Update, or Application Stop Error: InvalidApplicationConfigurationException

An error similar to the following might occur during a snapshot operation, or during an operation that creates a snapshot, such as updating or stopping an application:

An error occurred (InvalidApplicationConfigurationException) when calling the UpdateApplication operation: Failed to take snapshot for the application xxxx at this moment. The application is currently experiencing downtime. Please check the application's CloudWatch metrics or CloudWatch logs for any possible errors and retry the request. You can also retry the request after disabling the snapshots in the Kinesis Data Analytics console or by updating the ApplicationSnapshotConfiguration through the AWS SDK

This error occurs when the application is unable to create a snapshot.

If you encounter this error during a snapshot operation or an operation that creates a snapshot, do the following:

  • Disable snapshots for your application. You can do this either in the Kinesis Data Analytics console, or by using the SnapshotsEnabledUpdate parameter of the UpdateApplication action.

  • Investigate why snapshots cannot be created. For more information, see Application Is Stuck in a Transient Status.

  • Reenable snapshots when the application returns to a healthy state.

java.nio.file.NoSuchFileException: /usr/local/openjdk-8/lib/security/cacerts

The location of the SSL truststore was updated in a previous deployment. Use the following value for the ssl.truststore.location parameter instead:

/usr/local/openjdk-8/jre/lib/security/cacerts

Downtime Is Not Zero

If the Downtime metric is not zero, the application isn't healthy. Common causes of this condition include the following:

  • Your application is under-provisioning sources and sinks. Check that any sources or sinks used in the application are well provisioned, and are not experiencing read or write throttling.

    If the source or sink is a Kinesis data stream, check the metrics for the stream for ReadProvisionedThroughputExceeded or WriteProvisionedThroughputExceeded errors.

    You can investigate the causes of this condition by querying your application logs for changes from your application's status from RUNNING to FAILED. For more information, see Analyze Errors: Application Task-Related Failures.

  • If any exception in an operator in your application is unhandled, the application fails over (by interpreting that the failure cannot be handled by operator). The application restarts from the latest checkpoint to maintain "exactly-once" processing semantics. As a result, Downtime is not zero during these restart periods. In order to prevent this from happening, we recommend that you handle any retryable exceptions in the application code.