Troubleshooting Kinesis Data Analytics for Apache Flink - Amazon Kinesis Data Analytics

Troubleshooting Kinesis Data Analytics for Apache Flink

The following can help you troubleshoot problems that you might encounter with Amazon Kinesis Data Analytics for Apache Flink.

General Troubleshooting: Analyze Logs

You can investigate issues with your application by querying your application's CloudWatch logs.

We recommend that you set your log level to INFO. This log level writes sufficient information to your logs to troubleshoot most issues. You can use the DEBUG log level for short periods of time while troubleshooting issues, but it can create significant performance issues for your application.

For information about setting up and analyzing CloudWatch logs, see Logging and Monitoring.

If your application is not writing entries to your CloudWatch log, see Logging Troubleshooting.

Cannot access resources in a VPC

If your application uses an Amazon VPC, do the following to verify that your application has access to its resources:

  • Check your CloudWatch logs for the following error. This error indicates that your application cannot access resources in your VPC:

    org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.

    If you see this error, verify that your route tables are set up correctly, and that your connectors have the correct connection settings.

    For information about setting up and analyzing CloudWatch logs, see Logging and Monitoring.

  • Verify that you are not using restricted CIDRs in the subnets in your Amazon VPC. For more information, see Limitations.

Compile error: "Could not resolve dependencies for project"

In order to compile the Kinesis Data Analytics for Apache Flink sample applications, you must first download and compile the Apache Flink Kinesis connector and add it to your local Maven repository. If the connector hasn't been added to your repository, a compile error similar to the following appears:

Could not resolve dependencies for project your project name: Failure to find org.apache.flink:flink-connector-kinesis_2.11:jar:1.8.2 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced

To resolve this error, you must download the Apache Flink source code (version 1.8.2 from https://flink.apache.org/downloads.html) for the connector. For instructions about how to download, compile, and install the Apache Flink source code, see Using the Apache Flink Kinesis Streams Connector.

Invalid Choice: 'kinesisanalyticsv2'

To use v2 of the Kinesis Data Analytics API, you need the latest version of the AWS Command Line Interface (AWS CLI).

For information about upgrading the AWS CLI, see Installing the AWS Command Line Interface in the AWS Command Line Interface User Guide.

Data Is Lost When Writing to an Amazon S3 Bucket

Some data loss may occur when writing to an Amazon S3 bucket using Apache Flink version 1.6.2. We recommend using Apache Flink version 1.8.2. when using S3 for output directly. To write to an Amazon S3 bucket using Apache Flink 1.6.2, we recommend using Kinesis Data Firehose. For more information about using Kinesis Data Firehose with Kinesis Data Analytics, see Kinesis Data Firehose Sink.

Application Is in RUNNING State but Not Processing Data

You can check your application state using either the ListApplications or the DescribeApplication actions. If your application enters the RUNNING state but is not writing data to your sink, you can troubleshoot the issue by adding an Amazon CloudWatch log stream to your application. For more information, see Working with Application CloudWatch Logging Options. The log stream contains messages that you can use to troubleshoot application issues.

Application Fails to enter RUNNING state

You can check your application state using either the ListApplications or the DescribeApplication actions. If your application fails to enter the RUNNING state, check the following:

  • Verify that your application JAR file is smaller than 512 MB. JAR files larger than 512 MB are not supported.

  • Check your application's CloudWatch logs for errors. For information about setting up CloudWatch logs for your application, see Setting Up Application Logging.

Snapshot Fails to Be Created

Kinesis Data Analytics takes a snapshot of the application during an UpdateApplication or StopApplication request. The service then uses this snapshot state and restores the application using the updated application configuration to provide exactly-once processing semantics.

The service can't take a snapshot of the application under the following circumstances:

  • The application exceeded the snapshot limit. The limit for snapshots is 1,000. For more information, see Snapshots.

  • The application is not in a healthy state.

  • The application does not have permissions to access its source or sink.

  • The application code is not functioning properly.

  • The application is experiencing other configuration issues.

If you get an exception while taking a snapshot during an application update or while stopping the application, check the application's CloudWatch logs for errors, and retry the request. You can also retry the request by setting the SnapshotsEnabled property of your application's ApplicationSnapshotConfiguration to false.

After the application returns to a healthy state, we recommend that you set the SnapshotsEnabled property to true.

You can set the SnapshotsEnabled property using the UpdateApplication action. The following UpdateApplication example sets the SnapshotsEnabled property to true:

aws kinesisanalyticsv2 update-application \ --application-name MyApplication \ --current-application-version-id 10 \ --application-configuration-update '{"ApplicationSnapshotConfigurationUpdate":{"SnapshotsEnabledUpdate":true}}'

You can also update the SnapshotsEnabled property using the console.

Update the SnapshotsEnabled Property Using the Console

  1. Open the Kinesis Data Analytics console at https://console.aws.amazon.com/kinesisanalytics.

  2. In the Kinesis Data Analytics console, choose your application.

  3. In your application's page, choose Configure.

  4. In the Snapshots section, choose Enable.

    
                    Screenshot showing the Snapshots section.

Restoring from a Snapshot Fails

If you remove or change an operator in an application update and attempt to restore from a snapshot, the restore will fail by default if the snapshot contains state data for the missing operator. In addition, the application will be stuck in either the STOPPED or UPDATING state. To change this behavior and allow the restore to succeed, change the AllowNonRestoredState parameter of the application's FlinkRunConfiguration to true. This will allow the resume operation to skip state data that cannot be mapped to the new program.

For more information, see Restoring From a Snapshot That Contains Incompatible State Data.

Throughput Is Too Slow or MillisBehindLatest Is Increasing

If the application metrics are showing that throughput is too slow or the MillisBehindLatest metric is steadily increasing, do the following:

  • Enable auto scaling if it is disabled, or increase application parallelism. For more information, see Scaling.

  • Check if the application is logging an entry for every record being processed. Writing a log entry for each record during times when the application has high throughput will cause severe bottlenecks in data processing. To check for this condition, query your logs for log entries that your application writes with every record it processes. For more information, see Analyzing Logs with CloudWatch Logs Insights.

  • Increase the application's parallelism. You update the application's parallelism using the ParallelismConfigurationUpdate parameter of the UpdateApplication action.

    The maximum KPUs for an application is 32 by default, and can be increased by requesting a limit increase.

  • Verify that your application's workload isn't distributed unevenly. To control the distribution of workload across your application's worker processes, use the Parallelism setting and the KeyBy operator. For more information, see the following topics in the Apache Flink documentation:

  • Examine your application logic for inefficient or non-performant operations, such as accessing an external dependency (such as a database or a web service), accessing application state, etc. If you are using an external dependency to enrich or otherwise process incoming data, consider using asynchronous IO instead. For more information, see Async I/O in the Apache Flink documentation.

  • Ensure that your application is not throwing exceptions that lead to job restarts. Any job restarts will slow down processing as the job is recycled and rebuilt and state data is persisted and restored. For information about logging application errors using CloudWatch Logs, see Setting Up Application Logging.

Snapshot, Application Update, or Application Stop Error: InvalidApplicationConfigurationException

An error similar to the following may occur during a snapshot operation, or during an operation that creates a snapshot, such as updating or stopping an application:

An error occurred (InvalidApplicationConfigurationException) when calling the UpdateApplication operation: Failed to take snapshot for the application xxxx at this moment. The application is currently experiencing downtime. Please check the application's CloudWatch metrics or CloudWatch logs for any possible errors and retry the request. You can also retry the request after disabling the snapshots in the Kinesis Data Analytics console or by updating the ApplicationSnapshotConfiguration through the AWS SDK

This error occurs when the application is unable to create a snapshot.

If you encounter this error during a snapshot operation or an operation that creates a snapshot, do the following:

  • Disable snapshots for your application. You can do this either in the Kinesis Data Analytics console, or by using the SnapshotsEnabledUpdate parameter of the UpdateApplication action.

  • Investigate why snapshots cannot be created. For more information, see Snapshot Fails to Be Created.

  • Re-enable snapshots when the application returns to a healthy state.

Downtime Is Not Zero

If the Downtime metric is not zero, the application is not healthy. Common causes of this condition include the following:

  • Your application is under-provisioning sources and sinks. Check that any sources or sinks used in the application are well-provisioned, and are not experiencing read or write throttling.

    If the source or sink is a Kinesis data stream, check the metrics for the stream for ReadProvisionedThroughputExceeded or WriteProvisionedThroughputExceeded errors.

    You can investigate the causes of this condition by querying your application logs for changes from your application's state from RUNNING to FAILED. For more information, see Analyze Errors: Application Task-Related Failures.

  • If any exception in an operator in your application is unhandled, the application fails over (by interpreting that the failure cannot be handled by operator) and the application will restart from the latest checkpoint to maintain "exactly-once" processing semantics. This will lead to Downtime being not zero during these restart periods. In order to prevent this from happening, we recommend that you handle any retryable exceptions in the application code.

java.nio.file.NoSuchFileException: /usr/local/openjdk-8/lib/security/cacerts

THe location of the SSL truststore was updated in a previous deployment. Use the following value for the ssl.truststore.location parameter instead:

/usr/local/openjdk-8/jre/lib/security/cacerts

UpdateApplication action not reloading application code

The UpdateApplication action will not reload application code with the same filename if no S3 object version is specified. To reload application code with the same filename, enable versioning on your S3 bucket, and specify the new object version using the ObjectVersionUpdate parameter. For more information about enabling object versioning in an S3 bucket, see Enabling or Disabling Versioning.