Maintenance and troubleshooting
The following sections will outline how to maintain your long-running Flink jobs, and provide guidance on how to troubleshoot some common issues.
Migrating Flink applications
Flink applications are typically designed to run for long periods of time such as weeks, months, or even years. As with all long-running services, Flink streaming applications need to be maintained. This includes bug fixes, improvements, and migration to a Flink cluster of a later version.
When the spec changes for FlinkDeployment
and
FlinkSessionJob
resources, you need to upgrade the running application.
To do this, the operator stops the running job (unless already suspended) and redeploys
it with the latest spec and, for stateful applications, the state from the previous
run.
Users control how to manage the state when stateful applications stop and restore with
the upgradeMode
setting of the JobSpec
.
Upgrade modes
Optional introduction
- Stateless
-
Stateless application upgrades from empty state.
- Last state
-
Quick upgrades in any application state (even for failing jobs), does not require a healthy job as it always uses the latest successful checkpoint. Manual recovery may be necessary if HA metadata is lost. To limit the time the job may fall back when picking up the latest checkpoint you can configure
kubernetes.operator.job.upgrade.last-state.max.allowed.checkpoint.age
. If the checkpoint is older than the configured value, a savepoint will be taken instead for healthy jobs. This is not supported in Session mode. - Savepoint
-
Use savepoint for upgrade, providing maximal safety and possibility to serve as backup/fork point. The savepoint will be created during the upgrade process. Note that the Flink job needs to be running to allow the savepoint to get created. If the job is in an unhealthy state, the last checkpoint will be used (unless kubernetes.operator.job.upgrade.last-state-fallback.enabled is set to false). If the last checkpoint is not available, the job upgrade will fail.
Troubleshooting
This section describes how to troubleshoot problems with Amazon EMR on EKS. For information on how to troubleshoot general problems with Amazon EMR, see Troubleshoot a cluster in the Amazon EMR Management Guide.
Troubleshooting Apache Flink on Amazon EMR on EKS
Resource mapping not found when installing the Helm chart
You might encounter the following error message when you install the Helm chart.
Error: INSTALLATION FAILED: pulling from host 1234567890.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests 6.13.0]: 403 Forbidden Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "flink-operator-serving-cert" namespace: "<the namespace to install your operator>" from "": no matches for kind "Certificate" in version "cert-manager.io/v1"
ensure CRDs are installed first, resource mapping not found for name: "flink-operator-selfsigned-issuer" namespace: "<the namespace to install your operator>" " from "": no matches for kind "Issuer" in version "cert-manager.io/v1"
ensure CRDs are installed first].
To resolve this error, install cert-manager to enable adding the webhook component. You must install cert-manager to each Amazon EKS cluster that you use.
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0
AWS service access denied error
If you see an access denied error, confirm
that the IAM role for operatorExecutionRoleArn
in the Helm chart
values.yaml
file has the correct permissions. Also ensure the
IAM role under executionRoleArn
in your
FlinkDeployment
specification has the correct
permissions.
FlinkDeployment
is stuck
If your FlinkDeployment
stalls in an arrested state, use the
following steps to force delete the deployment:
-
Edit the deployment run.
kubectl edit -n
Flink Namespace
flinkdeployments/App Name
-
Remove this finalizer.
finalizers: - flinkdeployments.flink.apache.org/finalizer
-
Delete the deployment.
kubectl delete -n
Flink Namespace
flinkdeployments/App Name