Spread workloads across nodes and Availability Zones
Distributing a workload across failure domains
Use pod topology spread constraints
Kubernetes pod topology spread constraints
-
Distribute or concentrate pods across different failure domains depending on application requirements. For example, you can distribute pods for resilience, and you can concentrate pods for network performance.
-
Combine different conditions, such as distributing across Availability Zones and distributing across nodes.
-
Specify the preferred action if conditions can't be met:
-
Use
whenUnsatisfiable: DoNotSchedule
with a combination ofmaxSkew
andminDomains
to create hard requirements for the scheduler. -
Use
whenUnsatisfiable: ScheduleAnyway
to reducemaxSkew
.
-
If a failure zone becomes unavailable, the pods in that zone become unhealthy. Kubernetes reschedules the pods while adhering to the spread constraint if possible.
Pod topology spread constraints
The following code shows an example of using pod topology spread constraints across Availability Zones or across nodes:
... spec: selector: matchLabels: app: <your-app-label> replicas: 3 template: metadata: labels: <your-app-label> spec: serviceAccountName: <ServiceAccountName> ... topologySpreadConstraints: - labelSelector: matchLabels: app: <your-app-label> maxSkew: 1 topologyKey: topology.kubernetes.io/zone # <---spread those pods evenly over all availability zones whenUnsatisfiable: ScheduleAnyway - labelSelector: matchLabels: app: <your-app-label> maxSkew: 1 topologyKey: kubernetes.io/hostname # <---spread those pods evenly over all nodes whenUnsatisfiable: ScheduleAnyway
Default cluster-wide topology spread constraints
Out of the box Kubernetes provides a default set of topology spread constraints
defaultConstraints: - maxSkew: 3 topologyKey: "kubernetes.io/hostname" whenUnsatisfiable: ScheduleAnyway - maxSkew: 5 topologyKey: "topology.kubernetes.io/zone" whenUnsatisfiable: ScheduleAnyway
Note
Applications that need different types of topology constraints can override the cluster-level policy.
The default constraints set a high maxSkew
, which isn't useful for
deployments that have a small number of pods. As of now,
KubeSchedulerConfiguration
can't be
changed
The Gatekeeper policy for topology spread constraints
Another option for enforcing topology spread constraints is to use a policy from the
Gatekeeper
The following code examples show the use of a Gatekeeper OPA
policy for
deployment. You can modify the policy for your needs. For example, apply the policy only
to deployments that have the label HA=true
, or write a similar policy using a
different policy controller.
This first example shows ConstraintTemplate
used with
k8stopologyspreadrequired_template.yml
:
apiVersion: templates.gatekeeper.sh/v1 kind: ConstraintTemplate metadata: name: k8stopologyspreadrequired spec: crd: spec: names: kind: K8sTopologySpreadRequired validation: openAPIV3Schema: type: object properties: message: type: string targets: - target: admission.k8s.gatekeeper.sh rego: | package k8stopologyspreadrequired get_message(parameters, _default) =3D msg { not parameters.message msg :=_default } get_message(parameters, _default) =3D msg { msg := parameters.message } violation[{"msg": msg}] { input.review.kind.kind ="Deployment" not input.review.object.spec.template.spec.topologySpreadConstraint def_msg :"Pod Topology Spread Constraints are required for Deployments" msg :get_message(input.parameters, def_msg) }
The following code shows the constraints
YAML manifest
k8stopologyspreadrequired_constraint.yml
:
apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sTopologySpreadRequired metadata: name: require-topologyspread-for-deployments spec: match: kinds: - apiGroups: ["apps"] kinds: ["Deployment"] namespaces: ## Without theses two lines will apply to the whole cluster - "example"
When to use topology spread constraints
Consider using topology spread constraints for the following scenarios:
-
Any horizontally scalable application (for example, stateless web services)
-
Applications with active-active or active-passive replicas (for example, NoSQL databases or caches)
-
Applications with stand-by replicas (for example, controllers)
System components that can be used for the horizontally scalable scenario, for example, include the following:
-
Cluster Autoscaler
and Karpenter (with replicaCount > 1
andleader-elect = true
)
Pod affinity and anti-affinity
In some cases, it's beneficial to ensure that no more than one pod of a specific type is
running on a node. For example, to avoid scheduling multiple network-heavy pods on the same
node, you can use the anti-affinity rule with the label Ingress
or
Network-heavy
. When you use anti-affinity
, you can also use a
combination of the following:
-
Taints on network-optimized nodes
-
Corresponding tolerations on network-heavy pods
-
Node affinity or node selector to ensure that network-heavy pods use network-optimized instances
Network-heavy pods are used as an example. You might have different requirements, such
as GPU, memory, or local storage. For other usage examples and configuration options, see
the Kubernetes documentation
Rebalance pods
This section discusses two approaches to rebalancing pods in a Kubernetes cluster. The first uses the Descheduler for Kubernetes. The Descheduler helps to maintain pod distribution by enforcing strategies to remove pods that violate topology spread constraints or anti-affinity rules. The second approach uses the Karpenter consolidation and bin-packing feature. Consolidation continuously evaluates and optimizes resource usage by consolidating workloads onto fewer, more efficiently packed nodes.
We recommend using Descheduler if you aren't using Karpenter. If you're using Karpenter and Cluster Autoscaler together, you can use Descheduler with Cluster Autoscaler for node groups.
Descheduler for groupless nodes
There's no guarantee that the topology constraints remain satisfied when pods are removed. For example, scaling down a deployment might result in imbalanced pod distribution. However, because Kubernetes uses pod topology spread constraints only at the scheduling stage, pods are left unbalanced across the failure domain.
To maintain a balanced pod distribution in such scenarios, you can use Descheduler for
Kubernetes
Karpenter consolidation and bin-packing feature
For workloads that use Karpenter, you can use the consolidation and bin-packing functionality to optimize resource utilization and reduce costs in Kubernetes clusters. Karpenter continuously evaluates pod placements and node utilization, and it attempts to consolidate workloads onto fewer, more efficiently packed nodes when possible. This process involves analyzing resource requirements, considering constraints such as pod affinity rules, and potentially moving pods between nodes to improve overall cluster efficiency. The following code provides an example:
apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: disruption: consolidationPolicy: WhenUnderutilized expireAfter: 720h
For consolidationPolicy
, you can use WhenUnderutilized
or
WhenEmpty
:
-
When
consolidationPolicy
is set toWhenUnderutilized
, Karpenter considers all nodes for consolidation. When Karpenter discovers a node that's empty or underused, Karpenter attempts to remove or replace the node to reduce cost. -
When
consolidationPolicy
is set toWhenEmpty
, Karpenter considers for consolidation only nodes that contain no workload pods.
The Karpenter consolidation decisions are not based solely on CPU or memory
utilization percentages that you might see in monitoring tools. Instead, Karpenter uses
a more complex algorithm based on pod resource requests and potential cost
optimizations. For more information, see the Karpenter
documentation