High Availability

Ensuring High Availability in Kubernetes

High availability (HA) in Kubernetes ensures that workloads remain operational even when nodes fail, maintenance occurs, or infrastructure issues arise. In Thalassa Cloud, deploying workloads across multiple availability zones, enforcing intelligent scheduling, and configuring proper disruption policies are key to minimizing downtime.

Without high availability strategies, a cluster might suffer from single points of failure, leading to application downtime and degraded performance. Kubernetes provides built-in mechanisms to distribute workloads effectively, but it’s up to the user to configure their applications correctly to take advantage of these capabilities. This page covers best practices for achieving high availability using:

  • Deployment Strategies to ensure redundancy.
  • Pod Topology Spread Constraints to evenly distribute workloads.
  • Pod Disruption Budgets (PDBs) to protect applications during voluntary disruptions.
  • Node Affinity and Anti-Affinity Rules for intelligent scheduling.
  • Health Probes for Reliability to ensure application self-healing.

Distributing Workloads Across Availability Zones

Thalassa Cloud runs across multiple availability zones (AZs) within a region. To ensure application resilience, workloads should be evenly distributed across zones using topology-aware scheduling.

A failure in a single availability zone should not impact the overall availability of the application. If all replicas of a deployment are scheduled in the same zone, a localized failure—such as a power outage or hardware failure—could bring down the entire workload. Topology spread constraints allow Kubernetes to enforce even distribution of Pods across failure domains, reducing the risk of zone-wide outages.

Using Topology Spread Constraints

Topology Spread Constraints ensure that Pods are distributed evenly across zones, preventing single points of failure. This is especially important for stateful applications, databases, and high-traffic services.

Example: Distribute Pods Across Availability Zones

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "k8s.thalassa.cloud/zone"
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: web-app
      containers:
      - name: app
        image: my-web-app:latest

This configuration ensures Pods are evenly spread across zones, minimizing impact during failures.

Protecting Workloads During Maintenance with Pod Disruption Budgets

A Pod Disruption Budget (PDB) defines the minimum number of replicas that must remain running during voluntary disruptions, such as node maintenance, cluster upgrades, or rebalancing activities.

Without a properly configured PDB, Kubernetes may evict too many Pods during maintenance, causing service downtime. By setting minAvailable or maxUnavailable, a PDB ensures that at least a certain number of Pods remain active at all times.

Example: Ensuring At Least Two Pods Remain Available

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app

This ensures that at least two Pods remain available, preventing unexpected disruptions.

Using Node Affinity and Anti-Affinity for Smart Scheduling

Node Affinity ensures Pods are scheduled on the right nodes, while Pod Anti-Affinity prevents too many Pods from running on the same node.

Why Use Node Affinity?

Node affinity is useful for assigning workloads to specific hardware configurations, such as dedicated GPU nodes, high-memory instances, or performance-optimized nodes. This prevents critical workloads from running on unsuitable nodes and helps optimize resource utilization.

Example: Scheduling Pods Only on GPU Nodes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  replicas: 4
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: k8s.thalassa.cloud/gpu
                operator: In
                values:
                - "true"
      containers:
      - name: gpu-container
        image: gpu-workload:latest

Why Use Pod Anti-Affinity?

Pod Anti-Affinity prevents Kubernetes from scheduling all replicas of an application on the same node, ensuring fault tolerance. If a node fails, only a subset of replicas will be affected, allowing the application to continue serving traffic.

Example: Preventing All Pods from Running on a Single Node

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      labels:
        app: backend
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: backend
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: backend-container
        image: backend-app:latest

Configuring Health Probes for Self-Healing Applications

Kubernetes Liveness Probes and Readiness Probes allow the system to automatically detect and restart unhealthy Pods.

Health probes ensure that applications remain available even when individual components fail. By setting up Liveness Probes, Kubernetes can restart failing containers, while Readiness Probes ensure that traffic is only routed to fully initialized Pods.

Example: Liveness Probe - Restarting Crashed Pods

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

If the application stops responding, Kubernetes automatically restarts the Pod.

Example: Readiness Probe - Controlling Traffic Routing

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 5

This ensures the Pod only receives traffic when it is fully ready.

Summary

Ensuring high availability in Thalassa Cloud Kubernetes requires proper workload distribution, failover strategies, and intelligent scheduling.

Best Practices:

  • Use Topology Spread Constraints to distribute workloads across availability zones.
  • Define Pod Disruption Budgets to prevent voluntary downtime.
  • Use Node Affinity and Anti-Affinity for workload isolation and fault tolerance.
  • Configure Health Probes to detect failures and trigger self-healing.

By following these strategies, you can ensure resilient, highly available applications on Thalassa Cloud.