Ensuring High Availability in Kubernetes
High availability (HA) in Kubernetes ensures that workloads remain operational even when nodes fail, maintenance occurs, or infrastructure issues arise. In Thalassa Cloud, deploying workloads across multiple availability zones, enforcing intelligent scheduling, and configuring proper disruption policies are key to minimizing downtime.
Without high availability strategies, a cluster might suffer from single points of failure, leading to application downtime and degraded performance. Kubernetes provides built-in mechanisms to distribute workloads effectively, but it’s up to the user to configure their applications correctly to take advantage of these capabilities. This page covers best practices for achieving high availability using:
- Deployment Strategies to ensure redundancy.
- Pod Topology Spread Constraints to evenly distribute workloads.
- Pod Disruption Budgets (PDBs) to protect applications during voluntary disruptions.
- Node Affinity and Anti-Affinity Rules for intelligent scheduling.
- Health Probes for Reliability to ensure application self-healing.
Distributing Workloads Across Availability Zones
Thalassa Cloud runs across multiple availability zones (AZs) within a region. To ensure application resilience, workloads should be evenly distributed across zones using topology-aware scheduling.
A failure in a single availability zone should not impact the overall availability of the application. If all replicas of a deployment are scheduled in the same zone, a localized failure—such as a power outage or hardware failure—could bring down the entire workload. Topology spread constraints allow Kubernetes to enforce even distribution of Pods across failure domains, reducing the risk of zone-wide outages.
Using Topology Spread Constraints
Topology Spread Constraints ensure that Pods are distributed evenly across zones, preventing single points of failure. This is especially important for stateful applications, databases, and high-traffic services.
Example: Distribute Pods Across Availability Zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "k8s.thalassa.cloud/zone"
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-app
containers:
- name: app
image: my-web-app:latest
This configuration ensures Pods are evenly spread across zones, minimizing impact during failures.
Protecting Workloads During Maintenance with Pod Disruption Budgets
A Pod Disruption Budget (PDB) defines the minimum number of replicas that must remain running during voluntary disruptions, such as node maintenance, cluster upgrades, or rebalancing activities.
Without a properly configured PDB, Kubernetes may evict too many Pods during maintenance, causing service downtime. By setting minAvailable or maxUnavailable, a PDB ensures that at least a certain number of Pods remain active at all times.
Example: Ensuring At Least Two Pods Remain Available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-app
This ensures that at least two Pods remain available, preventing unexpected disruptions.
Using Node Affinity and Anti-Affinity for Smart Scheduling
Node Affinity ensures Pods are scheduled on the right nodes, while Pod Anti-Affinity prevents too many Pods from running on the same node.
Why Use Node Affinity?
Node affinity is useful for assigning workloads to specific hardware configurations, such as dedicated GPU nodes, high-memory instances, or performance-optimized nodes. This prevents critical workloads from running on unsuitable nodes and helps optimize resource utilization.
Example: Scheduling Pods Only on GPU Nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-app
spec:
replicas: 4
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: k8s.thalassa.cloud/gpu
operator: In
values:
- "true"
containers:
- name: gpu-container
image: gpu-workload:latest
Why Use Pod Anti-Affinity?
Pod Anti-Affinity prevents Kubernetes from scheduling all replicas of an application on the same node, ensuring fault tolerance. If a node fails, only a subset of replicas will be affected, allowing the application to continue serving traffic.
Example: Preventing All Pods from Running on a Single Node
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-service
spec:
replicas: 3
selector:
matchLabels:
app: backend
template:
metadata:
labels:
app: backend
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: backend
topologyKey: "kubernetes.io/hostname"
containers:
- name: backend-container
image: backend-app:latest
Configuring Health Probes for Self-Healing Applications
Kubernetes Liveness Probes and Readiness Probes allow the system to automatically detect and restart unhealthy Pods.
Health probes ensure that applications remain available even when individual components fail. By setting up Liveness Probes, Kubernetes can restart failing containers, while Readiness Probes ensure that traffic is only routed to fully initialized Pods.
Example: Liveness Probe - Restarting Crashed Pods
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
If the application stops responding, Kubernetes automatically restarts the Pod.
Example: Readiness Probe - Controlling Traffic Routing
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
This ensures the Pod only receives traffic when it is fully ready.
Summary
Ensuring high availability in Thalassa Cloud Kubernetes requires proper workload distribution, failover strategies, and intelligent scheduling.
Best Practices:
- Use Topology Spread Constraints to distribute workloads across availability zones.
- Define Pod Disruption Budgets to prevent voluntary downtime.
- Use Node Affinity and Anti-Affinity for workload isolation and fault tolerance.
- Configure Health Probes to detect failures and trigger self-healing.
By following these strategies, you can ensure resilient, highly available applications on Thalassa Cloud.