Kubernetes DevOps

Zero-Downtime Kubernetes Upgrades: A Battle-Tested Playbook

Kloudpath Team February 28, 2025

Over the past two years, our team has upgraded more than 40 production Kubernetes clusters, spanning EKS, AKS, and GKE, without incurring a single minute of customer-facing downtime. This is not because the upgrades were easy. It is because we built a systematic process that anticipates every failure mode and has a tested rollback path for each one. This post is a distillation of that process into a repeatable playbook.

The Upgrade Surface Area

A Kubernetes upgrade is not a single operation. It involves upgrading the control plane (API server, etcd, controller manager, scheduler), the worker nodes (kubelet, container runtime), and the ecosystem components (CNI plugin, CSI drivers, ingress controllers, CoreDNS, cert-manager, and any admission webhooks). Each of these has its own compatibility matrix and its own failure modes. Treating the upgrade as a monolithic operation is the single most common mistake teams make.

We break every upgrade into three phases: pre-flight validation, control plane upgrade, and node pool rolling replacement. Each phase has its own checklist, its own monitoring dashboard, and its own rollback procedure.

Pre-Flight Validation

Before touching production, we run the upgrade on a staging cluster that mirrors production in configuration if not in scale. The staging cluster runs a synthetic workload that exercises all the API features our applications depend on: CRDs, HPA, PodDisruptionBudgets, volume mounts, and network policies. We automated this with a tool that compares the Kubernetes API deprecation list against every manifest in our GitOps repository.

# Check for deprecated APIs before upgrading
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

# Scan manifests for deprecated API versions
pluto detect-all-in-cluster --target-versions k8s=v1.29

We also run kubent (kube-no-trouble) against the cluster to identify any resources using APIs that will be removed in the target version. In one memorable upgrade from 1.25 to 1.27, we discovered that a third-party Helm chart was still using policy/v1beta1 PodDisruptionBudgets. Had we not caught this in staging, the PDBs would have silently disappeared after the upgrade, leaving our services unprotected during node drains.

PodDisruptionBudgets: Your Safety Net

PodDisruptionBudgets are the single most important primitive for zero-downtime upgrades. A PDB tells the Kubernetes scheduler how many pods in a deployment can be unavailable at any given time. Without PDBs, a node drain will evict all pods on a node simultaneously, causing a brief outage for any service that had all its replicas on that node.

Our standard PDB configuration for critical services uses maxUnavailable: 1 with a minimum replica count of 3. This ensures that at least two pods are always running during a drain.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: api

One subtlety: PDBs interact with the cluster autoscaler. If you set minAvailable too aggressively (for example, requiring all replicas to be available), the autoscaler cannot drain nodes, and your upgrade will stall indefinitely. We learned this the hard way on an early upgrade where a PDB with minAvailable: 3 on a 3-replica deployment blocked node drain for 45 minutes until we manually intervened.

Control Plane Upgrades

On managed services like EKS, the control plane upgrade is a single API call. AWS handles the rolling replacement of API server instances behind the scenes. The key nuance is that the API server briefly runs mixed versions during the transition. This means you must ensure that your client libraries and kubectl versions are compatible with both the old and new API server versions.

We monitor four metrics during the control plane upgrade: API server request latency (p99), etcd leader election count, webhook admission latency, and the overall API server availability. If any of these metrics breach their thresholds during the upgrade, we halt and investigate before proceeding to the node upgrades. Our Grafana dashboard for this has become one of our most-referenced internal tools.

Node Pool Rolling Replacement

This is where most downtime occurs if you are not careful. The strategy is to create a new node group running the target Kubernetes version, then cordon and drain the old nodes one at a time, allowing pods to reschedule onto the new nodes.

We use a blue-green approach for node groups rather than in-place upgrades. A new node group is provisioned with the target AMI (or VM image), and once it is healthy, we begin draining the old group. This has a critical advantage: if something goes wrong, the old nodes are still running and we can uncordon them immediately.

# Cordon the old node to prevent new pods from scheduling
kubectl cordon ip-10-0-1-42.ec2.internal

# Drain with a grace period and PDB respect
kubectl drain ip-10-0-1-42.ec2.internal \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=300s

The --timeout flag is essential. Without it, a stuck drain (often caused by a pod that cannot be rescheduled due to node affinity or resource constraints) will block the entire upgrade indefinitely. We set a 5-minute timeout per node and have an alert that fires if any single drain exceeds 3 minutes.

Monitoring During the Upgrade

We deploy a temporary upgrade monitoring dashboard that tracks pod restart count, pending pods, node conditions, and service endpoint counts. The most critical metric is service endpoint count: if an Endpoints object for a critical service drops below the expected replica count, it means pods are being evicted faster than they can be rescheduled, and the upgrade should be paused.

We also run continuous synthetic checks against our services during the upgrade. A simple curl loop hitting the health endpoint from outside the cluster provides a ground-truth signal that is independent of Kubernetes internal metrics. If the health check fails, we stop the drain immediately.

Rollback Procedures

Rolling back a Kubernetes upgrade is not as simple as downgrading the version. The control plane can be rolled back on managed services, but etcd schema changes may make this impossible after certain version jumps. Our rollback strategy is therefore focused on forward fixes: if something breaks, we fix it on the new version rather than reverting.

For node pool rollbacks, we simply uncordon the old node group and cordon the new one. This is instant because we do not terminate old nodes until the upgrade is fully validated. We keep old node groups running for 24 hours after the upgrade completes, which gives us a same-day rollback path if a delayed issue surfaces.

Ecosystem Component Upgrades

After the cluster itself is upgraded, we update ecosystem components in a specific order: CNI plugin first (because networking failures are the most disruptive), then CoreDNS, then the CSI driver, then ingress controllers, and finally monitoring and logging agents. Each component is upgraded independently with its own validation step.

For the CNI plugin (we use Calico on most clusters), we follow the documented upgrade path strictly. A Calico upgrade that goes wrong can partition the network, which is a far worse outage than anything the Kubernetes upgrade itself could cause.

Automation and Repeatability

We codified this entire process into an internal CLI tool that orchestrates the upgrade across all phases. The tool generates a run plan, executes each step with pre and post validation, and produces a report at the end. The entire upgrade for a 50-node cluster takes approximately 2 hours, of which about 20 minutes is human decision points where an engineer reviews metrics before approving the next phase.

The investment in tooling has paid for itself many times over. What used to be a stressful, weekend-long operation is now a routine Tuesday afternoon task that any engineer on the team can execute confidently.

← Back to Blog