Cookbook¶

Task-oriented, copy-pasteable workflows that string the commands together. Each recipe follows the refresh loop: status → readiness → patch → upgrade. See the upgrade lifecycle for the why.

Fleet posture: what's stale, everywhere¶

Start every session here. One table, all clusters, all regions — Kubernetes version, support window, stale AMIs, and add-ons behind latest.

# The front door
refresh status -A

# Narrow to "prod" clusters in two regions
refresh status prod -r us-east-1 -r us-west-2

# Machine-readable for a dashboard
refresh status -A -o json

status exits non-zero when the fleet needs attention, so it doubles as a CI gate. See refresh status.

Readiness: am I safe to upgrade?¶

A read-only pre-flight that surfaces AWS Cluster Insights plus control-plane vs. nodegroup/add-on version skew. Nothing is mutated.

refresh cluster upgrade-check -c prod-east

# Include passing checks, as JSON for a gate
refresh cluster upgrade-check -c prod-east --show-passing -o json

# Drill into a flagged insight
refresh cluster upgrade-check -c prod-east --id <insight-id>

See cluster upgrade-check.

Patch a single nodegroup¶

Always preview first. --changelog prints the amazon-eks-ami release notes between the current and target AMI so you know what's changing.

# Preview the roll + read the AMI changelog
refresh nodegroup update prod-east ng-default --dry-run --changelog

# Roll it, requiring a clean health gate
refresh nodegroup update prod-east ng-default --require-healthy

See nodegroup update.

Patch the whole fleet¶

Fleet mode discovers clusters across regions and rolls them serially with one batch confirmation and a worst-outcome exit code. Dry-run, eyeball, then commit.

# Fleet-wide plan
refresh nodegroup update --all-clusters -r us-east-1 -r us-west-2 --dry-run

# Execute once the plan looks right
refresh nodegroup update --all-clusters -r us-east-1 -r us-west-2 --yes

Unattended / CI patch¶

For cron, suppress prompts with --yes, treat health warnings as a hard stop with --require-healthy, and emit a JSON summary with -o json. Branch on the exit code.

refresh nodegroup update -c prod --yes --require-healthy -o json
case $? in
  0) echo "patched cleanly" ;;
  2) echo "health warnings — review" ;;
  3) echo "blocked by health — do not proceed" ;;
  4) echo "some updates failed to start" ;;
  5) echo "rolled, but verification flagged issues" ;;
esac

TTY-less runs need --yes

Without a terminal and without --yes, a run that would otherwise prompt fails fast — so CI never hangs waiting on stdin.

Update add-ons safely¶

Update every add-on in dependency-safe order (vpc-cni → coredns/kube-proxy → others), waiting for each to settle before the next.

refresh addon update prod-east --all --dependency-order --wait

# Skip an add-on you manage via Helm/GitOps
refresh addon update prod-east --all --dependency-order --wait --skip aws-load-balancer-controller

See addon update.

Safe scale-down¶

Scaling down can strand pods behind a PodDisruptionBudget. Preview the exact PDBs that would constrain the operation before touching anything.

# Preview the constraining PDBs (no changes)
refresh nodegroup scale prod-east -n ng-default --desired 2 --check-pdbs --dry-run

# Scale down for real, gated on PDBs, waiting for it to settle
refresh nodegroup scale prod-east -n ng-default --desired 2 --check-pdbs --wait

See nodegroup scale.

Stop repeating `--region` / `--profile`¶

Save your environments once as contexts, then switch by name. The active context fills in cluster/region/profile defaults.

refresh context add prod  --cluster prod-eks  --region us-east-1 --profile prod
refresh context add stage --cluster stage-eks --region us-west-2 --profile stage

refresh use prod
refresh nodegroup list           # targets prod-eks / us-east-1 / prod
refresh cluster upgrade-check    # same context, no flags

refresh use stage                # flip the whole environment

Orchestrate a full cluster upgrade¶

When you're ready to move a minor version, let refresh sequence the whole thing: control plane → add-ons → nodegroups, with a health gate after every phase. EKS upgrades one minor at a time, so a multi-minor jump expands into sequential hops. Always dry-run first.

# Print the ordered plan (exits non-zero if anything blocks)
refresh cluster upgrade -c prod-east --to 1.33 --dry-run

# Execute, confirming each mutating phase
refresh cluster upgrade -c prod-east --to 1.33

# Non-interactive run, skipping a Helm-managed add-on
refresh cluster upgrade -c prod-east --to 1.33 --yes --skip aws-load-balancer-controller

The plan is re-derived from live state on every run, so rerunning after a failure (or Ctrl+C) resumes where it left off. See cluster upgrade.