It happens faster than you can react. A junior developer runs kubectl delete namespace production instead of kubectl delete namespace test-production. In the two seconds before muscle memory kicks in and they hit Ctrl+C, Kubernetes has already begun terminating every pod, service, configmap, and secret in your production namespace.
Your database PersistentVolumeClaim survives because the storage class has reclaimPolicy: Retain. Small mercy. But the Task API deployment, the inference service configuration, the carefully-tuned HPA settings, the secrets containing API keys for three external services - all gone. You stare at the terminal, trying to remember what was in that namespace. You haven't touched some of those configurations in months.
This is the moment you discover whether your disaster recovery strategy exists beyond good intentions. Do you have backups? When were they last tested? Can you restore to a specific namespace? How long until production is back?
Velero answers these questions before disaster strikes. It's a CNCF project that backs up Kubernetes resources and persistent volumes, stores them safely off-cluster, and restores them when you need them. This lesson teaches you to install Velero, configure scheduled backups with retention policies, implement database-aware hooks for consistency, and verify that restores actually work.
In production, you'll use cloud object storage - S3, GCS, Azure Blob. For local development and testing, MinIO provides an S3-compatible storage backend that runs in your cluster.
Step 1: Deploy MinIO
Create a bucket for Velero backups:
Step 2: Install Velero
Verify Velero is running:
The Available phase confirms Velero can reach the MinIO bucket.
Velero introduces four Custom Resource Definitions for backup logic:
A 24-hour RPO means daily backups; 30-day retention means you can recover from problems noticed weeks later.
Create the Schedule manifest (task-api-schedule.yaml):
Apply and verify:
The ttl field ensures Velero automatically deletes expired backups and data.
If Velero backs up a PostgreSQL volume while active, the backup might contain inconsistent data. Hooks quiesce the database before snapshotting.
The CHECKPOINT command forces PostgreSQL to write all dirty buffers to disk, ensuring the volume snapshot is consistent. The pg_dump creates an additional SQL backup inside the container, providing application-level recovery if the raw volume snapshot fails.
When disaster strikes, you need a reliable restore procedure.
Check that resources are restored:
These prompts help you apply Velero patterns to your backup requirements.
Prompt 1 (Backup Strategy Design):
Prompt 2 (Multi-Cluster Backup Architecture):
Prompt 3 (Compliance-Driven Verification):
Restore operations can overwrite existing resources. Always use --include-namespaces to target specific namespaces. Test restores in non-production environments first. The --dry-run flag shows what would be restored without actually modifying the cluster.