Velero for Kubernetes Backup and Restore

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

It happens faster than you can react. A junior developer runs kubectl delete namespace production instead of kubectl delete namespace test-production. In the two seconds before muscle memory kicks in and they hit Ctrl+C, Kubernetes has already begun terminating every pod, service, configmap, and secret in your production namespace.

Your database PersistentVolumeClaim survives because the storage class has reclaimPolicy: Retain. Small mercy. But the Task API deployment, the inference service configuration, the carefully-tuned HPA settings, the secrets containing API keys for three external services - all gone. You stare at the terminal, trying to remember what was in that namespace. You haven't touched some of those configurations in months.

This is the moment you discover whether your disaster recovery strategy exists beyond good intentions. Do you have backups? When were they last tested? Can you restore to a specific namespace? How long until production is back?

Velero answers these questions before disaster strikes. It's a CNCF project that backs up Kubernetes resources and persistent volumes, stores them safely off-cluster, and restores them when you need them. This lesson teaches you to install Velero, configure scheduled backups with retention policies, implement database-aware hooks for consistency, and verify that restores actually work.

Installing Velero with MinIO for Local Development

In production, you'll use cloud object storage - S3, GCS, Azure Blob. For local development and testing, MinIO provides an S3-compatible storage backend that runs in your cluster.

Step 1: Deploy MinIO

bash

helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --namespace minio \
  --create-namespace \
  --set rootUser=minioadmin \
  --set rootPassword=minioadmin \
  --set mode=standalone \
  --set resources.requests.memory=512Mi \
  --set persistence.size=10Gi

Create a bucket for Velero backups:

bash

kubectl run minio-client --rm -it --restart=Never \
  --image=minio/mc \
  --namespace=minio \
  --command -- /bin/sh -c "
    mc alias set myminio http://minio:9000 minioadmin minioadmin && \
    mc mb myminio/velero-backups
  "

Step 2: Install Velero

bash

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set configuration.backupStorageLocation[0].name=default \
  --set configuration.backupStorageLocation[0].provider=aws \
  --set configuration.backupStorageLocation[0].bucket=velero-backups \
  --set configuration.backupStorageLocation[0].config.region=minio \
  --set configuration.backupStorageLocation[0].config.s3ForcePathStyle=true \
  --set configuration.backupStorageLocation[0].config.s3Url=http://minio.minio:9000 \
  --set snapshotsEnabled=false \
  --set initContainers[0].name=velero-plugin-for-aws \
  --set initContainers[0].image=velero/velero-plugin-for-aws:v1.10.0 \
  --set initContainers[0].volumeMounts[0].mountPath=/target \
  --set initContainers[0].volumeMounts[0].name=plugins \
  --set credentials.useSecret=true \
  --set credentials.secretContents.cloud='[default]\naws_access_key_id=minioadmin\naws_secret_access_key=minioadmin'

Verify Velero is running:

bash

kubectl get pods -n velero
kubectl get backupstoragelocation -n velero

The Available phase confirms Velero can reach the MinIO bucket.

Understanding Velero's CRD Architecture

Velero introduces four Custom Resource Definitions for backup logic:

BackupStorageLocation: Where backups live (S3/GCS bucket configs).
Backup: A point-in-time snapshot of resources and PVCs.
Schedule: Automated recurring backups based on a cron expression.
Restore: Recovering Kubernetes objects and data from a specific Backup.

Creating a Production Schedule with 30-Day Retention

A 24-hour RPO means daily backups; 30-day retention means you can recover from problems noticed weeks later.

Create the Schedule manifest (task-api-schedule.yaml):

yaml

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: task-api-production-daily
  namespace: velero
  labels:
    app: task-api
    environment: production
spec:
  schedule: "0 2 * * *"  # 2 AM UTC daily
  useOwnerReferencesInBackup: false
  template:
    includedNamespaces:
      - production
    includedResources:
      - "*"
    excludedResources:
      - events
      - pods
      - replicasets
    snapshotVolumes: true
    storageLocation: default
    ttl: 720h  # 30 days = 720 hours

Apply and verify:

bash

kubectl apply -f task-api-schedule.yaml
velero schedule describe task-api-production-daily

Understanding TTL Retention

The ttl field ensures Velero automatically deletes expired backups and data.

168h = 7 days
720h = 30 days
2160h = 90 days

Implementing Backup Hooks for Database Consistency

If Velero backs up a PostgreSQL volume while active, the backup might contain inconsistent data. Hooks quiesce the database before snapshotting.

Complete Schedule with Database Hooks

yaml

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: task-api-production-daily
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
      - production
    snapshotVolumes: true
    ttl: 720h
    hooks:
      resources:
        - name: postgres-consistency
          includedNamespaces:
            - production
          labelSelector:
            matchLabels:
              app: postgres
          pre:
            - exec:
                container: postgres
                command:
                  - /bin/bash
                  - -c
                  - |
                    psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "CHECKPOINT;"
                    pg_dump -U "$POSTGRES_USER" -d "$POSTGRES_DB" > /var/lib/postgresql/data/backup.sql
                onError: Fail
                timeout: 120s
          post:
            - exec:
                container: postgres
                command:
                  - /bin/bash
                  - -c
                  - |
                    rm -f /var/lib/postgresql/data/backup.sql
                onError: Continue
                timeout: 30s

Why Pre-Backup Hooks Matter

The CHECKPOINT command forces PostgreSQL to write all dirty buffers to disk, ensuring the volume snapshot is consistent. The pg_dump creates an additional SQL backup inside the container, providing application-level recovery if the raw volume snapshot fails.

Restore Procedure: Step-by-Step Recovery

When disaster strikes, you need a reliable restore procedure.

Step 1: Verify Available Backups

bash

velero backup get

Step 2: Inspect a Specific Backup

bash

velero backup describe task-api-production-daily-20241230020000 --details

Step 3: Create a Restore

bash

velero restore create restore-production-20241230 \
  --from-backup task-api-production-daily-20241230020000 \
  --include-namespaces production \
  --restore-volumes=true

Step 4: Monitor Restore Progress

bash

velero restore describe restore-production-20241230 --details
velero restore wait restore-production-20241230

Step 5: Verify the Restore

Check that resources are restored:

bash

kubectl get deployments -n production
kubectl get services -n production
kubectl get pvc -n production

Step 6: Validate Application Functionality

bash

kubectl port-forward svc/task-api 8000:8000 -n production &
curl http://localhost:8000/health

Try With AI

These prompts help you apply Velero patterns to your backup requirements.

Prompt 1 (Backup Strategy Design):

text

I'm designing a backup strategy for my cluster:
- Task API (FastAPI + PostgreSQL)
- Inference Service (stateless, connects to external LLMs)
- Redis cache (ephemeral)
Help me decide:
1. Which components need Velero backups vs just Helm charts?


2. What backup frequency for each?


3. Which need database hooks?

Prompt 2 (Multi-Cluster Backup Architecture):

text

I have three clusters: dev, staging, production. I want to:
- Back up production daily to S3
- Keep 7-day retention in dev, 30-day in staging, 90-day in production
Design the BackupStorageLocation and Schedule configurations for this setup.

Prompt 3 (Compliance-Driven Verification):

text

My company needs to comply with SOC 2 Type II. The auditor asks:
1. How do you ensure backups are encrypted at rest?

2. How do you verify backup integrity?

3. Can you prove restore procedures work?
Show me how to configure Velero to answer these questions with evidence.

Safety Note

Restore operations can overwrite existing resources. Always use --include-namespaces to target specific namespaces. Test restores in non-production environments first. The --dry-run flag shows what would be restored without actually modifying the cluster.