Multi-Cluster Deployments

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

So far you've deployed your FastAPI agent to a single Kubernetes cluster. That works for development. But production systems need redundancy: if one cluster fails, your agent keeps running on another. If you need to test a new version before rolling out to all users, you deploy to a staging cluster first. This chapter teaches you to manage multiple clusters from one ArgoCD instance using a hub-spoke architecture.

In hub-spoke, ArgoCD (the hub) manages deployment to many Kubernetes clusters (the spokes). You define your application once in Git. ArgoCD syncs that same application to cluster 1, cluster 2, cluster 3—each with different configurations. One Git repository becomes the source of truth for your entire infrastructure.

The Hub-Spoke Architecture

A hub-spoke topology has one control point (ArgoCD hub) managing many execution points (Kubernetes clusters as spokes). This is different from decentralized approaches where each cluster runs its own ArgoCD instance.

Why Hub-Spoke?

Single pane of glass: One ArgoCD UI/CLI shows status across all clusters

text

ArgoCD Hub                 Kubernetes Clusters
┌──────────────┐
│  ArgoCD      │          ┌──────────────┐
│  Server      │──────────│ Prod Cluster │
│              │          │   (us-east)  │
│              │          └──────────────┘
│ Git Repo     │
│ (source of   │          ┌──────────────┐
│  truth)      │──────────│ Staging      │
│              │          │  (us-west)   │
│              │          └──────────────┘
└──────────────┘          ┌──────────────┐
                    ──────│ DR Cluster   │
                          │  (eu-west)   │
                          └──────────────┘

Cost of a unified approach: Secrets containing cluster credentials must be stored securely in ArgoCD, not in Git. We'll address this in Chapter 15 (Secrets Management).

Alternative: cluster-local ArgoCD (not hub-spoke):

text

Git Repo              Kubernetes Clusters
Prod Cluster          ┌──────────────┐
  └─ ArgoCD ────────────│ Prod Cluster │
                        └──────────────┘
Staging Cluster       ┌──────────────┐
  └─ ArgoCD ────────────│ Staging      │
                        └──────────────┘

This approach works for teams with separate infra teams per cluster but loses the unified deployment view. We'll focus on hub-spoke because it's more common for AI agents.

Registering External Clusters

ArgoCD starts with one cluster: the one it's installed in (the hub). To deploy to other clusters (spokes), you must register those clusters with ArgoCD first.

Local Cluster Registration (Hub Cluster)

When you install ArgoCD on a cluster, it automatically registers itself:

yaml

apiVersion: cluster.argoproj.io/v1alpha1
kind: Cluster
metadata:
  name: in-cluster
spec:
  server: https://kubernetes.default.svc
  config:
    bearerToken: <token>
    tlsClientConfig:
      caData: <ca-cert>

Output:

text

Cluster registered successfully
Name:  in-cluster
URL:   https://kubernetes.default.svc
Status: Healthy

Registering External Clusters

To register an external cluster (e.g., your staging environment), you need:

Access to the external cluster's API server (kubeconfig context)
A service account with cluster-admin permissions (or appropriate RBAC)
The argocd CLI to register the cluster

Step 1: Create a service account on the external cluster

bash

# On the external cluster, create a namespace and service account
kubectl create namespace argocd
kubectl create serviceaccount argocd-manager -n argocd

# Grant cluster-admin permissions
kubectl create clusterrolebinding argocd-manager-cluster-admin \
  --clusterrole=cluster-admin \
  --serviceaccount=argocd:argocd-manager

Output:

text

namespace/argocd created
serviceaccount/argocd-manager created
clusterrolebinding.rbac.authorization.k8s.io/argocd-manager-cluster-admin created

Step 2: Get the external cluster's kubeconfig

bash

# Generate a kubeconfig for the service account
kubectl config get-contexts

# Current context should be your external cluster
# If not, switch to it:
kubectl config use-context <external-cluster-context>

Output:

text

CURRENT   NAME           CLUSTER      AUTHINFO     NAMESPACE
*         staging-us-west-1  us-west-1      admin
          prod-us-east-1    us-east-1      admin

Step 3: Register the cluster with ArgoCD

bash

# Switch back to your HUB cluster where ArgoCD is installed
kubectl config use-context in-cluster

# Port-forward to ArgoCD (if it's not exposed)
kubectl port-forward -n argocd svc/argocd-server 8080:443 &

# Register the external cluster
argocd cluster add staging-us-west-1 \
  --name staging \
  --in-cluster=false

Output:

text

INFO[0003] ServiceAccount "argocd-manager" created in namespace "argocd"
INFO[0004] ClusterRole "argocd-manager-role" created
INFO[0005] ClusterRoleBinding "argocd-manager-rolebinding" created
Cluster 'staging' has been added to Argo CD.

Cluster Secrets and Authentication

When you register an external cluster, ArgoCD stores the cluster's API server URL and authentication credentials as a Kubernetes Secret in the hub cluster.

Viewing Registered Clusters

bash

# List all registered clusters
argocd cluster list

# Get details of a specific cluster
argocd cluster get staging

# View the cluster secret directly
kubectl get secret -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml

Output:

yaml

NAME           CLUSTER             TLS
in-cluster     https://kubernetes.default.svc   false
staging        https://staging-api.example.com  true
prod           https://prod-api.example.com     true
---
apiVersion: v1
kind: Secret
metadata:
  name: cluster-staging-0123456789abcdef
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
data:
  server: aHR0cHM6Ly9zdGFnaW5nLWFwaS5leGFtcGxlLmNvbQ==  # base64 encoded
  name: c3RhZ2luZw==  # base64 encoded
  config: eyJiZWFyZXJUb2tlbiI6Ijc4OXB4eVl6ZUZRSXdVMkZrVUhGcGJISmhiblJsIn0=

Cluster Credentials: Bearer Token

The config field in the secret contains authentication details. For external clusters, it typically includes:

json

{
  "bearerToken": "<service-account-token>",
  "tlsClientConfig": {
    "insecure": false,
    "caData": "<base64-encoded-ca-cert>"
  }
}

The bearer token comes from the argocd-manager service account on the external cluster:

bash

# Get the token from the external cluster
kubectl get secret -n argocd \
  $(kubectl get secret -n argocd | grep argocd-manager-token | awk '{print $1}') \
  -o jsonpath='{.data.token}' | base64 -d

Output:

text

ey
Jhb
Gci
OiJIUzI1Ni
IsInR5cCI6IkpXVCJ9.ey
Jpc3Mi
OiJrdWJlcm5ldGVzL3Nlcn
ZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2Ui
OiJhcmdvY2QiLCJrdWJlcm5ldGVz
LmlvL3Nlcn
ZpY2VhY2NvdW50L3NlY3JldC5uYW1l
IjoiYXJnb2NkLW1hbm
FnZXItdG9rZW4tOXA0ZGwiLCJrdWJlcm5ldGVz
LmlvL3Nlcn
ZpY2VhY2NvdW50L3Nlcn
ZpY2VhY2NvdW50Lm5hbWUi
OiJhcmdvY2QtbWFuYWdlci
IsImt1Ym
VybmV0ZXMuaW8vc2VydmljZWFjY291bn
Qvc2VydmljZWFjY291bn
Qud
Wlk
IjoiOWQ1YTc1Yz
ItZjM0ZS00YjQ3LWJh
YmUtODJm
MmI4N2Rh
MjI0In0.4b
Gl...

Cluster Health Check

ArgoCD periodically verifies cluster connectivity:

bash

# Check cluster health
argocd cluster get staging

Output:

text

Name:           staging
Server:         https://staging-api.example.com
Connection Status: Successful

If a cluster becomes unreachable, ArgoCD marks it as unhealthy but continues managing other clusters.

ApplicationSet with Cluster Generator

You've already learned ApplicationSets in Chapter 11. Now you'll use the Cluster generator to deploy an application to multiple registered clusters with cluster-specific configurations.

The Cluster Generator Concept

Instead of creating separate Applications for prod, staging, and DR:

yaml

# ❌ Old way: Three separate Applications
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: agent-prod
spec:
  destination:
    server: https://prod-api.example.com
# ... etc

Use a Cluster generator to create one Application per registered cluster:

yaml

# ✅ New way: One ApplicationSet generates three Applications
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: agent-multi-cluster
spec:
  generators:
    - clusters: {}  # Generates one Application per registered cluster
  template:
    metadata:
      name: 'agent-{{name}}'
    spec:
      project: default
      destination:
        server: '{{server}}'
        namespace: agent
      source:
        repoURL: https://github.com/example/agent
        path: manifests/
        targetRevision: main

The clusters: {} generator creates template variables for every registered cluster:

{{name}}: Cluster name (e.g., "staging", "prod")
{{server}}: Cluster API server URL
{{metadata.labels}}: Cluster labels (if you've added them)

Cluster-Specific Configurations

Real deployments need different configs per cluster. You might want:

Prod: 3 replicas, resource limits, strict security policies
Staging: 1 replica, minimal resources, relaxed policies
DR: 3 replicas, same as prod but in different region

Use Helm values overrides to customize per cluster:

yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: agent-multi-cluster
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            deploy: "true"  # Only deploy to clusters with this label
  template:
    metadata:
      name: 'agent-{{name}}'
    spec:
      project: default
      destination:
        server: '{{server}}'
        namespace: agent
      source:
        repoURL: https://github.com/example/agent
        path: helm/
        targetRevision: main
        helm:
          releaseName: agent
          values: |
            replicas: "{{replicas}}"
            environment: "{{name}}"

Step 1: Add labels to clusters

bash

# Label the clusters
argocd cluster patch staging -p '{"metadata":{"labels":{"env":"staging","deploy":"true"}}}'
argocd cluster patch prod -p '{"metadata":{"labels":{"env":"prod","deploy":"true"}}}'
argocd cluster patch dr -p '{"metadata":{"labels":{"env":"dr","deploy":"true"}}}'

Output:

text

cluster 'staging' patched
cluster 'prod' patched
cluster 'dr' patched

Step 2: Create values-per-cluster in your Git repository

Create these files in your agent repository:

helm/values.yaml (default values) helm/values-prod.yaml (prod-specific overrides) helm/values-staging.yaml (staging-specific overrides) helm/values-dr.yaml (DR cluster same as prod)

Verify the files exist:

bash

ls -la helm/values*.yaml

Output:

text

-rw-r--r--  1 user  group  298 Dec 23 10:15 helm/values.yaml
-rw-r--r--  1 user  group  156 Dec 23 10:15 helm/values-staging.yaml
-rw-r--r--  1 user  group  298 Dec 23 10:15 helm/values-prod.yaml
-rw-r--r--  1 user  group  298 Dec 23 10:15 helm/values-dr.yaml

Step 3: Create ApplicationSet with per-cluster values

yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: agent-multi-cluster
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            deploy: "true"
  template:
    metadata:
      name: 'agent-{{name}}'
    spec:
      project: default
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
      destination:
        server: '{{server}}'
        namespace: agent
      source:
        repoURL: https://github.com/example/agent
        path: helm/
        targetRevision: main
        helm:
          releaseName: agent
          valueFiles:
            - values.yaml
            - values-{{name}}.yaml  # Cluster-specific overrides

Apply the ApplicationSet:

bash

kubectl apply -f applicationset.yaml
argocd app list

Output:

text

NAME                CLUSTER            NAMESPACE  PROJECT  STATUS    HEALTH
agent-staging       staging            agent      default  Synced    Healthy
agent-prod          prod               agent      default  Synced    Healthy
agent-dr            dr                 agent      default  Synced    Healthy

Cross-Cluster Networking Considerations

Multi-cluster deployments raise networking questions:

Service Discovery Between Clusters

Option	Method	Cons
Option 1	Direct IP/DNS	Not recommended; cluster-local IPs don't route.
Option 2	Ingress/Load Balancer	Extra hops, increased latency.
Option 3	Service Mesh (Istio)	High complexity, requires shared control plane.

For your AI agent, if each cluster is independent (data doesn't flow between clusters), you don't need cross-cluster communication. Each cluster runs a complete copy of your agent with its own database.

DNS Across Clusters

Each Kubernetes cluster has its own DNS domain:

In Cluster A: agent-service.agent.svc.cluster.local resolves only within Cluster A
In Cluster B: Same agent-service.agent.svc.cluster.local is different from Cluster A

To expose a service to other clusters, use an external DNS name:

bash

# Get the external endpoint
kubectl get svc -n agent agent-service -o jsonpath='{.status.load
Balancer.ingress[0].hostname}'

Output:

text

agent-staging.example.com
agent-prod.example.com
agent-dr.example.com

Disaster Recovery: ArgoCD HA and Cluster Failover

With multiple clusters, you need resilience at two levels: ArgoCD itself must be HA, and your clusters must be capable of failover.

ArgoCD High Availability (Hub Cluster)

If your ArgoCD hub cluster goes down, you cannot deploy to spoke clusters. Make ArgoCD highly available:

bash

# Install ArgoCD with HA enabled
helm install argocd argo/argo-cd \
  --namespace argocd \
  --set server.replicas=3 \
  --set controller.replicas=3 \
  --set repo.replicas=3 \
  --set redis.replicas=3

Output:

text

Release "argocd" has been installed.
Deployment argocd-application-controller: 3 replicas
Deployment argocd-server: 3 replicas
Deployment argocd-repo-server: 3 replicas
StatefulSet redis: 3 replicas

Each component is fault-tolerant (Controller, Server, Repo Server, Redis). If one pod crashes, others take over.

Cluster Failover: Traffic Shifting

Your agent runs on three clusters (staging, prod, DR). If the prod cluster fails:

Scenario: User Traffic Shifting

text

User Traffic → AWS NLB (Network Load Balancer)
                ├─→ Prod cluster (prod.example.com)  [FAILED]
                ├─→ DR cluster (dr.example.com)      [HEALTHY]
                └─→ Staging (staging.example.com)    [BACKUP]

Action: NLB detects prod failure → routes traffic to DR cluster

For your agent, implement:

Health checks on all clusters
DNS failover (Route53, Cloudflare) to shift traffic
ArgoCD monitoring to detect when clusters become unhealthy

bash

# Check if a cluster is healthy
argocd cluster get prod

# Check application health on prod cluster
argocd app get agent-prod

Output:

text

Application: agent-prod
Status: Degraded
Server: https://prod-api.example.com (UNREACHABLE)
---
Cluster: prod
Connection Status: Failed (connection timeout)

Complete Multi-Cluster ApplicationSet Example

Here's a production-ready example:

Directory structure:

text

repo/
├── argocd/
│   └── agent-multi-cluster-appset.yaml
├── helm/
│   ├── Chart.yaml
│   ├── values.yaml
│   ├── values-staging.yaml
│   ├── values-prod.yaml
│   └── values-dr.yaml

argocd/agent-multi-cluster-appset.yaml:

yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: agent-multi-cluster
  namespace: argocd
spec:
  syncPolicy:
    preserveResourcesOnDeletion: true
  generators:
    - clusters:
        selector:
          matchLabels:
            deploy: "true"
  template:
    metadata:
      name: 'agent-{{name}}'
      finalizers:
        - resources-finalizer.argocd.argoproj.io
    spec:
      project: default
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        retry:
          limit: 5
          backoff:
            duration: 5s
            factor: 2
      destination:
        server: '{{server}}'
        namespace: agent
      source:
        repoURL: https://github.com/example/agent
        path: helm/
        targetRevision: main
        helm:
          releaseName: agent-{{name}}
          valueFiles:
            - values.yaml
            - values-{{metadata.labels.env}}.yaml

Deploy the ApplicationSet:

bash

# Label each cluster
argocd cluster patch staging --labels 'env=staging,deploy=true'
argocd cluster patch prod --labels 'env=prod,deploy=true'
argocd cluster patch dr --labels 'env=dr,deploy=true'

# Apply the ApplicationSet
kubectl apply -f argocd/agent-multi-cluster-appset.yaml

# Check sync status
argocd app get agent-prod --refresh

Output:

text

NAME            CLUSTER    STATUS     HEALTH
agent-staging   staging    Synced     Healthy
agent-prod      prod       Synced     Healthy
agent-dr        dr         Synced     Healthy

Try With AI

Setup: Use the same FastAPI agent from previous chapters. You now have three Kubernetes clusters available (or can simulate with three Minikube instances).

Part 1: Design Your Multi-Cluster Strategy

Ask AI: "I have a FastAPI agent that I want to deploy to three clusters: staging, prod, and DR. Each should have different resource allocations. Design a multi-cluster deployment strategy using ArgoCD that supports: (1) Separate configurations per cluster, (2) Secrets management outside of Git, (3) Automatic failover if one cluster becomes unhealthy."

Part 2: Refine Secret Handling

"How would I configure External Secrets to pull database passwords from HashiCorp Vault for my prod cluster, while the staging cluster gets test credentials from a different secret location?"

Part 3: Test with One Cluster First

"I want to set up a test ApplicationSet with just my staging cluster to verify the approach works before adding prod and DR. Give me a minimal ApplicationSet that deploys to a single cluster with custom values."

Part 4: Scaling to Three Clusters

"Now add the prod and dr clusters to the ApplicationSet. How do I ensure the cluster selector only deploys to clusters with the deploy=true label?"

Part 5: Design Failover

"If my prod cluster becomes unreachable, how does ArgoCD detect this and how would my users be notified? What monitoring should I add to alert when a cluster is unhealthy?"

Reflect on Your Skill

You built a gitops-deployment skill in Chapter 0. Test and improve it based on what you learned.

Test Your Skill

bash

Using my gitops-deployment skill, register an external cluster with ArgoCD.
Does my skill describe the service account creation and argocd cluster add command?

Identify Gaps

Ask yourself:

Did my skill include ApplicationSet cluster generator for multi-cluster deployments?
Did it handle per-cluster Helm value overrides (values-prod.yaml, values-staging.yaml)?

Improve Your Skill

If you found gaps:

bash

My gitops-deployment skill doesn't generate multi-cluster Application
Sets.
Update it to include cluster generators with label selectors and environment-specific value files.