Who is Muhammad Usman Akbar?

Muhammad Usman Akbar is a world-class AI Transformation Consultant and Agentic Architect focused on achieving 30x industrial efficiency through autonomous ecosystems.

What results can an AI Transformation Consultant provide?

By replacing manual work with autonomous AI workflows, a consultant like Muhammad Usman can deliver up to 30x growth in output while reducing operational overhead by 40%.

What is Agentic AI Orchestration?

It is the engineering of multi-agent systems where autonomous AI entities collaborate to manage complex industrial operations in production environments.

Service Discovery Deep Dive

You've learned how Services provide stable IP addresses to dynamic Pods. But what happens when Agent A in agents namespace tries calling Agent B's API in services namespace and gets "host not found"? The answer lies in Kubernetes DNS.

Service discovery is the mechanism that translates service names (like api.services.svc.cluster.local) into IP addresses. When DNS resolution fails, entire microservices architectures collapse. In this lesson, you'll build mental models of how Kubernetes DNS works, then debug common connectivity failures systematically.

The Service Discovery Problem: "Host Not Found"

Imagine this scenario:

Your agent in agents namespace calls the API service in services namespace:

bash

kubectl run -it agent-debug --image=curlimages/curl --rm -- \
  curl http://api.services/health

Output:

text

curl: (6) Could not resolve host: api.services

Why? The agent can't find the service. The full name exists—api.services.svc.cluster.local—but without understanding Kubernetes DNS architecture, you're debugging blind.

How Kubernetes DNS Works: CoreDNS Architecture

Kubernetes doesn't use external DNS for internal service discovery. Instead, every cluster runs CoreDNS—a DNS server that understands Kubernetes services and translates names to IP addresses.

The DNS Chain

When a Pod requests api.services:

Pod queries localhost:53 (DNS resolver in container)
Resolver forwards to kube-dns service (Cluster IP: 10.96.0.10)
CoreDNS pod answers using cluster's service records
IP address returned to Pod

Key insight: CoreDNS is itself deployed as a Pod (or Pods) in the kube-system namespace, managed by a Deployment:

bash

kubectl get pods -n kube-system -l k8s-app=kube-dns

Output:

text

NAME                    READY   STATUS    RESTARTS   AGE
coredns-7db6d8ff4c      1/1     Running   0          3d
coredns-7db6d8ff4d      1/1     Running   0          3d

CoreDNS runs as multiple replicas for redundancy. Each CoreDNS pod watches the Kubernetes API for Service changes and automatically updates DNS records.

DNS Naming Hierarchy: FQDN Explained

Every service in Kubernetes has a Fully Qualified Domain Name (FQDN). Understanding this hierarchy is critical for cross-namespace discovery.

The FQDN Structure

text

service.namespace.svc.cluster.local

Example breakdown for a service named api in services namespace:

text

api                          ← Service name
.services                    ← Namespace
.svc                         ← Service subdomain (always literal)
.cluster.local               ← Cluster domain (configurable, usually "cluster.local")

Complete FQDN: api.services.svc.cluster.local

Short vs Long Names (Context Matters)

A Pod in the same namespace can use a short name:

bash

# From a Pod in "services" namespace
curl http://api:8080/health

Output:

text

HTTP/1.1 200 OK
{"status": "healthy"}

A Pod in different namespace must use the full FQDN:

bash

# From a Pod in "agents" namespace
curl http://api.services.svc.cluster.local:8080/health

Output:

text

HTTP/1.1 200 OK
{"status": "healthy"}

Why the FQDN Matters

Short names only work because CoreDNS automatically appends the Pod's namespace. When you request api from namespace agents:

Pod searches for api locally (not found—doesn't exist in agents)
Pod searches for api.agents.svc.cluster.local (not found—service is in services)
Failure: "host not found"

The FQDN bypasses this search chain entirely.

Debugging DNS: nslookup and dig

When service discovery fails, manual DNS debugging reveals the root cause. You'll use two tools: nslookup (simpler) and dig (more detailed).

Setting Up a Debug Pod

Create a debug Pod with DNS tools:

bash

kubectl run -it debug-dns --image=busybox:1.35 --rm --restart=Never -- \
  /bin/sh

Once inside the Pod:

bash

/ #

Using nslookup: DNS Lookup Basics

Query a service name:

bash

nslookup api.services.svc.cluster.local

Output (Success):

text

Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   api.services.svc.cluster.local
Address: 10.97.45.123

The server 10.96.0.10 is the kube-dns service. The resolved address 10.97.45.123 is the Service's ClusterIP.

Output (Failure - Wrong Namespace):

bash

nslookup api.agents.svc.cluster.local

text

Server:         10.96.0.10
Address:        10.96.0.10#53

** server can't find api.agents.svc.cluster.local: NXDOMAIN

NXDOMAIN means "Non-Existent Domain"—the service doesn't exist in that namespace.

Using dig: Detailed DNS Information

dig provides deeper insights into DNS records:

bash

dig api.services.svc.cluster.local

Output:

text

; <<>> DiG 9.16.1-Ubuntu <<>> api.services.svc.cluster.local
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;api.services.svc.cluster.local. IN    A

;; ANSWER SECTION:
api.services.svc.cluster.local. 30 IN  A       10.97.45.123

;; Query time: 2 msec

Key sections:

QUESTION: What are we looking up? (A record for IP address)
ANSWER: The IP address (10.97.45.123)
Query time: DNS resolution took 2 milliseconds (fast = healthy)

SRV Records: Service Records for Load Balancing

For headless services, CoreDNS creates SRV records that list all Pod IPs:

bash

dig SRV api.services.svc.cluster.local

Output:

text

;; ANSWER SECTION:
api.services.svc.cluster.local. 30 IN  SRV     0 100 8080 api-5f7d8c4f9z.api.services.svc.cluster.local.
api.services.svc.cluster.local. 30 IN  SRV     0 100 8080 api-5f7d8c4f9a.api.services.svc.cluster.local.

;; ADDITIONAL SECTION:
api-5f7d8c4f9z.api.services.svc.cluster.local. 30 IN A 10.244.0.5
api-5f7d8c4f9a.api.services.svc.cluster.local. 30 IN A 10.244.0.6

SRV records include:

Port number (8080)
Pod-specific FQDNs (api-5f7d8c4f9z.api.services.svc.cluster.local)
Pod IPs (10.244.0.5, 10.244.0.6)

This allows clients to connect directly to Pods instead of going through the Service's load balancer.

Headless Services: Direct Pod Discovery

Some applications need to connect directly to Pods, not through a load-balanced Service. Database replication, peer discovery, and stateful applications all require this. That's where headless services come in.

ClusterIP vs Headless

A regular Service (ClusterIP):

yaml

apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: services
spec:
  selector:
    app: api
  ports:
    - port: 8080
      targetPort: 8080

Returns a single virtual IP:

bash

kubectl get svc -n services api

Output:

text

NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)
api    ClusterIP   10.97.45.123    <none>        8080/TCP

A headless Service (no ClusterIP):

yaml

apiVersion: v1
kind: Service
metadata:
  name: db
  namespace: databases
spec:
  clusterIP: None  # This makes it headless
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432

Returns Pod IPs directly:

bash

kubectl get svc -n databases db

Output:

text

NAME   TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)
db     ClusterIP   None         <none>        5432/TCP

DNS Resolution: Headless Service

Query a headless service:

bash

dig db.databases.svc.cluster.local

Output:

text

;; ANSWER SECTION:
db.databases.svc.cluster.local. 30 IN  A       10.244.1.5
db.databases.svc.cluster.local. 30 IN  A       10.244.1.6
db.databases.svc.cluster.local. 30 IN  A       10.244.1.7

Instead of one virtual IP, DNS returns all Pod IPs that match the selector. Clients can connect to any Pod directly.

Troubleshooting Endpoint Mismatches

When DNS resolves but connections fail, the problem is usually endpoint mismatch: the Service selector doesn't match the Pod labels.

Discovering Endpoints

List endpoints for a service:

bash

kubectl get endpoints api -n services

Output (Healthy):

text

NAME   ENDPOINTS           AGE
api    10.244.0.5:8080     5m

The Service knows about one Pod at IP 10.244.0.5.

Output (No Endpoints):

text

NAME   ENDPOINTS   AGE
api    <none>      5m

The Service found zero Pods. This means the selector isn't matching any Pods.

Debugging Selector Mismatch

Check the Service's selector:

bash

kubectl get svc api -n services -o jsonpath='{.spec.selector}'

Output:

text

{"app":"api"}

The Service looks for Pods with label app: api.

Check Pods in the namespace:

bash

kubectl get pods -n services --show-labels

Output (Mismatch):

text

NAME            STATUS   LABELS
api-5f7d8c      Running  app=backend,version=v1
api-deployment  Running  app=api-service,version=v1

Neither Pod has app: api. The selector app: api matches zero Pods.

Output (Match):

text

NAME            STATUS   LABELS
api-5f7d8c      Running  app=api,version=v1
api-deployment  Running  app=api,version=v1

Both Pods have app: api. Now check endpoints again:

bash

kubectl get endpoints api -n services

Output:

text

NAME   ENDPOINTS                    AGE
api    10.244.0.5:8080,10.244.0.6:8080

Both Pods are now listed as endpoints.

Manual Label Verification

Use kubectl get pods with label selector syntax:

bash

kubectl get pods -n services -l app=api

Output (No Results):

text

No resources found in services namespace.

The selector matches zero Pods.

Output (Match):

text

NAME            READY   STATUS    RESTARTS   AGE
api-5f7d8c      1/1     Running   0          5m
api-deployment  1/1     Running   0          3m

The selector matches Pods successfully. Now endpoints should be populated.

Cross-Namespace Service Discovery

Services in different namespaces don't see each other by default. You must use the FQDN to reach across namespaces.

Accessing Services Across Namespaces

From a Pod in agents namespace accessing api service in services namespace:

bash

# Short name (fails - wrong namespace)
curl http://api:8080/health
# Error: Could not resolve host: api

# Full FQDN (succeeds)
curl http://api.services.svc.cluster.local:8080/health
# HTTP/1.1 200 OK

CoreDNS includes the full FQDN in all service discovery lookups. The FQDN is always correct.

Default Namespace Assumption

Pods assume their own namespace when resolving short names. CoreDNS tries:

servicename (fails immediately—not a valid FQDN)
servicename.namespace.svc.cluster.local (finds the service in its own namespace)
Falls back through search domains if configured

This is why short names only work in-namespace.

Connectivity Test: From Debugging to Validation

Let's walk through a complete debugging scenario, from failure to root cause to fix.

Scenario: Agent A Calls Agent B

Setup:

Agent A Pod in agents namespace
Agent B Service in services namespace
Agent A can't reach Agent B

Step 1: Verify Service Exists

bash

kubectl get svc -n services

Output:

text

NAME    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)
agent-b ClusterIP   10.97.123.45     <none>        8080/TCP

Service exists. Step 2: Check endpoints.

Step 2: Verify Endpoints

bash

kubectl get endpoints agent-b -n services

Output:

text

NAME      ENDPOINTS   AGE
agent-b   <none>      10m

No endpoints. The Service has no Pods. Step 3: Check selectors.

Step 3: Check Service Selector

bash

kubectl get svc agent-b -n services -o jsonpath='{.spec.selector}'

Output:

text

{"app":"agent-b","version":"v1"}

Step 4: Check if Pods match.

Step 4: List Pods with Required Labels

bash

kubectl get pods -n services -l app=agent-b,version=v1

Output:

text

No resources found in services namespace.

No Pods match. Step 5: Check what Pods exist.

Step 5: List All Pods in Namespace

bash

kubectl get pods -n services --show-labels

Output:

text

NAME             READY   STATUS    LABELS
agent-b-5f7d8c   1/1     Running   app=agent-b,version=v2

The Pod has label version=v2 but the Service requires version=v1.

Root Cause: Label mismatch.

Fix: Update Deployment to match Service selector:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-b
  namespace: services
spec:
  selector:
    matchLabels:
      app: agent-b
      version: v1  # Match Service selector
  template:
    metadata:
      labels:
        app: agent-b
        version: v1  # Match Service selector
    spec:
      containers:
      - name: agent-b
        image: myregistry.azurecr.io/agent-b:latest
        ports:
        - containerPort: 8080

Apply and verify:

bash

kubectl apply -f agent-b-deployment.yaml
kubectl get endpoints agent-b -n services

Output:

text

NAME      ENDPOINTS              AGE
agent-b   10.244.0.10:8080       2m

Endpoints now populated. Test from Agent A:

bash

kubectl run -it agent-a-test --image=curlimages/curl --rm -- \
  curl http://agent-b.services.svc.cluster.local:8080/health

Output:

text

HTTP/1.1 200 OK
{"status": "healthy"}

Success.

Try With AI

Setup: You have three microservices in Kubernetes:

auth-service in auth namespace (handles authentication)
api-gateway in api namespace (routes requests)
user-db in databases namespace (PostgreSQL database)

Both api-gateway and auth-service are running, but api-gateway cannot reach auth-service. Connections time out.

Part 1: Manual Diagnosis

Before asking AI, diagnose manually:

Verify auth-service exists:

bash
kubectl get svc -n auth auth-service
Check endpoints:

bash
kubectl get endpoints auth-service -n auth
Query DNS from a debug Pod:

bash
kubectl run -it debug --image=busybox:1.35 --rm --restart=Never -- \ nslookup auth-service.auth.svc.cluster.local
Check the Service selector:

bash
kubectl get svc auth-service -n auth -o jsonpath='{.spec.selector}'
List Pods in the namespace and compare labels:

bash
kubectl get pods -n auth --show-labels

Part 2: Gathering Evidence

Before consulting AI, collect:

Service definition (YAML)
Endpoint status
Pod labels
DNS query results
Error message from API Gateway logs

This evidence determines whether the issue is DNS, endpoints, or labels.

Part 3: Collaboration with AI

Once you've gathered evidence, ask AI:

"I have auth-service in the auth namespace. DNS resolves auth-service.auth.svc.cluster.local to 10.97.50.100, but api-gateway in the api namespace gets connection timeouts. Here's the Service definition [paste YAML], the endpoints output shows <none>, and the Pods in the namespace have labels [paste labels]. What's the likely root cause?"

AI will identify the mismatch between Service selector and Pod labels, or point out that endpoints are missing entirely.

Part 4: Validation

Test your fix:

bash

kubectl run -it api-test --image=curlimages/curl --rm -- \
  curl -v http://auth-service.auth.svc.cluster.local:8080/health

Connection should succeed (HTTP 200 or appropriate response).

Compare your manual diagnosis to AI's analysis. What did you miss? What did AI catch that you didn't?