You've learned how Services provide stable IP addresses to dynamic Pods. But what happens when Agent A in agents namespace tries calling Agent B's API in services namespace and gets "host not found"? The answer lies in Kubernetes DNS.
Service discovery is the mechanism that translates service names (like api.services.svc.cluster.local) into IP addresses. When DNS resolution fails, entire microservices architectures collapse. In this lesson, you'll build mental models of how Kubernetes DNS works, then debug common connectivity failures systematically.
Imagine this scenario:
Your agent in agents namespace calls the API service in services namespace:
Output:
Why? The agent can't find the service. The full name exists—api.services.svc.cluster.local—but without understanding Kubernetes DNS architecture, you're debugging blind.
Kubernetes doesn't use external DNS for internal service discovery. Instead, every cluster runs CoreDNS—a DNS server that understands Kubernetes services and translates names to IP addresses.
When a Pod requests api.services:
Key insight: CoreDNS is itself deployed as a Pod (or Pods) in the kube-system namespace, managed by a Deployment:
Output:
CoreDNS runs as multiple replicas for redundancy. Each CoreDNS pod watches the Kubernetes API for Service changes and automatically updates DNS records.
Every service in Kubernetes has a Fully Qualified Domain Name (FQDN). Understanding this hierarchy is critical for cross-namespace discovery.
Example breakdown for a service named api in services namespace:
Complete FQDN: api.services.svc.cluster.local
A Pod in the same namespace can use a short name:
Output:
A Pod in different namespace must use the full FQDN:
Output:
Short names only work because CoreDNS automatically appends the Pod's namespace. When you request api from namespace agents:
The FQDN bypasses this search chain entirely.
When service discovery fails, manual DNS debugging reveals the root cause. You'll use two tools: nslookup (simpler) and dig (more detailed).
Create a debug Pod with DNS tools:
Once inside the Pod:
Query a service name:
Output (Success):
The server 10.96.0.10 is the kube-dns service. The resolved address 10.97.45.123 is the Service's ClusterIP.
Output (Failure - Wrong Namespace):
NXDOMAIN means "Non-Existent Domain"—the service doesn't exist in that namespace.
dig provides deeper insights into DNS records:
Output:
Key sections:
For headless services, CoreDNS creates SRV records that list all Pod IPs:
Output:
SRV records include:
This allows clients to connect directly to Pods instead of going through the Service's load balancer.
Some applications need to connect directly to Pods, not through a load-balanced Service. Database replication, peer discovery, and stateful applications all require this. That's where headless services come in.
A regular Service (ClusterIP):
Returns a single virtual IP:
Output:
A headless Service (no ClusterIP):
Returns Pod IPs directly:
Output:
Query a headless service:
Output:
Instead of one virtual IP, DNS returns all Pod IPs that match the selector. Clients can connect to any Pod directly.
When DNS resolves but connections fail, the problem is usually endpoint mismatch: the Service selector doesn't match the Pod labels.
List endpoints for a service:
Output (Healthy):
The Service knows about one Pod at IP 10.244.0.5.
Output (No Endpoints):
The Service found zero Pods. This means the selector isn't matching any Pods.
Check the Service's selector:
Output:
The Service looks for Pods with label app: api.
Check Pods in the namespace:
Output (Mismatch):
Neither Pod has app: api. The selector app: api matches zero Pods.
Output (Match):
Both Pods have app: api. Now check endpoints again:
Output:
Both Pods are now listed as endpoints.
Use kubectl get pods with label selector syntax:
Output (No Results):
The selector matches zero Pods.
Output (Match):
The selector matches Pods successfully. Now endpoints should be populated.
Services in different namespaces don't see each other by default. You must use the FQDN to reach across namespaces.
From a Pod in agents namespace accessing api service in services namespace:
CoreDNS includes the full FQDN in all service discovery lookups. The FQDN is always correct.
Pods assume their own namespace when resolving short names. CoreDNS tries:
This is why short names only work in-namespace.
Let's walk through a complete debugging scenario, from failure to root cause to fix.
Setup:
Step 1: Verify Service Exists
Output:
Service exists. Step 2: Check endpoints.
Step 2: Verify Endpoints
Output:
No endpoints. The Service has no Pods. Step 3: Check selectors.
Step 3: Check Service Selector
Output:
Step 4: Check if Pods match.
Step 4: List Pods with Required Labels
Output:
No Pods match. Step 5: Check what Pods exist.
Step 5: List All Pods in Namespace
Output:
The Pod has label version=v2 but the Service requires version=v1.
Root Cause: Label mismatch.
Fix: Update Deployment to match Service selector:
Apply and verify:
Output:
Endpoints now populated. Test from Agent A:
Output:
Success.
Setup: You have three microservices in Kubernetes:
Both api-gateway and auth-service are running, but api-gateway cannot reach auth-service. Connections time out.
Part 1: Manual Diagnosis
Before asking AI, diagnose manually:
Verify auth-service exists:
Check endpoints:
Query DNS from a debug Pod:
Check the Service selector:
List Pods in the namespace and compare labels:
Part 2: Gathering Evidence
Before consulting AI, collect:
This evidence determines whether the issue is DNS, endpoints, or labels.
Part 3: Collaboration with AI
Once you've gathered evidence, ask AI:
"I have auth-service in the auth namespace. DNS resolves auth-service.auth.svc.cluster.local to 10.97.50.100, but api-gateway in the api namespace gets connection timeouts. Here's the Service definition [paste YAML], the endpoints output shows <none>, and the Pods in the namespace have labels [paste labels]. What's the likely root cause?"
AI will identify the mismatch between Service selector and Pod labels, or point out that endpoints are missing entirely.
Part 4: Validation
Test your fix:
Connection should succeed (HTTP 200 or appropriate response).
Compare your manual diagnosis to AI's analysis. What did you miss? What did AI catch that you didn't?