DNS failures in Kubernetes can be deceptive. A service might start up successfully, then suddenly fail to reach external endpoints or internal cluster services. This guide helps you identify DNS-related issues and implement fixes using CoreDNS, readiness probes, and network policies.
🐛 Common Symptoms of Flaky DNS
- Pods fail to connect to external services intermittently
- Errors like
Temporary failure in name resolution
orNXDOMAIN
- App containers fail health checks during startup or restart unexpectedly
- Some pods resolve services correctly, others do not (often node-specific)
🔍 First Steps: Troubleshooting DNS in Clusters
- Run a DNS test pod:
Use BusyBox or Alpine to manually test DNS resolution:kubectl run -it --rm dns-test --image=busybox --restart=Never -- sh
Then run:
nslookup google.com
- Check CoreDNS logs:
kubectl -n kube-system logs -l k8s-app=kube-dns
Look for timeouts, dropped queries, or upstream issues.
- Validate /etc/resolv.conf inside the pod:
Confirm it points to the cluster DNS service IP.
🛠️ Resolution Strategies
- Ensure kubelet is configured with a working DNS server on the node
Some node-level DNS configurations bleed into pod DNS resolution. Bad upstream entries will cause flaky behavior. - Use readiness probes to delay traffic until DNS is working
For example:readinessProbe: exec: command: [ "nslookup", "your-api.stackproof.dev" ] initialDelaySeconds: 5 periodSeconds: 10
- Tune CoreDNS configuration:
– Add caching
– Set proper forwarders (e.g., Google, Cloudflare)
– Limit retries or add load balancing - Restart DNS pods or the affected kubelet node
Sometimes node-specific DNS resolution breaks due to IPTables or CNI plugin issues.
📈 Proactive Measures
- Monitor DNS latency using metrics from CoreDNS (via Prometheus)
- Enable alerts for failed DNS lookups on critical services
- Pin CoreDNS versions and test custom configs in staging