Breakpoint

Intermittent DNS Resolution Failures

A microservices platform is experiencing random connection failures between services. The errors are sporadic and affect different service pairs at different times. Infrastructure team says "nothing changed." SRE is escalating.

System Overview

A Kubernetes-based microservices platform with 12 services communicating via internal DNS (service-name.namespace.svc.cluster.local). CoreDNS handles cluster DNS resolution. Services use HTTP clients with default connection pooling. A new monitoring sidecar was recently added to all pods.

Logs


[2024-09-05 16:00:01] INFO: Handling GET /users/42
[2024-09-05 16:00:01] INFO: Calling profile-service.default.svc.cluster.local:8080/profiles/42
[2024-09-05 16:00:01] ERROR: ENOTFOUND profile-service.default.svc.cluster.local
[2024-09-05 16:00:01] ERROR: Failed to fetch profile for user 42. Returning partial response.
[2024-09-05 16:00:15] INFO: Handling GET /users/42
[2024-09-05 16:00:15] INFO: Calling profile-service.default.svc.cluster.local:8080/profiles/42
[2024-09-05 16:00:15] INFO: Profile fetched successfully for user 42. 200 OK.
[2024-09-05 16:01:00] INFO: Handling GET /users/99
[2024-09-05 16:01:00] INFO: Calling profile-service.default.svc.cluster.local:8080/profiles/99
[2024-09-05 16:01:00] INFO: Profile fetched successfully for user 99. 200 OK.
[2024-09-05 16:02:30] INFO: Calling notification-service.default.svc.cluster.local:8080/send
[2024-09-05 16:02:30] ERROR: ENOTFOUND notification-service.default.svc.cluster.local

Diagnosis & Plan

Based on the logs, what do you think the root cause is, and what are your proposed next steps?