Lesson Learnt with Kubernetes Services and DNS¶
Estimated time to read: 7 minutes
- Originally Written: January, 2023
We recently migrated a simple application (Django web app+ Postgres DB) from one Kubernetes cluster to another. As part of the migration we ran into a problem that had us confused for a short while before we figured out what was happening (it probably didn't help that it was late Friday afternoon). This post explains the issue in case it helps someone else in the future.
TLDR: Always confirm the endpoint you think you're connected to is in fact the endpoint that you're connected to.
The application we were migrating is a Python Django web application which connects to a Postgres Database. Rather than configure Django with the IP address of the database pod, we use DNS and reference the Kubernetes service name. In this case the service name is db
From the previously mentioned post, a Kubernetes service keeps track of pods based on a label and the IP address that has been assigned to them. It also setups networking for example external client traffic to a web server, or internal traffic between frontend and backend tiers. In this case it was between the frontend (web) and backend (Postgres DB) tiers of the application.
We saw the web and DB pods deploy correctly and accessed the frontend admin page through a browser. Every time we tried to access it though, we received an error from Postgres which looked something like:
connect to PostgreSQL server: FATAL: no pg_hba.conf entry for host "XX.XXX.XX.XXX", user "XXX", database "XXX", SSL off
We spent some time troubleshooting and based on the error messages we were seeing we thought it might have been a Postgres authentication issue. It turns out it wasn't.
For reference in case you ever need to know, the pg_hba.conf
file in a Postgres DB controls the client/host-based authentication. The file is usually found in the /var/lib/postgress/data/pgdata
directory and might look something like this (just an example, not for production).
https://www.postgresql.org/docs/current/auth-pg-hba-conf.html
At that point we looked at the connection from web server to the database. In the Kubernetes deployment YAML file for the webserver, the db connection details are configured like this:
We ran nslookup db
on the webserver and noticed the IP address returned was not that of our own Postgres DB, but a different address outside of our cluster and lab altogether. At that point we realised we had forgotten to deploy the Kubernetes Service for the database when we migrated to the new cluster. When the service was created the application started working correctly.
How did forgetting the Kubernetes service cause this issue?¶
From the post above, Kubernetes DNS schedules a DNS Pod and Service on the cluster and configures the Kubelets (on each K8s node) to tell individual containers to use the DNS Service to resolve DNS names.
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service
Rather than use an IP address, we referenced the database service name, db
. The webserver took this name and queried the Kubernetes DNS (CoreDNS) pods for the IP address of db
. The main file used in this operation is /etc/resolv.conf
which is automatically configured by Kubernetes in each pod.
The /etc/resolv.conf
file in the webserver looks like this:
If no Fully Qualified Domain Name
is specified (i.e. just using db
), the domains in the line above (starting with search) are prepended with whatever it is we are trying to resolve (i.e. db
) and this creates an FQDN. The FQDN is then queried.
So using the above search paths, we saw the following output from the CoreDNS logs
In the output, NXDOMAIN
means the domain does not exist (non existent domain).
The queries continue through the list of search paths until one is found (db.another-lab.com
). As it turns out, someone has configured a record for db.another-lab.com
and coincidentally it's also running Postgres.
Info
So the Postgres instance we thought was ours was in fact a different DB altogether.
Fixing the error¶
By default, a pods DNS search list includes the namespace of the pod and the cluster's default domain. Think of a namespace like a tenant in a multi-tenant solution - it's a way of isolating groups of resources in a cluster.
The default search list is <namespace>.svc.cluster.local svc.cluster.local cluster.local
so in the example above, simple-web-app
is the namespace.
To fix this error we just had to create a new Kubernetes service for the DB. When this service was created (example below), db.simple-web-app.svc.cluster.local
resolved correctly and pointed our webserver to the IP address of our local db
pod.
How can we address this in the future?¶
-
Package all resources of the application/tiers using a tool like Helm, making it less likely to miss a resource in the app deployment
https://medium.com/hprog99/helm-a-practical-tutorial-429beeda214a
-
Update the
DJANGO_SQL_HOST
fromdb
todb.simple-web-app.svc.cluster.local.
so that the absolute name will be tried before using any search paths - Implement observability as part of the migration and deployment of future applications
- If we could see the flows and connections between pods (e.g. AppDynamics) perhaps we would have seen that web application was trying to connect to the wrong database
Some additional info¶
FQDN
A fully qualified domain name (FQDN), sometimes also referred to as an absolute domain name, is a domain name that specifies its exact location in the tree hierarchy of the Domain Name System (DNS)
ndots
A standard convention that most DNS resolvers follow is that if a domain ends with . (representing the root zone), the domain is considered to be FQDN
...
So mrkaran.dev. is an FQDN but mrkaran.dev is not.
https://mrkaran.dev/posts/ndots-kubernetes/
ndots
: Represents the number of dots in a query name to consider it as a fully qualified domain name i.e. using ndots:5
, db.simple-web-app.svc.cluster.local
is not considered an FQDN and the search list will still be used because there are only 4 dots
The default is ndots:5
for Kubernetes. See here for the reasons
https://github.com/kubernetes/kubernetes/issues/33554#issuecomment-266251056
Since we referenced db
and not db.
(note the period), the search paths were used. If we had used db.
(this time with the period), we would see the following error.