Monitoring DNS replication
Suppose you’re running a few DNS servers spread across multiple locations in your infrastructure, and you want to monitor DNS replication to make sure they’re all on the same page. Here’s an approach on how to do that using Prometheus and a bit of scripting.
Originally, I tried to just use the blackbox exporter to query the SOA record from the master DC and the backup DCs, and check for the difference in the serial number. Technically this worked, but for some reason the serial in those entries can lag behind for a week or two, so this is not really helpful. Thus I finally migrated this to a mechanism that puts the current timestamp into a DNS TXT record in the domain, plus another script that exports this entry for Prometheus to read:
# dig +short dns-monitor-stamp.local.lan TXT
"1585917601"
If the difference in this value is too big, I’ll get an alert.
Injector
What makes this part tricky is the fact that you’ll need to be
authenticated to the DC in order to update DNS. I created a
dns
user in AD for this, and created a keytab that holds
its password using ktutil
:
# klist -k -t /secrets/dns.keytab
#Keytab name: FILE:/secrets/dns.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
2 06.11.2019 13:55:38 dns@local.lan
2 06.11.2019 13:57:42 dns@LOCAL.LAN
krb5.conf
is pretty straightforward:
[libdefaults]
default_realm = LOCAL.LAN
dns_lookup_realm = false
dns_lookup_kdc = true
And finally, here’s the script that creates the record using a
samba-tool dns add/update
command:
#!/bin/bash
set -e
set -u
RECORD="dns-monitor-stamp"
DOMAIN="local.lan"
SERVER="dc.local.lan"
USER="dns"
KEYTAB="/secrets/dns.keytab"
CURRENT_VALUE="$(dig "@$SERVER" +short "$RECORD.$DOMAIN" TXT | tr -d '"')"
NEW_VALUE="$(date '+%s')"
if [ ! -r "$KEYTAB" ]; then
echo >2 "Keytab file ($KEYTAB) not found or not readable, aborting"
exit 1
fi
# Create or update the DNS record
if [ -z "$CURRENT_VALUE" ]; then
kinit -k -t "$KEYTAB" "$USER"
samba-tool dns add "$SERVER" "$DOMAIN" "$RECORD" TXT "$NEW_VALUE"
kdestroy
elif [[ "$NEW_VALUE" > $((CURRENT_VALUE + 1700)) ]]; then
# We wait at least half an hour between updates
kinit -k -t "$KEYTAB" "$USER"
samba-tool dns update "$SERVER" "$DOMAIN" "$RECORD" TXT "$CURRENT_VALUE" "$NEW_VALUE"
kdestroy
else
echo >2 "Current record is younger than half an hour, leaving as-is"
fi
I put a significant time between updates so that replication has enough time to actually do anything. I was actually surprised how fast it is though.
Monitor
Here’s the Python script I use to query the DNS servers for our timestamp record. It exposes the age of said timestamp as a metric, so that alerts can be configured to yell whenever the record has gone stale:
from uuid import uuid4
from time import time
import dns.exception
import dns.resolver
from flask import Flask, request, Response
= Flask(__name__)
app
@app.route("/")
def hai():
return Response("""<a href="/metrics">metrics</a>""")
@app.route("/metrics")
def metrics():
= request.args.get("server", "dc.local.lan")
server
= dns.resolver.Resolver(configure=False)
resolver = [server]
resolver.nameservers try:
= resolver.query('dns-monitor-stamp.local.lan', 'TXT', lifetime=1)
answers except dns.resolver.NoAnswer:
return Response(
'dns_stamp_not_found{server="%s"} 1.0\n' % server,
="text/plain"
mimetype
)except dns.exception.Timeout:
return Response(
'dns_stamp_timeout{server="%s"} 1.0\n' % server,
="text/plain"
mimetype
)else:
= max([
latest_timestamp float(txt_string)
for rdata in answers
for txt_string in rdata.strings
])
return Response(
'dns_stamp_age{server="%s"} %f\n' % (server, time() - latest_timestamp),
="text/plain"
mimetype
)
= str(uuid4())
app.secret_key = True
app.debug ="0.0.0.0", port=9543) app.run(host
Targets can be configured like so:
{
"targets": ["127.0.0.1:9543"],
"labels": {
"service": "dns",
"instance": "dc",
"fqdn": "dc.local.lan",
"__param_server": "192.168.0.5"
}
}
Combine this with an alert rule along the lines of
dns_stamp_age > 2000
, and you’ll be all set.