Monitoring DNS replication

Suppose you’re running a few DNS servers spread across multiple locations in your infrastructure, and you want to monitor DNS replication to make sure they’re all on the same page. Here’s an approach on how to do that using Prometheus and a bit of scripting.

Originally, I tried to just use the blackbox exporter to query the SOA record from the master DC and the backup DCs, and check for the difference in the serial number. Technically this worked, but for some reason the serial in those entries can lag behind for a week or two, so this is not really helpful. Thus I finally migrated this to a mechanism that puts the current timestamp into a DNS TXT record in the domain, plus another script that exports this entry for Prometheus to read:

# dig +short dns-monitor-stamp.local.lan TXT
"1585917601"

If the difference in this value is too big, I’ll get an alert.

Injector

What makes this part tricky is the fact that you’ll need to be authenticated to the DC in order to update DNS. I created a dns user in AD for this, and created a keytab that holds its password using ktutil:

# klist -k -t /secrets/dns.keytab
#Keytab name: FILE:/secrets/dns.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   2 06.11.2019 13:55:38 dns@local.lan
   2 06.11.2019 13:57:42 dns@LOCAL.LAN

krb5.conf is pretty straightforward:

[libdefaults]
        default_realm = LOCAL.LAN
        dns_lookup_realm = false
        dns_lookup_kdc = true

And finally, here’s the script that creates the record using a samba-tool dns add/update command:

#!/bin/bash

set -e
set -u

RECORD="dns-monitor-stamp"
DOMAIN="local.lan"
SERVER="dc.local.lan"

USER="dns"
KEYTAB="/secrets/dns.keytab"

CURRENT_VALUE="$(dig "@$SERVER" +short "$RECORD.$DOMAIN" TXT | tr -d '"')"
NEW_VALUE="$(date '+%s')"

if [ ! -r "$KEYTAB" ]; then
    echo >2 "Keytab file ($KEYTAB) not found or not readable, aborting"
    exit 1
fi

# Create or update the DNS record
if [ -z "$CURRENT_VALUE" ]; then
    kinit -k -t "$KEYTAB" "$USER"
    samba-tool dns add "$SERVER" "$DOMAIN" "$RECORD" TXT "$NEW_VALUE"
    kdestroy
elif [[ "$NEW_VALUE" > $((CURRENT_VALUE + 1700)) ]]; then
    # We wait at least half an hour between updates
    kinit -k -t "$KEYTAB" "$USER"
    samba-tool dns update "$SERVER" "$DOMAIN" "$RECORD" TXT "$CURRENT_VALUE" "$NEW_VALUE"
    kdestroy
else
    echo >2 "Current record is younger than half an hour, leaving as-is"
fi

I put a significant time between updates so that replication has enough time to actually do anything. I was actually surprised how fast it is though.

Monitor

Here’s the Python script I use to query the DNS servers for our timestamp record. It exposes the age of said timestamp as a metric, so that alerts can be configured to yell whenever the record has gone stale:

from uuid import uuid4
from time import time
import dns.exception
import dns.resolver
from flask import Flask, request, Response

app  = Flask(__name__)

@app.route("/")
def hai():
    return Response("""<a href="/metrics">metrics</a>""")

@app.route("/metrics")
def metrics():
    server = request.args.get("server", "dc.local.lan")

    resolver = dns.resolver.Resolver(configure=False)
    resolver.nameservers = [server]
    try:
        answers = resolver.query('dns-monitor-stamp.local.lan', 'TXT', lifetime=1)
    except dns.resolver.NoAnswer:
        return Response(
            'dns_stamp_not_found{server="%s"} 1.0\n' % server,
            mimetype="text/plain"
        )
    except dns.exception.Timeout:
        return Response(
            'dns_stamp_timeout{server="%s"} 1.0\n' % server,
            mimetype="text/plain"
        )
    else:
        latest_timestamp = max([
            float(txt_string)
            for rdata in answers
            for txt_string in rdata.strings
        ])

        return Response(
            'dns_stamp_age{server="%s"} %f\n' % (server, time() - latest_timestamp),
            mimetype="text/plain"
        )

app.secret_key = str(uuid4())
app.debug = True
app.run(host="0.0.0.0", port=9543)

Targets can be configured like so:

  {
    "targets": ["127.0.0.1:9543"],
    "labels": {
      "service": "dns",
      "instance": "dc",
      "fqdn": "dc.local.lan",
      "__param_server": "192.168.0.5"
    }
  }

Combine this with an alert rule along the lines of dns_stamp_age > 2000, and you’ll be all set.