Prometheus quick start

Svedrin

2016-07-05 15:50

Here's a little quick start procedure I like to use to get the Prometheus monitoring system up and running, featuring Prometheus itself, the node and SNMP exporters, separate directories for configured and discovered targets and a couple of basic alerts.

So far, I used this on Ubuntu 14.04 and Debian Jessie.

Prometheus

First, install Prometheus:

add-apt-repository ppa:ubuntu-lxc/lxd-stable
apt-get update
apt-get install golang build-essential

echo 'GOPATH="/opt/gocode"' >> /etc/environment
source /etc/environment
export GOPATH

go get github.com/prometheus/prometheus/cmd/prometheus
go get github.com/prometheus/prometheus/cmd/promtool
go get github.com/prometheus/node_exporter
go get github.com/prometheus/alertmanager

mkdir -p /etc/prometheus/targets/node
mkdir -p /etc/prometheus/targets/snmp
mkdir -p /var/lib/prometheus/data
mkdir -p /var/lib/prometheus/amdata
mkdir -p /var/lib/prometheus/discovery/node
mkdir -p /var/lib/prometheus/discovery/snmp

Put the following into /etc/prometheus/prometheus.yml:

scrape_configs:
  - job_name: "node"
    scrape_interval: "15s"
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/node/*.json'
        - '/var/lib/prometheus/discovery/node/*.json'
  - job_name: 'snmp'
    params:
      module: [default]
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/snmp/*.json'
        - '/var/lib/prometheus/discovery/snmp/*.json'
    relabel_configs:
      - source_labels: [instance]
        target_label: hostname
      - source_labels: [__address__]
        target_label: __param_address
      - source_labels: [__param_address]
        target_label: instance
      - target_label: __address__
        replacement: '127.0.0.1:9116'
rule_files:
  - /etc/prometheus/alert.rules

For starters, you can monitor the Prometheus node itself using Node exporter by putting the following into /etc/prometheus/targets/node/localhost.json:

[
  {
    "targets": ["127.0.0.1:9100"],
    "labels": {
      "instance": "localhost"
    }
  }
]

SNMP Exporter

Next, install the SNMP Exporter. You get to choose between the official branch:

apt-get install python-netsnmp python-dev python-pip
pip install snmp_exporter

and my own branch, that I extended with a more modular config and I fixed an infinite loop that occurred in our network for some reason:

apt-get install python-netsnmp python-dev python-pip
cd /opt
git clone https://github.com/Svedrin/snmp_exporter.git
cd snmp_exporter
git checkout svedrin-master
python setup.py install
cp -r snmp.yml.d /etc/prometheus

(So, you only need to run one of the two sections above.)

It totally helps if all your nodes are configured to use the same SNMP community and you have a discovery tool that can generate a JSON file that knows them all. This way, you can literally get up and running in minutes.

Alerting

We installed Alert Manager already, time to configure it -- the config file is /etc/prometheus/alertmanager.conf:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp.derpyherp.com'
  smtp_from: 'prometheus@herpyderp.com'
  smtp_auth_username: 'derpity'
  smtp_auth_password: 'derpington'

route:
  receiver: 'team-X-mails'
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 6h

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  equal: ['alertname']

receivers:
- name: 'team-X-mails'
  email_configs:
  - to: 'svedrin@herpyderp.com'

Rules go into /etc/prometheus/alert.rules:

ALERT node_down
  IF up == 0 AND job="node"
  FOR 5m
  ANNOTATIONS {
    summary = "Node is down",
    description = "Node has been unreachable for more than 5 minutes.",
    severity = "warning"
  }

ALERT snmp_down
  IF up == 0 AND job="snmp"
  FOR 5m
  ANNOTATIONS {
    summary = "SNMP is down",
    description = "SNMP has been unreachable for more than 5 minutes.",
    severity = "warning"
  }

ALERT fs_at_80_percent
  IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.8
  FOR 15m
  ANNOTATIONS {
    summary = "File system {{$labels.hrStorageDescr}} is at 80%",
    description = "{{$labels.hrStorageDescr}} has been at 80% for more than 15 Minutes.",
    severity = "warning"
  }

ALERT fs_at_90_percent
  IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.9
  FOR 15m
  ANNOTATIONS {
    summary = "File system {{$labels.hrStorageDescr}} is at 90%",
    description = "{{$labels.hrStorageDescr}} has been at 90% for more than 15 Minutes.",
    severity = "average"
  }

ALERT disk_load_mostly_random_reads
  IF rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND
     rate(diskIONReadX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) < 10000
  FOR 15m
  ANNOTATIONS {
    summary = "Disk {{$labels.diskIODevice}} reads are mostly random.",
    description = "{{$labels.diskIODevice}} reads have been mostly random for the past 15 Minutes.",
    severity = "info"
  }

ALERT disk_load_mostly_random_writes
  IF rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND
     rate(diskIONWrittenX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) < 10000
  FOR 15m
  ANNOTATIONS {
    summary = "Disk {{$labels.diskIODevice}} writes are mostly random.",
    description = "{{$labels.diskIODevice}} writes have been mostly random for the past 15 Minutes.",
    severity = "info"
  }

ALERT disk_load_high
  IF diskIOLA1{diskIODevice=~"s|vd[a-z]+"} > 30
  FOR 15m
  ANNOTATIONS {
    summary = "Disk {{$labels.diskIODevice}} is at 30%",
    description = "{{$labels.diskIODevice}} Load has exceeded 30% over the past 15 Minutes.",
    severity = "warning"
  }

ALERT cpu_load_high
  IF ssCpuIdle < 70
  FOR 15m
  ANNOTATIONS {
    summary = "CPU is at 30%",
    description = "CPU Load has constantly exceeded 30% over the past 15 Minutes.",
    severity = "warning"
  }

ALERT linux_load_high
  IF laLoad1 > 50
  FOR 15m
  ANNOTATIONS {
    summary = "Linux Load is at 40",
    description = "Linux Load has constantly exceeded 40 over the past 15 Minutes.",
    severity = "average"
  }

ALERT if_operstatus_changed
  IF delta(ifOperStatus[15m]) != 0
  ANNOTATIONS {
    summary = "Port {{$labels.ifDescr}} changed status",
    description = "Port {{$labels.ifDescr}} went up or down in the past 15 Minutes",
    severity = "info"
  }

ALERT if_traffic_at_30_percent
  IF ifSpeed > 10000000 AND
     ifOperStatus == 1 AND
     rate(ifInOctets[5m]) > ifSpeed * 0.3
  FOR 15m
  ANNOTATIONS {
    summary = "Port {{$labels.ifDescr}} is at 30%",
    description = "Port {{$labels.ifDescr}} has had at least 30% traffic over the past 15 Minutes.",
    severity = "warning"
  }

ALERT if_traffic_at_70_percent
  IF ifSpeed > 10000000 AND
     ifOperStatus == 1 AND
     rate(ifInOctets[5m]) > ifSpeed * 0.7
  FOR 15m
  ANNOTATIONS {
    summary = "Port {{$labels.ifDescr}} is at 70%",
    description = "Port {{$labels.ifDescr}} has had at least 70% traffic over the past 15 Minutes.",
    severity = "average"
  }

You could also put the instance name into the alert summary and/or description, but I'd advise against it. If you omit that info, you can more easily group alerts by their summary.

Upstart configs

All this stuff has to be started somehow. If you're on Ubuntu 14.04, you may want to (or find yourself forced to) use upstart. So, here goes:

/etc/init/prometheus.conf:

# Run prometheus

start on startup

script
        cd /opt/gocode/src/github.com/prometheus/prometheus
        /opt/gocode/bin/prometheus \
                -storage.local.path="/var/lib/prometheus/data" \
                -config.file=/etc/prometheus/prometheus.yml \
                -alertmanager.url=http://localhost:9093/alert-manager/ \
                -web.external-url=http://192.168.0.1/prometheus
end script

/etc/init/alertmanager.conf:

# Run alert manager

start on startup

script
   /opt/gocode/bin/alertmanager \
        -log.level=debug \
        -storage.path="/var/lib/prometheus/amdata" \
        -config.file=/etc/prometheus/alertmanager.conf \
        -web.external-url=http://192.168.0.1/alert-manager/
end script

/etc/init/node-exporter.conf:

# Run node_exporter

start on startup

script
   /opt/gocode/bin/node_exporter
end script

/etc/init/snmp-exporter.conf:

# Run snmp_exporter

start on startup

script
    # This is only relevant for the Svedrin edition. Omit it for upstream.
    cat /etc/prometheus/snmp.yml.d/*.yml > /var/lib/prometheus/snmp.yml
    /usr/local/bin/snmp_exporter /var/lib/prometheus/snmp.yml
end script

Systemd configs

If you're fortunate enough to be on a platform that supports Systemd, the following configs may come in handy.

/etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus server
After=network.target

[Service]
WorkingDirectory=/opt/gocode/src/github.com/prometheus/prometheus/
ExecStart=/opt/gocode/bin/prometheus \
    -storage.local.path=/var/lib/prometheus/data \
    -config.file=/etc/prometheus/prometheus.yml \
    -alertmanager.url=http://localhost:9093/alert-manager \
    -web.external-url=http://192.168.0.1/prometheus/
User=prometheus

[Install]
WantedBy=multi-user.target

/etc/systemd/system/alertmanager.service:

[Unit]
Description=Prometheus Alert Manager
After=network.target

[Service]
ExecStart=/opt/gocode/bin/alertmanager \
    -log.level=debug \
    -storage.path="/var/lib/prometheus/amdata" \
    -config.file=/etc/prometheus/alertmanager.conf \
    -web.external-url=http://192.168.0.1/alert-manager/
User=prometheus

[Install]
WantedBy=multi-user.target

/etc/systemd/system/node-exporter.service:

[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
ExecStart=/usr/local/sbin/node_exporter
User=nobody

[Install]
WantedBy=multi-user.target

/etc/systemd/system/snmp-exporter.service:

[Unit]
WantedBy=multi-user.target
Description=Prometheus SNMP Exporter
After=network.target

[Service]
WorkingDirectory=/opt/snmp_exporter
Environment=PYTHONPATH=.
ExecStart=/usr/bin/python scripts/snmp_exporter snmp.yml
User=nobody

[Install]
WantedBy=multi-user.target

(I haven't yet ported my snmp.yml.d mechanism to my systemd machine, so I don't have a config for that yet.)

Apache2 Reverse Proxy

/etc/apache2/sites-available/prometheus.conf:

ProxyPass        /prometheus/ https://localhost:9090/prometheus/
ProxyPassReverse /prometheus/ https://localhost:9090/prometheus/

ProxyPass        /alert-manager/ https://localhost:9093/alert-manager/
ProxyPassReverse /alert-manager/ https://localhost:9093/alert-manager/

Summary

This config illustrates a quick way to get started. I consider it more of a guideline than a production-ready setup, please don't forget to adapt it to your needs. Especially the alert rules will need some tuning.