Prometheus quick start
Here's a little quick start procedure I like to use to get the Prometheus monitoring system up and running, featuring Prometheus itself, the node and SNMP exporters, separate directories for configured and discovered targets and a couple of basic alerts.
So far, I used this on Ubuntu 14.04 and Debian Jessie.
Prometheus
First, install Prometheus:
add-apt-repository ppa:ubuntu-lxc/lxd-stable apt-get update apt-get install golang build-essential echo 'GOPATH="/opt/gocode"' >> /etc/environment source /etc/environment export GOPATH go get github.com/prometheus/prometheus/cmd/prometheus go get github.com/prometheus/prometheus/cmd/promtool go get github.com/prometheus/node_exporter go get github.com/prometheus/alertmanager mkdir -p /etc/prometheus/targets/node mkdir -p /etc/prometheus/targets/snmp mkdir -p /var/lib/prometheus/data mkdir -p /var/lib/prometheus/amdata mkdir -p /var/lib/prometheus/discovery/node mkdir -p /var/lib/prometheus/discovery/snmp
Put the following into /etc/prometheus/prometheus.yml
:
scrape_configs: - job_name: "node" scrape_interval: "15s" file_sd_configs: - files: - '/etc/prometheus/targets/node/*.json' - '/var/lib/prometheus/discovery/node/*.json' - job_name: 'snmp' params: module: [default] file_sd_configs: - files: - '/etc/prometheus/targets/snmp/*.json' - '/var/lib/prometheus/discovery/snmp/*.json' relabel_configs: - source_labels: [instance] target_label: hostname - source_labels: [__address__] target_label: __param_address - source_labels: [__param_address] target_label: instance - target_label: __address__ replacement: '127.0.0.1:9116' rule_files: - /etc/prometheus/alert.rules
For starters, you can monitor the Prometheus node itself using Node exporter by
putting the following into /etc/prometheus/targets/node/localhost.json
:
[ { "targets": ["127.0.0.1:9100"], "labels": { "instance": "localhost" } } ]
SNMP Exporter
Next, install the SNMP Exporter. You get to choose between the official branch:
apt-get install python-netsnmp python-dev python-pip pip install snmp_exporter
and my own branch, that I extended with a more modular config and I fixed an infinite loop that occurred in our network for some reason:
apt-get install python-netsnmp python-dev python-pip cd /opt git clone https://github.com/Svedrin/snmp_exporter.git cd snmp_exporter git checkout svedrin-master python setup.py install cp -r snmp.yml.d /etc/prometheus
(So, you only need to run one of the two sections above.)
It totally helps if all your nodes are configured to use the same SNMP community and you have a discovery tool that can generate a JSON file that knows them all. This way, you can literally get up and running in minutes.
Alerting
We installed Alert Manager already, time to configure it -- the config file is
/etc/prometheus/alertmanager.conf
:
global: # The smarthost and SMTP sender used for mail notifications. smtp_smarthost: 'smtp.derpyherp.com' smtp_from: 'prometheus@herpyderp.com' smtp_auth_username: 'derpity' smtp_auth_password: 'derpington' route: receiver: 'team-X-mails' group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 6h inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' # Apply inhibition if the alertname is the same. equal: ['alertname'] receivers: - name: 'team-X-mails' email_configs: - to: 'svedrin@herpyderp.com'
Rules go into /etc/prometheus/alert.rules
:
ALERT node_down IF up == 0 AND job="node" FOR 5m ANNOTATIONS { summary = "Node is down", description = "Node has been unreachable for more than 5 minutes.", severity = "warning" } ALERT snmp_down IF up == 0 AND job="snmp" FOR 5m ANNOTATIONS { summary = "SNMP is down", description = "SNMP has been unreachable for more than 5 minutes.", severity = "warning" } ALERT fs_at_80_percent IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.8 FOR 15m ANNOTATIONS { summary = "File system {{$labels.hrStorageDescr}} is at 80%", description = "{{$labels.hrStorageDescr}} has been at 80% for more than 15 Minutes.", severity = "warning" } ALERT fs_at_90_percent IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.9 FOR 15m ANNOTATIONS { summary = "File system {{$labels.hrStorageDescr}} is at 90%", description = "{{$labels.hrStorageDescr}} has been at 90% for more than 15 Minutes.", severity = "average" } ALERT disk_load_mostly_random_reads IF rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND rate(diskIONReadX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) < 10000 FOR 15m ANNOTATIONS { summary = "Disk {{$labels.diskIODevice}} reads are mostly random.", description = "{{$labels.diskIODevice}} reads have been mostly random for the past 15 Minutes.", severity = "info" } ALERT disk_load_mostly_random_writes IF rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND rate(diskIONWrittenX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) < 10000 FOR 15m ANNOTATIONS { summary = "Disk {{$labels.diskIODevice}} writes are mostly random.", description = "{{$labels.diskIODevice}} writes have been mostly random for the past 15 Minutes.", severity = "info" } ALERT disk_load_high IF diskIOLA1{diskIODevice=~"s|vd[a-z]+"} > 30 FOR 15m ANNOTATIONS { summary = "Disk {{$labels.diskIODevice}} is at 30%", description = "{{$labels.diskIODevice}} Load has exceeded 30% over the past 15 Minutes.", severity = "warning" } ALERT cpu_load_high IF ssCpuIdle < 70 FOR 15m ANNOTATIONS { summary = "CPU is at 30%", description = "CPU Load has constantly exceeded 30% over the past 15 Minutes.", severity = "warning" } ALERT linux_load_high IF laLoad1 > 50 FOR 15m ANNOTATIONS { summary = "Linux Load is at 40", description = "Linux Load has constantly exceeded 40 over the past 15 Minutes.", severity = "average" } ALERT if_operstatus_changed IF delta(ifOperStatus[15m]) != 0 ANNOTATIONS { summary = "Port {{$labels.ifDescr}} changed status", description = "Port {{$labels.ifDescr}} went up or down in the past 15 Minutes", severity = "info" } ALERT if_traffic_at_30_percent IF ifSpeed > 10000000 AND ifOperStatus == 1 AND rate(ifInOctets[5m]) > ifSpeed * 0.3 FOR 15m ANNOTATIONS { summary = "Port {{$labels.ifDescr}} is at 30%", description = "Port {{$labels.ifDescr}} has had at least 30% traffic over the past 15 Minutes.", severity = "warning" } ALERT if_traffic_at_70_percent IF ifSpeed > 10000000 AND ifOperStatus == 1 AND rate(ifInOctets[5m]) > ifSpeed * 0.7 FOR 15m ANNOTATIONS { summary = "Port {{$labels.ifDescr}} is at 70%", description = "Port {{$labels.ifDescr}} has had at least 70% traffic over the past 15 Minutes.", severity = "average" }
You could also put the instance name into the alert summary and/or description, but I'd advise against it. If you omit that info, you can more easily group alerts by their summary.
Upstart configs
All this stuff has to be started somehow. If you're on Ubuntu 14.04, you may want to (or find yourself forced to) use upstart. So, here goes:
/etc/init/prometheus.conf
:
# Run prometheus start on startup script cd /opt/gocode/src/github.com/prometheus/prometheus /opt/gocode/bin/prometheus \ -storage.local.path="/var/lib/prometheus/data" \ -config.file=/etc/prometheus/prometheus.yml \ -alertmanager.url=http://localhost:9093/alert-manager/ \ -web.external-url=http://192.168.0.1/prometheus end script
/etc/init/alertmanager.conf
:
# Run alert manager start on startup script /opt/gocode/bin/alertmanager \ -log.level=debug \ -storage.path="/var/lib/prometheus/amdata" \ -config.file=/etc/prometheus/alertmanager.conf \ -web.external-url=http://192.168.0.1/alert-manager/ end script
/etc/init/node-exporter.conf
:
# Run node_exporter start on startup script /opt/gocode/bin/node_exporter end script
/etc/init/snmp-exporter.conf
:
# Run snmp_exporter start on startup script # This is only relevant for the Svedrin edition. Omit it for upstream. cat /etc/prometheus/snmp.yml.d/*.yml > /var/lib/prometheus/snmp.yml /usr/local/bin/snmp_exporter /var/lib/prometheus/snmp.yml end script
Systemd configs
If you're fortunate enough to be on a platform that supports Systemd, the following configs may come in handy.
/etc/systemd/system/prometheus.service
:
[Unit] Description=Prometheus server After=network.target [Service] WorkingDirectory=/opt/gocode/src/github.com/prometheus/prometheus/ ExecStart=/opt/gocode/bin/prometheus \ -storage.local.path=/var/lib/prometheus/data \ -config.file=/etc/prometheus/prometheus.yml \ -alertmanager.url=http://localhost:9093/alert-manager \ -web.external-url=http://192.168.0.1/prometheus/ User=prometheus [Install] WantedBy=multi-user.target
/etc/systemd/system/alertmanager.service
:
[Unit] Description=Prometheus Alert Manager After=network.target [Service] ExecStart=/opt/gocode/bin/alertmanager \ -log.level=debug \ -storage.path="/var/lib/prometheus/amdata" \ -config.file=/etc/prometheus/alertmanager.conf \ -web.external-url=http://192.168.0.1/alert-manager/ User=prometheus [Install] WantedBy=multi-user.target
/etc/systemd/system/node-exporter.service
:
[Unit] Description=Prometheus Node Exporter After=network.target [Service] ExecStart=/usr/local/sbin/node_exporter User=nobody [Install] WantedBy=multi-user.target
/etc/systemd/system/snmp-exporter.service
:
[Unit] WantedBy=multi-user.target Description=Prometheus SNMP Exporter After=network.target [Service] WorkingDirectory=/opt/snmp_exporter Environment=PYTHONPATH=. ExecStart=/usr/bin/python scripts/snmp_exporter snmp.yml User=nobody [Install] WantedBy=multi-user.target
(I haven't yet ported my snmp.yml.d
mechanism to my systemd machine, so I don't have a config for that yet.)
Apache2 Reverse Proxy
/etc/apache2/sites-available/prometheus.conf
:
ProxyPass /prometheus/ https://localhost:9090/prometheus/ ProxyPassReverse /prometheus/ https://localhost:9090/prometheus/ ProxyPass /alert-manager/ https://localhost:9093/alert-manager/ ProxyPassReverse /alert-manager/ https://localhost:9093/alert-manager/
Summary
This config illustrates a quick way to get started. I consider it more of a guideline than a production-ready setup, please don't forget to adapt it to your needs. Especially the alert rules will need some tuning.