Monitoring with SSH and Prometheus

I just wanted to see some qps metrics from BIND9.

This is a bit of hand wavey post on how to set up remote monitoring with Prometheus, Grafana and SSH tunnels.

Initially I just wanted to monitor BIND9 because it actually exports some reasonable metrics, that can be made usable with bind_exporter. But of course BIND is BIND so this is different in BIND 9.10 which is what I have on my server…. sigh. @jpmens also has some interesting tidbits about this BIND9 feature.

The idea to use SSH to remotely scrape hosts came from here. The rest of this post is about how I did not get all BIND9 metrics working, but got something decent running in the end:

Some basic BIND9 metrics, sadly no qps information yet.

Configuring BIND and bind_exporter⌗

Get and install “bind_exporter” and configure BIND to actually export something, I’m using the heuristic “monitoring-port = 9100 + service-port”, in your BIND9 config:

statistics-channels {
   inet * port 9154 allow { 127.0.0.1; }; };

Add a service file (dealing with systemd here) to start bind_exporter so we launch it together with named: /etc/systemd/system/bind_exporter.service:

[Unit]
Description=BIND Prometheus Exporter
After=bind9.service

[Service]
ExecStart=/opt/bin/bind_exporter -bind.pid-file=/var/run/named/named.pid \
    -bind.statsuri=http://localhost:9154 \
    -web.listen-address=localhost:9153

[Install]
WantedBy=multi-user.target

Check if stuff works: curl localhost:9154 should result in some hideous XML and curl localhost:9153/metrics, should result in, ahhh, sane Prometheus metrics.

Scraping with SSH⌗

Ubuntu 16.10 comes with Prometheus and Grafana (note this font-awesome bug in the current Ubuntu one) in the repos. I just installed those, and follow a guide to set these up. The only thing here that I want to mention is using SSH port forwarding to remote scrape targets.

We are going to scrape a lot of localhosts, with some relabeling magic this becomes more readable in Prometheus. (It’s a bit verbose, but I forgot and can’t quickly find how to make this more DRY.)

  - job_name: node
    target_groups:
      - targets: ['localhost:9100', 'localhost:9300']
    relabel_configs:
      - source_labels: ['__address__']
        target_label: 'instance'
        regex: '.+(:93.+)'
        replacement: 'linode'
      - source_labels: ['__address__']
        target_label: 'instance'
        regex: '.+(:91.+)'
        replacement: 'x64'

Now to setup SSH tunnels you’ll need a user that can’t do much, see restricting access for a user only used for tunneling. Add an monitor user all systems:

sudo adduser monitor --gecos "Monitoring Tunnel" --shell /bin/false

Become this user (% sudo -sH -u monitor /bin/bash) and configure SSH keys and test the connection to get the The authenticity of host ... prompt out of the way.

Use systemd magic and write a service file that starts the tunnels. In the end this took most of the time to get right. I’ve settled on a little script such as: /opt/bin/autossh-9100:

#!/bin/bash
autossh="/usr/bin/autossh -f -N -L"
host="$1"

# extract port number from $0
exe=$(basename $0)
port=${exe##autossh\-}

case "${host}" in
    linode)
        remote=$((port + 200))
    ;;
    voordeur)
        remote=$((port + 400))
    ;;
esac
$autossh ${remote}:localhost:${port} \
    monitor@${host}.atoom.net

And then call that from systemd, with service template files that reference the host monitor-tunnel-9100@.service:

[Unit]
Description=Monitor Tunnel Setup %i
After=prometheus.service

[Service]
User=monitor
ExecStart=/opt/bin/autossh-9100 %i
Restart=on-failure

[Install]
WantedBy=multi-user.target

Which you can then symlink and call: monitor-tunnel-9100@linode.service. If you’re wondering why you just can’t start a shell script with all tunnels: each of the autossh forks and manages its own tunnel: multiple forking processes is not something systemd can deal with… Hence this setup.

Next Steps⌗

Next is “just” configuring Prometheus and Grafana to scrape and display the correct metrics. Alerting is something that needs to happen, but I haven’t touched in this post. And I need to fix bind_exporter to get me some more metrics.