In my last post, I ended with a look into the future at what tasks were my top prority for this website. One of these was setting up monitoring, which I will begin exploring today. So lets just jump just jump into it, starting with…

The Objective

Broadly speaking, I want know know is my website online, is it going to stay online and how well is it performing. This requires data (metrics and logs) and an interface to read, structure, alert on and interpret this data.

In the corporate world I would want to do stuff like run reports on this data to understand determine my SLI’s, compare where they’re at in relation to SLO’s and SLA’s, measure my errors against an error budget and so on. That would be a bit much for my personal website, although it would make a great blog post 😉 so I’ll probably attempt this at some point in the future anyway.

The Environment

In the welcome post I explained a little about my environment, so go check that out for more information if you want. In short, this static website we will be monitoring is located in an S3 bucket that is then served up by AWS Cloudfront.

Monitoring will be done from my personal home server located in my home network, but more of that covered below where we get to the plan.

The Plan

So here’s what I ended up deciding on.

Grafana will be our dashboard, amoung many other things. It’s an excellent open source project that’s super flexible and capable of enabling us to work with our metrics and logs in so many ways. It’s also going to manage our alerts, which I will be sending to Discord (chosen for being free and my wife already uses this platform so she can easily keep up with alerts coming from our home infrastructure).

Prometheus is going to fetch metrics for us from Node Exporter (running on my docker host) and a Prometheus Blackbox container (set up to query my site for HTTP(S). After being fetched by Prometheus these metrics will then be sent to Mirmir for storage.

Logs will be collected by Loki with some help from Promtail (to get logs from our host OS) and the Loki Docker driver plugin (to get logs from our docker containers).

All of this will be hosted on my home ubuntu server (a HP Microserver Gen 8) and will run in containers provisioned by docker compose. I did also consider the managed offerings from both Grafana and AWS, but this setup offered more of a chance to understand the tooling, plus I know it’ll never cost me a cent (more than my server already costs me anyway) and I have total control. The big downside is it introduces a massive single point of failure (my home server and network) but for this use case (a small static blog that’s mostly an excuse for some projects and learning exercises), I’m fine with that for now.

The Implementation

This was a several step process for me. Due to having so many moving parts (6 containers, 1 bare metal server, a docker plugin and a static site on AWS), rather than getting them all working at once I opted instead to get one thing working at a time and continue to add more compontents on as I went. Grafana was the first to be set up, and the rest was broken down into two categories: Metrics and Logs.

Starting with metrics, I got Prometheus up first. Then I setup node exporter on my host to send some data to Prometheus (or rather, Prometheus to pull from Node exporter), once that was confirmed working I threw Mimir into the mix, updated Prometheus to send data over to there and then added the Mimir datasource into Grafana to make sure it’s all working as planned. After that, only the Prometheus Blackbox metrics remained, so I got that config written, spun up the container and had Prometheus politley ask blackbox for the toddmurphy.me metrics. Then I went and grabbed this great looking stock dashboard and I was off and away with a good looking dashboard displaying status of my website and some basic metrics (we still need to make our own custom dashboard, we’ll discuss that more further down).

Cool so we’ve got metrics and have a basic idea of how our website is doing. Next is logging. For this I started with Loki, got a container up and then spun up Promtail to collect logs from my host OS and start populating Loki with data. Then after adding the Loki datasource to Grafana, I could sift through my logs in the Grafana explore page, fun stuff! Next I wanted to get logs from my containers so I installed the Grafana Loki Docker Plugin, then after restarting docker and force rebuilding my containers, I had their logs showing up and visible in Grafana as well. Immediately I noticed some errors and was getting value from Loki, more on that below when we get to challenges and issues faced.

So that’s the short human friendly version of how I got things running, now this is the part where I assult your browser with the big wall of yaml that is my docker compose and configuration files.

Docker Compose

Pretty straight foward compose file, added comments for anything that might not be self-explanatory.

version: "3.8"

networks:       # Not the most secure network setup but okay for my home network
  loki:         # Loki and all hosts it talks to
  prometheus:   # Prometheus and all hosts it talks to
  mimir:        # Mimir and all hosts it talks to

services:

  grafana:
    user: ${USER_ID}:${GROUP_ID} # Make sure to set these in /.env
    volumes:
     - ${DATA_DIR}/grafana:/var/lib/grafana # And set DATA_DIR too
    image: grafana/grafana:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    networks:
      - loki
      - prometheus
      - mimir

  prometheus:
    user: ${USER_ID}:${GROUP_ID}
    image: prom/prometheus:v2.44.0
    restart: unless-stopped
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./config/prometheus-targets.yml:/etc/prometheus/prometheus-targets.yml
      - ${DATA_DIR}/prometheus/data:/prometheus
    ports:
      - 9090:9090
    extra_hosts:
      - "host.docker.internal: ${HOST_IP}" # Fetching from host node exporter
    networks:
      - prometheus
      - mimir

  mimir: # Root needed - Some internal container files are owned by root 
    image: grafana/mimir:2.8.0
    restart: unless-stopped
    command: "-config.file=/etc/mimir.yaml"
    volumes:
      - ./config/mimir.yaml:/etc/mimir.yaml
      - ${DATA_DIR}/mimir/tmp:/tmp/mimir # TSDB location
    ports:
      - 9009:9009
    networks:
      - mimir

  loki:
    user: ${USER_ID}:${GROUP_ID}
    image: grafana/loki:2.8.0
    restart: unless-stopped
    command: -config.file=/etc/loki/loki.yaml
    ports:
      - "3100:3100"
    volumes: 
      - ./config/loki.yaml:/etc/loki/loki.yaml
      - ${DATA_DIR}/loki/data:/loki # Logging TSDB location
    networks:
      - loki

  promtail: #Run as root or not all /var/log files are accessible
    image: grafana/promtail:2.8.0
    restart: unless-stopped
    command: -config.file=/etc/promtail/config/promtail.yaml
    volumes:
      - ./config/promtail.yaml:/etc/promtail/config/promtail.yaml
      - /var/log:/opt/var/log #Allows promtail to read logs on localhost
    networks:
      - loki

  blackbox: # Run as root for ICMP to work
    image: prom/blackbox-exporter:v0.24.0
    restart: unless-stopped
    command: --config.file=/etc/blackbox-exporter/blackbox.yml
    volumes: 
      - ./config/blackbox.yml:/etc/blackbox-exporter/blackbox.yml
    ports:
      - 9115:9115
    extra_hosts: 
      - "host.docker.internal: ${HOST_IP}" # For checking node exporter status
    networks:
      - prometheus

Config

Mimir

A fairly standard Mimir config for running in Monolithic mode and using local file storage.

# Monolithic mode - per defaults (target: all)

# Disable multitenancy
multitenancy_enabled: false

# Store data on filesystem
blocks_storage:
  backend: filesystem
  bucket_store:
    sync_dir: /tmp/mimir/tsdb-sync
  filesystem:
    dir: /tmp/mimir/data/tsdb
  tsdb:
    dir: /tmp/mimir/tsdb

compactor:
  data_dir: /tmp/mimir/compactor
  sharding_ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: memberlist

ingester:
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: memberlist
    replication_factor: 1

ruler_storage:
  backend: filesystem
  filesystem:
    dir: /tmp/mimir/rules

server:
  http_listen_port: 9009
  log_level: error

store_gateway:
  sharding_ring:
    replication_factor: 1

Loki

Loki config, also fairly standard for using local storage.

auth_enabled: false

server:
  http_listen_port: 3100

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rule
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

Prometheus

Prometheus config for scraping metrics from my website and home server. One thing I’ll note is ICMP requires an IP or hostname, while the other exporters were fine with a full web address. As a result the ICMP results are stored with a slightly different label so my environment variables in Grafana will need to account for this in any dashboard that displays both HTTP(S) info and ICMP info (with the stock dashboard this isn’t accounted for so ICMP data is missing).

remote_write:
  - url: http://mimir:9009/api/v1/push

scrape_configs:

- job_name: prometheus # Scrape metrics
  honor_labels: true
  static_configs:
  - targets:
    - localhost:9090 # Prometheus container
    - <host_os_address>:9100 # Host OS Node Exporter

- job_name: 'blackbox-http' # HTTP/HTTPS Test
  metrics_path: /probe
  params:
    module: [http_2xx]  
  static_configs:
      - targets:
        - https://toddmurphy.me
        - http://toddmurphy.me
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox:9115  # The blackbox exporter's hostname:port.

- job_name: 'blackbox-tls' # TLS Test
  metrics_path: /probe
  params:
    module: [tls_connect]   
  static_configs:
      - targets:
        - https://toddmurphy.me
        - http://toddmurphy.me
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox:9115  # The blackbox exporter's hostname:port.

- job_name: 'blackbox-dns' # DNS Test
  metrics_path: /probe
  params:
    module: [dns_udp]  
  static_configs:
      - targets:
        - https://toddmurphy.me
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox:9115  # The blackbox exporter's hostname:port.

- job_name: 'blackbox-icmp' # ICMP Test
  metrics_path: /probe
  params:
    module: [icmp]  
  static_configs:
      - targets:
        - toddmurphy.me
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox:9115  # The blackbox exporter's hostname:port.

Promtail

Not much to say here, a pretty standard Promtail config.

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /opt/var/log/*log # Where we mounted the host os /var/log folder

Prometheus Blackbox

Blackbox had loads of options for its various modules, but the defaults were very sensible so I stuck with those for the most part, aside from needing to change from IPv6 to IPv4 due to limitations on my home network.

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      preferred_ip_protocol: "ip4" # defaults to "ip6"
      ip_protocol_fallback: false # My ISP has not enabled IPv6
  tls_connect:
    prober: tcp
    timeout: 5s
    tcp:
      tls: true
  dns_udp:
    prober: dns
    timeout: 5s
    dns:
      query_name: "dns"
      preferred_ip_protocol: "ip4"
      ip_protocol_fallback: false
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"
      ip_protocol_fallback: false

The Outcome

All of the above has gotten to the stage where we’re now monitoring this static website, and visualising our metrics in this great looking stock dashboard: Image alt

There’s also now everything in place to start setting up alerts and building our own custom dashboard. We’ve already covered a lot today so I’ll cover those things in more detail in another post, but short spoiler alert I have managed to start sending some alerts to discord and plan to display some of our Loki logs as metrics on our custom dashboard.

Challenges faced and things learned

MinIO

So I didn’t touch on it in this post yet, but originally I planned to use MinIO in order to simulate having an S3 bucket hosted locally on my home server. This would have been useful as the place for Mimir and Loki to store their data, as S3 buckets seem to be a common storage method in production for these. Unfortunately for me I was just running into so many configuration issues and it was adding so much complexity I decided to ditch it in favour of just using local filesystem storage. If I one day feel there’s an education or technical reason to try the MinIO method again, I might look at it, but for now file system storage suits me perfectly fine for my home lab environment and allowed me to get up and running faster.

Communicating with the host

A little trick I hadn’t come accross before was this extra_hosts option:

  prometheus:
    "dot": "dot dot"
    extra_hosts:
      - "host.docker.internal: ${HOST_IP}" # Fetching from host node exporter
    etc: "etc etc"

Adding this allows the container to communicate with ${HOST_IP} on the local network, which was critical for getting the logs from the host to where I needed them.

Sometimes containers need to run as root, argh!

Sometime I kept coming up against with this project, was containers not working correctly for one reason or another, when not ran as root. Loki was a lifesaver here as there were containers I actually thought were working fine, but upon inspection of my errors in Loki, I found little things not quite right. For instance, Promtail wasn’t able to read all of the host’s /var/log files without root (this was just straight up an oversight by me I know) and a more surprising one, the Mimir container has files baked in that are owned by root. This one I was able to confirm by finding a discussion on github about this same problem, but none the less was an annoying surprise to me.

Really this isn’t the end of the world, just more a pet peeve of mine, I swapped to root and for the most part things are fine. It makes a good argument though for moving to Podman, which was designed to solve problems like this, but that’s a project and discussion for another day.

/etc/docker/daemon.json didn’t exist by default

The Loki setup doco said to “modify” /etc/docker/daemon.json - which was a bit hard to do when this file didn’t exist. At first I worried it was hiding somewhere else, but fortunately the answer was far more simple, docker doesn’t create daemon.json by default, so I just needed to create it and put the docker plugin settings in there, then I was good to move on to the next step of the Loki docker plugin setup.

What’s next?

So to quickly recap part 1, we’ve built out the skeleton of our monitoring system and have some basic data displayed on a Grafana dashboard. There’s still plenty more to do though so in future posts in this monitoring series I plan to explore:

  • Designing smart alerts that trigger when needed but don’t spam our notifications.
  • Building a custom dashboard.
  • Making use of our Loki data to do some fancy things, beyond just using it as a way to explore the raw logs from within Grafana.