Monitoring - Part 1 Services
In my last post, I ended with a look into the future at what tasks were my top prority for this website. One of these was setting up monitoring, which I will begin exploring today. So lets just jump just jump into it, starting with…
The Objective
Broadly speaking, I want know know is my website online, is it going to stay online and how well is it performing. This requires data (metrics and logs) and an interface to read, structure, alert on and interpret this data.
In the corporate world I would want to do stuff like run reports on this data to understand determine my SLI’s, compare where they’re at in relation to SLO’s and SLA’s, measure my errors against an error budget and so on. That would be a bit much for my personal website, although it would make a great blog post 😉 so I’ll probably attempt this at some point in the future anyway.
The Environment
In the welcome post I explained a little about my environment, so go check that out for more information if you want. In short, this static website we will be monitoring is located in an S3 bucket that is then served up by AWS Cloudfront.
Monitoring will be done from my personal home server located in my home network, but more of that covered below where we get to the plan.
The Plan
So here’s what I ended up deciding on.
Grafana will be our dashboard, amoung many other things. It’s an excellent open source project that’s super flexible and capable of enabling us to work with our metrics and logs in so many ways. It’s also going to manage our alerts, which I will be sending to Discord (chosen for being free and my wife already uses this platform so she can easily keep up with alerts coming from our home infrastructure).
Prometheus is going to fetch metrics for us from Node Exporter (running on my docker host) and a Prometheus Blackbox container (set up to query my site for HTTP(S). After being fetched by Prometheus these metrics will then be sent to Mirmir for storage.
Logs will be collected by Loki with some help from Promtail (to get logs from our host OS) and the Loki Docker driver plugin (to get logs from our docker containers).
All of this will be hosted on my home ubuntu server (a HP Microserver Gen 8) and will run in containers provisioned by docker compose. I did also consider the managed offerings from both Grafana and AWS, but this setup offered more of a chance to understand the tooling, plus I know it’ll never cost me a cent (more than my server already costs me anyway) and I have total control. The big downside is it introduces a massive single point of failure (my home server and network) but for this use case (a small static blog that’s mostly an excuse for some projects and learning exercises), I’m fine with that for now.
The Implementation
This was a several step process for me. Due to having so many moving parts (6 containers, 1 bare metal server, a docker plugin and a static site on AWS), rather than getting them all working at once I opted instead to get one thing working at a time and continue to add more compontents on as I went. Grafana was the first to be set up, and the rest was broken down into two categories: Metrics and Logs.
Starting with metrics, I got Prometheus up first. Then I setup node exporter on my host to send some data to Prometheus (or rather, Prometheus to pull from Node exporter), once that was confirmed working I threw Mimir into the mix, updated Prometheus to send data over to there and then added the Mimir datasource into Grafana to make sure it’s all working as planned. After that, only the Prometheus Blackbox metrics remained, so I got that config written, spun up the container and had Prometheus politley ask blackbox for the toddmurphy.me metrics. Then I went and grabbed this great looking stock dashboard and I was off and away with a good looking dashboard displaying status of my website and some basic metrics (we still need to make our own custom dashboard, we’ll discuss that more further down).
Cool so we’ve got metrics and have a basic idea of how our website is doing. Next is logging. For this I started with Loki, got a container up and then spun up Promtail to collect logs from my host OS and start populating Loki with data. Then after adding the Loki datasource to Grafana, I could sift through my logs in the Grafana explore page, fun stuff! Next I wanted to get logs from my containers so I installed the Grafana Loki Docker Plugin, then after restarting docker and force rebuilding my containers, I had their logs showing up and visible in Grafana as well. Immediately I noticed some errors and was getting value from Loki, more on that below when we get to challenges and issues faced.
So that’s the short human friendly version of how I got things running, now this is the part where I assult your browser with the big wall of yaml that is my docker compose and configuration files.
Docker Compose
Pretty straight foward compose file, added comments for anything that might not be self-explanatory.
version: "3.8"
networks: # Not the most secure network setup but okay for my home network
loki: # Loki and all hosts it talks to
prometheus: # Prometheus and all hosts it talks to
mimir: # Mimir and all hosts it talks to
services:
grafana:
user: ${USER_ID}:${GROUP_ID} # Make sure to set these in /.env
volumes:
- ${DATA_DIR}/grafana:/var/lib/grafana # And set DATA_DIR too
image: grafana/grafana:latest
restart: unless-stopped
ports:
- "3000:3000"
networks:
- loki
- prometheus
- mimir
prometheus:
user: ${USER_ID}:${GROUP_ID}
image: prom/prometheus:v2.44.0
restart: unless-stopped
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
- ./config/prometheus-targets.yml:/etc/prometheus/prometheus-targets.yml
- ${DATA_DIR}/prometheus/data:/prometheus
ports:
- 9090:9090
extra_hosts:
- "host.docker.internal: ${HOST_IP}" # Fetching from host node exporter
networks:
- prometheus
- mimir
mimir: # Root needed - Some internal container files are owned by root
image: grafana/mimir:2.8.0
restart: unless-stopped
command: "-config.file=/etc/mimir.yaml"
volumes:
- ./config/mimir.yaml:/etc/mimir.yaml
- ${DATA_DIR}/mimir/tmp:/tmp/mimir # TSDB location
ports:
- 9009:9009
networks:
- mimir
loki:
user: ${USER_ID}:${GROUP_ID}
image: grafana/loki:2.8.0
restart: unless-stopped
command: -config.file=/etc/loki/loki.yaml
ports:
- "3100:3100"
volumes:
- ./config/loki.yaml:/etc/loki/loki.yaml
- ${DATA_DIR}/loki/data:/loki # Logging TSDB location
networks:
- loki
promtail: #Run as root or not all /var/log files are accessible
image: grafana/promtail:2.8.0
restart: unless-stopped
command: -config.file=/etc/promtail/config/promtail.yaml
volumes:
- ./config/promtail.yaml:/etc/promtail/config/promtail.yaml
- /var/log:/opt/var/log #Allows promtail to read logs on localhost
networks:
- loki
blackbox: # Run as root for ICMP to work
image: prom/blackbox-exporter:v0.24.0
restart: unless-stopped
command: --config.file=/etc/blackbox-exporter/blackbox.yml
volumes:
- ./config/blackbox.yml:/etc/blackbox-exporter/blackbox.yml
ports:
- 9115:9115
extra_hosts:
- "host.docker.internal: ${HOST_IP}" # For checking node exporter status
networks:
- prometheus
Config
Mimir
A fairly standard Mimir config for running in Monolithic mode and using local file storage.
# Monolithic mode - per defaults (target: all)
# Disable multitenancy
multitenancy_enabled: false
# Store data on filesystem
blocks_storage:
backend: filesystem
bucket_store:
sync_dir: /tmp/mimir/tsdb-sync
filesystem:
dir: /tmp/mimir/data/tsdb
tsdb:
dir: /tmp/mimir/tsdb
compactor:
data_dir: /tmp/mimir/compactor
sharding_ring:
kvstore:
store: memberlist
distributor:
ring:
instance_addr: 127.0.0.1
kvstore:
store: memberlist
ingester:
ring:
instance_addr: 127.0.0.1
kvstore:
store: memberlist
replication_factor: 1
ruler_storage:
backend: filesystem
filesystem:
dir: /tmp/mimir/rules
server:
http_listen_port: 9009
log_level: error
store_gateway:
sharding_ring:
replication_factor: 1
Loki
Loki config, also fairly standard for using local storage.
auth_enabled: false
server:
http_listen_port: 3100
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rule
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
Prometheus
Prometheus config for scraping metrics from my website and home server. One thing I’ll note is ICMP requires an IP or hostname, while the other exporters were fine with a full web address. As a result the ICMP results are stored with a slightly different label so my environment variables in Grafana will need to account for this in any dashboard that displays both HTTP(S) info and ICMP info (with the stock dashboard this isn’t accounted for so ICMP data is missing).
remote_write:
- url: http://mimir:9009/api/v1/push
scrape_configs:
- job_name: prometheus # Scrape metrics
honor_labels: true
static_configs:
- targets:
- localhost:9090 # Prometheus container
- <host_os_address>:9100 # Host OS Node Exporter
- job_name: 'blackbox-http' # HTTP/HTTPS Test
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://toddmurphy.me
- http://toddmurphy.me
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115 # The blackbox exporter's hostname:port.
- job_name: 'blackbox-tls' # TLS Test
metrics_path: /probe
params:
module: [tls_connect]
static_configs:
- targets:
- https://toddmurphy.me
- http://toddmurphy.me
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115 # The blackbox exporter's hostname:port.
- job_name: 'blackbox-dns' # DNS Test
metrics_path: /probe
params:
module: [dns_udp]
static_configs:
- targets:
- https://toddmurphy.me
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115 # The blackbox exporter's hostname:port.
- job_name: 'blackbox-icmp' # ICMP Test
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- toddmurphy.me
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115 # The blackbox exporter's hostname:port.
Promtail
Not much to say here, a pretty standard Promtail config.
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /opt/var/log/*log # Where we mounted the host os /var/log folder
Prometheus Blackbox
Blackbox had loads of options for its various modules, but the defaults were very sensible so I stuck with those for the most part, aside from needing to change from IPv6 to IPv4 due to limitations on my home network.
modules:
http_2xx:
prober: http
timeout: 5s
http:
method: GET
preferred_ip_protocol: "ip4" # defaults to "ip6"
ip_protocol_fallback: false # My ISP has not enabled IPv6
tls_connect:
prober: tcp
timeout: 5s
tcp:
tls: true
dns_udp:
prober: dns
timeout: 5s
dns:
query_name: "dns"
preferred_ip_protocol: "ip4"
ip_protocol_fallback: false
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
ip_protocol_fallback: false
The Outcome
All of the above has gotten to the stage where we’re now monitoring this static website, and visualising our metrics in this great looking stock dashboard:
There’s also now everything in place to start setting up alerts and building our own custom dashboard. We’ve already covered a lot today so I’ll cover those things in more detail in another post, but short spoiler alert I have managed to start sending some alerts to discord and plan to display some of our Loki logs as metrics on our custom dashboard.
Challenges faced and things learned
MinIO
So I didn’t touch on it in this post yet, but originally I planned to use MinIO in order to simulate having an S3 bucket hosted locally on my home server. This would have been useful as the place for Mimir and Loki to store their data, as S3 buckets seem to be a common storage method in production for these. Unfortunately for me I was just running into so many configuration issues and it was adding so much complexity I decided to ditch it in favour of just using local filesystem storage. If I one day feel there’s an education or technical reason to try the MinIO method again, I might look at it, but for now file system storage suits me perfectly fine for my home lab environment and allowed me to get up and running faster.
Communicating with the host
A little trick I hadn’t come accross before was this extra_hosts option:
prometheus:
"dot": "dot dot"
extra_hosts:
- "host.docker.internal: ${HOST_IP}" # Fetching from host node exporter
etc: "etc etc"
Adding this allows the container to communicate with ${HOST_IP}
on the local network, which was critical for getting the logs from the host to where I needed them.
Sometimes containers need to run as root, argh!
Sometime I kept coming up against with this project, was containers not working correctly for one reason or another, when not ran as root. Loki was a lifesaver here as there were containers I actually thought were working fine, but upon inspection of my errors in Loki, I found little things not quite right. For instance, Promtail wasn’t able to read all of the host’s /var/log files without root (this was just straight up an oversight by me I know) and a more surprising one, the Mimir container has files baked in that are owned by root. This one I was able to confirm by finding a discussion on github about this same problem, but none the less was an annoying surprise to me.
Really this isn’t the end of the world, just more a pet peeve of mine, I swapped to root and for the most part things are fine. It makes a good argument though for moving to Podman, which was designed to solve problems like this, but that’s a project and discussion for another day.
/etc/docker/daemon.json didn’t exist by default
The Loki setup doco said to “modify” /etc/docker/daemon.json
- which was a bit hard to do when this file didn’t exist. At first I worried it was hiding somewhere else, but fortunately the answer was far more simple, docker doesn’t create daemon.json
by default, so I just needed to create it and put the docker plugin settings in there, then I was good to move on to the next step of the Loki docker plugin setup.
What’s next?
So to quickly recap part 1, we’ve built out the skeleton of our monitoring system and have some basic data displayed on a Grafana dashboard. There’s still plenty more to do though so in future posts in this monitoring series I plan to explore:
- Designing smart alerts that trigger when needed but don’t spam our notifications.
- Building a custom dashboard.
- Making use of our Loki data to do some fancy things, beyond just using it as a way to explore the raw logs from within Grafana.