Building Observability with Prometheus and Grafana

Running infrastructure without monitoring is like driving without a dashboard. You might get where you're going, but you won't know if you're about to run out of gas until it happens. Early in my homelab journey, I learned this lesson the hard way.

The Wake-Up Call

I was running several services across multiple VMs when I noticed my media server was sluggish. Investigation revealed the host had been at 95% disk capacity for weeks—slowly filling up from log files I'd forgotten to rotate. By the time I noticed, users were complaining and I was scrambling to free up space.

That incident taught me: if you're running services, you need visibility.

The Monitoring Stack

My observability platform has three core components:

Prometheus: The Metrics Engine

Prometheus scrapes metrics from exporters running on each host. Every few seconds, it pulls data points:

CPU utilization, load averages
Memory usage and swap activity
Disk space, I/O throughput
Network traffic, connection states
Custom application metrics

The time-series database lets me query historical data and understand trends, not just current state.

Grafana: The Visualization Layer

Grafana turns raw metrics into understandable dashboards. I have views for:

Infrastructure overview: All hosts at a glance
Per-service deep dives: Detailed metrics for critical applications
Capacity planning: Storage trends, resource utilization over time
Network topology: Traffic patterns and connectivity

A good dashboard tells a story. When something goes wrong, I can see not just that it's broken, but the context around why.

Discord Alerts: The Notification Layer

Prometheus Alertmanager routes alerts to Discord via webhooks. When something goes wrong—disk space low, service down, high CPU for extended periods—I get a notification in a dedicated channel.

The key is making alerts actionable. Every alert should mean "something needs attention." If I start ignoring alerts, the system isn't working.

What I Monitor

Infrastructure Health

CPU and memory: Are hosts under pressure?
Disk utilization: Am I running out of space?
System load: Is the machine keeping up with demand?
Uptime: Has something crashed and restarted?

Service Health

Process status: Is the service actually running?
Response times: Is it responding quickly?
Error rates: Are requests failing?
Custom metrics: Application-specific indicators

Network Health

Interface throughput: How much traffic is flowing?
Connection counts: Are we handling expected load?
Latency: Is the network performing well?

Alert Philosophy

Not every metric needs an alert. My approach:

Alert on Symptoms, Not Causes

I alert on "disk space below 10%" rather than "specific process writing too many logs." The symptom is what matters; I can investigate the cause when I respond.

Avoid Alert Fatigue

Every alert that doesn't require action is a liability. It trains me to ignore alerts, which means I'll miss the real ones. I regularly review and tune alert thresholds.

Give Context

Alerts include relevant information: which host, what metric, what the current value is, and what the threshold is. I shouldn't have to log in just to understand what's happening.

The Workflow

When an alert fires:

Notification arrives in Discord with context
I triage the severity and potential impact
I investigate using Grafana dashboards for more context
I resolve the issue or acknowledge if it's known
I document if it revealed a gap in monitoring or configuration

This workflow has prevented several potential outages by catching issues before they became user-facing problems.

Lessons Learned

Building this monitoring stack taught me:

Visibility enables confidence - I can make changes knowing I'll see the impact immediately
Alerting is harder than monitoring - Collecting metrics is easy; knowing what to alert on takes iteration
Dashboards tell stories - A good visualization helps you understand systems, not just stare at numbers
Monitoring is ongoing - New services need new exporters; changing systems need updated dashboards

What's Next

My monitoring continues to evolve:

Adding application-level metrics for custom services
Improving alert routing and escalation
Building capacity forecasting based on trends
Documenting monitoring coverage to identify gaps

If you're running any kind of infrastructure, even a small homelab, invest in monitoring early. The visibility you gain will save you hours of debugging and give you confidence to experiment and grow.