Building Observability with Prometheus and Grafana
How I built a monitoring stack with Prometheus metrics, Grafana dashboards, and Discord alerting to maintain visibility across my infrastructure.
Running infrastructure without monitoring is like driving without a dashboard. You might get where you're going, but you won't know if you're about to run out of gas until it happens. Early in my homelab journey, I learned this lesson the hard way.
The Wake-Up Call
I was running several services across multiple VMs when I noticed my media server was sluggish. Investigation revealed the host had been at 95% disk capacity for weeks—slowly filling up from log files I'd forgotten to rotate. By the time I noticed, users were complaining and I was scrambling to free up space.
That incident taught me: if you're running services, you need visibility.
The Monitoring Stack
My observability platform has three core components:
Prometheus: The Metrics Engine
Prometheus scrapes metrics from exporters running on each host. Every few seconds, it pulls data points:
- CPU utilization, load averages
- Memory usage and swap activity
- Disk space, I/O throughput
- Network traffic, connection states
- Custom application metrics
The time-series database lets me query historical data and understand trends, not just current state.
Grafana: The Visualization Layer
Grafana turns raw metrics into understandable dashboards. I have views for:
- Infrastructure overview: All hosts at a glance
- Per-service deep dives: Detailed metrics for critical applications
- Capacity planning: Storage trends, resource utilization over time
- Network topology: Traffic patterns and connectivity
A good dashboard tells a story. When something goes wrong, I can see not just that it's broken, but the context around why.
Discord Alerts: The Notification Layer
Prometheus Alertmanager routes alerts to Discord via webhooks. When something goes wrong—disk space low, service down, high CPU for extended periods—I get a notification in a dedicated channel.
The key is making alerts actionable. Every alert should mean "something needs attention." If I start ignoring alerts, the system isn't working.
What I Monitor
Infrastructure Health
- CPU and memory: Are hosts under pressure?
- Disk utilization: Am I running out of space?
- System load: Is the machine keeping up with demand?
- Uptime: Has something crashed and restarted?
Service Health
- Process status: Is the service actually running?
- Response times: Is it responding quickly?
- Error rates: Are requests failing?
- Custom metrics: Application-specific indicators
Network Health
- Interface throughput: How much traffic is flowing?
- Connection counts: Are we handling expected load?
- Latency: Is the network performing well?
Alert Philosophy
Not every metric needs an alert. My approach:
Alert on Symptoms, Not Causes
I alert on "disk space below 10%" rather than "specific process writing too many logs." The symptom is what matters; I can investigate the cause when I respond.
Avoid Alert Fatigue
Every alert that doesn't require action is a liability. It trains me to ignore alerts, which means I'll miss the real ones. I regularly review and tune alert thresholds.
Give Context
Alerts include relevant information: which host, what metric, what the current value is, and what the threshold is. I shouldn't have to log in just to understand what's happening.
The Workflow
When an alert fires:
- Notification arrives in Discord with context
- I triage the severity and potential impact
- I investigate using Grafana dashboards for more context
- I resolve the issue or acknowledge if it's known
- I document if it revealed a gap in monitoring or configuration
This workflow has prevented several potential outages by catching issues before they became user-facing problems.
Lessons Learned
Building this monitoring stack taught me:
- Visibility enables confidence - I can make changes knowing I'll see the impact immediately
- Alerting is harder than monitoring - Collecting metrics is easy; knowing what to alert on takes iteration
- Dashboards tell stories - A good visualization helps you understand systems, not just stare at numbers
- Monitoring is ongoing - New services need new exporters; changing systems need updated dashboards
What's Next
My monitoring continues to evolve:
- Adding application-level metrics for custom services
- Improving alert routing and escalation
- Building capacity forecasting based on trends
- Documenting monitoring coverage to identify gaps
If you're running any kind of infrastructure, even a small homelab, invest in monitoring early. The visibility you gain will save you hours of debugging and give you confidence to experiment and grow.