Back to Blog

Building Observability with Prometheus and Grafana

How I built a monitoring stack with Prometheus metrics, Grafana dashboards, and Discord alerting to maintain visibility across my infrastructure.

PrometheusGrafanaMonitoringDevOps

Running infrastructure without monitoring is like driving without a dashboard. You might get where you're going, but you won't know if you're about to run out of gas until it happens. Early in my homelab journey, I learned this lesson the hard way.

The Wake-Up Call

I was running several services across multiple VMs when I noticed my media server was sluggish. Investigation revealed the host had been at 95% disk capacity for weeks—slowly filling up from log files I'd forgotten to rotate. By the time I noticed, users were complaining and I was scrambling to free up space.

That incident taught me: if you're running services, you need visibility.

The Monitoring Stack

My observability platform has three core components:

Prometheus: The Metrics Engine

Prometheus scrapes metrics from exporters running on each host. Every few seconds, it pulls data points:

  • CPU utilization, load averages
  • Memory usage and swap activity
  • Disk space, I/O throughput
  • Network traffic, connection states
  • Custom application metrics

The time-series database lets me query historical data and understand trends, not just current state.

Grafana: The Visualization Layer

Grafana turns raw metrics into understandable dashboards. I have views for:

  • Infrastructure overview: All hosts at a glance
  • Per-service deep dives: Detailed metrics for critical applications
  • Capacity planning: Storage trends, resource utilization over time
  • Network topology: Traffic patterns and connectivity

A good dashboard tells a story. When something goes wrong, I can see not just that it's broken, but the context around why.

Discord Alerts: The Notification Layer

Prometheus Alertmanager routes alerts to Discord via webhooks. When something goes wrong—disk space low, service down, high CPU for extended periods—I get a notification in a dedicated channel.

The key is making alerts actionable. Every alert should mean "something needs attention." If I start ignoring alerts, the system isn't working.

What I Monitor

Infrastructure Health

  • CPU and memory: Are hosts under pressure?
  • Disk utilization: Am I running out of space?
  • System load: Is the machine keeping up with demand?
  • Uptime: Has something crashed and restarted?

Service Health

  • Process status: Is the service actually running?
  • Response times: Is it responding quickly?
  • Error rates: Are requests failing?
  • Custom metrics: Application-specific indicators

Network Health

  • Interface throughput: How much traffic is flowing?
  • Connection counts: Are we handling expected load?
  • Latency: Is the network performing well?

Alert Philosophy

Not every metric needs an alert. My approach:

Alert on Symptoms, Not Causes

I alert on "disk space below 10%" rather than "specific process writing too many logs." The symptom is what matters; I can investigate the cause when I respond.

Avoid Alert Fatigue

Every alert that doesn't require action is a liability. It trains me to ignore alerts, which means I'll miss the real ones. I regularly review and tune alert thresholds.

Give Context

Alerts include relevant information: which host, what metric, what the current value is, and what the threshold is. I shouldn't have to log in just to understand what's happening.

The Workflow

When an alert fires:

  1. Notification arrives in Discord with context
  2. I triage the severity and potential impact
  3. I investigate using Grafana dashboards for more context
  4. I resolve the issue or acknowledge if it's known
  5. I document if it revealed a gap in monitoring or configuration

This workflow has prevented several potential outages by catching issues before they became user-facing problems.

Lessons Learned

Building this monitoring stack taught me:

  1. Visibility enables confidence - I can make changes knowing I'll see the impact immediately
  2. Alerting is harder than monitoring - Collecting metrics is easy; knowing what to alert on takes iteration
  3. Dashboards tell stories - A good visualization helps you understand systems, not just stare at numbers
  4. Monitoring is ongoing - New services need new exporters; changing systems need updated dashboards

What's Next

My monitoring continues to evolve:

  • Adding application-level metrics for custom services
  • Improving alert routing and escalation
  • Building capacity forecasting based on trends
  • Documenting monitoring coverage to identify gaps

If you're running any kind of infrastructure, even a small homelab, invest in monitoring early. The visibility you gain will save you hours of debugging and give you confidence to experiment and grow.