Running services without monitoring is flying blind. The first sign of a problem shouldn't be a customer complaint. Here's how to build a monitoring stack that gives you visibility before things break.
Prerequisites
- Docker and Docker Compose installed
- A Linux server or VPS
- At least one service to monitor (Node.js app, PostgreSQL, Nginx)
Stack Overview
Application metrics → Prometheus (scrape & store)
↓
Grafana (visualize)
↓
AlertManager (route alerts)
↓
Slack / PagerDuty / Email
Docker Compose Setup
# compose.yml
services:
prometheus:
image: prom/prometheus:latest
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "127.0.0.1:9090:9090"
alertmanager:
image: prom/alertmanager:latest
restart: unless-stopped
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "127.0.0.1:9093:9093"
grafana:
image: grafana/grafana:latest
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: 'false'
GF_SERVER_ROOT_URL: https://grafana.yourdomain.com
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
ports:
- "127.0.0.1:3001:3000"
node_exporter:
image: prom/node-exporter:latest
restart: unless-stopped
network_mode: host
pid: host
volumes:
- /:/host:ro,rslave
command:
- '--path.rootfs=/host'
postgres_exporter:
image: prometheuscommunity/postgres-exporter:latest
restart: unless-stopped
environment:
DATA_SOURCE_NAME: "postgresql://monitor:${DB_MONITOR_PASSWORD}@db:5432/appdb?sslmode=disable"
volumes:
prometheus_data:
grafana_data:Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerts/*.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node_exporter:9100']
- job_name: 'postgres'
static_configs:
- targets: ['postgres_exporter:9187']
- job_name: 'app'
static_configs:
- targets: ['app:3000'] # Your app must expose /metrics
metrics_path: '/metrics'Alert Rules
# prometheus/alerts/infrastructure.yml
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanize }}% for more than 5 minutes."
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is {{ $value | humanize }}% full."
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"AlertManager Routing
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts go to PagerDuty
- receiver: 'pagerduty-critical'
match:
severity: critical
continue: true # Also send to Slack
# Warning alerts go to Slack only
- receiver: 'slack-notifications'
match:
severity: warning
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
description: '{{ .GroupLabels.alertname }}: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'Exposing Metrics from Your App
// Node.js with prom-client
import { Registry, Counter, Histogram, collectDefaultMetrics } from 'prom-client';
import express from 'express';
const register = new Registry();
collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
registers: [register],
});
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register],
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path ?? req.path;
httpRequestDuration.labels(req.method, route, res.statusCode.toString()).observe(duration);
httpRequestsTotal.labels(req.method, route, res.statusCode.toString()).inc();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.send(await register.metrics());
});Grafana Dashboard Provisioning
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'default'
folder: 'Infrastructure'
type: file
options:
path: /etc/grafana/provisioning/dashboardsImport community dashboards by ID in Grafana UI:
- Node Exporter Full: 1860
- PostgreSQL: 9628
- Nginx: 9614
Common Pitfalls
- Alerting on every spike: use
for: 5min alert rules to avoid alert storms from brief spikes - No runbook links in annotations: add a
runbook_urlannotation to every alert — engineers should know what to do before they're paged - Prometheus retention too short: default is 15 days — set
--storage.tsdb.retention.time=30dminimum - Grafana not provisioned: manual dashboard creation doesn't survive container restarts — always provision from code