Prometheus + Grafana: Monitoring Without a $2,300/Month Bill

A client asked us to quote a monitoring solution for their healthcare platform -- 20 services across 3 clusters, 15 custom business metrics, and a compliance requirement to retain metrics for 12 months.

We got two quotes: Datadog at $2,300/month and New Relic at $1,800/month. We deployed Prometheus + Grafana for the cost of 40GB of persistent storage per cluster. The annual savings paid for the engineering time to set it up within the first month.

The problem

Every production system needs monitoring. The question isn't whether to monitor -- it's whether you're willing to pay a per-host, per-metric, per-GB fee to a vendor for the privilege of seeing your own data.

Managed monitoring platforms are excellent products. They're also designed to scale their pricing alongside your infrastructure. At 5 hosts with default metrics, the bill is reasonable. At 50 hosts with custom metrics, it's a line item your CFO asks about. At 200 hosts, it's a negotiation.

The open-source stack costs engineer time to set up and occasionally maintain. But once it's running, you own every byte of data, and adding a new host or metric costs nothing.

The stack

Prometheus -- Time-series database. Scrapes /metrics endpoints from your services at a configurable interval (15s default). Stores metrics locally with configurable retention.
Grafana -- Visualization and alerting. Dashboards for everything from infrastructure health to business KPIs. Alert rules with Slack, PagerDuty, and email routing.
Alertmanager -- Alert routing and deduplication. Receives alerts from Prometheus, groups them, and sends notifications based on severity and team routing.
Loki -- Log aggregation. The Prometheus model applied to logs -- labels instead of full-text indexing, which keeps storage costs 10x lower than Elasticsearch.
Node Exporter + cAdvisor -- System and container metrics. CPU, memory, disk, network at the host and container level.

Instrumenting services

Every service exposes a /metrics endpoint. Prometheus scrapes it. In Python (FastAPI):

from prometheus_client import Counter, Histogram, generate_latest
from fastapi import Response

REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "Request latency",
    ["endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

In Go, the promhttp package does the same thing in 3 lines. The key insight: define your histogram buckets based on your SLOs, not the defaults. Default buckets waste cardinality on ranges you'll never alert on.

The four golden signals

Every service dashboard starts with these four panels:

Latency -- p50, p95, p99 request duration. Our SLO is p99 < 500ms for API services.
Traffic -- Requests per second, broken down by endpoint. Helps capacity planning.
Errors -- 5xx rate as a percentage of total requests. Alert threshold: > 1% for 5 minutes.
Saturation -- CPU and memory utilization as a percentage of limits. Alert at 80%.

We standardize these four panels across all service dashboards using Grafana's dashboard-as-code (JSON models stored in Git, deployed via ArgoCD). A new service gets a monitoring dashboard automatically.

Alert rules that don't cause fatigue

groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[5m])
            / rate(http_requests_total[5m])
          ) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }}: error rate {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

Three rules we follow to prevent alert fatigue:

for: 5m minimum. Transient spikes are not incidents. Only alert on sustained conditions.
Rate, not count. rate(errors[5m]) / rate(total[5m]) > 0.01 catches real problems. errors > 10 fires during every traffic spike.
Every alert has a runbook. If you can't link to a document explaining what to do when the alert fires, the alert isn't ready for production.

Dashboard structure

We maintain three dashboard layers per cluster:

Overview -- All services. Green/red health status. Traffic heatmap. CPU and memory across all nodes.
Service detail -- The four golden signals for one service. Selected from a dropdown. Includes pod-level breakdown.
Debug -- Per-instance metrics. Goroutine counts, GC pauses, connection pool sizes, query timing. Used during incident response only.

All dashboards are JSON models stored in our infra repo and deployed via a Grafana sidecar. No one edits dashboards through the UI in production.

Cost comparison

Metric	Prometheus + Grafana (self-hosted)	Datadog
20 hosts	~$0 (runs on existing infra)	~$460/month
100 hosts	~$50/month (additional storage)	~$2,300/month
Custom metrics	Unlimited	$0.05 per custom metric
Log retention (12 months)	~$30/month (Loki + S3)	~$600/month
Total (20 hosts, 12mo retention)	~$30/month	~$1,060/month

The self-hosted stack costs more in initial setup time (we budget 3-5 days for a full deployment including dashboards and alerts). After that, maintenance is 2-4 hours per month -- mostly Prometheus version upgrades and dashboard additions.

The tradeoffs

No APM out of the box. Distributed tracing requires adding Jaeger or Tempo to the stack. Datadog's APM is excellent and integrated. We add Tempo when clients need trace-level debugging.
Prometheus is not a long-term store. Default retention is 15 days. For 12-month retention, we use Thanos or Mimir to ship metrics to S3-compatible storage. This adds operational complexity.
Grafana dashboard sprawl. Without discipline, teams create one-off dashboards that nobody maintains. We enforce dashboard-as-code and delete anything not in the Git repo.

Our recommendation

If you're spending more than $500/month on a monitoring vendor, or if you have compliance requirements that mandate data residency, deploy Prometheus + Grafana. The setup cost is 3-5 days. The ongoing cost is negligible. You own the data, you control the retention, and adding new services to monitoring is a 10-line scrape config.

The vendor products are excellent. They're also unnecessary for most workloads if your team can invest the initial setup time.