# Prometheus Best Practices (2025 Edition)

Comprehensive guide to Prometheus configuration, metrics naming, alerting, and operational best practices based on 2025 industry standards.

## Table of Contents

1. [Metrics Naming Conventions](#metrics-naming-conventions)
2. [Configuration Best Practices](#configuration-best-practices)
3. [Alert Design](#alert-design)
4. [Recording Rules](#recording-rules)
5. [Service Discovery](#service-discovery)
6. [Performance & Scaling](#performance--scaling)
7. [Security](#security)
8. [High Availability](#high-availability)
9. [Retention & Storage](#retention--storage)
10. [Troubleshooting](#troubleshooting)

---

## Metrics Naming Conventions

### 2025 Standards

Following official Prometheus naming conventions and OpenMetrics specifications:

#### Basic Rules

```
✓ GOOD: http_requests_total
✗ BAD:  HTTPRequests, http-requests, httpRequestsCount

✓ GOOD: node_memory_bytes
✗ BAD:  nodeMemory, node_memory_MB

✓ GOOD: api_request_duration_seconds
✗ BAD:  api_latency, request_time_ms
```

#### Naming Structure

```
<namespace>_<subsystem>_<name>_<unit>
```

Examples:
- `myapp_http_requests_total`
- `myapp_api_request_duration_seconds`
- `myapp_database_connections_active`

#### Required Conventions

1. **Use lowercase with underscores**
   ```
   http_requests_total  ✓
   HTTPRequests         ✗
   ```

2. **Include application prefix**
   ```
   myapp_requests_total     ✓
   requests_total           ✗ (too generic)
   ```

3. **Add base units**
   - Time: `_seconds` (always use seconds, not ms)
   - Size: `_bytes` (not KB, MB)
   - Percentage: `_ratio` (0-1, not 0-100)
   - Count: `_total` for counters

4. **Counter suffix**
   ```
   http_requests_total      ✓ (counter)
   http_requests            ✗ (ambiguous)
   ```

5. **Reserve colons for recording rules**
   ```
   job:http_requests:rate5m     ✓ (recording rule)
   http:requests:total          ✗ (instrumentation)
   ```

#### Metric Types

**Counters** (monotonically increasing):
```
http_requests_total
errors_total
bytes_sent_total
```

**Gauges** (can go up or down):
```
memory_usage_bytes
active_connections
queue_size
temperature_celsius
```

**Histograms** (observations in buckets):
```
http_request_duration_seconds
response_size_bytes
```

**Summaries** (similar to histogram, calculated client-side):
```
rpc_duration_seconds
```

### Label Best Practices

#### Good Labels

```yaml
http_requests_total{
  method="GET",
  status="200",
  endpoint="/api/users"
}
```

#### Label Guidelines

1. **Use meaningful label names**
   ```
   {method="GET"}       ✓
   {m="GET"}            ✗
   ```

2. **Keep cardinality low**
   ```
   {status="200"}       ✓ (limited values)
   {user_id="12345"}    ✗ (unbounded)
   ```

3. **Avoid high-cardinality labels**
   - ✗ User IDs
   - ✗ Email addresses
   - ✗ Timestamps
   - ✗ Request IDs
   - ✓ HTTP methods (GET, POST, etc.)
   - ✓ Status codes (200, 404, 500)
   - ✓ Environments (prod, staging, dev)

4. **Use consistent label names**
   ```
   {instance="server1"}     ✓ (everywhere)
   {host="server1"}         ✗ (inconsistent)
   ```

---

## Configuration Best Practices

### Global Configuration

```yaml
global:
  # Balance freshness vs. load
  scrape_interval: 15s      # Default: 15-60s based on needs
  evaluation_interval: 15s  # Match scrape_interval
  scrape_timeout: 10s       # Must be < scrape_interval

  # Multi-cluster identification
  external_labels:
    cluster: 'production'
    region: 'us-east-1'
    datacenter: 'dc1'
```

### Scrape Configuration

#### Optimal Intervals

```yaml
scrape_configs:
  # Critical services: 15s
  - job_name: 'api-critical'
    scrape_interval: 15s

  # Standard services: 30s
  - job_name: 'api-standard'
    scrape_interval: 30s

  # Infrastructure: 60s
  - job_name: 'node-exporter'
    scrape_interval: 60s
```

#### Timeout Settings

```yaml
# Rule: scrape_timeout < scrape_interval
scrape_configs:
  - job_name: 'app'
    scrape_interval: 30s
    scrape_timeout: 10s    # 1/3 of interval
```

#### Relabeling Best Practices

```yaml
relabel_configs:
  # 1. Drop unwanted targets early
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: drop
    regex: 'false'

  # 2. Keep only desired targets
  - source_labels: [__meta_kubernetes_namespace]
    action: keep
    regex: '(production|staging)'

  # 3. Transform labels
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod

  # 4. Drop high-cardinality labels
  - regex: 'user_id|session_id|request_id'
    action: labeldrop
```

---

## Alert Design

### 2025 Alerting Standards

#### Alert Structure

```yaml
- alert: AlertName
  expr: |
    # PromQL expression
  for: 5m                    # Prevent flapping
  keep_firing_for: 10m       # Stable alerting (NEW in 2.42+)
  labels:
    severity: critical       # critical, warning, info
    component: api
    team: platform
  annotations:
    summary: "Clear, actionable message"
    description: "Detailed context with {{ $value }}"
    runbook_url: "https://runbooks.company.com/alerts/alert-name"
    dashboard_url: "https://grafana.company.com/d/dashboard"
```

### Severity Levels

```yaml
# Critical: Immediate action, pages on-call
severity: critical
# - Service down
# - Data loss imminent
# - Security breach

# Warning: Investigation needed, ticket/email
severity: warning
# - High resource usage
# - Degraded performance
# - Non-critical errors

# Info: Awareness only, log/metric
severity: info
# - Deployments
# - Configuration changes
# - Scaling events
```

### Alert Best Practices

#### 1. Use `for` Clause

Prevent flapping alerts:

```yaml
# BAD: Alert fires immediately
- alert: HighCPU
  expr: cpu_usage > 80

# GOOD: Alert only if sustained
- alert: HighCPU
  expr: cpu_usage > 80
  for: 5m
```

#### 2. Use `keep_firing_for` (NEW)

Prevent false resolutions:

```yaml
- alert: ServiceDown
  expr: up == 0
  for: 2m
  keep_firing_for: 10m  # Keep firing even if data gaps
```

#### 3. Actionable Annotations

```yaml
annotations:
  # BAD: Vague
  summary: "Problem detected"

  # GOOD: Specific and actionable
  summary: "API latency P95 > 1s on {{ $labels.instance }}"
  description: |
    Current P95 latency: {{ $value | humanizeDuration }}
    Threshold: 1s
    Affected endpoint: {{ $labels.endpoint }}

    Immediate actions:
    1. Check application logs
    2. Review recent deployments
    3. Verify database performance

  runbook_url: "https://runbooks.example.com/high-latency"
```

#### 4. Appropriate Thresholds

```yaml
# Use multiple severity levels
- alert: HighMemoryWarning
  expr: memory_usage > 80
  for: 5m
  labels:
    severity: warning

- alert: HighMemoryCritical
  expr: memory_usage > 95
  for: 2m
  labels:
    severity: critical
```

#### 5. Rate vs. Increase

```yaml
# GOOD: Use rate() for per-second rates
alert: HighErrorRate
expr: rate(errors_total[5m]) > 10

# GOOD: Use increase() for total count
alert: TooManyErrors
expr: increase(errors_total[1h]) > 100
```

---

## Recording Rules

### Naming Convention

```
<level>:<metric>:<operations>
```

Examples:
```yaml
job:http_requests:rate5m
instance:cpu_utilization:avg
cluster:memory_usage_bytes:sum
```

### Best Practices

#### 1. Multi-Level Aggregations

```yaml
groups:
  - name: http_requests
    interval: 30s
    rules:
      # Instance level
      - record: instance:http_requests:rate5m
        expr: rate(http_requests_total[5m])

      # Job level
      - record: job:http_requests:rate5m
        expr: sum by(job) (instance:http_requests:rate5m)

      # Cluster level
      - record: cluster:http_requests:rate5m
        expr: sum(job:http_requests:rate5m)
```

#### 2. Reduce Dashboard Query Load

```yaml
# Instead of running this expensive query 100 times:
histogram_quantile(0.95,
  sum(rate(http_duration_bucket[5m])) by (le))

# Record it once:
- record: http:request_duration:p95
  expr: histogram_quantile(0.95,
    sum(rate(http_duration_bucket[5m])) by (le))
```

#### 3. SLO Calculations

```yaml
- record: slo:availability:ratio
  expr: |
    sum(rate(http_requests_total{status!~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))

- record: slo:error_budget:remaining
  expr: |
    1 - (
      (1 - slo:availability:ratio)
      /
      (1 - 0.999)  # 99.9% SLO
    )
```

---

## Service Discovery

### Kubernetes SD Best Practices

```yaml
kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: [production, staging]  # Limit scope

relabel_configs:
  # Filter early in pipeline
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true

  # Use consistent label names
  - source_labels: [__meta_kubernetes_namespace]
    target_label: kubernetes_namespace
```

### Consul SD Best Practices

```yaml
consul_sd_configs:
  - server: 'consul:8500'
    datacenter: 'dc1'
    tags: ['prometheus', 'production']  # Filter by tags

relabel_configs:
  # Only scrape healthy services
  - source_labels: [__meta_consul_health]
    regex: 'passing'
    action: keep
```

### File SD Best Practices

```yaml
file_sd_configs:
  - files:
      - '/etc/prometheus/targets/*.json'
    refresh_interval: 30s  # Balance freshness vs load

# Keep files small and focused
# targets/web-servers.json
# targets/api-servers.json
# targets/databases.json
```

---

## Performance & Scaling

### Memory Management

```yaml
# Estimate memory needs:
# memory = ingested_samples * retention_time * 1-2 bytes

# Example:
# 1M samples/sec * 15 days * 2 bytes = ~2.6 TB
# Use remote storage for long-term retention
```

### Reduce Cardinality

```yaml
# BAD: Unbounded cardinality
metric_relabel_configs:
  - source_labels: [user_id]  # Millions of users
    target_label: user

# GOOD: Limited cardinality
metric_relabel_configs:
  - source_labels: [user_tier]  # Free/Pro/Enterprise
    target_label: tier
```

### Query Optimization

```yaml
# BAD: Expensive aggregation over all time
sum(http_requests_total)

# GOOD: Use recording rules
sum(job:http_requests:rate5m)

# BAD: Long range without recording
rate(metric[1h])

# GOOD: Shorter range with recording
rate(metric[5m])
```

### Remote Storage

```yaml
remote_write:
  - url: "http://long-term-storage:9009/api/v1/write"
    queue_config:
      capacity: 10000
      max_samples_per_send: 5000
      batch_send_deadline: 5s
    # Send only important metrics
    write_relabel_configs:
      - source_labels: [__name__]
        regex: '(critical_metrics|slo_.*)'
        action: keep
```

---

## Security

### TLS Configuration

```yaml
scrape_configs:
  - job_name: 'secure-app'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ca.pem
      cert_file: /etc/prometheus/client-cert.pem
      key_file: /etc/prometheus/client-key.pem
      insecure_skip_verify: false  # Always false in production
```

### Authentication

```yaml
# Basic Auth
basic_auth:
  username: prometheus
  password_file: /etc/prometheus/password

# Bearer Token
bearer_token_file: /etc/prometheus/token

# OAuth2
oauth2:
  client_id: prometheus
  client_secret_file: /etc/prometheus/oauth_secret
  token_url: https://oauth.example.com/token
```

### Network Security

```yaml
# Kubernetes Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus
spec:
  podSelector:
    matchLabels:
      app: prometheus
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: grafana
    ports:
    - protocol: TCP
      port: 9090
```

---

## High Availability

### Prometheus HA Setup

```yaml
# Prometheus 1
global:
  external_labels:
    replica: 'prometheus-1'
    cluster: 'production'

# Prometheus 2
global:
  external_labels:
    replica: 'prometheus-2'
    cluster: 'production'

# Use Alertmanager for deduplication
# Use Thanos/Cortex for querying
```

### Federation

```yaml
# Global Prometheus federates from regional
scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'  # Only recording rules
    static_configs:
      - targets:
          - 'prometheus-us-east:9090'
          - 'prometheus-eu-west:9090'
```

---

## Retention & Storage

### Retention Policies

```yaml
storage:
  tsdb:
    retention:
      time: 15d        # Local storage: 15 days
      size: 50GB       # Or whichever comes first
    wal_compression: true  # Save disk space
```

### Remote Storage Strategy

```yaml
# Tier 1: Local (15 days, fast queries)
storage.tsdb.retention.time: 15d

# Tier 2: Remote (90 days, medium queries)
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"

# Tier 3: Object storage (2+ years, slow queries)
# Via Thanos/Cortex compactor
```

---

## Troubleshooting

### Common Issues

#### Targets Not Discovered

```bash
# Check service discovery
curl http://prometheus:9090/api/v1/targets

# Verify relabeling
curl http://prometheus:9090/api/v1/targets/metadata

# Check logs
kubectl logs prometheus-0 -n monitoring
```

#### High Memory Usage

```bash
# Check cardinality
curl http://prometheus:9090/api/v1/status/tsdb

# Find high-cardinality metrics
promtool tsdb analyze /prometheus/data

# Drop unnecessary labels
metric_relabel_configs:
  - regex: 'high_cardinality_label'
    action: labeldrop
```

#### Slow Queries

```bash
# Enable query logging
--query.log-file=/var/log/prometheus/queries.log

# Analyze slow queries
grep "took" /var/log/prometheus/queries.log | sort -k5 -rn

# Use recording rules for expensive queries
```

---

## References

- [Prometheus Official Docs](https://prometheus.io/docs/)
- [Naming Best Practices](https://prometheus.io/docs/practices/naming/)
- [Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
- [Recording Rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)
- [Awesome Prometheus Alerts](https://samber.github.io/awesome-prometheus-alerts/)

---

**Last Updated**: January 2025
**Prometheus Version**: 2.40+
**Standards**: OpenMetrics, PromQL, 2025 industry best practices
