Best Practices

This guide covers best practices for using Thalassa Prometheus Service effectively, optimizing costs, and ensuring reliable monitoring.

Label Strategy

Establish a consistent label naming strategy across your infrastructure to enable effective querying and alerting. Use labels like env=production or env=staging to distinguish between environments, app=backend or app=frontend to identify applications, team=platform or team=backend to organise by team ownership, and region=nl-01 to indicate geographic location.

Avoid high-cardinality labels that can cause metric explosion, such as user IDs, request IDs, or timestamps. Limit the number of label combinations to prevent creating excessive time series, which increases storage costs and query complexity. When you need to track high-cardinality data, use recording rules to pre-aggregate metrics before they are stored, reducing the number of unique time series.

Remote Write Configuration

Optimize Sample Rate

Scrape Intervals

Since pricing is based on samples ingested per 1,000 samples, optimizing your remote write configuration directly impacts costs. Balance the level of detail you need with cost by choosing appropriate scrape intervals: use 15-30 seconds for critical metrics that require frequent monitoring, 30-60 seconds for standard metrics that provide good visibility without excessive sampling, and 60 seconds or more for low-priority metrics where less frequent updates are acceptable.

Filter out Metrics

Use write relabeling to filter out unnecessary metrics before they are sent to the remote write endpoint, reducing both sample volume and storage costs. Optimize your batch configuration by tuning max_samples_per_send and batch_send_deadline to efficiently batch samples while maintaining acceptable latency.

Filter metrics before sending to reduce costs:

remote_write:
  - url: https://prometheus.nl-01.thalassa.cloud/api/v1/push
    write_relabel_configs:
      # Drop high-cardinality metrics
      - source_labels: [__name__]
        regex: '.*_bucket.*|.*_count.*'
        action: drop
      
      # Keep only important metrics
      - source_labels: [__name__]
        regex: 'node_.*|http_.*|container_.*'
        action: keep

Query Optimization

Use Recording Rules

In order to speed up queries, you can use Recording Rules to pre-compute expensive queries:

groups:
- name: recording_rules
  interval: 30s
  rules:
  - record: instance:node_cpu:rate5m
    expr: rate(node_cpu_seconds_total[5m])

Filter Early

Apply label filters early in queries to reduce data:

# Good: Filter first
sum(rate(http_requests_total{app="backend"}[5m])) by (instance)

# Less efficient: Filter after aggregation
sum(rate(http_requests_total[5m])) by (instance, app) and app="backend"

Retention Planning

Configure data retention based on your operational and compliance requirements.

  • Short-term retention of 7-30 days is typically sufficient for real-time monitoring and operational troubleshooting.
  • Medium-term retention of 30-90 days supports trend analysis and helps identify patterns in system behavior.
  • Long-term retention of 90 days or more is necessary for capacity planning, compliance requirements, and historical analysis.

Balance your retention period with storage costs, as longer retention periods increase storage requirements and associated costs.

References