Best Practices
This guide covers best practices for using Thalassa Prometheus Service effectively, optimizing costs, and ensuring reliable monitoring.
Label Strategy
Establish a consistent label naming strategy across your infrastructure to enable effective querying and alerting. Use labels like env=production or env=staging to distinguish between environments, app=backend or app=frontend to identify applications, team=platform or team=backend to organise by team ownership, and region=nl-01 to indicate geographic location.
Avoid high-cardinality labels that can cause metric explosion, such as user IDs, request IDs, or timestamps. Limit the number of label combinations to prevent creating excessive time series, which increases storage costs and query complexity. When you need to track high-cardinality data, use recording rules to pre-aggregate metrics before they are stored, reducing the number of unique time series.
Remote Write Configuration
Optimize Sample Rate
Scrape Intervals
Since pricing is based on samples ingested per 1,000 samples, optimizing your remote write configuration directly impacts costs. Balance the level of detail you need with cost by choosing appropriate scrape intervals: use 15-30 seconds for critical metrics that require frequent monitoring, 30-60 seconds for standard metrics that provide good visibility without excessive sampling, and 60 seconds or more for low-priority metrics where less frequent updates are acceptable.
Filter out Metrics
Use write relabeling to filter out unnecessary metrics before they are sent to the remote write endpoint, reducing both sample volume and storage costs. Optimize your batch configuration by tuning max_samples_per_send and batch_send_deadline to efficiently batch samples while maintaining acceptable latency.
Filter metrics before sending to reduce costs:
remote_write:
- url: https://prometheus.nl-01.thalassa.cloud/api/v1/push
write_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: '.*_bucket.*|.*_count.*'
action: drop
# Keep only important metrics
- source_labels: [__name__]
regex: 'node_.*|http_.*|container_.*'
action: keepQuery Optimization
Use Recording Rules
In order to speed up queries, you can use Recording Rules to pre-compute expensive queries:
groups:
- name: recording_rules
interval: 30s
rules:
- record: instance:node_cpu:rate5m
expr: rate(node_cpu_seconds_total[5m])Filter Early
Apply label filters early in queries to reduce data:
# Good: Filter first
sum(rate(http_requests_total{app="backend"}[5m])) by (instance)
# Less efficient: Filter after aggregation
sum(rate(http_requests_total[5m])) by (instance, app) and app="backend"Retention Planning
Configure data retention based on your operational and compliance requirements.
- Short-term retention of 7-30 days is typically sufficient for real-time monitoring and operational troubleshooting.
- Medium-term retention of 30-90 days supports trend analysis and helps identify patterns in system behavior.
- Long-term retention of 90 days or more is necessary for capacity planning, compliance requirements, and historical analysis.
Balance your retention period with storage costs, as longer retention periods increase storage requirements and associated costs.
References
- Getting Started — Basic Prometheus setup
- Remote Write Configuration — Configure remote write
- Querying Metrics — Query API and PromQL queries