Update memory_query and cpu_query for Prometheus#420
Update memory_query and cpu_query for Prometheus#420tjmoyes wants to merge 5 commits intorobusta-dev:mainfrom
Conversation
…ch no longer works in Azure Monitored Prometheus
WalkthroughPrometheus queries in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py
Show resolved
Hide resolved
robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py
Show resolved
Hide resolved
e268dd1 to
7de0fd5
Compare
7de0fd5 to
8284b08
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`:
- Around line 253-262: The PromQL matcher construction for cluster filtering is
producing malformed expressions because single_cluster_label is inserted without
a separating comma; update the cluster-label handling so it yields either an
empty string or a comma+space-prefixed matcher (e.g., ', cluster="xxx"') instead
of just removing the comma. Locate the cluster_label / single_cluster_label
logic and change it so single_cluster_label = f', {cluster_label.strip(", ")}'
if cluster_label.strip() else '' (or otherwise ensure a leading ", " is added
when non-empty) and keep the memory_query and cpu_query f-strings unchanged to
produce valid matchers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 52b71b99-4e00-4662-9716-f941e71d5240
📒 Files selected for processing (1)
robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py
robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py
Outdated
Show resolved
Hide resolved
|
@coderabbitai resume @coderabbitai review |
|
✅ Actions performedReviews resumed. Review triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`:
- Around line 253-259: The memory_query and cpu_query use
sum(kube_node_status_capacity{...}) which double-counts nodes when metrics are
scraped from multiple targets; change both queries to first deduplicate by node
then sum the per-node totals, e.g. use sum(sum by (node)
(kube_node_status_capacity{ resource="memory" {cluster_label} })) and the
analogous expression for cpu, ensuring you keep the existing cluster_label
interpolation and update the memory_query and cpu_query strings in
prometheus_metrics_service.py.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 32a98242-5117-4124-8155-eace6f1d2375
📒 Files selected for processing (1)
robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py
| memory_query = f""" | ||
| sum(max by (instance) (machine_memory_bytes{{ {single_cluster_label} }})) | ||
| sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }}) | ||
| """ | ||
|
|
||
| cpu_query = f""" | ||
| sum(max by (instance) (machine_cpu_cores{{ {single_cluster_label} }})) | ||
| sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }}) | ||
| """ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
PROM="${PROMETHEUS_URL:?Set PROMETHEUS_URL (e.g. https://prom.example.com)}"
queries=(
'sum(kube_node_status_capacity{resource="memory"})'
'sum(max by (node) (kube_node_status_capacity{resource="memory"}))'
'sum(kube_node_status_capacity{resource="cpu"})'
'sum(max by (node) (kube_node_status_capacity{resource="cpu"}))'
)
for q in "${queries[@]}"; do
echo "QUERY: $q"
curl -sG "$PROM/api/v1/query" --data-urlencode "query=$q" | jq -r '.data.result[0].value // "no result"'
echo
doneRepository: robusta-dev/krr
Length of output: 144
🏁 Script executed:
# Read the file at the specified lines
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '240,270p'Repository: robusta-dev/krr
Length of output: 1751
🏁 Script executed:
# Check the get_prometheus_cluster_label function referenced in the scratchpad
grep -n "get_prometheus_cluster_label" robusta_krr/core/integrations/prometheus/metrics/base.py | head -5Repository: robusta-dev/krr
Length of output: 112
🏁 Script executed:
# Find where these queries are used
rg "memory_query|cpu_query" --context 3 -A 2 -B 2 robusta_krr/core/integrations/prometheus/metrics_service/Repository: robusta-dev/krr
Length of output: 2031
🏁 Script executed:
# Check the implementation of get_prometheus_cluster_label
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '75,95p'Repository: robusta-dev/krr
Length of output: 979
🏁 Script executed:
# Look for the class definition and understand the context
grep -n "class.*PrometheusMetricsService" robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.pyRepository: robusta-dev/krr
Length of output: 109
🏁 Script executed:
# Search for kube_node_status_capacity usage in the codebase to understand metric structure
rg "kube_node_status_capacity" robusta_krr/Repository: robusta-dev/krr
Length of output: 393
🏁 Script executed:
# Check if there's any documentation or references to the metric labels
rg "node.*resource.*capacity" robusta_krr/ -A 2 -B 2Repository: robusta-dev/krr
Length of output: 41
🏁 Script executed:
# Look for any test data or example Prometheus queries
find robusta_krr -name "*.py" -type f | xargs grep -l "kube_state_metrics\|kube-state-metrics" | head -5Repository: robusta-dev/krr
Length of output: 41
🏁 Script executed:
# Search for how this metric is documented or used in comments
rg "kube_node_status_capacity" -B 5 -A 5 robusta_krr/Repository: robusta-dev/krr
Length of output: 1965
🏁 Script executed:
# Check the structure of the queries - look for any other aggregations in similar contexts
rg "kube_pod_container_resource_requests|kube_node" robusta_krr/core/integrations/prometheus/metrics_service/ -A 3 -B 1 | head -40Repository: robusta-dev/krr
Length of output: 2096
🏁 Script executed:
# Look at the similar query patterns for kube_system to understand the aggregation style
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '260,265p'Repository: robusta-dev/krr
Length of output: 501
🏁 Script executed:
# Check if there are any tests or documentation about this query
rg "get_cluster_summary" robusta_krr/ -A 10 -B 2 | head -50Repository: robusta-dev/krr
Length of output: 4489
🏁 Script executed:
# Research kube-state-metrics documentation about kube_node_status_capacity metric
# Check if there are any comments or references in the codebase about metric labels
rg "node.*label\|resource.*label" robusta_krr/ | head -20Repository: robusta-dev/krr
Length of output: 41
🏁 Script executed:
# Check the kube-state-metrics version or any vendor-specific documentation
find robusta_krr -name "*.md" -o -name "*.txt" | xargs grep -l "kube.state\|prometheus" 2>/dev/null | head -5Repository: robusta-dev/krr
Length of output: 41
🌐 Web query:
kube_node_status_capacity kube-state-metrics metric labels
💡 Result:
The kube_node_status_capacity metric from kube-state-metrics is a Gauge that represents the total amount of resources available (capacity) for a Kubernetes node. Labels: - node: The node address/name (e.g., node=) - resource: The name of the resource (e.g., cpu, memory, ephemeral_storage, pods, attachable_volumes_, hugepages_) Example values: - cpu= - memory= - ephemeral_storage= - pods= Status: STABLE This metric is sourced from the Kubernetes Node .status.capacity field.
Citations:
- 1: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/cluster/node-metrics.md
- 2: https://github.com/kubernetes/kube-state-metrics/blob/3193891ee442ee9552745d054c8d0140295431f3/docs/node-metrics.md
- 3: https://docs.cloudera.com/management-console/1.5.4/monitoring-metrics/topics/cdppvc_ds_kube_node_status_capacity_trics.html
🏁 Script executed:
# Check if there are any tests or example configurations
find robusta_krr -name "*test*" -type f -path "*prometheus*" | xargs grep -l "kube_node_status" 2>/dev/null | head -3Repository: robusta-dev/krr
Length of output: 41
🏁 Script executed:
# Look for any existing aggregation patterns for node metrics in the codebase
rg "by \(node\)" robusta_krr/ -B 2 -A 2Repository: robusta-dev/krr
Length of output: 41
🏁 Script executed:
# Verify there are no other similar problematic sum() aggregations for node metrics
rg "sum\(kube_node" robusta_krr/Repository: robusta-dev/krr
Length of output: 393
🏁 Script executed:
# Double-check the cluster_label format one more time
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '79,88p'Repository: robusta-dev/krr
Length of output: 523
Deduplicate node capacity series before summing.
sum(kube_node_status_capacity...) counts the same node's capacity multiple times when the metric appears in duplicate scrape targets or jobs (e.g., HA kube-state-metrics). Aggregate by node first to deduplicate, then sum.
🔧 Proposed fix
memory_query = f"""
- sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }})
+ sum(max by (node) (kube_node_status_capacity{{ resource="memory" {cluster_label} }}))
"""
cpu_query = f"""
- sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }})
+ sum(max by (node) (kube_node_status_capacity{{ resource="cpu" {cluster_label} }}))
"""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`
around lines 253 - 259, The memory_query and cpu_query use
sum(kube_node_status_capacity{...}) which double-counts nodes when metrics are
scraped from multiple targets; change both queries to first deduplicate by node
then sum the per-node totals, e.g. use sum(sum by (node)
(kube_node_status_capacity{ resource="memory" {cluster_label} })) and the
analogous expression for cpu, ensuring you keep the existing cluster_label
interpolation and update the memory_query and cpu_query strings in
prometheus_metrics_service.py.
machine_memory_bytes and machine_cpu_cores have been [deprecated] (https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/cluster/node-metrics.md) [as per StackOverflow] (https://stackoverflow.com/questions/63901926/how-to-query-the-total-memory-available-to-kubernetes-nodes)
This issue came up for me when querying Azure Managed Prometheus, as these queries no longer exist there. I've tested these queries on both manually deployed Prometheus as well as Azure Managed Prometheus and the query works and returns the same data.