Skip to content

Update memory_query and cpu_query for Prometheus#420

Open
tjmoyes wants to merge 5 commits intorobusta-dev:mainfrom
tjmoyes:feature/update-deprecated-prometheus-mem-cpu-queries
Open

Update memory_query and cpu_query for Prometheus#420
tjmoyes wants to merge 5 commits intorobusta-dev:mainfrom
tjmoyes:feature/update-deprecated-prometheus-mem-cpu-queries

Conversation

@tjmoyes
Copy link

@tjmoyes tjmoyes commented Mar 25, 2025

machine_memory_bytes and machine_cpu_cores have been [deprecated] (https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/cluster/node-metrics.md) [as per StackOverflow] (https://stackoverflow.com/questions/63901926/how-to-query-the-total-memory-available-to-kubernetes-nodes)

This issue came up for me when querying Azure Managed Prometheus, as these queries no longer exist there. I've tested these queries on both manually deployed Prometheus as well as Azure Managed Prometheus and the query works and returns the same data.

…ch no longer works in Azure Monitored Prometheus
@CLAassistant
Copy link

CLAassistant commented Mar 25, 2025

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link

coderabbitai bot commented Oct 26, 2025

Walkthrough

Prometheus queries in get_cluster_summary now use node capacity metrics (kube_node_status_capacity{resource="memory"} and kube_node_status_capacity{resource="cpu"}) instead of machine_memory_bytes/machine_cpu_cores. Some kube_pod_container_resource_requests label matchers switched from single to double quotes. Retrieval and return logic remain unchanged.

Changes

Cohort / File(s) Summary
Prometheus metric queries
robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py
Replaced machine_memory_bytes and machine_cpu_cores with kube_node_status_capacity{resource="memory"} and kube_node_status_capacity{resource="cpu"} in get_cluster_summary; changed kube_pod_container_resource_requests label matchers to use double-quoted namespace="kube-system" and resource="memory"/"cpu"; result keys (cluster_memory, cluster_cpu, kube_system_mem_req, kube_system_cpu_req) and parsing unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Update memory_query and cpu_query for Prometheus' directly describes the main change: updating Prometheus queries for memory and CPU metrics.
Description check ✅ Passed The description explains the rationale for the change (deprecated metrics in kube-state-metrics), the issue encountered (queries failing in Azure Managed Prometheus), and testing performed on both platforms.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7afd7d0 and 9988554.

📒 Files selected for processing (1)
  • robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py (1 hunks)

@tjmoyes tjmoyes force-pushed the feature/update-deprecated-prometheus-mem-cpu-queries branch 2 times, most recently from e268dd1 to 7de0fd5 Compare March 26, 2026 16:29
@tjmoyes tjmoyes force-pushed the feature/update-deprecated-prometheus-mem-cpu-queries branch from 7de0fd5 to 8284b08 Compare March 26, 2026 16:33
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`:
- Around line 253-262: The PromQL matcher construction for cluster filtering is
producing malformed expressions because single_cluster_label is inserted without
a separating comma; update the cluster-label handling so it yields either an
empty string or a comma+space-prefixed matcher (e.g., ', cluster="xxx"') instead
of just removing the comma. Locate the cluster_label / single_cluster_label
logic and change it so single_cluster_label = f', {cluster_label.strip(", ")}'
if cluster_label.strip() else '' (or otherwise ensure a leading ", " is added
when non-empty) and keep the memory_query and cpu_query f-strings unchanged to
produce valid matchers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 52b71b99-4e00-4662-9716-f941e71d5240

📥 Commits

Reviewing files that changed from the base of the PR and between 7de0fd5 and 8284b08.

📒 Files selected for processing (1)
  • robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py

@tjmoyes
Copy link
Author

tjmoyes commented Mar 26, 2026

@coderabbitai resume

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 26, 2026

@tjmoyes Sure! Resuming reviews and kicking off a review of the PR now.

✅ Actions performed

Reviews resumed.


Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`:
- Around line 253-259: The memory_query and cpu_query use
sum(kube_node_status_capacity{...}) which double-counts nodes when metrics are
scraped from multiple targets; change both queries to first deduplicate by node
then sum the per-node totals, e.g. use sum(sum by (node)
(kube_node_status_capacity{ resource="memory" {cluster_label} })) and the
analogous expression for cpu, ensuring you keep the existing cluster_label
interpolation and update the memory_query and cpu_query strings in
prometheus_metrics_service.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 32a98242-5117-4124-8155-eace6f1d2375

📥 Commits

Reviewing files that changed from the base of the PR and between 7de0fd5 and 433fca8.

📒 Files selected for processing (1)
  • robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py

Comment on lines 253 to 259
memory_query = f"""
sum(max by (instance) (machine_memory_bytes{{ {single_cluster_label} }}))
sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }})
"""

cpu_query = f"""
sum(max by (instance) (machine_cpu_cores{{ {single_cluster_label} }}))
sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }})
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
PROM="${PROMETHEUS_URL:?Set PROMETHEUS_URL (e.g. https://prom.example.com)}"

queries=(
  'sum(kube_node_status_capacity{resource="memory"})'
  'sum(max by (node) (kube_node_status_capacity{resource="memory"}))'
  'sum(kube_node_status_capacity{resource="cpu"})'
  'sum(max by (node) (kube_node_status_capacity{resource="cpu"}))'
)

for q in "${queries[@]}"; do
  echo "QUERY: $q"
  curl -sG "$PROM/api/v1/query" --data-urlencode "query=$q" | jq -r '.data.result[0].value // "no result"'
  echo
done

Repository: robusta-dev/krr

Length of output: 144


🏁 Script executed:

# Read the file at the specified lines
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '240,270p'

Repository: robusta-dev/krr

Length of output: 1751


🏁 Script executed:

# Check the get_prometheus_cluster_label function referenced in the scratchpad
grep -n "get_prometheus_cluster_label" robusta_krr/core/integrations/prometheus/metrics/base.py | head -5

Repository: robusta-dev/krr

Length of output: 112


🏁 Script executed:

# Find where these queries are used
rg "memory_query|cpu_query" --context 3 -A 2 -B 2 robusta_krr/core/integrations/prometheus/metrics_service/

Repository: robusta-dev/krr

Length of output: 2031


🏁 Script executed:

# Check the implementation of get_prometheus_cluster_label
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '75,95p'

Repository: robusta-dev/krr

Length of output: 979


🏁 Script executed:

# Look for the class definition and understand the context
grep -n "class.*PrometheusMetricsService" robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py

Repository: robusta-dev/krr

Length of output: 109


🏁 Script executed:

# Search for kube_node_status_capacity usage in the codebase to understand metric structure
rg "kube_node_status_capacity" robusta_krr/

Repository: robusta-dev/krr

Length of output: 393


🏁 Script executed:

# Check if there's any documentation or references to the metric labels
rg "node.*resource.*capacity" robusta_krr/ -A 2 -B 2

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Look for any test data or example Prometheus queries
find robusta_krr -name "*.py" -type f | xargs grep -l "kube_state_metrics\|kube-state-metrics" | head -5

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Search for how this metric is documented or used in comments
rg "kube_node_status_capacity" -B 5 -A 5 robusta_krr/

Repository: robusta-dev/krr

Length of output: 1965


🏁 Script executed:

# Check the structure of the queries - look for any other aggregations in similar contexts
rg "kube_pod_container_resource_requests|kube_node" robusta_krr/core/integrations/prometheus/metrics_service/ -A 3 -B 1 | head -40

Repository: robusta-dev/krr

Length of output: 2096


🏁 Script executed:

# Look at the similar query patterns for kube_system to understand the aggregation style
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '260,265p'

Repository: robusta-dev/krr

Length of output: 501


🏁 Script executed:

# Check if there are any tests or documentation about this query
rg "get_cluster_summary" robusta_krr/ -A 10 -B 2 | head -50

Repository: robusta-dev/krr

Length of output: 4489


🏁 Script executed:

# Research kube-state-metrics documentation about kube_node_status_capacity metric
# Check if there are any comments or references in the codebase about metric labels
rg "node.*label\|resource.*label" robusta_krr/ | head -20

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Check the kube-state-metrics version or any vendor-specific documentation
find robusta_krr -name "*.md" -o -name "*.txt" | xargs grep -l "kube.state\|prometheus" 2>/dev/null | head -5

Repository: robusta-dev/krr

Length of output: 41


🌐 Web query:

kube_node_status_capacity kube-state-metrics metric labels

💡 Result:

The kube_node_status_capacity metric from kube-state-metrics is a Gauge that represents the total amount of resources available (capacity) for a Kubernetes node. Labels: - node: The node address/name (e.g., node=) - resource: The name of the resource (e.g., cpu, memory, ephemeral_storage, pods, attachable_volumes_, hugepages_) Example values: - cpu= - memory= - ephemeral_storage= - pods= Status: STABLE This metric is sourced from the Kubernetes Node .status.capacity field.

Citations:


🏁 Script executed:

# Check if there are any tests or example configurations
find robusta_krr -name "*test*" -type f -path "*prometheus*" | xargs grep -l "kube_node_status" 2>/dev/null | head -3

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Look for any existing aggregation patterns for node metrics in the codebase
rg "by \(node\)" robusta_krr/ -B 2 -A 2

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Verify there are no other similar problematic sum() aggregations for node metrics
rg "sum\(kube_node" robusta_krr/

Repository: robusta-dev/krr

Length of output: 393


🏁 Script executed:

# Double-check the cluster_label format one more time
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '79,88p'

Repository: robusta-dev/krr

Length of output: 523


Deduplicate node capacity series before summing.

sum(kube_node_status_capacity...) counts the same node's capacity multiple times when the metric appears in duplicate scrape targets or jobs (e.g., HA kube-state-metrics). Aggregate by node first to deduplicate, then sum.

🔧 Proposed fix
         memory_query = f"""
-            sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }})
+            sum(max by (node) (kube_node_status_capacity{{ resource="memory" {cluster_label} }}))
         """
 
         cpu_query = f"""
-            sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }})
+            sum(max by (node) (kube_node_status_capacity{{ resource="cpu" {cluster_label} }}))
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`
around lines 253 - 259, The memory_query and cpu_query use
sum(kube_node_status_capacity{...}) which double-counts nodes when metrics are
scraped from multiple targets; change both queries to first deduplicate by node
then sum the per-node totals, e.g. use sum(sum by (node)
(kube_node_status_capacity{ resource="memory" {cluster_label} })) and the
analogous expression for cpu, ensuring you keep the existing cluster_label
interpolation and update the memory_query and cpu_query strings in
prometheus_metrics_service.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants