Skip to content

fix(collector): handle Aurora's unsupported pg_last_xact_replay_times…#1274

Open
dannotripp wants to merge 1 commit into
prometheus-community:masterfrom
dannotripp:fix/aurora-replication-collector
Open

fix(collector): handle Aurora's unsupported pg_last_xact_replay_times…#1274
dannotripp wants to merge 1 commit into
prometheus-community:masterfrom
dannotripp:fix/aurora-replication-collector

Conversation

@dannotripp

Copy link
Copy Markdown
Contributor

fix(collector): handle Aurora's unsupported pg_last_xact_replay_timestamp

Fixes #1273

Aurora PostgreSQL does not support pg_last_xact_replay_timestamp(), causing
the replication collector to abort every scrape with a fatal error on Aurora
instances.

This change detects the Aurora-specific feature_not_supported error (Postgres
error class 0A) and falls back gracefully: pg_replication_is_replica is
still reported via pg_is_in_recovery(), while the time-based metrics emit
NaN to signal they are unavailable.

A new test TestPgReplicationCollectorAurora covers the fallback path.

…tamp

Aurora PostgreSQL does not support pg_last_xact_replay_timestamp() and
returns a feature_not_supported error (code 0A000) when the replication
collector queries it. This causes the collector to crash on every scrape
for Aurora instances.

When this error is detected, the collector now falls back to a simpler
query that only reads pg_is_in_recovery(), so is_replica is still
reported correctly. The time-based metrics (lag_seconds and
last_replay_seconds) are emitted as NaN to signal that the values are
unavailable, rather than crashing the collection cycle entirely.

The error is identified by checking for a *pq.Error with class "0A"
(feature_not_supported) and a message that contains "Aurora", which
avoids incorrectly suppressing the same error code on standard Postgres.

A new test TestPgReplicationCollectorAurora covers this fallback path.

Signed-off-by: Danno Tripp <danno.tripp@reddit.com>
@dannotripp dannotripp force-pushed the fix/aurora-replication-collector branch from df20274 to b7c82e9 Compare March 5, 2026 00:35

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the pg_replication collector to gracefully handle Amazon Aurora PostgreSQL, which does not support pg_last_xact_replay_timestamp(), preventing the collector from failing on every scrape for Aurora instances.

Changes:

  • Detect Aurora’s feature_not_supported error when the replication query uses pg_last_xact_replay_timestamp().
  • Fall back to a simpler query that still reports pg_replication_is_replica via pg_is_in_recovery().
  • Emit NaN for time-based replication metrics when the underlying function is unsupported, and add a unit test covering this fallback.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
collector/pg_replication.go Adds Aurora-specific error detection and a fallback query; emits NaN for unsupported time-based metrics.
collector/pg_replication_test.go Adds TestPgReplicationCollectorAurora to validate fallback behavior and NaN emission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


row2 := db.QueryRowContext(ctx, pgReplicationIsReplicaQuery)
if err2 := row2.Scan(&isReplica); err2 != nil {
isReplica = 0
}
}

lagValue := math.NaN()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why NaN over just not emitting the metric? The latter is common in other collectors when the database value is NULL.

megative added a commit to megative/postgres_exporter that referenced this pull request May 17, 2026
Pulls pg_wal out of this PR's scope to keep it focused on the new
aurora_* collectors. Will open the pg_wal fallback as a separate small
bugfix PR (the wal-collector analog of prometheus-community#1274).

Removes:
- pg_wal.go/pg_wal_test.go Aurora-specific fallback (reverted to master)
- isAuroraUnsupportedFunction helper (had no remaining callers)
- Related imports + test

Cleans CHANGELOG / README accordingly.

Signed-off-by: Pavel K <megativ3@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Replication collector crashes on Aurora PostgreSQL

4 participants