Skip to content

Add Amazon Aurora PostgreSQL support#1308

Open
megative wants to merge 11 commits into
prometheus-community:masterfrom
megative:feat/aurora-collectors
Open

Add Amazon Aurora PostgreSQL support#1308
megative wants to merge 11 commits into
prometheus-community:masterfrom
megative:feat/aurora-collectors

Conversation

@megative

@megative megative commented May 13, 2026

Copy link
Copy Markdown

Motivation

Amazon Aurora PostgreSQL is a very common deployment target for this
exporter, but the support today is at best lossy:

  • replication collector fails every scrape on Aurora. Aurora
    rejects pg_last_xact_replay_timestamp() with feature_not_supported,
    so the collector returns an error, emits no replication metrics for
    that scrape interval, and writes the same error to the exporter log on
    every tick. This is tracked in Bug: Replication collector crashes on Aurora PostgreSQL #1273; a user there reports the bug
    hitting "over three hundred databases".
  • wal collector fails the same way. Aurora does not support
    pg_ls_waldir(), so the standard WAL metrics also blow up on every
    scrape with no graceful path.
  • Aurora-specific subsystems are invisible. Aurora exposes its
    storage replication lag, Optimized Reads cache, per-database
    commit-latency / DML-activity stats, and Global Database state via
    Aurora-only built-in functions (aurora_replica_status(),
    aurora_stat_database(), aurora_global_db_status(), …). None of
    these are observable through the standard pg_* views, so operators
    on Aurora typically scrape them via ad-hoc custom queries or stitch in
    CloudWatch metrics by hand.

This exporter already ships README-RDS.md and Aurora-related issues
recur in the tracker (#611, #1273), so the user base is real. Putting
Aurora support upstream — behind one opt-in flag, with the existing
non-Aurora behaviour fully preserved — avoids forks duplicating the same
work and gives Aurora operators a single tested integration.

Summary

  • New top-level flag --aurora.enabled (env: PG_EXPORTER_AURORA_ENABLED,
    default false). When set, the exporter runs SELECT aurora_version()
    once per instance to detect Aurora and flips the default state of every
    aurora_* collector to enabled. Per-collector --collector.aurora_X /
    --no-collector.aurora_X flags still win.
  • 9 new opt-in collectors covering Aurora-specific functions:
    aurora_replica_status, aurora_global_db_status,
    aurora_global_db_instance_status, aurora_stat_database,
    aurora_stat_bgwriter, aurora_stat_commit_latency,
    aurora_stat_dml_activity, aurora_stat_optimized_reads_cache,
    aurora_stat_logical_wal_cache. All disabled by default. On non-Aurora
    servers they short-circuit to ErrNoData (no log noise, no extra
    queries).
  • Graceful degradation in replication and wal: when the underlying
    function comes back with PostgreSQL error class 0A
    (feature_not_supported) and the message contains "Aurora", the
    collector emits NaN for the time-/WAL-directory-based metrics and still
    emits pg_replication_is_replica via pg_is_in_recovery(). Fixes
    Bug: Replication collector crashes on Aurora PostgreSQL #1273. The fingerprint is narrow on purpose — non-Aurora users see no
    behaviour change.

Design notes

  • Zero overhead on non-Aurora deployments: when --aurora.enabled=false
    (default), the aurora_version() probe is skipped entirely and every
    aurora_* collector is a no-op. The flag must be opt-in.
  • The Aurora probe runs at most once per process: *auroraProbe carries
    a sync.Once and is shared by every instance.copy() made per scrape.
  • --aurora.enabled is a top-level flag rather than per-collector
    because it controls a probe that affects multiple collectors at once.

Open questions

Happy to address in-PR — flagging up-front so it doesn't get missed:

  • Several Aurora metrics carry their AWS-native units in the name
    (e.g. aurora_replica_lag_milliseconds,
    aurora_stat_commit_latency_microseconds_total) because they mirror
    the underlying column names. If you'd prefer base units in seconds
    (matching the pattern in pg_stat_statements), happy to convert.

Test plan

  • go build ./..., go vet ./..., go test ./... all clean.
  • Unit tests added for each new collector (*_test.go) covering
    happy path with isAurora=true, ErrNoData on isAurora=false,
    empty result set, and excludeDatabases filtering where applicable.
  • pg_replication / pg_wal tests cover both the standard path
    and the Aurora-unsupported-function fallback.
  • README updated with the new collectors and a note on the
    graceful-degradation behaviour.
  • CHANGELOG entry added under ## main / (unreleased).

Closes #1273.

This is an AI assisted PR.

megative added 7 commits May 14, 2026 01:10
Adds two new collectors exposing Aurora-specific replication state and
fixes two existing collectors that fatally abort every scrape on Aurora.

New collectors (both disabled by default)

aurora_replica_status — sourced from aurora_replica_status(), one row per
cluster instance, labelled by server_id (writer and each reader):

* pg_aurora_replica_lag_milliseconds
* pg_aurora_replica_replay_latency_microseconds
* pg_aurora_replica_pending_read_ios

aurora_global_db_status — sourced from aurora_global_db_status(), labelled
by aws_region; useful for Aurora Global Database deployments:

* pg_aurora_global_db_durability_lag_milliseconds
* pg_aurora_global_db_rpo_lag_milliseconds

Enable with --collector.aurora_replica_status and
--collector.aurora_global_db_status.

Graceful fallback for default-enabled collectors

Aurora rejects pg_last_xact_replay_timestamp() and pg_ls_waldir() with a
feature_not_supported error (Postgres class "0A"), which currently causes
the replication and wal collectors to abort every scrape and emit no
metrics — including pg_replication_is_replica, even though
pg_is_in_recovery() works on Aurora.

Adds an isAuroraUnsupportedFunction helper next to the existing
IsNoDataError classifier in collector.go. It only matches *pq.Error with
SQLSTATE class 0A whose message contains "Aurora"; every other error
falls through to the existing ERROR path. A focused unit test covers the
match/no-match boundary across nine error shapes.

pg_replication: on the Aurora error, falls back to a simpler query that
still reports is_replica via pg_is_in_recovery() and emits NaN for
lag_seconds and last_replay_seconds.

pg_wal: on the Aurora error, emits NaN for both segments and size_bytes
instead of failing the scrape.

Signed-off-by: Pavel Konev <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
@sysadmind

Copy link
Copy Markdown
Contributor

This PR is pretty large as is. I'm in favor of improving our Aurora support and I would like to see this work get merged. I think the best way for that to happen is to split this PR up into smaller, easier to review PRs. I think the new collectors are good candidates to either be solo PRs or maybe small groups that are closely related. For some of the other bug fixes, can you double check open PRs? I think at least one of those bug callouts has an open PR that I gave feedback on recently. I want to give people who already opened PRs the chance to get their work merged before we take a different approach.

@megative

Copy link
Copy Markdown
Author

Thanks for the feedback! Will definitely double check the open PRs just in case.

Just for context - we're currently running this version of the exporter across all our environments, including production, so I'm very motivated to get this upstreamed properly.

Also, I would be happy to split it into smaller, more focused PRs - for purposes of future review. I understand there's huge amount of changes but I believe this is a big step forward for exporter functionality since there's no proper aurora support in other exporters as well. I decided to patch it since postgres exporter is a common used solution I'm working with for a long years, and I think it would help community a lot.

Could you please specify what can I exactly do to merge this PR into the upstream?

Comment thread cmd/postgres_exporter/main.go Outdated
includeDatabases = kingpin.Flag("include-databases", "A list of databases to include when autoDiscoverDatabases is enabled (DEPRECATED)").Default("").Envar("PG_EXPORTER_INCLUDE_DATABASES").String()
metricPrefix = kingpin.Flag("metric-prefix", "A metric prefix can be used to have non-default (not \"pg\") prefixes for each of the metrics").Default("pg").Envar("PG_EXPORTER_METRIC_PREFIX").String()
collectionTimeout = kingpin.Flag("collection-timeout", "Timeout for collecting the statistics when the database is slow").Default("1m").Envar("PG_EXPORTER_COLLECTION_TIMEOUT").String()
auroraEnabled = kingpin.Flag("aurora.enabled", "Enable Amazon Aurora PostgreSQL support. When set, the exporter detects Aurora on each instance and turns Aurora-specific collectors on by default. Has no effect on non-Aurora instances.").Default("false").Envar("PG_EXPORTER_AURORA_ENABLED").Bool()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally try and avoid this kind of feature flag. It's preferred to try and detect the version of the database and use appropriate collectors when detected.

See the version field on the instance struct.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SuperQ

Done in 57a79ec--aurora.enabled is deleted now (along with the env var and the three WithX helpers).

Mirrors the version pattern you pointed at: instance.isAurora is set in setup() via a one-shot SELECT aurora_version() (sync.Once, so one probe per process), and aurora_* collectors check it in Update()
the same way others check version.

Per-collector --collector.aurora_X toggles stay default-disabled (same as stat_checkpointer, also version-gated).

Seems like should be OK for now

@megative

Copy link
Copy Markdown
Author

This PR is pretty large as is. I'm in favor of improving our Aurora support and I would like to see this work get merged. I think the best way for that to happen is to split this PR up into smaller, easier to review PRs. I think the new collectors are good candidates to either be solo PRs or maybe small groups that are closely related. For some of the other bug fixes, can you double check open PRs? I think at least one of those bug callouts has an open PR that I gave feedback on recently. I want to give people who already opened PRs the chance to get their work merged before we take a different approach.

Checked — #1274 is already covers the pg_replication Aurora fix. So I'll drop that part from my work and let it land there.

megative added a commit to megative/postgres_exporter that referenced this pull request May 16, 2026
Address @SuperQ feedback on prometheus-community#1308: prefer detecting the database flavor
on the instance struct (mirroring `version`) instead of a top-level
feature flag.

- setup() always runs detectAurora() via sync.Once (one probe per process)
- aurora_* collectors stay default-disabled, gated only by --collector.aurora_X
- removes --aurora.enabled flag, PG_EXPORTER_AURORA_ENABLED env var,
  WithAuroraEnabled, EnableAuroraCollectors, WithProbeAuroraEnabled

Signed-off-by: Pavel K <megativ3@gmail.com>
megative added 3 commits May 17, 2026 02:10
Signed-off-by: Pavel K <megativ3@gmail.com>
Address @SuperQ feedback on prometheus-community#1308: prefer detecting the database flavor
on the instance struct (mirroring `version`) instead of a top-level
feature flag.

- setup() always runs detectAurora() via sync.Once (one probe per process)
- aurora_* collectors stay default-disabled, gated only by --collector.aurora_X
- removes --aurora.enabled flag, PG_EXPORTER_AURORA_ENABLED env var,
  WithAuroraEnabled, EnableAuroraCollectors, WithProbeAuroraEnabled

Signed-off-by: Pavel K <megativ3@gmail.com>
@megative megative force-pushed the feat/aurora-collectors branch from 0b55691 to 115de7b Compare May 17, 2026 00:12
@megative

Copy link
Copy Markdown
Author

@sysadmind — how would you suggest splitting this up? One PR per Aurora collector, or some other grouping?

Pulls pg_wal out of this PR's scope to keep it focused on the new
aurora_* collectors. Will open the pg_wal fallback as a separate small
bugfix PR (the wal-collector analog of prometheus-community#1274).

Removes:
- pg_wal.go/pg_wal_test.go Aurora-specific fallback (reverted to master)
- isAuroraUnsupportedFunction helper (had no remaining callers)
- Related imports + test

Cleans CHANGELOG / README accordingly.

Signed-off-by: Pavel K <megativ3@gmail.com>
@sysadmind

Copy link
Copy Markdown
Contributor

I think that once per collector is ideal.

@megative

Copy link
Copy Markdown
Author

I think that once per collector is ideal.

@sysadmind first PR of the series is ready: #1311 (aurora_replica_status + Aurora auto-detection foundation).
Happy to queue the remaining 8 per-collector PRs right after, or should I wait for #1311 to land first so the architecture pattern can settle?

@sysadmind

Copy link
Copy Markdown
Contributor

@megative I reviewed #1311 and based on the volume of feedback, I would recommend we settle the architecture first to save you having to redo any work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Replication collector crashes on Aurora PostgreSQL

3 participants