Add Amazon Aurora PostgreSQL support#1308
Conversation
Adds two new collectors exposing Aurora-specific replication state and fixes two existing collectors that fatally abort every scrape on Aurora. New collectors (both disabled by default) aurora_replica_status — sourced from aurora_replica_status(), one row per cluster instance, labelled by server_id (writer and each reader): * pg_aurora_replica_lag_milliseconds * pg_aurora_replica_replay_latency_microseconds * pg_aurora_replica_pending_read_ios aurora_global_db_status — sourced from aurora_global_db_status(), labelled by aws_region; useful for Aurora Global Database deployments: * pg_aurora_global_db_durability_lag_milliseconds * pg_aurora_global_db_rpo_lag_milliseconds Enable with --collector.aurora_replica_status and --collector.aurora_global_db_status. Graceful fallback for default-enabled collectors Aurora rejects pg_last_xact_replay_timestamp() and pg_ls_waldir() with a feature_not_supported error (Postgres class "0A"), which currently causes the replication and wal collectors to abort every scrape and emit no metrics — including pg_replication_is_replica, even though pg_is_in_recovery() works on Aurora. Adds an isAuroraUnsupportedFunction helper next to the existing IsNoDataError classifier in collector.go. It only matches *pq.Error with SQLSTATE class 0A whose message contains "Aurora"; every other error falls through to the existing ERROR path. A focused unit test covers the match/no-match boundary across nine error shapes. pg_replication: on the Aurora error, falls back to a simpler query that still reports is_replica via pg_is_in_recovery() and emits NaN for lag_seconds and last_replay_seconds. pg_wal: on the Aurora error, emits NaN for both segments and size_bytes instead of failing the scrape. Signed-off-by: Pavel Konev <megativ3@gmail.com> Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
|
This PR is pretty large as is. I'm in favor of improving our Aurora support and I would like to see this work get merged. I think the best way for that to happen is to split this PR up into smaller, easier to review PRs. I think the new collectors are good candidates to either be solo PRs or maybe small groups that are closely related. For some of the other bug fixes, can you double check open PRs? I think at least one of those bug callouts has an open PR that I gave feedback on recently. I want to give people who already opened PRs the chance to get their work merged before we take a different approach. |
|
Thanks for the feedback! Will definitely double check the open PRs just in case. Just for context - we're currently running this version of the exporter across all our environments, including production, so I'm very motivated to get this upstreamed properly. Also, I would be happy to split it into smaller, more focused PRs - for purposes of future review. I understand there's huge amount of changes but I believe this is a big step forward for exporter functionality since there's no proper aurora support in other exporters as well. I decided to patch it since postgres exporter is a common used solution I'm working with for a long years, and I think it would help community a lot. Could you please specify what can I exactly do to merge this PR into the upstream? |
| includeDatabases = kingpin.Flag("include-databases", "A list of databases to include when autoDiscoverDatabases is enabled (DEPRECATED)").Default("").Envar("PG_EXPORTER_INCLUDE_DATABASES").String() | ||
| metricPrefix = kingpin.Flag("metric-prefix", "A metric prefix can be used to have non-default (not \"pg\") prefixes for each of the metrics").Default("pg").Envar("PG_EXPORTER_METRIC_PREFIX").String() | ||
| collectionTimeout = kingpin.Flag("collection-timeout", "Timeout for collecting the statistics when the database is slow").Default("1m").Envar("PG_EXPORTER_COLLECTION_TIMEOUT").String() | ||
| auroraEnabled = kingpin.Flag("aurora.enabled", "Enable Amazon Aurora PostgreSQL support. When set, the exporter detects Aurora on each instance and turns Aurora-specific collectors on by default. Has no effect on non-Aurora instances.").Default("false").Envar("PG_EXPORTER_AURORA_ENABLED").Bool() |
There was a problem hiding this comment.
We generally try and avoid this kind of feature flag. It's preferred to try and detect the version of the database and use appropriate collectors when detected.
See the version field on the instance struct.
There was a problem hiding this comment.
Done in 57a79ec — --aurora.enabled is deleted now (along with the env var and the three WithX helpers).
Mirrors the version pattern you pointed at: instance.isAurora is set in setup() via a one-shot SELECT aurora_version() (sync.Once, so one probe per process), and aurora_* collectors check it in Update()
the same way others check version.
Per-collector --collector.aurora_X toggles stay default-disabled (same as stat_checkpointer, also version-gated).
Seems like should be OK for now
Checked — #1274 is already covers the |
Address @SuperQ feedback on prometheus-community#1308: prefer detecting the database flavor on the instance struct (mirroring `version`) instead of a top-level feature flag. - setup() always runs detectAurora() via sync.Once (one probe per process) - aurora_* collectors stay default-disabled, gated only by --collector.aurora_X - removes --aurora.enabled flag, PG_EXPORTER_AURORA_ENABLED env var, WithAuroraEnabled, EnableAuroraCollectors, WithProbeAuroraEnabled Signed-off-by: Pavel K <megativ3@gmail.com>
Signed-off-by: Pavel K <megativ3@gmail.com>
Address @SuperQ feedback on prometheus-community#1308: prefer detecting the database flavor on the instance struct (mirroring `version`) instead of a top-level feature flag. - setup() always runs detectAurora() via sync.Once (one probe per process) - aurora_* collectors stay default-disabled, gated only by --collector.aurora_X - removes --aurora.enabled flag, PG_EXPORTER_AURORA_ENABLED env var, WithAuroraEnabled, EnableAuroraCollectors, WithProbeAuroraEnabled Signed-off-by: Pavel K <megativ3@gmail.com>
0b55691 to
115de7b
Compare
|
@sysadmind — how would you suggest splitting this up? One PR per Aurora collector, or some other grouping? |
Pulls pg_wal out of this PR's scope to keep it focused on the new aurora_* collectors. Will open the pg_wal fallback as a separate small bugfix PR (the wal-collector analog of prometheus-community#1274). Removes: - pg_wal.go/pg_wal_test.go Aurora-specific fallback (reverted to master) - isAuroraUnsupportedFunction helper (had no remaining callers) - Related imports + test Cleans CHANGELOG / README accordingly. Signed-off-by: Pavel K <megativ3@gmail.com>
|
I think that once per collector is ideal. |
@sysadmind first PR of the series is ready: #1311 (aurora_replica_status + Aurora auto-detection foundation). |
Motivation
Amazon Aurora PostgreSQL is a very common deployment target for this
exporter, but the support today is at best lossy:
replicationcollector fails every scrape on Aurora. Aurorarejects
pg_last_xact_replay_timestamp()withfeature_not_supported,so the collector returns an error, emits no replication metrics for
that scrape interval, and writes the same error to the exporter log on
every tick. This is tracked in Bug: Replication collector crashes on Aurora PostgreSQL #1273; a user there reports the bug
hitting "over three hundred databases".
walcollector fails the same way. Aurora does not supportpg_ls_waldir(), so the standard WAL metrics also blow up on everyscrape with no graceful path.
storage replication lag, Optimized Reads cache, per-database
commit-latency / DML-activity stats, and Global Database state via
Aurora-only built-in functions (
aurora_replica_status(),aurora_stat_database(),aurora_global_db_status(), …). None ofthese are observable through the standard
pg_*views, so operatorson Aurora typically scrape them via ad-hoc custom queries or stitch in
CloudWatch metrics by hand.
This exporter already ships
README-RDS.mdand Aurora-related issuesrecur in the tracker (#611, #1273), so the user base is real. Putting
Aurora support upstream — behind one opt-in flag, with the existing
non-Aurora behaviour fully preserved — avoids forks duplicating the same
work and gives Aurora operators a single tested integration.
Summary
--aurora.enabled(env:PG_EXPORTER_AURORA_ENABLED,default
false). When set, the exporter runsSELECT aurora_version()once per instance to detect Aurora and flips the default state of every
aurora_*collector to enabled. Per-collector--collector.aurora_X/--no-collector.aurora_Xflags still win.aurora_replica_status,aurora_global_db_status,aurora_global_db_instance_status,aurora_stat_database,aurora_stat_bgwriter,aurora_stat_commit_latency,aurora_stat_dml_activity,aurora_stat_optimized_reads_cache,aurora_stat_logical_wal_cache. All disabled by default. On non-Auroraservers they short-circuit to
ErrNoData(no log noise, no extraqueries).
replicationandwal: when the underlyingfunction comes back with PostgreSQL error class
0A(
feature_not_supported) and the message contains"Aurora", thecollector emits NaN for the time-/WAL-directory-based metrics and still
emits
pg_replication_is_replicaviapg_is_in_recovery(). FixesBug: Replication collector crashes on Aurora PostgreSQL #1273. The fingerprint is narrow on purpose — non-Aurora users see no
behaviour change.
Design notes
--aurora.enabled=false(default), the
aurora_version()probe is skipped entirely and everyaurora_*collector is a no-op. The flag must be opt-in.*auroraProbecarriesa
sync.Onceand is shared by everyinstance.copy()made per scrape.--aurora.enabledis a top-level flag rather than per-collectorbecause it controls a probe that affects multiple collectors at once.
Open questions
Happy to address in-PR — flagging up-front so it doesn't get missed:
(e.g.
aurora_replica_lag_milliseconds,aurora_stat_commit_latency_microseconds_total) because they mirrorthe underlying column names. If you'd prefer base units in seconds
(matching the pattern in
pg_stat_statements), happy to convert.Test plan
go build ./...,go vet ./...,go test ./...all clean.*_test.go) coveringhappy path with
isAurora=true,ErrNoDataonisAurora=false,empty result set, and
excludeDatabasesfiltering where applicable.pg_replication/pg_waltests cover both the standard pathand the Aurora-unsupported-function fallback.
graceful-degradation behaviour.
## main / (unreleased).Closes #1273.
This is an AI assisted PR.