design: add More Zero-Downtime Upgrades design doc #34602

aljoscha · 2025-12-22T13:24:03Z

Rendered: https://github.com/aljoscha/materialize/blob/design-more-zero-downtime/doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md

teskje · 2026-01-07T11:04:40Z

doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md

+works. We have to audit whether it would work for both the old and new version
+to write at the same time. This is important for builtin tables that are not
+derived from catalog state, for example `mz_sessions`, the audit log, and
+probably others.


Isn't the audit log derived from catalog state? Are there any audit events we emit not in response to DDL?

In general, it seems problematic for the old environment to keep writing to any builtin tables. Afaik, the new environment completely truncates the builtin tables (except the audit log) when it promotes, and re-populates them. If the old environment continues to write, we could quickly end up with inconsistent data. For example, entries from mz_sessions could be removed twice, or could leak.

teskje · 2026-01-07T11:06:40Z

doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md

+TODO: Figure out if we want to allow background tasks to keep writing. This
+includes, but is not limited to storage usage collection.


Also all the storage-managed collections, right?

teskje · 2026-01-07T11:10:20Z

doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md

+The lived experience for users will be: **no perceived downtime for DQL and
+DML, and an error message about DDL not being allowed during version cutover,
+which should be on the order of 10s of seconds**.


I wonder if it would be less disruptive to indefinitely hang DDL, instead of returning an error. The client would see a very slow DDL query that eventually fails due to a connection error when the old envd shuts down. Clients should retry on connection errors and the retry would connect to the new envd and succeed. So instead of erroring DDL we would have slow DDL during an upgrade, which doesn't require any special handling. That is unless the client has a timeout configured...

A bit sneaky, but could work! Erroring is more honest but this is fudging things during the upgrade

teskje · 2026-01-07T11:13:41Z

doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md

+and will serve DQL/DML workload off of the catalog snapshot that it has. An
+important detail to figure out here is what happens when the old-version
+`environmentd` process crashes while we're in a lame-duck phase. If the since
+of the catalog shard has advanced, it will not be able to restart and read the
+catalog at the old version that it understands. This may require holding back
+the since of the catalog shard during upgrades or a similar mechanism.


Is there reason to assume the old envd would restart faster than the new envd? There is some migration work to be done, but hopefully that doesn't increase startup time significantly, right? If we don't think the old envd can overtake the new envd then we don't need to worry about ensuring the old envd can start up again.

ah yeah, that sounds interesting!

teskje · 2026-01-07T11:23:09Z

doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md

+We have to audit and find out which is which. For builtin tables that mirror
+catalog state we can use the self-correcting approach that we also use for
+materialized views and for a number of builtin storage-managed collections. For


Using the self-correcting approach for builtin tables seems dubious! Consider the case of mz_sessions: The old envd likely has a set of active sessions that it thinks should be in there, the new envd likely has no active sessions. They would fight, with the new envd continually deleting all the contents and the old envd continually inserting them again. When you query mz_sessions you might see the expected contents, or an empty table. If we simply stopped writing the builtin tables from the old envd you would always see an empty mz_sessions table, which doesn't seem much worse, but would be much less effort.

Ah, but I guess "simply stop writing in the old envd" isn't sufficient because we also need to ensure that the old envd stops writing before the new envd has started writing to the builtin tables, which I don't think we have a way to do currently.

Yes, this section is a bit handwave-y, but the general idea is that that we have to do something to allow multiple envds to collaborate. For zero-downtime but also for other things, so at some point we need to think about it and figure it out.

teskje · 2026-01-07T11:28:31Z

doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md

+I think they are ready, but sources will have a bit of a fight over who gets to
+read, for example for Postgres. Kafka is already fine because it already
+supports concurrently running ingestion instances.


How does the postgres thing work today? When a source starts up it takes over the replication slot and crashes the other source? So both the old and the new source would be in a restart loop but hopefully still making progress?

Seems like we could just shut down all source dataflows when we enter "lame duck" mode, no?

Yep, it's the fighting and crashing. And yes, we could shut them down in lame-duck mode!

design: add More Zero-Downtime Upgrades design doc

262ec33

aljoscha force-pushed the design-more-zero-downtime branch from b5035b2 to 262ec33 Compare January 6, 2026 15:55

teskje reviewed Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

design: add More Zero-Downtime Upgrades design doc #34602

design: add More Zero-Downtime Upgrades design doc #34602

Uh oh!

aljoscha commented Dec 22, 2025 •

edited

Loading

Uh oh!

teskje Jan 7, 2026

Uh oh!

teskje Jan 7, 2026

Uh oh!

teskje Jan 7, 2026

Uh oh!

aljoscha Jan 7, 2026

Uh oh!

teskje Jan 7, 2026

Uh oh!

aljoscha Jan 7, 2026

Uh oh!

teskje Jan 7, 2026

Uh oh!

teskje Jan 7, 2026

Uh oh!

aljoscha Jan 7, 2026

Uh oh!

teskje Jan 7, 2026

Uh oh!

aljoscha Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		TODO: Figure out if we want to allow background tasks to keep writing. This
		includes, but is not limited to storage usage collection.

design: add More Zero-Downtime Upgrades design doc #34602

Are you sure you want to change the base?

design: add More Zero-Downtime Upgrades design doc #34602

Uh oh!

Conversation

aljoscha commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aljoscha commented Dec 22, 2025 •

edited

Loading