Skip to content

Conversation

@aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Dec 22, 2025

@aljoscha aljoscha force-pushed the design-more-zero-downtime branch from b5035b2 to 262ec33 Compare January 6, 2026 15:55
works. We have to audit whether it would work for both the old and new version
to write at the same time. This is important for builtin tables that are not
derived from catalog state, for example `mz_sessions`, the audit log, and
probably others.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the audit log derived from catalog state? Are there any audit events we emit not in response to DDL?

In general, it seems problematic for the old environment to keep writing to any builtin tables. Afaik, the new environment completely truncates the builtin tables (except the audit log) when it promotes, and re-populates them. If the old environment continues to write, we could quickly end up with inconsistent data. For example, entries from mz_sessions could be removed twice, or could leak.

Comment on lines +199 to +200
TODO: Figure out if we want to allow background tasks to keep writing. This
includes, but is not limited to storage usage collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also all the storage-managed collections, right?

Comment on lines +30 to +32
The lived experience for users will be: **no perceived downtime for DQL and
DML, and an error message about DDL not being allowed during version cutover,
which should be on the order of 10s of seconds**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be less disruptive to indefinitely hang DDL, instead of returning an error. The client would see a very slow DDL query that eventually fails due to a connection error when the old envd shuts down. Clients should retry on connection errors and the retry would connect to the new envd and succeed. So instead of erroring DDL we would have slow DDL during an upgrade, which doesn't require any special handling. That is unless the client has a timeout configured...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit sneaky, but could work! Erroring is more honest but this is fudging things during the upgrade

Comment on lines +185 to +190
and will serve DQL/DML workload off of the catalog snapshot that it has. An
important detail to figure out here is what happens when the old-version
`environmentd` process crashes while we're in a lame-duck phase. If the since
of the catalog shard has advanced, it will not be able to restart and read the
catalog at the old version that it understands. This may require holding back
the since of the catalog shard during upgrades or a similar mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there reason to assume the old envd would restart faster than the new envd? There is some migration work to be done, but hopefully that doesn't increase startup time significantly, right? If we don't think the old envd can overtake the new envd then we don't need to worry about ensuring the old envd can start up again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, that sounds interesting!

Comment on lines +236 to +238
We have to audit and find out which is which. For builtin tables that mirror
catalog state we can use the self-correcting approach that we also use for
materialized views and for a number of builtin storage-managed collections. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the self-correcting approach for builtin tables seems dubious! Consider the case of mz_sessions: The old envd likely has a set of active sessions that it thinks should be in there, the new envd likely has no active sessions. They would fight, with the new envd continually deleting all the contents and the old envd continually inserting them again. When you query mz_sessions you might see the expected contents, or an empty table. If we simply stopped writing the builtin tables from the old envd you would always see an empty mz_sessions table, which doesn't seem much worse, but would be much less effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but I guess "simply stop writing in the old envd" isn't sufficient because we also need to ensure that the old envd stops writing before the new envd has started writing to the builtin tables, which I don't think we have a way to do currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this section is a bit handwave-y, but the general idea is that that we have to do something to allow multiple envds to collaborate. For zero-downtime but also for other things, so at some point we need to think about it and figure it out.

Comment on lines +276 to +278
I think they are ready, but sources will have a bit of a fight over who gets to
read, for example for Postgres. Kafka is already fine because it already
supports concurrently running ingestion instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the postgres thing work today? When a source starts up it takes over the replication slot and crashes the other source? So both the old and the new source would be in a restart loop but hopefully still making progress?

Seems like we could just shut down all source dataflows when we enter "lame duck" mode, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's the fighting and crashing. And yes, we could shut them down in lame-duck mode!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants