-
Notifications
You must be signed in to change notification settings - Fork 487
design: add More Zero-Downtime Upgrades design doc #34602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b5035b2 to
262ec33
Compare
| works. We have to audit whether it would work for both the old and new version | ||
| to write at the same time. This is important for builtin tables that are not | ||
| derived from catalog state, for example `mz_sessions`, the audit log, and | ||
| probably others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the audit log derived from catalog state? Are there any audit events we emit not in response to DDL?
In general, it seems problematic for the old environment to keep writing to any builtin tables. Afaik, the new environment completely truncates the builtin tables (except the audit log) when it promotes, and re-populates them. If the old environment continues to write, we could quickly end up with inconsistent data. For example, entries from mz_sessions could be removed twice, or could leak.
| TODO: Figure out if we want to allow background tasks to keep writing. This | ||
| includes, but is not limited to storage usage collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also all the storage-managed collections, right?
| The lived experience for users will be: **no perceived downtime for DQL and | ||
| DML, and an error message about DDL not being allowed during version cutover, | ||
| which should be on the order of 10s of seconds**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be less disruptive to indefinitely hang DDL, instead of returning an error. The client would see a very slow DDL query that eventually fails due to a connection error when the old envd shuts down. Clients should retry on connection errors and the retry would connect to the new envd and succeed. So instead of erroring DDL we would have slow DDL during an upgrade, which doesn't require any special handling. That is unless the client has a timeout configured...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit sneaky, but could work! Erroring is more honest but this is fudging things during the upgrade
| and will serve DQL/DML workload off of the catalog snapshot that it has. An | ||
| important detail to figure out here is what happens when the old-version | ||
| `environmentd` process crashes while we're in a lame-duck phase. If the since | ||
| of the catalog shard has advanced, it will not be able to restart and read the | ||
| catalog at the old version that it understands. This may require holding back | ||
| the since of the catalog shard during upgrades or a similar mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there reason to assume the old envd would restart faster than the new envd? There is some migration work to be done, but hopefully that doesn't increase startup time significantly, right? If we don't think the old envd can overtake the new envd then we don't need to worry about ensuring the old envd can start up again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah, that sounds interesting!
| We have to audit and find out which is which. For builtin tables that mirror | ||
| catalog state we can use the self-correcting approach that we also use for | ||
| materialized views and for a number of builtin storage-managed collections. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the self-correcting approach for builtin tables seems dubious! Consider the case of mz_sessions: The old envd likely has a set of active sessions that it thinks should be in there, the new envd likely has no active sessions. They would fight, with the new envd continually deleting all the contents and the old envd continually inserting them again. When you query mz_sessions you might see the expected contents, or an empty table. If we simply stopped writing the builtin tables from the old envd you would always see an empty mz_sessions table, which doesn't seem much worse, but would be much less effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but I guess "simply stop writing in the old envd" isn't sufficient because we also need to ensure that the old envd stops writing before the new envd has started writing to the builtin tables, which I don't think we have a way to do currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this section is a bit handwave-y, but the general idea is that that we have to do something to allow multiple envds to collaborate. For zero-downtime but also for other things, so at some point we need to think about it and figure it out.
| I think they are ready, but sources will have a bit of a fight over who gets to | ||
| read, for example for Postgres. Kafka is already fine because it already | ||
| supports concurrently running ingestion instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the postgres thing work today? When a source starts up it takes over the replication slot and crashes the other source? So both the old and the new source would be in a restart loop but hopefully still making progress?
Seems like we could just shut down all source dataflows when we enter "lame duck" mode, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's the fighting and crashing. And yes, we could shut them down in lame-duck mode!
Rendered: https://github.com/aljoscha/materialize/blob/design-more-zero-downtime/doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md