Skip to content

bump patchset to v52#153

Draft
phip1611 wants to merge 304 commits into
cyberus-technology:gardenlinux-next-v52-basefrom
phip1611:gardenlinux-next-v52
Draft

bump patchset to v52#153
phip1611 wants to merge 304 commits into
cyberus-technology:gardenlinux-next-v52-basefrom
phip1611:gardenlinux-next-v52

Conversation

@phip1611
Copy link
Copy Markdown
Member

@phip1611 phip1611 commented Apr 30, 2026

This series bumps the gardenlinux Cloud Hypervisor patchset onto the current
base (soon to be released as v52).

You can find an overview of the difficulties during the rebase in this outline document (trivial patches, hard to rebase patches, patches that are now upstream...).

From 248 commits we have in the current gardenlinux branch, we are now down to ~158 (when TLS is merged upstream). I expect the v52 release to happen very soon.

Changes & Hints for Reviewers

  • The commits that are still here, exist with the same name in the old gardenlinux branch
  • I reordered the patchset quite significantly: small standalone commits are mostly moved to the beginning where it makes sense, followed by larger series
  • All commits of series where consolidated, moved together, and sometimes even squashed (init A -> ... -> fix A commits where squashed)
  • For example, the whole CPU Profiles effort is now a single commit series at the end of our patchset
  • This was by far the toughest patchset rebase we had so far
  • Beware: I am unfortunately pretty sure that I've missed minor changes of our gardenlinux branch in that rebase process. For example, some error message improvement or so, but nothing major. This comes from the nature of this complex operation I had to do here.
  • Changes I had to do against upstream to work with our stack:
    • rename pci_device_id from upstream back to device_id to be compatible with us
    • remove mutual TLS (mTLS) (use normal TLS)
  • libvirt pipeline run: https://gitlab.cyberus-technology.de/cyberus/cloud/libvirt/-/merge_requests/194/pipelines

The result is a shorter and more reviewable branch than
cyberus-github/gardenlinux while preserving the relevant Gardenlinux behavior
on top of the current Cloud Hypervisor base.

Ticket: https://github.com/cobaltcore-dev/cobaltcore/issues/503#issuecomment-4311454443

@phip1611 phip1611 self-assigned this Apr 30, 2026
@phip1611 phip1611 force-pushed the gardenlinux-next-v52 branch from bc2452a to 1a41fef Compare April 30, 2026 09:21
@phip1611
Copy link
Copy Markdown
Member Author

@olivereanderson please take a brief look. I grouped all your commits and brought them into consecutive order. Once cloud-hypervisor#8029 is merged - what are the implications for our fork? What is your recommendation to keep the patchset working and maintainable? What are your thoughts and ideas?

@olivereanderson
Copy link
Copy Markdown

@olivereanderson please take a brief look. I grouped all your commits and brought them into consecutive order. Once cloud-hypervisor#8029 is merged - what are the implications for our fork? What is your recommendation to keep the patchset working and maintainable? What are your thoughts and ideas?

I plan to backport cloud-hypervisor#8029 as soon as it is merged because the code is simply better.

@phip1611
Copy link
Copy Markdown
Member Author

If possible, I'd prefer to not merge (or backport) anything into gardenlinux before we finish this. But we can plan this together next week as well!

@olivereanderson
Copy link
Copy Markdown

If possible, I'd prefer to not merge (or backport) anything into gardenlinux before we finish this. But we can plan this together next week as well!

We can definitely merge this PR (v52) first. Let's discuss further next week 🙂

This extends migration to also support paused VMs, preserving the
paused state on the destination.

Changes:
- Add CompletePaused protocol command that finalizes migration without
 resuming the VM on the destination
- Skip the pause step during migration if the VM is already paused
- On migration failure, only restore the running state if
  the VM was originally running (not paused)

Signed-off-by: Nguyen Dinh Phi <[email protected]>
Adding a paused flag to live_migration() tests; when this
flag is set, the VM will be paused before migration is
performed.

Signed-off-by: Nguyen Dinh Phi <[email protected]>
@phip1611 phip1611 force-pushed the gardenlinux-next-v52 branch 4 times, most recently from 1d4fdc8 to 5c611f4 Compare May 4, 2026 14:30
weltling and others added 2 commits May 4, 2026 16:08
A malicious or buggy guest can issue an MSI-X table write with an
unexpected size (not 4 or 8 bytes), triggering an assert!() that
crashes the VMM process. Replace the assertion with an error log and
early return to maintain VMM stability under adversarial guest
behavior.

Signed-off-by: Anatol Belski <[email protected]>
Rate limiting is implemented in the virtio device layer and does not
apply to vhost-user devices which delegate I/O handling to an external
process.

Add validation to reject configurations where vhost_user is enabled
along with rate limiting options (bw_size, ops_size, or
rate_limit_group) for both disk and network devices.

This prevents users from mistakenly configuring rate limiting that would
be silently ignored when using vhost-user backends.

Signed-off-by: Rob Bradford <[email protected]>
@phip1611 phip1611 force-pushed the gardenlinux-next-v52 branch 2 times, most recently from 554b8fc to c11d72b Compare May 4, 2026 18:25
rbradford and others added 4 commits May 4, 2026 21:28
If this test flakes is can then cause subsequent invocations to fail as
the test has left its special test interfaces alive.

Signed-off-by: Rob Bradford <[email protected]>
Booting the VM on these tests takes longer so allow longer before
timing out the boot.

Signed-off-by: Rob Bradford <[email protected]>
Follow the same pattern as other virtio devices using a bool to check if
it needs notification and propagating its own Error enum.

Sadly this does still use `anyhow!()` but this does match with the
behaviour of the other devices in their implementations.

As a side effect we can now remove two errors from the top-level Error
enum in virtio-devices as these were only used by this module and those
errors had mangled descriptions.

Signed-off-by: Rob Bradford <[email protected]>
Use into_iter() for test_list when building tests_to_run.

This keeps the collected type as Vec<&PerformanceTest>.

Signed-off-by: Muminul Islam <[email protected]>
@phip1611 phip1611 force-pushed the gardenlinux-next-v52 branch from 768a632 to 7ddbe2c Compare May 5, 2026 06:17
dgreid and others added 7 commits May 5, 2026 08:21
Request::execute and Request::execute_async checked each data descriptor
against `disk_nsectors` using the request's fixed start sector. With
sector = disk_nsectors-1 and N descriptors of 512 bytes each, every
descriptor passed (top = disk_nsectors) but the vectored I/O
collectively read/wrote N*512 bytes starting at the last sector — N-1
sectors past EOF.

For the io_uring/aio raw backends this lets the guest extend the host
disk image beyond its provisioned size, exhausting the host filesystem.
For fixed-VHD images (footer at end of file) the same chain overwrites
the footer with guest-controlled bytes, corrupting the disk image.

Replace the per-descriptor check with a chain-wide check_data_bounds().
Pre-validating the entire request before beginning the operation avoids
having to unroll a partial submit.

Signed-off-by: Dylan Reid <[email protected]>
A malicious or buggy guest can write an out-of-bounds value to
queue_msix_vector or msix_config. When the device later triggers
an interrupt, it indexes into table_entries with the unchecked
vector, causing a panic.

Validate the vector against the MSI-X table size in both trigger()
and notifier() paths, logging a warning and returning early when
the vector exceeds the table bounds.

Signed-off-by: Anatol Belski <[email protected]>
Verify that firing an interrupt with a queue vector beyond the
MSI-X table size returns Ok without panicking.

Signed-off-by: Anatol Belski <[email protected]>
Verify that triggering an interrupt when the vector is set to
VIRTQ_MSI_NO_VECTOR short-circuits and returns Ok.

Signed-off-by: Anatol Belski <[email protected]>
Verify that requesting a notifier with an out-of-bounds MSI-X
vector returns None instead of panicking.

Signed-off-by: Anatol Belski <[email protected]>
Verify that a valid in bounds vector with MSI-X enabled
successfully triggers the interrupt source group.

Signed-off-by: Anatol Belski <[email protected]>
Verify that firing a config change interrupt with msix_config
vector beyond the table size returns Ok without panicking.

Signed-off-by: Anatol Belski <[email protected]>
@phip1611
Copy link
Copy Markdown
Member Author

phip1611 commented May 5, 2026

Normal libvirt-tests (default suite) are already passing.

The unit tests added in bf3279f built sparse files by writing only
at one offset and assuming the surrounding pages stayed unallocated.
That breaks on shmem/tmpfs with huge=within_size: kernel 6.10+ added
large-folio support to shmem, and on first write the kernel allocates
one folio whose order is the largest power-of-two number of pages
that fits inside the file size (capped at PMD-size). For a 64 KiB
test file the very first pwrite anywhere allocates a 64 KiB folio
covering the whole file, so SEEK_HOLE never reports a hole and
written_pages_show_as_data_extents,
sparse_file_yields_extents_at_written_positions, and
single_extent_at_zero_offset all fail. memfd_create lives on shmem
too and inherits the same THP policy from
/sys/kernel/mm/transparent_hugepage/shmem_enabled, so the problem is
not /tmp-specific.

Fix the fixtures, not the production code: build each test file via
a new sparse_layout() helper that writes the requested data extents
and then fallocate(FALLOC_FL_PUNCH_HOLE)s every gap. PUNCH_HOLE is
the explicit "deallocate these pages" syscall and is honored by every
Linux filesystem we run tests on (tmpfs, ext4, xfs, btrfs); the
kernel splits any large folio overlapping the punched range. The
resulting SEEK_DATA/SEEK_HOLE map matches the spec exactly regardless
of folio/THP policy.

For single_extent_at_zero_offset the dst side still loses to the
folio allocator -- writing 8 KiB into a 64 KiB tmpfs file allocates
a 64 KiB folio whether we want it or not -- so the previous
meta.blocks()-based sparseness assertion (which tested the filesystem,
not our code) is replaced with a sentinel pre-fill: dst starts filled
with 0xFE and the post-condition is that bytes outside the
source-data extent are still 0xFE. That directly verifies
write_region_sparse only touched the data extent without depending on
dst-side hole reporting.

Side effect: extent_at_non_zero_src_offset,
two_regions_in_same_destination_file_at_dst_offset, and
round_trip_sparse_write_then_read previously passed by accident on
hosts with mTHP-on-shmem -- their src memfds reported the whole file
as data so write_region_sparse silently fell into a dense copy of
zeros + data. With sparse_layout() the sources are genuinely sparse
and those tests now exercise the sparse path on every host.

Tested on tmpfs (huge=within_size) and ext4 (TMPDIR=/var/tmp); all 9
tests pass on both with no skips.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Dylan Reid <[email protected]>
olivereanderson and others added 29 commits May 7, 2026 19:32
Regenerate CPU profiles in order to enable machine check architecture
(MCA) for non-host CPU profiles which is required to boot Windows
server.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
These are already displayed as not available to guests via CPUID for
non-host CPU profiles, but we forgot to forbid the corresponding MSRs.

The profiles we have generated are OK with respect to this oversight
because KVM_GET_MSR_INDEX_LIST did not report those MSRs at the time
they were generated, but it does now.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
Hardware duty cycling (HDC) does not make sense in the virtualization
setting and should thus not be displayed as available to guests.

We have already disabled certain HDC aspects via CPUID 0x6 ECX[13],
but we forgot to disable the state components which is what we do
in this commit.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
We have already disabled architectural LBR (last branch record) for CPU
profiles, but we forgot to disable the corresponding state components.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
Hardware P-states (HWP) is already disabled for non-host CPU profiles,
but we forgot to also disable the associated state components.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
We already disabled Processor Trace (PT) for CPU profiles, but forgot
to disable the associated state components.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
We have already forbidden IA32_PASID, an MSR related to process
address space identifiers (PASID), but we forgot to disable the
associated state components.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
Bit 56 of VM_ENTRY_HARDWARE_EXCEPTIONS in IA32_VMX_BASIC is only
set on rather recent KVM versions.

Thus whenever a CPU profile is generated on a machine with a recent
Linux kernel, the current inherit policy will lead to the CPU profile
being incompatible on deplyoments with older Linux kernels. This may
not be the intention of the person generating the CPU profile, thus
we change the policy to `Static(0)` for the time being.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
IA32_XSS (Extended Supervisor State Mask) is only reported via
KVM_GET_MSR_INDEX_LIST on rather recent kernels. This can lead to CPU
profiles that are generated on a machine with the latest Linux kernel,
not work with deployments where the hosts use a bit older kernels which
may be unintentional.

We thus decide to forbid this MSR for now, even though
CPUID 0xd.0x1.EAX[3] can inform the guest that the MSR is available.
We do not want to force the aforementioned feature bit to 0 because
it is also used to report support for XSAVES/XRSTORS.

Although not ideal, we consider denying access to IA32_XSS to be
acceptable because the 0xd CPUID leaves report all IA32_XSS related
state components to be unsupported. There is thus no reason for the
guest to be interested in using this MSR.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
We have disabled LBR for non-host CPU profiles, but forgot to also do
so in the VM-Exit and VM-Entry control MSRs.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
We add developer documentation on how to use the CPU profile generation
tool.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
We will later use flate2 in arch/build.rs to compress CPU profile
JSON files at compile time and also later to decompress them at
runtime.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
We introduce a build.rs build script in the arch crate which
automatically constructs the x86_64 CpuProfile enum with one variant
per pre-generated CPU profile.

In order to keep the binary size in check we also take the opportunity
to compress the CPU profile JSON files into the binary which then get
decompressed at runtime.

We will adapt cpu_profile.rs in the next commit to use the output
of build.rs

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
When we introduced our build script we forgot to tell `serde` to
(de-) serialize the `CpuProfile` enum in kebab-case which is a breaking
change.

Signed-off-by: Oliver Anderson <[email protected]>
On-behalf-of: SAP [email protected]
This is needed as at our customer we deployed everything without mTLS.
Management software can provide the necessary cert files if it knows
that both CH hosts support mTLS already, so we can eventually upgrade
the fleet to mTLS and get rid of this commit.

On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
This is a temporary measurement as upstream decided for a different name
than we in our fork.

On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
…rk)"

This reverts commit 3134e961444cd76ca3afc8abf55a8479f86c1e1c.
This is needed as at our customer we deployed everything without mTLS.
We need to find a migration path soon, tho.

On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
This was missing in [0] but is required for proper explicit PCI BDF
management, e.g., when a VM is created via libvirt and each device has
an explicit BDF.

[0] cloud-hypervisor#7965

On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
Restructure CtrlQueue::process() so each command parses its own
descriptor layout and returns the used length alongside the status
descriptor. This is a behavior-neutral cleanup that prepares follow-up
control queue features.

On-behalf-of: SAP [email protected]
Signed-off-by: Sebastian Eydam <[email protected]>
Extract virtio-net constructor bookkeeping into a small helper struct
and dedicated restore/fresh initialization helpers. This keeps
new_with_tap() focused on assembly and makes follow-up feature changes
easier to review.

On-behalf-of: SAP [email protected]
Signed-off-by: Sebastian Eydam <[email protected]>
In addition to the RARP announcement, advertise
VIRTIO_NET_F_GUEST_ANNOUNCE on virtio-net devices and request
a guest announcement after migration by setting the announce status bit
and raising a config interrupt. Handle the guest announce ACK on the
control queue.

On-behalf-of: SAP [email protected]
Signed-off-by: Sebastian Eydam <[email protected]>
Add unit tests for the new guest-announce flow in the control queue,
virtio-net, and vhost-user-net.

The tests cover setting and clearing the announce state, triggering the
config interrupt, and disabling the host-side RARP fallback when the
guest negotiated VIRTIO_NET_F_GUEST_ANNOUNCE.

On-behalf-of: SAP [email protected]
Signed-off-by: Sebastian Eydam <[email protected]>
Preserve migration compatibility with older snapshots by defaulting a
missing announce_pending field to false during deserialization, and
cover both cases with regression tests.

On-behalf-of: SAP [email protected]
Signed-off-by: Sebastian Eydam <[email protected]>
Re-trigger config interrupts for restored pending guest announce
requests once the net device is activated. Cover both virtio-net and
vhost-user-net with regression tests.

On-behalf-of: SAP [email protected]
Signed-off-by: Sebastian Eydam <[email protected]>
Track a runtime announce generation for virtio-net and vhost-user-net
so post-migration retry announcers stop after reset or device teardown.
This keeps repeated announce rounds within one migration session, while
preventing stale retry threads from re-arming VIRTIO_NET_S_ANNOUNCE
after the guest already reset, rebooted, or the device was dropped.

On-behalf-of: SAP [email protected]
Signed-off-by: Sebastian Eydam <[email protected]>
@phip1611 phip1611 force-pushed the gardenlinux-next-v52 branch from d45b566 to 67574bb Compare May 7, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.