bump patchset to v52#153
Conversation
bc2452a to
1a41fef
Compare
|
@olivereanderson please take a brief look. I grouped all your commits and brought them into consecutive order. Once cloud-hypervisor#8029 is merged - what are the implications for our fork? What is your recommendation to keep the patchset working and maintainable? What are your thoughts and ideas? |
I plan to backport cloud-hypervisor#8029 as soon as it is merged because the code is simply better. |
|
If possible, I'd prefer to not merge (or backport) anything into gardenlinux before we finish this. But we can plan this together next week as well! |
We can definitely merge this PR (v52) first. Let's discuss further next week 🙂 |
This extends migration to also support paused VMs, preserving the paused state on the destination. Changes: - Add CompletePaused protocol command that finalizes migration without resuming the VM on the destination - Skip the pause step during migration if the VM is already paused - On migration failure, only restore the running state if the VM was originally running (not paused) Signed-off-by: Nguyen Dinh Phi <[email protected]>
Adding a paused flag to live_migration() tests; when this flag is set, the VM will be paused before migration is performed. Signed-off-by: Nguyen Dinh Phi <[email protected]>
1d4fdc8 to
5c611f4
Compare
A malicious or buggy guest can issue an MSI-X table write with an unexpected size (not 4 or 8 bytes), triggering an assert!() that crashes the VMM process. Replace the assertion with an error log and early return to maintain VMM stability under adversarial guest behavior. Signed-off-by: Anatol Belski <[email protected]>
Rate limiting is implemented in the virtio device layer and does not apply to vhost-user devices which delegate I/O handling to an external process. Add validation to reject configurations where vhost_user is enabled along with rate limiting options (bw_size, ops_size, or rate_limit_group) for both disk and network devices. This prevents users from mistakenly configuring rate limiting that would be silently ignored when using vhost-user backends. Signed-off-by: Rob Bradford <[email protected]>
554b8fc to
c11d72b
Compare
If this test flakes is can then cause subsequent invocations to fail as the test has left its special test interfaces alive. Signed-off-by: Rob Bradford <[email protected]>
Booting the VM on these tests takes longer so allow longer before timing out the boot. Signed-off-by: Rob Bradford <[email protected]>
Follow the same pattern as other virtio devices using a bool to check if it needs notification and propagating its own Error enum. Sadly this does still use `anyhow!()` but this does match with the behaviour of the other devices in their implementations. As a side effect we can now remove two errors from the top-level Error enum in virtio-devices as these were only used by this module and those errors had mangled descriptions. Signed-off-by: Rob Bradford <[email protected]>
Use into_iter() for test_list when building tests_to_run. This keeps the collected type as Vec<&PerformanceTest>. Signed-off-by: Muminul Islam <[email protected]>
768a632 to
7ddbe2c
Compare
Request::execute and Request::execute_async checked each data descriptor against `disk_nsectors` using the request's fixed start sector. With sector = disk_nsectors-1 and N descriptors of 512 bytes each, every descriptor passed (top = disk_nsectors) but the vectored I/O collectively read/wrote N*512 bytes starting at the last sector — N-1 sectors past EOF. For the io_uring/aio raw backends this lets the guest extend the host disk image beyond its provisioned size, exhausting the host filesystem. For fixed-VHD images (footer at end of file) the same chain overwrites the footer with guest-controlled bytes, corrupting the disk image. Replace the per-descriptor check with a chain-wide check_data_bounds(). Pre-validating the entire request before beginning the operation avoids having to unroll a partial submit. Signed-off-by: Dylan Reid <[email protected]>
A malicious or buggy guest can write an out-of-bounds value to queue_msix_vector or msix_config. When the device later triggers an interrupt, it indexes into table_entries with the unchecked vector, causing a panic. Validate the vector against the MSI-X table size in both trigger() and notifier() paths, logging a warning and returning early when the vector exceeds the table bounds. Signed-off-by: Anatol Belski <[email protected]>
Verify that firing an interrupt with a queue vector beyond the MSI-X table size returns Ok without panicking. Signed-off-by: Anatol Belski <[email protected]>
Verify that triggering an interrupt when the vector is set to VIRTQ_MSI_NO_VECTOR short-circuits and returns Ok. Signed-off-by: Anatol Belski <[email protected]>
Verify that requesting a notifier with an out-of-bounds MSI-X vector returns None instead of panicking. Signed-off-by: Anatol Belski <[email protected]>
Verify that a valid in bounds vector with MSI-X enabled successfully triggers the interrupt source group. Signed-off-by: Anatol Belski <[email protected]>
Verify that firing a config change interrupt with msix_config vector beyond the table size returns Ok without panicking. Signed-off-by: Anatol Belski <[email protected]>
|
Normal libvirt-tests (default suite) are already passing. |
The unit tests added in bf3279f built sparse files by writing only at one offset and assuming the surrounding pages stayed unallocated. That breaks on shmem/tmpfs with huge=within_size: kernel 6.10+ added large-folio support to shmem, and on first write the kernel allocates one folio whose order is the largest power-of-two number of pages that fits inside the file size (capped at PMD-size). For a 64 KiB test file the very first pwrite anywhere allocates a 64 KiB folio covering the whole file, so SEEK_HOLE never reports a hole and written_pages_show_as_data_extents, sparse_file_yields_extents_at_written_positions, and single_extent_at_zero_offset all fail. memfd_create lives on shmem too and inherits the same THP policy from /sys/kernel/mm/transparent_hugepage/shmem_enabled, so the problem is not /tmp-specific. Fix the fixtures, not the production code: build each test file via a new sparse_layout() helper that writes the requested data extents and then fallocate(FALLOC_FL_PUNCH_HOLE)s every gap. PUNCH_HOLE is the explicit "deallocate these pages" syscall and is honored by every Linux filesystem we run tests on (tmpfs, ext4, xfs, btrfs); the kernel splits any large folio overlapping the punched range. The resulting SEEK_DATA/SEEK_HOLE map matches the spec exactly regardless of folio/THP policy. For single_extent_at_zero_offset the dst side still loses to the folio allocator -- writing 8 KiB into a 64 KiB tmpfs file allocates a 64 KiB folio whether we want it or not -- so the previous meta.blocks()-based sparseness assertion (which tested the filesystem, not our code) is replaced with a sentinel pre-fill: dst starts filled with 0xFE and the post-condition is that bytes outside the source-data extent are still 0xFE. That directly verifies write_region_sparse only touched the data extent without depending on dst-side hole reporting. Side effect: extent_at_non_zero_src_offset, two_regions_in_same_destination_file_at_dst_offset, and round_trip_sparse_write_then_read previously passed by accident on hosts with mTHP-on-shmem -- their src memfds reported the whole file as data so write_region_sparse silently fell into a dense copy of zeros + data. With sparse_layout() the sources are genuinely sparse and those tests now exercise the sparse path on every host. Tested on tmpfs (huge=within_size) and ext4 (TMPDIR=/var/tmp); all 9 tests pass on both with no skips. Assisted-by: Claude:Opus-4.7 Signed-off-by: Dylan Reid <[email protected]>
Regenerate CPU profiles in order to enable machine check architecture (MCA) for non-host CPU profiles which is required to boot Windows server. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
These are already displayed as not available to guests via CPUID for non-host CPU profiles, but we forgot to forbid the corresponding MSRs. The profiles we have generated are OK with respect to this oversight because KVM_GET_MSR_INDEX_LIST did not report those MSRs at the time they were generated, but it does now. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
Hardware duty cycling (HDC) does not make sense in the virtualization setting and should thus not be displayed as available to guests. We have already disabled certain HDC aspects via CPUID 0x6 ECX[13], but we forgot to disable the state components which is what we do in this commit. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We have already disabled architectural LBR (last branch record) for CPU profiles, but we forgot to disable the corresponding state components. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
Hardware P-states (HWP) is already disabled for non-host CPU profiles, but we forgot to also disable the associated state components. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We already disabled Processor Trace (PT) for CPU profiles, but forgot to disable the associated state components. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We have already forbidden IA32_PASID, an MSR related to process address space identifiers (PASID), but we forgot to disable the associated state components. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
Bit 56 of VM_ENTRY_HARDWARE_EXCEPTIONS in IA32_VMX_BASIC is only set on rather recent KVM versions. Thus whenever a CPU profile is generated on a machine with a recent Linux kernel, the current inherit policy will lead to the CPU profile being incompatible on deplyoments with older Linux kernels. This may not be the intention of the person generating the CPU profile, thus we change the policy to `Static(0)` for the time being. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
IA32_XSS (Extended Supervisor State Mask) is only reported via KVM_GET_MSR_INDEX_LIST on rather recent kernels. This can lead to CPU profiles that are generated on a machine with the latest Linux kernel, not work with deployments where the hosts use a bit older kernels which may be unintentional. We thus decide to forbid this MSR for now, even though CPUID 0xd.0x1.EAX[3] can inform the guest that the MSR is available. We do not want to force the aforementioned feature bit to 0 because it is also used to report support for XSAVES/XRSTORS. Although not ideal, we consider denying access to IA32_XSS to be acceptable because the 0xd CPUID leaves report all IA32_XSS related state components to be unsupported. There is thus no reason for the guest to be interested in using this MSR. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We have disabled LBR for non-host CPU profiles, but forgot to also do so in the VM-Exit and VM-Entry control MSRs. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We add developer documentation on how to use the CPU profile generation tool. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We will later use flate2 in arch/build.rs to compress CPU profile JSON files at compile time and also later to decompress them at runtime. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We introduce a build.rs build script in the arch crate which automatically constructs the x86_64 CpuProfile enum with one variant per pre-generated CPU profile. In order to keep the binary size in check we also take the opportunity to compress the CPU profile JSON files into the binary which then get decompressed at runtime. We will adapt cpu_profile.rs in the next commit to use the output of build.rs Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
When we introduced our build script we forgot to tell `serde` to (de-) serialize the `CpuProfile` enum in kebab-case which is a breaking change. Signed-off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
This is needed as at our customer we deployed everything without mTLS. Management software can provide the necessary cert files if it knows that both CH hosts support mTLS already, so we can eventually upgrade the fleet to mTLS and get rid of this commit. On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
This is a temporary measurement as upstream decided for a different name than we in our fork. On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
…rk)" This reverts commit 3134e961444cd76ca3afc8abf55a8479f86c1e1c.
This is needed as at our customer we deployed everything without mTLS. We need to find a migration path soon, tho. On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
This was missing in [0] but is required for proper explicit PCI BDF management, e.g., when a VM is created via libvirt and each device has an explicit BDF. [0] cloud-hypervisor#7965 On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
Restructure CtrlQueue::process() so each command parses its own descriptor layout and returns the used length alongside the status descriptor. This is a behavior-neutral cleanup that prepares follow-up control queue features. On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
Extract virtio-net constructor bookkeeping into a small helper struct and dedicated restore/fresh initialization helpers. This keeps new_with_tap() focused on assembly and makes follow-up feature changes easier to review. On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
In addition to the RARP announcement, advertise VIRTIO_NET_F_GUEST_ANNOUNCE on virtio-net devices and request a guest announcement after migration by setting the announce status bit and raising a config interrupt. Handle the guest announce ACK on the control queue. On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
Add unit tests for the new guest-announce flow in the control queue, virtio-net, and vhost-user-net. The tests cover setting and clearing the announce state, triggering the config interrupt, and disabling the host-side RARP fallback when the guest negotiated VIRTIO_NET_F_GUEST_ANNOUNCE. On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
Preserve migration compatibility with older snapshots by defaulting a missing announce_pending field to false during deserialization, and cover both cases with regression tests. On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
Re-trigger config interrupts for restored pending guest announce requests once the net device is activated. Cover both virtio-net and vhost-user-net with regression tests. On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
Track a runtime announce generation for virtio-net and vhost-user-net so post-migration retry announcers stop after reset or device teardown. This keeps repeated announce rounds within one migration session, while preventing stale retry threads from re-arming VIRTIO_NET_S_ANNOUNCE after the guest already reset, rebooted, or the device was dropped. On-behalf-of: SAP [email protected] Signed-off-by: Sebastian Eydam <[email protected]>
d45b566 to
67574bb
Compare
This series bumps the gardenlinux Cloud Hypervisor patchset onto the current
base (soon to be released as v52).
You can find an overview of the difficulties during the rebase in this outline document (trivial patches, hard to rebase patches, patches that are now upstream...).
From 248 commits we have in the current
gardenlinuxbranch, we are now down to ~158 (when TLS is merged upstream). I expect the v52 release to happen very soon.Changes & Hints for Reviewers
init A -> ... -> fix Acommits where squashed)pci_device_idfrom upstream back todevice_idto be compatible with usThe result is a shorter and more reviewable branch than
cyberus-github/gardenlinuxwhile preserving the relevant Gardenlinux behavioron top of the current Cloud Hypervisor base.
Ticket: https://github.com/cobaltcore-dev/cobaltcore/issues/503#issuecomment-4311454443