On Thu, 2024-07-25 at 17:47 -0400, Michael S. Tsirkin wrote: > On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote: > > Those people included me. I wanted to interrupt all the vCPUs, even the > > ones which were in userspace at the moment of migration, and have the > > kernel deal with passing it on to userspace via a different ABI. > > > > It ends up being complex and intricate, and requiring a lot of new > > kernel and userspace support. I gave up on it in the end for snapshots, > > and didn't go there again for this. > > ok I believe you, I am just curious how come you need userspace > support - what I imagine would live completely in kernel ... Userspace doesn't even make a system call for gettimeofday() any more; the relevant information is exposed to userspace through the vDSO. If userspace needs to know that the time has been disrupted by LM, then fundamentally that either needs to be exposed directly to it as well, or userspace needs to go back to making actual system calls to get the time (which is slow, and not acceptable for the same use cases which care about it being accurate). So how do we make it available in a form that's mappable directly to userspace? Well, we could have a hypervisor enlightenment, where the guest kernel uses an MSR or hypercall to tell the hypervisor "please write the information to <this> GPA", and provides an address within the vDSO information page. Which isn't nice for Confidential Compute, and is hard to allow for expansion in the size of the structure. And is much more complex to support consistently across different hypervisors and different architectures. We *could* attempt to contrive a system where we indeed interrupt *all* vCPUs and the kernel then updates something in the vDSO page before running userspace again. That could work in theory and *might* be a bit simpler than what we were trying to do for VMGENID/snapshots, but it's still complex and would take an eternity to deploy to actual users, and would probably never work for non-Linux. And imposes an even higher cost on the guest kernel when LM occurs. Or there's this method, where the hypervisor puts it in a shared memory region which is just a PCI BAR or an ACPI _CRS or attached to virtio (we really don't care how it's discovered). There's a nit that it now has to be page sized, and a guest which has larger pages than the hypervisor expects is going to have to use a small PTE to map it (or not support that mode). But I think that's reasonable. Having gone around in circles a few times, I'm fairly sure that exposing a memory region which the hypervisor updates directly is the simplest and cleanest way of doing it and getting it in the hands of users. We're rolling out the AMZNVCLK device for internal use cases, and plan to add it in public instances some time later. This is the guest driver which consumes that, and I've separately posted the QEMU patch to provide the same device. Because I absolutely do want this to be standardised across hypervisors, for the reasons you point out. You're preaching to the choir there; I even got Microsoft to implement the same 15-bit MSI extensions that we added to KVM :) Supporting the disruption signal is the critical part, which allows applications to abort operations until their clock is good again. Providing the actual clock information on the new host, so that applications can keep running immediately, is what I'll be working on next. I'd love virtio-rtc to adopt this structure too, and I've done my best to ensure that that's feasible, but I can't take a dependency on that and wait for it (and as discussed, wouldn't use the virtio form in my environment anyway). > mutt sucks less ;) So does 'nc' but Evolution can talk to the corporate Exchange calendar and email. And I'm used to it and can mostly cope with its quirks :)
Attachment:
smime.p7s
Description: S/MIME cryptographic signature