On Fri, 2024-07-26 at 01:55 -0400, Michael S. Tsirkin wrote: > On Fri, Jul 26, 2024 at 01:09:24AM -0400, Michael S. Tsirkin wrote: > > On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote: > > > > > > Then can't we fix it by interrupting all CPUs right after LM? > > > > > > > > > > > > To me that seems like a cleaner approach - we then compartmentalize > > > > > > the ABI issue - kernel has its own ABI against userspace, > > > > > > devices have their own ABI against kernel. > > > > > > It'd mean we need a way to detect that interrupt was sent, > > > > > > maybe yet another counter inside that structure. > > > > > > > > > > > > WDYT? > > > > > > > > > > > > By the way the same idea would work for snapshots - > > > > > > some people wanted to expose that info to userspace, too. > > > > > > Those people included me. I wanted to interrupt all the vCPUs, even the > > > ones which were in userspace at the moment of migration, and have the > > > kernel deal with passing it on to userspace via a different ABI. > > > > > > It ends up being complex and intricate, and requiring a lot of new > > > kernel and userspace support. I gave up on it in the end for snapshots, > > > and didn't go there again for this. > > > > Maybe become you insist on using ACPI? > > I see a fairly simple way to do it. For example, with virtio: > > > > one vq per CPU, with a single outstanding buffer, > > callback copies from the buffer into the userspace > > visible memory. > > > > Want me to show you the code? > > Couldn't resist, so I wrote a bit of this code. > Fundamentally, we keep a copy of the hypervisor abi > in the device: > > struct virtclk_info *vci { > struct vmclock_abi abi; > }; > > each vq will has its own copy: > > struct virtqueue_info { > struct scatterlist sg[]; > struct vmclock_abi abi; > } > > we add it during probe: > sg_init_one(vqi->sg, &vqi->abi, sizeof(vqi->abi)); > virtqueue_add_inbuf(vq, > vqi->sg, 1, > &vq->vabi, > GFP_ATOMIC); > > > > We set the affinity for each vq: > > for (i = 0; i < num_online_cpus(); i++) > virtqueue_set_affinity(vi->vq[i], i); > > (virtio net does it, and it handles cpu hotplug as well) > > each vq callback would do: > > static void vmclock_cb(struct virtqueue *vq) > { > struct virtclk_info *vci = vq->vdev->priv; > struct virtqueue_info *vqi = vq->priv; > void *buf; > unsigned int len; > > buf = virtqueue_get_buf(vq, &len); > if (!buf) > return; > > BUG_ON(buf != &vq->abi); > > spin_lock(vci->lock); > if (memcmp(&vci->abi, &vqi->abi, sizeof(vqi->abi))) { > memcpy(&vci->abi, &vqi->abi, sizeof(vqi->abi)); > } > > /* Update the userspace visible structure now */ > ..... > > /* Re-add the buffer */ > virtqueue_add_inbuf(vq, > vqi->sg, 1, > &vqi->abi, > GFP_ATOMIC); > > spin_unlock(vi->lock); > } > > That's it! > Where's the problem here? That's great. You don't even need it to be per-vCPU if you let the hypervisor write directly to the single physical location that's mapped to userspace. It can do that before it even starts *running* the vCPUs after migration. It's a whole lot simpler. Even if we were to insist one having two *different* ABIs, one for hypervisor←→guest kernel, and another for kernel←→userspace, then you still only need the hypervisor to write one copy. It's just that you do need to interrupt to all the other vCPUs to ensure they aren't running userspace before the data are updated. So yes, we can do the above and (ab)use gratuitous data transfers to trigger those IRQs, and then the handler has some kind of locking to ensure that we can't reenter userspace until the user-visible structure is updated? Obviously the "interrupt all CPUs" doesn't work well for hardware implementations but those won't have to deal with live migration, so that's OK. A hypothetical hardware implementation would map something like the ART to reference time. Obviously there's no LM but are they any other causes of 'disruption' in the same sense? Not that I can think of. But I don't think we actually do want to have separate ABIs for hypervisor-kernel and kernel-user. The ABI we have in the structure is just fine, and extensible. And having the hypervisor update it directly is fine. It means that applications can use it with a *simple* externally buildable kernel driver, which is easy to persuade OSVs to add to their product kernels. If Linux wants to adopt it and map it directly into the vDSO area and use it from vDSO functions, that *great*, and still entirely possible. But there's no need for us to take a dependency on Linux doing that kind of thing before this functionality is even usable at all. I think something like the above could well be how we expose the pvclock_abi struct in virtio-rtc, and that's fine. It even fixes the PAGE_SIZE problem that the ACPI model has. But I see it as just another transport, completely orthogonal to the actual contents of the pvclock_abi.
<<attachment: smime.p7s>>