Re: [PATCH] ptp: Add vDSO-style vmclock support

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 26 Jul 2024 09:06:29 +0100

On Fri, 2024-07-26 at 01:55 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 01:09:24AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > > > > > Then can't we fix it by interrupting all CPUs right after LM?
> > > > > > 
> > > > > > To me that seems like a cleaner approach - we then compartmentalize
> > > > > > the ABI issue - kernel has its own ABI against userspace,
> > > > > > devices have their own ABI against kernel.
> > > > > > It'd mean we need a way to detect that interrupt was sent,
> > > > > > maybe yet another counter inside that structure.
> > > > > > 
> > > > > > WDYT?
> > > > > > 
> > > > > > By the way the same idea would work for snapshots -
> > > > > > some people wanted to expose that info to userspace, too.
> > > 
> > > Those people included me. I wanted to interrupt all the vCPUs, even the
> > > ones which were in userspace at the moment of migration, and have the
> > > kernel deal with passing it on to userspace via a different ABI.
> > > 
> > > It ends up being complex and intricate, and requiring a lot of new
> > > kernel and userspace support. I gave up on it in the end for snapshots,
> > > and didn't go there again for this.
> > 
> > Maybe become you insist on using ACPI?
> > I see a fairly simple way to do it. For example, with virtio:
> > 
> > one vq per CPU, with a single outstanding buffer,
> > callback copies from the buffer into the userspace
> > visible memory.
> > 
> > Want me to show you the code?
> 
> Couldn't resist, so I wrote a bit of this code.
> Fundamentally, we keep a copy of the hypervisor abi
> in the device:
> 
> struct virtclk_info *vci {
>         struct vmclock_abi abi;
> };
> 
> each vq will has its own copy:
> 
> struct virtqueue_info {
>         struct scatterlist sg[];
>         struct vmclock_abi abi;
> }
> 
> we add it during probe:
>         sg_init_one(vqi->sg, &vqi->abi, sizeof(vqi->abi));
>         virtqueue_add_inbuf(vq,
>                         vqi->sg, 1,
>                         &vq->vabi,
>                         GFP_ATOMIC);
> 
> 
> 
> We set the affinity for each vq:
> 
>        for (i = 0; i < num_online_cpus(); i++)
>                virtqueue_set_affinity(vi->vq[i], i);
> 
> (virtio net does it, and it handles cpu hotplug as well)
> 
> each vq callback would do:
> 
> static void vmclock_cb(struct virtqueue *vq)
> {
>         struct virtclk_info *vci = vq->vdev->priv;
>         struct virtqueue_info *vqi = vq->priv;
>         void *buf;
>         unsigned int len;
> 
>         buf = virtqueue_get_buf(vq, &len);
>         if (!buf)
>                 return;
> 
>         BUG_ON(buf != &vq->abi);
> 
>         spin_lock(vci->lock);
>         if (memcmp(&vci->abi, &vqi->abi, sizeof(vqi->abi))) {
>                 memcpy(&vci->abi, &vqi->abi, sizeof(vqi->abi));
>         }
> 
>         /* Update the userspace visible structure now */
>         .....
> 
>         /* Re-add the buffer */
>         virtqueue_add_inbuf(vq,
>                         vqi->sg, 1,
>                         &vqi->abi,
>                         GFP_ATOMIC);
> 
>         spin_unlock(vi->lock);
> }
> 
> That's it!
> Where's the problem here?

That's great. You don't even need it to be per-vCPU if you let the
hypervisor write directly to the single physical location that's mapped
to userspace. It can do that before it even starts *running* the vCPUs
after migration. It's a whole lot simpler. 

Even if we were to insist one having two *different* ABIs, one for
hypervisor←→guest kernel, and another for kernel←→userspace, then you
still only need the hypervisor to write one copy. It's just that you do
need to interrupt to all the other vCPUs to ensure they aren't running
userspace before the data are updated. So yes, we can do the above and
(ab)use gratuitous data transfers to trigger those IRQs, and then the
handler has some kind of locking to ensure that we can't reenter
userspace until the user-visible structure is updated? 

Obviously the "interrupt all CPUs" doesn't work well for hardware
implementations but those won't have to deal with live migration, so
that's OK. A hypothetical hardware implementation would map something
like the ART to reference time. Obviously there's no LM but are they
any other causes of 'disruption' in the same sense? Not that I can
think of.

But I don't think we actually do want to have separate ABIs for
hypervisor-kernel and kernel-user. The ABI we have in the structure is
just fine, and extensible. And having the hypervisor update it directly
is fine. It means that applications can use it with a *simple*
externally buildable kernel driver, which is easy to persuade OSVs to
add to their product kernels. 

If Linux wants to adopt it and map it directly into the vDSO area and
use it from vDSO functions, that *great*, and still entirely possible.
But there's no need for us to take a dependency on Linux doing that
kind of thing before this functionality is even usable at all.

I think something like the above could well be how we expose the
pvclock_abi struct in virtio-rtc, and that's fine. It even fixes the
PAGE_SIZE problem that the ACPI model has.

But I see it as just another transport, completely orthogonal to the
actual contents of the pvclock_abi.
<<attachment: smime.p7s>>