Re: [PATCH] ptp: Add vDSO-style vmclock support

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Thu, 25 Jul 2024 10:56:05 +0100

Hi Michael, thanks for the review!

On Thu, 2024-07-25 at 01:48 -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2024 at 06:16:37PM +0100, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > ---
> > QEMU implementation at
> > https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> > 
> > Although the ACPI device implemented in QEMU (and some other
> > hypervisor) stands alone, most of the fields and values herein are
> > aligned as much as possible with the nascent virtio-rtc specification,
> > with the intent that a version of the same structure can be
> > incorporated into that standard.
> 
> Do you want to just help complete virtio-rtc then? Would be easier than
> trying to keep two specs in sync.

The ACPI version is much more lightweight and doesn't take up a
valuable PCI slot#. (I know, you can do virtio without PCI but that's
complex in other ways).

So I do see these existing in parallel, and I'm happy for the VMCLOCK
device to defer to the shared memory structure in the virtio-rtc
specification once that's final.

I've done my best to ensure we have a fair chance of it being adopted
as-is, and it's versioned so if virtio-rtc gets published with v2 of
the structure, that's also OK.
> 

> > +
> > +       if (st->clk->magic != VMCLOCK_MAGIC ||
> > +           st->clk->size < sizeof(*st->clk) ||
> 
> 
> This is a problem and what I mention below.
> As structure will evolve, sizeof(*st->clk) will grow,
> then it will fail on old hosts.

Potentially, yes. At the time a guest starts to support a larger
version of the struct, it would need to check for the size of the v1
structure.

We certainly *allow* for the structure to evolve, but as you say this
is ABI. So we've tried to plan ahead and avoid having to change it
unless something *really* surprising happens in the field of leap
seconds.

> > +
> > +       /* If the structure is big enough, it can be mapped to userspace */
> > +       if (st->clk->size >= PAGE_SIZE) {
> 
> This is also part of ABI. And a problematic one since linux can
> increase PAGE_SIZE at any time for its own reasons.

Yes. For it to be mapped to userspace, the hypervisor has to put it
into a memory region which is large enough for the page granularity
that the guest happens to use.

The hypervisor implementations so far (for x86 in QEMU since there
seems to be no ACPI support for Arm?, and for x86+Arm in my other pet
hypervisor) expose a 4KiB region.

So if the guest uses larger pages (and can't cope with using 4KiB PTEs
for MMIO mappings), then it can't use this feature unless the
hypervisor gives it a larger region. I think that's OK.

> > +               st->miscdev.minor = MISC_DYNAMIC_MINOR;
> > +               st->miscdev.fops = &vmclock_miscdev_fops;
> > +               st->miscdev.name = st->name;
> > +
> > +               ret = misc_register(&st->miscdev);
> > +               if (ret)
> > +                       goto out;
> > +       }
> 
> Hmm so you want to map a page to userspace, but who will
> keep the clock in that memory page up to date?
> Making hypervisor do it will mean stealing a bunch
> of hypervisor time and we don't *know* userspace is
> going to use it.

I think it should scale OK. The bulk of the actual *calculations* are
done only once by the hypervisor on an NTP skew adjustment anyway,
rather than per-VM. That gives the period of the actual host counter,
and the reference time.

Where TSC scaling *isn't* occurring, all that the VMM needs to do is
copy that information, applying the appropriate TSC offset. I think it
should be mostly in the noise compared with other VMM overhead?

Where TSC scaling is in use, it's a little more complex. The counter
value at the reference time needs to be converted using the standard
mul-and-shift TSC scaling method. But the counter *period* needs to
undergo the reverse conversion. That would be a bit more work if done
naïvely with a right shift and then a *division*, but an appropriate
efficient conversion to match the frequency scaling could be
precalculated, much as KVM already does in kvm_get_time_scale()? Then
the conversion of the counter period could just be a mul and shift too.

If it's just the TSC scaling which pushes it over the edge, then
*maybe* we could add the raw host→guest TSC scaling information to the
structure and make it the guest's problem? I'm not sure I'm overly fond
of that approach.

(You mention feature negotiation later, so we'll come back to that...)

> > +static const struct acpi_device_id vmclock_acpi_ids[] = {
> > +       { "VMCLOCK", 0 },
> > +       {}
> > +};
> > +MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);
> 
> This is also part of ABI.
> How do we know no one in the world happens to have a CID like this?
> We should just reserve a valid HID, I think.

Ack. This was largely just based on the existing VMGENID code. Can I
assign a QEMUxxxx HID for it?

> > +struct vmclock_abi {
> > +       /* CONSTANT FIELDS */
> > +       uint32_t magic;
> > +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
> > +       uint32_t size;          /* Size of region containing this structure */
> > +       uint16_t version;       /* 1 */
> 
> I think we need to document how will compatibility be handled,
> otherwise people will just go
> (assert(size == sizeof(struct vmclock_abi))
> and we can't extend it ever.

As you noted, the existing guest code only checks for it being large
*enough*. I've let this "specification" be implementation-led for the
time being, as I'm planning for Peter to accept it into the virtio-rtc
spec and for *that* to be the formal definition, where such admonitions
would reside. 

> I also feel some kind of negotiation so host can detect guest
> expectations would be a good idea, basically a la virtio feature
> negotiation.
> 
> Also, I think it's not great that host is just expected to
> poke at this memory at all times. Being able to know that e.g.
> system is being shut down and so no need to poke
> at memory would be good.

Fair enough. I think that when used in virtio-rtc, doing it through
feature negotiation is probably the right answer?

For the ACPI device, perhaps we can just add an ACPI method to
enable/disable the timekeeping (the disruption signal can always be
operational).

<<attachment: smime.p7s>>