Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2024-06-25 at 15:22 -0700, John Stultz wrote:
> On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> > On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> > > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > > > 
> > > > The vmclock "device" provides a shared memory region with precision clock
> > > > information. By using shared memory, it is safe across Live Migration.
> > > > 
> > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > > actually helpful.
> > > > 
> > > > The memory region of the device is also exposed to userspace so it can be
> > > > read or memory mapped by application which need reliable notification of
> > > > clock disruptions.
> > > 
> > > There is effort underway to expose PTP clocks to user space via VDSO.
> > 
> > Ooh, interesting. Got a reference to that please?
> > 
> > >  Can we please not expose an ad hoc interface for that?
> > 
> > Absolutely. I'm explicitly trying to intercept the virtio-rtc
> > specification here, to *avoid* having to do anything ad hoc.
> > 
> > Note that this is a "vDSO-style" interface from hypervisor to guest via
> > a shared memory region, not necessarily an actual vDSO.
> > 
> > But yes, it *is* intended to be exposed to userspace, so that userspace
> > can know the *accurate* time without a system call, and know that it
> > hasn't been perturbed by live migration.
> 
> Yea, I was going to raise a concern that just defining an mmaped
> structure means it has to trust the guest logic is as expected. It's
> good that it's versioned! :)

Right. Although it's basically a pvclock, and we've had those for ages.

The main difference here is that we add an indicator that tells the
guest that it's been live migrated, so any additional NTP/PTP
refinement that the *guest* has done of its oscillator, should now be
discarded.

It's also designed to be useful in "disruption-only" mode, where the
pvclock information isn't actually populated, so *all* it does is tell
guests that their clock is now hosed due to live migration.

That part is why it needs to be mappable directly to userspace, so that
userspace can not only get a timestamp but *also* know that it's
actually valid. All without a system call.

The critical use cases are financial systems where they incur massive
fines if they submit mis-timestamped transactions, and distributed
databases which rely on accurate timestamps (and error bounds) for
eventual coherence. Live migration can screw those completely.

I'm open to changing fairly much anything about the proposal as long as
we can address those use cases (which the existing virtio-rtc and other
KVM enlightenments do not).

> I'd fret a bit about exposing this to userland. It feels very similar
> to the old powerpc systemcfg implementation that similarly mapped just
> kernel data out to userland and was difficult to maintain as changes
> were made. Would including a code page like a proper vdso make sense
> to make this more flexible of an UABI to maintain?

I think the structure itself should be stable once we've bikeshedded it
a bit. But there is certainly some potential for vDSO functions which
help us expose it to the user...

This structure exposes a 'disruption count' which is updated every time
the TSC/counter is messed with by live migration. But what is userspace
actually going to *compare* it with?

It basically needs to compare it with the disruption count when the
clock was last synchronized, so maybe the kernel could export *that* to
vDSO too, then expose a simple vDSO function which reports whether the
clock is valid?

The 'invalid' code path could turn into an actual system call which
makes the kernel (check for itself and) call ntp_clear() when the
disruption occurs. Or maybe not just ntp_clear() but actually consume
the pvclock rate information directly and apply the *new* calibration?

That kind of thing would be great, and I've definitely tried to design
the structure so that it *can* be made a first-class citizen within the
kernel's timekeeping code and used like that.

But I was going to start with a more modest proposal that it's "just a
device", and applications which care about reliable time after LM would
have to /dev/vmclock0 and mmap it and check for themselves. (Which
would be assisted by things like the ClockBound library).


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux