Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 08 Mar 2024 12:33:20 +0000

On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> On 07.03.24 15:02, David Woodhouse wrote:
> > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> > > RFC v3 updates
> > > --------------
> > > 
> > > This series implements a driver for a virtio-rtc device conforming to spec
> > > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> > > the PTP clock driver already present before.
> > > 
> > > This patch series depends on the patch series "treewide: Use clocksource id
> > > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> > > series on top of mainline.
> > > 
> > > Overview
> > > --------
> > > 
> > > This patch series adds the virtio_rtc module, and related bugfixes. The
> > > virtio_rtc module implements a driver compatible with the proposed Virtio
> > > RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> > > provides information about current time. The device can provide different
> > > clocks, e.g. for the UTC or TAI time standards, or for physical time
> > > elapsed since some past epoch. 
> > 
> > Hm, should we allow UTC? If you tell me the time in UTC, then
> > (sometimes) I still don't actually know what the time is, because some
> > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > offset, surely? Should the virtio_rtc specification make it mandatory
> > to provide such?
> > 
> > Otherwise you're just designing it to allow crappy hypervisors to
> > expose incomplete information.
> > 
> 
> Hi David,
> 
> (adding virtio-comment@xxxxxxxxxxxxxxxxxxxx for spec discussion),
> 
> thank you for your insightful comments. I think I take a broadly similar
> view. The reason why the current spec and driver is like this is that I
> took a pragmatic approach at first and only included features which work
> out-of-the-box for the current Linux ecosystem.
> 
> The current virtio_rtc features work similar to ptp_kvm, and therefore can
> work out-of-the-box with time sync daemons such as chrony.
> 
> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
> well, I am afraid that
> 
> - in some (embedded) scenarios, the TAI clock may not be available
> 
> - crappy hypervisors will pass off the UTC clock as the TAI clock.
> 
> For the same reasons, I am also not sure about adding a *mandatory* TAI
> offset to each readout. I don't know user-space software which would
> leverage this already (at least not through the PTP clock interface). And
> why would such software not go straight for the TAI clock instead?
> 
> How about adding a requirement to the spec that the virtio-rtc device
> SHOULD expose the TAI clock whenever it is available - would this address
> your concerns?

I think that would be too easy for implementors to miss, or decide not
to obey. Or to get *wrong*, by exposing a TAI clock but actually
putting UTC in it.

I think I prefer to mandate the tai_offset field with the UTC clock.
Crappy implementations will just set it to zero, but at least that
gives a clear signal to the guests that it's *their* problem to
resolve.

> > > PTP clock interface
> > > -------------------
> > > 
> > > virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> > > If both the Virtio RTC device and this driver have special support for the
> > > current clocksource, time synchronization programs can use
> > > cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> > > PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> > > with single-digit ns precision is possible with a quiescent reference clock
> > > (from the Virtio RTC device). This works even when the Virtio device
> > > response is slow compared to ptp_kvm hypercalls.
> > 
> > Is PTP the right mechanism for this? As I understand it, PTP is a way
> > to precisely synchronize one clock with another. But in the case of
> > virt guests synchronizing against the host, it isn't really *another*
> > clock. It really is the *same* underlying clock. As the host clock
> > varies with temperature, for example, so does the guest clock. The only
> > difference is an offset and (on x86 perhaps) a mathematical scaling of
> > the frequency.
> > 
> > I was looking at this another way, when I came across this virtio-rtc
> > work.
> > 
> > My idea was just for the hypervisor to expose its own timekeeping
> > information — the counter/TSC value and TAI time at a given moment,
> > frequency of the counter, and the precision of both that frequency
> > (±PPM) and the TAI timestamp (±µs).
> > 
> > By putting that in a host/guest shared data structure with a seqcount
> > for lockless updates, we can update it as time synchronization on the
> > host is refined, and we can even cleanly handle live migration where
> > the guest ends up on a completely different host. It allows for use
> > cases which *really* care (e.g. timestamping financial transactions) to
> > ensure that there is never even a moment of getting *wrong* timestamps
> > if they haven't yet resynced after a migration.
> 
> I considered a similar approach as well, but integrating that with the
> kernel timekeeping seemed too much effort for the first step. However,
> reading the clock from user space would be much simpler.

Right. In fact my *first* use case was userspace, specifically in the
context of https://github.com/aws/clock-bound — but anything we design
for this absolutely has to be usable for kernel timekeeping too.

It's also critical to solve the Live Migration problem.

But is it so hard to integrate into the kernel timekeeping? My plan
would have given us effectively an infinite number of cross-reads of
the realtime clock vs. TSC. You don't have to actually read from a
virtio device; you just read the TSC and do the maths, using the values
in the shared memory region. Couldn't that be used to present a PTP
device to the guest kernel just the same as you do at the moment?

You could probably even simulate PPS with it. Typically with PPS we
have to catch the hardware interrupt and then read the TSC as soon as
possible thereafter. With this, you'd be able to *calculate* the TSC
value at the start of the next second, and wouldn't have to suffer the
real hardware latency :)

> > 
> > Now I'm trying to work out if I should attempt to reconcile with your
> > existing virtio-rtc work, or just decide that virtio-rtc isn't trying
> > to solve the actual problem that we have, and go ahead with something
> > different... ?
> > 
> 
> We are certainly interested into the discussed, say, "virtual timekeeper"
> mechanism, which would also solve a lot of problems for us (especially if
> it would be integrated with kernel timekeeping). Even without Linux kernel
> timekeeping, the virtual timekeeper would be useful to us for guests with
> simpler timekeeping, and potentially for user space applications.
> 
> Our current intent is to at first try to upstream the current (RFC spec v3)
> feature set. I think the virtual timekeeper would be suitable as an
> optional feature of virtio_rtc (with Virtio, this could easily be added
> after initial upstreaming). It is also possible to have a virtio-rtc device
> only implement the virtual timekeeper, but not the other clock reading
> methods, if these are of no interest.

Yeah, that might make sense. I was thinking of a simple ACPI/DT device
exposing a page of memory and *maybe* an interrupt for when an update
happens. (With the caveat that the interrupt would always occur too
late by definition, so it's no substitute for using the seqlock
correctly in applications that *really* care and are going to get fined
millions of dollars for mis-timestamping their transactions.)

But using the virtio-rtc device as the vehicle for that shared memory
page is reasonable too. It's not even mutually exclusive; we could
expose the *same* data structure in memory via whatever mechanisms we
wanted.

One other thing to note is I think we're being very naïve about the TSC
on x86 hosts. Theoretically, the TSC for every vCPU might run at a
different frequency, and even if they run at the same frequency they
might be offset from each other. I'm happy to be naïve but I think we
should be *explicitly* so, and just say for example that it's defined
against vCPU0 so if other vCPUs are different then all bets are off. 

We *can* cope with TSC frequencies changing. Fundamentally, that's the
whole *point*; NTP calibrates itself as the underlying frequency does
change due to temperature changes, etc. — so a deliberate frequency
scaling, or even a live migration, are just a slightly special case of
the same thing.

One thing I have added to the memory region is a migration counter. In
the ideal case, guests will be happy just to use the hypervisor's
synchronization. But in some cases the guests may want to do NTP (or
PPS, PTP or something else) for themselves, to have more precise
timekeeping than the host. Even if the host is advertising itself as
stratum 16, the guest still needs to know of *migration*, because it
has to consider itself unsynchronized when that happens.
<<attachment: smime.p7s>>