Changing virtio-dev address to the new one. The discussion might also be relevant for virtio-comment, but it is discouraged to forward it to both. On 15.06.24 10:40, David Woodhouse wrote: > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote: >> RFC v3 updates >> -------------- >> >> This series implements a driver for a virtio-rtc device conforming to spec >> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to >> the PTP clock driver already present before. >> >> This patch series depends on the patch series "treewide: Use clocksource id >> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined >> series on top of mainline. >> >> Overview >> -------- >> >> This patch series adds the virtio_rtc module, and related bugfixes. The >> virtio_rtc module implements a driver compatible with the proposed Virtio >> RTC device specification [1]. The Virtio RTC (Real Time Clock) device >> provides information about current time. The device can provide different >> clocks, e.g. for the UTC or TAI time standards, or for physical time >> elapsed since some past epoch. The driver can read the clocks with simple >> or more accurate methods. Optionally, the driver can set an alarm. >> >> The series first fixes some bugs in the get_device_system_crosststamp() >> interpolation code, which is required for reliable virtio_rtc operation. >> Then, add the virtio_rtc implementation. >> >> For the Virtio RTC device, there is currently a proprietary implementation, >> which has been used for provisional testing. > > As discussed before, I don't think it makes sense to design a new high- > precision virtual clock which only gets it right *most* of the time. We > absolutely need to address the issue of live migration. > > When live migration occurs, the guest's time precision suffers in two > ways. > > First, even when migrating to a supposedly identical host the precise > rate of the underlying counter (TSC, arch counter, etc.) can change > within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural > changes that NTP normally manages to track, this is a *step* change, > potentially from -50PPM to +50PPM from one host to the next. > > Second, the accuracy of the counter as preserved across migration is > limited by the accuracy of each host's NTP synchronization. So there is > also a step change in the value of the counter itself. > > At the moment of migration, the guest's timekeeping should be > considered invalid. Any previous cross-timestamps are meaningless, and > blindly using the previously-known relationship between the counter and > real time can lead to problems such as corruption in distributed > databases, fines for mis-timestamped transactions, and other general > unhappiness. > > We obviously can't get a new timestamp from the virtio_rtc device every > time an application wants to know the time reliably. We don't even want > *system* calls for that, which is why we have it in a vDSO. > > We can take the same approach to warning the guest about clock > disruption due to live migration. A shared memory region from the > virtual clock device even be mapped all the way to userspace, for those > applications which need precise and *reliable* time to check a > 'disruption' marker in it, and do whatever is appropriate (e.g. abort > transactions and wait for time to resync) when it happens. > > We can do better than just letting the guest know about disruption, of > course. We can put the actual counter→realtime relationship into the > memory region too. As well as being able to provide a PTP driver with > this, the applications which care about *reliable* timestamps can mmap > the page directly and use it, vDSO-style, to have accurate timestamps > even from the first cycle after migration. > > When disruption is signalled, the guest needs to throw away any > *additional* refinement that it's done with NTP/PTP/PPS/etc. and revert > to knowing nothing more than what the hypervisor advertises here. > > Here's a first attempt at defining such a memory structure. For now > I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I > think it makes most sense for this to be part of the virtio_rtc spec. > Ultimately it doesn't matter *how* the guest finds the memory region. This looks sensible to me. I also think it would be possible to adapt this for the virtio-rtc spec. The proposal also supports some other use cases which are not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and others such as indication of a past leap second. Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to support leap second smearing. Also, it would be helpful to allow indicating when some of the fields are not valid (such as leapsecond_counter_value, leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...). Do you have plans to contribute this to the Virtio specification and Linux driver implementation? I also added a few comments below. Thanks for sharing, Peter [2] https://lore.kernel.org/virtio-comment/20240522170003.52565-1-peter.hilber@xxxxxxxxxxxxxxx/ > > Very preliminary QEMU hacks at > https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock > (I still need to do the KVM side helper for actually filling in the > host clock information, converted to the *guest* TSC) > > This is the guest side. H aving heckled your assumption that we can use > the virt counter on Arm, I concede I'm doing the same thing for now. > The structure itself is at the end, or see > https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock > > From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001 > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > Date: Mon, 10 Jun 2024 15:10:11 +0100 > Subject: [PATCH] ptp: Add vDSO-style vmclock support > > The vmclock "device" provides a shared memory region with precision clock > information. By using shared memory, it is safe across Live Migration. > > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > --- > drivers/ptp/Kconfig | 13 ++ > drivers/ptp/Makefile | 1 + > drivers/ptp/ptp_vmclock.c | 404 +++++++++++++++++++++++++++++++++++ > include/uapi/linux/vmclock.h | 102 +++++++++ > 4 files changed, 520 insertions(+) > create mode 100644 drivers/ptp/ptp_vmclock.c > create mode 100644 include/uapi/linux/vmclock.h > [...] > + > +static const struct ptp_clock_info ptp_vmclock_info = { > + .owner = THIS_MODULE, > + .max_adj = 0, > + .n_ext_ts = 0, > + .n_pins = 0, > + .pps = 0, > + .adjfine = ptp_vmclock_adjfine, > + .adjtime = ptp_vmclock_adjtime, > + .gettime64 = ptp_vmclock_gettime, Should implement .gettimex64 instead. > + .settime64 = ptp_vmclock_settime, > + .enable = ptp_vmclock_enable, > + .getcrosststamp = ptp_vmclock_getcrosststamp, > +}; > + [...] > +/* > + * This structure provides a vDSO-style clock to VM guests, exposing the > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch > + * counter, etc.) and real time. It is designed to address the problem of > + * live migration, which other clock enlightenments do not. > + * > + * When a guest is live migrated, this affects the clock in two ways. > + * > + * First, even between identical hosts the actual frequency of the underlying > + * counter will change within the tolerances of its specification (typically > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the > + * same host, but can be tracked by NTP as it generally varies slowly. With > + * live migration there is a step change in the frequency, with no warning. > + * > + * Second, there may be a step change in the value of the counter itself, as > + * its accuracy is limited by the precision of the NTP synchronization on the > + * source and destination hosts. > + * > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source > + * host before migration is invalid, and needs to be redone on the new host. > + * > + * In its most basic mode, this structure provides only an indication to the > + * guest that live migration has occurred. This allows the guest to know that > + * its clock is invalid and take remedial action. For applications that need > + * reliable accurate timestamps (e.g. distributed databases), the structure > + * can be mapped all the way to userspace. This allows the application to see > + * directly for itself that the clock is disrupted and take appropriate > + * action, even when using a vDSO-style method to get the time instead of a > + * system call. > + * > + * In its more advanced mode. this structure can also be used to expose the > + * precise relationship of the CPU counter to real time, as calibrated by the > + * host. This means that userspace applications can have accurate time > + * immediately after live migration, rather than having to pause operations > + * and wait for NTP to recover. This mode does, of course, rely on the > + * counter being reliable and consistent across CPUs. > + */ > + > +#ifdef __KERNEL__ > +#include <linux/types.h> > +#else > +#include <stdint.h> > +#endif > + > +struct vmclock_abi { > + uint32_t magic; > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ > + uint16_t size; /* Size of page containing this structure */ > + uint16_t version; /* 1 */ > + > + /* Sequence lock. Low bit means an update is in progress. */ > + uint64_t seq_count; > + > + /* > + * This field changes to another non-repeating value when the CPU > + * counter is disrupted, for example on live migration. > + */ > + uint64_t disruption_marker; > + > + uint8_t clock_status; > +#define VMCLOCK_STATUS_UNKNOWN 0 > +#define VMCLOCK_STATUS_INITIALIZING 1 > +#define VMCLOCK_STATUS_SYNCHRONIZED 2 > +#define VMCLOCK_STATUS_FREERUNNING 3 > +#define VMCLOCK_STATUS_UNRELIABLE 4 > + > + uint8_t counter_id; > +#define VMCLOCK_COUNTER_INVALID 0 > +#define VMCLOCK_COUNTER_X86_TSC 1 > +#define VMCLOCK_COUNTER_ARM_VCNT 2 > + > + uint8_t pad[3]; > + > + /* > + * UTC on its own is non-monotonic and ambiguous. > + * > + * Inform the guest about the TAI offset, so that it can have an > + * actual monotonic and reliable time reference if it needs it. > + * > + * Also indicate a nearby leap second, if one exists. Unlike in > + * NTP, this may indicate a leap second in the past in order to > + * indicate that it *has* been taken into account. > + */ > + int8_t leapsecond_direction; > + int16_t tai_offset_sec; > + uint64_t leapsecond_counter_value; > + uint64_t leapsecond_tai_time; > + > + /* Paired values of counter and UTC at a given point in time. */ > + uint64_t counter_value; > + uint64_t utc_time_sec; > + uint64_t utc_time_frac_sec; > + > + /* Counter frequency, and error margin. Units of (second >> 64) */ > + uint64_t counter_period_frac_sec; AFAIU this might limit the precision in case of high counter frequencies. Could the unit be aligned to the expected frequency band of counters? > + uint64_t counter_period_error_rate_frac_sec; > + > + /* Error margin of UTC reading above (± picoseconds) */ > + uint64_t utc_time_maxerror_picosec; > +};