Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Peter Hilber <peter.hilber@xxxxxxxxxxxxxxx> · Thu, 20 Jun 2024 14:37:45 +0200

Changing virtio-dev address to the new one. The discussion might also be
relevant for virtio-comment, but it is discouraged to forward it to both.

On 15.06.24 10:40, David Woodhouse wrote:
> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>> RFC v3 updates
>> --------------
>>
>> This series implements a driver for a virtio-rtc device conforming to spec
>> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
>> the PTP clock driver already present before.
>>
>> This patch series depends on the patch series "treewide: Use clocksource id
>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
>> series on top of mainline.
>>
>> Overview
>> --------
>>
>> This patch series adds the virtio_rtc module, and related bugfixes. The
>> virtio_rtc module implements a driver compatible with the proposed Virtio
>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
>> provides information about current time. The device can provide different
>> clocks, e.g. for the UTC or TAI time standards, or for physical time
>> elapsed since some past epoch. The driver can read the clocks with simple
>> or more accurate methods. Optionally, the driver can set an alarm.
>>
>> The series first fixes some bugs in the get_device_system_crosststamp()
>> interpolation code, which is required for reliable virtio_rtc operation.
>> Then, add the virtio_rtc implementation.
>>
>> For the Virtio RTC device, there is currently a proprietary implementation,
>> which has been used for provisional testing.
> 
> As discussed before, I don't think it makes sense to design a new high-
> precision virtual clock which only gets it right *most* of the time. We
> absolutely need to address the issue of live migration.
> 
> When live migration occurs, the guest's time precision suffers in two
> ways.
> 
> First, even when migrating to a supposedly identical host the precise
> rate of the underlying counter (TSC, arch counter, etc.) can change
> within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural
> changes that NTP normally manages to track, this is a *step* change,
> potentially from -50PPM to +50PPM from one host to the next.
> 
> Second, the accuracy of the counter as preserved across migration is
> limited by the accuracy of each host's NTP synchronization. So there is
> also a step change in the value of the counter itself.
> 
> At the moment of migration, the guest's timekeeping should be
> considered invalid. Any previous cross-timestamps are meaningless, and
> blindly using the previously-known relationship between the counter and
> real time can lead to problems such as corruption in distributed
> databases, fines for mis-timestamped transactions, and other general
> unhappiness.
> 
> We obviously can't get a new timestamp from the virtio_rtc device every
> time an application wants to know the time reliably. We don't even want
> *system* calls for that, which is why we have it in a vDSO.
> 
> We can take the same approach to warning the guest about clock
> disruption due to live migration. A shared memory region from the
> virtual clock device even be mapped all the way to userspace, for those
> applications which need precise and *reliable* time to check a
> 'disruption' marker in it, and do whatever is appropriate (e.g. abort
> transactions and wait for time to resync) when it happens.
> 
> We can do better than just letting the guest know about disruption, of
> course. We can put the actual counter→realtime relationship into the
> memory region too. As well as being able to provide a PTP driver with
> this, the applications which care about *reliable* timestamps can mmap
> the page directly and use it, vDSO-style, to have accurate timestamps
> even from the first cycle after migration.
> 
> When disruption is signalled, the guest needs to throw away any
> *additional* refinement that it's done with NTP/PTP/PPS/etc. and revert
> to knowing nothing more than what the hypervisor advertises here.
> 
> Here's a first attempt at defining such a memory structure. For now
> I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
> think it makes most sense for this to be part of the virtio_rtc spec.
> Ultimately it doesn't matter *how* the guest finds the memory region.

This looks sensible to me. I also think it would be possible to adapt this for
the virtio-rtc spec. The proposal also supports some other use cases which are
not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and
others such as indication of a past leap second.

Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to
support leap second smearing. Also, it would be helpful to allow indicating
when some of the fields are not valid (such as leapsecond_counter_value,
leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...).

Do you have plans to contribute this to the Virtio specification and Linux
driver implementation?

I also added a few comments below.

Thanks for sharing,

Peter

[2] https://lore.kernel.org/virtio-comment/20240522170003.52565-1-peter.hilber@xxxxxxxxxxxxxxx/

> 
> Very preliminary QEMU hacks at 
> https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> (I still need to do the KVM side helper for actually filling in the
> host clock information, converted to the *guest* TSC)
> 
> This is the guest side. H aving heckled your assumption that we can use
> the virt counter on Arm, I concede I'm doing the same thing for now.
> The structure itself is at the end, or see
> https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock
> 
> From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001
> From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> Date: Mon, 10 Jun 2024 15:10:11 +0100
> Subject: [PATCH] ptp: Add vDSO-style vmclock support
> 
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
> 
> Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> ---
>  drivers/ptp/Kconfig          |  13 ++
>  drivers/ptp/Makefile         |   1 +
>  drivers/ptp/ptp_vmclock.c    | 404 +++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vmclock.h | 102 +++++++++
>  4 files changed, 520 insertions(+)
>  create mode 100644 drivers/ptp/ptp_vmclock.c
>  create mode 100644 include/uapi/linux/vmclock.h
> 

[...]

> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> +	.owner		= THIS_MODULE,
> +	.max_adj	= 0,
> +	.n_ext_ts	= 0,
> +	.n_pins		= 0,
> +	.pps		= 0,
> +	.adjfine	= ptp_vmclock_adjfine,
> +	.adjtime	= ptp_vmclock_adjtime,
> +	.gettime64	= ptp_vmclock_gettime,

Should implement .gettimex64 instead.

> +	.settime64	= ptp_vmclock_settime,
> +	.enable		= ptp_vmclock_enable,
> +	.getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +

[...]

> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + */
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> +	uint32_t magic;
> +#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
> +	uint16_t size;		/* Size of page containing this structure */
> +	uint16_t version;	/* 1 */
> +
> +	/* Sequence lock. Low bit means an update is in progress. */
> +	uint64_t seq_count;
> +
> +	/*
> +	 * This field changes to another non-repeating value when the CPU
> +	 * counter is disrupted, for example on live migration.
> +	 */
> +	uint64_t disruption_marker;
> +
> +	uint8_t clock_status;
> +#define VMCLOCK_STATUS_UNKNOWN		0
> +#define VMCLOCK_STATUS_INITIALIZING	1
> +#define VMCLOCK_STATUS_SYNCHRONIZED	2
> +#define VMCLOCK_STATUS_FREERUNNING	3
> +#define VMCLOCK_STATUS_UNRELIABLE	4
> +
> +	uint8_t counter_id;
> +#define VMCLOCK_COUNTER_INVALID		0
> +#define VMCLOCK_COUNTER_X86_TSC		1
> +#define VMCLOCK_COUNTER_ARM_VCNT	2
> +
> +	uint8_t pad[3];
> +
> +	/*
> +	 * UTC on its own is non-monotonic and ambiguous.
> +	 *
> +	 * Inform the guest about the TAI offset, so that it can have an
> +	 * actual monotonic and reliable time reference if it needs it.
> +	 *
> +	 * Also indicate a nearby leap second, if one exists. Unlike in
> +	 * NTP, this may indicate a leap second in the past in order to
> +	 * indicate that it *has* been taken into account.
> +	 */
> +	int8_t leapsecond_direction;
> +	int16_t tai_offset_sec;
> +	uint64_t leapsecond_counter_value;
> +	uint64_t leapsecond_tai_time;
> +
> +	/* Paired values of counter and UTC at a given point in time. */
> +	uint64_t counter_value;
> +	uint64_t utc_time_sec;
> +	uint64_t utc_time_frac_sec;
> +
> +	/* Counter frequency, and error margin. Units of (second >> 64) */
> +	uint64_t counter_period_frac_sec;

AFAIU this might limit the precision in case of high counter frequencies.
Could the unit be aligned to the expected frequency band of counters?

> +	uint64_t counter_period_error_rate_frac_sec;
> +
> +	/* Error margin of UTC reading above (± picoseconds) */
> +	uint64_t utc_time_maxerror_picosec;
> +};