On 25.06.24 21:01, David Woodhouse wrote: > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > The vmclock "device" provides a shared memory region with precision clock > information. By using shared memory, it is safe across Live Migration. > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > actually helpful. > > The memory region of the device is also exposed to userspace so it can be > read or memory mapped by application which need reliable notification of > clock disruptions. > > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > --- > > v2: > • Add gettimex64() support > • Convert TSC values to KVM clock when appropriate > • Require int128 support > • Add counter_period_shift > • Add timeout when seq_count is invalid > • Add flags field > • Better comments in vmclock ABI structure > • Explicitly forbid smearing (as clock rates would need to change) Leap second smearing information could still be conveyed through the vmclock_abi. AFAIU, to cover the popular smearing variants, it should be enough to indicate whether the driver should apply linear or cosine smearing, and the start time and end time. > > drivers/ptp/Kconfig | 13 + > drivers/ptp/Makefile | 1 + > drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++ > include/uapi/linux/vmclock.h | 138 ++++++++++ > 4 files changed, 668 insertions(+) > create mode 100644 drivers/ptp/ptp_vmclock.c > create mode 100644 include/uapi/linux/vmclock.h > [...] > + > +/* > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64 > + * and add the fractional second part of the reference time. > + * > + * The result is a 128-bit value, the top 64 bits of which are seconds, and > + * the low 64 bits are (seconds >> 64). > + * > + * If __int128 isn't available, perform the calculation 32 bits at a time to > + * avoid overflow. > + */ > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta, > + uint64_t period, uint8_t shift, > + uint64_t frac_sec) > +{ > + unsigned __int128 res = (unsigned __int128)delta * period; > + > + res >>= shift; > + res += frac_sec; > + *res_hi = res >> 64; > + return (uint64_t)res; > +} > + > +static int vmclock_get_crosststamp(struct vmclock_state *st, > + struct ptp_system_timestamp *sts, > + struct system_counterval_t *system_counter, > + struct timespec64 *tspec) > +{ > + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT); > + struct system_time_snapshot systime_snapshot; > + uint64_t cycle, delta, seq, frac_sec; > + > +#ifdef CONFIG_X86 > + /* > + * We'd expect the hypervisor to know this and to report the clock > + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid. > + */ > + if (check_tsc_unstable()) > + return -EINVAL; > +#endif > + > + while (1) { > + seq = st->clk->seq_count & ~1ULL; > + virt_rmb(); > + > + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) > + return -EINVAL; > + > + /* > + * When invoked for gettimex64(), fill in the pre/post system > + * times. The simple case is when system time is based on the > + * same counter as st->cs_id, in which case all three times > + * will be derived from the *same* counter value. > + * > + * If the system isn't using the same counter, then the value > + * from ktime_get_snapshot() will still be used as pre_ts, and > + * ptp_read_system_postts() is called to populate postts after > + * calling get_cycles(). > + * > + * The conversion to timespec64 happens further down, outside > + * the seq_count loop. > + */ > + if (sts) { > + ktime_get_snapshot(&systime_snapshot); > + if (systime_snapshot.cs_id == st->cs_id) { > + cycle = systime_snapshot.cycles; > + } else { > + cycle = get_cycles(); > + ptp_read_system_postts(sts); > + } > + } else > + cycle = get_cycles(); > + > + delta = cycle - st->clk->counter_value; AFAIU in the general case this needs to be masked for non 64-bit counters. > + > + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta, > + st->clk->counter_period_frac_sec, > + st->clk->counter_period_shift, > + st->clk->utc_time_frac_sec); > + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64); > + tspec->tv_sec += st->clk->utc_time_sec; > + > + virt_rmb(); > + if (seq == st->clk->seq_count) > + break; > + > + if (ktime_after(ktime_get(), deadline)) > + return -ETIMEDOUT; > + } > + > + if (system_counter) { > + system_counter->cycles = cycle; > + system_counter->cs_id = st->cs_id; > + } > + > + if (sts) { > + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real); > + if (systime_snapshot.cs_id == st->cs_id) > + sts->post_ts = sts->pre_ts; > + } > + > + return 0; > +} > + [...] > + > +static const struct ptp_clock_info ptp_vmclock_info = { > + .owner = THIS_MODULE, > + .max_adj = 0, > + .n_ext_ts = 0, > + .n_pins = 0, > + .pps = 0, > + .adjfine = ptp_vmclock_adjfine, > + .adjtime = ptp_vmclock_adjtime, > + .gettime64 = ptp_vmclock_gettime, The .gettime64 op is now unneeded. > + .gettimex64 = ptp_vmclock_gettimex, > + .settime64 = ptp_vmclock_settime, > + .enable = ptp_vmclock_enable, > + .getcrosststamp = ptp_vmclock_getcrosststamp, > +}; > + [...] > diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h > new file mode 100644 > index 000000000000..cf0f22205e79 > --- /dev/null > +++ b/include/uapi/linux/vmclock.h > @@ -0,0 +1,138 @@ > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ > + > +/* > + * This structure provides a vDSO-style clock to VM guests, exposing the > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch > + * counter, etc.) and real time. It is designed to address the problem of > + * live migration, which other clock enlightenments do not. > + * > + * When a guest is live migrated, this affects the clock in two ways. > + * > + * First, even between identical hosts the actual frequency of the underlying > + * counter will change within the tolerances of its specification (typically > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the > + * same host, but can be tracked by NTP as it generally varies slowly. With > + * live migration there is a step change in the frequency, with no warning. > + * > + * Second, there may be a step change in the value of the counter itself, as > + * its accuracy is limited by the precision of the NTP synchronization on the > + * source and destination hosts. > + * > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source > + * host before migration is invalid, and needs to be redone on the new host. > + * > + * In its most basic mode, this structure provides only an indication to the > + * guest that live migration has occurred. This allows the guest to know that > + * its clock is invalid and take remedial action. For applications that need > + * reliable accurate timestamps (e.g. distributed databases), the structure > + * can be mapped all the way to userspace. This allows the application to see > + * directly for itself that the clock is disrupted and take appropriate > + * action, even when using a vDSO-style method to get the time instead of a > + * system call. > + * > + * In its more advanced mode. this structure can also be used to expose the > + * precise relationship of the CPU counter to real time, as calibrated by the > + * host. This means that userspace applications can have accurate time > + * immediately after live migration, rather than having to pause operations > + * and wait for NTP to recover. This mode does, of course, rely on the > + * counter being reliable and consistent across CPUs. > + * > + * Note that this must be true UTC, never with smeared leap seconds. If a > + * guest wishes to construct a smeared clock, it can do so. Presenting a > + * smeared clock through this interface would be problematic because it > + * actually messes with the apparent counter *period*. A linear smearing > + * of 1 ms per second would effectively tweak the counter period by 1000PPM > + * at the start/end of the smearing period, while a sinusoidal smear would > + * basically be impossible to represent. Clock types other than UTC could also be supported: TAI, monotonic. > + */ > + > +#ifndef __VMCLOCK_H__ > +#define __VMCLOCK_H__ > + > +#ifdef __KERNEL__ > +#include <linux/types.h> > +#else > +#include <stdint.h> > +#endif > + > +struct vmclock_abi { > + uint32_t magic; > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ > + uint16_t size; /* Size of page containing this structure */ > + uint16_t version; /* 1 */ > + > + /* Sequence lock. Low bit means an update is in progress. */ > + uint64_t seq_count; > + > + /* > + * This field changes to another non-repeating value when the CPU > + * counter is disrupted, for example on live migration. > + */ > + uint64_t disruption_marker; The field could also change when the clock is stepped (leap seconds excepted), or when the clock frequency is slewed.