On Tue, 2023-10-10 at 09:40 +0000, Paul Durrant wrote: > From: Paul Durrant <pdurrant@xxxxxxxxxx> > > Unless explicitly told to do so (by passing 'clocksource=tsc' and > 'tsc=stable:socket', and then jumping through some hoops concerning > potential CPU hotplug) Xen will never use TSC as its clocksource. > Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set > in either the primary or secondary pvclock memory areas. This has > led to bugs in some guest kernels which only become evident if > PVCLOCK_TSC_STABLE_BIT *is* set in the pvclock. Specifically, some OL7 kernels backported the whole pvclock vDSO thing but *forgot* https://git.kernel.org/torvalds/c/9f08890ab and thus kill init with a SIGBUS the first time it tries to read a clock, because they don't actually map the pvclock pages to userspace :) They apparently never noticed because evidently *their* Xen fleet doesn't actually jump through all those hoops to use the TSC as its clocksource either. It's a fairly safe bet that there are more broken guest kernels out there too, hence needing to work around it. > Hence, to support > such guests, give the VMM a new attribute to tell KVM to forcibly > clear the bit in the Xen pvclocks. I frowned at the "PVCLOCK" part of the new attribute for a while, thinking that perhaps if we're going to have a set of flags to tweak behaviour, we shouldn't be so specific. Call it 'XEN_FEATURES' or something... but then I realised we'd want to *advertise* the set of bits which is available for userspace to set... ... and then I realised we already do. That's exactly what the set of bits returned, and *set*, with KVM_CAP_XEN_HVM is for. So let's ditch the new *attribute*, and just add your new (renamed) KVM_XEN_HVM_CONFIG_PVCLOCK_NO_STABLE_TSC cap to the set of permitted_flags in kvm_xen_hvm_config() so that userspace can enable it that way like it does the INTERCEPT_HYPERCALL and EVTCHN_SEND behaviours.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature