On Mon, Jun 22, 2020 at 10:51:47AM +0100, Marc Zyngier wrote: > On 2020-06-22 09:41, Andrew Jones wrote: > > On Mon, Jun 22, 2020 at 09:20:02AM +0100, Marc Zyngier wrote: > > > Hi Andrew, > > > > > > On 2020-06-19 19:46, Andrew Jones wrote: > > > > arm64 requires a vcpu fd (KVM_HAS_DEVICE_ATTR vcpu ioctl) to probe > > > > support for steal time. However this is unnecessary and complicates > > > > userspace (userspace may prefer delaying vcpu creation until after > > > > feature probing). Since probing steal time only requires a KVM fd, > > > > we introduce a cap that can be checked. > > > > > > So this is purely an API convenience, right? You want a way to > > > identify the presence of steal time accounting without having to > > > create a vcpu? It would have been nice to have this requirement > > > before we merged this code :-(. > > > > Yes. I wish I had considered it more closely when I was reviewing the > > patches. And, I believe we have yet another user interface issue that > > I'm looking at now. Without the VCPU feature bit I'm not sure how easy > > it will be for a migration to fail when attempting to migrate from a > > host > > with steal-time enabled to one that does not support steal-time. So it's > > starting to look like steal-time should have followed the pmu pattern > > completely, not just the vcpu device ioctl part. > > Should we consider disabling steal time altogether until this is worked out? I think we can leave it alone and just try to resolve it before merging QEMU patches (which I'm working on now). It doesn't look like kvmtool or rust-vmm (the only other two KVM userspaces I'm paying some attention to) do anything with steal-time yet, so they won't notice. And, I'm not sure disabling steal-time for any other userspaces is better than just trying to keep them working the best we can while improving the uapi. > > > > > > > > Additionally, when probing > > > > steal time we should check delayacct_on, because even though > > > > CONFIG_KVM selects TASK_DELAY_ACCT, it's possible for the host > > > > kernel to have delay accounting disabled with the 'nodelayacct' > > > > command line option. x86 already determines support for steal time > > > > by checking delayacct_on and can already probe steal time support > > > > with a kvm fd (KVM_GET_SUPPORTED_CPUID), but we add the cap there > > > > too for consistency. > > > > > > > > Signed-off-by: Andrew Jones <drjones@xxxxxxxxxx> > > > > --- > > > > Documentation/virt/kvm/api.rst | 11 +++++++++++ > > > > arch/arm64/kvm/arm.c | 3 +++ > > > > arch/x86/kvm/x86.c | 3 +++ > > > > include/uapi/linux/kvm.h | 1 + > > > > 4 files changed, 18 insertions(+) > > > > > > > > diff --git a/Documentation/virt/kvm/api.rst > > > > b/Documentation/virt/kvm/api.rst > > > > index 9a12ea498dbb..05b1fdb88383 100644 > > > > --- a/Documentation/virt/kvm/api.rst > > > > +++ b/Documentation/virt/kvm/api.rst > > > > @@ -6151,3 +6151,14 @@ KVM can therefore start protected VMs. > > > > This capability governs the KVM_S390_PV_COMMAND ioctl and the > > > > KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected > > > > guests when the state change is invalid. > > > > + > > > > +8.24 KVM_CAP_STEAL_TIME > > > > +----------------------- > > > > + > > > > +:Architectures: arm64, x86 > > > > + > > > > +This capability indicates that KVM supports steal time accounting. > > > > +When steal time accounting is supported it may be enabled with > > > > +architecture-specific interfaces. For x86 see > > > > +Documentation/virt/kvm/msr.rst "MSR_KVM_STEAL_TIME". For arm64 see > > > > +Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL" > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > > index 90cb90561446..f6dca6d09952 100644 > > > > --- a/arch/arm64/kvm/arm.c > > > > +++ b/arch/arm64/kvm/arm.c > > > > @@ -222,6 +222,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, > > > > long ext) > > > > */ > > > > r = 1; > > > > break; > > > > + case KVM_CAP_STEAL_TIME: > > > > + r = sched_info_on(); > > > > + break; > > > > default: > > > > r = kvm_arch_vm_ioctl_check_extension(kvm, ext); > > > > break; > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > index 00c88c2f34e4..ced6335e403e 100644 > > > > --- a/arch/x86/kvm/x86.c > > > > +++ b/arch/x86/kvm/x86.c > > > > @@ -3533,6 +3533,9 @@ int kvm_vm_ioctl_check_extension(struct kvm > > > > *kvm, long ext) > > > > case KVM_CAP_HYPERV_ENLIGHTENED_VMCS: > > > > r = kvm_x86_ops.nested_ops->enable_evmcs != NULL; > > > > break; > > > > + case KVM_CAP_STEAL_TIME: > > > > + r = sched_info_on(); > > > > + break; > > > > default: > > > > break; > > > > } > > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > > > > index 4fdf30316582..121fb29ac004 100644 > > > > --- a/include/uapi/linux/kvm.h > > > > +++ b/include/uapi/linux/kvm.h > > > > @@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt { > > > > #define KVM_CAP_PPC_SECURE_GUEST 181 > > > > #define KVM_CAP_HALT_POLL 182 > > > > #define KVM_CAP_ASYNC_PF_INT 183 > > > > +#define KVM_CAP_STEAL_TIME 184 > > > > > > > > #ifdef KVM_CAP_IRQ_ROUTING > > > > > > Shouldn't you also add the same check of sched_info_on() to > > > the various pvtime attribute handling functions? It feels odd > > > that the capability can say "no", and yet we'd accept userspace > > > messing with the steal time parameters... > > > > I considered that, but the 'has attr' interface is really only asking > > if the interface exists, not if it should be used. I'm not sure what > > we should do about it other than document that the cap needs to say > > it's usable, rather than just the attr presence. But, since we've had > > the attr merged quite a while without the cap, then maybe we can't > > rely on a doc change alone? > > Accepting the pvtime attributes (setting up the per-vcpu area) has two > effects: we promise both the guest and userspace that we will provide > the guest with steal time. By not checking sched_info_on(), we lie to > both, with potential consequences. It really feels like a bug. Yes, I agree now. Again, following the pmu pattern looks best here. The pmu will report that it doesn't have the attr support when its underlying kernel support (perf counters) doesn't exist. That's a direct analogy with steal-time relying on sched_info_on(). I'll work up another version of this series doing that, but before posting I'll look at the migration issue a bit more and likely post something for that as well. Thanks, drew _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm