On Thu, Feb 24, 2022 at 03:40:15PM +0000, Marc Zyngier wrote: > On Wed, 23 Feb 2022 04:18:38 +0000, > Oliver Upton <oupton@xxxxxxxxxx> wrote: > > > > Add a new system event type, KVM_SYSTEM_EVENT_SUSPEND, which indicates > > to userspace that the guest has requested the VM be suspended. Userspace > > can decide whether or not it wants to honor the guest's request by > > changing the MP state of the vCPU. If it does not, userspace is > > responsible for configuring the vCPU to return an error to the guest. > > Document these expectations in the KVM API documentation. > > > > To preserve ABI, this new exit requires explicit opt-in from userspace. > > Add KVM_CAP_ARM_SYSTEM_SUSPEND which grants userspace the ability to > > opt-in to these exits on a per-VM basis. > > > > Signed-off-by: Oliver Upton <oupton@xxxxxxxxxx> > > --- > > Documentation/virt/kvm/api.rst | 39 +++++++++++++++++++++++++++++++ > > arch/arm64/include/asm/kvm_host.h | 3 +++ > > arch/arm64/kvm/arm.c | 5 ++++ > > arch/arm64/kvm/psci.c | 5 ++++ > > include/uapi/linux/kvm.h | 2 ++ > > 5 files changed, 54 insertions(+) > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > index 2b4bdbc2dcc0..1e207bbc01f5 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -5930,6 +5930,7 @@ should put the acknowledged interrupt vector into the 'epr' field. > > #define KVM_SYSTEM_EVENT_RESET 2 > > #define KVM_SYSTEM_EVENT_CRASH 3 > > #define KVM_SYSTEM_EVENT_WAKEUP 4 > > + #define KVM_SYSTEM_EVENT_SUSPENDED 5 > > __u32 type; > > __u64 flags; > > } system_event; > > @@ -5957,6 +5958,34 @@ Valid values for 'type' are: > > - KVM_SYSTEM_EVENT_WAKEUP -- the guest is in a suspended state and KVM > > has recognized a wakeup event. Userspace may honor this event by marking > > the exiting vCPU as runnable, or deny it and call KVM_RUN again. > > + - KVM_SYSTEM_EVENT_SUSPENDED -- the guest has requested a suspension of > > + the VM. > > + > > +For arm/arm64: > > +^^^^^^^^^^^^^^ > > + > > + KVM_SYSTEM_EVENT_SUSPENDED exits are enabled with the > > + KVM_CAP_ARM_SYSTEM_SUSPEND VM capability. If a guest successfully > > + invokes the PSCI SYSTEM_SUSPEND function, KVM will exit to userspace > > + with this event type. > > + > > + The guest's x2 register contains the 'entry_address' where execution > > x1? > > > + should resume when the VM is brought out of suspend. The guest's x3 > > x2? > > > + register contains the 'context_id' corresponding to the request. When > > + the guest resumes execution at 'entry_address', x0 should contain the > > + 'context_id'. For more details on the SYSTEM_SUSPEND PSCI call, see > > + ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND". > > I'd refrain from paraphrasing too much of the spec, and direct the > user to it. It will also avoid introducing bugs... ;-) > > Overall, "the guest" is super ambiguous, and echoes the questions I > had earlier about what this means for an SMP system. Only one vcpu can > restart the system, but which one? > > > + > > + Userspace is _required_ to take action for such an exit. It must > > + either: > > + > > + - Honor the guest request to suspend the VM. Userspace must reset > > + the calling vCPU, then set PC to 'entry_address' and x0 to > > + 'context_id'. Userspace may request in-kernel emulation of the > > + suspension by setting the vCPU's state to KVM_MP_STATE_SUSPENDED. > > So here, you are actively saying that the calling vcpu should be the > one being resumed. If that's the case (and assuming that this is a > behaviour intended by the spec), something should prevent the other > vcpus from being started. > > > + > > + - Deny the guest request to suspend the VM. Userspace must set > > + registers x1-x3 to 0 and set x0 to PSCI_RET_INTERNAL_ERROR (-6). > > Do you have any sort of userspace code that demonstrates this? It'd be > super useful to see how that works on any publicly available VMM > (qemu, kvmtool, or any of the ferric oxide based monsters). > > > > > :: > > > > @@ -7580,3 +7609,13 @@ The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset > > of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace > > the hypercalls whose corresponding bit is in the argument, and return > > ENOSYS for the others. > > + > > +8.35 KVM_CAP_ARM_SYSTEM_SUSPEND > > +------------------------------- > > + > > +:Capability: KVM_CAP_ARM_SYSTEM_SUSPEND > > +:Architectures: arm64 > > +:Type: vm > > + > > +When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of > > +type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > index d32cab0c9752..e1c2ec18d1aa 100644 > > --- a/arch/arm64/include/asm/kvm_host.h > > +++ b/arch/arm64/include/asm/kvm_host.h > > @@ -146,6 +146,9 @@ struct kvm_arch { > > > > /* Memory Tagging Extension enabled for the guest */ > > bool mte_enabled; > > + > > + /* System Suspend Event exits enabled for the VM */ > > + bool system_suspend_exits; > > Gah... More of these. Please pick this patch: > > https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=kvm-arm64/mmu/guest-MMIO-guard&id=7dd0a13a4217b870f2e83cdc6045e5ce482a5340 > > > }; > > > > struct kvm_vcpu_fault_info { > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > index d2b190f32651..ce3f14a77a49 100644 > > --- a/arch/arm64/kvm/arm.c > > +++ b/arch/arm64/kvm/arm.c > > @@ -101,6 +101,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > } > > mutex_unlock(&kvm->lock); > > break; > > + case KVM_CAP_ARM_SYSTEM_SUSPEND: > > + r = 0; > > + kvm->arch.system_suspend_exits = true; > > + break; > > default: > > r = -EINVAL; > > break; > > @@ -209,6 +213,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > > case KVM_CAP_SET_GUEST_DEBUG: > > case KVM_CAP_VCPU_ATTRIBUTES: > > case KVM_CAP_PTP_KVM: > > + case KVM_CAP_ARM_SYSTEM_SUSPEND: > > r = 1; > > break; > > case KVM_CAP_SET_GUEST_DEBUG2: > > diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c > > index 2bb8d047cde4..a7de84cec2e4 100644 > > --- a/arch/arm64/kvm/psci.c > > +++ b/arch/arm64/kvm/psci.c > > @@ -245,6 +245,11 @@ static int kvm_psci_system_suspend(struct kvm_vcpu *vcpu) > > return 1; > > } > > > > + if (kvm->arch.system_suspend_exits) { > > + kvm_vcpu_set_system_event_exit(vcpu, KVM_SYSTEM_EVENT_SUSPEND); > > + return 0; > > + } > > + > > So there really is a difference in behaviour here. Userspace sees the > WFI behaviour before reset (it implements it), while when not using > the SUSPEND event, reset occurs before anything else. > > They really should behave in a similar way (WFI first, reset next). I mentioned this on the other patch, but I think the conversation should continue here as UAPI context is in this one. If SUSPEND exits are disabled and SYSTEM_SUSPEND is implemented in the kernel, userspace cannot observe any intermediate state. I think it is necessary for migration, otherwise if userspace were to save the vCPU post-WFI, pre-reset the pending reset would get lost along the way. As far as userspace is concerned, I think the WFI+reset operation is atomic. SUSPEND exits just allow userspace to intervene before said atomic operation. Perhaps I'm missing something: assuming SUSPEND exits are disabled, what value is provided to userspace if it can see WFI behavior before the reset? -- Thanks, Oliver