Hello David, On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote: > On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote: > > Hello David, > > > > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote: > > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote: > > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote: > > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm: > > > > > > From: "Gautham R. Shenoy" <ego@xxxxxxxxxxxxxxxxxx> > > > > > > > > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the > > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before > > > > > > scheduling in the guest vCPU. > > > > > > > > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits > > > > > > set. This patch changes the behaviour to enter the guest with > > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we > > > > > > unconditionally clear these bits. Ideally this should be done > > > > > > conditionally on platforms where the guest stop instruction has no > > > > > > Bugs (starting POWER9 DD2.3). > > > > > > > > > > How will guests know that they can use this facility safely after your > > > > > series? You need both DD2.3 and a patched KVM. > > > > > > > > > > > > Yes, this is something that isn't addressed in this series (mentioned > > > > in the cover letter), which is a POC demonstrating that the stop0lite > > > > state in guest works. > > > > > > > > However, to answer your question, this is the scheme that I had in > > > > mind : > > > > > > > > OPAL: > > > > On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest" > > > > > > > > Hypervisor Kernel: > > > > 1. If "idle-stop-guest" dt-cpu-feature is discovered, then > > > > we set bool enable_guest_stop = true; > > > > > > > > 2. During KVM guest entry, clear PSSCR[ESL|EC] iff > > > > enable_guest_stop == true. > > > > > > > > 3. In kvm_vm_ioctl_check_extension(), for a new capability > > > > KVM_CAP_STOP, return true iff enable_guest_top == true. > > > > > > > > QEMU: > > > > Check with the hypervisor if KVM_CAP_STOP is present. If so, > > > > indicate the presence to the guest via device tree. > > > > > > Nack. Presenting different capabilities to the guest depending on > > > host capabilities (rather than explicit options) is never ok. It > > > means that depending on the system you start on you may or may not be > > > able to migrate to other systems that you're supposed to be able to, > > > > I agree that blocking migration for the unavailability of this feature > > is not desirable. Could you point me to some other capabilities in KVM > > which have been implemented via explicit options? > > TBH, most of the options for the 'pseries' machine type are in this > category: cap-vsx, cap-dfp, cap-htm, a bunch related to various > Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash > guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT > resizing extension), ic-mode (which irq controllers are available to > the guest). Thanks. I will follow this suit. > > > The ISA 3.0 allows the guest to execute the "stop" instruction. > > So, this was a bug in DD2.2's implementation of the architecture? Yes, the previous versions could miss wakeup events when stop was executed in HV=0,PR=0 mode. So, the hypervisor had to block that. > > > If the > > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the > > "stop" instruction in the causes a Hypervisor Facility Unavailable > > Exception, thus giving the hypervisor a chance to emulate the > > instruction. However, in the current code, when the hypervisor > > receives this exception, it sends a PROGKILL to the guest which > > results in crashing the guest. > > > > Patch 1 of this series emulates wakeup from the "stop" > > instruction. Would the following scheme be ok? > > > > OPAL: > > On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest" > > > > Hypervisor Kernel: > > > > If "idle-stop-guest" dt feature is available, then, before > > entering the guest, the hypervisor clears the PSSCR[EC|ESL] > > bits allowing the guest to safely execute stop instruction. > > > > If "idle-stop-guest" dt feature is not available, then, the > > Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a > > guest "stop" instruction execution to trap back into the > > hypervisor. We then emulate a wakeup from the stop > > instruction (Patch 1 of this series). > > > > Guest Kernel: > > If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the > > stop0lite cpuidle state. > > > > This allows us to migrate the KVM guest across any POWER9 > > Hypervisor. The minimal addition that the Hypervisor would need is > > Patch 1 of this series. > > That could be workable. Some caveats, though: > > * How does the latency of the trap-and-emulate compare to the guest > using H_CEDE in the first place? i.e. how big a negative impact > will this have for guests running on DD2.2 hosts? The wakeup latency of trap-and-emulated stop0lite (referred to as "stop0lite Emulated" in the tables below) the compares favorably compared to H_CEDE. It is in the order of 5-6us while the wakeup latency of H_CEDE is ~25-30us. ====================================================================== Wakeup Latency measured using a timer (in ns) [Lower is better] ====================================================================== Idle state | Nr samples | Min | Max | Median | Avg | Stddev| ---------------------------------------------------------------------- snooze | 60 | 787 | 1059 | 938 | 937.4 | 42.27 | ---------------------------------------------------------------------- stop0lite | 60 | 770 | 1182 | 948 | 946.4 | 67.41 | ---------------------------------------------------------------------- stop0lite | 60 | 2378 | 7659 | 5006 |5093.6 |1578.7 | Emulated | | | | | | | ---------------------------------------------------------------------- Shared CEDE| 60 | 9550 | 36694 | 29219 |28564.1|3545.9 | ====================================================================== ====================================================================== Wakeup latency measured using an IPI (in ns) [Lower is better] ====================================================================== Idle state | Nr | Min | Max | Median | Avg | Stddev | |samples | | | | | | ---------------------------------------------------------------------- snooze | 60 | 2089| 4228| 2259| 2342.31| 316.56| ---------------------------------------------------------------------- stop0lite | 60 | 1947| 3674| 2653| 2610.57| 266.73| ---------------------------------------------------------------------- stop0lite | 60 | 3580| 8154| 5596| 5644.95| 1368.44| Emulated | | | | | | | ---------------------------------------------------------------------- Shared CEDE| 60 | 20147| 36305| 21827| 26762.65| 6875.01| ====================================================================== > > * We'll only be able to enable this in a new qemu machine type > version (say, pseries-5.1.0). Otherwise a guest could start > thinking it can use stop states, then be migrated to an older qemu > or host kernel without the support and crash. That makese sense. In fact in the case of not being able to backport Patch 1 to all the older hypervisor kernels, we will need a way of gating the guest from using stop-states and then migrating onto an older hypervisor kernel. Associating this with a new qemu machine type version should solve this problem, assuming that all the newer qemus will also be running on newer hypervisor kernels. > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson -- Thanks and Regards gautham.