From: "Gautham R. Shenoy" <ego@xxxxxxxxxxxxxxxxxx> *** RFC Only. Not intended for inclusion ************ Motivation ~~~~~~~~~~~~~~~ The POWER ISA v3.0 allows stop instruction to be executed from a Guest Kernel (HV=0,PR=0) context. If the hypervisor has cleared PSSCR[ESL|EC] bits, then the stop instruction thus executed will cause the vCPU thread to "pause", thereby donating its cycles to the other threads in the core until the paused thread is woken up by an interrupt. If the hypervisor has set the PSSCR[ESL|EC] bits, then execution of the "stop" instruction will raise a Hypervisor Facility Unavailable exception. The stop idle state in the guest (henceforth referred to as stop0lite) when enabled * has a very small wakeup latency (1-3us) comparable to that of snooze and considerably better compared the Shared CEDE state (25-30us). Results are provided below for wakeup latency measured by waking up an idle CPU in a given state using a timer as well as using an IPI. ====================================================================== Wakeup Latency measured using a timer (in ns) [Lower is better] ====================================================================== Idle state | Nr samples | Min | Max | Median | Avg | Stddev| ====================================================================== snooze | 60 | 787 | 1059 | 938 | 937.4 | 42.27 | ====================================================================== stop0lite | 60 | 770 | 1182 | 948 | 946.4 | 67.41 | ====================================================================== Shared CEDE| 60 | 9550 | 36694 | 29219 |28564.1|3545.9 | ====================================================================== ====================================================================== Wakeup Latency measured using a timer (in ns) [Lower is better] ====================================================================== Idle state | Nr samples | Min | Max | Median | Avg | Stddev| ====================================================================== snooze | 60 | 787 | 1059 | 938 | 937.4 | 42.27 | ====================================================================== stop0lite | 60 | 770 | 1182 | 948 | 946.4 | 67.41 | ====================================================================== Shared CEDE| 60 | 9550 | 36694 | 29219 |28564.1|3545.9 | ====================================================================== * provides an improved single threaded performance compared to snooze since the idle state completely relinquishes the core cycles. The single threaded performance is observed to be better even when compared to "Shared CEDE", since in the latter case something else can scheduled on the ceded CPU, while "stop0lite" doesn't give up the CPU. On a KVM guest with smp 8,sockets=1,cores=2,threads=4 with vCPUs of a vCore bound to a physical core, we run a single-threaded ebizzy pinned to one of the guest vCPUs while the sibling vCPUs in the core are idling. We enable only one guest idle state at a time to measure the single-threaded performance benefit that the idle state provides by giving up the core resources to the non-idle thread. we obtain ~13% improvement in the throughput compared to that with "snooze" and ~8% improvement in the throughput compared to "Shared CEDE". ======================================================================= | ebizzy records/s : [Higher the better] | ======================================================================= |Idle state | Nr | Min | Max | Median | Avg | Stddev | | |samples | | | | | | ======================================================================= |snooze | 10 | 1378988| 1379358| 1379032|1379067.3| 113.47| ======================================================================= |stop0lite | 10 | 1561836| 1562058| 1561906|1561927.5| 81.87| ======================================================================= |Shared CEDE| 10 | 1446584| 1447383| 1447037|1447009.0| 244.16| ======================================================================= Is stop0lite a replacement for snooze ? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Not yet. snooze is a polling state, and can respond much faster to a need_resched() compared to stop0lite which needs an IPI to wakeup from idle state. This can be seen in the results below: With the context_switch2 pipe test, we can see that with stop0lite, the number of context switches are 32.47% lesser than with snooze. This is due to the fact that snooze is a polling state which polls for TIF_NEED_RESCHED. Thus it does not require an interrupt to exit the state and start executing the scheduler code. However, stop0lite needs an IPI. Compared to the "Shared CEDE" state, we see that with stop0lite, we have 82.7% improvement in the number of context switches. This is due to the low wakeup latency compared to Shared CEDE. ====================================================================== context switch2 : Number of context switches/s [Higher the better] ====================================================================== Idle state | Nr | Min | Max | Median | Avg | Stddev | |samples | | | | | | ====================================================================== snooze | 100 | 210480| 221578| 219860|219684.88| 1344.97| ====================================================================== stop0lite | 100 | 146730| 150266| 148258|148331.70| 871.50| ====================================================================== Shared CEDE| 100 | 75812| 82792| 81232| 81187.16| 832.99| ====================================================================== Is stop0lite a replacement for Shared CEDE ? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No. For longer idle durations, Shared CEDE is a better option compared to "stop0lite", both from a performance (CEDEd CPUs can be put into deeper idle states such as stop2, which can provide SMT folding benefits) and utilization (Hypervisor can utilize the idle CPUs for running something useful). What this patch-set does: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The patchset has 3 patches Patch 1: Allows the guest to run "stop" instruction without crashing even if the hypervisor has set the PSSCR[ESL|EC] bits. This is done by handling the Hypervisor Facility Unavailable exception and incrementing the program counter by 4 bytes, thus emulating the wakeup from a PSSCR[ESL = EC = 0] stop. Patch 2: Clears the PSSCR[ESL|EC] bits unconditionally before dispatching a vCPU, thereby allowing the vCPU to execute a "stop" instruction. Patch 3: Defines a cpuidle state for pseries guest named "stop0lite" to be invoked by the cpuidle driver. What this patch-set does not do: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * It does not define an interface by which the guest discovers the stop-capability. Should this be defined via device-tree? * It does address the problem of guest migration. i.e, a guest started on a hypervisor which supports guest stop state, if migrated to a hypervisor which does not support guest stop state will crash, unless it has Patch 1 above. I would like to seek feedback and comments with respect to how to go about implementing the issues that have not been addressed in this patchset. Gautham R. Shenoy (3): powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop. pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry cpuidle/pseries: Add stop0lite state arch/powerpc/include/asm/reg.h | 1 + arch/powerpc/kvm/book3s_hv.c | 8 ++++++-- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------ drivers/cpuidle/cpuidle-pseries.c | 27 +++++++++++++++++++++++++++ 4 files changed, 47 insertions(+), 14 deletions(-) -- 1.9.4