Hi, I observed performance degradation when running some parallel programs on a VM that has (1) KVM_FEATURE_PV_UNHALT, (2) KVM_FEATURE_STEAL_TIME, and (3) multi-core architecture. The benchmark results are shown at the bottom. An example of libvirt XML for creating such VM is ``` [...] <vcpu placement='static'>8</vcpu> <cpu mode='host-model'> <topology sockets='1' cores='8' threads='1'/> </cpu> <qemu:commandline> <qemu:arg value='-cpu'/> <qemu:arg value='host,l3-cache=on,+kvm-pv-unhalt,+kvm-steal-time'/> </qemu:commandline> [...] ``` I investigate the cause and found that the problem occurs in the following ways: - vCPU1 schedules thread A, and vCPU2 schedules thread B. vCPU1 and vCPU2 share LLC. - Thread A tries to acquire a lock but fails, resulting in a sleep state (via futex.) - vCPU1 becomes idle because there are no runnable threads and does HLT, which leads to HLT VMEXIT (if idle=halt, and KVM doesn't disable HLT VMEXIT using KVM_CAP_X86_DISABLE_EXITS). - KVM sets vCPU1's st->preempted as 1 in kvm_steal_time_set_preempted(). - Thread C wakes on vCPU2. vCPU2 tries to do load balancing in select_idle_core(). Although vCPU1 is idle, vCPU1 is not a candidate for load balancing because is_vcpu_preempted(vCPU1) is true, hence available_idle_cpu(vPCU1) is false. - As a result, both thread B and thread C stay in the vCPU2's runqueue, and vCPU1 is not utilized. The patch changes kvm_arch_cpu_put() so that it does not set st->preempted as 1 when a vCPU does HLT VMEXIT. As a result, is_vcpu_preempted(vCPU) becomes 0, and the vCPU becomes a candidate for CFS load balancing. The followings are parts of benchmark results of NPB-OMP (https://www.nas.nasa.gov/publications/npb.html), which contains several parallel computing programs. My machine has two nodes, and each CPU has 24 cores (Intel Xeon Platinum 8160, hyper-threading disabled.) I created a VM with 48 vCPU, and each vCPU is pinned to the corresponding pCPU. I also created virtual NUMA so that the guest environment became as close as the host. Values in the tables are execution time (seconds; lower is better). | environmnent \ benchmark name | lu.C | mg.C | bt.C | cg.C | |-------------------------------+--------+-------+-------+-------| | host (Linux v5.13-rc3) | 50.67 | 14.67 | 54.77 | 20.08 | | VM (sockets=48, cores=1) | 51.37 | 14.88 | 55.99 | 20.05 | | VM (sockets=2, cores=24) | 170.12 | 23.86 | 75.95 | 40.15 | | w/ this patch | 48.92 | 14.95 | 55.23 | 20.09 | is_vcpu_preempted() is also used in PV spinlock implementations to mitigate lock holder preemption problems, etc. A vCPU holding a lock does not do HLT, so I think this patch doesn't affect them. However, pCPU may be running the host's thread that has higher priority than a vCPU thread, and in that case, is_vcpu_preempted() should return 0 ideally. I guess its implementation would be a bit complicated, so I wonder if this patch approach is acceptable. Thanks, Masanori Misono (1): KVM: x86: Don't set preempted when vCPU does HLT VMEXIT arch/x86/kvm/x86.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) base-commit: c4681547bcce777daf576925a966ffa824edd09d -- 2.31.1