On Wed, May 26, 2021 at 10:37:26PM +0900, Masanori Misono wrote: > Hi, > > I observed performance degradation when running some parallel programs on a > VM that has (1) KVM_FEATURE_PV_UNHALT, (2) KVM_FEATURE_STEAL_TIME, and (3) > multi-core architecture. The benchmark results are shown at the bottom. An > example of libvirt XML for creating such VM is > > ``` > [...] > <vcpu placement='static'>8</vcpu> > <cpu mode='host-model'> > <topology sockets='1' cores='8' threads='1'/> > </cpu> > <qemu:commandline> > <qemu:arg value='-cpu'/> > <qemu:arg value='host,l3-cache=on,+kvm-pv-unhalt,+kvm-steal-time'/> > </qemu:commandline> > [...] > ``` > > I investigate the cause and found that the problem occurs in the following > ways: > > - vCPU1 schedules thread A, and vCPU2 schedules thread B. vCPU1 and vCPU2 > share LLC. > - Thread A tries to acquire a lock but fails, resulting in a sleep state > (via futex.) > - vCPU1 becomes idle because there are no runnable threads and does HLT, > which leads to HLT VMEXIT (if idle=halt, and KVM doesn't disable HLT > VMEXIT using KVM_CAP_X86_DISABLE_EXITS). > - KVM sets vCPU1's st->preempted as 1 in kvm_steal_time_set_preempted(). > - Thread C wakes on vCPU2. vCPU2 tries to do load balancing in > select_idle_core(). Although vCPU1 is idle, vCPU1 is not a candidate for > load balancing because is_vcpu_preempted(vCPU1) is true, hence > available_idle_cpu(vPCU1) is false. > - As a result, both thread B and thread C stay in the vCPU2's runqueue, and > vCPU1 is not utilized. > > The patch changes kvm_arch_cpu_put() so that it does not set st->preempted > as 1 when a vCPU does HLT VMEXIT. As a result, is_vcpu_preempted(vCPU) > becomes 0, and the vCPU becomes a candidate for CFS load balancing. I'm conficted on this; the vcpu stops running, the pcpu can go do anything, it might start the next task. There is no saying how quickly the vcpu task can return to running. I'm guessing your setup doesn't actually overload the system; and when it doesn't have the vcpu thread to run, the pcpu actually goes idle too. But for those 1:1 cases we already have knobs to disable much of this IIRC. So I'm tempted to say things are working as expected and you're just not configured right.