[PATCH RFC 0/1] Make vCPUs that are HLT state candidates for load balancing

Masanori Misono <m.misono760@xxxxxxxxx> · Wed, 26 May 2021 22:37:26 +0900

Hi,

I observed performance degradation when running some parallel programs on a
VM that has (1) KVM_FEATURE_PV_UNHALT, (2) KVM_FEATURE_STEAL_TIME, and (3)
multi-core architecture. The benchmark results are shown at the bottom. An
example of libvirt XML for creating such VM is

```
[...]
  <vcpu placement='static'>8</vcpu>
  <cpu mode='host-model'>
    <topology sockets='1' cores='8' threads='1'/>
  </cpu>
  <qemu:commandline>
    <qemu:arg value='-cpu'/>
    <qemu:arg value='host,l3-cache=on,+kvm-pv-unhalt,+kvm-steal-time'/>
  </qemu:commandline>
[...]
```

I investigate the cause and found that the problem occurs in the following
ways:

- vCPU1 schedules thread A, and vCPU2 schedules thread B. vCPU1 and vCPU2
  share LLC.
- Thread A tries to acquire a lock but fails, resulting in a sleep state
  (via futex.)
- vCPU1 becomes idle because there are no runnable threads and does HLT,
  which leads to HLT VMEXIT (if idle=halt, and KVM doesn't disable HLT
  VMEXIT using KVM_CAP_X86_DISABLE_EXITS).
- KVM sets vCPU1's st->preempted as 1 in kvm_steal_time_set_preempted().
- Thread C wakes on vCPU2. vCPU2 tries to do load balancing in
  select_idle_core(). Although vCPU1 is idle, vCPU1 is not a candidate for
  load balancing because is_vcpu_preempted(vCPU1) is true, hence
  available_idle_cpu(vPCU1) is false.
- As a result, both thread B and thread C stay in the vCPU2's runqueue, and
  vCPU1 is not utilized.

The patch changes kvm_arch_cpu_put() so that it does not set st->preempted
as 1 when a vCPU does HLT VMEXIT. As a result, is_vcpu_preempted(vCPU)
becomes 0, and the vCPU becomes a candidate for CFS load balancing.

The followings are parts of benchmark results of NPB-OMP
(https://www.nas.nasa.gov/publications/npb.html), which contains several
parallel computing programs. My machine has two nodes, and each CPU has 24
cores (Intel Xeon Platinum 8160, hyper-threading disabled.) I created a VM
with 48 vCPU, and each vCPU is pinned to the corresponding pCPU. I also
created virtual NUMA so that the guest environment became as close as the
host. Values in the tables are execution time (seconds; lower is better).

| environmnent \ benchmark name | lu.C   | mg.C  | bt.C  | cg.C  |
|-------------------------------+--------+-------+-------+-------|
| host (Linux v5.13-rc3)        | 50.67  | 14.67 | 54.77 | 20.08 |
| VM (sockets=48, cores=1)      | 51.37  | 14.88 | 55.99 | 20.05 |
| VM (sockets=2, cores=24)      | 170.12 | 23.86 | 75.95 | 40.15 |
|   w/ this patch               | 48.92  | 14.95 | 55.23 | 20.09 |

is_vcpu_preempted() is also used in PV spinlock implementations to mitigate
lock holder preemption problems, etc. A vCPU holding a lock does not do
HLT, so I think this patch doesn't affect them. However, pCPU may be
running the host's thread that has higher priority than a vCPU thread, and
in that case, is_vcpu_preempted() should return 0 ideally. I guess
its implementation would be a bit complicated, so I wonder if this patch
approach is acceptable.

Thanks,

Masanori Misono (1):
  KVM: x86: Don't set preempted when vCPU does HLT VMEXIT

 arch/x86/kvm/x86.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

base-commit: c4681547bcce777daf576925a966ffa824edd09d
-- 
2.31.1