On Thu, 22 Apr 2021 at 06:13, Kenta Ishiguro <kentaishiguro@xxxxxxxxxxxxxxxxxxxx> wrote: > > Dear KVM developers and maintainers, > > In our research work presented last week at the VEE 2021 conference [1], we > found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur > due to three problems we have identified: 1) Linux CFS ignores hints from > KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver > that has halted is always a candidate for boost. We have intoduced two > mitigations against the problems. > > To solve problem (1), patch 1 increases the vruntime of yielded vCPU to > pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, > left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's > skip and next are both vCPUs in the same VM. To keep fairness it does not > prioritize the guest VM which causes PLE, however it improves the > performance by eliminating unnecessary PLE. Also we have confirmed > `yield_to_task_fair` is called only from KVM. > > To solve problems (2) and (3), patch 2 monitors IPI communication between > vCPUs and leverages the relationship between vCPUs to select boost > candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is > delivering interrupt" patch > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@xxxxxxxxxxxxxx/T/) > seems to be effective for (2) while it only uses the IPI receiver > information. > > Our approach reduces the total number of PLE events by up to 87.6 % in four > 8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please > find the patch below. You should mention that this improvement mainly comes from your problems (1) scheduler hacking, however, kvm task is just an ordinary task and scheduler maintainer always does not accept special treatment. the worst case for problems (1) mentioned in your paper, I guess it is vCPU stacking issue, I try to mitigate it before (https://lore.kernel.org/kvm/1564479235-25074-1-git-send-email-wanpengli@xxxxxxxxxxx/). For your problems (3), we evaluate hackbench which is heavily contended rq locks and heavy async ipi(reschedule ipi), the async ipi influence is around 0.X%, I don't expect normal workloads can feel any affected. In addition, four 8-vCPU VMs are not suitable for scalability evaluation. I don't think the complex which is introduced by your patch 2 is worth it since it gets a similar effect as my version w/ current heuristic algorithm. Wanpeng