2017-07-03 17:28+0800, Yang Zhang: > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passing > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer One write is enough to disable/re-enable the APIC timer -- why does Linux use two? > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The > cost of such vmexits will degrades performance severely. Yeah, sounds like too much ... I understood that there are IPI from 1 to 2 4 * APIC timer IPI from 2 to 1 which adds to 6 MSR writes -- what are the other 4? > Linux kernel > already provide idle=poll to mitigate the trend. But it only eliminates the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve the > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=poll mechanism to Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow down the sibling hyperthread. MWAIT solves the IPI problem, but doesn't get rid of the timer one. > use dynamic polling and do poll before touch sched timer. It should not be a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below the > patch may not so perfect to fit all conditions, it looks no worse than now. It adds code to hot-paths (interrupt handlers) while trying to optimize an idle-path, which is suspicious. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? I think there is a nicer solution to avoid the expensive timer rewrite: Linux uses one-shot APIC timers and getting the timer interrupt is about as expensive as programming the timer, so the guest can keep the timer armed, but not re-arm it after the expiration if the CPU is idle. This should also mitigate the problem with short idle periods, but the optimized window is anywhere between 0 to 1ms. Do you see disadvantages of this combined with MWAIT? Thanks.