2017-07-03 17:28 GMT+08:00 Yang Zhang <yang.zhang.wz@xxxxxxxxx>: > On 2017/6/27 22:22, Radim Krčmář wrote: >> >> 2017-06-27 15:56+0200, Paolo Bonzini: >>> >>> On 27/06/2017 15:40, Radim Krčmář wrote: >>>>> >>>>> ... which is not necessarily _wrong_. It's just a different heuristic. >>>> >>>> Right, it's just harder to use than host's single_task_running() -- the >>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look >>>> at other VCPUs that are not halted, but still preempted. >>>> >>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and >>>> yield to the host. Working under the assumption that there is work for >>>> this PCPU if other VCPUs have stuff to do. The downside is that it >>>> misses information about host's topology, so it would be hard to make it >>>> work well. >>> >>> >>> I would just use vcpu_is_preempted on the current CPU. From guest POV >>> this option is really a "f*** everyone else" setting just like >>> idle=poll, only a little more polite. >> >> >> vcpu_is_preempted() on current cpu cannot return true, AFAIK. >> >>> If we've been preempted and we were polling, there are two cases. If an >>> interrupt was queued while the guest was preempted, the poll will be >>> treated as successful anyway. >> >> >> I think the poll should be treated as invalid if the window has expired >> while the VCPU was preempted -- the guest can't tell whether the >> interrupt arrived still within the poll window (unless we added paravirt >> for that), so it shouldn't be wasting time waiting for it. >> >>> If it hasn't, let others run---but really >>> that's not because the guest wants to be polite, it's to avoid that the >>> scheduler penalizes it excessively. >> >> >> This sounds like a VM entry just to do an immediate VM exit, so paravirt >> seems better here as well ... (the guest telling the host about its >> window -- which could also be used to rule it out as a target in the >> pause loop random kick.) >> >>> So until it's preempted, I think it's okay if the guest doesn't care >>> about others. You wouldn't use this option anyway in overcommitted >>> situations. >>> >>> (I'm still not very convinced about the idea). >> >> >> Me neither. (The same mechanism is applicable to bare-metal, but was >> never used there, so I would rather bring the guest behavior closer to >> bare-metal.) >> > > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passing > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer I didn't find these two scenarios will program APIC timer twice separately instead of once separately, could you point out the codes? Regards, Wanpeng Li > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The > cost of such vmexits will degrades performance severely. Linux kernel > already provide idle=poll to mitigate the trend. But it only eliminates the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve the > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=poll mechanism to > use dynamic polling and do poll before touch sched timer. It should not be a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below the > patch may not so perfect to fit all conditions, it looks no worse than now. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? > > > > -- > Yang > Alibaba Cloud Computing