Re: [PATCH 2/2] x86/idle: use dynamic halt poll

Radim Krčmář <rkrcmar@xxxxxxxxxx> · Tue, 4 Jul 2017 16:13:23 +0200

2017-07-03 17:28+0800, Yang Zhang:
> The background is that we(Alibaba Cloud) do get more and more complaints
> from our customers in both KVM and Xen compare to bare-mental.After
> investigations, the root cause is known to us: big cost in message passing
> workload(David show it in KVM forum 2015)
> 
> A typical message workload like below:
> vcpu 0                             vcpu 1
> 1. send ipi                     2.  doing hlt
> 3. go into idle                 4.  receive ipi and wake up from hlt
> 5. write APIC time twice        6.  write APIC time twice to
>    to stop sched timer              reprogram sched timer

One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?

> 7. doing hlt                    8.  handle task and send ipi to
>                                     vcpu 0
> 9. same to 4.                   10. same to 3
> 
> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
> cost of such vmexits will degrades performance severely.

Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?

>                                                          Linux kernel
> already provide idle=poll to mitigate the trend. But it only eliminates the
> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
> compromise would be to turn off NOHZ kernel, but it is not the default
> config for new distributions. Same for halt-poll in KVM, it only solve the
> cost from schedule in/out in host and can not help such workload much.
> 
> The purpose of this patch we want to improve current idle=poll mechanism to

Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
get rid of the timer one.

> use dynamic polling and do poll before touch sched timer. It should not be a
> virtualization specific feature but seems bare mental have low cost to
> access the MSR. So i want to only enable it in VM. Though the idea below the
> patch may not so perfect to fit all conditions, it looks no worse than now.

It adds code to hot-paths (interrupt handlers) while trying to optimize
an idle-path, which is suspicious.

> How about we keep current implementation and i integrate the patch to
> para-virtualize part as Paolo suggested? We can continue discuss it and i
> will continue to refine it if anyone has better suggestions?

I think there is a nicer solution to avoid the expensive timer rewrite:
Linux uses one-shot APIC timers and getting the timer interrupt is about
as expensive as programming the timer, so the guest can keep the timer
armed, but not re-arm it after the expiration if the CPU is idle.

This should also mitigate the problem with short idle periods, but the
optimized window is anywhere between 0 to 1ms.

Do you see disadvantages of this combined with MWAIT?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html