Re: [PATCH] sched: introduce configurable delay before entering idle

Wanpeng Li <kernellwp@xxxxxxxxx> · Thu, 16 May 2019 09:07:32 +0800



On Thu, 16 May 2019 at 02:42, Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>
> On 5/14/19 6:50 AM, Marcelo Tosatti wrote:
> > On Mon, May 13, 2019 at 05:20:37PM +0800, Wanpeng Li wrote:
> >> On Wed, 8 May 2019 at 02:57, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> >>>
> >>>
> >>> Certain workloads perform poorly on KVM compared to baremetal
> >>> due to baremetal's ability to perform mwait on NEED_RESCHED
> >>> bit of task flags (therefore skipping the IPI).
> >>
> >> KVM supports expose mwait to the guest, if it can solve this?
> >>
> >> Regards,
> >> Wanpeng Li
> >
> > Unfortunately mwait in guest is not feasible (uncompatible with multiple
> > guests). Checking whether a paravirt solution is possible.
>
> Hi Marcelo,
>
> I was also looking at making MWAIT available to guests in a safe manner:
> whether through emulation or a PV-MWAIT. My (unsolicited) thoughts

MWAIT emulation is not simple, here is a research
https://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/mwait.html

Regards,
Wanpeng Li

> follow.
>
> We basically want to handle this sequence:
>
>      monitor(monitor_address);
>      if (*monitor_address == base_value)
>           mwaitx(max_delay);
>
> Emulation seems problematic because, AFAICS this would happen:
>
>      guest                                   hypervisor
>      =====                                   ====
>
>      monitor(monitor_address);
>          vmexit  ===>                        monitor(monitor_address)
>      if (*monitor_address == base_value)
>           mwait();
>                vmexit    ====>               mwait()
>
> There's a context switch back to the guest in this sequence which seems
> problematic. Both the AMD and Intel specs list system calls and
> far calls as events which would lead to the MWAIT being woken up:
> "Voluntary transitions due to fast system call and far calls (occurring
> prior to issuing MWAIT but after setting the monitor)".
>
>
> We could do this instead:
>
>      guest                                   hypervisor
>      =====                                   ====
>
>      monitor(monitor_address);
>          vmexit  ===>                        cache monitor_address
>      if (*monitor_address == base_value)
>           mwait();
>                vmexit    ====>              monitor(monitor_address)
>                                             mwait()
>
> But, this would miss the "if (*monitor_address == base_value)" check in
> the host which is problematic if *monitor_address changed simultaneously
> when monitor was executed.
> (Similar problem if we cache both the monitor_address and
> *monitor_address.)
>
>
> So, AFAICS, the only thing that would work is the guest offloading the
> whole PV-MWAIT operation.
>
> AFAICS, that could be a paravirt operation which needs three parameters:
> (monitor_address, base_value, max_delay.)
>
> This would allow the guest to offload this whole operation to
> the host:
>      monitor(monitor_address);
>      if (*monitor_address == base_value)
>           mwaitx(max_delay);
>
> I'm guessing you are thinking on similar lines?
>
>
> High level semantics: If the CPU doesn't have any runnable threads, then
> we actually do this version of PV-MWAIT -- arming a timer if necessary
> so we only sleep until the time-slice expires or the MWAIT max_delay does.
>
> If the CPU has any runnable threads then this could still finish its
> time-quanta or we could just do a schedule-out.
>
>
> So the semantics guaranteed to the host would be that PV-MWAIT returns
> after >= max_delay OR with the *monitor_address changed.
>
>
>
> Ankur