On Thu, 16 May 2019 at 02:42, Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: > > On 5/14/19 6:50 AM, Marcelo Tosatti wrote: > > On Mon, May 13, 2019 at 05:20:37PM +0800, Wanpeng Li wrote: > >> On Wed, 8 May 2019 at 02:57, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > >>> > >>> > >>> Certain workloads perform poorly on KVM compared to baremetal > >>> due to baremetal's ability to perform mwait on NEED_RESCHED > >>> bit of task flags (therefore skipping the IPI). > >> > >> KVM supports expose mwait to the guest, if it can solve this? > >> > >> Regards, > >> Wanpeng Li > > > > Unfortunately mwait in guest is not feasible (uncompatible with multiple > > guests). Checking whether a paravirt solution is possible. > > Hi Marcelo, > > I was also looking at making MWAIT available to guests in a safe manner: > whether through emulation or a PV-MWAIT. My (unsolicited) thoughts MWAIT emulation is not simple, here is a research https://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/mwait.html Regards, Wanpeng Li > follow. > > We basically want to handle this sequence: > > monitor(monitor_address); > if (*monitor_address == base_value) > mwaitx(max_delay); > > Emulation seems problematic because, AFAICS this would happen: > > guest hypervisor > ===== ==== > > monitor(monitor_address); > vmexit ===> monitor(monitor_address) > if (*monitor_address == base_value) > mwait(); > vmexit ====> mwait() > > There's a context switch back to the guest in this sequence which seems > problematic. Both the AMD and Intel specs list system calls and > far calls as events which would lead to the MWAIT being woken up: > "Voluntary transitions due to fast system call and far calls (occurring > prior to issuing MWAIT but after setting the monitor)". > > > We could do this instead: > > guest hypervisor > ===== ==== > > monitor(monitor_address); > vmexit ===> cache monitor_address > if (*monitor_address == base_value) > mwait(); > vmexit ====> monitor(monitor_address) > mwait() > > But, this would miss the "if (*monitor_address == base_value)" check in > the host which is problematic if *monitor_address changed simultaneously > when monitor was executed. > (Similar problem if we cache both the monitor_address and > *monitor_address.) > > > So, AFAICS, the only thing that would work is the guest offloading the > whole PV-MWAIT operation. > > AFAICS, that could be a paravirt operation which needs three parameters: > (monitor_address, base_value, max_delay.) > > This would allow the guest to offload this whole operation to > the host: > monitor(monitor_address); > if (*monitor_address == base_value) > mwaitx(max_delay); > > I'm guessing you are thinking on similar lines? > > > High level semantics: If the CPU doesn't have any runnable threads, then > we actually do this version of PV-MWAIT -- arming a timer if necessary > so we only sleep until the time-slice expires or the MWAIT max_delay does. > > If the CPU has any runnable threads then this could still finish its > time-quanta or we could just do a schedule-out. > > > So the semantics guaranteed to the host would be that PV-MWAIT returns > after >= max_delay OR with the *monitor_address changed. > > > > Ankur