On 5/14/19 12:37 AM, Peter Zijlstra wrote:
On Mon, May 13, 2019 at 07:07:36PM -0700, Andy Lutomirski wrote:
On Mon, May 13, 2019 at 2:09 PM Liran Alon <liran.alon@xxxxxxxxxx> wrote:
The hope is that the very vast majority of #VMExit handlers will be
able to completely run without requiring to switch to full address
space. Therefore, avoiding the performance hit of (2).
However, for the very few #VMExits that does require to run in full
kernel address space, we must first kick the sibling hyperthread
outside of guest and only then switch to full kernel address space
and only once all hyperthreads return to KVM address space, then
allow then to enter into guest.
What exactly does "kick" mean in this context? It sounds like you're
going to need to be able to kick sibling VMs from extremely atomic
contexts like NMI and MCE.
Yeah, doing the full synchronous thing from NMI/MCE context sounds
exceedingly dodgy, howver..
Realistically they only need to send an IPI to the other sibling; they
don't need to wait for the VMExit to complete or anything else.
And that is something we can do from NMI context -- with a bit of care.
See also arch_irq_work_raise(); specifically we need to ensure we leave
the APIC in an idle state, such that if we interrupted an APIC sequence
it will not suddenly fail/violate the APIC write/state etc.
I've been experimenting with IPI'ing siblings on vmexit, primarily
because we know we'll need it if ASI turns out to be viable, but also
because I wanted to understand why previous experiments resulted in such
poor performance.
You're correct that you don't need to wait for the sibling to come out
once you send the IPI. That hardware thread will not do anything other
than process the IPI once it's sent. There is still some need for
synchronization, at least for the every vmexit case, since you always
want to make sure that one thread is actually doing work while the other
one is held. I have this working for some cases, but not enough to call
it a general solution. I'm not at all sure that the every vmexit case
can be made to perform for the general case. Even the non-general case
uses synchronization that I fear might be overly complex.
For the cases I do have working, simply not pinning the sibling when
we exit due to the quest idling is a big enough win to put performance
into a much more reasonable range.
Base on this, I believe that pining a sibling HT in a subset of cases,
when we interact with full kernel address space, is almost certainly
reasonable.
-jan