Re: [PATCH v3] KVM: x86: Fix recording of guest steal time / preempted status

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 12 Nov 2021 09:54:18 +0000

On Fri, 2021-11-12 at 10:31 +0100, Paolo Bonzini wrote:
> On 11/12/21 09:28, David Woodhouse wrote:
> > I do not recall that we'd actually reached a conclusion that we *will*
> > make the gfn_to_pfn cache generally usable in that fashion. The latest
> > I knew of that discussion was my message at
> > https://lore.kernel.org/kvm/55a5d4e3fbd29dd55e276b97eeaefd0411b3290b.camel@xxxxxxxxxxxxx/
> > 
> > in which I said I'd be a whole lot happier with that if we could do it
> > with RCU instead of an rwlock — but I don't think we can because we'd
> > need to call synchronize_srcu() in the MMU notifier callback that might
> > not be permitted to sleep?
> 
> Why do you have a problem with the rwlock?  If it's per-cache, and it's 
> mostly taken within vCPU context (with the exception of Xen), contention 
> should be nonexistent.

My problem with the using the rwlock instead of RCU is not the
contention, it's...

> > I'm also slightly less comfortable with having the MMU notifier work
> > through an arbitrary *list* of gfn_to_pfn caches that it potentially
> > needs to invalidate, but that is very much a minor concern compared
> > with the first.
> > 
> > I started looking through the nested code which is the big user of this
> > facility.
> 
> Yes, that's also where I got stuck in my first attempt a few months ago. 
>   I agree that it can be changed to use gfn-to-hva caches, except for 
> the vmcs12->posted_intr_desc_addr and vmcs12->virtual_apic_page_addr.
> 

... that anything accessing these will *still* need to do so in atomic
context. There's an atomic access which might fail, and then you fall
back to a context in which you can sleep to refresh the mapping. and
you *still* need to perform the actual access with the spinlock held to
protect against concurrent invalidation.

So let's take a look... for posted_intr_desc_addr, that host physical
address is actually written to the VMCS02, isn't it? 

Thinking about the case where the target page is being invalidated
while the vCPU is running... surely in that case the only 'correct'
solution is that the vCPU needs to be kicked out of non-root mode
before the invalidate_range() notifier completes?

That would have worked nicely if the MMU notifier could call
scru_synchronize() on invalidation. Can it kick the vCPU and wait for
it to exit though?

Or maybe there's a variant where we only have to ensure that no posted
interrupts will actually be *delivered* after the invaldation?

Don't get me wrong, a big part of me *loves* the idea that the hairiest
part of my Xen event channel delivery is actually a bug fix that we
need in the kernel anyway, and then the rest of it is simple and
uncontentious.

I'm just not quite sure I see how to provide a generic mechanism that
actually *fixes* the bugs that already exist elsewhere — at least not
without them having their own special cases for invalidation anyway.

(ISTR the virtual apic page is a bit different because it's only an
*address* and it doesn't even have to be backed by real memory at the
corresponding HPA? Otherwise it's basically the same issue?)

Attachment:
smime.p7s

Description: S/MIME cryptographic signature