Re: [PATCH v3] KVM: x86: Fix recording of guest steal time / preempted status

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 12 Nov 2021 13:28:55 +0000

On Fri, 2021-11-12 at 13:27 +0100, Paolo Bonzini wrote:
> On 11/12/21 12:29, David Woodhouse wrote:
> > We *could* use the rwlock thing for steal time reporting, but I still
> > don't really think it's worth doing so. Again, if it was truly going to
> > be a generic mechanism that would solve lots of other problems, I'd be
> > up for it. But if steal time would be the *only* other user of a
> > generic version of the rwlock thing, that just seems like
> > overengineering. I'm still mostly inclined to stand by my original
> > observation that it has a perfectly serviceable HVA that it can use
> > instead.
> 
> Well yeah, I'd have to see the code to decide.  But maybe this is where
> we disagree, what I want from generic KVM is a nice and easy-to-use API.
> I only care to a certain extent how messy it is inside, because a nice
> generic API means not reinventing the wheel across architectures.

I'm happy enough with that if I understand how to make it make sense :)

> That said, I think that your patch is much more complicated than it
> should be, because it hooks in the wrong place.  There are two cases:
> 
> 1) for kmap/kunmap, the invalidate_range() notifier is the right place
> to remove any references taken after invalidate_range_start().

Right, I added it in invalidate_range_start() because it was *simpler*
that way (I get given a GFN not an HVA) but that's wrong; it really
does need to be in invalidate_range() as you say).

>   For the
> APIC access page it needs a kvm_make_all_cpus_request, but for the
> shinfo page it can be a simple back-to-back write_lock/write_unlock
> (a super-poor-man RCU, if you wish).  And it can be extended to a
> per-cache rwlock.

A back-to-back write_lock/write_unlock *without* setting the address to
KVM_UNMAPPED_PAGE? I'm not sure I see how that protects the IRQ
delivery from accessing the (now stale) physical page after the MMU
notifier has completed? Not unless it's going to call hva_to_pfn again
for itself under the read_lock, every time it delivers an IRQ?

My version marked it the PFN dirty and invalidated the hva pointer so
that nothing will touch it again — but left it actually *mapped*
because we can't necessarily sleep to unmap. But that's all it did in
the notifier:

+	if (static_branch_unlikely(&kvm_xen_enabled.key)) {
+		write_lock(&kvm->arch.xen.shinfo_lock);
+
+		if (kvm->arch.xen.shared_info &&
+		    kvm->arch.xen.shinfo_gfn >= range->start &&
+		    kvm->arch.xen.shinfo_cache.gfn < range->end) {
+			/*
+			 * If kvm_xen_shared_info_init() had *finished* mapping the
+			 * page and assigned the pointer for real, then mark the page
+			 * dirty now instead of via the eventual cache teardown.
+			 */
+			if (kvm->arch.xen.shared_info != KVM_UNMAPPED_PAGE) {
+				kvm_set_pfn_dirty(kvm->arch.xen.shinfo_cache.pfn);
+				kvm->arch.xen.shinfo_cache.dirty = false;
+			}
+
+			kvm->arch.xen.shared_info = NULL;
+		}
+
+		write_unlock(&kvm->arch.xen.shinfo_lock);
+	}

> 2) for memremap/memunmap, all you really care about is reacting to
> changes in the memslots, so the MMU notifier integration has nothing
> to do.  You still need to call the same hook as
> kvm_mmu_notifier_invalidate_range() when memslots change, so that
> the update is done outside atomic context.

Hm, we definitely *do* care about reacting to MMU notifiers in this
case too. Userspace can do memory overcommit / ballooning etc.
*without* changing the memslots, and only mmap/munmap/userfault_fd on
the corresponding HVA ranges.

> So as far as short-term uses of the cache are concerned, all it
> takes (if I'm right...) is a list_for_each_entry in
> kvm_mmu_notifier_invalidate_range, visiting the list of
> gfn-to-pfn caches and lock-unlock each cache's rwlock.  Plus
> a way to inform the code of memslot changes before any atomic
> code tries to use an invalid cache.

I do the memslot part already. It basically ends up being 

	if (!map->hva || map->hva == KVM_UNMAPPED_PAGE ||
	    slots->generation != ghc->generation ||
	    kvm_is_error_hva(ghc->hva) || !ghc->memslot)) {
		ret = -EWOULDBLOCK; /* Try again when you can sleep */
		goto out_unlock;
	}

> > It looks like they want their own way of handling it; if the GPA being
> > invalidated matches posted_intr_desc_addr or virtual_apic_page_addr
> > then the MMU notifier just needs to call kvm_make_all_cpus_request()
> > with some suitable checking/WARN magic around the "we will never need
> > to sleep when we shouldn't" assertion that you made above.
> 
> I was thinking of an extra flag to decide whether (in addition
> to the write_lock/write_unlock) the MMU notifier also needs to do
> the kvm_make_all_cpus_request:

Yeah, OK.

I think we want to kill the struct kvm_host_map completely, merge its
extra 'hva' and 'page' fields into the (possibly renamed)
gfn_to_pfn_cache along with your 'guest_uses_pa' flag, and take it from
there.

I'll go knock up something that we can heckle further...

Attachment:
smime.p7s

Description: S/MIME cryptographic signature