Re: [EXTERNAL] [PATCH] KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Mon, 25 Oct 2021 13:19:11 +0100

On Mon, 2021-10-25 at 11:39 +0100, David Woodhouse wrote:
> > One possible solution (which I even have unfinished patches for) is to
> > put all the gfn_to_pfn_caches on a list, and refresh them when the MMU
> > notifier receives an invalidation.
> 
> For this use case I'm not even sure why I'd *want* to cache the PFN and
> explicitly kmap/memremap it, when surely by *definition* there's a
> perfectly serviceable HVA which already points to it?

That's indeed true for *this* use case but my *next* use case is
actually implementing the event channel delivery.

What we have in-kernel already is everything we absolutely *need* in
order to host Xen guests, but I really do want to fix the fact that
even IPIs and timers are bouncing up through userspace.

Xen 2-level event channel delivery is a series of test-and-set
operations. For delivering a given port#, we:

 • Test-and-set the corresponding port# bit in the shared info page.
   If it was already set, we're done.

 • Test the corresponding 'masked' bit in the shared info page. If it
   was already set, we're done.

 • Test-and-test the bit in the target vcpu_info 'evtchn_pending_sel'
   which corresponds to the *word* in which the port# resides. If it
   was already set, we're done.

 • Set the 'evtchn_upcall_pending' bit in the target vcpu_info to
   trigger the vector delivery.

In João and Ankur's original version¹ this was really simple; it looked
like this:

	if (test_and_set_bit(p, (unsigned long *) shared_info->evtchn_pending))
		return 1;

	if (!test_bit(p, (unsigned long *) shared_info->evtchn_mask) &&
	    !test_and_set_bit(p / BITS_PER_EVTCHN_WORD,
			      (unsigned long *) &vcpu_info->evtchn_pending_sel))
		return kvm_xen_evtchn_2l_vcpu_set_pending(vcpu_info);

Yay for permanently pinned pages! :)

So, with a fixed version of kvm_map_gfn() I suppose I could do the
same, but that's *two* maps/unmaps for each interrupt? That's probably
worse than just bouncing out and letting userspace do it!

So even for the event channel delivery use case, if I'm not allowed to
just pin the pages permanently then I stand by the observation that I
*have* a perfectly serviceable HVA for it already.

I can even do the test-and-set in userspace based on the futex
primitives, but the annoying part is that if the page does end up
absent, I need to *store* the pending operation because there will be
times when we're trying to deliver interrupts but *can't* sleep and
wait for the page. So that probably means 512 bytes of evtchn bitmap
*per vcpu* in order to store the event channels which are pending for
each vCPU, and a way to replay them from a context which *can* sleep.

And if I have *that* then I might as well use it to solve the problem
of the gpa_to_hva_cache being single-threaded, and let a vCPU do its
own writes to its vcpu_info *every* time.

With perhaps a little more thinking about how I use a gpa_to_hva_cache
for the shinfo page (which you removed in commit 319afe68), but perhaps
starting with the observation that it's only not thread-capable when
it's *invalid* and needs to be refreshed...

¹ https://lore.kernel.org/lkml/20190220201609.28290-12-joao.m.martins@xxxxxxxxxx/

Attachment:
smime.p7s

Description: S/MIME cryptographic signature