Re: [EXTERNAL] [PATCH] KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 29 Oct 2021 12:56:11 +0100

On Fri, 2021-10-29 at 12:31 +0100, Joao Martins wrote:
> On 10/28/21 23:22, David Woodhouse wrote:
> > On Mon, 2021-10-25 at 13:19 +0100, David Woodhouse wrote:
> > > On Mon, 2021-10-25 at 11:39 +0100, David Woodhouse wrote:
> > > > > One possible solution (which I even have unfinished patches for) is to
> > > > > put all the gfn_to_pfn_caches on a list, and refresh them when the MMU
> > > > > notifier receives an invalidation.
> > > > 
> > > > For this use case I'm not even sure why I'd *want* to cache the PFN and
> > > > explicitly kmap/memremap it, when surely by *definition* there's a
> > > > perfectly serviceable HVA which already points to it?
> > > 
> > > That's indeed true for *this* use case but my *next* use case is
> > > actually implementing the event channel delivery.
> > > 
> > > What we have in-kernel already is everything we absolutely *need* in
> > > order to host Xen guests, but I really do want to fix the fact that
> > > even IPIs and timers are bouncing up through userspace.
> > 
> > Here's a completely untested attempt, in which all the complexity is
> > based around the fact that I can't just pin the pages as João and
> > Ankur's original did.
> > 
> > It adds a new KVM_IRQ_ROUTING_XEN_EVTCHN with an ABI that allows for us
> > to add FIFO event channels, but for now only supports 2 level.
> > 
> > In kvm_xen_set_evtchn() I currently use kvm_map_gfn() *without* a cache
> > at all, but I'll work something out for that. I think I can use a
> > gfn_to_hva_cache (like the one removed in commit 319afe685) and in the
> > rare case that it's invalid, I can take kvm->lock to revalidate it.
> > 
> > It sets the bit in the global shared info but doesn't touch the target
> > vCPU's vcpu_info; instead it sets a bit in an *in-kernel* shadow of the
> > target's evtchn_pending_sel word, and kicks the vCPU.
> > 
> > That shadow is actually synced to the guest's vcpu_info struct in
> > kvm_xen_has_interrupt(). There's a little bit of fun asm there to set
> > the bits in the userspace struct and then clear the same set of bits in
> > the kernel shadow *if* the first op didn't fault. Or such is the
> > intent; I didn't hook up a test yet.
> > 
> > As things stand, I should be able to use this for delivery of PIRQs
> > from my VMM, where things like passed-through PCI MSI gets turned into
> > Xen event channels. As well as KVM unit tests, of course.
> > 
> Cool stuff!! I remember we only made IPIs and timers work but not PIRQs
> event channels.

PIRQs turn out to be fairly trivial in the end, once you get your head
around the fact that Xen guests don't actually *unmask* MSI-X when
they're routed to PIRQ; they rely on Xen to do that for them!

If you already have a virtual IOMMU doing interrupt remapping, hooking
it up to remap from the magic Xen MSI 'this is really a PIRQ' message
is fairly simple. Right now I fail that translation and inject the
evtchn manually from userspace, but this will allow me to translate
to KVM_IRQ_ROUTING_XEN_EVTCHN and hook the vfio eventfd directly up to
it.

> > The plan is then to hook up IPIs and timers — again based on the Oracle
> > code from before, but using eventfds for the actual evtchn delivery. 
> > 
> I recall the eventfd_signal() was there should the VMM choose supply an
> eventfd. But working without one was mainly for IPI/timers due to
> performance reasons (avoiding the call to eventfd_signal()). We saw some
> slight overhead there -- but I can't find the data right now :(
> 

Hm, there shouldn't have been *much* additional overhead, since you did
hook up the evtchn delivery from kvm_arch_set_irq_inatomic() so you
weren't incurring the workqueue latency every time. I'll play.

Given that we can't pin pages, I spent a while lying awake at night
pondering *how* we defer the evtchn delivery, if we take a page fault
when we can't just sleep. I was pondering a shadow of the whole
evtchn_pending bitmap, perhaps one per vCPU because the deferred
delivery would need to know *which* vCPU to deliver it to. And more
complexity about whether it was masked or not, too (what if we set the
pending bit but then take a fault when reading whether it was masked?)

Then I remembered that irqfd_wakeup() will *try* the inatomic version
and then fall back to a workqueue, and the whole horridness just went
away :)

Perhaps as an optimisation I can do the same kind of thing — when IPI
or timer wants to raise an evtchn, it can *try* to do so atomically but
fall back to the eventfd if it needs to wait.

I think I've just conceded that I have to map/unmap the page at a
kernel virtual address from the MMU notifier when it becomes
present/absent, so I might get *some* of the benefits of pinning from
that (at least if I protect it with a spinlock it can't go away
*between* two consecutive accesses).

> Perhaps I am not reading it right (or I forgot) but don't you need to use the shared info
> vcpu info when the guest hasn't explicitly registered for a *separate* vcpu info, no?

We can't, because we don't know the *Xen* CPU numbering scheme. It may
differ from both kvm_vcpu->vcpu_id and kvm_vcpu->idx. I lost a week or
so of my life to that, delivering interrupts to the wrong vCPUs :)

> Or maybe this relies on the API contract (?) that VMM needs to register the vcpu info in
> addition to shared info regardless of Xen guest explicitly asking for a separate vcpu
> info. If so, might be worth a comment...

Indeed, I washed my hands of that complexity in the kernel and left it
entirely up to the VMM. From Documentation/virt/kvm/api.rst:

KVM_XEN_ATTR_TYPE_SHARED_INFO
  Sets the guest physical frame number at which the Xen "shared info"
  page resides. Note that although Xen places vcpu_info for the first
  32 vCPUs in the shared_info page, KVM does not automatically do so
  and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used
  explicitly even when the vcpu_info for a given vCPU resides at the
  "default" location in the shared_info page. This is because KVM is
  not aware of the Xen CPU id which is used as the index into the
  vcpu_info[] array, so cannot know the correct default location.

> > +static inline int max_evtchn_port(struct kvm *kvm)
> > +{
> > +	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode)
> > +		return 4096;
> > +	else
> > +		return 1024;
> > +}
> > +
> 
> Maybe worth using Xen ABI interface macros that help tieing this in
> to Xen guest ABI. Particular the macros in include/xen/interface/event_channel.h
> 
> #define EVTCHN_2L_NR_CHANNELS (sizeof(xen_ulong_t) * sizeof(xen_ulong_t) * 64)
> 
> Sadly doesn't cover 32-bit case :( given the xen_ulong_t.

Yeah, that's a bit of a placeholder and wants cleanup but only after
I've made it *work*. :)

For KVM I have definitions of the compat stuff in arch/x86/kvm/xen.h so
could add it there.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature