On Fri, 2021-10-29 at 12:31 +0100, Joao Martins wrote: > On 10/28/21 23:22, David Woodhouse wrote: > > On Mon, 2021-10-25 at 13:19 +0100, David Woodhouse wrote: > > > On Mon, 2021-10-25 at 11:39 +0100, David Woodhouse wrote: > > > > > One possible solution (which I even have unfinished patches for) is to > > > > > put all the gfn_to_pfn_caches on a list, and refresh them when the MMU > > > > > notifier receives an invalidation. > > > > > > > > For this use case I'm not even sure why I'd *want* to cache the PFN and > > > > explicitly kmap/memremap it, when surely by *definition* there's a > > > > perfectly serviceable HVA which already points to it? > > > > > > That's indeed true for *this* use case but my *next* use case is > > > actually implementing the event channel delivery. > > > > > > What we have in-kernel already is everything we absolutely *need* in > > > order to host Xen guests, but I really do want to fix the fact that > > > even IPIs and timers are bouncing up through userspace. > > > > Here's a completely untested attempt, in which all the complexity is > > based around the fact that I can't just pin the pages as João and > > Ankur's original did. > > > > It adds a new KVM_IRQ_ROUTING_XEN_EVTCHN with an ABI that allows for us > > to add FIFO event channels, but for now only supports 2 level. > > > > In kvm_xen_set_evtchn() I currently use kvm_map_gfn() *without* a cache > > at all, but I'll work something out for that. I think I can use a > > gfn_to_hva_cache (like the one removed in commit 319afe685) and in the > > rare case that it's invalid, I can take kvm->lock to revalidate it. > > > > It sets the bit in the global shared info but doesn't touch the target > > vCPU's vcpu_info; instead it sets a bit in an *in-kernel* shadow of the > > target's evtchn_pending_sel word, and kicks the vCPU. > > > > That shadow is actually synced to the guest's vcpu_info struct in > > kvm_xen_has_interrupt(). There's a little bit of fun asm there to set > > the bits in the userspace struct and then clear the same set of bits in > > the kernel shadow *if* the first op didn't fault. Or such is the > > intent; I didn't hook up a test yet. > > > > As things stand, I should be able to use this for delivery of PIRQs > > from my VMM, where things like passed-through PCI MSI gets turned into > > Xen event channels. As well as KVM unit tests, of course. > > > Cool stuff!! I remember we only made IPIs and timers work but not PIRQs > event channels. PIRQs turn out to be fairly trivial in the end, once you get your head around the fact that Xen guests don't actually *unmask* MSI-X when they're routed to PIRQ; they rely on Xen to do that for them! If you already have a virtual IOMMU doing interrupt remapping, hooking it up to remap from the magic Xen MSI 'this is really a PIRQ' message is fairly simple. Right now I fail that translation and inject the evtchn manually from userspace, but this will allow me to translate to KVM_IRQ_ROUTING_XEN_EVTCHN and hook the vfio eventfd directly up to it. > > The plan is then to hook up IPIs and timers — again based on the Oracle > > code from before, but using eventfds for the actual evtchn delivery. > > > I recall the eventfd_signal() was there should the VMM choose supply an > eventfd. But working without one was mainly for IPI/timers due to > performance reasons (avoiding the call to eventfd_signal()). We saw some > slight overhead there -- but I can't find the data right now :( > Hm, there shouldn't have been *much* additional overhead, since you did hook up the evtchn delivery from kvm_arch_set_irq_inatomic() so you weren't incurring the workqueue latency every time. I'll play. Given that we can't pin pages, I spent a while lying awake at night pondering *how* we defer the evtchn delivery, if we take a page fault when we can't just sleep. I was pondering a shadow of the whole evtchn_pending bitmap, perhaps one per vCPU because the deferred delivery would need to know *which* vCPU to deliver it to. And more complexity about whether it was masked or not, too (what if we set the pending bit but then take a fault when reading whether it was masked?) Then I remembered that irqfd_wakeup() will *try* the inatomic version and then fall back to a workqueue, and the whole horridness just went away :) Perhaps as an optimisation I can do the same kind of thing — when IPI or timer wants to raise an evtchn, it can *try* to do so atomically but fall back to the eventfd if it needs to wait. I think I've just conceded that I have to map/unmap the page at a kernel virtual address from the MMU notifier when it becomes present/absent, so I might get *some* of the benefits of pinning from that (at least if I protect it with a spinlock it can't go away *between* two consecutive accesses). > Perhaps I am not reading it right (or I forgot) but don't you need to use the shared info > vcpu info when the guest hasn't explicitly registered for a *separate* vcpu info, no? We can't, because we don't know the *Xen* CPU numbering scheme. It may differ from both kvm_vcpu->vcpu_id and kvm_vcpu->idx. I lost a week or so of my life to that, delivering interrupts to the wrong vCPUs :) > Or maybe this relies on the API contract (?) that VMM needs to register the vcpu info in > addition to shared info regardless of Xen guest explicitly asking for a separate vcpu > info. If so, might be worth a comment... Indeed, I washed my hands of that complexity in the kernel and left it entirely up to the VMM. From Documentation/virt/kvm/api.rst: KVM_XEN_ATTR_TYPE_SHARED_INFO Sets the guest physical frame number at which the Xen "shared info" page resides. Note that although Xen places vcpu_info for the first 32 vCPUs in the shared_info page, KVM does not automatically do so and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used explicitly even when the vcpu_info for a given vCPU resides at the "default" location in the shared_info page. This is because KVM is not aware of the Xen CPU id which is used as the index into the vcpu_info[] array, so cannot know the correct default location. > > +static inline int max_evtchn_port(struct kvm *kvm) > > +{ > > + if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) > > + return 4096; > > + else > > + return 1024; > > +} > > + > > Maybe worth using Xen ABI interface macros that help tieing this in > to Xen guest ABI. Particular the macros in include/xen/interface/event_channel.h > > #define EVTCHN_2L_NR_CHANNELS (sizeof(xen_ulong_t) * sizeof(xen_ulong_t) * 64) > > Sadly doesn't cover 32-bit case :( given the xen_ulong_t. Yeah, that's a bit of a placeholder and wants cleanup but only after I've made it *work*. :) For KVM I have definitions of the compat stuff in arch/x86/kvm/xen.h so could add it there.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature