On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx> wrote: > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>> Do not read a pci device's msi message data to see if a pirq was >>> previously configured for the device's msi/msix, as the old pirq was >>> unmapped and may now be in use by another pci device. The previous >>> pirq should never be re-used; instead a new pirq should always be >>> allocated from the hypervisor. >> Won't this cause a starvation problem? That is we will run out of PIRQs >> as we are not reusing them? > > Don't we free the pirq when we unmap it? I think this is actually a bit worse than I initially thought. After looking a bit closer, and I think there's an asymmetry with pirq allocation: tl;dr: pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq allocated, and reserved in the hypervisor request_irq() -> an event channel is opened for the specific pirq, and maps the pirq with the hypervisor free_irq() -> the event channel is closed, and the pirq is unmapped, but that unmap function also frees the pirq! The hypervisor can/will give it away to the next call to get_free_pirq. However, the pci msi/msix data area still contains the pirq number, and the next call to request_irq() will re-use the pirq. pci_disable_msix() -> this has no effect on the pirq in the hypervisor (it's already been freed), and it also doesn't clear anything from the msi/msix data area, so the pirq is still cached there. It seems like the hypervisor needs to be fixed to *not* unmap the pirq when the event channel is closed - or at least, not to change it to IRQ_UNBOUND state? And, the pci_disable_msix call should eventually call into something in the Xen guest kernel that actually does the pirq unmapping, and clear it from the msi data area (i.e. pci_write_msi_msg) Alternately, if the hypervisor doesn't change, then the call into the hypervisor to actually allocate a pirq needs to move from the pci_enable_msix_range() call to the request_irq() call? So that when the msi/msix range is enabled, it doesn't actually reserve any pirq's for any of the vectors; each request_irq/free_irq pair do the pirq allocate-and-map/unmap... longer details: The chain of function calls starts in the initial call to configure the msi vectors, which eventually calls __pci_enable_msix_range (or msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which either tries to re-use any cached pirq in the MSI data area, or (for the first time setup) allocates a new pirq from the hypervisor via PHYSDEVOP_get_free_pirq. That pirq is then reserved from the hypervisor's perspective, but it's not mapped to anything in the guest kernel. Then, the driver calls request_irq to actually start using the irq, which calls __setup_irq to irq_startup to startup_pirq. The startup_pirq call actually creates the evtchn and binds it to the previously allocated pirq via EVTCHNOP_bind_pirq. At this point, the pirq is bound to a guest kernel evtchn (and irq) and is in use. But then, when the driver doesn't want it anymore, it calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that function closes the evtchn via EVTCHNOP_close. Inside the hypervisor, in xen/common/event_channel.c in evtchn_close(), if the channel is type ECS_PIRQ (which our pirq channel is) then it unmaps the pirq mapping via unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back to state IRQ_UNBOUND, which makes it available for the hypervisor to give away to anyone requesting a new pirq! > > -boris > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >>> msi descriptor message data for each msi/msix vector it sets up, and if >>> it finds the vector was previously configured with a pirq, and that pirq >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >>> from the hypervisor. However, that pirq was unmapped when the pci device >>> disabled its msi/msix, and it cannot be re-used; it may have been given >>> to a different pci device. >> Hm, but this implies that we do keep track of it. >> >> >> while (true) >> do >> rmmod nvme >> modprobe nvme >> done >> >> Things go boom without this patch. But with this patch does this >> still work? As in we don't run out of PIRQs? >> >> Thanks. >>> This exact situation is happening in a Xen guest where multiple NVMe >>> controllers (pci devices) are present. The NVMe driver configures each >>> pci device's msi/msix twice; first to configure a single vector (to >>> talk to the controller for its configuration info), and then it disables >>> that msi/msix and re-configures with all the msi/msix it needs. When >>> multiple NVMe controllers are present, this happens concurrently on all >>> of them, and in the time between controller A calling pci_disable_msix() >>> and then calling pci_enable_msix_range(), controller B enables its msix >>> and gets controller A's pirq allocated from the hypervisor. Then when >>> controller A re-configures its msix, its first vector tries to re-use >>> the same pirq that it had before; but that pirq was allocated to >>> controller B, and thus the Xen event channel for controller A's re-used >>> pirq fails to map its irq to that pirq; the hypervisor already has the >>> pirq mapped elsewhere. >>> >>> Signed-off-by: Dan Streetman <dan.streetman@xxxxxxxxxxxxx> >>> --- >>> arch/x86/pci/xen.c | 23 +++++++---------------- >>> 1 file changed, 7 insertions(+), 16 deletions(-) >>> >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >>> index bedfab9..a00a6c0 100644 >>> --- a/arch/x86/pci/xen.c >>> +++ b/arch/x86/pci/xen.c >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >>> return 1; >>> >>> for_each_pci_msi_entry(msidesc, dev) { >>> - __pci_read_msi_msg(msidesc, &msg); >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >>> - xen_irq_from_pirq(pirq) < 0) { >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >>> - if (pirq < 0) { >>> - irq = -ENODEV; >>> - goto error; >>> - } >>> - xen_msi_compose_msg(dev, pirq, &msg); >>> - __pci_write_msi_msg(msidesc, &msg); >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> - } else { >>> - dev_dbg(&dev->dev, >>> - "xen: msi already bound to pirq=%d\n", pirq); >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >>> + if (pirq < 0) { >>> + irq = -ENODEV; >>> + goto error; >>> } >>> + xen_msi_compose_msg(dev, pirq, &msg); >>> + __pci_write_msi_msg(msidesc, &msg); >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >>> (type == PCI_CAP_ID_MSIX) ? >>> -- >>> 2.9.3 >>> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > https://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html