On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini <sstabellini@xxxxxxxxxx> wrote: > On Tue, 10 Jan 2017, Dan Streetman wrote: >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote: >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >> > <sstabellini@xxxxxxxxxx> wrote: >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> >>> > <boris.ostrovsky@xxxxxxxxxx> wrote: >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >> >>> > >>> unmapped and may now be in use by another pci device. The previous >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >> >>> > >>> allocated from the hypervisor. >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> >>> > >> as we are not reusing them? >> >>> > > >> >>> > > Don't we free the pirq when we unmap it? >> >>> > >> >>> > I think this is actually a bit worse than I initially thought. After >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >> >>> > allocation: >> >>> >> >>> Lets include Stefano, >> >>> >> >>> Thank you for digging in this! This has quite the deja-vu >> >>> feeling as I believe I hit this at some point in the past and >> >>> posted some possible ways of fixing this. But sadly I can't >> >>> find the thread. >> >> >> >> This issue seems to be caused by: >> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> >> Author: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> >> >> which was a fix to a bug: >> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> >> trying to enable the same MSI for the second time: the old MSI to pirq >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> >> try to assign a new pirq anyway. >> >> A simple way to reproduce this bug is to assign an MSI capable network >> >> card to a PV on HVM guest, if the user brings down the corresponding >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >> >> device. >> >> >> >> I don't remember any of the details. From the description of this bug, >> >> it seems that Xen changed behavior in the past few years: before it used >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> >> have been completely freed, then reassigned to somebody else the way it >> >> is described in this email. >> >> >> >> I think we should indentify the changeset or Xen version that introduced >> >> the new behavior. If it is old enough, we might be able to just revert >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> >> behavior conditional to the Xen version. >> > >> > Are PT devices the only MSI-capable devices available in a Xen guest? >> > That's where I'm seeing this problem, with PT NVMe devices. > > They are the main ones. It is possible to have emulated virtio devices > with emulated MSIs, for example virtio-net. Althought they are not in > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > too. > > >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >> > with the patch from this thread (which essentially reverts your commit >> > above) as well as some added debug to see the pirq numbers, cycles of >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >> > hypervisor provides essentially the same pirqs for each modprobe, >> > since they were freed by the rmmod. > > I am fine with reverting the old patch, but we need to understand what > caused the change in behavior first. Maybe somebody else with a Xen PCI > passthrough setup at hand can help with that. > > In the Xen code I can still see: > > case ECS_PIRQ: { > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > > if ( !pirq ) > break; > if ( !is_hvm_domain(d1) ) > pirq_guest_unbind(d1, pirq); > > which means that pirq_guest_unbind should only be called on evtchn_close > if the guest is not an HVM guest. I tried an experiment to call get_free_pirq on both sides of a evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I see something interesting; each nic uses 3 MSIs, and it looks like when they shut down, each nic's 3 pirqs are not available until after the nic is done shutting down, so it seems like closing the evtchn isn't what makes the pirq free. [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == 0, xen_pvh_domain() == 0 just to be sure, a bit of dbg so I know what domain this is :-) [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but none of those become available for get_free_pirq. [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 aha, now nic 4 has fully finished shutting down, and nic 3 has started shutdown; we see those pirqs from nic 4 are now available. So it must not be evtchn closing that frees the pirqs. [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 so, since pci_disable_msix eventually calls xen_teardown_msi_irq() which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() (and recompiled/rebooted) and found the pirqs have already been freed before that is called: [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 I'm still following thru the kernel code, but it's not immediately obvious what exactly is telling the hypervisor to free the pirqs; any idea? >From the hypervisor code, it seems that the pirq is "available" via is_free_pirq(): return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || pirq->arch.hvm.emuirq == IRQ_UNBOUND)); when the evtchn is closed, it does: if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) unmap_domain_pirq_emuirq(d1, pirq->pirq); and that call to unmap_domain_pirq_emuirq does: info->arch.hvm.emuirq = IRQ_UNBOUND; so, the only thing left is to clear pirq->arch.irq,but the only place I can find that does that is clear_domain_irq_pirq(), which is only called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not seeing where either of those would be called when all the kernel is doing is disabling a pci device. > > >> >>> > tl;dr: >> >>> > >> >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq >> >>> > allocated, and reserved in the hypervisor >> >>> > >> >>> > request_irq() -> an event channel is opened for the specific pirq, and >> >>> > maps the pirq with the hypervisor >> >>> > >> >>> > free_irq() -> the event channel is closed, and the pirq is unmapped, >> >>> > but that unmap function also frees the pirq! The hypervisor can/will >> >>> > give it away to the next call to get_free_pirq. However, the pci >> >>> > msi/msix data area still contains the pirq number, and the next call >> >>> > to request_irq() will re-use the pirq. >> >>> > >> >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor >> >>> > (it's already been freed), and it also doesn't clear anything from the >> >>> > msi/msix data area, so the pirq is still cached there. >> >>> > >> >>> > >> >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq >> >>> > when the event channel is closed - or at least, not to change it to >> >>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually >> >>> > call into something in the Xen guest kernel that actually does the >> >>> > pirq unmapping, and clear it from the msi data area (i.e. >> >>> > pci_write_msi_msg) >> >>> >> >>> The problem is that Xen changes have sailed a long long time ago. >> >>> > >> >>> > Alternately, if the hypervisor doesn't change, then the call into the >> >>> > hypervisor to actually allocate a pirq needs to move from the >> >>> > pci_enable_msix_range() call to the request_irq() call? So that when >> >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's >> >>> > for any of the vectors; each request_irq/free_irq pair do the pirq >> >>> > allocate-and-map/unmap... >> >>> >> >>> >> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq() >> >>> from actually calling evtchn_close() until the pci_disable_msix() is done? >> >> >> >> I think that's a reasonable alternative: we mask the evtchn, but do not >> >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests. >> >> Something like (not compiled, untested): >> >> >> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c >> >> index 137bd0e..3174923 100644 >> >> --- a/drivers/xen/events/events_base.c >> >> +++ b/drivers/xen/events/events_base.c >> >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) >> >> return; >> >> >> >> mask_evtchn(evtchn); >> >> - xen_evtchn_close(evtchn); >> >> + if (!xen_hvm_domain()) >> >> + xen_evtchn_close(evtchn); >> > >> > wouldn't we need to also add code to the pci_disable_msix path that >> > would actually close the evtchn? Would this leave the evtchn around >> > forever? >> > >> >> xen_irq_info_cleanup(info); >> >> } >> >> >> >> >> >> We want to keep the pirq allocated to the device - not closing the >> >> evtchn seems like the right thing to do. I suggest we test for the >> >> original bug too: enable/disable the network interface of an MSI capable >> >> network card. >> > >> > I don't have a Xen hypervisor setup myself, I'm just using AWS, but >> > I'll try to test this on an instance with a SRIOV/PT nic. >> >> I started an instance with SRIOV nics, with the latest upstream kernel >> plus this patch and debug to show the pirqs. Taking an interface down >> and back up uses the same pirq, because the driver only does >> free_irq/request_irq, which just closes/reopens the evtchn using the >> same pirq (which is cached in the struct irq_info) - it doesn't >> disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of >> the driver does disable and reenable the MSIX vectors, but the >> hypervisor provides the same pirqs from the get_free_pirq() call that >> were used before (since nothing else is asking for free pirqs). > > Good! Thanks for testing, it's very helpful. I believe it should work > even if the hypervisor returned a different pirq though. > > >> >>> > longer details: >> >>> > >> >>> > The chain of function calls starts in the initial call to configure >> >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or >> >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which >> >>> > either tries to re-use any cached pirq in the MSI data area, or (for >> >>> > the first time setup) allocates a new pirq from the hypervisor via >> >>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the >> >>> > hypervisor's perspective, but it's not mapped to anything in the guest >> >>> > kernel. >> >>> > >> >>> > Then, the driver calls request_irq to actually start using the irq, >> >>> > which calls __setup_irq to irq_startup to startup_pirq. The >> >>> > startup_pirq call actually creates the evtchn and binds it to the >> >>> > previously allocated pirq via EVTCHNOP_bind_pirq. >> >>> > >> >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq) >> >>> > and is in use. But then, when the driver doesn't want it anymore, it >> >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that >> >>> > function closes the evtchn via EVTCHNOP_close. >> >>> > >> >>> > Inside the hypervisor, in xen/common/event_channel.c in >> >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq >> >>> > channel is) then it unmaps the pirq mapping via >> >>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back >> >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to >> >>> > give away to anyone requesting a new pirq! >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > > >> >>> > > -boris >> >>> > > >> >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >> >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if >> >>> > >>> it finds the vector was previously configured with a pirq, and that pirq >> >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >> >>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device >> >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given >> >>> > >>> to a different pci device. >> >>> > >> Hm, but this implies that we do keep track of it. >> >>> > >> >> >>> > >> >> >>> > >> while (true) >> >>> > >> do >> >>> > >> rmmod nvme >> >>> > >> modprobe nvme >> >>> > >> done >> >>> > >> >> >>> > >> Things go boom without this patch. But with this patch does this >> >>> > >> still work? As in we don't run out of PIRQs? >> >>> > >> >> >>> > >> Thanks. >> >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe >> >>> > >>> controllers (pci devices) are present. The NVMe driver configures each >> >>> > >>> pci device's msi/msix twice; first to configure a single vector (to >> >>> > >>> talk to the controller for its configuration info), and then it disables >> >>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When >> >>> > >>> multiple NVMe controllers are present, this happens concurrently on all >> >>> > >>> of them, and in the time between controller A calling pci_disable_msix() >> >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix >> >>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when >> >>> > >>> controller A re-configures its msix, its first vector tries to re-use >> >>> > >>> the same pirq that it had before; but that pirq was allocated to >> >>> > >>> controller B, and thus the Xen event channel for controller A's re-used >> >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the >> >>> > >>> pirq mapped elsewhere. >> >>> > >>> >> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@xxxxxxxxxxxxx> >> >>> > >>> --- >> >>> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- >> >>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) >> >>> > >>> >> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >> >>> > >>> index bedfab9..a00a6c0 100644 >> >>> > >>> --- a/arch/x86/pci/xen.c >> >>> > >>> +++ b/arch/x86/pci/xen.c >> >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >> >>> > >>> return 1; >> >>> > >>> >> >>> > >>> for_each_pci_msi_entry(msidesc, dev) { >> >>> > >>> - __pci_read_msi_msg(msidesc, &msg); >> >>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >> >>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >> >>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >> >>> > >>> - xen_irq_from_pirq(pirq) < 0) { >> >>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >> >>> > >>> - if (pirq < 0) { >> >>> > >>> - irq = -ENODEV; >> >>> > >>> - goto error; >> >>> > >>> - } >> >>> > >>> - xen_msi_compose_msg(dev, pirq, &msg); >> >>> > >>> - __pci_write_msi_msg(msidesc, &msg); >> >>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> >>> > >>> - } else { >> >>> > >>> - dev_dbg(&dev->dev, >> >>> > >>> - "xen: msi already bound to pirq=%d\n", pirq); >> >>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >> >>> > >>> + if (pirq < 0) { >> >>> > >>> + irq = -ENODEV; >> >>> > >>> + goto error; >> >>> > >>> } >> >>> > >>> + xen_msi_compose_msg(dev, pirq, &msg); >> >>> > >>> + __pci_write_msi_msg(msidesc, &msg); >> >>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> >>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >> >>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >> >>> > >>> (type == PCI_CAP_ID_MSIX) ? >> >>> > >>> -- >> >>> > >>> 2.9.3 >> >>> > >>> >> >>> > > >> >>> > > >> >>> > > _______________________________________________ >> >>> > > Xen-devel mailing list >> >>> > > Xen-devel@xxxxxxxxxxxxx >> >>> > > https://lists.xen.org/xen-devel >> >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html