Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data

Stefano Stabellini <sstabellini@xxxxxxxxxx> · Tue, 10 Jan 2017 11:03:45 -0800 (PST)

On Tue, 10 Jan 2017, Dan Streetman wrote:
> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> > <sstabellini@xxxxxxxxxx> wrote:
> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> >>> > <boris.ostrovsky@xxxxxxxxxx> wrote:
> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
> >>> > >>> unmapped and may now be in use by another pci device.  The previous
> >>> > >>> pirq should never be re-used; instead a new pirq should always be
> >>> > >>> allocated from the hypervisor.
> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
> >>> > >> as we are not reusing them?
> >>> > >
> >>> > > Don't we free the pirq when we unmap it?
> >>> >
> >>> > I think this is actually a bit worse than I initially thought.  After
> >>> > looking a bit closer, and I think there's an asymmetry with pirq
> >>> > allocation:
> >>>
> >>> Lets include Stefano,
> >>>
> >>> Thank you for digging in this! This has quite the deja-vu
> >>> feeling as I believe I hit this at some point in the past and
> >>> posted some possible ways of fixing this. But sadly I can't
> >>> find the thread.
> >>
> >> This issue seems to be caused by:
> >>
> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
> >> Author: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>
> >> Date:   Wed Dec 1 14:51:44 2010 +0000
> >>
> >>     xen: fix MSI setup and teardown for PV on HVM guests
> >>
> >> which was a fix to a bug:
> >>
> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
> >>     trying to enable the same MSI for the second time: the old MSI to pirq
> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
> >>     try to assign a new pirq anyway.
> >>     A simple way to reproduce this bug is to assign an MSI capable network
> >>     card to a PV on HVM guest, if the user brings down the corresponding
> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
> >>     device.
> >>
> >> I don't remember any of the details. From the description of this bug,
> >> it seems that Xen changed behavior in the past few years: before it used
> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
> >> have been completely freed, then reassigned to somebody else the way it
> >> is described in this email.
> >>
> >> I think we should indentify the changeset or Xen version that introduced
> >> the new behavior. If it is old enough, we might be able to just revert
> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
> >> behavior conditional to the Xen version.
> >
> > Are PT devices the only MSI-capable devices available in a Xen guest?
> > That's where I'm seeing this problem, with PT NVMe devices.

They are the main ones. It is possible to have emulated virtio devices
with emulated MSIs, for example virtio-net. Althought they are not in
the Xen Project CI-loop, so I wouldn't be surprised if they are broken
too.

> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
> > with the patch from this thread (which essentially reverts your commit
> > above) as well as some added debug to see the pirq numbers, cycles of
> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> > hypervisor provides essentially the same pirqs for each modprobe,
> > since they were freed by the rmmod.

I am fine with reverting the old patch, but we need to understand what
caused the change in behavior first. Maybe somebody else with a Xen PCI
passthrough setup at hand can help with that.

In the Xen code I can still see:

    case ECS_PIRQ: {
        struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);

        if ( !pirq )
            break;
        if ( !is_hvm_domain(d1) )
            pirq_guest_unbind(d1, pirq);

which means that pirq_guest_unbind should only be called on evtchn_close
if the guest is not an HVM guest.

> >>> > tl;dr:
> >>> >
> >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
> >>> > allocated, and reserved in the hypervisor
> >>> >
> >>> > request_irq() -> an event channel is opened for the specific pirq, and
> >>> > maps the pirq with the hypervisor
> >>> >
> >>> > free_irq() -> the event channel is closed, and the pirq is unmapped,
> >>> > but that unmap function also frees the pirq!  The hypervisor can/will
> >>> > give it away to the next call to get_free_pirq.  However, the pci
> >>> > msi/msix data area still contains the pirq number, and the next call
> >>> > to request_irq() will re-use the pirq.
> >>> >
> >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor
> >>> > (it's already been freed), and it also doesn't clear anything from the
> >>> > msi/msix data area, so the pirq is still cached there.
> >>> >
> >>> >
> >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq
> >>> > when the event channel is closed - or at least, not to change it to
> >>> > IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
> >>> > call into something in the Xen guest kernel that actually does the
> >>> > pirq unmapping, and clear it from the msi data area (i.e.
> >>> > pci_write_msi_msg)
> >>>
> >>> The problem is that Xen changes have sailed a long long time ago.
> >>> >
> >>> > Alternately, if the hypervisor doesn't change, then the call into the
> >>> > hypervisor to actually allocate a pirq needs to move from the
> >>> > pci_enable_msix_range() call to the request_irq() call?  So that when
> >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's
> >>> > for any of the vectors; each request_irq/free_irq pair do the pirq
> >>> > allocate-and-map/unmap...
> >>>
> >>>
> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq()
> >>> from actually calling evtchn_close() until the pci_disable_msix() is done?
> >>
> >> I think that's a reasonable alternative: we mask the evtchn, but do not
> >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests.
> >> Something like (not compiled, untested):
> >>
> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
> >> index 137bd0e..3174923 100644
> >> --- a/drivers/xen/events/events_base.c
> >> +++ b/drivers/xen/events/events_base.c
> >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data)
> >>                 return;
> >>
> >>         mask_evtchn(evtchn);
> >> -       xen_evtchn_close(evtchn);
> >> +       if (!xen_hvm_domain())
> >> +               xen_evtchn_close(evtchn);
> >
> > wouldn't we need to also add code to the pci_disable_msix path that
> > would actually close the evtchn?  Would this leave the evtchn around
> > forever?
> >
> >>         xen_irq_info_cleanup(info);
> >>  }
> >>
> >>
> >> We want to keep the pirq allocated to the device - not closing the
> >> evtchn seems like the right thing to do. I suggest we test for the
> >> original bug too: enable/disable the network interface of an MSI capable
> >> network card.
> >
> > I don't have a Xen hypervisor setup myself, I'm just using AWS, but
> > I'll try to test this on an instance with a SRIOV/PT nic.
> 
> I started an instance with SRIOV nics, with the latest upstream kernel
> plus this patch and debug to show the pirqs.  Taking an interface down
> and back up uses the same pirq, because the driver only does
> free_irq/request_irq, which just closes/reopens the evtchn using the
> same pirq (which is cached in the struct irq_info) - it doesn't
> disable/reenable the MSIX vectors.  Doing a complete rmmod/modprobe of
> the driver does disable and reenable the MSIX vectors, but the
> hypervisor provides the same pirqs from the get_free_pirq() call that
> were used before (since nothing else is asking for free pirqs).

Good! Thanks for testing, it's very helpful. I believe it should work
even if the hypervisor returned a different pirq though.

> >>> > longer details:
> >>> >
> >>> > The chain of function calls starts in the initial call to configure
> >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or
> >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
> >>> > either tries to re-use any cached pirq in the MSI data area, or (for
> >>> > the first time setup) allocates a new pirq from the hypervisor via
> >>> > PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
> >>> > hypervisor's perspective, but it's not mapped to anything in the guest
> >>> > kernel.
> >>> >
> >>> > Then, the driver calls request_irq to actually start using the irq,
> >>> > which calls __setup_irq to irq_startup to startup_pirq.  The
> >>> > startup_pirq call actually creates the evtchn and binds it to the
> >>> > previously allocated pirq via EVTCHNOP_bind_pirq.
> >>> >
> >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq)
> >>> > and is in use.  But then, when the driver doesn't want it anymore, it
> >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
> >>> > function closes the evtchn via EVTCHNOP_close.
> >>> >
> >>> > Inside the hypervisor, in xen/common/event_channel.c in
> >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
> >>> > channel is) then it unmaps the pirq mapping via
> >>> > unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
> >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to
> >>> > give away to anyone requesting a new pirq!
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > >
> >>> > > -boris
> >>> > >
> >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
> >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if
> >>> > >>> it finds the vector was previously configured with a pirq, and that pirq
> >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
> >>> > >>> from the hypervisor.  However, that pirq was unmapped when the pci device
> >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given
> >>> > >>> to a different pci device.
> >>> > >> Hm, but this implies that we do keep track of it.
> >>> > >>
> >>> > >>
> >>> > >> while (true)
> >>> > >> do
> >>> > >>  rmmod nvme
> >>> > >>  modprobe nvme
> >>> > >> done
> >>> > >>
> >>> > >> Things go boom without this patch. But with this patch does this
> >>> > >> still work? As in we don't run out of PIRQs?
> >>> > >>
> >>> > >> Thanks.
> >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe
> >>> > >>> controllers (pci devices) are present.  The NVMe driver configures each
> >>> > >>> pci device's msi/msix twice; first to configure a single vector (to
> >>> > >>> talk to the controller for its configuration info), and then it disables
> >>> > >>> that msi/msix and re-configures with all the msi/msix it needs.  When
> >>> > >>> multiple NVMe controllers are present, this happens concurrently on all
> >>> > >>> of them, and in the time between controller A calling pci_disable_msix()
> >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix
> >>> > >>> and gets controller A's pirq allocated from the hypervisor.  Then when
> >>> > >>> controller A re-configures its msix, its first vector tries to re-use
> >>> > >>> the same pirq that it had before; but that pirq was allocated to
> >>> > >>> controller B, and thus the Xen event channel for controller A's re-used
> >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the
> >>> > >>> pirq mapped elsewhere.
> >>> > >>>
> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@xxxxxxxxxxxxx>
> >>> > >>> ---
> >>> > >>>  arch/x86/pci/xen.c | 23 +++++++----------------
> >>> > >>>  1 file changed, 7 insertions(+), 16 deletions(-)
> >>> > >>>
> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> >>> > >>> index bedfab9..a00a6c0 100644
> >>> > >>> --- a/arch/x86/pci/xen.c
> >>> > >>> +++ b/arch/x86/pci/xen.c
> >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> >>> > >>>              return 1;
> >>> > >>>
> >>> > >>>      for_each_pci_msi_entry(msidesc, dev) {
> >>> > >>> -            __pci_read_msi_msg(msidesc, &msg);
> >>> > >>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> >>> > >>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> >>> > >>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
> >>> > >>> -                xen_irq_from_pirq(pirq) < 0) {
> >>> > >>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
> >>> > >>> -                    if (pirq < 0) {
> >>> > >>> -                            irq = -ENODEV;
> >>> > >>> -                            goto error;
> >>> > >>> -                    }
> >>> > >>> -                    xen_msi_compose_msg(dev, pirq, &msg);
> >>> > >>> -                    __pci_write_msi_msg(msidesc, &msg);
> >>> > >>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> >>> > >>> -            } else {
> >>> > >>> -                    dev_dbg(&dev->dev,
> >>> > >>> -                            "xen: msi already bound to pirq=%d\n", pirq);
> >>> > >>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
> >>> > >>> +            if (pirq < 0) {
> >>> > >>> +                    irq = -ENODEV;
> >>> > >>> +                    goto error;
> >>> > >>>              }
> >>> > >>> +            xen_msi_compose_msg(dev, pirq, &msg);
> >>> > >>> +            __pci_write_msi_msg(msidesc, &msg);
> >>> > >>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> >>> > >>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
> >>> > >>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
> >>> > >>>                                             (type == PCI_CAP_ID_MSIX) ?
> >>> > >>> --
> >>> > >>> 2.9.3
> >>> > >>>
> >>> > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > Xen-devel mailing list
> >>> > > Xen-devel@xxxxxxxxxxxxx
> >>> > > https://lists.xen.org/xen-devel
> >>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html