Solution for access to device registers in PCI quirk for proxy devfn aliasing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The primary intent of this email is to pass on my experience getting a
PCI quirk working with a PCIe NTB switch, in the hope that somebody in
the community can benefit from this experience.

The other intent of this email is to thank Bjorn Helgaas and Logan
Gunthorpe for helping out on the investigation and, ultimately,
finding the solution. Thank you, gentlemen.

Summary Overview:
I needed a way for a PCI switch to access host memory with TLPs that
had requester IDs (BDFs) that the host did not know about (had not
enumerated). The un-enumerated IDs had to be read from the PCI switch.

The solution implemented here is a function in drivers/pci/quirks.c
which uses pci_add_dma_alias(). But access to the PCI switch registers
was not possible until a call to pci_enable_device() was added at the
top of the function.

Problem:
With the IOMMU on, the IOMMU would object to seeing PCI
device-functions that it did not enumerate in TLPs originating from
the switch. This would happen because of the way the PCIe NTB switch
allows hosts to communicate with each other via non-transparent
bridges.

Essentially, Host A would enumerate its bus and see a non-transparent
endpoint (NT EP) at some BDF, such as 03:00.1. It is non-transparent,
so nothing is enumerated behind that EP. Host B does the same thing,
and sees its NT EP, perhaps also at its own BDF 03:00.1 (or something
else if the machines are not identically configured). When Host B
tries to access memory on Host A, a "proxy ID" is used internal to the
switch. The proxy ID is the devfn portion of the BDF. So, if Host B is
internally given a proxy ID of 04.2, then a memory access from Host B
to host A would have a TLP requester ID of 03:04.2. That BDF was never
enumerated by Host A.


TLP = TLP with Host B requester ID
TLP' = TLP with requester ID changed to a proxy ID for internal chip routing

                     [          SWITCH            ]

Host B ---> TLP ---> [NTB EP ---> TLP' ---> NTB EP] ---> TLP' ---> Host A

e.g. BDF    00:00.0                                      03:04.2


By default, the IOMMU does just what it is supposed to do: it blocks
the TLP. This is the sort of thing you'd see in dmesg/syslog:

        [ 1923.060446] DMAR: [DMA Read] Request device [03:04.2] fault
addr ffa00000 [fault reason 02] Present bit in context entry is clear


Proposal:
The proposed solution was to use pci_add_dma_alias() to alias the
proxy ID of any valid requestor to the NT switch device on the target
host.

As a topic for another day, my initial naive attempt was to call this
in the switch's device driver. The call seemed to work, but there was
no change in behavior, as if the aliasing wasn't actually happening.
It was then that Logan suggested that the aliasing needed to happen
much earlier, and so I was pointed to drivers/pci/quirks.c.

[Note: The idea of being able to do this aliasing in the driver is of
interest to me, should somebody know how.]

Solution:
There could be more than NT peer (host) in the system, and each peer
could have one or more proxy IDs. The proxy IDs are set by the switch
itself when it performs internal configuration after reset is
released. So, it is necessary for the quirk (on Host A in the above
example) to read this proxy configuration information from the switch
chip at runtime.

Fortunately, the switch supports a management capability which
provides access to the internal registers. This management capability
is located in BAR0. It was straight-forward to create a quirk with
this basic concept. The following code is simplified/scrubbed to focus
on the essentials.

    static void quirk_ntb_dma_alias(struct pci_dev *pdev)
    {
           void __iomem *mmio;
           u32 id_info;

           /* iomap all of BAR0 */
           mmio = pci_iomap(pdev, 0, 0);
           if (mmio == NULL) {
                   dev_err(&pdev->dev, ...);
                   return;
            }

            /* read the proxy ID information */
            [...]

           id_info = ioread32(mmio + various offsets);
           [...]

           /* extract the proxy ID and alias it to this device */
            pci_add_dma_alias(pdev, (id_info >> 1) & 0xFF);

            pci_iounmap(pdev, mmio);
           return;
    }
    DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_XXX,
         PCI_DEVICE_ID_YYY,
         PCI_CLASS_BRIDGE_OTHER, 8,
         quirk_ntb_dma_alias);


The issue with the above code is that it did not work. Reads of the
mapped iomem space returned all-Fs. Typically an indicator that the
reads timed out. Apparently the device was not responding to the read
TLPs.

Now, interestingly, a little test quirk (similar to the above, but
without the aliasing) was run on several different machines. These
machines did differ in terms of CPU (i7 vs a couple flavors of Xeon)
and PCI topology. In only one case was the BAR0 register space
accessible. In all other cases it was not (Fs were returned). That
mystery remains to this day.

In the end, I was pointed to the PCI command register. You can find
this register in the PCIe Base Specification section 7.5.1.1. This has
memory, I/O, and bus master enables that need to be properly set up.
Bjorn pointed me to pci_enable_device() which does this. To quote
Bjorn:

"The most likely reason it didn't respond here is that the
PCI_COMMAND_MEMORY bit in its command register is not set. That is
normally done when the driver calls pci_enable_device().  Quirks are
run before the driver claims the device, so if you need to access BARs
from a quirk, you would to call pci_enable_device() from the quirk
itself."


So I added the following to the top of the function, and the quirk
worked on all machines I tested it on.

    static void quirk_ntb_dma_alias(struct pci_dev *pdev)
    {
           void __iomem *mmio;
            u32 id_info;

           if (pci_enable_device(pdev)) {
                    dev_err(&pdev->dev, ...);
                   return;
           }

            /* iomap all of BAR0 */
           mmio = pci_iomap(pdev, 0, 0);
           [...]


Again, thanks to Bjorn and Logan.
Hopefully this will be a help to somebody else.

Closing repeat of previous note: If somebody knows a way to accomplish
this aliasing later so that it could be done in a device driver, I
would like to understand that.

Blessings,
Doug



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux