Re: Problem with PCIe enumeration of Google/Coral TPU Edge module on Linux

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[+cc Todd]

On Thu, Apr 09, 2020 at 11:30:10AM -0500, Bjorn Helgaas wrote:
> On Thu, Apr 09, 2020 at 04:25:40PM +0100, Luís Mendes wrote:
> > Hi Bjorn,
> > 
> > I've good news. I've found the culprit and it is a pretty simple
> > issue, however the good solution is not obvious to me.
> > Can you help in finding the best way to patch this issue?
> > 
> > So first detailing the problem in file setup_bus.c there is this *if
> > condition* to ignore resources from classless devices and so
> > it is that this Google/Coral Edge TPU is a classless device with class 0xff:
> > 
> > static void __dev_sort_resources(struct pci_dev *dev, struct list_head *head)
> > {
> >     u16 class = dev->class >> 8;
> > 
> >        pci_info(dev, "%s\n", __func__);
> >     /* Don't touch classless devices or host bridges or IOAPICs */
> >     if (class == PCI_CLASS_NOT_DEFINED || class == PCI_CLASS_BRIDGE_HOST)
> >         return;
> >    ....
> > 
> > So the one possible trivial, non generic, attempt that works is to do:
> > static void __dev_sort_resources(struct pci_dev *dev, struct list_head *head)
> > {
> >     u16 class = dev->class >> 8;
> > 
> >        pci_info(dev, "%s\n", __func__);
> >     /* Don't touch classless devices or host bridges or IOAPICs */
> >     if ((class == PCI_CLASS_NOT_DEFINED &&  !(dev->vendor == 0x1ac1 &&
> > dev->device==0x089a)) || class == PCI_CLASS_BRIDGE_HOST)
> >         return;
> >    ....
> > 
> > What is your suggestion to make the solution generic? Create a
> > whitelist? Remove this verification? I have no idea... nothing sounds
> > good to me...
> 
> Good detective work, thanks for chasing this down!
> 
> I should have seen that check when adding the debug.  Guess I thought
> "sort", hmmm, that just re-orders things without actually changing the
> content.  But pdev_sort_resources() in fact *adds* resources to a
> list, and if resources aren't on the list, we apparently don't assign
> space for them.
> 
> In any event, I would first check to see if there's an Edge TPU
> firmware update that might set the class code.
> 
> If not, we should probably add a quirk to override the class code,
> similar to quirk_eisa_bridge(), fixup_rev1_53c810(),
> fixup_ti816x_class(), quirk_tw686x_class().

In fact, apex_pci_fixup_class() already exists!  But it's in
apex_driver.c.  Do you happen to have CONFIG_STAGING_APEX_DRIVER=m
(built as a module)?  If so, that quirk won't be run until the module
is loaded, and that happens long after resource assignment.

Building with CONFIG_STAGING_APEX_DRIVER=y (not =m) should be a
workaround.  But I think the real fix would be moving
apex_pci_fixup_class() from apex_driver.c to drivers/pci/quirks.c,
like the following.  Would you mind testing it?


commit 59f3165318b3 ("PCI: Move Apex Edge TPU class quirk to fix BAR assignment")
Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Date:   Thu Apr 9 12:43:45 2020 -0500

    PCI: Move Apex Edge TPU class quirk to fix BAR assignment
    
    Some Google Apex Edge TPU devices have a class code of 0
    (PCI_CLASS_NOT_DEFINED).  This prevents the PCI core from assigning
    resources for the Apex BARs because __dev_sort_resources() ignores
    classless devices, host bridges, and IOAPICs.
    
    On x86, firmware typically assigns those resources, so this was not a
    problem.  But on some architectures, firmware does *not* assign BARs, and
    since the PCI core didn't do it either, the Apex device didn't work
    correctly:
    
      apex 0000:01:00.0: can't enable device: BAR 0 [mem 0x00000000-0x00003fff 64bit pref] not claimed
      apex 0000:01:00.0: error enabling PCI device
    
    f390d08d8b87 ("staging: gasket: apex: fixup undefined PCI class") added a
    quirk to fix the class code, but it was in the apex driver, and if the
    driver was built as a module, it was too late to help.
    
    Move the quirk to the PCI core, where it will always run early enough that
    the PCI core will assign resources if necessary.
    
    Link: https://lore.kernel.org/r/CAEzXK1r0Er039iERnc2KJ4jn7ySNUOG9H=Ha8TD8XroVqiZjgg@xxxxxxxxxxxxxx
    Fixes: f390d08d8b87 ("staging: gasket: apex: fixup undefined PCI class")
    Reported-by: Luís Mendes <luis.p.mendes@xxxxxxxxx>
    Debugged-by: Luís Mendes <luis.p.mendes@xxxxxxxxx>
    Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 28c9a2409c50..ca9ed5774eb1 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5567,3 +5567,10 @@ static void pci_fixup_no_d0_pme(struct pci_dev *dev)
 	dev->pme_support &= ~(PCI_PM_CAP_PME_D0 >> PCI_PM_CAP_PME_SHIFT);
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASMEDIA, 0x2142, pci_fixup_no_d0_pme);
+
+static void apex_pci_fixup_class(struct pci_dev *pdev)
+{
+	pdev->class = (PCI_CLASS_SYSTEM_OTHER << 8) | pdev->class;
+}
+DECLARE_PCI_FIXUP_CLASS_HEADER(0x1ac1, 0x089a,
+			       PCI_CLASS_NOT_DEFINED, 8, apex_pci_fixup_class);
diff --git a/drivers/staging/gasket/apex_driver.c b/drivers/staging/gasket/apex_driver.c
index 46199c8ca441..f12f81c8dd2f 100644
--- a/drivers/staging/gasket/apex_driver.c
+++ b/drivers/staging/gasket/apex_driver.c
@@ -570,13 +570,6 @@ static const struct pci_device_id apex_pci_ids[] = {
 	{ PCI_DEVICE(APEX_PCI_VENDOR_ID, APEX_PCI_DEVICE_ID) }, { 0 }
 };
 
-static void apex_pci_fixup_class(struct pci_dev *pdev)
-{
-	pdev->class = (PCI_CLASS_SYSTEM_OTHER << 8) | pdev->class;
-}
-DECLARE_PCI_FIXUP_CLASS_HEADER(APEX_PCI_VENDOR_ID, APEX_PCI_DEVICE_ID,
-			       PCI_CLASS_NOT_DEFINED, 8, apex_pci_fixup_class);
-
 static int apex_pci_probe(struct pci_dev *pci_dev,
 			  const struct pci_device_id *id)
 {



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux