On Thu, Sep 11, 2014 at 3:24 PM, Dirk Gouders <dirk@xxxxxxxxxxx> wrote: > Bjorn Helgaas <bhelgaas@xxxxxxxxxx> writes: > >> On Thu, Sep 11, 2014 at 2:33 PM, Dirk Gouders <dirk@xxxxxxxxxxx> wrote: >>> What I was currently trying was to construct a test-environment so that >>> I do not need to do tests and diagnosis on a busy machine. >>> >>> I noticed that this problem seems to start with the narrow Root >>> Bridge window (00-07) but every other machine that I had a look at, >>> starts with (00-ff), so those will not trigger my problem. >>> >>> I thought I could perhaps try to shrink the window in >>> acpi_pci_root_add() to trigger the problem and that kind of works: it >>> triggers it but not exactly the same way, because it basically ends at >>> this code in pci_scan_bridge(): >>> >>> if (max >= bus->busn_res.end) { >>> dev_warn(&dev->dev, "can't allocate child bus %02x from %pR (pass %d)\n", >>> max, &bus->busn_res, pass); >>> goto out; >>> } >>> >>> If this could work but I am just missing a small detail, I would be >>> glad to hear about it and do the first tests this way. If it is >>> complete nonsense, I will just use the machine that triggers the problem >>> for the tests. >> >> I was about to suggest the same thing. If the problem is related to >> the bus number change, we should be able to force that to happen on a >> different machine. Your approach sounds good, so I'm guessing we just >> need a tweak. >> >> I would first double-check that the PCI adapters are identical, >> including the firmware on the card. Can you also include your patch >> and the resulting dmesg (with debug enabled as before)? > > Currently I am at home doing just tests for understanding and that I can > hopefully use when I am back in the office. > > I already noticed the the backup FC Adapter on the test machine is not > exactly the same: it is Rev. 1 whereas the one on the failing machine is > Rev. 2. > > So, here at home my tests let a NIC disappear. Different from the > original problem but I was just trying to reconstruct the szenario of a > misconfigured bridge causing a reconfiguration. > > What I was trying is: > > diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c > index e6ae603..fd146b3 100644 > --- a/drivers/acpi/pci_root.c > +++ b/drivers/acpi/pci_root.c > @@ -556,6 +556,7 @@ static int acpi_pci_root_add(struct acpi_device *device, > strcpy(acpi_device_name(device), ACPI_PCI_ROOT_DEVICE_NAME); > strcpy(acpi_device_class(device), ACPI_PCI_ROOT_CLASS); > device->driver_data = root; > + root->secondary.end = 0x02; > > pr_info(PREFIX "%s [%s] (domain %04x %pR)\n", > acpi_device_name(device), acpi_device_bid(device), > > The device that disappears is a NIC: > > 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09) > 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09) > 00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04) > 00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04) > 00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04) > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) > 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4) > 00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4) > 00:1c.5 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 6 (rev c4) > 00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04) > 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a4) > 00:1f.0 ISA bridge: Intel Corporation B75 Express Chipset LPC Controller (rev 04) > 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04) > 00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04) > 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) > > This is the one that is missing with the above change: > 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) This situation is a little different, so I don't think you're reproducing the situation we want to test. On this box, you have: pci_bus 0000:00: root bus resource [bus 00-02] pci 0000:00:1c.0: PCI bridge to [bus 01] pci 0000:00:1c.4: PCI bridge to [bus 02] so we find all the devices on bus 00 and bus 02 (there's nothing on bus 01). My guess is the 03:00.0 device is normally behind the 00:1c.5 bridge, but we don't even scan behind that bridge because we can't allocate a secondary bus number for it (we're not smart enough to take advantage of the empty bus 01). On the failing box, it's different because we *do* have unused bus number space, and we do actually reconfigure the bridge to use it. It's just that the FC adapter doesn't respond when we use the new bus number for it. You might be able to do something similar on the test box by: - Keeping your root->secondary.end = 02 patch, so you still have [bus 00-02]. - Ignoring bridges 00:1c.0 and 00:1c.4. I would just test for those devfns in pci_scan_device() and when you see them, return NULL instead of trying to read the vendor ID. Then 00:1c.5 is probably configured by the BIOS for [bus 03], but that's outside the root bridge range, so we should reconfigure it to use [bus 01]. Then we should scan behind it, and we'll probably discover the NIC that was previously at 03:00.0. The device *should* just work at the new bus number, since it probably doesn't have the same bug the FC adapter does. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html