Hi Harry, On Tue, Jan 03, 2017 at 03:35:13PM +0000, Harry Mallon wrote: > Hi, > > In this email I am asking for advice on how to diagnose problems in assigning memory for PCI-E devices and bridges. I can reproduce my issue on the current mainline kernel but I am not currently planning to target any fixes for the mainline kernel (unless they prove to be useful outside of this machine). I am planning to target the CentOS 3.10.0 based kernel. I understand that I am not owed any help/patches etc from anyone here, especially not as I am using a non mainline kernel. > > I am working on a machine with an odd PCI structure, it has 4 different PLX bridges and requires hotplug to work on at least 2 of these. We were previously using a kernel based on 3.3 and had to add a few (hacky, machine specific) patches to that to make it work correctly. We also use "pci=realloc,pcie_bus_safe" in the kernel cmdline. I am currently using that kernel as a reference to compare to my development version. > > On my current setup some devices don't work, using "lspci -vvv" I can see that they are not all receiving the memory allocations that they need. They report like one of the two following (it changes depending on another PCI-E card being in or out): > > 05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN X] (rev a1) (prog-if 00 [VGA controller]) > Subsystem: NVIDIA Corporation Device 119a > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 11 > Region 0: Memory at <ignored> (32-bit, non-prefetchable) [disabled] > Region 1: Memory at e0000000 (64-bit, prefetchable) [disabled] [size=256M] > Region 3: Memory at f0000000 (64-bit, prefetchable) [disabled] [size=32M] > Region 5: I/O ports at 4000 [disabled] [size=128] > Expansion ROM at <ignored> [disabled] > > 05:00.0 VGA compatible controller: NVIDIA Corporation GK110B [GeForce GTX TITAN Black] (rev ff) (prog-if ff) > !!! Unknown header type 7f > > What tools and techniques can anyone recommend for diagnosing this type of problem? Is there a way to export all the bridge memory ranges in a way that can be visualised (maybe the newer kernel cannot allocate enough aligned space)? Is there a way to enable extra PCI debug in the kernel? Is there a way to make the kernel panic and > report when it fails to assign memory on boot instead of continuing with non-functional hardware? A complete dmesg log from a current kernel and complete "lspci -vv" output is the best place to start. The goal is that all your devices should work without requiring any machine-specific patches or command-line parameters. Our resource allocation code is not really very robust, so unusual topologies don't always work out of the box. You can open a bug report at https://bugzilla.kernel.org in the drivers/PCI area, attach the dmesg and lspci output, and respond with the URL here. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html