On Thu, Jan 12, 2023 at 09:06:15PM -0600, Bjorn Helgaas wrote: > On Thu, Jan 12, 2023 at 8:39 PM Yang Su <yang.su@xxxxxxxxxxxxxxxxx> wrote: > > I also test pci_bridge_wait_for_secondary_bus in NVIDIA GPU T4 > > which bind vfio passthrough in VMM, I found the > > pci_bridge_wait_for_secondary_bus can not wait the enough time > > as pci spec requires, the reasons are described as below. [...] > I'll wait for you and Lukas to continue this conversation on the > mailing list. Yang Su reports that pci_bridge_secondary_bus_reset() is called for an Nvidia T4 GPU. That's an endpoint device but the function should only ever be called for bridges. It's unclear to me how that can happen. The call stack looks like this: vfio_pci_open() pci_set_power_state() pci_clear_master() pci_enable_device() pci_try_reset_function() pci_dev_trylock() pci_dev_save_and_disable() pci_dev_specific_reset() pcie_has_flr() pcie_af_flr() pci_read_config_word() pci_dev_reset_slot_function() pci_parent_bus_reset() pci_bridge_secondary_bus_reset() pcibios_reset_secondary_bus() pci_reset_secondary_bus() pci_read_config_word() pci_write_config_word() pci_write_config_word() pcie_wait_for_link_delay() pci_dev_wait() pci_dev_restore() pci_cfg_access_unlock() pci_save_state() pci_store_saved_state() vfio_config_init() Note that vfio_pci_open() was renamed to vfio_pci_open_device() with commit 2cd8b14aaa66, which went into v5.15. So apparently this call stack is from an earlier kernel. The GPU is located below a PEX9797 switch, which is connected to a SkyLake-E Root Port. So pci_bridge_secondary_bus_reset() should be called for the Switch Downstream Port but Yang Su says it's called for the GPU endpoint device. pci_parent_bus_reset() finds the parent by following dev->bus->self. I've suggested to Yang Su to replace that with pci_upstream_bridge(dev) and see if it fixes the issue. It does make a difference in SR-IOV scenarios (see the comment in include/linux/pci.h above pci_is_root_bus()) though Yang Su reports that SR-IOV is not used on this machine, only vfio passthrough. I believe that endpoint devices don't have a Bridge Control Register (Type 0 Config Space has Max_Lat / Min_Gnt instead), so setting the Secondary Bus Reset bit should have no effect. But apparently it does have an effect because Yang Su is witnessing issues with the delay after reset. Specifically, Yang Su says that pci_bridge_wait_for_secondary_bus() bails out because the GPU endpoint device fails the !pci_is_bridge() check at the top of the function. Also, the calculation of pci_bus_max_d3cold_delay() fails because the GPU lacks a subordinate bus. Apparently the unconditional 1 second delay in pci_reset_secondary_bus() papers over the issue because it waits long enough for the GPU endpoint device to come out of reset. Maybe the information that pci_bridge_secondary_bus_reset() is executed for an endpoint device is not correct and it's in fact executed for the Downstream Port of the PEX9797 switch, but perhaps config space of the port is broken and contains an incorrect Header Type field, which would cause pci_is_bridge() to return false for it. Though we wouldn't scan the secondary bus below the switch in that case, so the GPU wouldn't be enumerated. I've requested "lspci -vvv" output for the GPU and the Downstream Port off-list but haven't received it yet. Thanks, Lukas