On Mon, Feb 24, 2025 at 05:13:15PM +0200, Ilpo Järvinen wrote: > On Tue, 18 Feb 2025, Lukas Wunner wrote: > > On Mon, Feb 17, 2025 at 06:52:58PM +0200, Ilpo Järvinen wrote: > > > PCIe BW controller enables BW notifications for Downstream Ports by > > > setting Link Bandwidth Management Interrupt Enable (LBMIE) and Link > > > Autonomous Bandwidth Interrupt Enable (LABIE) (PCIe Spec. r6.2 sec. > > > 7.5.3.7). > > > > > > It was discovered that performing a reset can lead to the device > > > underneath the Downstream Port becoming unavailable if BW notifications > > > are left enabled throughout the reset sequence (at least LBMIE was > > > found to cause an issue). > > > > What kind of reset? FLR? SBR? This needs to be specified in the > > commit message so that the reader isn't forced to sift through a > > bugzilla with dozens of comments and attachments. > > Heh, I never really tried to figure out it because the reset disable > patch was just a stab into the dark style patch. To my surprise, it ended > up working (after the initial confusion was resolved) and I just started > to prepare this patch from that knowledge. If the present patch is of the type "changing this somehow makes the problem go away" instead of a complete root-cause analysis, it would have been appropriate to mark it as an RFC. I've started to dig into the bugzilla and the very first attachment (dmesg for the non-working case) shows: vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway That message is emitted by pcie_flr(). Perhaps the Nvidia GPU takes more time than usual to finish pending transactions, so the first thing I would have tried would be to raise the timeout significantly and see if that helps. Yet I'm not seeing any patch or comment in the bugzilla where this was attempted. Please provide a patch for the reporter to verify this hypothesis. > Logs do mention this: > > [ 21.560206] pcieport 0000:00:01.1: unlocked secondary bus reset via: pciehp_reset_slot+0x98/0x140 > > ...so it seems to be SBR. Looking at the vfio code, vfio_pci_core_enable() (which is called on binding the vfio driver to the GPU) invokes pci_try_reset_function(). This will execute the reset method configured via sysfs. The same is done on unbind via vfio_pci_core_disable(). So you should have asked the reporter for the contents of: /sys/bus/pci/devices/0000:01:00.0/reset_method /sys/bus/pci/devices/0000:01:00.1/reset_method In particular, I would like to know whether the contents differ across different kernel versions. There's another way to perform a reset: Via an ioctl. This ends up calling vfio_pci_dev_set_hot_reset(), which invokes pci_reset_bus() to perform an SBR. Looking at dmesg output in log_linux_6.13.2-arch1-1_pcie_port_pm_off.log it seems that vfio first performs a function reset of the GPU on bind... [ 40.171564] vfio-pci 0000:01:00.0: resetting [ 40.276485] vfio-pci 0000:01:00.0: reset done ...and then goes on to perform an SBR both of the GPU and its audio device... [ 40.381082] vfio-pci 0000:01:00.0: resetting [ 40.381180] vfio-pci 0000:01:00.1: resetting [ 40.381228] pcieport 0000:00:01.1: unlocked secondary bus reset via: pciehp_reset_slot+0x98/0x140 [ 40.620442] vfio-pci 0000:01:00.0: reset done [ 40.620479] vfio-pci 0000:01:00.1: reset done ...which is odd because the audio device apparently wasn't bound to vfio-pci, otherwise there would have been a function reset. So why does vfio think it can safely reset it? Oddly, there is a third function reset of only the GPU: [ 40.621894] vfio-pci 0000:01:00.0: resetting [ 40.724430] vfio-pci 0000:01:00.0: reset done The reporter writes that pcie_port_pm=off avoids the PME messages. If the reset_method is "pm", I could imagine that the Nvidia GPU signals a PME event during the D0 -> D3hot -> D0 transition. I also note that the vfio-pci driver allows runtime PM. So both the GPU and its audio device may runtime suspend to D3hot. This in turn lets the Root Port runtime suspend to D3hot. It looks like the reporter is using a laptop with an integrated AMD GPU and a discrete Nvidia GPU. On such products the platform often allows powering down the discrete GPU and this is usually controlled through ACPI Power Resources attached to the Root Port. Those are powered off after the Root Port goes to D3hot. You should have asked the reporter for an acpidump. pcie_bwnotif_irq() accesses the Link Status register without acquiring a runtime PM reference on the PCIe port. This feels wrong and may also contribute to the issue reported here. Acquiring a runtime PM ref may sleep, so I think you need to change the driver to use a threaded IRQ handler. Nvidia GPUs are known to hide the audio device if no audio-capable display is attached (e.g. HDMI). quirk_nvidia_hda() unhides the audio device on boot and resume. It might be necessary to also run the quirk after resetting the GPU. Knowing which reset_method was used is important to decide if that's necessary, and also whether a display was attached. Moreover Nvidia GPUs are known to change the link speed on idle to reduce power consumption. Perhaps resetting the GPU causes a change of link speed and thus execution of pcie_bwnotif_irq()? > > This approach won't work if the reset is performed without software > > intervention. E.g. if a DPC event occurs, the device likewise undergoes > > a reset but there is no prior system software involvement. Software only > > becomes involved *after* the reset has occurred. > > > > I think it needs to be tested if that same issue occurs with DPC. > > It's easy to simulate DPC by setting the Software Trigger bit: > > > > setpci -s 00:01.1 ECAP_DPC+6.w=40:40 > > > > If the issue does occur with DPC then this fix isn't sufficient. > > Looking into lspci logs, I don't see DPC capability being there for > 00:01.1?! Hm, so we can't verify whether your approach is safe for DPC. > > Instead of putting this in the PCI core, amend pcie_portdrv_err_handler > > with ->reset_prepare and ->reset_done callbacks which call down to all > > the port service drivers, then amend bwctrl.c to disable/enable > > interrupts in these callbacks. > > Will it work? I mean if the port itself is not reset (0000:00:01.1 in this > case), do these callbacks get called for it? Never mind, indeed this won't work. Thanks, Lukas