Re: [PATCH 1/1] PCI/bwctrl: Disable PCIe BW controller during reset

Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> · Mon, 3 Mar 2025 13:58:23 +0200 (EET)

On Thu, 27 Feb 2025, Lukas Wunner wrote:

> On Mon, Feb 24, 2025 at 05:13:15PM +0200, Ilpo Järvinen wrote:
> > On Tue, 18 Feb 2025, Lukas Wunner wrote:
> > > On Mon, Feb 17, 2025 at 06:52:58PM +0200, Ilpo Järvinen wrote:
> > > > PCIe BW controller enables BW notifications for Downstream Ports by
> > > > setting Link Bandwidth Management Interrupt Enable (LBMIE) and Link
> > > > Autonomous Bandwidth Interrupt Enable (LABIE) (PCIe Spec. r6.2 sec.
> > > > 7.5.3.7).
> > > > 
> > > > It was discovered that performing a reset can lead to the device
> > > > underneath the Downstream Port becoming unavailable if BW notifications
> > > > are left enabled throughout the reset sequence (at least LBMIE was
> > > > found to cause an issue).
> > > 
> > > What kind of reset?  FLR?  SBR?  This needs to be specified in the
> > > commit message so that the reader isn't forced to sift through a
> > > bugzilla with dozens of comments and attachments.
> > 
> > Heh, I never really tried to figure out it because the reset disable 
> > patch was just a stab into the dark style patch. To my surprise, it ended 
> > up working (after the initial confusion was resolved) and I just started 
> > to prepare this patch from that knowledge.
> 
> If the present patch is of the type "changing this somehow makes the
> problem go away" instead of a complete root-cause analysis, it would
> have been appropriate to mark it as an RFC.

I'll keep that mind in the future.

I don't think your depiction is entirely accurate of the situtation 
though. The reporter had confirmed that if bwctrl is change(s) are 
reverted, the problem is not observed.

So I set to understand why bwctrl has any impact here at all since it 
should touch only a different device (and even that in relatively limited 
ways). And that is how I found that if bwctrl is not enabled during reset, 
the problem is also not observed.

I also noted that the patch just works around the problems and there was 
also informal speculation about the suspected root cause in the patch (the 
only other theory I've a about the root cause relates to extra interrupts 
causing a problem through hp/pme interrupt handlers).

> I've started to dig into the bugzilla and the very first attachment
> (dmesg for the non-working case) shows:
> 
>   vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway
> 
> That message is emitted by pcie_flr().  Perhaps the Nvidia GPU takes
> more time than usual to finish pending transactions, so the first
> thing I would have tried would be to raise the timeout significantly
> and see if that helps.  Yet I'm not seeing any patch or comment in
> the bugzilla where this was attempted.  Please provide a patch for
> the reporter to verify this hypothesis.

I've problem in understanding how reverting bwctrl change does "solve" to 
this. Bwctrl is not even supposed to touch Nvidia GPU at all AFAIK.

> > Logs do mention this:
> > 
> > [   21.560206] pcieport 0000:00:01.1: unlocked secondary bus reset via: pciehp_reset_slot+0x98/0x140
> > 
> > ...so it seems to be SBR.
> 
> Looking at the vfio code, vfio_pci_core_enable() (which is called on
> binding the vfio driver to the GPU) invokes pci_try_reset_function().
> This will execute the reset method configured via sysfs.  The same
> is done on unbind via vfio_pci_core_disable().
> 
> So you should have asked the reporter for the contents of:
> /sys/bus/pci/devices/0000:01:00.0/reset_method
> /sys/bus/pci/devices/0000:01:00.1/reset_method
> 
> In particular, I would like to know whether the contents differ across
> different kernel versions.
> 
> There's another way to perform a reset:   Via an ioctl.  This ends up
> calling vfio_pci_dev_set_hot_reset(), which invokes pci_reset_bus()
> to perform an SBR.
> 
> Looking at dmesg output in log_linux_6.13.2-arch1-1_pcie_port_pm_off.log
> it seems that vfio first performs a function reset of the GPU on bind...
> 
> [   40.171564] vfio-pci 0000:01:00.0: resetting
> [   40.276485] vfio-pci 0000:01:00.0: reset done
> 
> ...and then goes on to perform an SBR both of the GPU and its audio
> device...
> 
> [   40.381082] vfio-pci 0000:01:00.0: resetting
> [   40.381180] vfio-pci 0000:01:00.1: resetting
> [   40.381228] pcieport 0000:00:01.1: unlocked secondary bus reset via: pciehp_reset_slot+0x98/0x140
> [   40.620442] vfio-pci 0000:01:00.0: reset done
> [   40.620479] vfio-pci 0000:01:00.1: reset done
> 
> ...which is odd because the audio device apparently wasn't bound to
> vfio-pci, otherwise there would have been a function reset.  So why
> does vfio think it can safely reset it?
> 
> Oddly, there is a third function reset of only the GPU:
> 
> [   40.621894] vfio-pci 0000:01:00.0: resetting
> [   40.724430] vfio-pci 0000:01:00.0: reset done
> 
> The reporter writes that pcie_port_pm=off avoids the PME messages.
> If the reset_method is "pm", I could imagine that the Nvidia GPU
> signals a PME event during the D0 -> D3hot -> D0 transition.
>
> I also note that the vfio-pci driver allows runtime PM.  So both the
> GPU and its audio device may runtime suspend to D3hot.  This in turn
> lets the Root Port runtime suspend to D3hot.  It looks like the
> reporter is using a laptop with an integrated AMD GPU and a
> discrete Nvidia GPU.  On such products the platform often allows
> powering down the discrete GPU and this is usually controlled
> through ACPI Power Resources attached to the Root Port.
> Those are powered off after the Root Port goes to D3hot.
> You should have asked the reporter for an acpidump.

A lots of these suggestions do not make much sense to me given that 
reverting bwctrl alone does not exhibit the problem. E.g., the reset 
method should be exactly the same.

I can see there could be some way through either hp or PME interrupt 
handlers, where an extra interrupt that comes due to bwctrl (LBMIE) being 
enabled triggers one of the such behaviors. But none of the above 
description theoritizes anything to that direction.

> pcie_bwnotif_irq() accesses the Link Status register without
> acquiring a runtime PM reference on the PCIe port.  This feels
> wrong and may also contribute to the issue reported here.
> Acquiring a runtime PM ref may sleep, so I think you need to
> change the driver to use a threaded IRQ handler.
> 
> Nvidia GPUs are known to hide the audio device if no audio-capable
> display is attached (e.g. HDMI).  quirk_nvidia_hda() unhides the
> audio device on boot and resume.  It might be necessary to also run
> the quirk after resetting the GPU.  Knowing which reset_method
> was used is important to decide if that's necessary, and also
> whether a display was attached.

...You seem to have a lot of ideas on this. ;-)

> Moreover Nvidia GPUs are known to change the link speed on idle
> to reduce power consumption.  Perhaps resetting the GPU causes
> a change of link speed and thus execution of pcie_bwnotif_irq()?

Why would execution of pcie_bwnotif_irq() be a problem, it's not that
complicated?

I'm more worried that having LBMIE enabled causes also the other interrupt 
handlers to execute due to the shared interrupt.

> > > This approach won't work if the reset is performed without software
> > > intervention.  E.g. if a DPC event occurs, the device likewise undergoes
> > > a reset but there is no prior system software involvement.  Software only
> > > becomes involved *after* the reset has occurred.
> > > 
> > > I think it needs to be tested if that same issue occurs with DPC.
> > > It's easy to simulate DPC by setting the Software Trigger bit:
> > > 
> > > setpci -s 00:01.1 ECAP_DPC+6.w=40:40
> > > 
> > > If the issue does occur with DPC then this fix isn't sufficient.
> > 
> > Looking into lspci logs, I don't see DPC capability being there for 
> > 00:01.1?!
> 
> Hm, so we can't verify whether your approach is safe for DPC.
> 
> 
> > > Instead of putting this in the PCI core, amend pcie_portdrv_err_handler
> > > with ->reset_prepare and ->reset_done callbacks which call down to all
> > > the port service drivers, then amend bwctrl.c to disable/enable
> > > interrupts in these callbacks.
> > 
> > Will it work? I mean if the port itself is not reset (0000:00:01.1 in this 
> > case), do these callbacks get called for it?
> 
> Never mind, indeed this won't work.
> 
> Thanks,
> 
> Lukas
> 

I'm sorry for the delay, I've been sick for a while.

-- 
 i.