Re: [PATCH 1/1] PCI/bwctrl: Disable PCIe BW controller during reset

Lukas Wunner <lukas@xxxxxxxxx> · Thu, 27 Feb 2025 06:31:08 +0100

On Mon, Feb 24, 2025 at 05:13:15PM +0200, Ilpo Järvinen wrote:
> On Tue, 18 Feb 2025, Lukas Wunner wrote:
> > On Mon, Feb 17, 2025 at 06:52:58PM +0200, Ilpo Järvinen wrote:
> > > PCIe BW controller enables BW notifications for Downstream Ports by
> > > setting Link Bandwidth Management Interrupt Enable (LBMIE) and Link
> > > Autonomous Bandwidth Interrupt Enable (LABIE) (PCIe Spec. r6.2 sec.
> > > 7.5.3.7).
> > > 
> > > It was discovered that performing a reset can lead to the device
> > > underneath the Downstream Port becoming unavailable if BW notifications
> > > are left enabled throughout the reset sequence (at least LBMIE was
> > > found to cause an issue).
> > 
> > What kind of reset?  FLR?  SBR?  This needs to be specified in the
> > commit message so that the reader isn't forced to sift through a
> > bugzilla with dozens of comments and attachments.
> 
> Heh, I never really tried to figure out it because the reset disable 
> patch was just a stab into the dark style patch. To my surprise, it ended 
> up working (after the initial confusion was resolved) and I just started 
> to prepare this patch from that knowledge.

If the present patch is of the type "changing this somehow makes the
problem go away" instead of a complete root-cause analysis, it would
have been appropriate to mark it as an RFC.

I've started to dig into the bugzilla and the very first attachment
(dmesg for the non-working case) shows:

  vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway

That message is emitted by pcie_flr().  Perhaps the Nvidia GPU takes
more time than usual to finish pending transactions, so the first
thing I would have tried would be to raise the timeout significantly
and see if that helps.  Yet I'm not seeing any patch or comment in
the bugzilla where this was attempted.  Please provide a patch for
the reporter to verify this hypothesis.

> Logs do mention this:
> 
> [   21.560206] pcieport 0000:00:01.1: unlocked secondary bus reset via: pciehp_reset_slot+0x98/0x140
> 
> ...so it seems to be SBR.

Looking at the vfio code, vfio_pci_core_enable() (which is called on
binding the vfio driver to the GPU) invokes pci_try_reset_function().
This will execute the reset method configured via sysfs.  The same
is done on unbind via vfio_pci_core_disable().

So you should have asked the reporter for the contents of:
/sys/bus/pci/devices/0000:01:00.0/reset_method
/sys/bus/pci/devices/0000:01:00.1/reset_method

In particular, I would like to know whether the contents differ across
different kernel versions.

There's another way to perform a reset:   Via an ioctl.  This ends up
calling vfio_pci_dev_set_hot_reset(), which invokes pci_reset_bus()
to perform an SBR.

Looking at dmesg output in log_linux_6.13.2-arch1-1_pcie_port_pm_off.log
it seems that vfio first performs a function reset of the GPU on bind...

[   40.171564] vfio-pci 0000:01:00.0: resetting
[   40.276485] vfio-pci 0000:01:00.0: reset done

...and then goes on to perform an SBR both of the GPU and its audio
device...

[   40.381082] vfio-pci 0000:01:00.0: resetting
[   40.381180] vfio-pci 0000:01:00.1: resetting
[   40.381228] pcieport 0000:00:01.1: unlocked secondary bus reset via: pciehp_reset_slot+0x98/0x140
[   40.620442] vfio-pci 0000:01:00.0: reset done
[   40.620479] vfio-pci 0000:01:00.1: reset done

...which is odd because the audio device apparently wasn't bound to
vfio-pci, otherwise there would have been a function reset.  So why
does vfio think it can safely reset it?

Oddly, there is a third function reset of only the GPU:

[   40.621894] vfio-pci 0000:01:00.0: resetting
[   40.724430] vfio-pci 0000:01:00.0: reset done

The reporter writes that pcie_port_pm=off avoids the PME messages.
If the reset_method is "pm", I could imagine that the Nvidia GPU
signals a PME event during the D0 -> D3hot -> D0 transition.

I also note that the vfio-pci driver allows runtime PM.  So both the
GPU and its audio device may runtime suspend to D3hot.  This in turn
lets the Root Port runtime suspend to D3hot.  It looks like the
reporter is using a laptop with an integrated AMD GPU and a
discrete Nvidia GPU.  On such products the platform often allows
powering down the discrete GPU and this is usually controlled
through ACPI Power Resources attached to the Root Port.
Those are powered off after the Root Port goes to D3hot.
You should have asked the reporter for an acpidump.

pcie_bwnotif_irq() accesses the Link Status register without
acquiring a runtime PM reference on the PCIe port.  This feels
wrong and may also contribute to the issue reported here.
Acquiring a runtime PM ref may sleep, so I think you need to
change the driver to use a threaded IRQ handler.

Nvidia GPUs are known to hide the audio device if no audio-capable
display is attached (e.g. HDMI).  quirk_nvidia_hda() unhides the
audio device on boot and resume.  It might be necessary to also run
the quirk after resetting the GPU.  Knowing which reset_method
was used is important to decide if that's necessary, and also
whether a display was attached.

Moreover Nvidia GPUs are known to change the link speed on idle
to reduce power consumption.  Perhaps resetting the GPU causes
a change of link speed and thus execution of pcie_bwnotif_irq()?

> > This approach won't work if the reset is performed without software
> > intervention.  E.g. if a DPC event occurs, the device likewise undergoes
> > a reset but there is no prior system software involvement.  Software only
> > becomes involved *after* the reset has occurred.
> > 
> > I think it needs to be tested if that same issue occurs with DPC.
> > It's easy to simulate DPC by setting the Software Trigger bit:
> > 
> > setpci -s 00:01.1 ECAP_DPC+6.w=40:40
> > 
> > If the issue does occur with DPC then this fix isn't sufficient.
> 
> Looking into lspci logs, I don't see DPC capability being there for 
> 00:01.1?!

Hm, so we can't verify whether your approach is safe for DPC.

> > Instead of putting this in the PCI core, amend pcie_portdrv_err_handler
> > with ->reset_prepare and ->reset_done callbacks which call down to all
> > the port service drivers, then amend bwctrl.c to disable/enable
> > interrupts in these callbacks.
> 
> Will it work? I mean if the port itself is not reset (0000:00:01.1 in this 
> case), do these callbacks get called for it?

Never mind, indeed this won't work.

Thanks,

Lukas