On Thu, 6 Apr 2023 16:50:49 -0500 Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > [+cc Sajid, author of 3e347969a577] > > On Tue, Mar 28, 2023 at 04:59:30PM -0600, Alex Williamson wrote: > > Assignment of NVIDIA Ampere-based GPUs have seen a regression since the > > below referenced commit, where the reduced D3hot transition delay appears > > to introduce a small window where a D3hot->D0 transition followed by a bus > > reset can wedge the device. The entire device is subsequently unavailable, > > returning -1 on config space read and is unrecoverable without a host reset. > > > > This has been observed with RTX A2000 and A5000 GPU and audio functions > > assigned to a Windows VM, where shutdown of the VM places the devices in > > D3hot prior to vfio-pci performing a bus reset when userspace releases the > > devices. The issue has roughly a 2-3% chance of occurring per shutdown. > > > > Restoring the HDA controller d3hot_delay to the effective value before the > > below commit has been shown to resolve the issue. > > Interesting. This sounds like it was a hassle to track down. I guess > we knew there was some risk in reducing those delays. > > Did you by chance notice whether the actual delay when the device gets > wedged is sufficient per spec? > > If there's a case where the usleep_range() doesn't quite wait the > spec-mandated time, we should adjust that in case we have the same > problem with other devices. That would have been a good test, unfortunately I didn't check and don't currently have access to the system anymore. Perhaps this is something the NVIDIA folks can check as they're investigating the scope of affected hardware. Thanks, Alex > > I'm looking for input from NVIDIA whether this issue is unique to > > Ampere-based HDA controllers or should be assumed to linger in both older > > and newer controllers as well. Currently we've not been able to reproduce > > the issue other than on Ampere HDA controllers, however the implementation > > here includes all NVIDIA HDA controllers based on PCI vendor and device > > class. > > > > If we were to limit the quirk to Ampere HDA controllers, I think that would > > include: > > > > 1aef GA102 High Definition Audio Controller > > 228b GA104 High Definition Audio Controller > > 228e GA106 High Definition Audio Controller > > > > Cc: Abhishek Sahu <abhsahu@xxxxxxxxxx> > > Cc: Tarun Gupta <targupta@xxxxxxxxxx> > > Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()") > > Reported-by: Zhiyi Guo <zhguo@xxxxxxxxxx> > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > --- > > drivers/pci/quirks.c | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index 44cab813bf95..f4e2a88729fd 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev) > > } > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm); > > > > +/* > > + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus > > + * reset is performed too soon after transition to D0, extend d3hot_delay > > + * to previous effective default for all NVIDIA HDA controllers. > > + */ > > +static void quirk_nvidia_hda_pm(struct pci_dev *dev) > > +{ > > + quirk_d3hot_delay(dev, 20); > > +} > > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > > + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, > > + quirk_nvidia_hda_pm); > > + > > /* > > * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle. > > * https://bugzilla.kernel.org/show_bug.cgi?id=205587 > > > > >