On Thu, 13 Apr 2023 00:40:17 +0530 "Tarun Gupta (SW-GPU)" <targupta@xxxxxxxxxx> wrote: > On 4/7/2023 3:31 AM, Alex Williamson wrote: > > External email: Use caution opening links or attachments > > > > > > On Thu, 6 Apr 2023 16:50:49 -0500 > > Bjorn Helgaas<helgaas@xxxxxxxxxx> wrote: > > > >> [+cc Sajid, author of 3e347969a577] > >> > >> On Tue, Mar 28, 2023 at 04:59:30PM -0600, Alex Williamson wrote: > >>> Assignment of NVIDIA Ampere-based GPUs have seen a regression since the > >>> below referenced commit, where the reduced D3hot transition delay appears > >>> to introduce a small window where a D3hot->D0 transition followed by a bus > >>> reset can wedge the device. The entire device is subsequently unavailable, > >>> returning -1 on config space read and is unrecoverable without a host reset. > >>> > >>> This has been observed with RTX A2000 and A5000 GPU and audio functions > >>> assigned to a Windows VM, where shutdown of the VM places the devices in > >>> D3hot prior to vfio-pci performing a bus reset when userspace releases the > >>> devices. The issue has roughly a 2-3% chance of occurring per shutdown. > >>> > >>> Restoring the HDA controller d3hot_delay to the effective value before the > >>> below commit has been shown to resolve the issue. > >> Interesting. This sounds like it was a hassle to track down. I guess > >> we knew there was some risk in reducing those delays. > >> > >> Did you by chance notice whether the actual delay when the device gets > >> wedged is sufficient per spec? > >> > >> If there's a case where the usleep_range() doesn't quite wait the > >> spec-mandated time, we should adjust that in case we have the same > >> problem with other devices. > > That would have been a good test, unfortunately I didn't check and > > don't currently have access to the system anymore. Perhaps this is > > something the NVIDIA folks can check as they're investigating the scope > > of affected hardware. Thanks, > > > > Alex > > > >>> I'm looking for input from NVIDIA whether this issue is unique to > >>> Ampere-based HDA controllers or should be assumed to linger in both older > >>> and newer controllers as well. Currently we've not been able to reproduce > >>> the issue other than on Ampere HDA controllers, however the implementation > >>> here includes all NVIDIA HDA controllers based on PCI vendor and device > >>> class. > > Hi, I believe it is safe to include this patch for all HDA controllers and not > limit to only Ampere-based HDA controllers. This patch looks fine to me. > > Haven't explicitly tested whether usleep_range() performs as mandated, but > from our testing it doesn't seem to be the problem. On the same platform, > the problem reproduces on certain GPU configs but doesn't reproduce on other. Thanks, Tarun. Feel free to also send an explicit Acked-by: or Reviewed-by: if you'd like. Bjorn, I can resend this as a non-RFC if you prefer, the only change would be deleting "I'm looking for input..." through to the Cc:s. Thanks, Alex > >>> > >>> If we were to limit the quirk to Ampere HDA controllers, I think that would > >>> include: > >>> > >>> 1aef GA102 High Definition Audio Controller > >>> 228b GA104 High Definition Audio Controller > >>> 228e GA106 High Definition Audio Controller > >>> > >>> Cc: Abhishek Sahu<abhsahu@xxxxxxxxxx> > >>> Cc: Tarun Gupta<targupta@xxxxxxxxxx> > >>> Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()") > >>> Reported-by: Zhiyi Guo<zhguo@xxxxxxxxxx> > >>> Signed-off-by: Alex Williamson<alex.williamson@xxxxxxxxxx> > >>> --- > >>> drivers/pci/quirks.c | 13 +++++++++++++ > >>> 1 file changed, 13 insertions(+) > >>> > >>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > >>> index 44cab813bf95..f4e2a88729fd 100644 > >>> --- a/drivers/pci/quirks.c > >>> +++ b/drivers/pci/quirks.c > >>> @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev) > >>> } > >>> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm); > >>> > >>> +/* > >>> + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus > >>> + * reset is performed too soon after transition to D0, extend d3hot_delay > >>> + * to previous effective default for all NVIDIA HDA controllers. > >>> + */ > >>> +static void quirk_nvidia_hda_pm(struct pci_dev *dev) > >>> +{ > >>> + quirk_d3hot_delay(dev, 20); > >>> +} > >>> +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > >>> + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, > >>> + quirk_nvidia_hda_pm); > >>> + > >>> /* > >>> * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle. > >>> *https://bugzilla.kernel.org/show_bug.cgi?id=205587 > >>> > >>>