[+cc Mika, Sathy, Lukas since they've been looking at similar delays] On Thu, Apr 13, 2023 at 01:40:42PM -0600, Alex Williamson wrote: > Assignment of NVIDIA Ampere-based GPUs have seen a regression since the > below referenced commit, where the reduced D3hot transition delay appears > to introduce a small window where a D3hot->D0 transition followed by a bus > reset can wedge the device. The entire device is subsequently unavailable, > returning -1 on config space read and is unrecoverable without a host reset. > > This has been observed with RTX A2000 and A5000 GPU and audio functions > assigned to a Windows VM, where shutdown of the VM places the devices in > D3hot prior to vfio-pci performing a bus reset when userspace releases the > devices. The issue has roughly a 2-3% chance of occurring per shutdown. > > Restoring the HDA controller d3hot_delay to the effective value before the > below commit has been shown to resolve the issue. NVIDIA confirms this > change should be safe for all of their HDA controllers. > > Cc: Abhishek Sahu <abhsahu@xxxxxxxxxx> > Cc: Tarun Gupta <targupta@xxxxxxxxxx> > Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()") > Reported-by: Zhiyi Guo <zhguo@xxxxxxxxxx> > Reviewed-by: Tarun Gupta <targupta@xxxxxxxxxx> > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> Applied to pci/reset for v6.4, thanks, Alex! I guess there's no real risk here since we're waiting *longer*. It only makes NVIDIA GPU resets take longer. Mika has some patches in flight that increase delays generically in some cases, but I think that applies to D3cold -> D0 transitions, which I don't *think* you're doing here. > --- > > Unfortunately Tarun's reply with confirmation doesn't show up on lore, > possibly due to html email, or else I'd provide that as a Link:. > > drivers/pci/quirks.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 44cab813bf95..f4e2a88729fd 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev) > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm); > > +/* > + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus > + * reset is performed too soon after transition to D0, extend d3hot_delay > + * to previous effective default for all NVIDIA HDA controllers. > + */ > +static void quirk_nvidia_hda_pm(struct pci_dev *dev) > +{ > + quirk_d3hot_delay(dev, 20); > +} > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, > + quirk_nvidia_hda_pm); > + > /* > * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle. > * https://bugzilla.kernel.org/show_bug.cgi?id=205587 > -- > 2.39.2 >