On Wed, 27 May 2020 at 22:42, Deucher, Alexander <Alexander.Deucher@xxxxxxx> wrote: > > [AMD Official Use Only - Internal Distribution Only] > > > -----Original Message----- > > From: Bjorn Helgaas <helgaas@xxxxxxxxxx> > > Sent: Wednesday, May 27, 2020 5:32 PM > > To: Kevin Buettner <kevinb@xxxxxxxxxx> > > Cc: linux-pci@xxxxxxxxxxxxxxx; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; Alex > > Williamson <alex.williamson@xxxxxxxxxx>; Deucher, Alexander > > <Alexander.Deucher@xxxxxxx>; Koenig, Christian > > <Christian.Koenig@xxxxxxx> > > Subject: Re: [PATCH] PCI: Avoid FLR for AMD Starship USB 3.0 > > > > [+cc Alex D, Christian -- do you guys have any contacts or insight into why we > > suddenly have three new AMD devices that advertise FLR support but it > > doesn't work? Are we doing something wrong in Linux, or are these devices > > defective? > > +Nehal who handles our USB drivers. > > Nehal any ideas about FLR or whether it should be advertised? > > Alex > I had read somewhere that the IO die in the Ryzen/Threadripper packages are identical to the ones used in the motherboard chipsets. Since the latter do reset ok, it would seem a BIOS update of the AGESA may potentially fix the issue. Unfortunately, it's not something motherboard manufacturer's customer support people know how to deal with or pass back up the chain to AMD engineers. Actual use of this feature seems to be fairly niche. After I added the workaround for the USB and audio controllers on the 3rd-gen Ryzen, I tried contacting Kim Phillips (who I found as a kernel committer to x86/cpu/amd), but haven't heard back. It would be wonderful to know if this can potentially be fixed in CPU firmware, and whether there's any likelihood of it actually being distributed by motherboard manufacturers. Marcos > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore. > > kernel.org%2Fr%2F20200524003529.598434ff%40f31- > > 4.lan&data=02%7C01%7Calexander.deucher%40amd.com%7Ccb77b56b > > 62ae47f60f8808d802855759%7C3dd8961fe4884e608e11a82d994e183d%7C0% > > 7C0%7C637262119015438912&sdata=3z%2Btn%2Bv2pvUl3X0Tzk%2BLoi > > Mk06dLZCmgUOrsGf3kLpY%3D&reserved=0 > > AMD Starship USB 3.0 host controller > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore. > > kernel.org%2Fr%2FCAAri2DpkcuQZYbT6XsALhx2e6vRqPHwtbjHYeiH7MNp4z > > mt1RA%40mail.gmail.com&data=02%7C01%7Calexander.deucher%40a > > md.com%7Ccb77b56b62ae47f60f8808d802855759%7C3dd8961fe4884e608e11 > > a82d994e183d%7C0%7C0%7C637262119015438912&sdata=69GsHB0HCp > > 6x0xW0tA%2FrAln0Vy0Yc9I8QSHowebdIxI%3D&reserved=0 > > AMD Matisse HD Audio & USB 3.0 host controller ] > > > > On Sun, May 24, 2020 at 12:35:29AM -0700, Kevin Buettner wrote: > > > This commit adds an entry to the quirk_no_flr table for the AMD > > > Starship USB 3.0 host controller. > > > > > > Tested on a Micro-Star International Co., Ltd. MS-7C59/Creator TRX40 > > > motherboard with an AMD Ryzen Threadripper 3970X. > > > > > > Without this patch, when attempting to assign (pass through) an AMD > > > Starship USB 3.0 host controller to a guest OS, the system becomes > > > increasingly unresponsive over the course of several minutes, > > > eventually requiring a hard reset. > > > > > > Shortly after attempting to start the guest, I see these messages: > > > > > > May 23 22:59:46 mesquite kernel: vfio-pci 0000:05:00.3: not ready > > > 1023ms after FLR; waiting May 23 22:59:48 mesquite kernel: vfio-pci > > > 0000:05:00.3: not ready 2047ms after FLR; waiting May 23 22:59:51 > > > mesquite kernel: vfio-pci 0000:05:00.3: not ready 4095ms after FLR; > > > waiting May 23 22:59:56 mesquite kernel: vfio-pci 0000:05:00.3: not > > > ready 8191ms after FLR; waiting > > > > > > And then eventually: > > > > > > May 23 23:01:00 mesquite kernel: vfio-pci 0000:05:00.3: not ready > > > 65535ms after FLR; giving up May 23 23:01:05 mesquite kernel: INFO: > > > NMI handler (perf_event_nmi_handler) took too long to run: 0.000 msecs > > > May 23 23:01:06 mesquite kernel: perf: interrupt took too long (642744 > > > > 2500), lowering kernel.perf_event_max_sample_rate to 1000 May 23 > > > 23:01:07 mesquite kernel: INFO: NMI handler (perf_event_nmi_handler) > > > took too long to run: 82.270 msecs May 23 23:01:08 mesquite kernel: INFO: > > NMI handler (perf_event_nmi_handler) took too long to run: 680.608 msecs > > May 23 23:01:08 mesquite kernel: INFO: NMI handler > > (perf_event_nmi_handler) took too long to run: 100.952 msecs ... > > > kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 22s! > > > [qemu-system-x86:7487] May 23 23:01:25 mesquite kernel: watchdog: > > BUG: > > > soft lockup - CPU#3 stuck for 22s! [qemu-system-x86:7487] > > > > > > The above log snippets were obtained using the aforementioned hardware > > > running Fedora 32 w/ kernel package kernel-5.6.13-300.fc32.x86_64. My > > > fix was applied to a local copy of the F32 kernel package, then > > > rebuilt, etc. > > > > > > With this patch in place, the host kernel doesn't exhibit these > > > problems. The guest OS (also Fedora 32) starts up and works as > > > expected with the passed-through USB host controller. > > > > > > Signed-off-by: Kevin Buettner <kevinb@xxxxxxxxxx> > > > > Applied to pci/virtualization for v5.8, thanks! > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index > > > 43a0c2ce635e..b1db58d00d2b 100644 > > > --- a/drivers/pci/quirks.c > > > +++ b/drivers/pci/quirks.c > > > @@ -5133,6 +5133,7 @@ > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, > > quirk_intel_qat_vf_cap); > > > * FLR may cause the following to devices to hang: > > > * > > > * AMD Starship/Matisse HD Audio Controller 0x1487 > > > + * AMD Starship USB 3.0 Host Controller 0x148c > > > * AMD Matisse USB 3.0 Host Controller 0x149c > > > * Intel 82579LM Gigabit Ethernet Controller 0x1502 > > > * Intel 82579V Gigabit Ethernet Controller 0x1503 @@ -5143,6 +5144,7 > > > @@ static void quirk_no_flr(struct pci_dev *dev) > > > dev->dev_flags |= PCI_DEV_FLAGS_NO_FLR_RESET; } > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr); > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x148c, > > quirk_no_flr); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, > > quirk_no_flr); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, > > quirk_no_flr); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, > > quirk_no_flr); > > >