Hi Alex, I believe in the patch file, this + (pdev->subsystem_device == 0x0c19 || + pdev->subsystem_device == 0x0c10)) Has to be changed to: + (pdev->subsystem_device == 0xce19 || + pdev->subsystem_device == 0xcc10)) Because our SSIDs are "ea50:ce19" and "ea50:cc10" respectively and another one would "ea50:cc08". I will apply that patch and feedback the results soon plus the patch file that I actually had applied. -----Original Message----- From: Deucher, Alexander <Alexander.Deucher@xxxxxxx> Sent: Montag, 30. November 2020 19:36 To: Merger, Edgar [AUTOSOL/MAS/AUGS] <Edgar.Merger@xxxxxxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; Zhu, Changfeng <Changfeng.Zhu@xxxxxxx> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken [AMD Public Use] > -----Original Message----- > From: Merger, Edgar [AUTOSOL/MAS/AUGS] <Edgar.Merger@xxxxxxxxxxx> > Sent: Thursday, November 26, 2020 4:24 AM > To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray > <Ray.Huang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Cc: Will Deacon <will@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > linux- pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn > Helgaas <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; Zhu, > Changfeng <Changfeng.Zhu@xxxxxxx> > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > broken > > Alex, > > This is pretty much the same patch as what I have received from Joerg > previously, except that it is tied to the particular Emerson platform > and its derivatives (listed with Subsystem IDs). Right. As per my original point, I don't want to disable ATS on all Picasso chips because doing so would break GPU compute on them, so I'd like to apply this quirk as narrowly as possible. > > Below patch was what Joerg provided me and I successfully tested. > > This diff to the kernel should do that: > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index > f70692ac79c5..3911b0ec57ba 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -5176,6 +5176,8 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, > 0x6900, quirk_amd_harvest_no_ats); > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, > quirk_amd_harvest_no_ats); > /* AMD Navi14 dGPU */ > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, > quirk_amd_harvest_no_ats); > +/* AMD Raven platform iGPU */ > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, > +quirk_amd_harvest_no_ats); > #endif /* CONFIG_PCI_ATS */ > > /* Freescale PCIe doesn't support MSI in RC mode */ > > So far I have seen this issue on two instances of this chip, but I > must admit that I did test only two of them to this extent, so I guess > it is not a bad chip in particular, but the chips we use are from the > same production lot, so it might be a systematical problem of that production lot? > > UEFI-Setup shows: > Processor Family: 17h > Procossor Model: 20h - 2Fh > CPUID: 00820F01 > Microcode Patch Level: 8200103 > > Looking at the chip-die I found that this is a fully qualified IP > Silicon (according to Ryzen Embedded R1000 SOC Interlock). > YE1305C9T20FG > BI2015SUY > 9JB6496P00123 > 2016 AMD > DIFFUSED IN USA > MADE IN CHINA > > Currently used SBIOS is a branch from "EmbeddedPI-FP5 1.2.0.3RC3". > > In the future our SBIOS might merge with EmbeddedPI-FP5_1.2.0.5RC3. > I think it's more likely an sbios issue, so hopefully the new release fixes it. Alex > > > > -----Original Message----- > From: Deucher, Alexander <Alexander.Deucher@xxxxxxx> > Sent: Mittwoch, 25. November 2020 17:08 > To: Merger, Edgar [AUTOSOL/MAS/AUGS] <Edgar.Merger@xxxxxxxxxxx>; > Huang, Ray <Ray.Huang@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx> > Cc: Will Deacon <will@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > linux- pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn > Helgaas <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; Zhu, > Changfeng <Changfeng.Zhu@xxxxxxx> > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > broken > > [AMD Public Use] > > > -----Original Message----- > > From: Merger, Edgar [AUTOSOL/MAS/AUGS] > <Edgar.Merger@xxxxxxxxxxx> > > Sent: Wednesday, November 25, 2020 5:04 AM > > To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray > > <Ray.Huang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx> > > Cc: Will Deacon <will@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > > linux- pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn > > Helgaas <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; Zhu, > > Changfeng <Changfeng.Zhu@xxxxxxx> > > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > > broken > > > > I do have also other problems with this unit, when IOMMU is enabled > > and pci=noats is not set as kernel parameter. > > > > [ 2004.265906] amdgpu 0000:0b:00.0: [drm:amdgpu_ib_ring_tests > > [amdgpu]] > > *ERROR* IB test failed on gfx (-110). > > [ 2004.266024] [drm:amdgpu_device_delayed_init_work_handler > [amdgpu]] > > *ERROR* ib ring test failed (-110). > > > > Is this seen on all instances of this chip or only specific silicon? > I.e., could this be a bad chip? Would it be possible to test a newer > sbios? I think the attached patch should work if we can't get it > fixed on the platform side. It should only enable the quirk on your particular platform. > > Alex > > > > -----Original Message----- > > From: Merger, Edgar [AUTOSOL/MAS/AUGS] > > Sent: Mittwoch, 25. November 2020 10:16 > > To: 'Deucher, Alexander' <Alexander.Deucher@xxxxxxx>; 'Huang, Ray' > > <Ray.Huang@xxxxxxx>; 'Kuehling, Felix' <Felix.Kuehling@xxxxxxx> > > Cc: 'Will Deacon' <will@xxxxxxxxxx>; 'linux-kernel@xxxxxxxxxxxxxxx' > > <linux- kernel@xxxxxxxxxxxxxxx>; 'linux-pci@xxxxxxxxxxxxxxx' <linux- > > pci@xxxxxxxxxxxxxxx>; 'iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx' > > <iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx>; 'Bjorn Helgaas' > > <bhelgaas@xxxxxxxxxx>; 'Joerg Roedel' <jroedel@xxxxxxx>; 'Zhu, > > Changfeng' <Changfeng.Zhu@xxxxxxx> > > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > > broken > > > > Remark: > > > > Systems with R1305G APU (which show the issue) have the following > > VGA- > > Controller: > > 0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > > [AMD/ATI] Picasso (rev cf) > > > > Systems with V1404I APU (which do not show the issue) have the > > following > > VGA-Controller: > > 0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > > [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile > > Series] (rev 83) > > > > "rev cf" vs. "ref 83" is probably what you where referring to with > > PCI Revision ID. > > > > -----Original Message----- > > From: Merger, Edgar [AUTOSOL/MAS/AUGS] > > Sent: Mittwoch, 25. November 2020 07:05 > > To: 'Deucher, Alexander' <Alexander.Deucher@xxxxxxx>; Huang, Ray > > <Ray.Huang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx> > > Cc: Will Deacon <will@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > > linux- pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn > > Helgaas <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; Zhu, > > Changfeng <Changfeng.Zhu@xxxxxxx> > > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > > broken > > > > I see that problem only on systems that use a R1305G APU > > > > sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info > > > > shows > > > > VCE feature version: 0, firmware version: 0x00000000 UVD feature > > version: 0, firmware version: 0x00000000 MC feature version: 0, > > firmware > version: > > 0x00000000 ME feature version: 50, firmware version: 0x000000a3 PFP > > feature version: 50, firmware version: 0x000000bb CE feature version: > > 50, firmware version: 0x0000004f RLC feature version: 1, firmware version: > > 0x00000049 RLC SRLC feature version: 1, firmware version: 0x00000001 > > RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS > > feature > > version: 1, firmware version: 0x00000001 MEC feature version: 50, > > firmware > > version: 0x000001b5 > > MEC2 feature version: 50, firmware version: 0x000001b5 SOS feature > version: > > 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: > > 0x21000030 TA XGMI feature version: 0, firmware version: 0x00000000 > > TA RAS feature version: 0, firmware version: 0x00000000 SMC feature > > version: 0, firmware version: 0x00002527 > > SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature > > version: 0, firmware version: 0x0110901c DMCU feature version: 0, > > firmware > > version: 0x00000001 VBIOS version: 113-RAVEN2-117 > > > > We are also using V1404I APU on the same boards and I haven´t seen > > the issue on those boards > > > > These boards give me slightly different info: sudo cat > > /sys/kernel/debug/dri/0/amdgpu_firmware_info > > > > VCE feature version: 0, firmware version: 0x00000000 UVD feature > > version: 0, firmware version: 0x00000000 MC feature version: 0, > > firmware > version: > > 0x00000000 ME feature version: 47, firmware version: 0x000000a2 PFP > > feature version: 47, firmware version: 0x000000b9 CE feature version: > > 47, firmware version: 0x0000004e RLC feature version: 1, firmware version: > > 0x00000213 RLC SRLC feature version: 1, firmware version: 0x00000001 > > RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS > > feature > > version: 1, firmware version: 0x00000001 MEC feature version: 47, > > firmware > > version: 0x000001ab > > MEC2 feature version: 47, firmware version: 0x000001ab SOS feature > version: > > 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: > > 0x21000013 TA XGMI feature version: 0, firmware version: 0x00000000 > > TA RAS feature version: 0, firmware version: 0x00000000 SMC feature > > version: 0, firmware version: 0x00001e5b > > SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature > > version: 0, firmware version: 0x0110901c DMCU feature version: 0, > > firmware > > version: 0x00000000 VBIOS version: 113-RAVEN-116 > > > > > > > > > > 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Root Complex > > 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > IOMMU > > 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h > > (Models > > 00h-1fh) PCIe Dummy Host Bridge > > 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > PCIe GPP Bridge [6:0] > > 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin > > Switch Upstream (PCIE SW.US) > > 00:01.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > PCIe GPP Bridge [6:0] > > 00:01.5 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin > > Switch Upstream (PCIE SW.US) > > 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h > > (Models > > 00h-1fh) PCIe Dummy Host Bridge > > 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Internal PCIe GPP Bridge 0 to Bus A > > 00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Internal PCIe GPP Bridge 0 to Bus B > > 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus > > Controller (rev 61) > > 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC > > Bridge (rev > > 51) > > 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 0 > > 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 1 > > 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 2 > > 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 3 > > 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 4 > > 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 5 > > 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 6 > > 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 > > Device 24: Function 7 > > 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0e) > > 01:00.1 Serial controller: Realtek Semiconductor Co., Ltd. Device > > 816a (rev 0e) > > 01:00.2 Serial controller: Realtek Semiconductor Co., Ltd. Device > > 816b (rev 0e) > > 01:00.3 IPMI Interface: Realtek Semiconductor Co., Ltd. Device 816c > > (rev 0e) > > 01:00.4 USB controller: Realtek Semiconductor Co., Ltd. Device 816d > > (rev 0e) > > 02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network > > Connection (rev 03) > > 03:00.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 > > 6-Port/8- Lane Packet Switch > > 04:01.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 > > 6-Port/8- Lane Packet Switch > > 04:02.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 > > 6-Port/8- Lane Packet Switch > > 04:03.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 > > 6-Port/8- Lane Packet Switch > > 04:04.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 > > 6-Port/8- Lane Packet Switch > > 04:05.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 > > 6-Port/8- Lane Packet Switch > > 06:00.0 Serial controller: Asix Electronics Corporation Device 9100 > > 06:00.1 Serial controller: Asix Electronics Corporation Device 9100 > > 07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network > > Connection (rev 03) > > 0a:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network > > Connection (rev 03) > > 0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > > [AMD/ATI] Picasso (rev cf) > > 0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] > > Raven/Raven2/Fenghuang HDMI/DP Audio Controller > > 0b:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] > > Family 17h (Models 10h-1fh) Platform Security Processor > > 0b:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven2 > > USB > > 3.1 > > 0b:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] > > Raven/Raven2/FireFlight/Renoir Audio Processor > > 0b:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc. > > [AMD] Raven/Raven2/Renoir Non-Sensor Fusion Hub KMDF driver > > 0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA > > Controller [AHCI mode] (rev 61) > > > > PCI Revision ID is 06 I believe. Got that from this lspci -xx > > > > 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin > > Switch Upstream (PCIE SW.US) > > 00: 22 10 5d 14 07 04 10 00 00 00 04 06 10 00 81 00 > > 10: 00 00 00 00 00 00 00 00 00 02 02 00 f1 01 00 00 > > 20: e0 fc e0 fc f1 ff 01 00 00 00 00 00 00 00 00 00 > > 30: 00 00 00 00 50 00 00 00 00 00 00 00 ff 00 12 00 > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 50: 01 58 03 c8 00 00 00 00 10 a0 42 01 22 80 00 00 > > 60: 1f 29 00 00 13 38 73 03 42 00 11 30 00 00 04 00 > > 70: 00 00 40 01 18 00 01 00 00 00 00 00 bf 01 70 00 > > 80: 06 00 00 00 0e 00 00 00 03 00 01 00 00 00 00 00 > > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > a0: 05 c0 81 00 00 00 e0 fe 00 00 00 00 00 00 00 00 > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > c0: 0d c8 00 00 22 10 34 12 08 00 03 a8 00 00 00 00 > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > e0: 00 00 00 00 4c 8a 05 00 00 00 00 00 00 00 00 00 > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > -----Original Message----- > > From: Deucher, Alexander <Alexander.Deucher@xxxxxxx> > > Sent: Dienstag, 24. November 2020 16:06 > > To: Merger, Edgar [AUTOSOL/MAS/AUGS] > <Edgar.Merger@xxxxxxxxxxx>; > > Huang, Ray <Ray.Huang@xxxxxxx>; Kuehling, Felix > > <Felix.Kuehling@xxxxxxx> > > Cc: Will Deacon <will@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > > linux- pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn > > Helgaas <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; Zhu, > > Changfeng <Changfeng.Zhu@xxxxxxx> > > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > > broken > > > > [AMD Public Use] > > > > > -----Original Message----- > > > From: Merger, Edgar [AUTOSOL/MAS/AUGS] > > <Edgar.Merger@xxxxxxxxxxx> > > > Sent: Tuesday, November 24, 2020 2:29 AM > > > To: Huang, Ray <Ray.Huang@xxxxxxx>; Kuehling, Felix > > > <Felix.Kuehling@xxxxxxx> > > > Cc: Will Deacon <will@xxxxxxxxxx>; Deucher, Alexander > > > <Alexander.Deucher@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > > > pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn > > > Helgaas <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; > > > Zhu, > > Changfeng > > > <Changfeng.Zhu@xxxxxxx> > > > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS > > > as broken > > > > > > Module Version : PiccasoCpu 10 > > > AGESA Version : PiccasoPI 100A > > > > > > I did not try to enter the system in any other way (like via ssh) > > > than via Desktop. > > > > You can get this information from the amdgpu driver. E.g., sudo cat > > /sys/kernel/debug/dri/0/amdgpu_firmware_info . Also what is the PCI > > revision id of your chip (from lspci)? Also are you just seeing > > this on specific versions of the sbios? > > > > Thanks, > > > > Alex > > > > > > > > > > -----Original Message----- > > > From: Huang Rui <ray.huang@xxxxxxx> > > > Sent: Dienstag, 24. November 2020 07:43 > > > To: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > > > Cc: Will Deacon <will@xxxxxxxxxx>; Deucher, Alexander > > > <Alexander.Deucher@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > > > pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn > > > Helgaas <bhelgaas@xxxxxxxxxx>; Merger, Edgar [AUTOSOL/MAS/AUGS] > > > <Edgar.Merger@xxxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; > > Changfeng > > > Zhu <changfeng.zhu@xxxxxxx> > > > Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > > broken > > > > > > On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote: > > > > On 2020-11-23 5:33 p.m., Will Deacon wrote: > > > > > On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander > wrote: > > > > >> [AMD Public Use] > > > > >> > > > > >>> -----Original Message----- > > > > >>> From: Will Deacon <will@xxxxxxxxxx> > > > > >>> Sent: Monday, November 23, 2020 8:44 AM > > > > >>> To: linux-kernel@xxxxxxxxxxxxxxx > > > > >>> Cc: linux-pci@xxxxxxxxxxxxxxx; > > > > >>> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Will Deacon > > > > >>> <will@xxxxxxxxxx>; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; > > > > >>> Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Edgar > Merger > > > > >>> <Edgar.Merger@xxxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx> > > > > >>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken > > > > >>> > > > > >>> Edgar Merger reports that the AMD Raven GPU does not work > > > > >>> reliably on his system when the IOMMU is enabled: > > > > >>> > > > > >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx > > > > >>> timeout, signaled seq=1, emitted seq=3 > > > > >>> | [...] > > > > >>> | amdgpu 0000:0b:00.0: GPU reset begin! > > > > >>> | AMD-Vi: Completion-Wait loop timed out > > > > >>> | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT > > > > >>> device=0b:00.0 address=0x38edc0970] > > > > >>> > > > > >>> This is indicative of a hardware/platform configuration > > > > >>> issue so, since disabling ATS has been shown to resolve the > > > > >>> problem, add a quirk to match this particular device while > > > > >>> Edgar follows-up with AMD > > > for more information. > > > > >>> > > > > >>> Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > > > >>> Cc: Alex Deucher <alexander.deucher@xxxxxxx> > > > > >>> Reported-by: Edgar Merger <Edgar.Merger@xxxxxxxxxxx> > > > > >>> Suggested-by: Joerg Roedel <jroedel@xxxxxxx> > > > > >>> Link: > > > > >>> > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.p > rotection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furld&d=DwIFAw&c=jOU > RTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP1 > 3_P6LVp7j_9l1xmg&m=BIpm40CYGVSJNrmoqPI4DeIayU0tYU2D5NpRwfbkZvA&s=tmsZ3 > ihrSXZ3g6wdJ2maxU9mJ1TGcRxd91z9IQTP00A&e= > efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > 3A__nam11.safelinks.p&data=04%7C01%7CAlexander.Deucher%40amd > .com%7C9194a443d95c4ffcb7f708d891ed0889%7C3dd8961fe4884e608e11a82 > d994e183d%7C0%7C1%7C637419794843309283%7CUnknown%7CTWFpbGZsb > 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 > %3D%7C1000&sdata=4JGSwn7au4u%2FBB69mmq0%2BrWfVG12sEyd5H > oBUeiut9o%3D&reserved=0 > > rotection.outlook.com_-3Furl-3Dhttps-253A-252F- > 252Furld&d=DwIFAw&c=jOU > > > RTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_- > 862rdSP1 > > 3_P6LVp7j_9l1xmg&m=ZhN0Jau6oCc4cnz64IhGK2- > XDiD5D_6vW6ZYbifWYF0&s=ndvE- > > ezxTBweMMUjyMWdiCyPB6GDIS_eWs0kmZwqtpY&e= > > > efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > > 3A__nam11.safelinks.p& > > > > > > ;data=04%7C01%7CAlexander.Deucher%40amd.com%7C1d797071822d47ce6 > > c9808d8 > > > > > > 9129698f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418954 > > 3633797 > > > > > > 99%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM > > zIiLCJBTiI > > > > > > 6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VLlzQtS3KWZqQslcJKZYrG > > sj6eMk3 > > > VWaE%2BXhbNdRx80%3D&reserved=0 > > > rotection.outlook.com_-3Furl-3Dhttps-253A-252F- > > 252Furld&d=DwIFAw&c=jOU > > > > > > RTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_- > > 862rdSP1 > > > 3_P6LVp7j_9l1xmg&m=MMI_EgCqeOX4EvIftpL7agRxJ- > > udp1CLokf2QWuzFgE&s=ZLdz6 > > > OgavzNn2vSzsgyL1IB6MbK7hPKavOYwbLhyTPU&e= > > > efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > > > > > > 3A__lore%26d%3DDwIDAw%26c%3DjOURTkCZzT8tVB5xPEYIm3YJGoxoTaQs > > > QPzPKJGaWbo%26r%3DBJxhacqqa4K1PJGm6_- > > > > > > 862rdSP13_P6LVp7j_9l1xmg%26m%3DlNXu2xwvyxEZ3PzoVmXMBXXS55jsmf > > > > > > DicuQFJqkIOH4%26s%3D_5VDNCRQdA7AhsvvZ3TJJtQZ2iBp9c9tFHIleTYT_ZM > > > > > > %26e%3D&data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5f > > > > > > a241f9634692c03908d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7 > > > > > > C0%7C0%7C637417997272974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi > > > > > > MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100 > > > > > > 0&sdata=OEgYlw%2F1YP0C%2FnWBRQUxwBH56mGOJxYMWSQ%2Fj1Y > > > 9f6Q%3D&reserved=0 . > > > > >>> kernel.org/linux- > > > > >>> > > > iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M > > > > >>> B1310.namprd10.prod.outlook.com > > > > >>> > > > > > > her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488 > > > > >>> > > > > > > 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7 > > > > >>> > > > > > > CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi > > > > >>> > > > > > > LCJXVCI6Mn0%3D%7C1000&sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ > > > > >>> LOUfX5oGaoLN8n%2B8%3D&reserved=0 > > > > >>> Signed-off-by: Will Deacon <will@xxxxxxxxxx> > > > > >>> --- > > > > >>> > > > > >>> Hi all, > > > > >>> > > > > >>> Since Joerg is away at the moment, I'm posting this to try > > > > >>> to make some progress with the thread in the Link: tag. > > > > >> + Felix > > > > >> > > > > >> What system is this? Can you provide more details? Does a > > > > >> sbios update fix this? Disabling ATS for all Ravens will > > > > >> break GPU compute for a lot of people. I'd prefer to just > > > > >> black list this particular system (e.g., just SSIDs or revision) if possible. > > > > > > > > +Ray > > > > > > > > There are already many systems where the IOMMU is disabled in > > > > the BIOS, or the CRAT table reporting the APU compute > > > > capabilities is broken. Ray has been working on a fallback to > > > > make APUs behave like dGPUs on such systems. That should also > > > > cover this case where ATS is blacklisted. That said, it affects > > > > the programming model, because we don't support the unified and > > > > coherent memory model on dGPUs like we do on APUs with IOMMUv2. > > > > So it would be good to > make > > > > the conditions for this workaround as narrow as possible. > > > > > > Yes, besides the comments from Alex and Felix, may we get your > > > firmware version (SMC firmware which is from SBIOS) and device id? > > > > > > > >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx > > > > >>> timeout, signaled seq=1, emitted seq=3 > > > > > > It looks only gfx ib test passed, and fails to lanuch desktop, am I right? > > > > > > We would like to see whether it is Raven, Raven kicker (new > > > Raven), or Picasso. In our side, per the internal test result, we > > > didn't see the similiar issue on Raven kicker and Picasso platform. > > > > > > Thanks, > > > Ray > > > > > > > > > > > These are the relevant changes in KFD and Thunk for reference: > > > > > > > > ### KFD ### > > > > > > > > commit 914913ab04dfbcd0226ecb6bc99d276832ea2908 > > > > Author: Huang Rui <ray.huang@xxxxxxx> > > > > Date: Tue Aug 18 14:54:23 2020 +0800 > > > > > > > > drm/amdkfd: implement the dGPU fallback path for apu (v6) > > > > > > > > We still have a few iommu issues which need to address, so > > > > force raven > > > > as "dgpu" path for the moment. > > > > > > > > This is to add the fallback path to bypass IOMMU if IOMMU > > > > v2 is disabled > > > > or ACPI CRAT table not correct. > > > > > > > > v2: Use ignore_crat parameter to decide whether it will go > > > > with IOMMUv2. > > > > v3: Align with existed thunk, don't change the way of > > > > raven, only renoir > > > > will use "dgpu" path by default. > > > > v4: don't update global ignore_crat in the driver, and > > > > revise fallback > > > > function if CRAT is broken. > > > > v5: refine acpi crat good but no iommu support case, and > > > > rename the > > > > title. > > > > v6: fix the issue of dGPU initialized firstly, just modify > > > > the report > > > > value in the node_show(). > > > > > > > > Signed-off-by: Huang Rui <ray.huang@xxxxxxx> > > > > Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > > > > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > > > > > > > > ### Thunk ### > > > > > > > > commit e32482fa4b9ca398c8bdc303920abfd672592764 > > > > Author: Huang Rui <ray.huang@xxxxxxx> > > > > Date: Tue Aug 18 18:54:05 2020 +0800 > > > > > > > > libhsakmt: remove is_dgpu flag in the hsa_gfxip_table > > > > > > > > Whether use dgpu path will check the props which exposed > > > > from > > kernel. > > > > We won't need hard code in the ASIC table. > > > > > > > > Signed-off-by: Huang Rui <ray.huang@xxxxxxx> > > > > Change-Id: I0c018a26b219914a41197ff36dbec7a75945d452 > > > > > > > > commit 7c60f6d912034aa67ed27b47a29221422423f5cc > > > > Author: Huang Rui <ray.huang@xxxxxxx> > > > > Date: Thu Jul 30 10:22:23 2020 +0800 > > > > > > > > libhsakmt: implement the method that using flag which > > > > exposed by kfd to configure is_dgpu > > > > > > > > KFD already implemented the fallback path for APU. Thunk > > > > will use flag > > > > which exposed by kfd to configure is_dgpu instead of > > > > hardcode > > before. > > > > > > > > Signed-off-by: Huang Rui <ray.huang@xxxxxxx> > > > > Change-Id: I445f6cf668f9484dd06cd9ae1bb3cfe7428ec7eb > > > > > > > > Regards, > > > > Felix > > > > > > > > > > > > > Cheers, Alex. I'll have to defer to Edgar for the details, as > > > > > my understanding from the original thread over at: > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.p > rotection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furld&d=DwIFAw&c=jOU > RTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP1 > 3_P6LVp7j_9l1xmg&m=BIpm40CYGVSJNrmoqPI4DeIayU0tYU2D5NpRwfbkZvA&s=tmsZ3 > ihrSXZ3g6wdJ2maxU9mJ1TGcRxd91z9IQTP00A&e= > efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > 3A__nam11.safelinks.p&data=04%7C01%7CAlexander.Deucher%40amd > .com%7C9194a443d95c4ffcb7f708d891ed0889%7C3dd8961fe4884e608e11a82 > d994e183d%7C0%7C1%7C637419794843309283%7CUnknown%7CTWFpbGZsb > 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 > %3D%7C1000&sdata=4JGSwn7au4u%2FBB69mmq0%2BrWfVG12sEyd5H > oBUeiut9o%3D&reserved=0 > > rotection.outlook.com_-3Furl-3Dhttps-253A-252F- > 252Furld&d=DwIFAw&c=jOU > > > RTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_- > 862rdSP1 > > 3_P6LVp7j_9l1xmg&m=ZhN0Jau6oCc4cnz64IhGK2- > XDiD5D_6vW6ZYbifWYF0&s=ndvE- > > ezxTBweMMUjyMWdiCyPB6GDIS_eWs0kmZwqtpY&e= > > > efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > > 3A__nam11.safelinks.p& > > > > > > ;data=04%7C01%7CAlexander.Deucher%40amd.com%7C1d797071822d47ce6 > > c9808d8 > > > > > > 9129698f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418954 > > 3633797 > > > > > > 99%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM > > zIiLCJBTiI > > > > > > 6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VLlzQtS3KWZqQslcJKZYrG > > sj6eMk3 > > > VWaE%2BXhbNdRx80%3D&reserved=0 > > > rotection.outlook.com_-3Furl-3Dhttps-253A-252F- > > 252Fur&d=DwIFAw&c=jOURT > > > > kCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_- > > 862rdSP13_ > > > P6LVp7j_9l1xmg&m=MMI_EgCqeOX4EvIftpL7agRxJ- > > udp1CLokf2QWuzFgE&s=IPZRolk > > > y3TYlbWPsOkY37MbDdzwhc1b_LaE6JkaOkOo&e= > > > > > ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > > > 3A__lore.kernel.org&a > > > > > > > > > > > mp;data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5fa241f963469 > > > 2c039 > > > > > > > > > > > 08d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63741 > > > 79972 > > > > > > > > > > > 72974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi > > > V2luMzI > > > > > > > > > > > iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iKTPucGQqcRXET > > > QZiQz > > > > > j90WdJeCYDytdZHJ1ZiUyR%2FM%3D&reserved=0 > > > > > _linux-2Diommu_MWHPR10MB1310CDB6829DDCF5EA84A14689150- > > > 40MWHPR10MB131 > > > > > > > > > > > 0.namprd10.prod.outlook.com_&d=DwIDAw&c=jOURTkCZzT8tVB5xPEYIm3Y > > > JGoxo > > > > > TaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_- > > > 862rdSP13_P6LVp7j_9l1xmg&m=lNXu > > > > > > > > > > > 2xwvyxEZ3PzoVmXMBXXS55jsmfDicuQFJqkIOH4&s=dsAVVJbD7gJIj3ctZpnnU > > > 60y21 > > > > > ijWZmZ8xmOK1cO_O0&e= > > > > > > > > > > is that this is a board developed by his company. > > > > > > > > > > Edgar -- please can you answer Alex's questions? > > > > > > > > > > Will