[AMD Public Use] > -----Original Message----- > From: Merger, Edgar [AUTOSOL/MAS/AUGS] > <Edgar.Merger@xxxxxxxxxxx> > Sent: Tuesday, November 24, 2020 2:29 AM > To: Huang, Ray <Ray.Huang@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx> > Cc: Will Deacon <will@xxxxxxxxxx>; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn Helgaas > <bhelgaas@xxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; Zhu, Changfeng > <Changfeng.Zhu@xxxxxxx> > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as > broken > > Module Version : PiccasoCpu 10 > AGESA Version : PiccasoPI 100A > > I did not try to enter the system in any other way (like via ssh) than via > Desktop. You can get this information from the amdgpu driver. E.g., sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info . Also what is the PCI revision id of your chip (from lspci)? Also are you just seeing this on specific versions of the sbios? Thanks, Alex > > -----Original Message----- > From: Huang Rui <ray.huang@xxxxxxx> > Sent: Dienstag, 24. November 2020 07:43 > To: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Cc: Will Deacon <will@xxxxxxxxxx>; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Bjorn Helgaas > <bhelgaas@xxxxxxxxxx>; Merger, Edgar [AUTOSOL/MAS/AUGS] > <Edgar.Merger@xxxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx>; > Changfeng Zhu <changfeng.zhu@xxxxxxx> > Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken > > On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote: > > On 2020-11-23 5:33 p.m., Will Deacon wrote: > > > On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote: > > >> [AMD Public Use] > > >> > > >>> -----Original Message----- > > >>> From: Will Deacon <will@xxxxxxxxxx> > > >>> Sent: Monday, November 23, 2020 8:44 AM > > >>> To: linux-kernel@xxxxxxxxxxxxxxx > > >>> Cc: linux-pci@xxxxxxxxxxxxxxx; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; > > >>> Will Deacon <will@xxxxxxxxxx>; Bjorn Helgaas > > >>> <bhelgaas@xxxxxxxxxx>; Deucher, Alexander > > >>> <Alexander.Deucher@xxxxxxx>; Edgar Merger > > >>> <Edgar.Merger@xxxxxxxxxxx>; Joerg Roedel <jroedel@xxxxxxx> > > >>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken > > >>> > > >>> Edgar Merger reports that the AMD Raven GPU does not work reliably > > >>> on his system when the IOMMU is enabled: > > >>> > > >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > > >>> signaled seq=1, emitted seq=3 > > >>> | [...] > > >>> | amdgpu 0000:0b:00.0: GPU reset begin! > > >>> | AMD-Vi: Completion-Wait loop timed out > > >>> | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT > > >>> device=0b:00.0 address=0x38edc0970] > > >>> > > >>> This is indicative of a hardware/platform configuration issue so, > > >>> since disabling ATS has been shown to resolve the problem, add a > > >>> quirk to match this particular device while Edgar follows-up with AMD > for more information. > > >>> > > >>> Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > >>> Cc: Alex Deucher <alexander.deucher@xxxxxxx> > > >>> Reported-by: Edgar Merger <Edgar.Merger@xxxxxxxxxxx> > > >>> Suggested-by: Joerg Roedel <jroedel@xxxxxxx> > > >>> Link: > > >>> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Furld > efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > 3A__lore%26d%3DDwIDAw%26c%3DjOURTkCZzT8tVB5xPEYIm3YJGoxoTaQs > QPzPKJGaWbo%26r%3DBJxhacqqa4K1PJGm6_- > 862rdSP13_P6LVp7j_9l1xmg%26m%3DlNXu2xwvyxEZ3PzoVmXMBXXS55jsmf > DicuQFJqkIOH4%26s%3D_5VDNCRQdA7AhsvvZ3TJJtQZ2iBp9c9tFHIleTYT_ZM > %26e%3D&data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5f > a241f9634692c03908d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7 > C0%7C0%7C637417997272974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi > MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100 > 0&sdata=OEgYlw%2F1YP0C%2FnWBRQUxwBH56mGOJxYMWSQ%2Fj1Y > 9f6Q%3D&reserved=0 . > > >>> kernel.org/linux- > > >>> > iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M > > >>> B1310.namprd10.prod.outlook.com > > >>> > her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488 > > >>> > 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7 > > >>> > CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi > > >>> > LCJXVCI6Mn0%3D%7C1000&sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ > > >>> LOUfX5oGaoLN8n%2B8%3D&reserved=0 > > >>> Signed-off-by: Will Deacon <will@xxxxxxxxxx> > > >>> --- > > >>> > > >>> Hi all, > > >>> > > >>> Since Joerg is away at the moment, I'm posting this to try to make > > >>> some progress with the thread in the Link: tag. > > >> + Felix > > >> > > >> What system is this? Can you provide more details? Does a sbios > > >> update fix this? Disabling ATS for all Ravens will break GPU > > >> compute for a lot of people. I'd prefer to just black list this > > >> particular system (e.g., just SSIDs or revision) if possible. > > > > +Ray > > > > There are already many systems where the IOMMU is disabled in the > > BIOS, or the CRAT table reporting the APU compute capabilities is > > broken. Ray has been working on a fallback to make APUs behave like > > dGPUs on such systems. That should also cover this case where ATS is > > blacklisted. That said, it affects the programming model, because we > > don't support the unified and coherent memory model on dGPUs like we > > do on APUs with IOMMUv2. So it would be good to make the conditions > > for this workaround as narrow as possible. > > Yes, besides the comments from Alex and Felix, may we get your firmware > version (SMC firmware which is from SBIOS) and device id? > > > >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > > >>> signaled seq=1, emitted seq=3 > > It looks only gfx ib test passed, and fails to lanuch desktop, am I right? > > We would like to see whether it is Raven, Raven kicker (new Raven), or > Picasso. In our side, per the internal test result, we didn't see the similiar > issue on Raven kicker and Picasso platform. > > Thanks, > Ray > > > > > These are the relevant changes in KFD and Thunk for reference: > > > > ### KFD ### > > > > commit 914913ab04dfbcd0226ecb6bc99d276832ea2908 > > Author: Huang Rui <ray.huang@xxxxxxx> > > Date: Tue Aug 18 14:54:23 2020 +0800 > > > > drm/amdkfd: implement the dGPU fallback path for apu (v6) > > > > We still have a few iommu issues which need to address, so force > > raven > > as "dgpu" path for the moment. > > > > This is to add the fallback path to bypass IOMMU if IOMMU v2 is > > disabled > > or ACPI CRAT table not correct. > > > > v2: Use ignore_crat parameter to decide whether it will go with > > IOMMUv2. > > v3: Align with existed thunk, don't change the way of raven, only > > renoir > > will use "dgpu" path by default. > > v4: don't update global ignore_crat in the driver, and revise > > fallback > > function if CRAT is broken. > > v5: refine acpi crat good but no iommu support case, and rename > > the > > title. > > v6: fix the issue of dGPU initialized firstly, just modify the > > report > > value in the node_show(). > > > > Signed-off-by: Huang Rui <ray.huang@xxxxxxx> > > Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > > > > ### Thunk ### > > > > commit e32482fa4b9ca398c8bdc303920abfd672592764 > > Author: Huang Rui <ray.huang@xxxxxxx> > > Date: Tue Aug 18 18:54:05 2020 +0800 > > > > libhsakmt: remove is_dgpu flag in the hsa_gfxip_table > > > > Whether use dgpu path will check the props which exposed from kernel. > > We won't need hard code in the ASIC table. > > > > Signed-off-by: Huang Rui <ray.huang@xxxxxxx> > > Change-Id: I0c018a26b219914a41197ff36dbec7a75945d452 > > > > commit 7c60f6d912034aa67ed27b47a29221422423f5cc > > Author: Huang Rui <ray.huang@xxxxxxx> > > Date: Thu Jul 30 10:22:23 2020 +0800 > > > > libhsakmt: implement the method that using flag which exposed by > > kfd to configure is_dgpu > > > > KFD already implemented the fallback path for APU. Thunk will use > > flag > > which exposed by kfd to configure is_dgpu instead of hardcode before. > > > > Signed-off-by: Huang Rui <ray.huang@xxxxxxx> > > Change-Id: I445f6cf668f9484dd06cd9ae1bb3cfe7428ec7eb > > > > Regards, > > Felix > > > > > > > Cheers, Alex. I'll have to defer to Edgar for the details, as my > > > understanding from the original thread over at: > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fur > > > ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps- > 3A__lore.kernel.org&a > > > > mp;data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5fa241f963469 > 2c039 > > > > 08d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63741 > 79972 > > > > 72974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi > V2luMzI > > > > iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iKTPucGQqcRXET > QZiQz > > > j90WdJeCYDytdZHJ1ZiUyR%2FM%3D&reserved=0 > > > _linux-2Diommu_MWHPR10MB1310CDB6829DDCF5EA84A14689150- > 40MWHPR10MB131 > > > > 0.namprd10.prod.outlook.com_&d=DwIDAw&c=jOURTkCZzT8tVB5xPEYIm3Y > JGoxo > > > TaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_- > 862rdSP13_P6LVp7j_9l1xmg&m=lNXu > > > > 2xwvyxEZ3PzoVmXMBXXS55jsmfDicuQFJqkIOH4&s=dsAVVJbD7gJIj3ctZpnnU > 60y21 > > > ijWZmZ8xmOK1cO_O0&e= > > > > > > is that this is a board developed by his company. > > > > > > Edgar -- please can you answer Alex's questions? > > > > > > Will