Am 2021-08-05 um 4:51 a.m. schrieb Zhu, Changfeng: > [AMD Official Use Only] > > Hi Felix, > > Can we set noretry=1 for dgpu path(ignore_crat=1) which doesn’t to through iommuv2 path on raven as below: There are other possible reasons than ignore_crat for Raven to work in dGPU mode (broken CRAT, disabled IOMMU). However, those are not known until later in the initialization. Regards, Felix >> + case CHIP_RAVEN: >> + /* >> + * TODO: Raven currently can fix most SVM issues with >> + * noretry =1. However it has two issues with noretry = 1 >> + * on kfd migrate tests. It still needs to root causes >> + * with these two migrate fails on raven with noretry = 1. >> + */ >> if (amdgpu_noretry == -1) { >> If(ignore_crat) >> gmc->noretry = 1; >> else >> gmc->noretry = 0; >> } >> else >> gmc->noretry = amdgpu_noretry; >> break; > BR, > Changfeng. > > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Sent: Wednesday, July 28, 2021 10:22 PM > To: Zhu, Changfeng <Changfeng.Zhu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Huang, Ray <Ray.Huang@xxxxxxx>; Zhang, Yifan <Yifan1.Zhang@xxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven > > Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for system memory access depend on correct retry handling in the SQ. > Therefore noretry must be 0 on Raven. > > I believe the reason that SVM has trouble with retry enabled is, that > IOMMUv2 is catching the page faults, so the driver never gets to handle the page fault interrupts. That breaks page-fault based migration in the SVM code. I think the better solution is to disable SVM on APUs where > IOMMUv2 is enabled. > > Alternatively, we could give up on IOMMUv2 entirely and always rely on SVM to provide that functionality. But that requires more changes in the amdgpu_vm code. > > Regards, > Felix > > > Am 2021-07-28 um 2:36 a.m. schrieb Changfeng: >> From: changzhu <Changfeng.Zhu@xxxxxxx> >> >> From: Changfeng <Changfeng.Zhu@xxxxxxx> >> >> It can't find any issues with noretry=1 except two SVM migrate issues. >> Oppositely, it will cause most SVM cases fail with noretry=0. >> The two SVM migrate issues also happen with noretry=0. So it can set >> default noretry=1 for raven firstly to fix most SVM fails. >> >> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc >> Signed-off-by: Changfeng <Changfeng.Zhu@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++------- >> 1 file changed, 8 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >> index 09edfb64cce0..d7f69dbd48e6 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev) >> * noretry = 0 will cause kfd page fault tests fail >> * for some ASICs, so set default to 1 for these ASICs. >> */ >> + case CHIP_RAVEN: >> + /* >> + * TODO: Raven currently can fix most SVM issues with >> + * noretry =1. However it has two issues with noretry = 1 >> + * on kfd migrate tests. It still needs to root causes >> + * with these two migrate fails on raven with noretry = 1. >> + */ >> if (amdgpu_noretry == -1) >> gmc->noretry = 1; >> else >> gmc->noretry = amdgpu_noretry; >> break; >> - case CHIP_RAVEN: >> default: >> - /* Raven currently has issues with noretry >> - * regardless of what we decide for other >> - * asics, we should leave raven with >> - * noretry = 0 until we root cause the >> - * issues. >> - * >> + /* >> * default this to 0 for now, but we may want >> * to change this in the future for certain >> * GPUs as it can increase performance in