Thanks Monk, I just updated the patch and it could passed 1000 rounds TDR test. Sent out an review email. Regards, Jack -----Original Message----- From: Liu, Monk <Monk.Liu@xxxxxxx> Sent: Friday, April 3, 2020 11:38 AM To: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Zhang, Jack (Jian) <Jack.Zhang1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset Thanks Felix Hi Jack I think below changes can resolve your problem , we had this on our customer branch already, it fix the memory leak, and also fix my previous bug . Can you make this change applied to gfx_v10/v9 ? thanks ! diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c index 29749502..532258445 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c @@ -543,6 +543,8 @@ static int kgd_hqd_destroy(struct kgd_dev *kgd, void *mqd, uint32_t temp; struct v10_compute_mqd *m = get_mqd(mqd); + if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset) + return 0; #if 0 unsigned long flags; int retry; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 35b32ad..f6479e1 100755 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3653,6 +3653,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, if (r) return r; + amdgpu_amdkfd_pre_reset(adev); + /* Resume IP prior to SMC */ r = amdgpu_device_ip_reinit_early_sriov(adev); if (r) _____________________________________ Monk Liu|GPU Virtualization Team |AMD -----Original Message----- From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> Sent: Friday, April 3, 2020 1:26 AM To: Zhang, Jack (Jian) <Jack.Zhang1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Liu, Monk <Monk.Liu@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset [+Monk] This looks reasonable to me. However, you're effectively reverting this commit by Monk: a03eb637d2a5 drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV In hind-sight, Monk's commit was broken. Removing the call to pre_reset has other consequences, such as breaking notifications about reset to user mode, and probably invalidating some assumptions in kfd_post_reset. Can you coordinate with Monk to work out why his change was needed, and whether you'll need a different solution for the problem he was trying to address? In the meanwhile, this patch is Acked-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> Am 2020-04-02 um 3:20 a.m. schrieb Jack Zhang: > kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate > > Without this change, sriov tdr code path will never free those > allocated memories and get memory leak. > > Signed-off-by: Jack Zhang <Jack.Zhang1@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 8faaa17..832daf7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3847,6 +3847,8 @@ static int amdgpu_device_reset_sriov(struct > amdgpu_device *adev, { > int r; > > + amdgpu_amdkfd_pre_reset(adev); > + > if (from_hypervisor) > r = amdgpu_virt_request_full_gpu(adev, true); > else _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx