[Public]
Will do.
Alex
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Monday, July 18, 2022 11:29 AM To: Mike Lothian <mike@xxxxxxxxxxxxxx>; Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Cc: Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence Xinhui submitted this patch instead, which should address the same
issue: "drm/amdgpu: Remove one duplicated ef removal" Alex, can you pick up that patch for drm-fixes for 5.19, if it's not too late? Thanks, Felix On 2022-07-18 10:58, Mike Lothian wrote: > Is this likely to land before 5.19 final? It's been nearly 2 weeks > since I said if fixed an issue I was seeing > https://gitlab.freedesktop.org/drm/amd/-/issues/2074 > > On Fri, 8 Jul 2022 at 10:05, Christian König > <ckoenig.leichtzumerken@xxxxxxxxx> wrote: >> Hi guys, >> >> well the practice to remove all fences by adding a NULL exclusive fence >> was pretty much illegal in the first place because this also removes >> kernel internal fences which can lead to freeing up memory which is >> still accessed. >> >> I've just didn't noticed that this was used by the KFD code as well >> otherwise I would have pushed to clean that up much earlier. >> >> Regards, >> Christian. >> >> Am 08.07.22 um 03:08 schrieb Pan, Xinhui: >>> [AMD Official Use Only - General] >>> >>> Felix, >>> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉 >>> From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken. >>> >>> And I also try the debug_evictions parameter. No unexpected eviction shows anyway. >>> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it. >>> >>> So I would make a new patch to remove this duplicated ef removal. >>> >>> -----Original Message----- >>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >>> Sent: Thursday, July 7, 2022 11:47 PM >>> To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> >>> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence >>> >>> Am 2022-07-07 um 05:54 schrieb Christian König: >>>> Am 07.07.22 um 11:50 schrieb xinhui pan: >>>>> Fence is accessed by dma_resv_add_fence() now. >>>>> Use amdgpu_amdkfd_remove_eviction_fence instead. >>>>> >>>>> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> index 0036c9e405af..1e25c400ce4f 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct >>>>> amdgpu_device *adev, >>>>> if (!process_info) >>>>> return; >>>>> - >>>>> /* Release eviction fence from PD */ >>>>> amdgpu_bo_reserve(pd, false); >>>>> - amdgpu_bo_fence(pd, NULL, false); >>>>> + amdgpu_amdkfd_remove_eviction_fence(pd, >>>>> + process_info->eviction_fence); >>>> Good catch as well, but Felix needs to take a look at this. >>> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. >>> >>> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with >>> >>> echo Y > /sys/module/amdgpu/parameters/debug_evictions >>> >>> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. >>> >>> Regards, >>> Felix >>> >>> >>>> Regards, >>>> Christian. >>>> >>>>> amdgpu_bo_unreserve(pd); >>>>> /* Update process info */ |