Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]


Will do.

Alex


From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Monday, July 18, 2022 11:29 AM
To: Mike Lothian <mike@xxxxxxxxxxxxxx>; Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Cc: Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
 
Xinhui submitted this patch instead, which should address the same
issue: "drm/amdgpu: Remove one duplicated ef removal"

Alex, can you pick up that patch for drm-fixes for 5.19, if it's not too
late?

Thanks,
   Felix


On 2022-07-18 10:58, Mike Lothian wrote:
> Is this likely to land before 5.19 final? It's been nearly 2 weeks
> since I said if fixed an issue I was seeing
> https://gitlab.freedesktop.org/drm/amd/-/issues/2074
>
> On Fri, 8 Jul 2022 at 10:05, Christian König
> <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>> Hi guys,
>>
>> well the practice to remove all fences by adding a NULL exclusive fence
>> was pretty much illegal in the first place because this also removes
>> kernel internal fences which can lead to freeing up memory which is
>> still accessed.
>>
>> I've just didn't noticed that this was used by the KFD code as well
>> otherwise I would have pushed to clean that up much earlier.
>>
>> Regards,
>> Christian.
>>
>> Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
>>> [AMD Official Use Only - General]
>>>
>>> Felix,
>>> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
>>>   From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
>>>
>>> And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
>>> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
>>>
>>> So I would make a new patch to remove this duplicated ef removal.
>>>
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>>> Sent: Thursday, July 7, 2022 11:47 PM
>>> To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
>>>
>>> Am 2022-07-07 um 05:54 schrieb Christian König:
>>>> Am 07.07.22 um 11:50 schrieb xinhui pan:
>>>>> Fence is accessed by dma_resv_add_fence() now.
>>>>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>>>>
>>>>> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> index 0036c9e405af..1e25c400ce4f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
>>>>> amdgpu_device *adev,
>>>>>           if (!process_info)
>>>>>             return;
>>>>> -
>>>>>         /* Release eviction fence from PD */
>>>>>         amdgpu_bo_reserve(pd, false);
>>>>> -    amdgpu_bo_fence(pd, NULL, false);
>>>>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>>>>> +                    process_info->eviction_fence);
>>>> Good catch as well, but Felix needs to take a look at this.
>>> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
>>>
>>> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
>>>
>>>        echo Y > /sys/module/amdgpu/parameters/debug_evictions
>>>
>>> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
>>>
>>> Regards,
>>>      Felix
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>         amdgpu_bo_unreserve(pd);
>>>>>           /* Update process info */

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux