RE: [PATCH] drm/amdgpu: change the fence ring wait timeout

"Deng, Emily" <Emily.Deng@xxxxxxx> · Mon, 18 Jan 2021 11:56:50 +0000



[AMD Official Use Only - Internal Distribution Only]

>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>Sent: Monday, January 18, 2021 3:49 PM
>To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>;
>amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>
>Mhm, we could change amdgpu_fence_wait_empty() to timeout. But I think
>that waiting forever here is intentional and the right thing to do.
>
>What happens is that we wait for the hardware to make sure that nothing is
>writing to any memory before we unload the driver.
>
>Now the VCN block has crashed and doesn't respond, but we can't guarantee
>that it is not accidentally writing anywhere.
>
>The only alternative we have is to time out and proceed with the driver unload,
>risking corrupting the memory we free during that should the hardware
>continue to do something.
Hi Christian,
    Thanks your suggestion, but still not quite clearly, could you detail the solution to avoid kernel not lockup?
>
>Regards,
>Christian.
>
>Am 18.01.21 um 03:01 schrieb Deng, Emily:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>> Sent: Thursday, January 14, 2021 9:50 PM
>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>;
>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>>>
>>> Am 14.01.21 um 03:00 schrieb Deng, Emily:
>>>> [AMD Official Use Only - Internal Distribution Only]
>>>>
>>>>> -----Original Message-----
>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>>>> Christian König
>>>>> Sent: Wednesday, January 13, 2021 10:03 PM
>>>>> To: Sun, Roy <Roy.Sun@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>>>>>
>>>>> Am 13.01.21 um 07:36 schrieb Roy Sun:
>>>>>> This fix bug where when the engine hang, the fence ring will wait
>>>>>> without quit and cause kernel crash
>>>>> NAK, this blocking is intentional unlimited because otherwise we
>>>>> will cause a memory corruption.
>>>>>
>>>>> What is the actual bug you are trying to fix here?
>>>> When some engine hang during initialization, the IB test will fail,
>>>> and fence will never come back, then never could wait the fence back.
>>>> Why we need to wait here forever? We'd better not use forever wait
>>>> which
>>> will cause kernel crash and lockup. And we have
>>> amdgpu_fence_driver_force_completion will let memory free. We should
>>> remove all those forever wait, and replace them with one timeout,
>>> and do the correct error process if timeout.
>>>
>>> This wait here is to make sure we never overwrite the software fence
>>> ring buffer. Without it we would not signal all fences in
>>> amdgpu_fence_driver_force_completion() and cause either memory leak
>>> or corruption.
>>>
>>> Waiting here forever is the right thing to do even when that means
>>> that the submission thread gets stuck forever, cause that is still
>>> better than memory corruption.
>>>
>>> But this should never happen in practice and is only here for
>>> precaution. So do you really see that in practice?
>> Yes, we hit the issue when vcn ib test fail. Could you give some suggestions
>about how to fix this?
>> [  958.301685] failed to read reg:1a6c0
>>
>> [  959.036645] gmc_v10_0_process_interrupt: 42 callbacks suppressed
>>
>> [  959.036653] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.038043] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  959.039014] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.040202] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  959.041174] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.042353] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  959.043325] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.044508] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  959.045480] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.046659] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  959.047631] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.048815] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  959.049787] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.050973] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  959.051950] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>
>> [  959.053123] amdgpu 0000:00:07.0:   in page starting at address
>0x0000000000567000 from client 18
>>
>> [  967.208705] amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]]
>*ERROR* IB test failed on vcn_enc0 (-110).
>>
>> [  967.209879] [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test
>failed (-110).
>>
>>
>>
>> [ 1209.384668] INFO: task modprobe:23957 blocked for more than 120
>seconds.
>>
>> [ 1209.385605]       Tainted: G           OE     5.4.0-58-generic #64~18.04.1-
>Ubuntu
>>
>> [ 1209.386451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>disables this message.
>>
>> [ 1209.387342] modprobe        D    0 23957   1221 0x80004006
>>
>> [ 1209.387344] Call Trace:
>>
>> [ 1209.387354]  __schedule+0x293/0x720
>>
>> [ 1209.387356]  schedule+0x33/0xa0
>>
>> [ 1209.387357]  schedule_timeout+0x1d3/0x320
>>
>> [ 1209.387359]  ? schedule+0x33/0xa0
>>
>> [ 1209.387360]  ? schedule_timeout+0x1d3/0x320
>>
>> [ 1209.387363]  dma_fence_default_wait+0x1b2/0x1e0
>>
>> [ 1209.387364]  ? dma_fence_release+0x130/0x130
>>
>> [ 1209.387366]  dma_fence_wait_timeout+0xfd/0x110
>>
>> [ 1209.387773]  amdgpu_fence_wait_empty+0x90/0xc0 [amdgpu]
>>
>> [ 1209.387839]  amdgpu_fence_driver_fini+0xd6/0x110 [amdgpu]
>>> Regards,
>>> Christian.
>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Signed-off-by: Roy Sun <Roy.Sun@xxxxxxx>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48
>>>>> ++++++++++++++++++++---
>>>>>>     1 file changed, 43 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index 6b0aeee61b8b..738ea65077ea 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -41,6 +41,8 @@
>>>>>>     #include "amdgpu.h"
>>>>>>     #include "amdgpu_trace.h"
>>>>>>
>>>>>> +#define AMDGPU_FENCE_TIMEOUT  msecs_to_jiffies(1000) #define
>>>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000)
>>>>>>     /*
>>>>>>      * Fences
>>>>>>      * Fences mark an event in the GPUs pipeline and are used @@
>>>>>> -104,6
>>>>>> +106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring
>>>>>> +*ring,
>>>>>> +u32
>>>>> seq)
>>>>>>     *drv->cpu_addr = cpu_to_le32(seq);
>>>>>>     }
>>>>>>
>>>>>> +/**
>>>>>> + * amdgpu_fence_wait_timeout - get the fence wait timeout
>>>>>> + *
>>>>>> + * @ring: ring the fence is associated with
>>>>>> + *
>>>>>> + * Returns the value of the fence wait timeout.
>>>>>> + */
>>>>>> +long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring) { long
>>>>>> +tmo_gfx, tmo_mm, tmo; struct amdgpu_device *adev = ring->adev;
>>>>>> +tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT; if
>>> (amdgpu_sriov_vf(adev))
>>>>>> +{ tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT; } if
>>>>>> +(amdgpu_sriov_runtime(adev)) { tmo_gfx = 8 *
>>> AMDGPU_FENCE_TIMEOUT;
>>>>>> +} else if (adev->gmc.xgmi.hive_id) { tmo_gfx =
>>>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT; } if (ring->funcs->type ==
>>>>>> +AMDGPU_RING_TYPE_UVD ||
>>>>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCE || type ==
>>>>>> +ring->funcs->AMDGPU_RING_TYPE_UVD_ENC || type ==
>>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_DEC || type ==
>>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_ENC || type ==
>>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_JPEG)
>>>>>> +tmo = tmo_mm;
>>>>>> +else
>>>>>> +tmo = tmo_gfx;
>>>>>> +return tmo;
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>>      * amdgpu_fence_read - read a fence value
>>>>>>      *
>>>>>> @@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct amdgpu_ring
>>>>>> *ring,
>>>>> struct dma_fence **f,
>>>>>>     rcu_read_unlock();
>>>>>>
>>>>>>     if (old) {
>>>>>> -r = dma_fence_wait(old, false);
>>>>>> +long timeout;
>>>>>> +timeout = amdgpu_fence_wait_timeout(ring); r =
>>>>>> +dma_fence_wait_timeout(old, false, timeout);
>>>>>>     dma_fence_put(old);
>>>>>>     if (r)
>>>>>> -return r;
>>>>>> +return r < 0 ? r : 0;
>>>>>>     }
>>>>>>     }
>>>>>>
>>>>>> @@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct
>>>>> amdgpu_ring *ring)
>>>>>>     return 0;
>>>>>>     }
>>>>>>     rcu_read_unlock();
>>>>>> -
>>>>>> -r = dma_fence_wait(fence, false);
>>>>>> +
>>>>>> +long timeout;
>>>>>> +timeout = amdgpu_fence_wait_timeout(ring); r =
>>>>>> +dma_fence_wait_timeout(fence, false, timeout);
>>>>>>     dma_fence_put(fence);
>>>>>> -return r;
>>>>>> +return r < 0 ? r : 0;
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>>>> is
>>>>> ts.f
>>>>> reedesktop.org%2Fmailman%2Flistinfo%2Famd-
>>>>>
>>>
>gfx&amp;data=04%7C01%7CEmily.Deng%40amd.com%7C8b116229938b463
>>>
>df87f08d8b7cbf607%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
>>>
>C637461433936049544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
>>>
>MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sda
>>>
>ta=HOcLHmmblOUHXATFBl5HC6LOmFq0oXglAh2GFwd6sus%3D&amp;reserve
>>>>> d=0

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx