RE: [PATCH] drm/amdgpu: change the fence ring wait timeout

"Deng, Emily" <Emily.Deng@xxxxxxx> · Mon, 18 Jan 2021 02:01:17 +0000

[AMD Official Use Only - Internal Distribution Only]

>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>Sent: Thursday, January 14, 2021 9:50 PM
>To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>;
>amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>
>Am 14.01.21 um 03:00 schrieb Deng, Emily:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>> Christian König
>>> Sent: Wednesday, January 13, 2021 10:03 PM
>>> To: Sun, Roy <Roy.Sun@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>>>
>>> Am 13.01.21 um 07:36 schrieb Roy Sun:
>>>> This fix bug where when the engine hang, the fence ring will wait
>>>> without quit and cause kernel crash
>>> NAK, this blocking is intentional unlimited because otherwise we will
>>> cause a memory corruption.
>>>
>>> What is the actual bug you are trying to fix here?
>> When some engine hang during initialization, the IB test will fail,
>> and fence will never come back, then never could wait the fence back.
>> Why we need to wait here forever? We'd better not use forever wait which
>will cause kernel crash and lockup. And we have
>amdgpu_fence_driver_force_completion will let memory free. We should
>remove all those forever wait, and replace them with one timeout,  and do
>the correct error process if timeout.
>
>This wait here is to make sure we never overwrite the software fence ring
>buffer. Without it we would not signal all fences in
>amdgpu_fence_driver_force_completion() and cause either memory leak or
>corruption.
>
>Waiting here forever is the right thing to do even when that means that the
>submission thread gets stuck forever, cause that is still better than memory
>corruption.
>
>But this should never happen in practice and is only here for precaution. So do
>you really see that in practice?
Yes, we hit the issue when vcn ib test fail. Could you give some suggestions about how to fix this?
[  958.301685] failed to read reg:1a6c0

[  959.036645] gmc_v10_0_process_interrupt: 42 callbacks suppressed

[  959.036653] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.038043] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  959.039014] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.040202] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  959.041174] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.042353] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  959.043325] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.044508] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  959.045480] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.046659] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  959.047631] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.048815] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  959.049787] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.050973] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  959.051950] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.053123] amdgpu 0000:00:07.0:   in page starting at address 0x0000000000567000 from client 18

[  967.208705] amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_enc0 (-110).

[  967.209879] [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test failed (-110).

[ 1209.384668] INFO: task modprobe:23957 blocked for more than 120 seconds.

[ 1209.385605]       Tainted: G           OE     5.4.0-58-generic #64~18.04.1-Ubuntu

[ 1209.386451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 1209.387342] modprobe        D    0 23957   1221 0x80004006

[ 1209.387344] Call Trace:

[ 1209.387354]  __schedule+0x293/0x720

[ 1209.387356]  schedule+0x33/0xa0

[ 1209.387357]  schedule_timeout+0x1d3/0x320

[ 1209.387359]  ? schedule+0x33/0xa0

[ 1209.387360]  ? schedule_timeout+0x1d3/0x320

[ 1209.387363]  dma_fence_default_wait+0x1b2/0x1e0

[ 1209.387364]  ? dma_fence_release+0x130/0x130

[ 1209.387366]  dma_fence_wait_timeout+0xfd/0x110

[ 1209.387773]  amdgpu_fence_wait_empty+0x90/0xc0 [amdgpu]

[ 1209.387839]  amdgpu_fence_driver_fini+0xd6/0x110 [amdgpu]
>
>Regards,
>Christian.
>
>>
>>> Regards,
>>> Christian.
>>>
>>>> Signed-off-by: Roy Sun <Roy.Sun@xxxxxxx>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48
>>> ++++++++++++++++++++---
>>>>    1 file changed, 43 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 6b0aeee61b8b..738ea65077ea 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -41,6 +41,8 @@
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_trace.h"
>>>>
>>>> +#define AMDGPU_FENCE_TIMEOUT  msecs_to_jiffies(1000) #define
>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000)
>>>>    /*
>>>>     * Fences
>>>>     * Fences mark an event in the GPUs pipeline and are used @@
>>>> -104,6
>>>> +106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring *ring,
>>>> +u32
>>> seq)
>>>>    *drv->cpu_addr = cpu_to_le32(seq);
>>>>    }
>>>>
>>>> +/**
>>>> + * amdgpu_fence_wait_timeout - get the fence wait timeout
>>>> + *
>>>> + * @ring: ring the fence is associated with
>>>> + *
>>>> + * Returns the value of the fence wait timeout.
>>>> + */
>>>> +long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring) { long
>>>> +tmo_gfx, tmo_mm, tmo; struct amdgpu_device *adev = ring->adev;
>>>> +tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT; if
>(amdgpu_sriov_vf(adev))
>>>> +{ tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT; } if
>>>> +(amdgpu_sriov_runtime(adev)) { tmo_gfx = 8 *
>AMDGPU_FENCE_TIMEOUT;
>>>> +} else if (adev->gmc.xgmi.hive_id) { tmo_gfx =
>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT; } if (ring->funcs->type ==
>>>> +AMDGPU_RING_TYPE_UVD ||
>>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCE || type ==
>>>> +ring->funcs->AMDGPU_RING_TYPE_UVD_ENC || type ==
>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_DEC || type ==
>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_ENC || type ==
>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_JPEG)
>>>> +tmo = tmo_mm;
>>>> +else
>>>> +tmo = tmo_gfx;
>>>> +return tmo;
>>>> +}
>>>> +
>>>>    /**
>>>>     * amdgpu_fence_read - read a fence value
>>>>     *
>>>> @@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct amdgpu_ring
>>>> *ring,
>>> struct dma_fence **f,
>>>>    rcu_read_unlock();
>>>>
>>>>    if (old) {
>>>> -r = dma_fence_wait(old, false);
>>>> +long timeout;
>>>> +timeout = amdgpu_fence_wait_timeout(ring); r =
>>>> +dma_fence_wait_timeout(old, false, timeout);
>>>>    dma_fence_put(old);
>>>>    if (r)
>>>> -return r;
>>>> +return r < 0 ? r : 0;
>>>>    }
>>>>    }
>>>>
>>>> @@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct
>>> amdgpu_ring *ring)
>>>>    return 0;
>>>>    }
>>>>    rcu_read_unlock();
>>>> -
>>>> -r = dma_fence_wait(fence, false);
>>>> +
>>>> +long timeout;
>>>> +timeout = amdgpu_fence_wait_timeout(ring); r =
>>>> +dma_fence_wait_timeout(fence, false, timeout);
>>>>    dma_fence_put(fence);
>>>> -return r;
>>>> +return r < 0 ? r : 0;
>>>>    }
>>>>
>>>>    /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
>>> ts.f
>>> reedesktop.org%2Fmailman%2Flistinfo%2Famd-
>>>
>gfx&amp;data=04%7C01%7CEmily.Deng%40amd.com%7C8b116229938b463
>>>
>df87f08d8b7cbf607%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
>>>
>C637461433936049544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
>>>
>MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sda
>>>
>ta=HOcLHmmblOUHXATFBl5HC6LOmFq0oXglAh2GFwd6sus%3D&amp;reserve
>>> d=0

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx