[AMD Official Use Only - Internal Distribution Only] >-----Original Message----- >From: Koenig, Christian <Christian.Koenig@xxxxxxx> >Sent: Monday, January 18, 2021 3:49 PM >To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>; >amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout > >Mhm, we could change amdgpu_fence_wait_empty() to timeout. But I think >that waiting forever here is intentional and the right thing to do. > >What happens is that we wait for the hardware to make sure that nothing is >writing to any memory before we unload the driver. > >Now the VCN block has crashed and doesn't respond, but we can't guarantee >that it is not accidentally writing anywhere. > >The only alternative we have is to time out and proceed with the driver unload, >risking corrupting the memory we free during that should the hardware >continue to do something. Hi Christian, Thanks your suggestion, but still not quite clearly, could you detail the solution to avoid kernel not lockup? > >Regards, >Christian. > >Am 18.01.21 um 03:01 schrieb Deng, Emily: >> [AMD Official Use Only - Internal Distribution Only] >> >>> -----Original Message----- >>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>> Sent: Thursday, January 14, 2021 9:50 PM >>> To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>; >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout >>> >>> Am 14.01.21 um 03:00 schrieb Deng, Emily: >>>> [AMD Official Use Only - Internal Distribution Only] >>>> >>>>> -----Original Message----- >>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>>>> Christian König >>>>> Sent: Wednesday, January 13, 2021 10:03 PM >>>>> To: Sun, Roy <Roy.Sun@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout >>>>> >>>>> Am 13.01.21 um 07:36 schrieb Roy Sun: >>>>>> This fix bug where when the engine hang, the fence ring will wait >>>>>> without quit and cause kernel crash >>>>> NAK, this blocking is intentional unlimited because otherwise we >>>>> will cause a memory corruption. >>>>> >>>>> What is the actual bug you are trying to fix here? >>>> When some engine hang during initialization, the IB test will fail, >>>> and fence will never come back, then never could wait the fence back. >>>> Why we need to wait here forever? We'd better not use forever wait >>>> which >>> will cause kernel crash and lockup. And we have >>> amdgpu_fence_driver_force_completion will let memory free. We should >>> remove all those forever wait, and replace them with one timeout, >>> and do the correct error process if timeout. >>> >>> This wait here is to make sure we never overwrite the software fence >>> ring buffer. Without it we would not signal all fences in >>> amdgpu_fence_driver_force_completion() and cause either memory leak >>> or corruption. >>> >>> Waiting here forever is the right thing to do even when that means >>> that the submission thread gets stuck forever, cause that is still >>> better than memory corruption. >>> >>> But this should never happen in practice and is only here for >>> precaution. So do you really see that in practice? >> Yes, we hit the issue when vcn ib test fail. Could you give some suggestions >about how to fix this? >> [ 958.301685] failed to read reg:1a6c0 >> >> [ 959.036645] gmc_v10_0_process_interrupt: 42 callbacks suppressed >> >> [ 959.036653] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.038043] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 959.039014] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.040202] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 959.041174] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.042353] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 959.043325] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.044508] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 959.045480] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.046659] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 959.047631] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.048815] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 959.049787] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.050973] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 959.051950] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 >> ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) >> >> [ 959.053123] amdgpu 0000:00:07.0: in page starting at address >0x0000000000567000 from client 18 >> >> [ 967.208705] amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] >*ERROR* IB test failed on vcn_enc0 (-110). >> >> [ 967.209879] [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test >failed (-110). >> >> >> >> [ 1209.384668] INFO: task modprobe:23957 blocked for more than 120 >seconds. >> >> [ 1209.385605] Tainted: G OE 5.4.0-58-generic #64~18.04.1- >Ubuntu >> >> [ 1209.386451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >disables this message. >> >> [ 1209.387342] modprobe D 0 23957 1221 0x80004006 >> >> [ 1209.387344] Call Trace: >> >> [ 1209.387354] __schedule+0x293/0x720 >> >> [ 1209.387356] schedule+0x33/0xa0 >> >> [ 1209.387357] schedule_timeout+0x1d3/0x320 >> >> [ 1209.387359] ? schedule+0x33/0xa0 >> >> [ 1209.387360] ? schedule_timeout+0x1d3/0x320 >> >> [ 1209.387363] dma_fence_default_wait+0x1b2/0x1e0 >> >> [ 1209.387364] ? dma_fence_release+0x130/0x130 >> >> [ 1209.387366] dma_fence_wait_timeout+0xfd/0x110 >> >> [ 1209.387773] amdgpu_fence_wait_empty+0x90/0xc0 [amdgpu] >> >> [ 1209.387839] amdgpu_fence_driver_fini+0xd6/0x110 [amdgpu] >>> Regards, >>> Christian. >>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Signed-off-by: Roy Sun <Roy.Sun@xxxxxxx> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48 >>>>> ++++++++++++++++++++--- >>>>>> 1 file changed, 43 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>> index 6b0aeee61b8b..738ea65077ea 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>> @@ -41,6 +41,8 @@ >>>>>> #include "amdgpu.h" >>>>>> #include "amdgpu_trace.h" >>>>>> >>>>>> +#define AMDGPU_FENCE_TIMEOUT msecs_to_jiffies(1000) #define >>>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000) >>>>>> /* >>>>>> * Fences >>>>>> * Fences mark an event in the GPUs pipeline and are used @@ >>>>>> -104,6 >>>>>> +106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring >>>>>> +*ring, >>>>>> +u32 >>>>> seq) >>>>>> *drv->cpu_addr = cpu_to_le32(seq); >>>>>> } >>>>>> >>>>>> +/** >>>>>> + * amdgpu_fence_wait_timeout - get the fence wait timeout >>>>>> + * >>>>>> + * @ring: ring the fence is associated with >>>>>> + * >>>>>> + * Returns the value of the fence wait timeout. >>>>>> + */ >>>>>> +long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring) { long >>>>>> +tmo_gfx, tmo_mm, tmo; struct amdgpu_device *adev = ring->adev; >>>>>> +tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT; if >>> (amdgpu_sriov_vf(adev)) >>>>>> +{ tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT; } if >>>>>> +(amdgpu_sriov_runtime(adev)) { tmo_gfx = 8 * >>> AMDGPU_FENCE_TIMEOUT; >>>>>> +} else if (adev->gmc.xgmi.hive_id) { tmo_gfx = >>>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT; } if (ring->funcs->type == >>>>>> +AMDGPU_RING_TYPE_UVD || >>>>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCE || type == >>>>>> +ring->funcs->AMDGPU_RING_TYPE_UVD_ENC || type == >>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_DEC || type == >>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_ENC || type == >>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_JPEG) >>>>>> +tmo = tmo_mm; >>>>>> +else >>>>>> +tmo = tmo_gfx; >>>>>> +return tmo; >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * amdgpu_fence_read - read a fence value >>>>>> * >>>>>> @@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct amdgpu_ring >>>>>> *ring, >>>>> struct dma_fence **f, >>>>>> rcu_read_unlock(); >>>>>> >>>>>> if (old) { >>>>>> -r = dma_fence_wait(old, false); >>>>>> +long timeout; >>>>>> +timeout = amdgpu_fence_wait_timeout(ring); r = >>>>>> +dma_fence_wait_timeout(old, false, timeout); >>>>>> dma_fence_put(old); >>>>>> if (r) >>>>>> -return r; >>>>>> +return r < 0 ? r : 0; >>>>>> } >>>>>> } >>>>>> >>>>>> @@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct >>>>> amdgpu_ring *ring) >>>>>> return 0; >>>>>> } >>>>>> rcu_read_unlock(); >>>>>> - >>>>>> -r = dma_fence_wait(fence, false); >>>>>> + >>>>>> +long timeout; >>>>>> +timeout = amdgpu_fence_wait_timeout(ring); r = >>>>>> +dma_fence_wait_timeout(fence, false, timeout); >>>>>> dma_fence_put(fence); >>>>>> -return r; >>>>>> +return r < 0 ? r : 0; >>>>>> } >>>>>> >>>>>> /** >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl >>>>> is >>>>> ts.f >>>>> reedesktop.org%2Fmailman%2Flistinfo%2Famd- >>>>> >>> >gfx&data=04%7C01%7CEmily.Deng%40amd.com%7C8b116229938b463 >>> >df87f08d8b7cbf607%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7 >>> >C637461433936049544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw >>> >MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sda >>> >ta=HOcLHmmblOUHXATFBl5HC6LOmFq0oXglAh2GFwd6sus%3D&reserve >>>>> d=0 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx