[AMD Official Use Only - Internal Distribution Only] >-----Original Message----- >From: Koenig, Christian <Christian.Koenig@xxxxxxx> >Sent: Thursday, January 14, 2021 9:50 PM >To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>; >amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout > >Am 14.01.21 um 03:00 schrieb Deng, Emily: >> [AMD Official Use Only - Internal Distribution Only] >> >>> -----Original Message----- >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>> Christian König >>> Sent: Wednesday, January 13, 2021 10:03 PM >>> To: Sun, Roy <Roy.Sun@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout >>> >>> Am 13.01.21 um 07:36 schrieb Roy Sun: >>>> This fix bug where when the engine hang, the fence ring will wait >>>> without quit and cause kernel crash >>> NAK, this blocking is intentional unlimited because otherwise we will >>> cause a memory corruption. >>> >>> What is the actual bug you are trying to fix here? >> When some engine hang during initialization, the IB test will fail, >> and fence will never come back, then never could wait the fence back. >> Why we need to wait here forever? We'd better not use forever wait which >will cause kernel crash and lockup. And we have >amdgpu_fence_driver_force_completion will let memory free. We should >remove all those forever wait, and replace them with one timeout, and do >the correct error process if timeout. > >This wait here is to make sure we never overwrite the software fence ring >buffer. Without it we would not signal all fences in >amdgpu_fence_driver_force_completion() and cause either memory leak or >corruption. > >Waiting here forever is the right thing to do even when that means that the >submission thread gets stuck forever, cause that is still better than memory >corruption. > >But this should never happen in practice and is only here for precaution. So do >you really see that in practice? Yes, we hit the issue when vcn ib test fail. Could you give some suggestions about how to fix this? [ 958.301685] failed to read reg:1a6c0 [ 959.036645] gmc_v10_0_process_interrupt: 42 callbacks suppressed [ 959.036653] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.038043] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 959.039014] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.040202] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 959.041174] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.042353] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 959.043325] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.044508] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 959.045480] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.046659] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 959.047631] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.048815] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 959.049787] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.050973] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 959.051950] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0 ring:0 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 959.053123] amdgpu 0000:00:07.0: in page starting at address 0x0000000000567000 from client 18 [ 967.208705] amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_enc0 (-110). [ 967.209879] [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test failed (-110). [ 1209.384668] INFO: task modprobe:23957 blocked for more than 120 seconds. [ 1209.385605] Tainted: G OE 5.4.0-58-generic #64~18.04.1-Ubuntu [ 1209.386451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1209.387342] modprobe D 0 23957 1221 0x80004006 [ 1209.387344] Call Trace: [ 1209.387354] __schedule+0x293/0x720 [ 1209.387356] schedule+0x33/0xa0 [ 1209.387357] schedule_timeout+0x1d3/0x320 [ 1209.387359] ? schedule+0x33/0xa0 [ 1209.387360] ? schedule_timeout+0x1d3/0x320 [ 1209.387363] dma_fence_default_wait+0x1b2/0x1e0 [ 1209.387364] ? dma_fence_release+0x130/0x130 [ 1209.387366] dma_fence_wait_timeout+0xfd/0x110 [ 1209.387773] amdgpu_fence_wait_empty+0x90/0xc0 [amdgpu] [ 1209.387839] amdgpu_fence_driver_fini+0xd6/0x110 [amdgpu] > >Regards, >Christian. > >> >>> Regards, >>> Christian. >>> >>>> Signed-off-by: Roy Sun <Roy.Sun@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48 >>> ++++++++++++++++++++--- >>>> 1 file changed, 43 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> index 6b0aeee61b8b..738ea65077ea 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> @@ -41,6 +41,8 @@ >>>> #include "amdgpu.h" >>>> #include "amdgpu_trace.h" >>>> >>>> +#define AMDGPU_FENCE_TIMEOUT msecs_to_jiffies(1000) #define >>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000) >>>> /* >>>> * Fences >>>> * Fences mark an event in the GPUs pipeline and are used @@ >>>> -104,6 >>>> +106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring *ring, >>>> +u32 >>> seq) >>>> *drv->cpu_addr = cpu_to_le32(seq); >>>> } >>>> >>>> +/** >>>> + * amdgpu_fence_wait_timeout - get the fence wait timeout >>>> + * >>>> + * @ring: ring the fence is associated with >>>> + * >>>> + * Returns the value of the fence wait timeout. >>>> + */ >>>> +long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring) { long >>>> +tmo_gfx, tmo_mm, tmo; struct amdgpu_device *adev = ring->adev; >>>> +tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT; if >(amdgpu_sriov_vf(adev)) >>>> +{ tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT; } if >>>> +(amdgpu_sriov_runtime(adev)) { tmo_gfx = 8 * >AMDGPU_FENCE_TIMEOUT; >>>> +} else if (adev->gmc.xgmi.hive_id) { tmo_gfx = >>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT; } if (ring->funcs->type == >>>> +AMDGPU_RING_TYPE_UVD || >>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCE || type == >>>> +ring->funcs->AMDGPU_RING_TYPE_UVD_ENC || type == >>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_DEC || type == >>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_ENC || type == >>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_JPEG) >>>> +tmo = tmo_mm; >>>> +else >>>> +tmo = tmo_gfx; >>>> +return tmo; >>>> +} >>>> + >>>> /** >>>> * amdgpu_fence_read - read a fence value >>>> * >>>> @@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct amdgpu_ring >>>> *ring, >>> struct dma_fence **f, >>>> rcu_read_unlock(); >>>> >>>> if (old) { >>>> -r = dma_fence_wait(old, false); >>>> +long timeout; >>>> +timeout = amdgpu_fence_wait_timeout(ring); r = >>>> +dma_fence_wait_timeout(old, false, timeout); >>>> dma_fence_put(old); >>>> if (r) >>>> -return r; >>>> +return r < 0 ? r : 0; >>>> } >>>> } >>>> >>>> @@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct >>> amdgpu_ring *ring) >>>> return 0; >>>> } >>>> rcu_read_unlock(); >>>> - >>>> -r = dma_fence_wait(fence, false); >>>> + >>>> +long timeout; >>>> +timeout = amdgpu_fence_wait_timeout(ring); r = >>>> +dma_fence_wait_timeout(fence, false, timeout); >>>> dma_fence_put(fence); >>>> -return r; >>>> +return r < 0 ? r : 0; >>>> } >>>> >>>> /** >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis >>> ts.f >>> reedesktop.org%2Fmailman%2Flistinfo%2Famd- >>> >gfx&data=04%7C01%7CEmily.Deng%40amd.com%7C8b116229938b463 >>> >df87f08d8b7cbf607%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7 >>> >C637461433936049544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw >>> >MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sda >>> >ta=HOcLHmmblOUHXATFBl5HC6LOmFq0oXglAh2GFwd6sus%3D&reserve >>> d=0 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx