Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

Christian König <christian.koenig@xxxxxxx> · Tue, 19 Jan 2021 09:05:58 +0100

Am 19.01.21 um 04:23 schrieb Deng, Emily:
[AMD Official Use Only - Internal Distribution Only]

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
Christian König
Sent: Tuesday, January 19, 2021 12:04 AM
To: Deng, Emily <Emily.Deng@xxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

Am 18.01.21 um 12:56 schrieb Deng, Emily:
[AMD Official Use Only - Internal Distribution Only]

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Monday, January 18, 2021 3:49 PM
To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy <Roy.Sun@xxxxxxx>;
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

Mhm, we could change amdgpu_fence_wait_empty() to timeout. But I
think that waiting forever here is intentional and the right thing to do.

What happens is that we wait for the hardware to make sure that
nothing is writing to any memory before we unload the driver.

Now the VCN block has crashed and doesn't respond, but we can't
guarantee that it is not accidentally writing anywhere.

The only alternative we have is to time out and proceed with the
driver unload, risking corrupting the memory we free during that
should the hardware continue to do something.
Hi Christian,
      Thanks your suggestion, but still not quite clearly, could you detail the
solution to avoid kernel not lockup?

Well as I said that the kernel locks up is intentional here.
So you think the lock up is better than only some memory corruption?

Yes exactly.

Because we could give more time, such as 60s to wait, I don't think the fence won't be signaled within 60s if the engine is good. So
when the engine is ok, it won't cause memory corruption with the timeout. When engine is bad, the fence will never be signaled, so even we force completion, it still won't cause memory corruption.

Unfortunately it's not that easy. See for example some engines keep 
updating statistics (how many bytes written, samples encoded etc...) 
even when the engines is stuck in an endless loop.

As for sriov, when engine
is bad, we still could do recover, and do driver reload to make the driver works ok again, so we don't want the kernel lockup.

Well that is certainly something you could try, e.g. wait for empty with 
a timeout and if that times out triggers do a GPU reset and wait again.

Alternative would be to see if we could disable system memory access 
from the device all together with the PCIe config regs if we can't get 
it idle again.

Regards,
Christian.

Regards,
Christian.

Regards,
Christian.

Am 18.01.21 um 03:01 schrieb Deng, Emily:
[AMD Official Use Only - Internal Distribution Only]

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Thursday, January 14, 2021 9:50 PM
To: Deng, Emily <Emily.Deng@xxxxxxx>; Sun, Roy
<Roy.Sun@xxxxxxx>;
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

Am 14.01.21 um 03:00 schrieb Deng, Emily:
[AMD Official Use Only - Internal Distribution Only]

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf
Of Christian König
Sent: Wednesday, January 13, 2021 10:03 PM
To: Sun, Roy <Roy.Sun@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait
timeout

Am 13.01.21 um 07:36 schrieb Roy Sun:
This fix bug where when the engine hang, the fence ring will
wait without quit and cause kernel crash
NAK, this blocking is intentional unlimited because otherwise we
will cause a memory corruption.

What is the actual bug you are trying to fix here?
When some engine hang during initialization, the IB test will
fail, and fence will never come back, then never could wait the fence
back.
Why we need to wait here forever? We'd better not use forever wait
which
will cause kernel crash and lockup. And we have
amdgpu_fence_driver_force_completion will let memory free. We
should remove all those forever wait, and replace them with one
timeout, and do the correct error process if timeout.

This wait here is to make sure we never overwrite the software
fence ring buffer. Without it we would not signal all fences in
amdgpu_fence_driver_force_completion() and cause either memory
leak
or corruption.

Waiting here forever is the right thing to do even when that means
that the submission thread gets stuck forever, cause that is still
better than memory corruption.

But this should never happen in practice and is only here for
precaution. So do you really see that in practice?
Yes, we hit the issue when vcn ib test fail. Could you give some
suggestions
about how to fix this?
[  958.301685] failed to read reg:1a6c0

[  959.036645] gmc_v10_0_process_interrupt: 42 callbacks suppressed

[  959.036653] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.038043] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  959.039014] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.040202] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  959.041174] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.042353] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  959.043325] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.044508] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  959.045480] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.046659] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  959.047631] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.048815] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  959.049787] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.050973] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  959.051950] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.053123] amdgpu 0000:00:07.0:   in page starting at address
0x0000000000567000 from client 18
[  967.208705] amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests
[amdgpu]]
*ERROR* IB test failed on vcn_enc0 (-110).
[  967.209879] [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring
test
failed (-110).

[ 1209.384668] INFO: task modprobe:23957 blocked for more than 120
seconds.
[ 1209.385605]       Tainted: G           OE     5.4.0-58-generic #64~18.04.1-
Ubuntu
[ 1209.386451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1209.387342] modprobe        D    0 23957   1221 0x80004006

[ 1209.387344] Call Trace:

[ 1209.387354]  __schedule+0x293/0x720

[ 1209.387356]  schedule+0x33/0xa0

[ 1209.387357]  schedule_timeout+0x1d3/0x320

[ 1209.387359]  ? schedule+0x33/0xa0

[ 1209.387360]  ? schedule_timeout+0x1d3/0x320

[ 1209.387363]  dma_fence_default_wait+0x1b2/0x1e0

[ 1209.387364]  ? dma_fence_release+0x130/0x130

[ 1209.387366]  dma_fence_wait_timeout+0xfd/0x110

[ 1209.387773]  amdgpu_fence_wait_empty+0x90/0xc0 [amdgpu]

[ 1209.387839]  amdgpu_fence_driver_fini+0xd6/0x110 [amdgpu]
Regards,
Christian.

Regards,
Christian.

Signed-off-by: Roy Sun <Roy.Sun@xxxxxxx>
---
      drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48
++++++++++++++++++++---
      1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 6b0aeee61b8b..738ea65077ea 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -41,6 +41,8 @@
      #include "amdgpu.h"
      #include "amdgpu_trace.h"

+#define AMDGPU_FENCE_TIMEOUT  msecs_to_jiffies(1000)
#define
+AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000)
      /*
       * Fences
       * Fences mark an event in the GPUs pipeline and are used
@@
-104,6
+106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring
+*ring,
+u32
seq)
      *drv->cpu_addr = cpu_to_le32(seq);
      }

+/**
+ * amdgpu_fence_wait_timeout - get the fence wait timeout
+ *
+ * @ring: ring the fence is associated with
+ *
+ * Returns the value of the fence wait timeout.
+ */
+long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring) { long
+tmo_gfx, tmo_mm, tmo; struct amdgpu_device *adev = ring->adev;
+tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT; if
(amdgpu_sriov_vf(adev))
+{ tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT; } if
+(amdgpu_sriov_runtime(adev)) { tmo_gfx = 8 *
AMDGPU_FENCE_TIMEOUT;
+} else if (adev->gmc.xgmi.hive_id) { tmo_gfx =
+AMDGPU_FENCE_GFX_XGMI_TIMEOUT; } if (ring->funcs->type ==
+AMDGPU_RING_TYPE_UVD ||
+ring->funcs->type == AMDGPU_RING_TYPE_VCE || type ==
+ring->funcs->AMDGPU_RING_TYPE_UVD_ENC || type ==
+ring->funcs->AMDGPU_RING_TYPE_VCN_DEC || type ==
+ring->funcs->AMDGPU_RING_TYPE_VCN_ENC || type ==
+ring->funcs->AMDGPU_RING_TYPE_VCN_JPEG)
+tmo = tmo_mm;
+else
+tmo = tmo_gfx;
+return tmo;
+}
+
      /**
       * amdgpu_fence_read - read a fence value
       *
@@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct
amdgpu_ring
*ring,
struct dma_fence **f,
      rcu_read_unlock();

      if (old) {
-r = dma_fence_wait(old, false);
+long timeout;
+timeout = amdgpu_fence_wait_timeout(ring); r =
+dma_fence_wait_timeout(old, false, timeout);
      dma_fence_put(old);
      if (r)
-return r;
+return r < 0 ? r : 0;
      }
      }

@@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct
amdgpu_ring *ring)
      return 0;
      }
      rcu_read_unlock();
-
-r = dma_fence_wait(fence, false);
+
+long timeout;
+timeout = amdgpu_fence_wait_timeout(ring); r =
+dma_fence_wait_timeout(fence, false, timeout);
      dma_fence_put(fence);
-return r;
+return r < 0 ? r : 0;
      }

      /**
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2
Fl
is
ts.f
reedesktop.org%2Fmailman%2Flistinfo%2Famd-

gfx&amp;data=04%7C01%7CEmily.Deng%40amd.com%7C8b116229938b463
df87f08d8b7cbf607%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
C637461433936049544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sda
ta=HOcLHmmblOUHXATFBl5HC6LOmFq0oXglAh2GFwd6sus%3D&amp;reserve
d=0
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-
gfx&amp;data=04%7C01%7CEm
ily.Deng%40amd.com%7C5a6d3a94dc7d4190477a08d8bbcaa509%7C3dd896
1fe4884e
608e11a82d994e183d%7C0%7C0%7C637465826316473967%7CUnknown%7
CTWFpbGZsb3
d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%
3D%7
C1000&amp;sdata=7XUhFg5H0FTO8RByofukETH7yBz9RBHKkUQmHZP3z34%3
D&amp;res
erved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.f
reedesktop.org%2Fmailman%2Flistinfo%2Famd-
gfx&amp;data=04%7C01%7CEmily.Deng%40amd.com%7C5a6d3a94dc7d4190
477a08d8bbcaa509%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
C637465826316473967%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sda
ta=7XUhFg5H0FTO8RByofukETH7yBz9RBHKkUQmHZP3z34%3D&amp;reserve
d=0

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx