Re: [amdgpu] deadlock

Christian König <christian.koenig@xxxxxxx> · Wed, 3 Feb 2021 13:30:07 +0100

Am 03.02.21 um 13:24 schrieb Daniel Vetter:
On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
Am 03.02.21 um 12:45 schrieb Daniel Gomez:
On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@xxxxxxxx> wrote:
On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@xxxxxxxx> wrote:
On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@xxxxxxx> wrote:
Am 03.02.21 um 09:48 schrieb Daniel Vetter:
On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@xxxxxxx> wrote:
Hi Daniel,

this is not a deadlock, but rather a hardware lockup.
Are you sure? Ime getting stuck in dma_fence_wait has generally good
chance of being a dma_fence deadlock. GPU hang should never result in
a forever stuck dma_fence.
Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
this.
Maybe clarifying, could be both. TDR should notice and get us out of
this, but if there's a dma_fence deadlock and we can't re-emit or
force complete the pending things, then we're stuck for good.
-Daniel

Question is rather why we end up in the userptr handling for GFX? Our
ROCm OpenCL stack shouldn't use this.

Daniel, can you pls re-hang your machine and then dump backtraces of
all tasks into dmesg with sysrq-t, and then attach that? Without all
the backtraces it's tricky to construct the full dependency chain of
what's going on. Also is this plain -rc6, not some more patches on
top?
Yeah, that's still a good idea to have.
Here the full backtrace dmesg logs after the hang:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885971019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=a3934SOOSFtRU3RraUe%2BWDgAEDefENxQZcd0prmSZXs%3D&amp;reserved=0

This is another dmesg log with the backtraces after SIGKILL the matrix process:
(I didn't have the sysrq enable at the time):
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nPom9VwIrEZF02hSEnC5Ef8lHdQURMELCapIhwKk2JE%3D&amp;reserved=0
I've now removed all our v4l2 patches and did the same test with the 'plain'
mainline version (-rc6).

Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8

Same error, same behaviour. Full dmesg log attached:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WQw6g9oA38aT1VuuZ8%2F1Y43pG%2BPlV%2F9%2FRHjKdGvZLK4%3D&amp;reserved=0
Note:
    dmesg with sysrq-t before running the test starts in [  122.016502]
sysrq: Show State
    dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
There is nothing amdgpu related in there except for waiting for the
hardware.
Yeah, but there's also no other driver that could cause a stuck dma_fence,
so why is reset not cleaning up the mess here? Irrespective of why the gpu
is stuck, the kernel should at least complete all the dma_fences even if
the gpu for some reason is terminally ill ...

That's a good question as well. I'm digging into this.

My best theory is that the amdgpu packages disabled GPU reset for some 
reason.

But the much more interesting question is why we end up in this call 
path. I've pinged internally, but east coast is not awake yet :)

Christian.

-Daniel

This is a pretty standard hardware lockup, but I'm still waiting for an
explanation why we end up in this call path in the first place.

Christian.

-Daniel

Which OpenCl stack are you using?

Regards,
Christian.

Am 03.02.21 um 09:33 schrieb Daniel Gomez:
Hi all,

I have a deadlock with the amdgpu mainline driver when running in parallel two
OpenCL applications. So far, we've been able to replicate it easily by executing
clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
opencl-samples so, if you have any other suggestion for testing I'd be very
happy to test it as well.

How to replicate the issue:

# while true; do /usr/bin/MatrixMultiplication --device gpu \
        --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
# while true; do clinfo; done

Output:

After a minute or less (sometimes could be more) I can see that
MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
goes up from ~35% to ~96%.

clinfo keeps printing:
ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)

And MatrixMultiplication prints the following (strace) if you try to
kill the process:

sched_yield()                           = 0
futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
     <detached ...>

After this, the gpu is not functional at all and you'd need a power cycle reset
to restore the system.

Hardware info:
CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
(rev 83)
        DeviceName: Broadcom 5762
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
[Radeon Vega Series / Radeon Vega Mobile Series]
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

Linux kernel info:

root@qt5222:~# uname -a
Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
2021 x86_64 x86_64 x86_64 GNU/Linux

By enabling the kernel locks stats I could see the MatrixMultiplication is
hanged in the amdgpu_mn_invalidate_gfx function:

[  738.359202] 1 lock held by MatrixMultiplic/653:
[  738.359206]  #0: ffff88810e364fe0
(&adev->notifier_lock){+.+.}-{3:3}, at:
amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]

I can see in the the amdgpu_mn_invalidate_gfx function: the
dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
guess the code gets stuck there waiting forever. According to the
documentation: "When somebody tries to invalidate the page tables we block the
update until all operations on the pages in question are completed, then those
pages are marked  as accessed and also dirty if it wasn’t a read only access."
Looks like the fences are deadlocked and therefore, it never returns. Could it
be possible? any hint to where can I look to fix this?

Thank you  in advance.

Here the full dmesg output:

[  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
[  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
[  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
ppid:     1 flags:0x00004000
[  738.358254] Call Trace:
[  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358276]  __schedule+0x370/0x960
[  738.358291]  ? dma_fence_default_wait+0x117/0x230
[  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358305]  schedule+0x51/0xc0
[  738.358312]  schedule_timeout+0x275/0x380
[  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358332]  ? mark_held_locks+0x4f/0x70
[  738.358341]  ? dma_fence_default_wait+0x117/0x230
[  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
[  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
[  738.358362]  ? dma_fence_default_wait+0x117/0x230
[  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358375]  dma_fence_default_wait+0x214/0x230
[  738.358384]  ? dma_fence_release+0x1a0/0x1a0
[  738.358396]  dma_fence_wait_timeout+0x105/0x200
[  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
[  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
[  738.358688]  __mmu_notifier_release+0x1bb/0x210
[  738.358710]  exit_mmap+0x2f/0x1e0
[  738.358723]  ? find_held_lock+0x34/0xa0
[  738.358746]  mmput+0x39/0xe0
[  738.358756]  do_exit+0x5c3/0xc00
[  738.358763]  ? find_held_lock+0x34/0xa0
[  738.358780]  do_group_exit+0x47/0xb0
[  738.358791]  get_signal+0x15b/0xc50
[  738.358807]  arch_do_signal_or_restart+0xaf/0x710
[  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
[  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
[  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
[  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
[  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
[  738.359054]  syscall_exit_to_user_mode+0x19/0x60
[  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  738.359069] RIP: 0033:0x7f6b89a51887
[  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
[  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
[  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
[  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
[  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
[  738.359129]
                   Showing all locks held in the system:
[  738.359141] 1 lock held by khungtaskd/54:
[  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x183
[  738.359187] 1 lock held by systemd-journal/174:
[  738.359202] 1 lock held by MatrixMultiplic/653:
[  738.359206]  #0: ffff88810e364fe0
(&adev->notifier_lock){+.+.}-{3:3}, at:
amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]

Daniel
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=OkFv8jiehNoa46Q%2B5yOXUg29cRbzl8voV2GqC8j1V9Q%3D&amp;reserved=0
--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=m0e9DrqnuYQoJYwwZAyonKlSfkp9hFTRNoT53OY3IbU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=BuUCnnGsKhSQc0ldgBPVBIQxYUnvIPwqqLMe81ynrgY%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx