Am 2021-07-14 um 11:25 a.m. schrieb Oak Zeng: > If GPU is during a resetting cycle, writing to GPU can cause > unpredictable protection fault, see below call trace. Disallow using kfd debugfs > hang_hws to hang hws if GPU is resetting. > > [12808.234114] general protection fault: 0000 [#1] SMP NOPTI > [12808.234119] CPU: 13 PID: 6334 Comm: tee Tainted: G OE > 5.4.0-77-generic #86-Ubuntu > [12808.234121] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE > WIFI, BIOS 0211 11/27/2020 > [12808.234220] RIP: 0010:kq_submit_packet+0xd/0x50 [amdgpu] > [12808.234222] Code: 8b 45 d0 48 c7 00 00 00 00 00 b8 f4 ff ff ff eb df 66 66 > 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 8b 17 48 8b 47 48 <48> 8b 52 > 08 48 89 e5 83 7a 20 08 74 14 8b 77 20 89 30 48 8b 47 10 > [12808.234224] RSP: 0018:ffffb0bf4954bdc0 EFLAGS: 00010216 > [12808.234226] RAX: ffffb0bf4a1a5a00 RBX: ffff99302895c0c8 RCX: > 0000000000000000 > [12808.234227] RDX: c3156d43d3a04949 RSI: 0000000000000055 RDI: > ffff99302584c300 > [12808.234228] RBP: ffffb0bf4954bdf8 R08: 0000000000000543 R09: > ffffb0bf4a1a4230 > [12808.234229] R10: 000000000000000a R11: f000000000000000 R12: > 0000000000000000 > [12808.234230] R13: ffff99302895c0d8 R14: 00007ffebb3d18f0 R15: > 0000000000000005 > [12808.234232] FS: 00007f0d822ef580(0000) GS:ffff99307d340000(0000) > knlGS:0000000000000000 > [12808.234233] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [12808.234234] CR2: 00007ffebb3d1908 CR3: 0000001efe1ec000 CR4: > 0000000000340ee0 > [12808.234235] Call Trace: > [12808.234324] ? pm_debugfs_hang_hws+0x71/0xd0 [amdgpu] > [12808.234408] kfd_debugfs_hang_hws+0x2e/0x50 [amdgpu] > [12808.234494] kfd_debugfs_hang_hws_write+0xb6/0xc0 [amdgpu] > [12808.234499] full_proxy_write+0x5c/0x90 > [12808.234502] __vfs_write+0x1b/0x40 > [12808.234504] vfs_write+0xb9/0x1a0 > [12808.234506] ksys_write+0x67/0xe0 > [12808.234508] __x64_sys_write+0x1a/0x20 > [12808.234511] do_syscall_64+0x57/0x190 > [12808.234514] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Signed-off-by: Oak Zeng <Oak.Zeng@xxxxxxx> > --- > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > index 9e4a05e..fc77d03 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > @@ -1390,6 +1390,11 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev) > return -EINVAL; > } > > + if (dev->dqm->is_resetting) { Checking dev->dqm->is_resetting without holding the dqm_lock is incorrect. The problem is not really the fact, that it's resetting, but that dqm->packets (the packet manager) is not initialized at that time. A more general solution would be to move the pm_debugfs_hang_hws call into dqm_debugfs_execute_queues, which does take the dqm_lock, and add a check for dqm->packets while holding the lock. Regards, Felix > + pr_err("HWS is already resetting, please wait for the current reset to finish\n"); > + return -EBUSY; > + } > + > r = pm_debugfs_hang_hws(&dev->dqm->packets); > if (!r) > r = dqm_debugfs_execute_queues(dev->dqm); _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx