Re: [PATCH 1/3] drm/amdkfd: Disallow debugfs to hang hws when GPU is resetting

Felix Kuehling <felix.kuehling@xxxxxxx> · Wed, 14 Jul 2021 12:04:21 -0400

Am 2021-07-14 um 11:25 a.m. schrieb Oak Zeng:
> If GPU is during a resetting cycle, writing to GPU can cause
> unpredictable protection fault, see below call trace. Disallow using kfd debugfs
> hang_hws to hang hws if GPU is resetting.
>
> [12808.234114] general protection fault: 0000 [#1] SMP NOPTI
> [12808.234119] CPU: 13 PID: 6334 Comm: tee Tainted: G           OE
> 5.4.0-77-generic #86-Ubuntu
> [12808.234121] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE
> WIFI, BIOS 0211 11/27/2020
> [12808.234220] RIP: 0010:kq_submit_packet+0xd/0x50 [amdgpu]
> [12808.234222] Code: 8b 45 d0 48 c7 00 00 00 00 00 b8 f4 ff ff ff eb df 66 66
> 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 8b 17 48 8b 47 48 <48> 8b 52
> 08 48 89 e5 83 7a 20 08 74 14 8b 77 20 89 30 48 8b 47 10
> [12808.234224] RSP: 0018:ffffb0bf4954bdc0 EFLAGS: 00010216
> [12808.234226] RAX: ffffb0bf4a1a5a00 RBX: ffff99302895c0c8 RCX:
> 0000000000000000
> [12808.234227] RDX: c3156d43d3a04949 RSI: 0000000000000055 RDI:
> ffff99302584c300
> [12808.234228] RBP: ffffb0bf4954bdf8 R08: 0000000000000543 R09:
> ffffb0bf4a1a4230
> [12808.234229] R10: 000000000000000a R11: f000000000000000 R12:
> 0000000000000000
> [12808.234230] R13: ffff99302895c0d8 R14: 00007ffebb3d18f0 R15:
> 0000000000000005
> [12808.234232] FS:  00007f0d822ef580(0000) GS:ffff99307d340000(0000)
> knlGS:0000000000000000
> [12808.234233] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12808.234234] CR2: 00007ffebb3d1908 CR3: 0000001efe1ec000 CR4:
> 0000000000340ee0
> [12808.234235] Call Trace:
> [12808.234324]  ? pm_debugfs_hang_hws+0x71/0xd0 [amdgpu]
> [12808.234408]  kfd_debugfs_hang_hws+0x2e/0x50 [amdgpu]
> [12808.234494]  kfd_debugfs_hang_hws_write+0xb6/0xc0 [amdgpu]
> [12808.234499]  full_proxy_write+0x5c/0x90
> [12808.234502]  __vfs_write+0x1b/0x40
> [12808.234504]  vfs_write+0xb9/0x1a0
> [12808.234506]  ksys_write+0x67/0xe0
> [12808.234508]  __x64_sys_write+0x1a/0x20
> [12808.234511]  do_syscall_64+0x57/0x190
> [12808.234514]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: Oak Zeng <Oak.Zeng@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 9e4a05e..fc77d03 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -1390,6 +1390,11 @@ int kfd_debugfs_hang_hws(struct kfd_dev *dev)
>  		return -EINVAL;
>  	}
>  
> +	if (dev->dqm->is_resetting) {

Checking dev->dqm->is_resetting without holding the dqm_lock is
incorrect. The problem is not really the fact, that it's resetting, but
that dqm->packets (the packet manager) is not initialized at that time.

A more general solution would be to move the pm_debugfs_hang_hws call
into dqm_debugfs_execute_queues, which does take the dqm_lock, and add a
check for dqm->packets while holding the lock.

Regards,
  Felix

> +		pr_err("HWS is already resetting, please wait for the current reset to finish\n");
> +		return -EBUSY;
> +	}
> +
>  	r = pm_debugfs_hang_hws(&dev->dqm->packets);
>  	if (!r)
>  		r = dqm_debugfs_execute_queues(dev->dqm);
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx