Hi Andrey,
First I really appreciate the discussion! It helped me understand the driver code greatly. Thank you so much:)
Please see my inline comments.
Maybe an application has kfd open, but was not accessing the dev. So kill it at unplug could kill the process unnecessarily.
However, the latest version I had with the sleep function got rid of the IP block fini hang.
So I basically revert back to the original solution which you suggested. diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..5504a18b5a45 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
return;
/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes(force);
This patch works great at planned plugout, where all the rocm processes are killed before plugout. And device can be added back without a problem.
However unplanned plugout when there is rocm processes are running just don’t work.
Scenario 0: Kill before plug back 1. echo 1 > /sys/bus/pci/…/remove, would finish.
But the application won’t exit until there is a kill signal.
2. kill the the process. The application does several things and seems trigger drm_release in the kernel, which are met with kernel NULL pointer deference related to sysfs_remove. Then the whole fs just freeze.
[ +0.002498] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ +0.000486] #PF: supervisor read access in kernel mode
[ +0.000545] #PF: error_code(0x0000) - not-present page
[ +0.000551] PGD 0 P4D 0
[ +0.000553] Oops: 0000 [#1] SMP NOPTI
[ +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G W E 5.13.0-kfd #1
[ +0.000559] Hardware name: INGRASYS TURING /MB , BIOS K71FQ28A 10/05/2021
[ +0.000567] Workqueue: events delayed_fput
[ +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
[ +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6 20 41 0f
[ +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
[ +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 0000000000000000
[ +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 0000000000000000
[ +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 0000000000000000
[ +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 0000000000000000
[ +0.000702] FS: 0000000000000000(0000) GS:ffffa0ffbfcc0000(0000) knlGS:0000000000000000
[ +0.000666] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 0000000000770ee0
[ +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ +0.000592] PKRU: 55555554
[ +0.000580] Call Trace:
[ +0.000591] kernfs_find_and_get_ns+0x2f/0x50
[ +0.000584] sysfs_remove_file_from_group+0x20/0x50
[ +0.000580] amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
[ +0.000737] amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
[ +0.000750] amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
[ +0.000742] ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
[ +0.000738] sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
[ +0.000717] amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[ +0.000704] amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[ +0.000687] drm_dev_release+0x20/0x40 [drm]
[ +0.000583] drm_release+0xa8/0xf0 [drm]
[ +0.000584] __fput+0xa5/0x250
[ +0.000621] delayed_fput+0x1f/0x30
[ +0.000726] process_one_work+0x26e/0x580
[ +0.000581] ? process_one_work+0x580/0x580
[ +0.000611] worker_thread+0x4d/0x3d0
[ +0.000611] ? process_one_work+0x580/0x580
[ +0.000605] kthread+0x117/0x150
[ +0.000611] ? kthread_park+0x90/0x90
[ +0.000619] ret_from_fork+0x1f/0x30
[ +0.000625] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables
ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
on_quirks [last unloaded: amdgpu]
3. echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the sysfs is broken.
Based on 1 & 2, it seems that 1 won’t let the amdgpu exit gracefully, because 2 will do some cleanup maybe should have happened before 1.
Scenario 2: Kill after plug back
If I perform rescan before kill, then the driver seemed probed fine. But kill will have the same issue which messed up the sysfs the same way as in Scenario 2.
Final Comments:
1. For planned hotplug, this patch should work as long as you follow some protocol, i.e. kill before plugout. Is this patch an acceptable one since it provides some added feature than before?
2. For unplanned hotplug when there is rocm app running, the patch that kill all processes and wait for 5 sec would work consistently. But it seems that it is an unacceptable solution for official release. I can hold it for our own internal usage. It
seems that kill after removal would cause problems, and I don’t know if there is a quick fix by me because of my limited understanding of the amdgpu driver. Maybe AMD could have a quick fix; Or it is really a difficult one. This feature may or may not be a
blocking issue in our GPU disaggregation research down the way. Please let us know for either cases, and we would like to learn and help as much as we could!
Thank you so much!
Best regards,
Shuotao
|