Label: Fix 3
I delete this and it seems to be OK. It was previously added to suppress restore_useptr_work which keeps updating PTE.
Now this is gone by Fix 3. Please let us know if it is OK:) @Felix
The problem is that it cannot find an event with a type that matches HW_EXCEPTION_TYPE so it does **nothing** from the driver with the default parameter value of send_sigterm = false;
After all, if a “zombie” process (zombie in the sense it does not have a GPU dev) does not exit, kfd resources seems not been released properly and new kfd process cannot run after plug back.
(I still need to look hard into rocr/hsakmt/kfd driver code to understand the reason. At least I am seeing that the kfd topology won’t be cleaned up without process exiting, so that there would be a “zombie" kfd node in the topology, which may or may not
cause issues in hsakmt).
@Felix Do you have suggestion/insight on this “zombie" process issue? @Andrey suggests it should be OK to have a “zombie” kfd process and a “zombie” kfd dev, and the new kfd process should be ok to run on the new kfd dev after plugback.
May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel restore_userptr_work
May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw exception to pasid = 0x800
May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: Process 25894 (pasid 0x8001) got unhandled exception
OK understood.
I tried short circuiting them, but would later caused BUG related to GPU reset. I added the following that solve the issue on plugout.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b583026dc893..d78a06d74759 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5317,7 +5317,8 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
{
struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
- recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+ if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
+ recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
}
/*
* Serialize gpu recover into reset domain single threaded wq
However after kill the zombie process, it failed to evict queues of the process.
[ +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
[ +9.002503] amdgpu: qcm fence wait loop timeout expired
[ +0.001364] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[ +0.001343] amdgpu: Failed to evict process queues
[ +0.001355] amdgpu: Failed to evict queues of pasid 0x8001
This would cause driver BUG triggered by new kfd process after plugback. I am pasting the errors from dmesg after plugback as below.
May 11 10:25:16 NETSYS26 kernel: [ 688.445332] amdgpu: Evicting PASID 0x8001 queues
May 11 10:25:16 NETSYS26 kernel: [ 688.445359] BUG: unable to handle page fault for address: 000000020000006e
May 11 10:25:16 NETSYS26 kernel: [ 688.447516] #PF: supervisor read access in kernel mode
May 11 10:25:16 NETSYS26 kernel: [ 688.449627] #PF: error_code(0x0000) - not-present page
May 11 10:25:16 NETSYS26 kernel: [ 688.451661] PGD 80000020892a8067 P4D 80000020892a8067 PUD 0
May 11 10:25:16 NETSYS26 kernel: [ 688.453741] Oops: 0000 [#1] PREEMPT SMP PTI
May 11 10:25:16 NETSYS26 kernel: [ 688.455904] CPU: 25 PID: 9236 Comm: tf_cnn_benchmar Tainted: G W OE 5.16.0+ #3
May 11 10:25:16 NETSYS26 kernel: [ 688.457406] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
May 11 10:25:16 NETSYS26 kernel: [ 688.457798] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
May 11 10:25:16 NETSYS26 kernel: [ 688.461458] RIP: 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [ 688.465238] Code: bd 13 8a dd 85 c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 74 ea c6 43 6e 00 41 83 ac 24 70 01 00 00
May 11 10:25:16 NETSYS26 kernel: [ 688.470516] RSP: 0018:ffffb2674c8afbf0 EFLAGS: 00010203
May 11 10:25:16 NETSYS26 kernel: [ 688.473255] RAX: ffff91c65cca3800 RBX: 0000000200000000 RCX: 0000000000000001
May 11 10:25:16 NETSYS26 kernel: [ 688.475691] RDX: 0000000000000000 RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
May 11 10:25:16 NETSYS26 kernel: [ 688.478564] RBP: ffffb2674c8afc20 R08: 0000000000000000 R09: 000000000006ba18
May 11 10:25:16 NETSYS26 kernel: [ 688.481409] R10: 00007fe5a0000000 R11: ffffb2674c8af918 R12: ffff91c66d6f5800
May 11 10:25:16 NETSYS26 kernel: [ 688.484254] R13: ffff91c66d6f5938 R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
May 11 10:25:16 NETSYS26 kernel: [ 688.487184] FS: 00007fe62124a700(0000) GS:ffff92053fd00000(0000) knlGS:0000000000000000
May 11 10:25:16 NETSYS26 kernel: [ 688.490308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 11 10:25:16 NETSYS26 kernel: [ 688.493122] CR2: 000000020000006e CR3: 0000002095284004 CR4: 00000000001706e0
May 11 10:25:16 NETSYS26 kernel: [ 688.496142] Call Trace:
May 11 10:25:16 NETSYS26 kernel: [ 688.499199] <TASK>
May 11 10:25:16 NETSYS26 kernel: [ 688.502261] kfd_process_evict_queues+0x43/0xf0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [ 688.506378] kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [ 688.510539] amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [ 688.514110] amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [ 688.518247] __mmu_notifier_invalidate_range_start+0x136/0x1e0
May 11 10:25:16 NETSYS26 kernel: [ 688.521252] change_protection+0x41d/0xcd0
May 11 10:25:16 NETSYS26 kernel: [ 688.524310] change_prot_numa+0x19/0x30
May 11 10:25:16 NETSYS26 kernel: [ 688.527366] task_numa_work+0x1ca/0x330
May 11 10:25:16 NETSYS26 kernel: [ 688.530157] task_work_run+0x6c/0xa0
May 11 10:25:16 NETSYS26 kernel: [ 688.533124] exit_to_user_mode_prepare+0x1af/0x1c0
May 11 10:25:16 NETSYS26 kernel: [ 688.536058] syscall_exit_to_user_mode+0x2a/0x40
May 11 10:25:16 NETSYS26 kernel: [ 688.538989] do_syscall_64+0x46/0xb0
May 11 10:25:16 NETSYS26 kernel: [ 688.541830] entry_SYSCALL_64_after_hwframe+0x44/0xae
May 11 10:25:16 NETSYS26 kernel: [ 688.544701] RIP: 0033:0x7fe6585ec317
May 11 10:25:16 NETSYS26 kernel: [ 688.547297] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
May 11 10:25:16 NETSYS26 kernel: [ 688.553183] RSP: 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 11 10:25:16 NETSYS26 kernel: [ 688.556105] RAX: ffffffffffffffc2 RBX: 0000000000000000 RCX: 00007fe6585ec317
May 11 10:25:16 NETSYS26 kernel: [ 688.558970] RDX: 00007fe621249540 RSI: 00000000c0584b02 RDI: 0000000000000003
May 11 10:25:16 NETSYS26 kernel: [ 688.561950] RBP: 00007fe621249540 R08: 0000000000000000 R09: 0000000000040000
May 11 10:25:16 NETSYS26 kernel: [ 688.564563] R10: 00007fe617480000 R11: 0000000000000246 R12: 00000000c0584b02
May 11 10:25:16 NETSYS26 kernel: [ 688.567494] R13: 0000000000000003 R14: 0000000000000064 R15: 00007fe621249920
May 11 10:25:16 NETSYS26 kernel: [ 688.570470] </TASK>
May 11 10:25:16 NETSYS26 kernel: [ 688.573380] Modules linked in: amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_intel
intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si
ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
May 11 10:25:16 NETSYS26 kernel: [ 688.573543] async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper syscopyarea hid_generic
crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
May 11 10:25:16 NETSYS26 kernel: [ 688.611083] CR2: 000000020000006e
May 11 10:25:16 NETSYS26 kernel: [ 688.614454] ---[ end trace 349cf28efb6268bc ]—
Looking forward to the comments.
Regards,
Shuotao
|