Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

Shuotao Xu <shuotaoxu@xxxxxxxxxxxxx> · Wed, 11 May 2022 03:35:42 +0000

On May 11, 2022, at 4:31 AM, Felix Kuehling <felix.kuehling@xxxxxxx> wrote:


[Some
 people who received this message don't often get email from felix.kuehling@xxxxxxx.
 Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.]



Am
 2022-05-10 um 07:03 schrieb Shuotao Xu:






On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky

<andrey.grodzovsky@xxxxxxx> wrote:



On 2022-04-27 05:20, Shuotao Xu wrote:



Hi Andrey,



Sorry that I did not have time to work on this for a few days.



I just tried the sysfs crash fix on Radeon VII and it seems that it

worked. It did not pass last the hotplug test, but my version has 4

tests instead of 3 in your case.






That because the 4th one is only enabled when here are 2 cards in the

system - to test DRI_PRIME export. I tested this time with only one card.




Yes, I only had one Radeon VII in my system, so this 4th test should

have been skipped. I am ignoring this issue.








Suite: Hotunplug Tests

Test: Unplug card and rescan the bus to plug it back

.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed

Test: Same as first test but with command submission

.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed

Test: Unplug with exported bo

.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed

Test: Unplug with exported fence

.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)






on the kernel side - the IOCTlL returning this is drm_getclient -

maybe take a look while it can't find client it ? I didn't have such

issue as far as I remember when testing.





FAILED

1. ../tests/amdgpu/hotunplug_tests.c:368 - CU_ASSERT_EQUAL(r,0)

2. ../tests/amdgpu/hotunplug_tests.c:411 -

CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,

&sync_obj_handle2),0)

3. ../tests/amdgpu/hotunplug_tests.c:423 -

CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2,

1, 100000000, 0, NULL),0)

4. ../tests/amdgpu/hotunplug_tests.c:425 -

CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)



Run Summary: Type Total Ran Passed Failed Inactive

suites 14 1 n/a 0 0

tests 71 4 3 1 0

asserts 39 39 35 4 n/a



Elapsed time = 17.321 seconds



For kfd compute, there is some problem which I did not see in MI100

after I killed the hung application after hot plugout. I was using

rocm5.0.2 driver for MI100 card, and not sure if it is a regression

from the newer driver.

After pkill, one of child of user process would be stuck in Zombie

mode (Z) understandably because of the bug, and future rocm

application after plug-back would in uninterrupted sleep mode (D)

because it would not return from syscall to kfd.



Although drm test for amdgpu would run just fine without issues

after plug-back with dangling kfd state.






I am not clear when the crash bellow happens ? Is it related to what

you describe above ?







I don’t know if there is a quick fix to it. I was thinking add

drm_enter/drm_exit to amdgpu_device_rreg.






Try adding drm_dev_enter/exit pair at the highest level of attmetong

to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We

always try to avoid accessing any HW functions after backing device

is gone.





Also this has been a long time in my attempt to fix hotplug issue

for kfd application.

I don’t know 1) if I would be able to get to MI100 (fixing Radeon

VII would mean something but MI100 is more important for us); 2)

what the direct of the patch to this issue will move forward.






I will go to office tomorrow to pick up MI-100, With time and

priorities permitting I will then then try to test it and fix any

bugs such that it will be passing all hot plug libdrm tests at the

tip of public amd-staging-drm-next

-https://nam06.safelinks.protection.outlook.com/?url="">%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&amp;reserved=0,
 after that you can try

to continue working with ROCm enabling on top of that.



For now i suggest you move on with Radeon 7 which as your development

ASIC and use the fix i mentioned above.




I finally got some time to continue on kfd hotplug patch attempt.

The following patch seems to work for kfd hotplug on Radeon VII. After

hot plugout, the tf process exists because of vm fault.

A new tf process run without issues after plugback.



It has the following fixes.



1. ras sysfs regression;

2. skip setting compute idle after dev is plugged, otherwise it will

   try to write the pci bar thus driver fault

3. stops the actual work of invalidate memory map triggered by

   useptrs; (return false will trigger warning, so I returned true.

   Not sure if it is correct)

4. It sends exceptions to all the events/signal that a “zombie”

   process that are waiting for. (Not sure if the hw_exception is

   worthwhile, it did not do anything in my case since there is such

   event type associated with that process)



Please take a look and let me know if it acceptable.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

index 1f8161cd507f..2f7858692067 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

@@ -33,6 +33,7 @@

#include <uapi/linux/kfd_ioctl.h>

#include "amdgpu_ras.h"

#include "amdgpu_umc.h"

+#include <drm/drm_drv.h>



/* Total memory size in system memory and all GPU VRAM. Used to

 * estimate worst case amount of memory to reserve for page tables

@@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device

*adev,



void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool

idle)

{

-       amdgpu_dpm_switch_power_profile(adev,

- PP_SMC_POWER_PROFILE_COMPUTE,

-                                       !idle);

+       if (!drm_dev_is_unplugged(adev_to_drm(adev)))

+               amdgpu_dpm_switch_power_profile(adev,

+ PP_SMC_POWER_PROFILE_COMPUTE,

+                                               !idle);

}



bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c

index 4b153daf283d..fb4c9e55eace 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c

@@ -46,6 +46,7 @@

#include <linux/firmware.h>

#include <linux/module.h>

#include <drm/drm.h>

+#include <drm/drm_drv.h>



#include "amdgpu.h"

#include "amdgpu_amdkfd.h"

@@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct

mmu_interval_notifier *mni,

       struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo,

notifier);

       struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);



+       if (drm_dev_is_unplugged(adev_to_drm(adev)))

+               return true;

+




Label: Fix 3




       if (!mmu_notifier_range_blockable(range))

               return false;



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

index cac56f830aed..fbbaaabf3a67 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

@@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct

amdgpu_device *adev)

               }

       }



-       amdgpu_ras_sysfs_remove_all(adev);

       return 0;

}

/* ras fs end */

@@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct

amdgpu_device *adev,

       if (!ras_block)

               return;



-       amdgpu_ras_sysfs_remove(adev, ras_block);

-

       ras_obj = container_of(ras_block, struct

amdgpu_ras_block_object, ras_comm);

       if (ras_obj->ras_cb)

               amdgpu_ras_interrupt_remove_handler(adev, ras_block);

@@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)

       /* Need disable ras on all IPs here before ip [hw/sw]fini */

       amdgpu_ras_disable_all_features(adev, 0);

       amdgpu_ras_recovery_fini(adev);

+       amdgpu_ras_sysfs_remove_all(adev);

       return 0;

}



diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c

b/drivers/gpu/drm/amd/amdkfd/kfd_device.c

index f1a225a20719..4b789bec9670 100644

--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c

+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c

@@ -714,16 +714,37 @@ bool kfd_is_locked(void)



void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)

{

+       struct kfd_process *p;

+       struct amdkfd_process_info *p_info;

+       unsigned int temp;

+

       if (!kfd->init_complete)

               return;



       /* for runtime suspend, skip locking kfd */

-       if (!run_pm) {

+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {

               /* For first KFD device suspend all the KFD processes */

               if (atomic_inc_return(&kfd_locked) == 1)

                       kfd_suspend_all_processes();

       }



+       if (drm_dev_is_unplugged(kfd->ddev)){

+               int idx = srcu_read_lock(&kfd_processes_srcu);

+               pr_debug("cancel restore_userptr_work\n");

+               hash_for_each_rcu(kfd_processes_table, temp, p,

kfd_processes) {

+                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id)

>= 0) {

+                               p_info = p->kgd_process_info;

+                               pr_debug("cancel processes, pid = %d

for gpu_id = %d", pid_nr(p_info->pid), kfd->id);

+ cancel_delayed_work_sync(&p_info->restore_userptr_work);




Is
 this really necessary? If it is, there are probably other workers,

e.g.
 related to our SVM code, that would need to be canceled as well.








I delete this and it seems to be OK. It was previously added to suppress restore_useptr_work which keeps updating PTE.
Now this is gone by Fix 3. Please let us know if it is OK:) @Felix







+

+ /* send exception signals to the kfd

events waiting in user space */

+ kfd_signal_hw_exception_event(p->pasid);




This
 makes sense. It basically tells user mode that the application's

GPU
 state is lost due to a RAS error or a GPU reset, or now a GPU

hot-unplug.






The problem is that it cannot find an event with a type that matches HW_EXCEPTION_TYPE so it does **nothing** from the driver with the default parameter value of send_sigterm = false;
After all, if a “zombie” process (zombie in the sense it does not have a GPU dev) does not exit, kfd resources seems not been released properly and new kfd process cannot run after plug back.
(I still need to look hard into rocr/hsakmt/kfd driver code to understand the reason. At least I am seeing that the kfd topology won’t be cleaned up without process exiting, so that there would be a “zombie" kfd node in the topology, which may or may not
 cause issues in hsakmt). 
@Felix Do you have suggestion/insight on this “zombie" process issue? @Andrey suggests it should be OK to have a “zombie” kfd process and a “zombie” kfd dev, and the new kfd process should be ok to run on the new kfd dev after plugback.




May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel restore_userptr_work
May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw exception to pasid = 0x800
May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: Process 25894 (pasid 0x8001) got unhandled exception










+ kfd_signal_vm_fault_event(kfd, p->pasid, NULL);




This
 does not make sense. A VM fault indicates an access to a bad

virtual
 address by the GPU. If a debugger is attached to the process, it

notifies
 the debugger to investigate what went wrong. If the GPU is

gone,
 that doesn't make any sense. There is no GPU that could have

issued
 a bad memory request. And the debugger won't be happy either to

find
 a VM fault from a GPU that doesn't exist any more.






OK understood.





If
 the HW-exception event doesn't terminate your process, we may need to

look
 into how ROCr handles the HW-exception events.






+ }

+ }

+ srcu_read_unlock(&kfd_processes_srcu, idx);

+ }

+

kfd->dqm->ops.stop(kfd->dqm);

kfd_iommu_suspend(kfd);




Should
 DQM stop and IOMMU suspend still be executed? Or should the

hot-unplug
 case short-circuit them?






I tried short circuiting them, but would later caused BUG related to GPU reset. I added the following that solve the issue on plugout. 




diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b583026dc893..d78a06d74759 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5317,7 +5317,8 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
 {
        struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);



-       recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+       if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
+               recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
 }
 /*
  * Serialize gpu recover into reset domain single threaded wq




However after kill the zombie process, it failed to evict queues of the process.




[  +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
[  +9.002503] amdgpu: qcm fence wait loop timeout expired
[  +0.001364] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  +0.001343] amdgpu: Failed to evict process queues
[  +0.001355] amdgpu: Failed to evict queues of pasid 0x8001







This would cause driver BUG triggered by new kfd process after plugback. I am pasting the errors from dmesg after plugback as below.










May 11 10:25:16 NETSYS26 kernel: [  688.445332] amdgpu: Evicting PASID 0x8001 queues
May 11 10:25:16 NETSYS26 kernel: [  688.445359] BUG: unable to handle page fault for address: 000000020000006e
May 11 10:25:16 NETSYS26 kernel: [  688.447516] #PF: supervisor read access in kernel mode
May 11 10:25:16 NETSYS26 kernel: [  688.449627] #PF: error_code(0x0000) - not-present page
May 11 10:25:16 NETSYS26 kernel: [  688.451661] PGD 80000020892a8067 P4D 80000020892a8067 PUD 0
May 11 10:25:16 NETSYS26 kernel: [  688.453741] Oops: 0000 [#1] PREEMPT SMP PTI
May 11 10:25:16 NETSYS26 kernel: [  688.455904] CPU: 25 PID: 9236 Comm: tf_cnn_benchmar Tainted: G        W  OE     5.16.0+ #3
May 11 10:25:16 NETSYS26 kernel: [  688.457406] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
May 11 10:25:16 NETSYS26 kernel: [  688.457798] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
May 11 10:25:16 NETSYS26 kernel: [  688.461458] RIP: 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.465238] Code: bd 13 8a dd 85 c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 74 ea c6 43 6e 00 41 83 ac 24 70 01 00 00
May 11 10:25:16 NETSYS26 kernel: [  688.470516] RSP: 0018:ffffb2674c8afbf0 EFLAGS: 00010203
May 11 10:25:16 NETSYS26 kernel: [  688.473255] RAX: ffff91c65cca3800 RBX: 0000000200000000 RCX: 0000000000000001
May 11 10:25:16 NETSYS26 kernel: [  688.475691] RDX: 0000000000000000 RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
May 11 10:25:16 NETSYS26 kernel: [  688.478564] RBP: ffffb2674c8afc20 R08: 0000000000000000 R09: 000000000006ba18
May 11 10:25:16 NETSYS26 kernel: [  688.481409] R10: 00007fe5a0000000 R11: ffffb2674c8af918 R12: ffff91c66d6f5800
May 11 10:25:16 NETSYS26 kernel: [  688.484254] R13: ffff91c66d6f5938 R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
May 11 10:25:16 NETSYS26 kernel: [  688.487184] FS:  00007fe62124a700(0000) GS:ffff92053fd00000(0000) knlGS:0000000000000000
May 11 10:25:16 NETSYS26 kernel: [  688.490308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 11 10:25:16 NETSYS26 kernel: [  688.493122] CR2: 000000020000006e CR3: 0000002095284004 CR4: 00000000001706e0
May 11 10:25:16 NETSYS26 kernel: [  688.496142] Call Trace:
May 11 10:25:16 NETSYS26 kernel: [  688.499199]  <TASK>
May 11 10:25:16 NETSYS26 kernel: [  688.502261]  kfd_process_evict_queues+0x43/0xf0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.506378]  kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.510539]  amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.514110]  amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.518247]  __mmu_notifier_invalidate_range_start+0x136/0x1e0
May 11 10:25:16 NETSYS26 kernel: [  688.521252]  change_protection+0x41d/0xcd0
May 11 10:25:16 NETSYS26 kernel: [  688.524310]  change_prot_numa+0x19/0x30
May 11 10:25:16 NETSYS26 kernel: [  688.527366]  task_numa_work+0x1ca/0x330
May 11 10:25:16 NETSYS26 kernel: [  688.530157]  task_work_run+0x6c/0xa0
May 11 10:25:16 NETSYS26 kernel: [  688.533124]  exit_to_user_mode_prepare+0x1af/0x1c0
May 11 10:25:16 NETSYS26 kernel: [  688.536058]  syscall_exit_to_user_mode+0x2a/0x40
May 11 10:25:16 NETSYS26 kernel: [  688.538989]  do_syscall_64+0x46/0xb0
May 11 10:25:16 NETSYS26 kernel: [  688.541830]  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 11 10:25:16 NETSYS26 kernel: [  688.544701] RIP: 0033:0x7fe6585ec317
May 11 10:25:16 NETSYS26 kernel: [  688.547297] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
May 11 10:25:16 NETSYS26 kernel: [  688.553183] RSP: 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 11 10:25:16 NETSYS26 kernel: [  688.556105] RAX: ffffffffffffffc2 RBX: 0000000000000000 RCX: 00007fe6585ec317
May 11 10:25:16 NETSYS26 kernel: [  688.558970] RDX: 00007fe621249540 RSI: 00000000c0584b02 RDI: 0000000000000003
May 11 10:25:16 NETSYS26 kernel: [  688.561950] RBP: 00007fe621249540 R08: 0000000000000000 R09: 0000000000040000
May 11 10:25:16 NETSYS26 kernel: [  688.564563] R10: 00007fe617480000 R11: 0000000000000246 R12: 00000000c0584b02
May 11 10:25:16 NETSYS26 kernel: [  688.567494] R13: 0000000000000003 R14: 0000000000000064 R15: 00007fe621249920
May 11 10:25:16 NETSYS26 kernel: [  688.570470]  </TASK>
May 11 10:25:16 NETSYS26 kernel: [  688.573380] Modules linked in: amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_intel
 intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si
 ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
May 11 10:25:16 NETSYS26 kernel: [  688.573543]  async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper syscopyarea hid_generic
 crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.611083] CR2: 000000020000006e
May 11 10:25:16 NETSYS26 kernel: [  688.614454] ---[ end trace 349cf28efb6268bc ]—



Looking forward to the comments.



Regards,
Shuotao







Regards,

Felix






}



Regards,

Shuotao



Andrey







Regards,

Shuotao



[  +0.001645] BUG: unable to handle page fault for address:

0000000000058a68

[  +0.001298] #PF: supervisor read access in kernel mode

[  +0.001252] #PF: error_code(0x0000) - not-present page

[  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD

109b2d067 PMD 0

[  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI

[  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G

  W   E     5.16.0+ #3

[  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS

1.5.4 [FPGA Test BIOS] 10/002/2015

[  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]

[  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f

44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0

09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c

2e ca 85

[  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202

[  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:

00000000ffffffff

[  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:

ffff8b0c9c840000

[  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:

0000000000000001

[  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:

0000000000058a68

[  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:

000000000001629a

[  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)

knlGS:0000000000000000

[  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:

00000000001706e0

[  +0.001422] Call Trace:

[  +0.001407]  <TASK>

[  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]

[  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]

[  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]

[  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]

[  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]

[  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]

[  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]

[  +0.001829]  ? kvfree+0x1e/0x30

[  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]

[  +0.001868]  ? kvfree+0x1e/0x30

[  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]

[  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]

[  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]

[  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]

[  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]

[  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]

[  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]

[  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]

[  +0.001718]  __mmu_notifier_release+0x77/0x1f0

[  +0.001411]  exit_mmap+0x1b5/0x200

[  +0.001396]  ? __switch_to+0x12d/0x3e0

[  +0.001388]  ? __switch_to_asm+0x36/0x70

[  +0.001372]  ? preempt_count_add+0x74/0xc0

[  +0.001364]  mmput+0x57/0x110

[  +0.001349]  do_exit+0x33d/0xc20

[  +0.001337]  ? _raw_spin_unlock+0x1a/0x30

[  +0.001346]  do_group_exit+0x43/0xa0

[  +0.001341]  get_signal+0x131/0x920

[  +0.001295]  arch_do_signal_or_restart+0xb1/0x870

[  +0.001303]  ? do_futex+0x125/0x190

[  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0

[  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40

[  +0.001264]  do_syscall_64+0x46/0xb0

[  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae

[  +0.001219] RIP: 0033:0x7f6aff1d2ad3

[  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.

[  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:

00000000000000ca

[  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:

00007f6aff1d2ad3

[  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:

0000000004f542d8

[  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:

0000000000000000

[  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:

0000000004f542d8

[  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:

0000000000000000

[  +0.001152]  </TASK>

[  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink

nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM

iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack

nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4

xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter

ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload

esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac

x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi

snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec

kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore

irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support

joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf

ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser

rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi

scsi_transport_iscsi ip_tables x_tables autofs4 btrfs

blake2b_generic zstd_compress raid10 raid456

[  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor

async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear

iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper

drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops

crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid

uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd

libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]

[  +0.016626] CR2: 0000000000058a68

[  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---

[  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]

[  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f

44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0

09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c

2e ca 85

[  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202

[  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:

00000000ffffffff

[  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:

ffff8b0c9c840000

[  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:

0000000000000001

[  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:

0000000000058a68

[  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:

000000000001629a

[  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)

knlGS:0000000000000000

[  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:

00000000001706e0

[  +0.001740] Fixing recursive fault but reboot is needed!





On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky

<andrey.grodzovsky@xxxxxxx> wrote:



I retested hot plug tests at the commit I mentioned bellow - looks

ok, my ASIC is Navi 10, I also tested using Vega 10 and older

Polaris ASICs (whatever i had at home at the time). It's possible

there are extra issues in ASICs like ur which I didn't cover during

tests.



andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory





The ASIC NOT support UVD, suite disabled

/usr/local/share/libdrm/amdgpu.ids: No such file or directory





The ASIC NOT support VCE, suite disabled

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory





The ASIC NOT support UVD ENC, suite disabled.

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory





Don't support TMZ (trust memory zone), security suite disabled

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

/usr/local/share/libdrm/amdgpu.ids: No such file or directory

Peer device is not opened or has ASIC not supported by the suite,

skip all Peer to Peer tests.





CUnit - A unit testing framework for C - Version 2.1-3

https://nam06.safelinks.protection.outlook.com/?url="">





*Suite: Hotunplug Tests**

** Test: Unplug card and rescan the bus to plug it back

.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**

**passed**

** Test: Same as first test but with command submission

.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**

**passed**

** Test: Unplug with exported bo

.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**

**passed*



Run Summary: Type Total Ran Passed Failed Inactive

suites 14 1 n/a 0 0

tests 71 3 3 0 1

asserts 21 21 21 0 n/a



Elapsed time = 9.195 seconds





Andrey



On 2022-04-20 11:44, Andrey Grodzovsky wrote:



The only one in Radeon 7 I see is the same sysfs crash we already

fixed so you can use the same fix. The MI 200 issue i haven't seen

yet but I also haven't tested MI200 so never saw it before. Need

to test when i get the time.



So try that fix with Radeon 7 again to see if you pass the tests

(the warnings should all be minor issues).



Andrey





On 2022-04-20 05:24, Shuotao Xu wrote:




That a problem, latest working baseline I tested and confirmed

passing hotplug tests is this branch and

commithttps://nam06.safelinks.protection.outlook.com/?url="">

is amd-staging-drm-next. 5.14 was the branch we ups-reamed the

hotplug code but it had a lot of regressions over time due to

new changes (that why I added the hotplug test to try and catch

them early). It would be best to run this branch on mi-100 so we

have a clean baseline and only after confirming this particular

branch from this commits passes libdrm tests only then start

adding the KFD specific addons. Another option if you can't work

with MI-100 and this branch is to try a different ASIC that does

work with this branch (if possible).



Andrey




OK I tried both this commit and the HEAD of and-staging-drm-next

on two GPUs( MI100 and Radeon VII) both did not pass hotplugout

libdrm test. I might be able to gain access to MI200, but I

suspect it would work.



I copied the complete dmesgs as follows. I highlighted the OOPSES

for you.



Radeon VII: