However, those are all based my own observations. Please explain why it does not need exit if you believe so?
I first plugout the device, then kill the rocm user process. Then it has other OOPSES related to ttm_bo_cleanup_refs.
[ +0.000006] BUG: kernel NULL pointer dereference, address: 0000000000000010
[ +0.000349] #PF: supervisor read access in kernel mode
[ +0.000340] #PF: error_code(0x0000) - not-present page
[ +0.000341] PGD 0 P4D 0
[ +0.000336] Oops: 0000 [#1] SMP NOPTI
[ +0.000345] CPU: 9 PID: 95 Comm: kworker/9:1 Tainted: G W E 5.13.0-kfd #1
[ +0.000367] Hardware name: INGRASYS TURING /MB , BIOS K71FQ28A 10/05/2021
[ +0.000376] Workqueue: events delayed_fput
[ +0.000422] RIP: 0010:ttm_resource_free+0x24/0x40 [ttm]
[ +0.000464] Code: 00 00 0f 1f 40 00 0f 1f 44 00 00 53 48 89 f3 48 8b 36 48 85 f6 74 21 48 8b 87 28 02 00 00 48 63 56 10 48 8b bc d0 b8 00 00 00 <48> 8b 47 10 ff 50 08 48 c7 03 00 00 00 00 5b c3 66 66 2e 0f 1f 84
[ +0.001009] RSP: 0018:ffffb21c59413c98 EFLAGS: 00010282
[ +0.000515] RAX: ffff8b1aa4285f68 RBX: ffff8b1a823b5ea0 RCX: 00000000002a000c
[ +0.000536] RDX: 0000000000000000 RSI: ffff8b1acb84db80 RDI: 0000000000000000
[ +0.000539] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffc03c3e00
[ +0.000543] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b1a823b5ec8
[ +0.000542] R13: 0000000000000000 R14: ffff8b1a823b5d90 R15: ffff8b1a823b5ec8
[ +0.000544] FS: 0000000000000000(0000) GS:ffff8b187f440000(0000) knlGS:0000000000000000
[ +0.000559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000575] CR2: 0000000000000010 CR3: 00000076e6812004 CR4: 0000000000770ee0
[ +0.000575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ +0.000579] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ +0.000575] PKRU: 55555554
[ +0.000568] Call Trace:
[ +0.000567] ttm_bo_cleanup_refs+0xe4/0x290 [ttm]
[ +0.000588] ttm_bo_delayed_delete+0x147/0x250 [ttm]
[ +0.000589] ttm_device_fini+0xad/0x1b0 [ttm]
[ +0.000590] amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
[ +0.000730] gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[ +0.000753] amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[ +0.000734] amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[ +0.000737] drm_dev_release+0x20/0x40 [drm]
[ +0.000626] drm_release+0xa8/0xf0 [drm]
[ +0.000625] __fput+0xa5/0x250
[ +0.000606] delayed_fput+0x1f/0x30
[ +0.000607] process_one_work+0x26e/0x580
[ +0.000608] ? process_one_work+0x580/0x580
[ +0.000616] worker_thread+0x4d/0x3d0
[ +0.000614] ? process_one_work+0x580/0x580
[ +0.000617] kthread+0x117/0x150
[ +0.000615] ? kthread_park+0x90/0x90
[ +0.000621] ret_from_fork+0x1f/0x30
[ +0.000603] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper
drm_ttm_helper iommu_v2 ttm gpu_sched drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks [last unloaded: amdgpu]
[ +0.002840] CR2: 0000000000000010
[ +0.000755] ---[ end trace 9737737402551e39 ]--
3. echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the sysfs is broken.
Based on 1 & 2, it seems that 1 won’t let the amdgpu exit gracefully, because 2 will do some cleanup maybe should have happened before 1.
or you kill after plug back does it makes a difference).
Scenario 2: Kill after plug back
If I perform rescan before kill, then the driver seemed probed fine. But kill will have the same issue which messed up the sysfs the same way as in Scenario 2.
Final Comments:
0. cancel_delayed_work_sync(&p_info->restore_userptr_work) would make the repletion of
amdgpu_vm_bo_update failure go away, but it does not solve the issues in those scenarios.
Still - it's better to do it this way even for those failures to go awaya
Cancel_delayed_work is insufficient, you will need to make sure the work won’t be processed after plugout. Please see my patch
1. For planned hotplug, this patch should work as long as you follow some protocol, i.e. kill before plugout. Is this patch an acceptable one since it provides some added feature than before?
Let's try to fix more as I advised above.
2. For unplanned hotplug when there is rocm app running, the patch that kill all processes and wait for 5 sec would work consistently. But it seems that it is an unacceptable solution for official release. I can hold it for our own internal usage.
It seems that kill after removal would cause problems, and I don’t know if there is a quick fix by me because of my limited understanding of the amdgpu driver. Maybe AMD could have a quick fix; Or it is really a difficult one. This feature may or may not
be a blocking issue in our GPU disaggregation research down the way. Please let us know for either cases, and we would like to learn and help as much as we could!
I am currently not sure why it helps. I will need to setup my own ROCm setup and retest hot plug to check this in more depth but currently i have higher priorities. Please try to confirm ASIC reset always takes place on plug back
and fix the sysfs OOPs as I advised above to clear up at least some of the issues. Also please describe to me exactly what you steps to reproduce this scenario so later I might be able to do it myself.
I can still try to help to fix the bug in my spare time. My setup is as follows
- I have a server with 4 AMD MI100 GPUs. I think 1 GPU would also work.
- I used the https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/roc-5.0.x as the starting point, and apply Mukul’s patch and my patch.
- Then I run a tensorflow benchmark from a docker.
- docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/tensorflow:rocm4.5.2-tf1.15-dev
- And run the following benchmark in the docker: python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=4 --batch_size=32 --model=resnet50 --variable_update=parameter_server
- Might to need to adjust num_gpus parameter based on your setup
- Remove a GPU at random time.
- Do whatever is needed to before plugback and reverify the benchmark can still run.
Also, we have hotplug test suite in libdrm (graphic stack), so maybe u can install libdrm and run that test suite to see if it exposes more issues.
OK I could try it some time.
The following is the new diff.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 182b7eae598a..48c3cd4054de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1327,7 +1327,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
* ASICs macro.
*/
#define amdgpu_asic_set_vga_state(adev, state) (adev)->asic_funcs->set_vga_state((adev), (state))
-#define amdgpu_asic_reset(adev) (adev)->asic_funcs->reset((adev))
+#define amdgpu_asic_reset(adev) ({int r; pr_info("performing amdgpu_asic_reset\n"); r = (adev)->asic_funcs->reset((adev));r;})
#define amdgpu_asic_reset_method(adev) (adev)->asic_funcs->reset_method((adev))
#define amdgpu_asic_get_xclk(adev) (adev)->asic_funcs->get_xclk((adev))
#define amdgpu_asic_set_uvd_clocks(adev, v, d) (adev)->asic_funcs->set_uvd_clocks((adev), (v), (d))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..842abd7150a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -134,6 +134,7 @@ struct amdkfd_process_info {
/* MMU-notifier related fields */
atomic_t evicted_bos;
+ atomic_t invalid;
struct delayed_work restore_userptr_work;
struct pid *pid;
};
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 99d2b15bcbf3..2a588eb9f456 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void **process_info,
info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
atomic_set(&info->evicted_bos, 0);
+ atomic_set(&info->invalid, 0);
INIT_DELAYED_WORK(&info->restore_userptr_work,
amdgpu_amdkfd_restore_userptr_worker);
@@ -2693,6 +2694,9 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
struct mm_struct *mm;
int evicted_bos;
+ if (atomic_read(&process_info->invalid))
+ return;
+
evicted_bos = atomic_read(&process_info->evicted_bos);
if (!evicted_bos)
return;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ec38517ab33f..e7d85d8d282d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1054,6 +1054,7 @@ void amdgpu_device_program_register_sequence(struct amdgpu_device *adev,
*/
void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
{
+ pr_debug("%s called\n",__func__);
pci_write_config_dword(adev->pdev, 0x7c, AMDGPU_ASIC_RESET_DATA);
}
@@ -1066,6 +1067,7 @@ void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
*/
int amdgpu_device_pci_reset(struct amdgpu_device *adev)
{
+ pr_debug("%s called\n",__func__);
return pci_reset_function(adev->pdev);
}
@@ -4702,6 +4704,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
bool need_full_reset, skip_hw_reset, vram_lost = false;
int r = 0;
+ pr_debug("%s called\n",__func__);
+
/* Try reset handler method first */
tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
reset_list);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 49bdf9ff7350..b469acb65c1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2518,7 +2518,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device *adev,
if (!ras_block || !ih_info)
return;
- amdgpu_ras_sysfs_remove(adev, ras_block);
if (ih_info->cb)
amdgpu_ras_interrupt_remove_handler(adev, ih_info);
}
@@ -2577,6 +2576,7 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev)
int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
{
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+ struct ras_manager *obj, *tmp;
if (!adev->ras_enabled || !con)
return 0;
@@ -2585,6 +2585,13 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
/* Need disable ras on all IPs here before ip [hw/sw]fini */
amdgpu_ras_disable_all_features(adev, 0);
amdgpu_ras_recovery_fini(adev);
+
+ /* remove sysfs before pci_remove to avoid OOPSES from sysfs_remove_groups */
+ list_for_each_entry_safe(obj, tmp, &con->head, node) {
+ amdgpu_ras_sysfs_remove(adev, &obj->head);
+ put_obj(obj);
+ }
+
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..0fa806a78e39 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -693,16 +693,35 @@ bool kfd_is_locked(void)
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
{
+ struct kfd_process *p;
+ struct amdkfd_process_info *p_info;
+ unsigned int temp;
+
if (!kfd->init_complete)
return;
/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes(force);
}
+ if (drm_dev_is_unplugged(kfd->ddev)){
+ int idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_debug("cancel restore_userptr_wor\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ if ( kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0 ) {
+ p_info = p->kgd_process_info;
+ pr_debug("cancel processes, pid = %d for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+ cancel_delayed_work_sync(&p_info->restore_userptr_work);
+ /* block all future restore_userptr_work */
+ atomic_inc(&p_info->invalid);
+ }
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+ }
+
kfd->dqm->ops.stop(kfd->dqm);
kfd_iommu_suspend(kfd);
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 600ba2a728ea..7e3d1848eccc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -669,11 +669,12 @@ static void kfd_remove_sysfs_node_entry(struct kfd_topology_device *dev)
#ifdef HAVE_AMD_IOMMU_PC_SUPPORTED
if (dev->kobj_perf) {
list_for_each_entry(perf, &dev->perf_props, list) {
+ sysfs_remove_group(dev->kobj_perf, perf->attr_group);
kfree(perf->attr_group);
perf->attr_group = NULL;
}
kobject_del(dev->kobj_perf);
- kobject_put(dev->kobj_perf);
+ /* kobject_put(dev->kobj_perf); */
dev->kobj_perf = NULL;
}
#endif
Thank you so much! Looking forward to your comments!
Regards,
Shuotao
Andrey
Thank you so much!
Best regards,
Shuotao
Andrey
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8fa9b86ac9d2..c0b27f722281 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
}
+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
+{
+ if (adev->kfd.dev)
+ kgd2kfd_kill_all_user_processes(adev->kfd.dev);
+}
+
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
{
if (adev->kfd.dev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..f4e485d60442 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -141,6 +141,7 @@ struct amdkfd_process_info {
int amdgpu_amdkfd_init(void);
void amdgpu_amdkfd_fini(void);
+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, bool sync);
@@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
const struct kgd2kfd_shared_resources *gpu_resources);
void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
@@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
{
}
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
+}
+
static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3d5fc0751829..af6fe5080cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
{
struct drm_device *dev = pci_get_drvdata(pdev);
+ /* kill all kfd processes before drm_dev_unplug */
+ amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
+
#ifdef HAVE_DRM_DEV_UNPLUG
drm_dev_unplug(dev);
#else
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5504a18b5a45..480c23bef5e2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -691,6 +691,12 @@ bool kfd_is_locked(void)
return (atomic_read(&kfd_locked) > 0);
}
+inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
+{
+ kfd_kill_all_user_processes();
+}
+
+
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
{
if (!kfd->init_complete)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..a35a2cb5bb9f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1064,6 +1064,7 @@ static inline struct kfd_process_device *kfd_process_device_from_gpuidx(
void kfd_unref_process(struct kfd_process *p);
int kfd_process_evict_queues(struct kfd_process *p, bool force);
int kfd_process_restore_queues(struct kfd_process *p);
+void kfd_kill_all_user_processes(void);
void kfd_suspend_all_processes(bool force);
/*
* kfd_resume_all_processes:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..17e769e6951d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -46,6 +46,9 @@ struct mm_struct;
#include "kfd_trace.h"
#include "kfd_debug.h"
+static atomic_t kfd_process_locked = ATOMIC_INIT(0);
+static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
+
/*
* List of struct kfd_process (field kfd_process).
* Unique/indexed by mm_struct*
@@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct task_struct *thread)
struct kfd_process *process;
int ret;
+ if ( atomic_read(&kfd_process_locked) > 0 )
+ return ERR_PTR(-EINVAL);
+
if (!(thread->mm && mmget_not_zero(thread->mm)))
return ERR_PTR(-EINVAL);
@@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct work_struct *work)
put_task_struct(p->lead_thread);
kfree(p);
+
+ if ( atomic_read(&kfd_process_locked) > 0 ){
+ atomic_dec(&kfd_inflight_kills);
+ }
}
static void kfd_process_ref_release(struct kref *ref)
@@ -2186,6 +2196,35 @@ static void restore_process_worker(struct work_struct *work)
pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
}
+void kfd_kill_all_user_processes(void)
+{
+ struct kfd_process *p;
+ /* struct amdkfd_process_info *p_info; */
+ unsigned int temp;
+ int idx;
+ atomic_inc(&kfd_process_locked);
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_info("Killing all processes\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ dev_warn(kfd_device,
+ "Sending SIGBUS to process %d (pasid 0x%x)",
+ p->lead_thread->pid, p->pasid);
+ send_sig(SIGBUS, p->lead_thread, 0);
+ atomic_inc(&kfd_inflight_kills);
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+
+ while ( atomic_read(&kfd_inflight_kills) > 0 ){
+ dev_warn(kfd_device,
+ "kfd_processes_table is not empty, going to sleep for 10ms\n");
+ msleep(10);
+ }
+
+ atomic_dec(&kfd_process_locked);
+ pr_info("all processes has been fully released");
+}
+
void kfd_suspend_all_processes(bool force)
{
struct kfd_process *p;
Regards,
Shuotao
Andrey
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
int kfd_resume_all_processes(bool sync)
{
struct kfd_process *p;
Andrey
Really appreciate your help!
Best,
Shuotao
2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.
3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.
-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}
+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}
+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);
+ amdgpu_amdkfd_resume_processes();
}
void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}
+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}
It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.
I suggest this instead - you only incrmemnt if not during PCI remove
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;
/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();
Andrey
+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;