On 2023-03-27 14:43, Jonathan Kim wrote:
The debugger for GFX9.4.1 uses kfd_suspend_all_processes to pause the
compute pipe line so it can safely toggle the SQ's implicit wait on
barrier setting during debug attach/detach to work around the wave
exception s_barrier race condition.
For mGPU setups, repeated calls to cancel all outstanding restore work can
result in an assymetric permanent cancelling of the restored work from the
debug device after it has toggled the HW work around settings.
This is a bit hard to follow. Not sure what you mean by asymmetric.
I think this is a general bug in how kfd_suspend_all_processes and
kfd_resume_all_processes interact. The latter schedules restore work. If
that gets cancelled before it gets a chance to run, it will result in
the queues staying preempted forever. It just happened that the barrier
waitcount setting workaround on GFXv9.4.1 was good at triggering the bug.
I would simplify the description like this:
Flush delayed restore work in kfd_suspend_all_queues instead of
cancelling. Cancelling the work before it runs results in the queues
becoming permanently disabled. Flushing the work ensures that the
queue suspend/resume state stays balanced.
With the updated description, the patch is
Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>
Instead of cancelling the outstanding restore work, just flush it as it
will be properly evicted anyways by the current suspend call.
Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 1e3795e7e18d..55a4ddd35e12 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2008,7 +2008,7 @@ void kfd_suspend_all_processes(void)
WARN(debug_evictions, "Evicting all processes");
hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
cancel_delayed_work_sync(&p->eviction_work);
- cancel_delayed_work_sync(&p->restore_work);
+ flush_delayed_work(&p->restore_work);
if (kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SUSPEND))
pr_err("Failed to suspend process 0x%x\n", p->pasid);