Re: [PATCH 08/34] drm/amdkfd: fix kfd_suspend_all_processes for gfx941 debugging

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023-03-27 14:43, Jonathan Kim wrote:
The debugger for GFX9.4.1 uses kfd_suspend_all_processes to pause the
compute pipe line so it can safely toggle the SQ's implicit wait on
barrier setting during debug attach/detach to work around the wave
exception s_barrier race condition.

For mGPU setups, repeated calls to cancel all outstanding restore work can
result in an assymetric permanent cancelling of the restored work from the
debug device after it has toggled the HW work around settings.

This is a bit hard to follow. Not sure what you mean by asymmetric.

I think this is a general bug in how kfd_suspend_all_processes and kfd_resume_all_processes interact. The latter schedules restore work. If that gets cancelled before it gets a chance to run, it will result in the queues staying preempted forever. It just happened that the barrier waitcount setting workaround on GFXv9.4.1 was good at triggering the bug.

I would simplify the description like this:

Flush delayed restore work in kfd_suspend_all_queues instead of cancelling. Cancelling the work before it runs results in the queues becoming permanently disabled. Flushing the work ensures that the queue suspend/resume state stays balanced.
With the updated description, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>


Instead of cancelling the outstanding restore work, just flush it as it
will be properly evicted anyways by the current suspend call.

Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
---
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 1e3795e7e18d..55a4ddd35e12 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2008,7 +2008,7 @@ void kfd_suspend_all_processes(void)
  	WARN(debug_evictions, "Evicting all processes");
  	hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
  		cancel_delayed_work_sync(&p->eviction_work);
-		cancel_delayed_work_sync(&p->restore_work);
+		flush_delayed_work(&p->restore_work);
if (kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SUSPEND))
  			pr_err("Failed to suspend process 0x%x\n", p->pasid);



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux