Re: [PATCH] drm/amdkfd: fix missed queue reset on queue destroy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2024-08-22 11:17, Jonathan Kim wrote:
If a queue is being destroyed but causes a HWS hang on removal, the KFD
may issue an unnecessary gpu reset if the destroyed queue can be fixed
by a queue reset.

This is because the queue has been removed from the KFD's queue list
prior to the preemption action on destroy so the reset call will fail to
match the HQD PQ reset information against the KFD's queue record to do
the actual reset.

To fix this, deactivate the queue prior to preemption since it's being
destroyed anyways and remove the queue from the KFD's queue list after
preemption.

v2: early deactivate queue and delete queue from list later as-per
description instead of destroy queue referencing hack.

Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 577d121cc6d1..6d5a632b95eb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2407,10 +2407,10 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
  		pdd->sdma_past_activity_counter += sdma_val;
  	}
- list_del(&q->list);
  	qpd->queue_count--;

You may need to move the queue_count update as well to keep things consistent. Please make sure this passes KFD queue tests on GPUs with HWS and MES.

Other than that, this patch is

Reviewed-by: Felix Kuehling <felix.kuehling@xxxxxxx>


  	if (q->properties.is_active) {
  		decrement_queue_count(dqm, qpd, q);
+		q->properties.is_active = false;
  		if (!dqm->dev->kfd->shared_resources.enable_mes) {
  			retval = execute_queues_cpsch(dqm,
  						      KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
@@ -2421,6 +2421,7 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
  			retval = remove_queue_mes(dqm, q, qpd);
  		}
  	}
+	list_del(&q->list);
/*
  	 * Unconditionally decrement this counter, regardless of the queue's



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux