On 2025-03-07 03:53, Yifan Zha wrote:
[Why]
If reset is detected and kfd need to evict working queues, HWS moving queue will be failed.
Then remaining queues are not evicted and in active state.
After reset done, kfd uses HWS to termination remaining activated queues but HWS is resetted.
So remove queue will be failed again.
[How]
Keep removing all queues even if HWS returns failed.
It will not affect cpsch as it checks reset_domain->sem.
v2: If any queue failed, evict queue returns error.
Signed-off-by: Yifan Zha <Yifan.Zha@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index f3f2fd6ee65c..b647745ee0a5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1189,7 +1189,7 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
struct queue *q;
struct device *dev = dqm->dev->adev->dev;
struct kfd_process_device *pdd;
- int retval = 0;
+ int retval, err = 0;
You should still initialize retval to 0, otherwise the function may
return an uninitialized value if there are no MES queues. err does not
need to be initialized because you're always assigning a value just
before checking it below.
In fact, you could declare err inside the if-block below, since it is
only needed in that scope. It is preferred to declare variables in a
more limited scope if they are not needed outside.
Regards,
Felix
dqm_lock(dqm);
if (qpd->evicted++ > 0) /* already evicted, do nothing */
@@ -1219,11 +1219,11 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
decrement_queue_count(dqm, qpd, q);
if (dqm->dev->kfd->shared_resources.enable_mes) {
- retval = remove_queue_mes(dqm, q, qpd);
- if (retval) {
+ err = remove_queue_mes(dqm, q, qpd);
+ if (err) {
dev_err(dev, "Failed to evict queue %d\n",
q->properties.queue_id);
- goto out;
+ retval = err;
}
}
}