Am 2022-04-14 um 11:08 schrieb Yat Sin, David:
-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Thursday, April 14, 2022 10:52 AM
To: Yat Sin, David <David.YatSin@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count
Am 2022-04-13 um 11:51 schrieb David Yat Sin:
Queue can be inactive during process termination. This would cause
dqm->gws_queue_count to not be decremented. There can only be 1 GWS
queue per device process so moving the logic out of loop.
Signed-off-by: David Yat Sin <david.yatsin@xxxxxxx>
---
.../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 12 ++++++-----
-
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index acf4f7975850..7c107b88d944 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct
device_queue_manager *dqm,
else if (q->properties.type ==
KFD_QUEUE_TYPE_SDMA_XGMI)
deallocate_sdma_queue(dqm, q);
- if (q->properties.is_active) {
+ if (q->properties.is_active)
decrement_queue_count(dqm, q->properties.type);
- if (q->properties.is_gws) {
- dqm->gws_queue_count--;
- qpd->mapped_gws_queue = false;
- }
- }
This looks like the original intention was to decrement the counter for inactive
GWS queues. This makes sense because this counter is used to determine
whether the runlist is oversubscribed. Inactive queues are not in the runlist,
so they should not be counted.
I see that the counter is updated when queues are activated and deactivated
in update_queue. So IMO this patch is both incorrect and unnecessary. Did
you see an actual problem with the GWS counter during process termination?
If so, I'd like to know more to understand the root cause.
Regards,
Felix
Yes, when using a unit test (for example KFDGWSTest.Semaphore),
1. Put a sleep in the middle of the application (after calling hsaKmtAllocQueueGWS)
2. Run application and kill the application which it is in sleep (application never calls queue.Destroy()).
Then inside kernel dqm->gws_queue_count keeps incrementing each time the experiment is repeated and never goes back to 0. This seems to be because the sleep causes the process to be evicted and q->properties.is_active is false so code does not enter that if statement.
That's weird. Can you find out why it's not getting there? In the test
you describe, I would expect the queue to be active, so the counter
should be decremented by the existing code.
Does the test evict the queues for some reason? is_active gets set to
false when a queue is evicted. Looks like we're missing code to update
the gws_queue_count in evict/restore_process_queues_cpsch (it is present
in evict/restore_process_queues_nocpsch).
Maybe the management of this counter should be moved into
increment/decrement_queue_count, so we don't need to duplicate it in
many places.
Regards,
Felix
Regards,
David
dqm->total_queue_count--;
}
+ if (qpd->mapped_gws_queue) {
+ dqm->gws_queue_count--;
+ qpd->mapped_gws_queue = false;
+ }
+
/* Unregister process */
list_for_each_entry_safe(cur, next_dpn, &dqm->queues, list) {
if (qpd == cur->qpd) {