-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Thursday, August 15, 2024 2:25 PM
To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
Subject: Re: [PATCHv2 2/3] drm/amdkfd: Update queue unmap after VM fault
with MES
On 2024-08-14 19:27, Mukul Joshi wrote:
MEC FW expects MES to unmap all queues when a VM fault is observed on
a queue and then resumed once the affected process is terminated.
Use the MES Suspend and Resume APIs to achieve this.
Signed-off-by: Mukul Joshi <mukul.joshi@xxxxxxx>
---
v1->v2:
- Add MES FW version check.
- Separate out the kfd_dqm_evict_pasid into another function.
- Use amdgpu_mes_suspend/amdgpu_mes_resume to suspend/resume
queues.
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 79
++++++++++++++++++-
1 file changed, 77 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index f6e211070299..cb5b866eee3b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -319,6 +319,42 @@ static int remove_all_queues_mes(struct
device_queue_manager *dqm)
return retval;
}
+static int suspend_all_queues_mes(struct device_queue_manager *dqm) {
+ struct amdgpu_device *adev = (struct amdgpu_device *)dqm->dev-
adev;
+ int r = 0;
+
+ if (dqm->is_hws_hang)
+ return -EIO;
+
+ r = amdgpu_mes_suspend(adev);
+ if (r) {
+ dev_err(adev->dev, "failed to suspend gangs from MES\n");
+ dev_err(adev->dev, "MES might be in unrecoverable state,
issue a GPU reset\n");
+ kfd_hws_hang(dqm);
+ }
+
+ return r;
+}
+
+static int resume_all_queues_mes(struct device_queue_manager *dqm) {
+ struct amdgpu_device *adev = (struct amdgpu_device *)dqm->dev-
adev;
+ int r = 0;
+
+ if (dqm->is_hws_hang)
+ return -EIO;
+
+ r = amdgpu_mes_resume(adev);
+ if (r) {
+ dev_err(adev->dev, "failed to resume gangs from MES\n");
+ dev_err(adev->dev, "MES might be in unrecoverable state,
issue a GPU reset\n");
+ kfd_hws_hang(dqm);
+ }
+
+ return r;
+}
+
static void increment_queue_count(struct device_queue_manager *dqm,
struct qcm_process_device *qpd,
struct queue *q)
@@ -2835,6 +2871,40 @@ void device_queue_manager_uninit(struct
device_queue_manager *dqm)
kfree(dqm);
}
+static int kfd_dqm_evict_pasid_mes(struct device_queue_manager *dqm,
+ struct qcm_process_device *qpd) {
+ struct device *dev = dqm->dev->adev->dev;
+ int ret = 0;
+
+ /* Check if process is already evicted */
+ dqm_lock(dqm);
+ if (qpd->evicted) {
+ dqm_unlock(dqm);
+ goto out;
qpd->evicted is a reference count. Without this shortcut,
dqm->ops.evict_process_queues will increment the ref count. You probably
need to increment it here before dropping the lock. Otherwise two things can
go wrong:
1. The corresponding dqm->ops.restore_process_queues will underflow the
reference count
2. A race condition where the queues get restored too early