Re: [PATCH] drm/amdkfd: Have kfd driver use same PASID values from graphic driver

"Chen, Xiaogang" <xiaogang.chen@xxxxxxx> · Thu, 21 Nov 2024 10:11:45 -0600

On 11/21/2024 9:22 AM, Felix Kuehling wrote:

On 2024-11-20 22:58, Chen, Xiaogang wrote:
   @@ -1822,15 +1804,20 @@ struct kfd_process *kfd_lookup_process_by_pasid(u32 pasid)
   {
       struct kfd_process *p, *ret_p = NULL;
       unsigned int temp;
+    int i;
         int idx = srcu_read_lock(&kfd_processes_srcu);
         hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
-        if (p->pasid == pasid) {
-            kref_get(&p->ref);
-            ret_p = p;
-            break;
+        for (i = 0; i < p->n_pdds; i++) {
+            if (p->pdds[i]->pasid == pasid) {
+                kref_get(&p->ref);
+                ret_p = p;
+                break;
+            }
I think this won't work correctly. The same PASID can be used for different processes on different GPUs because each adev manages its own PASID->amdgpu_vm lookup table. So kfd_lookup_process_by_pasid needs a new parameter that identifies the GPU adev, and you should only compare pasids, if the adev matches.
I think it is the main concern here: the pasid used here is global in driver by amdgpu_pasid_alloc(16) at amdgpu_driver_open_kms.  Each time a render node(partition) got opened, a new pasid value is generated. Its lifetime is until render node got closed. A pdd just uses this pasid.  And each adev has its own xarray which saves pasids for this adev.
I think I had a misunderstanding here. I saw the xarray in adev and assumed that each adev allocated PASIDs independently. But there is also a global amdgpu_pasid_ida that I missed. If the PASID allocation is global in the amdgpu_pasid_ida, then each PASID uniquely identifies a VM your code should be fine.

We use 16 bits to generate pasid, so could have 64k pasid values. 
Suppose system has 64 total partitions and each process use all of them, 
in theory we could have 1k processes.

Regards

Xiaogang

@Christian, we discussed the number of PASIDs consumed on systems with many GPUs and partitions. If the PASIDs are managed globally, then running out of PASIDs is a concern. Do you think we should change this?

Regards,
   Felix

Regards

Xiaogang