Re: [PATCH 21/24] drm/amdkfd: add queue remapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2023-11-23 18:01, Felix Kuehling wrote:
On 2023-11-23 17:41, Greathouse, Joseph wrote:
[Public]

-----Original Message-----
From: Zhu, James <James.Zhu@xxxxxxx>
Sent: Thursday, November 23, 2023 1:49 PM

On 2023-11-23 14:02, Felix Kuehling wrote:
On 2023-11-23 11:25, James Zhu wrote:
On 2023-11-22 17:35, Felix Kuehling wrote:
On 2023-11-03 09:11, James Zhu wrote:
Add queue remapping to force the waves in any running
processes to complete a CWSR trap.
Please add an explanation why this is needed.
[JZ] Even though the profiling-enabled bits is turned off, the CWSR
trap handlers for some kernels with this process may still in running
stage, this will

force the waves in any running processes to complete a CWSR trap, and
make sure pc sampling is completely stopped with this process.   I
will add it later.
It may be confusing to talk specifically about "CWSR trap handler".
There is only one trap handler that is triggered by different events:
CWSR, host trap, s_trap instructions, exceptions, etc. When a new trap
triggers, it serializes with any currently running trap handler in
that wavefront. So it seems that you're using CWSR as a way to ensure
that any host trap has completed: CWSR will wait for previous traps to
finish before trapping again for CWSR, the HWS firmware waits for CWSR
completion and the driver waits for HWS to finish CWSR with a fence on
a HIQ QUERY_STATUS packet. Is that correct?
[JZ] I think your explanation is more detail. Need Joseph to confirm.
Felix, your summary is correct. The reason we are trying to perform a queue unmap/map cycle as part of the PC sampling stop is to prevent the following:

1. A PC sampling request arrives to Wave X, sending it to 1st-level trap handler 2. User thread asks KFD to stop sampling for this process, which leads to kfd_pc_sample_stop() 3. kfd_pc_sample_stop() decrements the sampling refcent. If this is the last process to stop sampling, it stops any further sampling traps from being generated 4. kfd_pc_sample_stop() sets this process's TMA flag to false so waves in the 1st-level trap handler know sampling is disabled      4.1. Wave X may be in 1st-level handler and not yet checked the TMA flag. If so, it will exit the 1st-level handler when it sees flag is false      4.2. Wave X may have already passed the 1st-level TMA flag check and entered the 2nd-level trap handler to do the PC sample 5. kfd_pc_sample_stop() returns, eventually causing ioctl to return, back to user-space 6. Because the stop ioctl has returned, user-land deallocates user-space buffer the 2nd level trap handler uses to output sample data 7. Wave X that was in the 2nd-level handler tries to finish its sample output and writes to the now-freed location, causing a use-after-free

Note that Step 3 does not always stop further traps from arriving -- if another process still wants to do sampling, the driver or HW might still send traps to every wave on the device after Step 3. As such, to avoid going into the 2nd-level handler for non-sampled processes, all 1st-level handlers must check their TMA flag to see if they should allow the sample to flow to the 2nd-level handler.

By removing the queue from the HW after Step 4, we can be sure that any existing waves from this process that entered the PC sampling 2nd-level handler before Step 4 are done. Any waves that were still in the 1st-level handler at Step 4.1 will be filtered by the TMA flag being set to false. CWSR will wait until they exit. Any waves that were already in the 2nd-level handler (4.2) must complete before the CWSR save will complete and allow this queue removal request to complete. Any waves that enter the 1st-level trap handler after Step 4 won't go into the PC sampling logic in the 2nd-level handler because the TMA flag is set to false. CWSR will wait until they exit.

When we then put the queue back on the hardware, any further traps that might show up (e.g. because another process is sampling) will get filtered by the TMA flag.

So once the queue removal (and thus CWSR save cycle) has completed, we can be sure that no other traps to this process will try to use its PC sample data buffer, so it's safe to return to user-space and let them potentially free that buffer.

I don't know how to summarize this nicely in a comment, but hopefully y'all can figure that out. :)

My best summary: We need to ensure that any waves executing the PC sampling part of the trap handler are done before kfd_pc_sample_stop returns, and that no new waves enter that part of the trap handler afterwards. This avoids race conditions that could lead to use-after-free. Unmapping and remapping the queues either waits for the waves to drain, or preempts them with CWSR, which itself executes a trap and waits for previous traps to finish.

[JZ]  Thanks all!

Regards,
  Felix



Thanks,
-Joe

Regards,
   Felix


Signed-off-by: James Zhu <James.Zhu@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 11
+++++++++++
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h |  5 +++++
drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c |  3 +++
   3 files changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index c0e71543389a..a3f57be63f4f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -3155,6 +3155,17 @@ int debug_refresh_runlist(struct
device_queue_manager *dqm)
       return debug_map_and_unlock(dqm);
   }
   +void remap_queue(struct device_queue_manager *dqm,
+                enum kfd_unmap_queues_filter filter,
+                uint32_t filter_param,
+                uint32_t grace_period)
Not sure if you need the filter and grace period parameters in this
function. What's the point of exposing that to callers who just want
to trigger a CWSR? You could also change the function name to
reflect the purpose of the function, rather than the implementation.
[JZ] Just want to create a general function in case that used by
others. I am fine to remove passing filter_param/grace_period
Regards,
   Felix


+{
+    dqm_lock(dqm);
+    if (!dqm->dev->kfd->shared_resources.enable_mes)
+        execute_queues_cpsch(dqm, filter, filter_param,
grace_period);
+    dqm_unlock(dqm);
+}
+
   #if defined(CONFIG_DEBUG_FS)
     static void seq_reg_dump(struct seq_file *m,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index cf7e182588f8..f8aae3747a36 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -303,6 +303,11 @@ int debug_lock_and_unmap(struct
device_queue_manager *dqm);
   int debug_map_and_unlock(struct device_queue_manager *dqm);
   int debug_refresh_runlist(struct device_queue_manager *dqm);
   +void remap_queue(struct device_queue_manager *dqm,
+                enum kfd_unmap_queues_filter filter,
+                uint32_t filter_param,
+                uint32_t grace_period);
+
   static inline unsigned int get_sh_mem_bases_32(struct
kfd_process_device *pdd)
   {
       return (pdd->lds_base >> 16) & 0xFF;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
b/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
index e8f0559b618e..66670cdb813a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
@@ -24,6 +24,7 @@
   #include "kfd_priv.h"
   #include "amdgpu_amdkfd.h"
   #include "kfd_pc_sampling.h"
+#include "kfd_device_queue_manager.h"
     struct supported_pc_sample_info {
       uint32_t ip_version;
@@ -164,6 +165,8 @@ static int kfd_pc_sample_stop(struct
kfd_process_device *pdd,
cancel_work_sync(&pdd->dev->pcs_data.hosttrap_entry.base.pc_sampling_work);

kfd_process_set_trap_pc_sampling_flag(&pdd->qpd,
pdd->dev->pcs_data.hosttrap_entry.base.pc_sample_info.method, false);
+        remap_queue(pdd->dev->dqm,
+            KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0,
USE_DEFAULT_GRACE_PERIOD);
mutex_lock(&pdd->dev->pcs_data.mutex);
pdd->dev->pcs_data.hosttrap_entry.base.target_simd = 0;



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux