Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023-08-11 09:26, Felix Kuehling wrote:
Am 2023-08-10 um 18:27 schrieb Eric Huang:
There is not UNMAP_QUEUES command sending for queue preemption because the queue is suspended and test is closed to the end. Function unmap_queue_cpsch will do nothing after that.

How do you suspend queues without sending an UNMAP_QUEUES command?
Now I understand what you mean, I was only thinking of UNMAP_QUEUES sending after clearing call. So MEC FW should clear the control register unconditionally on every UNMAP_QUEUES command. We can request it for gfx v9.4.3 to avoid the awkward workaround in KFD.

Thanks,
Eric

Regards,
  Felix



The workaround is new and only for gfx v9.4.2, because debugger tests has changed to check if all address watch points are correctly set, i.e. test A sets more than one watchpoint and leave, the following test B only sets one watchpoint, and test A's setting will cause more than one watchpoint event, so test B check out and report error on second or third watchpoint not set by itself.

Regards,
Eric

On 2023-08-10 17:56, Felix Kuehling wrote:
I think Jon is suggesting that the UNMAP_QUEUES command should clear the address watch registers. Requesting such a change from the the HWS team may take a long time.

That said, when was this workaround implemented and reviewed? Did I review it as part of Jon's debugger upstreaming patch series? Or did this come later? This patch only enables the workaround for v9.4.2.

Regards,
  Felix


On 2023-08-10 17:52, Eric Huang wrote:
The problem is the queue is suspended before clearing address watch call in KFD, there is not queue preemption and queue resume after clearing call, and the test ends. So there is not chance to send MAP_PROCESS to HWS. At this point FW has nothing to do. We have several test FWs from Tej, none of them works, so I recalled the kernel debug log and found out the problem.

GFX11 has different scheduler, when calling clear address watch, KFD directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it doesn't consider if the queue is suspended. So GFX11 doesn't have this issue.

Regards,
Eric

On 2023-08-10 17:27, Kim, Jonathan wrote:
[AMD Official Use Only - General]

This is a strange solution because the MEC should set watch controls as non-valid automatically on queue preemption to avoid this kind of issue in the first place by design. MAP_PROCESS on resume will take whatever the driver requests.
GFX11 has no issue with letting the HWS do this.

Are we sure we're not working around some HWS bug?

Thanks,

Jon

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Thursday, August 10, 2023 5:03 PM
To: Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
different because it needs to support multiple XCCs.

That said, this patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>


On 2023-08-10 16:47, Eric Huang wrote:
KFD currently relies on MEC FW to clear tcp watch control
register by sending MAP_PROCESS packet with 0 of field
tcp_watch_cntl to HWS, but if the queue is suspended, the
packet will not be sent and the previous value will be
left on the register, that will affect the following apps.
So the solution is to clear the register as gfx v9 in KFD.

Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +-------
   1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
index e2fed6edbdd0..aff08321e976 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
@@ -163,12 +163,6 @@ static uint32_t
kgd_gfx_aldebaran_set_address_watch(
     return watch_address_cntl;
   }

-static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
amdgpu_device *adev,
- uint32_t watch_id)
-{
-   return 0;
-}
-
   const struct kfd2kgd_calls aldebaran_kfd2kgd = {
     .program_sh_mem_settings =
kgd_gfx_v9_program_sh_mem_settings,
     .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
@@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
     .set_wave_launch_trap_override =
kgd_aldebaran_set_wave_launch_trap_override,
     .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
     .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
-   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
+   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
     .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
     .build_grace_period_packet_info =
kgd_gfx_v9_build_grace_period_packet_info,
.program_trap_handler_settings =
kgd_gfx_v9_program_trap_handler_settings,






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux