[Public] Sounds good. I'd also change: >>>>>> KFD currently relies on MEC FW to clear tcp watch control >>>>>> register by sending MAP_PROCESS packet with 0 of field >>>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the >>>>>> packet will not be sent and the previous value will be >>>>>> left on the register, that will affect the following apps. >>>>>> So the solution is to clear the register as gfx v9 in KFD. To something like: KFD currently relies on MEC FW to clear tcp watch control register on UNMAP_QUEUES. Due to a FW bug, MEC does not do this. So the solution is to clear the register as gfx v9 in KFD. With those fixed, this patch is Reviewed-by: Jonathan Kim <jonathan.kim@xxxxxxx> Hopefully we can get away with this since every watch instance register is supposed to be 1-1 to a process ... And that there's no race scenarios with trailing exceptions on dynamic watch point address changes ... Thanks, Jon > -----Original Message----- > From: Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx> > Sent: Thursday, August 10, 2023 6:31 PM > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2 > > I will change title to "drm/amdkfd: workaround address watch clearing > bug for gfx v9.4.2". is it OK? > > Regards, > Eric > > On 2023-08-10 18:25, Kim, Jonathan wrote: > > [Public] > > > > Yeah this is a recent bug so this workaround is new. More rigorous tests > revealed this is probably a miss on the FW side. We explicitly requested > UNMAP_QUEUES unconditionally invalidate watch controls during the > beginning of design to prevent any watch point racing. > > > > Note GFX11 MES calls are different on the surface but under the hood it's > the same (registers get invalidated on unmap then get updated on map. > Only difference it's at the queue level). > > > > I'm fine with this solution but I think it'd be good to describe this as a > workaround somewhere (as opposed to a driver issue) so that folks aren't > scratching their heads later on looking at code for GFX11 and up and > wondering why we don't nuke the control setting with the KFD for those > devices. > > > > Thanks, > > > > Jon > > > >> -----Original Message----- > >> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > >> Sent: Thursday, August 10, 2023 5:56 PM > >> To: Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; Kim, Jonathan > >> <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx > v9.4.2 > >> > >> I think Jon is suggesting that the UNMAP_QUEUES command should clear > the > >> address watch registers. Requesting such a change from the the HWS team > >> may take a long time. > >> > >> That said, when was this workaround implemented and reviewed? Did I > >> review it as part of Jon's debugger upstreaming patch series? Or did > >> this come later? This patch only enables the workaround for v9.4.2. > >> > >> Regards, > >> Felix > >> > >> > >> On 2023-08-10 17:52, Eric Huang wrote: > >>> The problem is the queue is suspended before clearing address watch > >>> call in KFD, there is not queue preemption and queue resume after > >>> clearing call, and the test ends. So there is not chance to send > >>> MAP_PROCESS to HWS. At this point FW has nothing to do. We have > >>> several test FWs from Tej, none of them works, so I recalled the > >>> kernel debug log and found out the problem. > >>> > >>> GFX11 has different scheduler, when calling clear address watch, KFD > >>> directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it > >> doesn't > >>> consider if the queue is suspended. So GFX11 doesn't have this issue. > >>> > >>> Regards, > >>> Eric > >>> > >>> On 2023-08-10 17:27, Kim, Jonathan wrote: > >>>> [AMD Official Use Only - General] > >>>> > >>>> This is a strange solution because the MEC should set watch controls > >>>> as non-valid automatically on queue preemption to avoid this kind of > >>>> issue in the first place by design. MAP_PROCESS on resume will take > >>>> whatever the driver requests. > >>>> GFX11 has no issue with letting the HWS do this. > >>>> > >>>> Are we sure we're not working around some HWS bug? > >>>> > >>>> Thanks, > >>>> > >>>> Jon > >>>> > >>>>> -----Original Message----- > >>>>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > >>>>> Sent: Thursday, August 10, 2023 5:03 PM > >>>>> To: Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; amd- > >>>>> gfx@xxxxxxxxxxxxxxxxxxxxx > >>>>> Cc: Kim, Jonathan <Jonathan.Kim@xxxxxxx> > >>>>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for > >>>>> gfx v9.4.2 > >>>>> > >>>>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a > bit > >>>>> different because it needs to support multiple XCCs. > >>>>> > >>>>> That said, this patch is > >>>>> > >>>>> Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > >>>>> > >>>>> > >>>>> On 2023-08-10 16:47, Eric Huang wrote: > >>>>>> KFD currently relies on MEC FW to clear tcp watch control > >>>>>> register by sending MAP_PROCESS packet with 0 of field > >>>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the > >>>>>> packet will not be sent and the previous value will be > >>>>>> left on the register, that will affect the following apps. > >>>>>> So the solution is to clear the register as gfx v9 in KFD. > >>>>>> > >>>>>> Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 > +----- > >> -- > >>>>>> 1 file changed, 1 insertion(+), 7 deletions(-) > >>>>>> > >>>>>> diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c > >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c > >>>>>> index e2fed6edbdd0..aff08321e976 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c > >>>>>> @@ -163,12 +163,6 @@ static uint32_t > >>>>> kgd_gfx_aldebaran_set_address_watch( > >>>>>> return watch_address_cntl; > >>>>>> } > >>>>>> > >>>>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct > >>>>> amdgpu_device *adev, > >>>>>> - uint32_t watch_id) > >>>>>> -{ > >>>>>> - return 0; > >>>>>> -} > >>>>>> - > >>>>>> const struct kfd2kgd_calls aldebaran_kfd2kgd = { > >>>>>> .program_sh_mem_settings = > >>>>> kgd_gfx_v9_program_sh_mem_settings, > >>>>>> .set_pasid_vmid_mapping = > kgd_gfx_v9_set_pasid_vmid_mapping, > >>>>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd > = > >> { > >>>>>> .set_wave_launch_trap_override = > >>>>> kgd_aldebaran_set_wave_launch_trap_override, > >>>>>> .set_wave_launch_mode = > kgd_aldebaran_set_wave_launch_mode, > >>>>>> .set_address_watch = kgd_gfx_aldebaran_set_address_watch, > >>>>>> - .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch, > >>>>>> + .clear_address_watch = kgd_gfx_v9_clear_address_watch, > >>>>>> .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times, > >>>>>> .build_grace_period_packet_info = > >>>>> kgd_gfx_v9_build_grace_period_packet_info, > >>>>>> .program_trap_handler_settings = > >>>>> kgd_gfx_v9_program_trap_handler_settings,