RE: [PATCH] drm/amdgpu:Check gfx poweron when skip flush_gpu_tlb

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]

Hi,

Can I get the RB for this patch? To fix the reset failure with following calltrace?

[   72.743001] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:160 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[   72.743009] [drm] ring compute_32769.2.2 was added
[   72.743024] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00000000004a2000 from client 10
[   72.743038] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B40
[   72.743050] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[   72.743056] [drm] ring sdma_32769.3.3 was added
[   72.743061] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x0
[   72.743069] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x0
[   72.743077] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x4
[   72.743086] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x1
[   72.743095] amdgpu 0000:03:00.0: amdgpu:      RW: 0x1
[   72.743105] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:144 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[   72.743122] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00000000004a2000 from client 10
[   72.743135] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B21
[   72.743145] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[   72.743155] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x1
[   72.743164] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x0
[   72.743173] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x2
[   72.743181] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x1
[   72.743189] amdgpu 0000:03:00.0: amdgpu:      RW: 0x0

Thanks,
Feifei

-----Original Message-----
From: Xu, Feifei
Sent: Tuesday, October 10, 2023 6:14 PM
To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: RE: [PATCH] drm/amdgpu:Check gfx poweron when skip flush_gpu_tlb

If the behavior is correct, this patch looks like workaround HW reset not flushed the TLB or something can be workaround by adding a gpu TLB flush.

Thanks,
Feifei

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Tuesday, October 10, 2023 5:07 PM
To: Xu, Feifei <Feifei.Xu@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu:Check gfx poweron when skip flush_gpu_tlb

Hi Feifei,

yeah, that is correct behavior. The GMC callback should *not* get called during resume in a reset, because the reset needs to take care of invalidating the TLB anyway.

If the later doesn't work any more we need to re-iterate the reset procedure and not mess with this here.

Regards,
Christian.

Am 10.10.23 um 04:27 schrieb Xu, Feifei:
> [AMD Official Use Only - General]
>
> Yes, adev->gfx.is_poweron check will be applied in gmc_v11 callback, which will be called after the generic gmc part: amdgpu_gmc_flush_gpu_tlb() function.
> But in commit: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2"), the flush is moved at a higher level amdgpu_gmc_flush_gpu_tlb() function.
>
> Thus the gmc_v11 callback will never be called in the resume because adev->reset_domain->sem not released and returned ahead. Adding a check of adev->gfx.is_poweron will let GFX11 not breaking ahead, like following:
>
> if (!down_read_trylock(&adev->reset_domain->sem) && //--> true in
> gfx11
> +!adev->gfx.is_poweron)         //-->false in gfx11, and the whole if statement will be false, not return ahead. The following gmc v11 callback will be executed later.
>
> Thanks,
> Feifei
>
> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> Sent: Monday, October 9, 2023 4:58 PM
> To: Xu, Feifei <Feifei.Xu@xxxxxxx>; Wang, Yang(Kevin)
> <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu:Check gfx poweron when skip
> flush_gpu_tlb
>
> [AMD Official Use Only - General]
>
> adev->gfx.is_poweron check should already be applied in IP specific (gmc v11) callback. If gfx is not power on, it does nothing but just returns. I didn't see how it helps resolve the issue if we just move the check from one function to another.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Xu, Feifei <Feifei.Xu@xxxxxxx>
> Sent: Monday, October 9, 2023 09:51
> To: Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>;
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu:Check gfx poweron when skip
> flush_gpu_tlb
>
> [AMD Official Use Only - General]
>
> Hi,
>
>>> Based on your description, the above code should use "||" instead of
>>> "&&",
> && is to add more restriction here.  To avoid skipping necessary TLB flush by return.
> For Asics < GFX11, !adev->gfx.is_poweron is always true (this paremeter is intrudoced from GFX11), only depends on reset_domain->sem; For Asics = GFX11, !adev->gfx.is_poweron might be false (which gfx might already poweron in the reset), this will make the if () not ture, return will not be executed, thus flush TLB.
>
>>> And after merging code into one line may result in the lock not being released if the lock can be acquired success.
> If !adev->gfx.is_poweron is true, the reset_domin->sem will not be down_read_trylock, thus could avoid this deadlock.
>
> Thanks,
> Feifei
>
> -----Original Message-----
> From: Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>
> Sent: Sunday, October 8, 2023 9:36 PM
> To: Xu, Feifei <Feifei.Xu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Xu, Feifei <Feifei.Xu@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>;
> Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu:Check gfx poweron when skip
> flush_gpu_tlb
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> Feifei Xu
> Sent: Sunday, October 8, 2023 6:07 PM
> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Xu, Feifei <Feifei.Xu@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>;
> Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>
> Subject: [PATCH] drm/amdgpu:Check gfx poweron when skip flush_gpu_tlb
>
> To fix the gpu recovery failed on GFX11 with gfxhub pagefault, flush gpu tlb after reset on GFX11.
> Gfxhub tlb flush need check if adev->gfx.is_poweron set.
>
> Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb
> v2")
>
> Signed-off-by: Feifei Xu <Feifei.Xu@xxxxxxx>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 2f9bb86edd71..048d32edee88 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -611,8 +611,9 @@ void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
>                  /*
>                   * A GPU reset should flush all TLBs anyway, so no need to do
>                   * this while one is ongoing.
> +                * For GFX11, gfxhub flush check if adev->gfx.is_poweron is set.
>                   */
> -               if (!down_read_trylock(&adev->reset_domain->sem))
> +               if (!down_read_trylock(&adev->reset_domain->sem) &&
> +!adev->gfx.is_poweron)
>                          return;
>
> [Kevin]:
> Based on your description, the above code should use "||" instead of "&&", And after merging code into one line may result in the lock not being released if the lock can be acquired success.
>
> Best Regards,
> Kevin
>
>                  if (adev->gmc.flush_tlb_needs_extra_type_2)
> --
> 2.34.1
>
>
>





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux