RE: [PATCH] Increase tlb flush timeout for sriov

"Milinkovic, Dusica" <Dusica.Milinkovic@xxxxxxx> · Tue, 9 Aug 2022 11:19:41 +0000

[AMD Official Use Only - General]

Hello all, 

I just checked this with 1200ms for timeout, test case successfully passed 10 loops.

If all of you agree, I will change a value in the patch.

Regards, 

Dusica

From: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx> 

Sent: Tuesday, August 9, 2022 12:31 AM

To: Alex Deucher <alexdeucher@xxxxxxxxx>; Milinkovic, Dusica <Dusica.Milinkovic@xxxxxxx>

Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx

Subject: Re: [PATCH] Increase tlb flush timeout for sriov

As I discussed with Alice ,this change is when multi-vf running compute benchmark (Luxmark) at the same time, which involves multiple vf  do the tlb invalidation at the same time. They observed kiq timeout after submit the tlb invalidate
 command. Although each vf has the invalidate register set, but from hw, the invalidate requests are queue to execute.

Alice, as we discussed, we can use maximum 12*100ms for the timeout , it shouldn't be 6000ms. Did you see issues with 1200 ms timeout?

Regards

Shaoyun.liu

From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Alex Deucher <alexdeucher@xxxxxxxxx>

Sent: August 8, 2022 4:49 PM

To: Milinkovic, Dusica <Dusica.Milinkovic@xxxxxxx>

Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

Subject: Re: [PATCH] Increase tlb flush timeout for sriov 

On Wed, Aug 3, 2022 at 5:02 AM Dusica Milinkovic

<dusica.milinkovic@xxxxxxx> wrote:

>

Please include a patch description.  Why do you need a longer timeout?

 What problem does it fix?

> Signed-off-by: Dusica Milinkovic <dusica.milinkovic@xxxxxxx>

> ---

>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +++++-

>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 +++++-

>  2 files changed, 10 insertions(+), 2 deletions(-)

>

> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

> index 9ae8cdaa033e..6ab7d329916f 100644

> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

> @@ -419,6 +419,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>         uint32_t seq;

>         uint16_t queried_pasid;

>         bool ret;

> +       uint32_t sriov_usec_timeout = 6000000;  /* wait for 12 * 500ms for SRIOV */

>         struct amdgpu_ring *ring = &adev->gfx.kiq.ring;

>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;

>

> @@ -437,7 +438,10 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>

>                 amdgpu_ring_commit(ring);

>                 spin_unlock(&adev->gfx.kiq.ring_lock);

> -               r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

> +               if (amdgpu_sriov_vf(adev))

> +                       r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);

> +               else

> +                       r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

What about something like this?

u32 usec_timeout = amdgpu_sriov_vf(adev) ? 6000000 :

adev->usec_timeout;  /* wait for 12 * 500ms for SRIOV */

...

r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);

>                 if (r < 1) {

>                         dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);

>                         return -ETIME;

> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

> index 22761a3bb818..941a6b52fa72 100644

> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

> @@ -896,6 +896,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>         uint32_t seq;

>         uint16_t queried_pasid;

>         bool ret;

> +       uint32_t sriov_usec_timeout = 6000000;  /* wait for 12 * 500ms for SRIOV */

>         struct amdgpu_ring *ring = &adev->gfx.kiq.ring;

>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;

>

> @@ -935,7 +936,10 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>

>                 amdgpu_ring_commit(ring);

>                 spin_unlock(&adev->gfx.kiq.ring_lock);

> -               r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

> +               if (amdgpu_sriov_vf(adev))

> +                       r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);

> +               else

> +                       r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

Same comment here.

Alex

>                 if (r < 1) {

>                         dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);

>                         up_read(&adev->reset_domain->sem);

> --

> 2.25.1

>