Re: [PATCH] Increase tlb flush timeout for sriov

"Liu, Shaoyun" <Shaoyun.Liu@xxxxxxx> · Mon, 8 Aug 2022 22:30:53 +0000

As I discussed with Alice ,this change is when multi-vf running compute benchmark (Luxmark) at the same time, which involves multiple vf  do the tlb invalidation at the same time. They observed kiq timeout after submit the tlb invalidate command. Although
 each vf has the invalidate register set, but from hw, the invalidate requests are queue to execute.

Alice, as we discussed, we can use maximum 12*100ms for the timeout , it shouldn't be 6000ms. Did you see issues with 1200 ms timeout?

Regards
Shaoyun.liu

From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Alex Deucher <alexdeucher@xxxxxxxxx>

Sent: August 8, 2022 4:49 PM

To: Milinkovic, Dusica <Dusica.Milinkovic@xxxxxxx>

Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

Subject: Re: [PATCH] Increase tlb flush timeout for sriov

On Wed, Aug 3, 2022 at 5:02 AM Dusica Milinkovic

<dusica.milinkovic@xxxxxxx> wrote:

>

Please include a patch description.  Why do you need a longer timeout?

 What problem does it fix?

> Signed-off-by: Dusica Milinkovic <dusica.milinkovic@xxxxxxx>

> ---

>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +++++-

>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 +++++-

>  2 files changed, 10 insertions(+), 2 deletions(-)

>

> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

> index 9ae8cdaa033e..6ab7d329916f 100644

> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

> @@ -419,6 +419,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>         uint32_t seq;

>         uint16_t queried_pasid;

>         bool ret;

> +       uint32_t sriov_usec_timeout = 6000000;  /* wait for 12 * 500ms for SRIOV */

>         struct amdgpu_ring *ring = &adev->gfx.kiq.ring;

>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;

>

> @@ -437,7 +438,10 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>

>                 amdgpu_ring_commit(ring);

>                 spin_unlock(&adev->gfx.kiq.ring_lock);

> -               r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

> +               if (amdgpu_sriov_vf(adev))

> +                       r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);

> +               else

> +                       r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

What about something like this?

u32 usec_timeout = amdgpu_sriov_vf(adev) ? 6000000 :

adev->usec_timeout;  /* wait for 12 * 500ms for SRIOV */

...

r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);

>                 if (r < 1) {

>                         dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);

>                         return -ETIME;

> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

> index 22761a3bb818..941a6b52fa72 100644

> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

> @@ -896,6 +896,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>         uint32_t seq;

>         uint16_t queried_pasid;

>         bool ret;

> +       uint32_t sriov_usec_timeout = 6000000;  /* wait for 12 * 500ms for SRIOV */

>         struct amdgpu_ring *ring = &adev->gfx.kiq.ring;

>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;

>

> @@ -935,7 +936,10 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

>

>                 amdgpu_ring_commit(ring);

>                 spin_unlock(&adev->gfx.kiq.ring_lock);

> -               r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

> +               if (amdgpu_sriov_vf(adev))

> +                       r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);

> +               else

> +                       r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

Same comment here.

Alex

>                 if (r < 1) {

>                         dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);

>                         up_read(&adev->reset_domain->sem);

> --

> 2.25.1

>