Re: [PATCH] drm/i915/gt: Reset twice

Gwan-gyeong Mun <gwan-gyeong.mun@xxxxxxxxx> · Thu, 22 Dec 2022 11:28:34 +0200

On 12/15/22 10:07 PM, Rodrigo Vivi wrote:
On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
Hi Rodrigo,

On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
Hi Rodrigo,

On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

After applying an engine reset, on some platforms like
Jasperlake, we
occasionally detect that the engine state is not cleared until
shortly
after the resume. As we try to resume the engine with volatile
internal
state, the first request fails with a spurious CS event (it looks
like
it reports a lite-restore to the hung context, instead of the
expected
idle->active context switch).

Signed-off-by: Chris Wilson <hris@xxxxxxxxxxxxxxxxxx>

There's a typo in the signature email I'm afraid...

oh yes, I forgot the 'C' :)

you forgot?
who signed it off?

Chris, but as I was copy/pasting SoB's I might have
unintentionally removed the 'c'.

Other than that, have we checked the possibility of using the
driver-initiated-flr bit
instead of this second loop? That should be the right way to
guarantee everything is
cleared on gen11+...

maybe I am misinterpreting it, but is FLR the same as resetting
hardware domains individually?

No, it is bigger than that... almost the PCI FLR with some exceptions:

https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

yes, exactly... I would use FLR feedback if I was performing an
FLR reset. But here I'm not doing that, here I'm simply gating
off some power domains. It happens that those power domains turn
on and off engines making them reset.

is this issue only seeing when this reset is called from the
sanitize functions at probe and resumes?
Or from any kind of gt reset?

I don't remember seeing any reference link to the bug in the patch,
hence I'm assuming this is happening in any kind of gt reset that
ends up in this function.


FLR doesn't have anything to do here, also because if you want to
reset a single engine you go through this function, instead of
resetting the whole GPU with whatever is annexed.

yeap. That might be to extreme depending on the case. But all that
I asked was if we were considering this option since this is the
recommended way of reseting our engines nowadays.


This patch is not fixing the "reset" concept of i915, but it's
fixing a missing feedback that happens in one single platform
when trying to gate on/off a domain.

But it is changing the reset concept and timeouts for all the reset
cases in all the platforms.


Maybe I am completely off track, but I don't see connection with
FLR here.

The point is that if a reset is needed, for any reason,
the recommended way for Jasperlake, and any other newer platforms,
is to use the FLR rather than the engine reset. But we are using
the engine reset, and now twice, rather then attempt the recommended
way.


(besides FLR might not be present in all the platforms)

This issue is also not present in all the platforms and you are still
increasing the loops and delay for all the platforms.


Thanks a lot for your inputs,

have we looked to the Jasperlake workarounds to see if we are missing
anything there that could help us in this case instead of this extreme
approach of randomly increasing timeouts and attempts for all the
platforms?

Hi, Rodrigo
JSL_WA (Bspec: 33451) doesn't seem to have a WA similar to this issue. 
(Please correct me if I can't find it.)

Andi

How am I supposed to use driver_initiated_flr() in this context?

Some drivers are not even using this gt reset anymore and going
directly with the driver-initiated FLR in case that guc reset failed.

I believe we can still keep the gt reset in our case as we currently
have, but instead of keep retrying it until it succeeds we probably
should go to the next level and do the driver initiated FLR when the GT
reset failed.


Thanks,
Andi

Cc: stable@xxxxxxxxxxxxxxx
Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
Signed-off-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
---
  drivers/gpu/drm/i915/gt/intel_reset.c | 34
++++++++++++++++++++++-----
  1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
b/drivers/gpu/drm/i915/gt/intel_reset.c
index ffde89c5835a4..88dfc0c5316ff 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
intel_engine_mask_t engine_mask,
  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
hw_domain_mask)
  {
         struct intel_uncore *uncore = gt->uncore;
+       int loops = 2;
         int err;
  
         /*
@@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
intel_gt *gt, u32 hw_domain_mask)
          * for fifo space for the write or forcewake the chip for
          * the read
          */
-       intel_uncore_write_fw(uncore, GEN6_GDRST,
hw_domain_mask);
+       do {
+               intel_uncore_write_fw(uncore, GEN6_GDRST,
hw_domain_mask);
  
-       /* Wait for the device to ack the reset requests */
-       err = __intel_wait_for_register_fw(uncore,
-                                          GEN6_GDRST,
hw_domain_mask, 0,
-                                          500, 0,
-                                          NULL);
+               /*
+                * Wait for the device to ack the reset requests.
+                *
+                * On some platforms, e.g. Jasperlake, we see see
that the
+                * engine register state is not cleared until
shortly after
+                * GDRST reports completion, causing a failure as
we try
+                * to immediately resume while the internal state
is still
+                * in flux. If we immediately repeat the reset,
the second
+                * reset appears to serialise with the first, and
since
+                * it is a no-op, the registers should retain
their reset
+                * value. However, there is still a concern that
upon
+                * leaving the second reset, the internal engine
state
+                * is still in flux and not ready for resuming.
+                */
+               err = __intel_wait_for_register_fw(uncore,
GEN6_GDRST,
+
hw_domain_mask, 0,
+                                                  2000, 0,
+                                                  NULL);
Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it 
tries to reset it once more. How was this value of 2000 calculated?
+       } while (err == 0 && --loops);
         if (err)
                 GT_TRACE(gt,
                          "Wait for 0x%08x engines reset
failed\n",
                          hw_domain_mask);
Did GT_TRACE report an error in a situation where the problem was reported?
  
+       /*
+        * As we have observed that the engine state is still
volatile
+        * after GDRST is acked, impose a small delay to let
everything settle.
+        */
+       udelay(50);
udelay(50) affects all platforms that can call gen6_hw_domain_reset(), 
is that intended?

Br,

G.G.
+
         return err;
  }
  
--
2.38.1