Re: [PATCH] drm/i915/gt: Reset twice

Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx> · Wed, 14 Dec 2022 23:37:19 +0100

Hi Rodrigo,

On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
> > Hi Rodrigo,
> > 
> > On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
> > > On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
> > > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > > > 
> > > > After applying an engine reset, on some platforms like
> > > > Jasperlake, we
> > > > occasionally detect that the engine state is not cleared until
> > > > shortly
> > > > after the resume. As we try to resume the engine with volatile
> > > > internal
> > > > state, the first request fails with a spurious CS event (it looks
> > > > like
> > > > it reports a lite-restore to the hung context, instead of the
> > > > expected
> > > > idle->active context switch).
> > > > 
> > > > Signed-off-by: Chris Wilson <hris@xxxxxxxxxxxxxxxxxx>
> > > 
> > > There's a typo in the signature email I'm afraid...
> > 
> > oh yes, I forgot the 'C' :)
> 
> you forgot?
> who signed it off?

Chris, but as I was copy/pasting SoB's I might have
unintentionally removed the 'c'.

> > > Other than that, have we checked the possibility of using the
> > > driver-initiated-flr bit
> > > instead of this second loop? That should be the right way to
> > > guarantee everything is
> > > cleared on gen11+...
> > 
> > maybe I am misinterpreting it, but is FLR the same as resetting
> > hardware domains individually?
> 
> No, it is bigger than that... almost the PCI FLR with some exceptions:
> 
> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html

yes, exactly... I would use FLR feedback if I was performing an
FLR reset. But here I'm not doing that, here I'm simply gating
off some power domains. It happens that those power domains turn
on and off engines making them reset.

FLR doesn't have anything to do here, also because if you want to
reset a single engine you go through this function, instead of
resetting the whole GPU with whatever is annexed.

This patch is not fixing the "reset" concept of i915, but it's
fixing a missing feedback that happens in one single platform
when trying to gate on/off a domain.

Maybe I am completely off track, but I don't see connection with
FLR here.

(besides FLR might not be present in all the platforms)

Thanks a lot for your inputs,
Andi

> > How am I supposed to use driver_initiated_flr() in this context?
> 
> Some drivers are not even using this gt reset anymore and going
> directly with the driver-initiated FLR in case that guc reset failed.
> 
> I believe we can still keep the gt reset in our case as we currently
> have, but instead of keep retrying it until it succeeds we probably
> should go to the next level and do the driver initiated FLR when the GT
> reset failed.
> 
> > 
> > Thanks,
> > Andi
> > 
> > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
> > > > Signed-off-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_reset.c | 34
> > > > ++++++++++++++++++++++-----
> > > >  1 file changed, 28 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > index ffde89c5835a4..88dfc0c5316ff 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > > > @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
> > > > intel_engine_mask_t engine_mask,
> > > >  static int gen6_hw_domain_reset(struct intel_gt *gt, u32
> > > > hw_domain_mask)
> > > >  {
> > > >         struct intel_uncore *uncore = gt->uncore;
> > > > +       int loops = 2;
> > > >         int err;
> > > >  
> > > >         /*
> > > > @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
> > > > intel_gt *gt, u32 hw_domain_mask)
> > > >          * for fifo space for the write or forcewake the chip for
> > > >          * the read
> > > >          */
> > > > -       intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > > +       do {
> > > > +               intel_uncore_write_fw(uncore, GEN6_GDRST,
> > > > hw_domain_mask);
> > > >  
> > > > -       /* Wait for the device to ack the reset requests */
> > > > -       err = __intel_wait_for_register_fw(uncore,
> > > > -                                          GEN6_GDRST,
> > > > hw_domain_mask, 0,
> > > > -                                          500, 0,
> > > > -                                          NULL);
> > > > +               /*
> > > > +                * Wait for the device to ack the reset requests.
> > > > +                *
> > > > +                * On some platforms, e.g. Jasperlake, we see see
> > > > that the
> > > > +                * engine register state is not cleared until
> > > > shortly after
> > > > +                * GDRST reports completion, causing a failure as
> > > > we try
> > > > +                * to immediately resume while the internal state
> > > > is still
> > > > +                * in flux. If we immediately repeat the reset,
> > > > the second
> > > > +                * reset appears to serialise with the first, and
> > > > since
> > > > +                * it is a no-op, the registers should retain
> > > > their reset
> > > > +                * value. However, there is still a concern that
> > > > upon
> > > > +                * leaving the second reset, the internal engine
> > > > state
> > > > +                * is still in flux and not ready for resuming.
> > > > +                */
> > > > +               err = __intel_wait_for_register_fw(uncore,
> > > > GEN6_GDRST,
> > > > +                                                 
> > > > hw_domain_mask, 0,
> > > > +                                                  2000, 0,
> > > > +                                                  NULL);
> > > > +       } while (err == 0 && --loops);
> > > >         if (err)
> > > >                 GT_TRACE(gt,
> > > >                          "Wait for 0x%08x engines reset
> > > > failed\n",
> > > >                          hw_domain_mask);
> > > >  
> > > > +       /*
> > > > +        * As we have observed that the engine state is still
> > > > volatile
> > > > +        * after GDRST is acked, impose a small delay to let
> > > > everything settle.
> > > > +        */
> > > > +       udelay(50);
> > > > +
> > > >         return err;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.38.1
> > > > 
>