Re: [CI 1/4] drm/i915/gt: Try to more gracefully quiesce the system before resets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Quoting Mika Kuoppala (2019-10-23 14:21:01)
> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:
> 
> > If we are doing a normal GPU reset triggered after detecting a long
> > period of stalled work, we can take our time and allow the engines to
> > quiesce. Since we've stopped submission to the engine, and if we wait
> > long enough an innocent context should complete, leaving the engine idle.
> > So by waiting a short amount of time, we should prevent clobbering other
> > users when resetting a stuck context.
> >
> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
> > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>
> > ---
> >  drivers/gpu/drm/i915/Kconfig.profile         | 11 +++++++++++
> >  drivers/gpu/drm/i915/gt/intel_engine_cs.c    | 20 +++++++++++++++++++-
> >  drivers/gpu/drm/i915/gt/intel_engine_types.h |  4 ++++
> >  3 files changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > index 48df8889a88a..97f01bfeda41 100644
> > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > @@ -25,3 +25,14 @@ config DRM_I915_SPIN_REQUEST
> >         May be 0 to disable the initial spin. In practice, we estimate
> >         the cost of enabling the interrupt (if currently disabled) to be
> >         a few microseconds.
> > +
> > +config DRM_I915_STOP_TIMEOUT
> > +     int "How long to wait for an engine to quiesce gracefully before reset (ms)"
> > +     default 100 # milliseconds
> > +     help
> > +       By stopping submission and sleeping for a short time before resetting
> > +       the GPU, we allow the innocent contexts also on the system to quiesce.
> > +       It is then less likely for a hanging context to cause collateral
> > +       damage as the system is reset in order to recover. The colorary is
> 
> s/coloray/corollary
> 
> I am not claiming that I would know a better value for this tunable.
> 
> But atleast currently with the hangcheck periods we have, I think
> there is room for more time to actual reset processing.
> 
> We could go as far as we start to idle the other engines
> in parallel, when one shows symptoms. But well perhaps
> the effect is the same as shortening the detection cycle.

True, the other idea I think I may experiment with is pushing the
stalled flag down. There's no point waiting for the engine if we've
declared it hung already, and that should eliminate the need for the if
(in_atomic). I think the essence of the path stands -- we can reset
more gracefully if we wait.

I probably should make it a Suggested-by Joonas & Jon.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux