Re: [PATCH] drm/i915: Replace hangcheck by heartbeats

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Fri, 26 Jul 2019 00:52:17 +0100

Quoting Bloomfield, Jon (2019-07-26 00:41:49)
> > -----Original Message-----
> > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Sent: Thursday, July 25, 2019 4:28 PM
> > To: Bloomfield, Jon <jon.bloomfield@xxxxxxxxx>; intel-
> > gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>; Ursulin, Tvrtko
> > <tvrtko.ursulin@xxxxxxxxx>
> > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > 
> > Quoting Bloomfield, Jon (2019-07-26 00:21:47)
> > > > -----Original Message-----
> > > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > > > Sent: Thursday, July 25, 2019 4:17 PM
> > > > To: intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>; Joonas Lahtinen
> > > > <joonas.lahtinen@xxxxxxxxxxxxxxx>; Ursulin, Tvrtko
> > <tvrtko.ursulin@xxxxxxxxx>;
> > > > Bloomfield, Jon <jon.bloomfield@xxxxxxxxx>
> > > > Subject: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > > >
> > > > Replace sampling the engine state every so often with a periodic
> > > > heartbeat request to measure the health of an engine. This is coupled
> > > > with the forced-preemption to allow long running requests to survive so
> > > > long as they do not block other users.
> > >
> > > Can you explain why we would need this at all if we have forced-preemption?
> > > Forced preemption guarantees that an engine cannot interfere with the
> > timely
> > > execution of other contexts. If it hangs, but nothing else wants to use the
> > engine
> > > then do we care?
> > 
> > We may not have something else waiting to use the engine, but we may
> > have users waiting for the response where we need to detect the GPU hang
> > to prevent an infinite wait / stuck processes and infinite power drain.
> 
> I'm not sure I buy that logic. Being able to pre-empt doesn't imply it will
> ever end. As written a context can sit forever, apparently making progress
> but never actually returning a response to the user. If the user isn't happy
> with the progress they will kill the process. So we haven't solved the
> user responsiveness here. All we've done is eliminated the potential to
> run one class of otherwise valid workload.

Indeed, one of the conditions I have in mind for endless is rlimits. The
user + admin should be able to specify that a context not exceed so much
runtime, and if we ever get a scheduler, we can write that as a budget
(along with deadlines).

> Same argument goes for power. Just because it yields when other contexts
> want to run doesn't mean it won't consume lots of power indefinitely. I can
> equally write a CPU program to burn lots of power, forever, and it won't get
> nuked.

I agree, and continue to dislike letting hogs have free reign.

> TDR made sense when it was the only way to ensure contexts could always
> make forward progress. But force-preemption does everything we need to
> ensure that as far as I can tell.

No. Force-preemption (preemption-by-reset) is arbitrarily shooting mostly
innocent contexts, that had the misfortune to not yield quick enough. It
is data loss and a dos (given enough concentration could probably be used
by third parties to shoot down completely innocent clients), and so
should be used as a last resort shotgun and not be confused as being a
scalpel. And given our history and current situation, resets are still a
liability.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx