Quoting Bloomfield, Jon (2019-07-26 00:41:49) > > -----Original Message----- > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Sent: Thursday, July 25, 2019 4:28 PM > > To: Bloomfield, Jon <jon.bloomfield@xxxxxxxxx>; intel- > > gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>; Ursulin, Tvrtko > > <tvrtko.ursulin@xxxxxxxxx> > > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats > > > > Quoting Bloomfield, Jon (2019-07-26 00:21:47) > > > > -----Original Message----- > > > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > > > Sent: Thursday, July 25, 2019 4:17 PM > > > > To: intel-gfx@xxxxxxxxxxxxxxxxxxxxx > > > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>; Joonas Lahtinen > > > > <joonas.lahtinen@xxxxxxxxxxxxxxx>; Ursulin, Tvrtko > > <tvrtko.ursulin@xxxxxxxxx>; > > > > Bloomfield, Jon <jon.bloomfield@xxxxxxxxx> > > > > Subject: [PATCH] drm/i915: Replace hangcheck by heartbeats > > > > > > > > Replace sampling the engine state every so often with a periodic > > > > heartbeat request to measure the health of an engine. This is coupled > > > > with the forced-preemption to allow long running requests to survive so > > > > long as they do not block other users. > > > > > > Can you explain why we would need this at all if we have forced-preemption? > > > Forced preemption guarantees that an engine cannot interfere with the > > timely > > > execution of other contexts. If it hangs, but nothing else wants to use the > > engine > > > then do we care? > > > > We may not have something else waiting to use the engine, but we may > > have users waiting for the response where we need to detect the GPU hang > > to prevent an infinite wait / stuck processes and infinite power drain. > > I'm not sure I buy that logic. Being able to pre-empt doesn't imply it will > ever end. As written a context can sit forever, apparently making progress > but never actually returning a response to the user. If the user isn't happy > with the progress they will kill the process. So we haven't solved the > user responsiveness here. All we've done is eliminated the potential to > run one class of otherwise valid workload. Indeed, one of the conditions I have in mind for endless is rlimits. The user + admin should be able to specify that a context not exceed so much runtime, and if we ever get a scheduler, we can write that as a budget (along with deadlines). > Same argument goes for power. Just because it yields when other contexts > want to run doesn't mean it won't consume lots of power indefinitely. I can > equally write a CPU program to burn lots of power, forever, and it won't get > nuked. I agree, and continue to dislike letting hogs have free reign. > TDR made sense when it was the only way to ensure contexts could always > make forward progress. But force-preemption does everything we need to > ensure that as far as I can tell. No. Force-preemption (preemption-by-reset) is arbitrarily shooting mostly innocent contexts, that had the misfortune to not yield quick enough. It is data loss and a dos (given enough concentration could probably be used by third parties to shoot down completely innocent clients), and so should be used as a last resort shotgun and not be confused as being a scalpel. And given our history and current situation, resets are still a liability. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx