Hmmn. We're still on orthogonal perspectives as far as our previous arguments stand. But it doesn't matter because while thinking through your replies, I realized there is one argument in favour, which trumps all my previous arguments against this patch - it makes things deterministic. Without this patch (or hangcheck), whether a context gets nuked depends on what else is running. And that's a recipe for confused support emails. So I retract my other arguments, thanks for staying with me :-) BTW: TDR==Timeout-Detection and Reset. Essentially hangcheck and recovery. > -----Original Message----- > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Sent: Friday, July 26, 2019 2:30 PM > To: Bloomfield, Jon <jon.bloomfield@xxxxxxxxx>; intel- > gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>; Ursulin, Tvrtko > <tvrtko.ursulin@xxxxxxxxx> > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats > > Quoting Bloomfield, Jon (2019-07-26 21:58:38) > > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > > It's no more often than before, you have to fail to advance within an > > > interval, and then deny preemption request. > > > > It's entrapment. You are creating an artificial workload for the context to > impede. Before that artificial workload was injected, the context would have > run to completion, the world would be at peace (and the user would have her > new bitcoin). Instead she stays poor because the DoS police launched an illegal > sting on her otherwise genius operation. > > It's housekeeping; it's the cost of keeping the engine alive. This > argument is why CPU-isolation came about (aiui). > > > > > I do like the rlimit suggestion, but until we have that, just disabling TDR > feels > > > like a better stop-gap than nuking innocent compute contexts just in case > they > > > might block something. > > > > > > They're not innocent though :-p > > > > They are innocent until proven guilty :-) > > > > > > > > Disabling hangcheck (no idea why you confuse that with the recovery > > > procedure) makes igt unhappy, but they are able to do that today with > > > the modparam. This patch makes it easier to move that to an engine > > > parameter, but to disable it entirely you still need to be able to reset > > > the GPU on demand (suspend, eviction, oom). Without hangcheck we need > to > > > evaluate all MAX_SCHEDULE_TIMEOUT waits and supply a reset-on- > timeout > > > along critical paths. > > > -Chris > > > > I don't think I'm confusing hang-check with the recovery. I've talked about > TDR, which to me is a periodic hangcheck, combined with a recovery by engine > reset. I don't argue against being able to reset, just against the blunt > classification that hangcheck itself provides. > > > > TDR was originally looking for regressive workloads that were not making > forward progress to protect against DoS. But it was always a very blunt tool. It's > never been able to differentiate long running, but otherwise valid, compute > workloads from genuine BB hangs, but that was fine at the time, and as you say > users could always switch the modparam. > > To be honest, I still have no idea what TDR is. But I take it that you > agree that we're only talking about hangcheck :) What I think you are > missing out on is that we have some more or less essential (itself > depending on client behaviour) housekeeping that goes along side it. > > My claim is that without a guarantee of isolation, anything else that > wants to use that engine will need the background housekeeping. (And if > we don't go as far as performing complete isolation, I expect userspace > is going to need the kernel to cleanup as they go along, as they are > unlikely to be prepared to do the system maintenance themselves.) > > > Now we have more emphasis on compute we need a solution that doesn't > involve a modparam. This was specifically requested by the compute team - > they know that they can flip the tdr switch, but that means their workload will > only run if user modifies the system. That's hardly ideal. > > It means they can adjust things to their admins' hearts' content, and > it's a udev rule away from setting permissions to allow the compute group > to freely reconfigure the settings. > > > Without the rlimit concept I don't think we can't prevent power hogs > whatever we do, any more than the core kernel can prevent CPU power hogs. > So, if we can prevent a workload from blocking other contexts, then it is > unhelpful to continue either with the blunt tool that TDR is, or the similarly > blunt heartbeat. If we have neither of these, but can guarantee forward > progress when we need to, then we can still protect the system against DoS, > without condemning any contexts before a crime has been committed. > > rlimits is orthogonal to the problem of preventing an isolated compute > task from turning into a system-wide dos. For endless, we need > suspend-to-idle at a very minimum before even considering the dilemma of > hangcheck. For eviction/oom, suspend-to-idle is not enough, and we need > to have the same sort of concept as oomkiller; kill a task to save the > system (or not depending on admin selection). > -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx