> From: Intel-gfx [mailto:intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx] On Behalf > Of Jeff McGee > Sent: Thursday, March 22, 2018 12:09 PM > To: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> > Cc: Kondapally, Kalyan <kalyan.kondapally@xxxxxxxxx>; intel- > gfx@xxxxxxxxxxxxxxxxxxxxx; ben@xxxxxxxxxxxx > Subject: Re: [RFC 0/8] Force preemption > > On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote: > > > > On 22/03/2018 16:01, Jeff McGee wrote: > > >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote: > > >> > > >>On 22/03/2018 14:34, Jeff McGee wrote: > > >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote: > > >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55) > > >>>>> > > >>>>>On 21/03/2018 17:26, jeff.mcgee@xxxxxxxxx wrote: > > >>>>>>From: Jeff McGee <jeff.mcgee@xxxxxxxxx> > > >>>>>> > > >>>>>>Force preemption uses engine reset to enforce a limit on the time > > >>>>>>that a request targeted for preemption can block. This feature is > > >>>>>>a requirement in automotive systems where the GPU may be > shared by > > >>>>>>clients of critically high priority and clients of low priority that > > >>>>>>may not have been curated to be preemption friendly. There may > be > > >>>>>>more general applications of this feature. I'm sharing as an RFC to > > >>>>>>stimulate that discussion and also to get any technical feedback > > >>>>>>that I can before submitting to the product kernel that needs this. > > >>>>>>I have developed the patches for ease of rebase, given that this is > > >>>>>>for the moment considered a non-upstreamable feature. It would > be > > >>>>>>possible to refactor hangcheck to fully incorporate force > preemption > > >>>>>>as another tier of patience (or impatience) with the running > request. > > >>>>> > > >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this > work > > >>>>>only with stateless clients - or in other words, what would happen to > > >>>>>stateful clients which would be force preempted? Or the answer is > we > > >>>>>don't care since they are misbehaving? > > >>>> > > >>>>They get notified of being guilty for causing a gpu reset; three strikes > > >>>>and they are out (banned from using the gpu) using the current rules. > > >>>>This is a very blunt hammer that requires the rest of the system to be > > >>>>robust; one might argue time spent making the system robust would > be > > >>>>better served making sure that the timer never expired in the first > place > > >>>>thereby eliminating the need for a forced gpu reset. > > >>>>-Chris > > >>> > > >>>Yes, for simplication the policy applied to force preempted contexts > > >>>is the same as for hanging contexts. It is known that this feature > > >>>should not be required in a fully curated system. It's a requirement > > >>>if end user will be alllowed to install 3rd party apps to run in the > > >>>non-critical domain. > > >> > > >>My concern is whether it safe to call this force _preemption_, while > > >>it is not really expected to work as preemption from the point of > > >>view of preempted context. I may be missing some angle here, but I > > >>think a better name would include words like maximum request > > >>duration or something. > > >> > > >>I can see a difference between allowed maximum duration when there > > >>is something else pending, and when it isn't, but I don't > > >>immediately see that we should consider this distinction for any > > >>real benefit? > > >> > > >>So should the feature just be "maximum request duration"? This would > > >>perhaps make it just a special case of hangcheck, which ignores head > > >>progress, or whatever we do in there. > > >> > > >>Regards, > > >> > > >>Tvrtko > > > > > >I think you might be unclear about how this works. We're not starting a > > >preemption to see if we can cleanly remove a request who has begun to > > >exceed its normal time slice, i.e. hangcheck. This is about bounding > > >the time that a normal preemption can take. So first start preemption > > >in response to higher-priority request arrival, then wait for preemption > > >to complete within a certain amount of time. If it does not, resort to > > >reset. > > > > > >So it's really "force the resolution of a preemption", shortened to > > >"force preemption". > > > > You are right, I veered off in my thinking and ended up with > > something different. :) > > > > I however still think the name is potentially misleading, since the > > request/context is not getting preempted. It is getting effectively > > killed (sooner or later, directly or indirectly). > > > > Maybe that is OK for the specific use case when everything is only > > broken and not malicious. > > > > In a more general purpose system it would be a bit random when > > something would work, and when it wouldn't, depending on system > > setup and even timings. > > > > Hm, maybe you don't even really benefit from the standard three > > strikes and you are out policy, and for this specific use case, you > > should just kill it straight away. If it couldn't be preempted once, > > why pay the penalty any more? > > > > If you don't have it already, devising a solution which blacklists > > the process (if it creates more contexts), or even a parent (if > > forking is applicable and implementation feasible), for offenders > > could also be beneficial. > > > > Regards, > > > > Tvrtko > > Fair enough. There wasn't a lot of deliberation on this name. We > referred to it in various ways during development. I think I started > using "force preemption" because it was short. "reset to preempt" was > another phrase that was used. > > The handling of the guilty client/context could be tailored more. Like > I said it was easiest to start with the same sort of handling that we > have already for hang scenarios. Simple is good when you are rebasing > without much hope to upstream. :( > > If there was interest in upstreaming this capability, we could certainly > incorporate nicely within a refactoring of hangcheck. And then we > wouldn't even need a special name for it. The whole thing could be recast > as time slice management, where your slice is condition-based. You get > unlimited time if no one wants the engine, X time if someone of equal > priority wants the engine, and Y time if someone of higher priority > wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout > value in my RFC. > On the subject of it being "a bit random when something would work": Currently it's the well behaved app (and everything else behind it) that pays the price for the badly-behaved umd being a bad citizen. 12s is a really long time for the GUI to freeze before a bad context can be nuked. Yes, with enforced pre-emption latency, the bad umd pays a price in that, if it's lucky, it will run to completion. If unlucky it gets nuked. But it's a bad umd at the end of the day. And we should find it and fix it. With fair-scheduling, this becomes even more important - all umd's need to ensure that they are capable of pre-empting to the finest granularity supported by the h/w for their respective workloads. If they don't they really should be obliterated (aka detected and fixed). The big benefit I see with the new approach (vs TDR) is that there's actually no need to enforce the (slightly arbitrarily determined 3x4s timeout) for a context at all, providing it plays nicely. If you want to run a really long GPGPU workload, do so by all means, just don't hog the GPU. BTW, the same context banning should be applied to a context nuked by forced-preemption as with TDR. That's the intention at least. This is just an RFC right now. _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx