Quoting Bloomfield, Jon (2018-03-22 21:59:33) > > From: Intel-gfx [mailto:intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx] On Behalf > > Of Jeff McGee > > Sent: Thursday, March 22, 2018 12:09 PM > > To: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> > > Cc: Kondapally, Kalyan <kalyan.kondapally@xxxxxxxxx>; intel- > > gfx@xxxxxxxxxxxxxxxxxxxxx; ben@xxxxxxxxxxxx > > Subject: Re: [RFC 0/8] Force preemption > > > > On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote: > > > > > > On 22/03/2018 16:01, Jeff McGee wrote: > > > >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote: > > > >> > > > >>On 22/03/2018 14:34, Jeff McGee wrote: > > > >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote: > > > >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55) > > > >>>>> > > > >>>>>On 21/03/2018 17:26, jeff.mcgee@xxxxxxxxx wrote: > > > >>>>>>From: Jeff McGee <jeff.mcgee@xxxxxxxxx> > > > >>>>>> > > > >>>>>>Force preemption uses engine reset to enforce a limit on the time > > > >>>>>>that a request targeted for preemption can block. This feature is > > > >>>>>>a requirement in automotive systems where the GPU may be > > shared by > > > >>>>>>clients of critically high priority and clients of low priority that > > > >>>>>>may not have been curated to be preemption friendly. There may > > be > > > >>>>>>more general applications of this feature. I'm sharing as an RFC to > > > >>>>>>stimulate that discussion and also to get any technical feedback > > > >>>>>>that I can before submitting to the product kernel that needs this. > > > >>>>>>I have developed the patches for ease of rebase, given that this is > > > >>>>>>for the moment considered a non-upstreamable feature. It would > > be > > > >>>>>>possible to refactor hangcheck to fully incorporate force > > preemption > > > >>>>>>as another tier of patience (or impatience) with the running > > request. > > > >>>>> > > > >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this > > work > > > >>>>>only with stateless clients - or in other words, what would happen to > > > >>>>>stateful clients which would be force preempted? Or the answer is > > we > > > >>>>>don't care since they are misbehaving? > > > >>>> > > > >>>>They get notified of being guilty for causing a gpu reset; three strikes > > > >>>>and they are out (banned from using the gpu) using the current rules. > > > >>>>This is a very blunt hammer that requires the rest of the system to be > > > >>>>robust; one might argue time spent making the system robust would > > be > > > >>>>better served making sure that the timer never expired in the first > > place > > > >>>>thereby eliminating the need for a forced gpu reset. > > > >>>>-Chris > > > >>> > > > >>>Yes, for simplication the policy applied to force preempted contexts > > > >>>is the same as for hanging contexts. It is known that this feature > > > >>>should not be required in a fully curated system. It's a requirement > > > >>>if end user will be alllowed to install 3rd party apps to run in the > > > >>>non-critical domain. > > > >> > > > >>My concern is whether it safe to call this force _preemption_, while > > > >>it is not really expected to work as preemption from the point of > > > >>view of preempted context. I may be missing some angle here, but I > > > >>think a better name would include words like maximum request > > > >>duration or something. > > > >> > > > >>I can see a difference between allowed maximum duration when there > > > >>is something else pending, and when it isn't, but I don't > > > >>immediately see that we should consider this distinction for any > > > >>real benefit? > > > >> > > > >>So should the feature just be "maximum request duration"? This would > > > >>perhaps make it just a special case of hangcheck, which ignores head > > > >>progress, or whatever we do in there. > > > >> > > > >>Regards, > > > >> > > > >>Tvrtko > > > > > > > >I think you might be unclear about how this works. We're not starting a > > > >preemption to see if we can cleanly remove a request who has begun to > > > >exceed its normal time slice, i.e. hangcheck. This is about bounding > > > >the time that a normal preemption can take. So first start preemption > > > >in response to higher-priority request arrival, then wait for preemption > > > >to complete within a certain amount of time. If it does not, resort to > > > >reset. > > > > > > > >So it's really "force the resolution of a preemption", shortened to > > > >"force preemption". > > > > > > You are right, I veered off in my thinking and ended up with > > > something different. :) > > > > > > I however still think the name is potentially misleading, since the > > > request/context is not getting preempted. It is getting effectively > > > killed (sooner or later, directly or indirectly). > > > > > > Maybe that is OK for the specific use case when everything is only > > > broken and not malicious. > > > > > > In a more general purpose system it would be a bit random when > > > something would work, and when it wouldn't, depending on system > > > setup and even timings. > > > > > > Hm, maybe you don't even really benefit from the standard three > > > strikes and you are out policy, and for this specific use case, you > > > should just kill it straight away. If it couldn't be preempted once, > > > why pay the penalty any more? > > > > > > If you don't have it already, devising a solution which blacklists > > > the process (if it creates more contexts), or even a parent (if > > > forking is applicable and implementation feasible), for offenders > > > could also be beneficial. > > > > > > Regards, > > > > > > Tvrtko > > > > Fair enough. There wasn't a lot of deliberation on this name. We > > referred to it in various ways during development. I think I started > > using "force preemption" because it was short. "reset to preempt" was > > another phrase that was used. > > > > The handling of the guilty client/context could be tailored more. Like > > I said it was easiest to start with the same sort of handling that we > > have already for hang scenarios. Simple is good when you are rebasing > > without much hope to upstream. :( > > > > If there was interest in upstreaming this capability, we could certainly > > incorporate nicely within a refactoring of hangcheck. And then we > > wouldn't even need a special name for it. The whole thing could be recast > > as time slice management, where your slice is condition-based. You get > > unlimited time if no one wants the engine, X time if someone of equal > > priority wants the engine, and Y time if someone of higher priority > > wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout > > value in my RFC. > > > On the subject of it being "a bit random when something would work": > Currently it's the well behaved app (and everything else behind it) > that pays the price for the badly-behaved umd being a bad citizen. 12s is > a really long time for the GUI to freeze before a bad context can be nuked. But the price is not that high when you get a momentary freeze of the system, compared to applications dying on the user and unsaved work getting lost, for example. > Yes, with enforced pre-emption latency, the bad umd pays a price in that, > if it's lucky, it will run to completion. If unlucky it gets nuked. But it's a bad > umd at the end of the day. And we should find it and fix it. I think the key issue is the difference in difficulty between finding an offending GPU hogger in a completely controlled product environment vs. in the wild in desktop environment. In the wild an application could've worked for years without problems (being a GPU hogger, whatever the limit is set at), and stops working when another application is introduced that demands to use the GPU at a higher priority and the GPU hogging application won't slow down, but rather will just die. Requiring all userspace applications to suddenly have a short ARB check period, or being in danger of getting killed is not a light change to make in the generic desktop environment. So some clear opt-in in the form of "Sacrifice everything to keep running at 60 FPS [ ]" tick-box from compositor would be required. So far nobody has been succesful in selling this to the userspace compositors (the most likely user) or has somebody? Regards, Joonas > With fair-scheduling, this becomes even more important - all umd's need > to ensure that they are capable of pre-empting to the finest granularity > supported by the h/w for their respective workloads. If they don't > they really should be obliterated (aka detected and fixed). > > The big benefit I see with the new approach (vs TDR) is that there's actually > no need to enforce the (slightly arbitrarily determined 3x4s timeout) for a > context at all, providing it plays nicely. If you want to run a really long > GPGPU workload, do so by all means, just don't hog the GPU. > > BTW, the same context banning should be applied to a context nuked > by forced-preemption as with TDR. That's the intention at least. This is just > an RFC right now. > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx