Re: [PATCH 41/53] drm/i915/bdw: Avoid non-lite-restore preemptions

Daniel Vetter <daniel@xxxxxxxx> · Mon, 7 Jul 2014 14:47:11 +0200



On Mon, Jun 23, 2014 at 11:52:11AM +0000, Mateo Lozano, Oscar wrote:
> > -----Original Message-----
> > From: Daniel Vetter [mailto:daniel.vetter@xxxxxxxx] On Behalf Of Daniel
> > Vetter
> > Sent: Wednesday, June 18, 2014 9:49 PM
> > To: Mateo Lozano, Oscar
> > Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re:  [PATCH 41/53] drm/i915/bdw: Avoid non-lite-restore
> > preemptions
> > 
> > On Fri, Jun 13, 2014 at 04:37:59PM +0100, oscar.mateo@xxxxxxxxx wrote:
> > > From: Oscar Mateo <oscar.mateo@xxxxxxxxx>
> > >
> > > In the current Execlists feeding mechanism, full preemption is not
> > > supported yet: only lite-restores are allowed (this is: the GPU simply
> > > samples a new tail pointer for the context currently in execution).
> > >
> > > But we have identified an scenario in which a full preemption occurs:
> > > 1) We submit two contexts for execution (A & B).
> > > 2) The GPU finishes with the first one (A), switches to the second one
> > > (B) and informs us.
> > > 3) We submit B again (hoping to cause a lite restore) together with C,
> > > but in the time we spend writing to the ELSP, the GPU finishes B.
> > > 4) The GPU start executing B again (since we told it so).
> > > 5) We receive a B finished interrupt and, mistakenly, we submit C
> > > (again) and D, causing a full preemption of B.
> > >
> > > By keeping a better track of our submissions, we can avoid the
> > > scenario described above.
> > 
> > How? I don't see a way to fundamentally avoid the above race, and I don't
> > really see an issue with it - the gpu should notice that there's not really any
> > work done and then switch to C.
> > 
> > Or am I completely missing the point here?
> > 
> > With no clue at all this looks really scary.
> 
> The race is avoided by keeping track of how many times a context has
> been submitted to the hardware and by better discriminating the received
> context switch interrupts: in the example, when we have submitted B
> twice, we won´t submit C and D as soon as we receive the notification
> that B is completed because we were expecting to get a LITE_RESTORE and
> we didn´t, so we know a second completion will be received shortly.
> 
> Without this explicit checking, the race condition happens and, somehow,
> the batch buffer execution order gets messed with. This can be verified
> with the IGT test I sent together with the series. I don´t know the
> exact mechanism by which the pre-emption messes with the execution order
> but, since other people is working on the Scheduler + Preemption on
> Execlists, I didn´t try to fix it. In these series, only Lite Restores
> are supported (other kind of preemptions WARN).
> 
> I´ll add this clarification to the commit message.

Yeah, please elaborate more clearly in the commit message what exactly
seems to go wrong (and where you're unsure with your model of how the
hardware reacts). And please also reference the precise testcase (and
precise failure mode without this patch) so that if anyone stumbles over
this again we have some breadcrumbs to figure things out. And some stern
warnings as comments in the code to warn the unsuspecting reader.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx