Re: [PATCH] drm/i915: Prevent bonded requests from overtaking each other on preemption

"Bloomfield, Jon" <jon.bloomfield@xxxxxxxxx> · Fri, 20 Sep 2019 17:12:06 +0000

> -----Original Message-----
> From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Sent: Friday, September 20, 2019 9:04 AM
> To: Bloomfield, Jon <jon.bloomfield@xxxxxxxxx>; intel-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
> Subject: RE:  [PATCH] drm/i915: Prevent bonded requests from
> overtaking each other on preemption
> 
> Quoting Bloomfield, Jon (2019-09-20 16:50:57)
> > > -----Original Message-----
> > > From: Intel-gfx <intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> Tvrtko
> > > Ursulin
> > > Sent: Friday, September 20, 2019 8:12 AM
> > > To: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Subject: Re:  [PATCH] drm/i915: Prevent bonded requests from
> > > overtaking each other on preemption
> > >
> > >
> > > On 20/09/2019 15:57, Chris Wilson wrote:
> > > > Quoting Chris Wilson (2019-09-20 09:36:24)
> > > >> Force bonded requests to run on distinct engines so that they cannot be
> > > >> shuffled onto the same engine where timeslicing will reverse the order.
> > > >> A bonded request will often wait on a semaphore signaled by its master,
> > > >> creating an implicit dependency -- if we ignore that implicit dependency
> > > >> and allow the bonded request to run on the same engine and before its
> > > >> master, we will cause a GPU hang.
> > > >
> > > > Thinking more, it should not directly cause a GPU hang, as the stuck
> request
> > > > should be timesliced away, and each preemption should be enough to
> keep
> > > > hangcheck at bay (though we have evidence it may not). So at best it runs
> > > > at half-speed, at worst a third (if my model is correct).
> > >
> > > But I think it is still correct to do since we don't have the coupling
> > > information on re-submit. Hm.. but don't we need to prevent slave from
> > > changing engines as well?
> >
> > Unless I'm missing something, the proposal here is to set the engines in stone
> at first submission, and never change them?
> 
> For submission here, think execution (submission to actual HW). (We have
> 2 separate phases that all like to be called submit()!)
> 
> > If so, that does sound overly restrictive, and will prevent any kind of
> rebalancing as workloads (of varying slave counts) come and go.
> 
> We are only restricting this request, not the contexts. We still have
> balancing overall, just not instantaneous balancing if we timeslice out
> of this request -- we put it back onto the "same" engine and not another.
> Which is in some ways is less than ideal, although strictly we are only
> saying don't put it back onto an engine we have earmarked for our bonded
> request, and so we avoid contending with our parallel request reducing
> that to serial (and often bad) behaviour.
> 
> [So at the end of this statement, I'm more happy with the restriction ;]
> 
> > During the original design it was called out that the workloads should be pre-
> empted atomically. That allows the entire bonding mask to be re-evaluated at
> every context switch and so we can then rebalance. Still not easy to achieve I
> agree :-(
> 
> The problem with that statement is that atomic implies a global
> scheduling decision. Blood, sweat and tears.

Agreed - It isn't fun. Perhaps it doesn't matter anyway. Once GuC is offloading the scheduling it should be able to do a little more wrt rebalancing. Let's make it a GuC headache instead.

> 
> Of course, with your endless scheme, scheduling is all in the purview of
> the user :)

Hey, don't tarnish me with that brush. I don't like it either.
Actually, it's your scheme technically. I just asked for a way to enable HPC workloads, and you enthusiastically offered heartbeats&non-persistence. So shall history be written :-)

> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx