Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Thu, 12 Jan 2023 17:48:33 +0100

On Thu, 12 Jan 2023 16:38:18 +0100
Daniel Vetter <daniel@xxxxxxxx> wrote:

> > >
> > > Also if you do the allocation in ->prepare_job with dma_fence and not
> > > run_job, then I think can sort out fairness issues (if they do pop up) in
> > > the drm/sched code instead of having to think about this in each driver.  
> >
> > By allocation, you mean assigning a FW slot ID? If we do this allocation
> > in ->prepare_job(), couldn't we mess up ordering? Like,
> > lower-prio/later-queuing entity being scheduled before its pairs,
> > because there's no guarantee on the job completion order (and thus the
> > queue idleness order). I mean, completion order depends on the kind of
> > job being executed by the queues, the time the FW actually lets the
> > queue execute things and probably other factors. You can use metrics
> > like the position in the LRU list + the amount of jobs currently
> > queued to a group to guess which one will be idle first, but that's
> > just a guess. And I'm not sure I see what doing this slot selection in  
> > ->prepare_job() would bring us compared to doing it in ->run_job(),  
> > where we can just pick the least recently used slot.  
> 
> In ->prepare_job you can let the scheduler code do the stalling (and
> ensure fairness), in ->run_job it's your job.

Yeah returning a fence in ->prepare_job() to wait for a FW slot to
become idle sounds good. This fence would be signaled when one of the
slots becomes idle. But I'm wondering why we'd want to select the slot
so early. Can't we just do the selection in ->run_job()? After all, if
the fence has been signaled, that means we'll find at least one slot
that's ready when we hit ->run_job(), and we can select it at that
point.

> The current RFC doesn't
> really bother much with getting this very right, but if the scheduler
> code tries to make sure it pushes higher-prio stuff in first before
> others, you should get the right outcome.

Okay, so I'm confused again. We said we had a 1:1
drm_gpu_scheduler:drm_sched_entity mapping, meaning that entities are
isolated from each other. I can see how I could place the dma_fence
returned by ->prepare_job() in a driver-specific per-priority list, so
the driver can pick the highest-prio/first-inserted entry and signal the
associated fence when a slot becomes idle. But I have a hard time
seeing how common code could do that if it doesn't see the other
entities. Right now, drm_gpu_scheduler only selects the best entity
among the registered ones, and there's only one entity per
drm_gpu_scheduler in this case.

> 
> The more important functional issue is that you must only allocate the
> fw slot after all dependencies have signalled.

Sure, but it doesn't have to be a specific FW slot, it can be any FW
slot, as long as we don't signal more fences than we have slots
available, right?

> Otherwise you might get
> a nice deadlock, where job A is waiting for the fw slot of B to become
> free, and B is waiting for A to finish.

Got that part, and that's ensured by the fact we wait for all
regular deps before returning the FW-slot-available dma_fence in
->prepare_job(). This exact same fence will be signaled when a slot
becomes idle.

> 
> > > Few fw sched slots essentially just make fw scheduling unfairness more
> > > prominent than with others, but I don't think it's fundamentally something
> > > else really.
> > >
> > > If every ctx does that and the lru isn't too busted, they should then form
> > > a nice orderly queue and cycle through the fw scheduler, while still being
> > > able to get some work done. It's essentially the exact same thing that
> > > happens with ttm vram eviction, when you have a total working set where
> > > each process fits in vram individually, but in total they're too big and
> > > you need to cycle things through.  
> >
> > I see.
> >  
> > >  
> > > > > I'll need to make sure this still works with the concept of group (it's
> > > > > not a single queue we schedule, it's a group of queues, meaning that we
> > > > > have N fences to watch to determine if the slot is busy or not, but
> > > > > that should be okay).  
> > > >
> > > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > > not entirely fair, it does take the slot priority (which has to be
> > > > unique across all currently assigned slots) into account when
> > > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > > when they share the same drm_sched_priority (probably based on the
> > > > position in the LRU).  
> > >
> > > Hm that will make things a bit more fun I guess, especially with your
> > > constraint to not update this too often. How strict is that priority
> > > difference? If it's a lot, we might need to treat this more like execlist
> > > and less like a real fw scheduler ...  
> >
> > Strict as in, if two groups with same priority try to request an
> > overlapping set of resources (cores or tilers), it can deadlock, so
> > pretty strict I would say :-).  
> 
> So it first finishes all the higher priority tasks and only then it
> runs the next one, so no round-robin? Or am I just confused what this
> all is about. Or is it more that the order in the group determines how
> it tries to schedule on the hw, and if the earlier job needs hw that
> also the later one needs, then the earlier one has to finish first?
> Which would still mean that for these overlapping cases there's just
> no round-robin in the fw scheduler at all.

Okay, so my understanding is: FW scheduler always takes the highest
priority when selecting between X groups requesting access to a
resource, but if 2 groups want the same resource and have the same
priority, there's no ordering guarantee. The deadlock happens when both
group A and B claim resources X and Y. Group A might get resource X
and group B might get resource Y, both waiting for the other resource
they claimed. If they have different priorities one of them would back
off and let the other run, if they have the same priority, none of them
would, and that's where the deadlock comes from. Note that we don't
control the order resources get acquired from the CS, so there's no way
to avoid this deadlock without assigning different priorities.

And you're right, if you pick different priorities, the only time lower
priority groups get to run is when the highest priority group is
waiting on an asynchronous operation to complete (can be a
compute/frag/tiler job completion, some inter queue synchronization,
waiting for an already acquired resource, ...), or when it's idle. I
suspect queues from different groups can run concurrently if there's
enough command-stream processing slots available, and those groups
request resources that don't overlap, but I'm speculating here. So, no
round-robin if slots are assigned unique priorities. Not even sure
scheduling is time-slice based to be honest, it could be some
cooperative scheduling where groups with the same priorities get to
wait for the currently running group to be blocked to get access to
the HW. In any case, there's no easy way to prevent deadlocks if we
don't assign unique priorities.