Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Jason Ekstrand <jason@xxxxxxxxxxxxxx> · Mon, 9 Jan 2023 09:45:09 -0600

On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:

> On Fri, 30 Dec 2022 12:55:08 +0100

> Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:

> 

> > On Fri, 30 Dec 2022 11:20:42 +0100

> > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:

> > 

> > > Hello Matthew,

> > > 

> > > On Thu, 22 Dec 2022 14:21:11 -0800

> > > Matthew Brost <matthew.brost@xxxxxxxxx> wrote:

> > >   

> > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1

> > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this

> > > > seems a bit odd but let us explain the reasoning below.

> > > > 

> > > > 1. In XE the submission order from multiple drm_sched_entity is not

> > > > guaranteed to be the same completion even if targeting the same hardware

> > > > engine. This is because in XE we have a firmware scheduler, the GuC,

> > > > which allowed to reorder, timeslice, and preempt submissions. If a using

> > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls

> > > > apart as the TDR expects submission order == completion order. Using a

> > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.    

> > > 

> > > Oh, that's interesting. I've been trying to solve the same sort of

> > > issues to support Arm's new Mali GPU which is relying on a FW-assisted

> > > scheduling scheme (you give the FW N streams to execute, and it does

> > > the scheduling between those N command streams, the kernel driver

> > > does timeslice scheduling to update the command streams passed to the

> > > FW). I must admit I gave up on using drm_sched at some point, mostly

> > > because the integration with drm_sched was painful, but also because I

> > > felt trying to bend drm_sched to make it interact with a

> > > timeslice-oriented scheduling model wasn't really future proof. Giving

> > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might

> > > help for a few things (didn't think it through yet), but I feel it's

> > > coming short on other aspects we have to deal with on Arm GPUs.  

> > 

> > Ok, so I just had a quick look at the Xe driver and how it

> > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I

> > have a better understanding of how you get away with using drm_sched

> > while still controlling how scheduling is really done. Here

> > drm_gpu_scheduler is just a dummy abstract that let's you use the

> > drm_sched job queuing/dep/tracking mechanism. The whole run-queue

You nailed it here, we use the DRM scheduler for queuing jobs,

dependency tracking and releasing jobs to be scheduled when dependencies

are met, and lastly a tracking mechanism of inflights jobs that need to

be cleaned up if an error occurs. It doesn't actually do any scheduling

aside from the most basic level of not overflowing the submission ring

buffer. In this sense, a 1 to 1 relationship between entity and

scheduler fits quite well.

Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel want here and what you need for Arm thanks to the number of FW queues available. I don't remember the exact number of GuC queues but it's at least 1k. This puts it in an entirely different class from what you have on Mali. Roughly, there's about three categories here:

 1. Hardware where the kernel is placing jobs on actual HW rings. This is old Mali, Intel Haswell and earlier, and probably a bunch of others.  (Intel BDW+ with execlists is a weird case that doesn't fit in this categorization.)

 2. Hardware (or firmware) with a very limited number of queues where you're going to have to juggle in the kernel in order to run desktop Linux.

 3. Firmware scheduling with a high queue count. In this case, you don't want the kernel scheduling anything. Just throw it at the firmware and let it go brrrrr.  If we ever run out of queues (unlikely), the kernel can temporarily pause some low-priority contexts and do some juggling or, frankly, just fail userspace queue creation and tell the user to close some windows.

The existence of this 2nd class is a bit annoying but it's where we are. I think it's worth recognizing that Xe and panfrost are in different places here and will require different designs. For Xe, we really are just using drm/scheduler as a front-end and the firmware does all the real scheduling.

How do we deal with class 2? That's an interesting question.  We may eventually want to break that off into a separate discussion and not litter the Xe thread but let's keep going here for a bit.  I think there are some pretty reasonable solutions but they're going to look a bit different.

The way I did this for Xe with execlists was to keep the 1:1:1 mapping between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.  Instead of feeding a GuC ring, though, it would feed a fixed-size execlist ring and then there was a tiny kernel which operated entirely in IRQ handlers which juggled those execlists by smashing HW registers.  For Panfrost, I think we want something slightly different but can borrow some ideas here.  In particular, have the schedulers feed kernel-side SW queues (they can even be fixed-size if that helps) and then have a kthread which juggles those feeds the limited FW queues.  In the case where you have few enough active contexts to fit them all in FW, I do think it's best to have them all active in FW and let it schedule. But with only 31, you need to be able to juggle if you run out.

FWIW this design was also ran by AMD quite a while ago (off the list)

and we didn't get any serious push back. Things can change however...

Yup, AMD and NVIDIA both want this, more-or-less.

> > selection is dumb because there's only one entity ever bound to the

> > scheduler (the one that's part of the xe_guc_engine object which also

> > contains the drm_gpu_scheduler instance). I guess the main issue we'd

> > have on Arm is the fact that the stream doesn't necessarily get

> > scheduled when ->run_job() is called, it can be placed in the runnable

> > queue and be picked later by the kernel-side scheduler when a FW slot

> > gets released. That can probably be sorted out by manually disabling the

> > job timer and re-enabling it when the stream gets picked by the

> > scheduler. But my main concern remains, we're basically abusing

> > drm_sched here.

> > 

That's a matter of opinion, yes we are using it slightly differently

than anyone else but IMO the fact the DRM scheduler works for the Xe use

case with barely any changes is a testament to its design.

> > For the Arm driver, that means turning the following sequence

> > 

> > 1. wait for job deps

> > 2. queue job to ringbuf and push the stream to the runnable

> >    queue (if it wasn't queued already). Wakeup the timeslice scheduler

> >    to re-evaluate (if the stream is not on a FW slot already)

> > 3. stream gets picked by the timeslice scheduler and sent to the FW for

> >    execution

> > 

> > into

> > 

> > 1. queue job to entity which takes care of waiting for job deps for

> >    us

> > 2. schedule a drm_sched_main iteration

> > 3. the only available entity is picked, and the first job from this

> >    entity is dequeued. ->run_job() is called: the job is queued to the

> >    ringbuf and the stream is pushed to the runnable queue (if it wasn't

> >    queued already). Wakeup the timeslice scheduler to re-evaluate (if

> >    the stream is not on a FW slot already)

> > 4. stream gets picked by the timeslice scheduler and sent to the FW for

> >    execution

> >

Yes, an extra step but you get to use all the nice DRM scheduler

functions for dependency tracking. Also in our case we really want a

single entry point in the backend (the work queue). Also see [1] which

helped us seal a bunch of races we had in the i915 by using a single

entry point. All these benefits are why we landed on the DRM scheduler

and it has worked of rather nicely compared to the i915.

[1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1

> > That's one extra step we don't really need. To sum-up, yes, all the

> > job/entity tracking might be interesting to share/re-use, but I wonder

> > if we couldn't have that without pulling out the scheduling part of

> > drm_sched, or maybe I'm missing something, and there's something in

> > drm_gpu_scheduler you really need.

> 

> On second thought, that's probably an acceptable overhead (not even

> sure the extra step I was mentioning exists in practice, because dep

> fence signaled state is checked as part of the drm_sched_main

> iteration, so that's basically replacing the worker I schedule to

> check job deps), and I like the idea of being able to re-use drm_sched

> dep-tracking without resorting to invasive changes to the existing

> logic, so I'll probably give it a try.

Let me know how this goes.

Matt