Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Mon, 2 Jan 2023 08:30:19 +0100

On Fri, 30 Dec 2022 12:55:08 +0100
Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:

> On Fri, 30 Dec 2022 11:20:42 +0100
> Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:
> 
> > Hello Matthew,
> > 
> > On Thu, 22 Dec 2022 14:21:11 -0800
> > Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
> >   
> > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > seems a bit odd but let us explain the reasoning below.
> > > 
> > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > guaranteed to be the same completion even if targeting the same hardware
> > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > apart as the TDR expects submission order == completion order. Using a
> > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.    
> > 
> > Oh, that's interesting. I've been trying to solve the same sort of
> > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > scheduling scheme (you give the FW N streams to execute, and it does
> > the scheduling between those N command streams, the kernel driver
> > does timeslice scheduling to update the command streams passed to the
> > FW). I must admit I gave up on using drm_sched at some point, mostly
> > because the integration with drm_sched was painful, but also because I
> > felt trying to bend drm_sched to make it interact with a
> > timeslice-oriented scheduling model wasn't really future proof. Giving
> > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > help for a few things (didn't think it through yet), but I feel it's
> > coming short on other aspects we have to deal with on Arm GPUs.  
> 
> Ok, so I just had a quick look at the Xe driver and how it
> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> have a better understanding of how you get away with using drm_sched
> while still controlling how scheduling is really done. Here
> drm_gpu_scheduler is just a dummy abstract that let's you use the
> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> selection is dumb because there's only one entity ever bound to the
> scheduler (the one that's part of the xe_guc_engine object which also
> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> have on Arm is the fact that the stream doesn't necessarily get
> scheduled when ->run_job() is called, it can be placed in the runnable
> queue and be picked later by the kernel-side scheduler when a FW slot
> gets released. That can probably be sorted out by manually disabling the
> job timer and re-enabling it when the stream gets picked by the
> scheduler. But my main concern remains, we're basically abusing
> drm_sched here.
> 
> For the Arm driver, that means turning the following sequence
> 
> 1. wait for job deps
> 2. queue job to ringbuf and push the stream to the runnable
>    queue (if it wasn't queued already). Wakeup the timeslice scheduler
>    to re-evaluate (if the stream is not on a FW slot already)
> 3. stream gets picked by the timeslice scheduler and sent to the FW for
>    execution
> 
> into
> 
> 1. queue job to entity which takes care of waiting for job deps for
>    us
> 2. schedule a drm_sched_main iteration
> 3. the only available entity is picked, and the first job from this
>    entity is dequeued. ->run_job() is called: the job is queued to the
>    ringbuf and the stream is pushed to the runnable queue (if it wasn't
>    queued already). Wakeup the timeslice scheduler to re-evaluate (if
>    the stream is not on a FW slot already)
> 4. stream gets picked by the timeslice scheduler and sent to the FW for
>    execution
> 
> That's one extra step we don't really need. To sum-up, yes, all the
> job/entity tracking might be interesting to share/re-use, but I wonder
> if we couldn't have that without pulling out the scheduling part of
> drm_sched, or maybe I'm missing something, and there's something in
> drm_gpu_scheduler you really need.

On second thought, that's probably an acceptable overhead (not even
sure the extra step I was mentioning exists in practice, because dep
fence signaled state is checked as part of the drm_sched_main
iteration, so that's basically replacing the worker I schedule to
check job deps), and I like the idea of being able to re-use drm_sched
dep-tracking without resorting to invasive changes to the existing
logic, so I'll probably give it a try.