Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Fri, 30 Dec 2022 11:20:42 +0100

Hello Matthew,

On Thu, 22 Dec 2022 14:21:11 -0800
Matthew Brost <matthew.brost@xxxxxxxxx> wrote:

> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> seems a bit odd but let us explain the reasoning below.
> 
> 1. In XE the submission order from multiple drm_sched_entity is not
> guaranteed to be the same completion even if targeting the same hardware
> engine. This is because in XE we have a firmware scheduler, the GuC,
> which allowed to reorder, timeslice, and preempt submissions. If a using
> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> apart as the TDR expects submission order == completion order. Using a
> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

Oh, that's interesting. I've been trying to solve the same sort of
issues to support Arm's new Mali GPU which is relying on a FW-assisted
scheduling scheme (you give the FW N streams to execute, and it does
the scheduling between those N command streams, the kernel driver
does timeslice scheduling to update the command streams passed to the
FW). I must admit I gave up on using drm_sched at some point, mostly
because the integration with drm_sched was painful, but also because I
felt trying to bend drm_sched to make it interact with a
timeslice-oriented scheduling model wasn't really future proof. Giving
drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
help for a few things (didn't think it through yet), but I feel it's
coming short on other aspects we have to deal with on Arm GPUs. Here
are a few things I noted while working on the drm_sched-based PoC:

- The complexity to suspend/resume streams and recover from failures
  remains quite important (because everything is still very asynchronous
  under the hood). Sure, you don't have to do this fancy
  timeslice-based scheduling, but that's still a lot of code, and
  AFAICT, it didn't integrate well with drm_sched TDR (my previous
  attempt at reconciling them has been unsuccessful, but maybe your
  patches would help there)
- You lose one of the nice thing that's brought by timeslice-based
  scheduling: a tiny bit of fairness. That is, if one stream is queuing
  a compute job that's monopolizing the GPU core, you know the kernel
  part of the scheduler will eventually evict it and let other streams
  with same or higher priority run, even before the job timeout
  kicks in.
- Stream slots exposed by the Arm FW are not exactly HW queues that run
  things concurrently. The FW can decide to let only the stream with the
  highest priority get access to the various HW resources (GPU cores,
  tiler, ...), and let other streams starve. That means you might get
  spurious timeouts on some jobs/sched-entities while they didn't even
  get a chance to run.

So overall, and given I'm no longer the only one having to deal with a
FW scheduler that's designed with timeslice scheduling in mind, I'm
wondering if it's not time to design a common timeslice-based scheduler
instead of trying to bend drivers to use the model enforced by
drm_sched. But that's just my 2 cents, of course.

Regards,

Boris