Hello Matthew, On Thu, 22 Dec 2022 14:21:11 -0800 Matthew Brost <matthew.brost@xxxxxxxxx> wrote: > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this > seems a bit odd but let us explain the reasoning below. > > 1. In XE the submission order from multiple drm_sched_entity is not > guaranteed to be the same completion even if targeting the same hardware > engine. This is because in XE we have a firmware scheduler, the GuC, > which allowed to reorder, timeslice, and preempt submissions. If a using > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls > apart as the TDR expects submission order == completion order. Using a > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. Oh, that's interesting. I've been trying to solve the same sort of issues to support Arm's new Mali GPU which is relying on a FW-assisted scheduling scheme (you give the FW N streams to execute, and it does the scheduling between those N command streams, the kernel driver does timeslice scheduling to update the command streams passed to the FW). I must admit I gave up on using drm_sched at some point, mostly because the integration with drm_sched was painful, but also because I felt trying to bend drm_sched to make it interact with a timeslice-oriented scheduling model wasn't really future proof. Giving drm_sched_entity exlusive access to a drm_gpu_scheduler probably might help for a few things (didn't think it through yet), but I feel it's coming short on other aspects we have to deal with on Arm GPUs. Here are a few things I noted while working on the drm_sched-based PoC: - The complexity to suspend/resume streams and recover from failures remains quite important (because everything is still very asynchronous under the hood). Sure, you don't have to do this fancy timeslice-based scheduling, but that's still a lot of code, and AFAICT, it didn't integrate well with drm_sched TDR (my previous attempt at reconciling them has been unsuccessful, but maybe your patches would help there) - You lose one of the nice thing that's brought by timeslice-based scheduling: a tiny bit of fairness. That is, if one stream is queuing a compute job that's monopolizing the GPU core, you know the kernel part of the scheduler will eventually evict it and let other streams with same or higher priority run, even before the job timeout kicks in. - Stream slots exposed by the Arm FW are not exactly HW queues that run things concurrently. The FW can decide to let only the stream with the highest priority get access to the various HW resources (GPU cores, tiler, ...), and let other streams starve. That means you might get spurious timeouts on some jobs/sched-entities while they didn't even get a chance to run. So overall, and given I'm no longer the only one having to deal with a FW scheduler that's designed with timeslice scheduling in mind, I'm wondering if it's not time to design a common timeslice-based scheduler instead of trying to bend drivers to use the model enforced by drm_sched. But that's just my 2 cents, of course. Regards, Boris