On Fri, 30 Dec 2022 12:55:08 +0100 Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > On Fri, 30 Dec 2022 11:20:42 +0100 > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > Hello Matthew, > > > > On Thu, 22 Dec 2022 14:21:11 -0800 > > Matthew Brost <matthew.brost@xxxxxxxxx> wrote: > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this > > > seems a bit odd but let us explain the reasoning below. > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not > > > guaranteed to be the same completion even if targeting the same hardware > > > engine. This is because in XE we have a firmware scheduler, the GuC, > > > which allowed to reorder, timeslice, and preempt submissions. If a using > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls > > > apart as the TDR expects submission order == completion order. Using a > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. > > > > Oh, that's interesting. I've been trying to solve the same sort of > > issues to support Arm's new Mali GPU which is relying on a FW-assisted > > scheduling scheme (you give the FW N streams to execute, and it does > > the scheduling between those N command streams, the kernel driver > > does timeslice scheduling to update the command streams passed to the > > FW). I must admit I gave up on using drm_sched at some point, mostly > > because the integration with drm_sched was painful, but also because I > > felt trying to bend drm_sched to make it interact with a > > timeslice-oriented scheduling model wasn't really future proof. Giving > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might > > help for a few things (didn't think it through yet), but I feel it's > > coming short on other aspects we have to deal with on Arm GPUs. > > Ok, so I just had a quick look at the Xe driver and how it > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I > have a better understanding of how you get away with using drm_sched > while still controlling how scheduling is really done. Here > drm_gpu_scheduler is just a dummy abstract that let's you use the > drm_sched job queuing/dep/tracking mechanism. The whole run-queue > selection is dumb because there's only one entity ever bound to the > scheduler (the one that's part of the xe_guc_engine object which also > contains the drm_gpu_scheduler instance). I guess the main issue we'd > have on Arm is the fact that the stream doesn't necessarily get > scheduled when ->run_job() is called, it can be placed in the runnable > queue and be picked later by the kernel-side scheduler when a FW slot > gets released. That can probably be sorted out by manually disabling the > job timer and re-enabling it when the stream gets picked by the > scheduler. But my main concern remains, we're basically abusing > drm_sched here. > > For the Arm driver, that means turning the following sequence > > 1. wait for job deps > 2. queue job to ringbuf and push the stream to the runnable > queue (if it wasn't queued already). Wakeup the timeslice scheduler > to re-evaluate (if the stream is not on a FW slot already) > 3. stream gets picked by the timeslice scheduler and sent to the FW for > execution > > into > > 1. queue job to entity which takes care of waiting for job deps for > us > 2. schedule a drm_sched_main iteration > 3. the only available entity is picked, and the first job from this > entity is dequeued. ->run_job() is called: the job is queued to the > ringbuf and the stream is pushed to the runnable queue (if it wasn't > queued already). Wakeup the timeslice scheduler to re-evaluate (if > the stream is not on a FW slot already) > 4. stream gets picked by the timeslice scheduler and sent to the FW for > execution > > That's one extra step we don't really need. To sum-up, yes, all the > job/entity tracking might be interesting to share/re-use, but I wonder > if we couldn't have that without pulling out the scheduling part of > drm_sched, or maybe I'm missing something, and there's something in > drm_gpu_scheduler you really need. On second thought, that's probably an acceptable overhead (not even sure the extra step I was mentioning exists in practice, because dep fence signaled state is checked as part of the drm_sched_main iteration, so that's basically replacing the worker I schedule to check job deps), and I like the idea of being able to re-use drm_sched dep-tracking without resorting to invasive changes to the existing logic, so I'll probably give it a try.