Hi, On Tue, 3 Jan 2023 13:02:15 +0000 Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote: > On 02/01/2023 07:30, Boris Brezillon wrote: > > On Fri, 30 Dec 2022 12:55:08 +0100 > > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > >> On Fri, 30 Dec 2022 11:20:42 +0100 > >> Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > >> > >>> Hello Matthew, > >>> > >>> On Thu, 22 Dec 2022 14:21:11 -0800 > >>> Matthew Brost <matthew.brost@xxxxxxxxx> wrote: > >>> > >>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > >>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this > >>>> seems a bit odd but let us explain the reasoning below. > >>>> > >>>> 1. In XE the submission order from multiple drm_sched_entity is not > >>>> guaranteed to be the same completion even if targeting the same hardware > >>>> engine. This is because in XE we have a firmware scheduler, the GuC, > >>>> which allowed to reorder, timeslice, and preempt submissions. If a using > >>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls > >>>> apart as the TDR expects submission order == completion order. Using a > >>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. > >>> > >>> Oh, that's interesting. I've been trying to solve the same sort of > >>> issues to support Arm's new Mali GPU which is relying on a FW-assisted > >>> scheduling scheme (you give the FW N streams to execute, and it does > >>> the scheduling between those N command streams, the kernel driver > >>> does timeslice scheduling to update the command streams passed to the > >>> FW). I must admit I gave up on using drm_sched at some point, mostly > >>> because the integration with drm_sched was painful, but also because I > >>> felt trying to bend drm_sched to make it interact with a > >>> timeslice-oriented scheduling model wasn't really future proof. Giving > >>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might > >>> help for a few things (didn't think it through yet), but I feel it's > >>> coming short on other aspects we have to deal with on Arm GPUs. > >> > >> Ok, so I just had a quick look at the Xe driver and how it > >> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I > >> have a better understanding of how you get away with using drm_sched > >> while still controlling how scheduling is really done. Here > >> drm_gpu_scheduler is just a dummy abstract that let's you use the > >> drm_sched job queuing/dep/tracking mechanism. The whole run-queue > >> selection is dumb because there's only one entity ever bound to the > >> scheduler (the one that's part of the xe_guc_engine object which also > >> contains the drm_gpu_scheduler instance). I guess the main issue we'd > >> have on Arm is the fact that the stream doesn't necessarily get > >> scheduled when ->run_job() is called, it can be placed in the runnable > >> queue and be picked later by the kernel-side scheduler when a FW slot > >> gets released. That can probably be sorted out by manually disabling the > >> job timer and re-enabling it when the stream gets picked by the > >> scheduler. But my main concern remains, we're basically abusing > >> drm_sched here. > >> > >> For the Arm driver, that means turning the following sequence > >> > >> 1. wait for job deps > >> 2. queue job to ringbuf and push the stream to the runnable > >> queue (if it wasn't queued already). Wakeup the timeslice scheduler > >> to re-evaluate (if the stream is not on a FW slot already) > >> 3. stream gets picked by the timeslice scheduler and sent to the FW for > >> execution > >> > >> into > >> > >> 1. queue job to entity which takes care of waiting for job deps for > >> us > >> 2. schedule a drm_sched_main iteration > >> 3. the only available entity is picked, and the first job from this > >> entity is dequeued. ->run_job() is called: the job is queued to the > >> ringbuf and the stream is pushed to the runnable queue (if it wasn't > >> queued already). Wakeup the timeslice scheduler to re-evaluate (if > >> the stream is not on a FW slot already) > >> 4. stream gets picked by the timeslice scheduler and sent to the FW for > >> execution > >> > >> That's one extra step we don't really need. To sum-up, yes, all the > >> job/entity tracking might be interesting to share/re-use, but I wonder > >> if we couldn't have that without pulling out the scheduling part of > >> drm_sched, or maybe I'm missing something, and there's something in > >> drm_gpu_scheduler you really need. > > > > On second thought, that's probably an acceptable overhead (not even > > sure the extra step I was mentioning exists in practice, because dep > > fence signaled state is checked as part of the drm_sched_main > > iteration, so that's basically replacing the worker I schedule to > > check job deps), and I like the idea of being able to re-use drm_sched > > dep-tracking without resorting to invasive changes to the existing > > logic, so I'll probably give it a try. > > I agree with the concerns and think that how Xe proposes to integrate > with drm_sched is a problem, or at least significantly inelegant. Okay, so it looks like I'm not the only one to be bothered by the way Xe tries to bypass the drm_sched limitations :-). > > AFAICT it proposes to have 1:1 between *userspace* created contexts (per > context _and_ engine) and drm_sched. I am not sure avoiding invasive > changes to the shared code is in the spirit of the overall idea and > instead opportunity should be used to look at way to refactor/improve > drm_sched. > > Even on the low level, the idea to replace drm_sched threads with > workers has a few problems. > > To start with, the pattern of: > > while (not_stopped) { > keep picking jobs > } > > Feels fundamentally in disagreement with workers (while obviously fits > perfectly with the current kthread design). > > Secondly, it probably demands separate workers (not optional), otherwise > behaviour of shared workqueues has either the potential to explode > number kernel threads anyway, or add latency. > > What would be interesting to learn is whether the option of refactoring > drm_sched to deal with out of order completion was considered and what > were the conclusions. I might be wrong, but I don't think the fundamental issue here is the out-of-order completion thing that's mentioned in the commit message. It just feels like this is a symptom of the impedance mismatch we have between priority+FIFO-based job scheduling and priority+timeslice-based queue scheduling (a queue being represented by a drm_sched_entity in drm_sched). > > Second option perhaps to split out the drm_sched code into parts which > would lend themselves more to "pick and choose" of its functionalities. > Specifically, Xe wants frontend dependency tracking, but not any > scheduling really (neither least busy drm_sched, neither FIFO/RQ entity > picking), so even having all these data structures in memory is a waste. Same thing for the panfrost+CSF driver I was mentioning in my previous emails. > > With the first option then the end result could be drm_sched per engine > class (hardware view), which I think fits with the GuC model. Give all > schedulable contexts (entities) to the GuC and then mostly forget about > them. Timeslicing and re-ordering and all happens transparently to the > kernel from that point until completion. Yep, that would work. I guess it would mean creating an intermediate abstract/interface to schedule entities and then implement this interface for the simple HW-engine+job-scheduling case, so that existing drm_sched users don't see a difference, and new drivers that need to interface with FW-assisted schedulers can implement the higher-lever entity scheduling interface. Don't know what this interface would look like though. > > Or with the second option you would build on some smaller refactored > sub-components of drm_sched, by maybe splitting the dependency tracking > from scheduling (RR/FIFO entity picking code). What I've done so far is duplicate the dep-tracking logic in the driver. It's not that much code, but it would be nice to not have to duplicate it in the first place... Regards, Boris