On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote: > Hi Jason, > > On Mon, 9 Jan 2023 09:45:09 -0600 > Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote: > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@xxxxxxxxx> > > wrote: > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote: > > > > On Fri, 30 Dec 2022 12:55:08 +0100 > > > > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100 > > > > > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > Hello Matthew, > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800 > > > > > > Matthew Brost <matthew.brost@xxxxxxxxx> wrote: > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first > > > this > > > > > > > seems a bit odd but let us explain the reasoning below. > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not > > > > > > > guaranteed to be the same completion even if targeting the same > > > hardware > > > > > > > engine. This is because in XE we have a firmware scheduler, the > > > GuC, > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a > > > using > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR > > > falls > > > > > > > apart as the TDR expects submission order == completion order. > > > Using a > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this > > > problem. > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of > > > > > > issues to support Arm's new Mali GPU which is relying on a > > > FW-assisted > > > > > > scheduling scheme (you give the FW N streams to execute, and it does > > > > > > the scheduling between those N command streams, the kernel driver > > > > > > does timeslice scheduling to update the command streams passed to the > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly > > > > > > because the integration with drm_sched was painful, but also because > > > I > > > > > > felt trying to bend drm_sched to make it interact with a > > > > > > timeslice-oriented scheduling model wasn't really future proof. > > > Giving > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably > > > might > > > > > > help for a few things (didn't think it through yet), but I feel it's > > > > > > coming short on other aspects we have to deal with on Arm GPUs. > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I > > > > > have a better understanding of how you get away with using drm_sched > > > > > while still controlling how scheduling is really done. Here > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs, > > > dependency tracking and releasing jobs to be scheduled when dependencies > > > are met, and lastly a tracking mechanism of inflights jobs that need to > > > be cleaned up if an error occurs. It doesn't actually do any scheduling > > > aside from the most basic level of not overflowing the submission ring > > > buffer. In this sense, a 1 to 1 relationship between entity and > > > scheduler fits quite well. > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel > > want here and what you need for Arm thanks to the number of FW queues > > available. I don't remember the exact number of GuC queues but it's at > > least 1k. This puts it in an entirely different class from what you have on > > Mali. Roughly, there's about three categories here: > > > > 1. Hardware where the kernel is placing jobs on actual HW rings. This is > > old Mali, Intel Haswell and earlier, and probably a bunch of others. > > (Intel BDW+ with execlists is a weird case that doesn't fit in this > > categorization.) > > > > 2. Hardware (or firmware) with a very limited number of queues where > > you're going to have to juggle in the kernel in order to run desktop Linux. > > > > 3. Firmware scheduling with a high queue count. In this case, you don't > > want the kernel scheduling anything. Just throw it at the firmware and let > > it go brrrrr. If we ever run out of queues (unlikely), the kernel can > > temporarily pause some low-priority contexts and do some juggling or, > > frankly, just fail userspace queue creation and tell the user to close some > > windows. > > > > The existence of this 2nd class is a bit annoying but it's where we are. I > > think it's worth recognizing that Xe and panfrost are in different places > > here and will require different designs. For Xe, we really are just using > > drm/scheduler as a front-end and the firmware does all the real scheduling. > > > > How do we deal with class 2? That's an interesting question. We may > > eventually want to break that off into a separate discussion and not litter > > the Xe thread but let's keep going here for a bit. I think there are some > > pretty reasonable solutions but they're going to look a bit different. > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine. > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist > > ring and then there was a tiny kernel which operated entirely in IRQ > > handlers which juggled those execlists by smashing HW registers. For > > Panfrost, I think we want something slightly different but can borrow some > > ideas here. In particular, have the schedulers feed kernel-side SW queues > > (they can even be fixed-size if that helps) and then have a kthread which > > juggles those feeds the limited FW queues. In the case where you have few > > enough active contexts to fit them all in FW, I do think it's best to have > > them all active in FW and let it schedule. But with only 31, you need to be > > able to juggle if you run out. > > That's more or less what I do right now, except I don't use the > drm_sched front-end to handle deps or queue jobs (at least not yet). The > kernel-side timeslice-based scheduler juggling with runnable queues > (queues with pending jobs that are not yet resident on a FW slot) > uses a dedicated ordered-workqueue instead of a thread, with scheduler > ticks being handled with a delayed-work (tick happening every X > milliseconds when queues are waiting for a slot). It all seems very > HW/FW-specific though, and I think it's a bit premature to try to > generalize that part, but the dep-tracking logic implemented by > drm_sched looked like something I could easily re-use, hence my > interest in Xe's approach. So another option for these few fw queue slots schedulers would be to treat them as vram and enlist ttm. Well maybe more enlist ttm and less treat them like vram, but ttm can handle idr (or xarray or whatever you want) and then help you with all the pipelining (and the drm_sched then with sorting out dependencies). If you then also preferentially "evict" low-priority queus you pretty much have the perfect thing. Note that GuC with sriov splits up the id space and together with some restrictions due to multi-engine contexts media needs might also need this all. If you're balking at the idea of enlisting ttm just for fw queue management, amdgpu has a shoddy version of id allocation for their vm/tlb index allocation. Might be worth it to instead lift that into some sched helper code. Either way there's two imo rather solid approaches available to sort this out. And once you have that, then there shouldn't be any big difference in driver design between fw with defacto unlimited queue ids, and those with severe restrictions in number of queues. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch