On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote: > On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote: > > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote: > > > > [snip] > > > > > >>> AFAICT it proposes to have 1:1 between *userspace* created > > > contexts (per > > > >>> context _and_ engine) and drm_sched. I am not sure avoiding > > > invasive changes > > > >>> to the shared code is in the spirit of the overall idea and instead > > > >>> opportunity should be used to look at way to refactor/improve > > > drm_sched. > > > > > > > > > Maybe? I'm not convinced that what Xe is doing is an abuse at all or > > > really needs to drive a re-factor. (More on that later.) There's only > > > one real issue which is that it fires off potentially a lot of kthreads. > > > Even that's not that bad given that kthreads are pretty light and you're > > > not likely to have more kthreads than userspace threads which are much > > > heavier. Not ideal, but not the end of the world either. Definitely > > > something we can/should optimize but if we went through with Xe without > > > this patch, it would probably be mostly ok. > > > > > > >> Yes, it is 1:1 *userspace* engines and drm_sched. > > > >> > > > >> I'm not really prepared to make large changes to DRM scheduler > > > at the > > > >> moment for Xe as they are not really required nor does Boris > > > seem they > > > >> will be required for his work either. I am interested to see > > > what Boris > > > >> comes up with. > > > >> > > > >>> Even on the low level, the idea to replace drm_sched threads > > > with workers > > > >>> has a few problems. > > > >>> > > > >>> To start with, the pattern of: > > > >>> > > > >>> while (not_stopped) { > > > >>> keep picking jobs > > > >>> } > > > >>> > > > >>> Feels fundamentally in disagreement with workers (while > > > obviously fits > > > >>> perfectly with the current kthread design). > > > >> > > > >> The while loop breaks and worker exists if no jobs are ready. > > > > > > > > > I'm not very familiar with workqueues. What are you saying would fit > > > better? One scheduling job per work item rather than one big work item > > > which handles all available jobs? > > > > Yes and no, it indeed IMO does not fit to have a work item which is > > potentially unbound in runtime. But it is a bit moot conceptual mismatch > > because it is a worst case / theoretical, and I think due more fundamental > > concerns. > > > > If we have to go back to the low level side of things, I've picked this > > random spot to consolidate what I have already mentioned and perhaps expand. > > > > To start with, let me pull out some thoughts from workqueue.rst: > > > > """ > > Generally, work items are not expected to hog a CPU and consume many cycles. > > That means maintaining just enough concurrency to prevent work processing > > from stalling should be optimal. > > """ > > > > For unbound queues: > > """ > > The responsibility of regulating concurrency level is on the users. > > """ > > > > Given the unbound queues will be spawned on demand to service all queued > > work items (more interesting when mixing up with the system_unbound_wq), in > > the proposed design the number of instantiated worker threads does not > > correspond to the number of user threads (as you have elsewhere stated), but > > pessimistically to the number of active user contexts. That is the number > > which drives the maximum number of not-runnable jobs that can become > > runnable at once, and hence spawn that many work items, and in turn unbound > > worker threads. > > > > Several problems there. > > > > It is fundamentally pointless to have potentially that many more threads > > than the number of CPU cores - it simply creates a scheduling storm. > > > > We can use a different work queue if this is an issue, have a FIXME > which indicates we should allow the user to pass in the work queue. > > > Unbound workers have no CPU / cache locality either and no connection with > > the CPU scheduler to optimize scheduling patterns. This may matter either on > > large systems or on small ones. Whereas the current design allows for > > scheduler to notice userspace CPU thread keeps waking up the same drm > > scheduler kernel thread, and so it can keep them on the same CPU, the > > unbound workers lose that ability and so 2nd CPU might be getting woken up > > from low sleep for every submission. > > > > I guess I don't understand kthread vs. workqueue scheduling internals. > Looked into this and we are not using unbound workers rather we are just using the system_wq which is indeed bound. Again we can change this so a user can just pass in worker too. After doing a of research bound workers allows the scheduler to use locality too avoid that exact problem your reading. TL;DR I'm not buying any of these arguments although it is possible I am missing something. Matt > > Hence, apart from being a bit of a impedance mismatch, the proposal has the > > potential to change performance and power patterns and both large and small > > machines. > > > > We are going to have to test this out I suppose and play around to see > if this design has any real world impacts. As Jason said, yea probably > will need a bit of help here from others. Will CC relavent parties on > next rev. > > > > >>> Secondly, it probably demands separate workers (not optional), > > > otherwise > > > >>> behaviour of shared workqueues has either the potential to > > > explode number > > > >>> kernel threads anyway, or add latency. > > > >>> > > > >> > > > >> Right now the system_unbound_wq is used which does have a limit > > > on the > > > >> number of threads, right? I do have a FIXME to allow a worker to be > > > >> passed in similar to TDR. > > > >> > > > >> WRT to latency, the 1:1 ratio could actually have lower latency > > > as 2 GPU > > > >> schedulers can be pushing jobs into the backend / cleaning up > > > jobs in > > > >> parallel. > > > >> > > > > > > > > Thought of one more point here where why in Xe we absolutely want > > > a 1 to > > > > 1 ratio between entity and scheduler - the way we implement > > > timeslicing > > > > for preempt fences. > > > > > > > > Let me try to explain. > > > > > > > > Preempt fences are implemented via the generic messaging > > > interface [1] > > > > with suspend / resume messages. If a suspend messages is received to > > > > soon after calling resume (this is per entity) we simply sleep in the > > > > suspend call thus giving the entity a timeslice. This completely > > > falls > > > > apart with a many to 1 relationship as now a entity waiting for a > > > > timeslice blocks the other entities. Could we work aroudn this, > > > sure but > > > > just another bunch of code we'd have to add in Xe. Being to > > > freely sleep > > > > in backend without affecting other entities is really, really > > > nice IMO > > > > and I bet Xe isn't the only driver that is going to feel this way. > > > > > > > > Last thing I'll say regardless of how anyone feels about Xe using > > > a 1 to > > > > 1 relationship this patch IMO makes sense as I hope we can all > > > agree a > > > > workqueue scales better than kthreads. > > > > > > I don't know for sure what will scale better and for what use case, > > > combination of CPU cores vs number of GPU engines to keep busy vs other > > > system activity. But I wager someone is bound to ask for some > > > numbers to > > > make sure proposal is not negatively affecting any other drivers. > > > > > > > > > Then let them ask. Waving your hands vaguely in the direction of the > > > rest of DRM and saying "Uh, someone (not me) might object" is profoundly > > > unhelpful. Sure, someone might. That's why it's on dri-devel. If you > > > think there's someone in particular who might have a useful opinion on > > > this, throw them in the CC so they don't miss the e-mail thread. > > > > > > Or are you asking for numbers? If so, what numbers are you asking for? > > > > It was a heads up to the Xe team in case people weren't appreciating how the > > proposed change has the potential influence power and performance across the > > board. And nothing in the follow up discussion made me think it was > > considered so I don't think it was redundant to raise it. > > > > In my experience it is typical that such core changes come with some > > numbers. Which is in case of drm scheduler is tricky and probably requires > > explicitly asking everyone to test (rather than count on "don't miss the > > email thread"). Real products can fail to ship due ten mW here or there. > > Like suddenly an extra core prevented from getting into deep sleep. > > > > If that was "profoundly unhelpful" so be it. > > > > > Also, If we're talking about a design that might paint us into an > > > Intel-HW-specific hole, that would be one thing. But we're not. We're > > > talking about switching which kernel threading/task mechanism to use for > > > what's really a very generic problem. The core Xe design works without > > > this patch (just with more kthreads). If we land this patch or > > > something like it and get it wrong and it causes a performance problem > > > for someone down the line, we can revisit it. > > > > For some definition of "it works" - I really wouldn't suggest shipping a > > kthread per user context at any point. > > > > Yea, this is why using a workqueue rathre than a kthread was suggested > to me by AMD. I should've put a suggested by on the commit message, need > to dig through my emails and figure out who exactly suggested this. > > > > In any case that's a low level question caused by the high level design > > > decision. So I'd think first focus on the high level - which is the 1:1 > > > mapping of entity to scheduler instance proposal. > > > > > > Fundamentally it will be up to the DRM maintainers and the community to > > > bless your approach. And it is important to stress 1:1 is about > > > userspace contexts, so I believe unlike any other current scheduler > > > user. And also important to stress this effectively does not make Xe > > > _really_ use the scheduler that much. > > > > > > > > > I don't think this makes Xe nearly as much of a one-off as you think it > > > does. I've already told the Asahi team working on Apple M1/2 hardware > > > to do it this way and it seems to be a pretty good mapping for them. I > > > believe this is roughly the plan for nouveau as well. It's not the way > > > it currently works for anyone because most other groups aren't doing FW > > > scheduling yet. In the world of FW scheduling and hardware designed to > > > support userspace direct-to-FW submit, I think the design makes perfect > > > sense (see below) and I expect we'll see more drivers move in this > > > direction as those drivers evolve. (AMD is doing some customish thing > > > for how with gpu_scheduler on the front-end somehow. I've not dug into > > > those details.) > > > > > > I can only offer my opinion, which is that the two options mentioned in > > > this thread (either improve drm scheduler to cope with what is > > > required, > > > or split up the code so you can use just the parts of drm_sched which > > > you want - which is frontend dependency tracking) shouldn't be so > > > readily dismissed, given how I think the idea was for the new driver to > > > work less in a silo and more in the community (not do kludges to > > > workaround stuff because it is thought to be too hard to improve common > > > code), but fundamentally, "goto previous paragraph" for what I am > > > concerned. > > > > > > > > > Meta comment: It appears as if you're falling into the standard i915 > > > team trap of having an internal discussion about what the community > > > discussion might look like instead of actually having the community > > > discussion. If you are seriously concerned about interactions with > > > other drivers or whether or setting common direction, the right way to > > > do that is to break a patch or two out into a separate RFC series and > > > tag a handful of driver maintainers. Trying to predict the questions > > > other people might ask is pointless. Cc them and asking for their input > > > instead. > > > > I don't follow you here. It's not an internal discussion - I am raising my > > concerns on the design publicly. I am supposed to write a patch to show > > something, but am allowed to comment on a RFC series? > > > > It is "drm/sched: Convert drm scheduler to use a work queue rather than > > kthread" which should have Cc-ed _everyone_ who use drm scheduler. > > > > Yea, will do on next rev. > > > > > > > Regards, > > > > > > Tvrtko > > > > > > P.S. And as a related side note, there are more areas where drm_sched > > > could be improved, like for instance priority handling. > > > Take a look at msm_submitqueue_create / msm_gpu_convert_priority / > > > get_sched_entity to see how msm works around the drm_sched hardcoded > > > limit of available priority levels, in order to avoid having to leave a > > > hw capability unused. I suspect msm would be happier if they could have > > > all priority levels equal in terms of whether they apply only at the > > > frontend level or completely throughout the pipeline. > > > > > > > [1] > > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 > > > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1> > > > > > > > >>> What would be interesting to learn is whether the option of > > > refactoring > > > >>> drm_sched to deal with out of order completion was considered > > > and what were > > > >>> the conclusions. > > > >>> > > > >> > > > >> I coded this up a while back when trying to convert the i915 to > > > the DRM > > > >> scheduler it isn't all that hard either. The free flow control > > > on the > > > >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is > > > really what > > > >> sold me on the this design. > > > > > > > > > You're not the only one to suggest supporting out-of-order completion. > > > However, it's tricky and breaks a lot of internal assumptions of the > > > scheduler. It also reduces functionality a bit because it can no longer > > > automatically rate-limit HW/FW queues which are often fixed-size. (Ok, > > > yes, it probably could but it becomes a substantially harder problem.) > > > > > > It also seems like a worse mapping to me. The goal here is to turn > > > submissions on a userspace-facing engine/queue into submissions to a FW > > > queue submissions, sorting out any dma_fence dependencies. Matt's > > > description of saying this is a 1:1 mapping between sched/entity doesn't > > > tell the whole story. It's a 1:1:1 mapping between xe_engine, > > > gpu_scheduler, and GuC FW engine. Why make it a 1:something:1 mapping? > > > Why is that better? > > > > As I have stated before, what I think what would fit well for Xe is one > > drm_scheduler per engine class. In specific terms on our current hardware, > > one drm scheduler instance for render, compute, blitter, video and video > > enhance. Userspace contexts remain scheduler entities. > > > > I disagree. > > > That way you avoid the whole kthread/kworker story and you have it actually > > use the entity picking code in the scheduler, which may be useful when the > > backend is congested. > > > > In practice the backend shouldn't be congested but if it is a mutex > provides fairness probably better than using a shared scheduler. Also > what you are suggesting doesn't make sense at all as the congestion is > per-GT, so if anything we should use 1 scheduler per-GT not per engine > class. > > > Yes you have to solve the out of order problem so in my mind that is > > something to discuss. What the problem actually is (just TDR?), how tricky > > and why etc. > > > > Cleanup of jobs, TDR, replaying jobs, etc... It has decent amount of > impact. > > > And yes you lose the handy LRCA ring buffer size management so you'd have to > > make those entities not runnable in some other way. > > > > Also we lose our preempt fence implemenation too. Again I don't see how > the design you are suggesting is a win. > > > Regarding the argument you raise below - would any of that make the frontend > > / backend separation worse and why? Do you think it is less natural? If > > neither is true then all remains is that it appears extra work to support > > out of order completion of entities has been discounted in favour of an easy > > but IMO inelegant option. > > > > > There are two places where this 1:1:1 mapping is causing problems: > > > > > > 1. It creates lots of kthreads. This is what this patch is trying to > > > solve. IDK if it's solving it the best way but that's the goal. > > > > > > 2. There are a far more limited number of communication queues between > > > the kernel and GuC for more meta things like pausing and resuming > > > queues, getting events back from GuC, etc. Unless we're in a weird > > > pressure scenario, the amount of traffic on this queue should be low so > > > we can probably just have one per physical device. The vast majority of > > > kernel -> GuC communication should be on the individual FW queue rings > > > and maybe smashing in-memory doorbells. > > > > I don't follow your terminology here. I suppose you are talking about global > > GuC CT and context ringbuffers. If so then isn't "far more limited" actually > > one? > > > > We have 1 GuC GT per-GT. > > Matt > > > Regards, > > > > Tvrtko > > > > > Doing out-of-order completion sort-of solves the 1 but does nothing for > > > 2 and actually makes managing FW queues harder because we no longer have > > > built-in rate limiting. Seems like a net loss to me. > > > > > > >>> Second option perhaps to split out the drm_sched code into > > > parts which would > > > >>> lend themselves more to "pick and choose" of its functionalities. > > > >>> Specifically, Xe wants frontend dependency tracking, but not > > > any scheduling > > > >>> really (neither least busy drm_sched, neither FIFO/RQ entity > > > picking), so > > > >>> even having all these data structures in memory is a waste. > > > >>> > > > >> > > > >> I don't think that we are wasting memory is a very good argument for > > > >> making intrusive changes to the DRM scheduler. > > > > > > > > > Worse than that, I think the "we could split it up" kind-of misses the > > > point of the way Xe is using drm/scheduler. It's not just about > > > re-using a tiny bit of dependency tracking code. Using the scheduler in > > > this way provides a clean separation between front-end and back-end. > > > The job of the userspace-facing ioctl code is to shove things on the > > > scheduler. The job of the run_job callback is to encode the job into > > > the FW queue format, stick it in the FW queue ring, and maybe smash a > > > doorbell. Everything else happens in terms of managing those queues > > > side-band. The gpu_scheduler code manages the front-end queues and Xe > > > manages the FW queues via the Kernel <-> GuC communication rings. From > > > a high level, this is a really clean design. There are potentially some > > > sticky bits around the dual-use of dma_fence for scheduling and memory > > > management but none of those are solved by breaking the DRM scheduler > > > into chunks or getting rid of the 1:1:1 mapping. > > > > > > If we split it out, we're basically asking the driver to implement a > > > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc. > > > It may not be all that much code but also, why? To save a few bytes of > > > memory per engine? Each engine already has 32K(ish) worth of context > > > state and a similar size ring to communicate with the FW. No one is > > > going to notice an extra CPU data structure. > > > > > > I'm not seeing a solid argument against the 1:1:1 design here other than > > > that it doesn't seem like the way DRM scheduler was intended to be > > > used. I won't argue that. It's not. But it is a fairly natural way to > > > take advantage of the benefits the DRM scheduler does provide while also > > > mapping it to hardware that was designed for userspace direct-to-FW > > > submit. > > > > > > --Jason > > > > > > >>> With the first option then the end result could be drm_sched > > > per engine > > > >>> class (hardware view), which I think fits with the GuC model. > > > Give all > > > >>> schedulable contexts (entities) to the GuC and then mostly > > > forget about > > > >>> them. Timeslicing and re-ordering and all happens transparently > > > to the > > > >>> kernel from that point until completion. > > > >>> > > > >> > > > >> Out-of-order problem still exists here. > > > >> > > > >>> Or with the second option you would build on some smaller > > > refactored > > > >>> sub-components of drm_sched, by maybe splitting the dependency > > > tracking from > > > >>> scheduling (RR/FIFO entity picking code). > > > >>> > > > >>> Second option is especially a bit vague and I haven't thought > > > about the > > > >>> required mechanics, but it just appeared too obvious the > > > proposed design has > > > >>> a bit too much impedance mismatch. > > > >>> > > > >> > > > >> IMO ROI on this is low and again lets see what Boris comes up with. > > > >> > > > >> Matt > > > >> > > > >>> Oh and as a side note, when I went into the drm_sched code base > > > to remind > > > >>> myself how things worked, it is quite easy to find some FIXME > > > comments which > > > >>> suggest people working on it are unsure of locking desing there > > > and such. So > > > >>> perhaps that all needs cleanup too, I mean would benefit from > > > >>> refactoring/improving work as brainstormed above anyway. > > > >>> > > > >>> Regards, > > > >>> > > > >>> Tvrtko > > >