On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote: > > On 10/01/2023 19:01, Matthew Brost wrote: > > On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote: > > > > > > On 10/01/2023 15:55, Matthew Brost wrote: > > > > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote: > > > > > > > > > > On 10/01/2023 11:28, Tvrtko Ursulin wrote: > > > > > > > > > > > > > > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote: > > > > > > > > > > > > [snip] > > > > > > > > > > > > > >>> AFAICT it proposes to have 1:1 between *userspace* created > > > > > > > contexts (per > > > > > > > >>> context _and_ engine) and drm_sched. I am not sure avoiding > > > > > > > invasive changes > > > > > > > >>> to the shared code is in the spirit of the overall idea and > > > > > > > instead > > > > > > > >>> opportunity should be used to look at way to refactor/improve > > > > > > > drm_sched. > > > > > > > > > > > > > > > > > > > > > Maybe? I'm not convinced that what Xe is doing is an abuse at all > > > > > > > or really needs to drive a re-factor. (More on that later.) > > > > > > > There's only one real issue which is that it fires off potentially a > > > > > > > lot of kthreads. Even that's not that bad given that kthreads are > > > > > > > pretty light and you're not likely to have more kthreads than > > > > > > > userspace threads which are much heavier. Not ideal, but not the > > > > > > > end of the world either. Definitely something we can/should > > > > > > > optimize but if we went through with Xe without this patch, it would > > > > > > > probably be mostly ok. > > > > > > > > > > > > > > >> Yes, it is 1:1 *userspace* engines and drm_sched. > > > > > > > >> > > > > > > > >> I'm not really prepared to make large changes to DRM scheduler > > > > > > > at the > > > > > > > >> moment for Xe as they are not really required nor does Boris > > > > > > > seem they > > > > > > > >> will be required for his work either. I am interested to see > > > > > > > what Boris > > > > > > > >> comes up with. > > > > > > > >> > > > > > > > >>> Even on the low level, the idea to replace drm_sched threads > > > > > > > with workers > > > > > > > >>> has a few problems. > > > > > > > >>> > > > > > > > >>> To start with, the pattern of: > > > > > > > >>> > > > > > > > >>> while (not_stopped) { > > > > > > > >>> keep picking jobs > > > > > > > >>> } > > > > > > > >>> > > > > > > > >>> Feels fundamentally in disagreement with workers (while > > > > > > > obviously fits > > > > > > > >>> perfectly with the current kthread design). > > > > > > > >> > > > > > > > >> The while loop breaks and worker exists if no jobs are ready. > > > > > > > > > > > > > > > > > > > > > I'm not very familiar with workqueues. What are you saying would fit > > > > > > > better? One scheduling job per work item rather than one big work > > > > > > > item which handles all available jobs? > > > > > > > > > > > > Yes and no, it indeed IMO does not fit to have a work item which is > > > > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch > > > > > > because it is a worst case / theoretical, and I think due more > > > > > > fundamental concerns. > > > > > > > > > > > > If we have to go back to the low level side of things, I've picked this > > > > > > random spot to consolidate what I have already mentioned and perhaps > > > > > > expand. > > > > > > > > > > > > To start with, let me pull out some thoughts from workqueue.rst: > > > > > > > > > > > > """ > > > > > > Generally, work items are not expected to hog a CPU and consume many > > > > > > cycles. That means maintaining just enough concurrency to prevent work > > > > > > processing from stalling should be optimal. > > > > > > """ > > > > > > > > > > > > For unbound queues: > > > > > > """ > > > > > > The responsibility of regulating concurrency level is on the users. > > > > > > """ > > > > > > > > > > > > Given the unbound queues will be spawned on demand to service all queued > > > > > > work items (more interesting when mixing up with the system_unbound_wq), > > > > > > in the proposed design the number of instantiated worker threads does > > > > > > not correspond to the number of user threads (as you have elsewhere > > > > > > stated), but pessimistically to the number of active user contexts. That > > > > > > is the number which drives the maximum number of not-runnable jobs that > > > > > > can become runnable at once, and hence spawn that many work items, and > > > > > > in turn unbound worker threads. > > > > > > > > > > > > Several problems there. > > > > > > > > > > > > It is fundamentally pointless to have potentially that many more threads > > > > > > than the number of CPU cores - it simply creates a scheduling storm. > > > > > > > > > > To make matters worse, if I follow the code correctly, all these per user > > > > > context worker thread / work items end up contending on the same lock or > > > > > circular buffer, both are one instance per GPU: > > > > > > > > > > guc_engine_run_job > > > > > -> submit_engine > > > > > a) wq_item_append > > > > > -> wq_wait_for_space > > > > > -> msleep > > > > > > > > a) is dedicated per xe_engine > > > > > > Hah true, what its for then? I thought throttling the LRCA ring is done via: > > > > > > > This is a per guc_id 'work queue' which is used for parallel submission > > (e.g. multiple LRC tail values need to written atomically by the GuC). > > Again in practice there should always be space. > > Speaking of guc id, where does blocking when none are available happen in > the non parallel case? > We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we think that is more than enough and can just reject xe_engine creation if we run out of guc_ids. If this proves to false, we can fix this but the guc_id stealing the i915 is rather complicated and hopefully not needed. We will limit the number of guc_ids allowed per user pid to reasonible number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to whatever they want. > > > drm_sched_init(&ge->sched, &drm_sched_ops, > > > e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES, > > > > > > Is there something more to throttle other than the ring? It is throttling > > > something using msleeps.. > > > > > > > Also you missed the step of programming the ring which is dedicated per xe_engine > > > > > > I was trying to quickly find places which serialize on something in the > > > backend, ringbuffer emission did not seem to do that but maybe I missed > > > something. > > > > > > > xe_ring_ops vfunc emit_job is called to write the ring. > > Right but does it serialize between different contexts, I didn't spot that > it does in which case it wasn't relevant to the sub story. > Right just saying this is an additional step that is done in parallel between xe_engines. > > > > > > > > > b) xe_guc_ct_send > > > > > -> guc_ct_send > > > > > -> mutex_lock(&ct->lock); > > > > > -> later a potential msleep in h2g_has_room > > > > > > > > Techincally there is 1 instance per GT not GPU, yes this is shared but > > > > in practice there will always be space in the CT channel so contention > > > > on the lock should be rare. > > > > > > Yeah I used the term GPU to be more understandable to outside audience. > > > > > > I am somewhat disappointed that the Xe opportunity hasn't been used to > > > improve upon the CT communication bottlenecks. I mean those backoff sleeps > > > and lock contention. I wish there would be a single thread in charge of the > > > CT channel and internal users (other parts of the driver) would be able to > > > send their requests to it in a more efficient manner, with less lock > > > contention and centralized backoff. > > > > > > > Well the CT backend was more or less a complete rewrite. Mutexes > > actually work rather well to ensure fairness compared to the spin locks > > used in the i915. This code was pretty heavily reviewed by Daniel and > > both of us landed a big mutex for all of the CT code compared to the 3 > > or 4 spin locks used in the i915. > > Are the "nb" sends gone? But that aside, I wasn't meaning just the locking > but the high level approach. Never mind. > xe_guc_ct_send is non-blocking, xe_guc_ct_send_block is blocking. I don't think the later is used yet. > > > > I haven't read your rather long reply yet, but also FWIW using a > > > > workqueue has suggested by AMD (original authors of the DRM scheduler) > > > > when we ran this design by them. > > > > > > Commit message says nothing about that. ;) > > > > > > > Yea I missed that, will fix in the next rev. Just dug through my emails > > and Christian suggested a work queue and Andrey also gave some input on > > the DRM scheduler design. > > > > Also in the next will likely update the run_wq to be passed in by the > > user. > > Yes, and IMO that may need to be non-optional. > Yea, will fix. Matt > Regards, > > Tvrtko