On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote: > > On 10/01/2023 11:28, Tvrtko Ursulin wrote: > > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote: > > > > [snip] > > > > > >>> AFAICT it proposes to have 1:1 between *userspace* created > > > contexts (per > > > >>> context _and_ engine) and drm_sched. I am not sure avoiding > > > invasive changes > > > >>> to the shared code is in the spirit of the overall idea and > > > instead > > > >>> opportunity should be used to look at way to refactor/improve > > > drm_sched. > > > > > > > > > Maybe? I'm not convinced that what Xe is doing is an abuse at all > > > or really needs to drive a re-factor. (More on that later.) > > > There's only one real issue which is that it fires off potentially a > > > lot of kthreads. Even that's not that bad given that kthreads are > > > pretty light and you're not likely to have more kthreads than > > > userspace threads which are much heavier. Not ideal, but not the > > > end of the world either. Definitely something we can/should > > > optimize but if we went through with Xe without this patch, it would > > > probably be mostly ok. > > > > > > >> Yes, it is 1:1 *userspace* engines and drm_sched. > > > >> > > > >> I'm not really prepared to make large changes to DRM scheduler > > > at the > > > >> moment for Xe as they are not really required nor does Boris > > > seem they > > > >> will be required for his work either. I am interested to see > > > what Boris > > > >> comes up with. > > > >> > > > >>> Even on the low level, the idea to replace drm_sched threads > > > with workers > > > >>> has a few problems. > > > >>> > > > >>> To start with, the pattern of: > > > >>> > > > >>> while (not_stopped) { > > > >>> keep picking jobs > > > >>> } > > > >>> > > > >>> Feels fundamentally in disagreement with workers (while > > > obviously fits > > > >>> perfectly with the current kthread design). > > > >> > > > >> The while loop breaks and worker exists if no jobs are ready. > > > > > > > > > I'm not very familiar with workqueues. What are you saying would fit > > > better? One scheduling job per work item rather than one big work > > > item which handles all available jobs? > > > > Yes and no, it indeed IMO does not fit to have a work item which is > > potentially unbound in runtime. But it is a bit moot conceptual mismatch > > because it is a worst case / theoretical, and I think due more > > fundamental concerns. > > > > If we have to go back to the low level side of things, I've picked this > > random spot to consolidate what I have already mentioned and perhaps > > expand. > > > > To start with, let me pull out some thoughts from workqueue.rst: > > > > """ > > Generally, work items are not expected to hog a CPU and consume many > > cycles. That means maintaining just enough concurrency to prevent work > > processing from stalling should be optimal. > > """ > > > > For unbound queues: > > """ > > The responsibility of regulating concurrency level is on the users. > > """ > > > > Given the unbound queues will be spawned on demand to service all queued > > work items (more interesting when mixing up with the system_unbound_wq), > > in the proposed design the number of instantiated worker threads does > > not correspond to the number of user threads (as you have elsewhere > > stated), but pessimistically to the number of active user contexts. That > > is the number which drives the maximum number of not-runnable jobs that > > can become runnable at once, and hence spawn that many work items, and > > in turn unbound worker threads. > > > > Several problems there. > > > > It is fundamentally pointless to have potentially that many more threads > > than the number of CPU cores - it simply creates a scheduling storm. > > To make matters worse, if I follow the code correctly, all these per user > context worker thread / work items end up contending on the same lock or > circular buffer, both are one instance per GPU: > > guc_engine_run_job > -> submit_engine > a) wq_item_append > -> wq_wait_for_space > -> msleep a) is dedicated per xe_engine Also you missed the step of programming the ring which is dedicated per xe_engine > b) xe_guc_ct_send > -> guc_ct_send > -> mutex_lock(&ct->lock); > -> later a potential msleep in h2g_has_room Techincally there is 1 instance per GT not GPU, yes this is shared but in practice there will always be space in the CT channel so contention on the lock should be rare. I haven't read your rather long reply yet, but also FWIW using a workqueue has suggested by AMD (original authors of the DRM scheduler) when we ran this design by them. Matt > > Regards, > > Tvrtko