On Fri, 2025-01-03 at 13:31 +0100, Christian König wrote: > Am 03.01.25 um 13:02 schrieb Tvrtko Ursulin: > > > > </tldr> > > > > > > > > If I remember correctly Christian mentioned recently (give or > > > > take) > > > > that maybe > > > > round-robin could be removed. That got me thinking how and what > > > > could > > > > be > > > > improved and simplified. So I played a bit in the scheduler > > > > code and > > > > came up > > > > with something which appears to not crash at least. Whether or > > > > not > > > > there are > > > > significant advantages apart from maybe code consolidation and > > > > reduction is the > > > > main thing to be determined. > > > > > > Hi, > > > > > > so first of all: Happy new year and thx for putting in all that > > > effort > > > :) > > > > > > I gave the series a first look; Since this is an RFC, let's > > > abstain > > > from doing deeper reviews of the exact code for now. > > > > Ditto and thank you for taking a look! > > Ah, yes. Happy new year guys :) > > > > > > > One big question is whether round-robin can really be removed. > > > > Does > > > > anyone use > > > > it, rely on it, or what are even use cases where it is much > > > > better > > > > than FIFO. > > > > > > So AFAICS Round Robin is not used anymore by anyone. And my > > > understanding indeed is, too, that there is not really any use- > > > case > > > where one would like anything except for FIFO. > > > > > > Looking at 977d97f18b5b ("drm/scheduler: Set the FIFO scheduling > > > policy > > > as the default"), it seems to me that RR just was easy to > > > implement and > > > it had the disadvantage of systems under high load cause the > > > oldest job > > > to be starved to death, which is why FIFO was introduced. > > > > > > So my guess would be that RR just is a relict. > > > > > > If we agree on that, then we could remove RR in any case, and the > > > subsequent question would be whether FIFO should be replaced with > > > deadline (or: if there should be FIFO *and* deadline?), wouldn't > > > it? > > > > I am unsure about RR but I agree what is the second part of the > > question. > > Well we came up with FIFO because we found that RR performed quite > badly > when you have a huge number of submitting applications. > > E.g. one of our cloud test cases ran 100 instances of a single game > and > the worst response time improved massively by switching from RR to > FIFO. > > Different priorities on the other hand were originally invented to > make > sure the kernel has precedence over userspace. Should it, in your opinion, be *guaranteed* that the kernel always takes precedence, or is a "kernel takes precedence almost always" enough? Because this patchset does seem to do the latter > But later we also exposed > the priorities to userspace which results in the problem that higher > priority queues can starve low priority ones. > > That's the other reason why I said that RR should probably be removed > and FIFO changed in a way that the priority is basically just a bonus > to > the score used for sorting the FIFO. I haven't taken a deeper look > yet, > but I think that this is more or less what this patch set here does. > > What FIFO is still missing compared to RR is some sort of fairness > between queues. Before I forget it I note it right away: If we go for such a thing we should set our terminology straight. A FIFO that suddenly takes deadlines into account is not a FIFO anymore. So we shouldn't call it that. However, a deadline scheduler with priority levels also doesn't seem to be a deadline scheduler. I think, no matter what we do, we should aim for strict congruency between the GPU scheduler and the CPU scheduler in regards with terms and naming. The CPU scheduler has SCHED_FIFO, which is pure FIFO with priorities (behaving exactly like the GPU schedulers FIFO, currently) and SCHED_DEADLINE, which is purely based on deadlines (not like in this patch set). So we should find a new suitable term if we go for that to avoid any misunderstandings. P. > E.g. a queues which hasn't submitted something in a > while might get a bonus for their submissions compared to a queue > which > submits stuff all the time (or something like that). > > Regards, > Christian. > > > > There may be nuances with different drivers depending on how much > > they > > can queue to the hardware/firmware at once. Modern drivers which > > use > > 1:1 sched:entity I don't expect care about DRM scheduler scheduling > > mode. The fewer jobs driver can queue to the backend the more it > > cares. Question is FIFO ever better. Keeping in mind that for same > > priority this deadline and FIFO are actually identical. > > > > > > See "drm/sched: Add deadline policy" commit message for a short > > > > description on > > > > what flavour of deadline scheduling it is. But in essence it > > > > should a > > > > more fair > > > > FIFO where higher priority can not forever starve lower > > > > priorities. > > > > > > See my answer on that patch. > > > > > > As you can imagine I'm wondering if that "better FIFO" would be > > > worth > > > it considering that we are running into a certain risk of > > > regressing > > > stuff through this rework. > > > > I will reply to that part there then. > > > > "drm/sched: Connect with dma-fence deadlines" wires up dma- > > > > fence > > > > deadlines to > > > > the scheduler because it is easy and makes logical sense with > > > > this. > > > > And I > > > > noticed userspace already uses it so why not wire it up fully. > > > > > > Userspace uses the dma-fence deadlines you mean? Do you have a > > > pointer > > > for us? > > > > I've noticed it empirically and the one I could fine is this: > > > > https://invent.kde.org/plasma/kwin/-/commit/4ad5670ddfcd7400c8b84c12cbf8bd97a0590f43 > > > > > > > > > > Otherwise the series is a bit of progression from consolidating > > > > RR > > > > into FIFO > > > > code paths and going from there to deadline and then to a > > > > change in > > > > how > > > > dependencies are handled. And code simplification to 1:1 run > > > > queue to > > > > scheduler > > > > relationship, because deadline does not need per priority run > > > > queues. > > > > > > > > There is quite a bit of code to go throught here so I think it > > > > could > > > > be even > > > > better if other drivers could give it a spin as is and see if > > > > some > > > > improvements > > > > can be detected. Or at least no regressions. > > > > > > I hope I can dive deeper into the Nouveau side soon. > > > > Fantastic! > > > > > > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > > Cc: Danilo Krummrich <dakr@xxxxxxxxxx> > > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > Cc: Philipp Stanner <pstanner@xxxxxxxxxx> > > > > > > > > Tvrtko Ursulin (14): > > > > drm/sched: Delete unused update_job_credits > > > > drm/sched: Remove idle entity from tree > > > > drm/sched: Implement RR via FIFO > > > > drm/sched: Consolidate entity run queue management > > > > drm/sched: Move run queue related code into a separate file > > > > drm/sched: Ignore own fence earlier > > > > drm/sched: Resolve same scheduler dependencies earlier > > > > drm/sched: Add deadline policy > > > > drm/sched: Remove FIFO and RR and simplify to a single run > > > > queue > > > > drm/sched: Queue all free credits in one worker invocation > > > > drm/sched: Connect with dma-fence deadlines > > > > drm/sched: Embed run queue singleton into the scheduler > > > > dma-fence: Add helper for custom fence context when merging > > > > fences > > > > drm/sched: Resolve all job dependencies in one go > > > > > > It seems to me that this series is a "port RR and FIFO to > > > deadline"- > > > series with some additional patches that could be branched out > > > and be > > > reviewed through a separate series? > > > > > > Patch 1 ("Delete unused...") for example. Other candidates are > > > Patch 5 > > > ("Move run queue related..."), 13 ("Add helper for..."). > > > > > > A few patches might be mergeable even if the main idea with > > > deadline > > > doesn't get approved, that's why I'm suggesting that. > > > > Yes some of those could be possible and I am happy to extract and > > rebase in principle. But not yet I think. If and when something > > gets a > > positive nod. > > > > Regards, > > > > Tvrtko > > > > > > > > > > drivers/dma-buf/dma-fence-unwrap.c | 8 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 5 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 8 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 8 +- > > > > drivers/gpu/drm/scheduler/Makefile | 2 +- > > > > drivers/gpu/drm/scheduler/sched_entity.c | 316 ++++++----- > > > > drivers/gpu/drm/scheduler/sched_fence.c | 5 +- > > > > drivers/gpu/drm/scheduler/sched_main.c | 587 +++++------ > > > > ------- > > > > -- > > > > drivers/gpu/drm/scheduler/sched_rq.c | 199 +++++++ > > > > include/drm/gpu_scheduler.h | 74 ++- > > > > include/linux/dma-fence-unwrap.h | 31 +- > > > > 14 files changed, 606 insertions(+), 678 deletions(-) > > > > create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c > > > > > > > >