On 03/01/2025 12:31, Christian König wrote:
Am 03.01.25 um 13:02 schrieb Tvrtko Ursulin:
</tldr>
If I remember correctly Christian mentioned recently (give or take)
that maybe
round-robin could be removed. That got me thinking how and what could
be
improved and simplified. So I played a bit in the scheduler code and
came up
with something which appears to not crash at least. Whether or not
there are
significant advantages apart from maybe code consolidation and
reduction is the
main thing to be determined.
Hi,
so first of all: Happy new year and thx for putting in all that effort
:)
I gave the series a first look; Since this is an RFC, let's abstain
from doing deeper reviews of the exact code for now.
Ditto and thank you for taking a look!
Ah, yes. Happy new year guys :)
One big question is whether round-robin can really be removed. Does
anyone use
it, rely on it, or what are even use cases where it is much better
than FIFO.
So AFAICS Round Robin is not used anymore by anyone. And my
understanding indeed is, too, that there is not really any use-case
where one would like anything except for FIFO.
Looking at 977d97f18b5b ("drm/scheduler: Set the FIFO scheduling policy
as the default"), it seems to me that RR just was easy to implement and
it had the disadvantage of systems under high load cause the oldest job
to be starved to death, which is why FIFO was introduced.
So my guess would be that RR just is a relict.
If we agree on that, then we could remove RR in any case, and the
subsequent question would be whether FIFO should be replaced with
deadline (or: if there should be FIFO *and* deadline?), wouldn't it?
I am unsure about RR but I agree what is the second part of the question.
Well we came up with FIFO because we found that RR performed quite badly
when you have a huge number of submitting applications.
E.g. one of our cloud test cases ran 100 instances of a single game and
the worst response time improved massively by switching from RR to FIFO.
Different priorities on the other hand were originally invented to make
sure the kernel has precedence over userspace. But later we also exposed
the priorities to userspace which results in the problem that higher
priority queues can starve low priority ones.
That's the other reason why I said that RR should probably be removed
and FIFO changed in a way that the priority is basically just a bonus to
the score used for sorting the FIFO. I haven't taken a deeper look yet,
but I think that this is more or less what this patch set here does.
What FIFO is still missing compared to RR is some sort of fairness
between queues. E.g. a queues which hasn't submitted something in a
while might get a bonus for their submissions compared to a queue which
submits stuff all the time (or something like that).
I can try to add something along these lines. Like dynamically scale the
relative deadlines based on entity queue depth so infrequent short
bursts compete better with persistent hogs. I'll wait for some more
feedback and then see where to go.
Regards,
Tvrtko
There may be nuances with different drivers depending on how much they
can queue to the hardware/firmware at once. Modern drivers which use
1:1 sched:entity I don't expect care about DRM scheduler scheduling
mode. The fewer jobs driver can queue to the backend the more it
cares. Question is FIFO ever better. Keeping in mind that for same
priority this deadline and FIFO are actually identical.
See "drm/sched: Add deadline policy" commit message for a short
description on
what flavour of deadline scheduling it is. But in essence it should a
more fair
FIFO where higher priority can not forever starve lower priorities.
See my answer on that patch.
As you can imagine I'm wondering if that "better FIFO" would be worth
it considering that we are running into a certain risk of regressing
stuff through this rework.
I will reply to that part there then.
"drm/sched: Connect with dma-fence deadlines" wires up dma-fence
deadlines to
the scheduler because it is easy and makes logical sense with this.
And I
noticed userspace already uses it so why not wire it up fully.
Userspace uses the dma-fence deadlines you mean? Do you have a pointer
for us?
I've noticed it empirically and the one I could fine is this:
https://invent.kde.org/plasma/kwin/-/commit/4ad5670ddfcd7400c8b84c12cbf8bd97a0590f43
Otherwise the series is a bit of progression from consolidating RR
into FIFO
code paths and going from there to deadline and then to a change in
how
dependencies are handled. And code simplification to 1:1 run queue to
scheduler
relationship, because deadline does not need per priority run queues.
There is quite a bit of code to go throught here so I think it could
be even
better if other drivers could give it a spin as is and see if some
improvements
can be detected. Or at least no regressions.
I hope I can dive deeper into the Nouveau side soon.
Fantastic!
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: Danilo Krummrich <dakr@xxxxxxxxxx>
Cc: Matthew Brost <matthew.brost@xxxxxxxxx>
Cc: Philipp Stanner <pstanner@xxxxxxxxxx>
Tvrtko Ursulin (14):
drm/sched: Delete unused update_job_credits
drm/sched: Remove idle entity from tree
drm/sched: Implement RR via FIFO
drm/sched: Consolidate entity run queue management
drm/sched: Move run queue related code into a separate file
drm/sched: Ignore own fence earlier
drm/sched: Resolve same scheduler dependencies earlier
drm/sched: Add deadline policy
drm/sched: Remove FIFO and RR and simplify to a single run queue
drm/sched: Queue all free credits in one worker invocation
drm/sched: Connect with dma-fence deadlines
drm/sched: Embed run queue singleton into the scheduler
dma-fence: Add helper for custom fence context when merging fences
drm/sched: Resolve all job dependencies in one go
It seems to me that this series is a "port RR and FIFO to deadline"-
series with some additional patches that could be branched out and be
reviewed through a separate series?
Patch 1 ("Delete unused...") for example. Other candidates are Patch 5
("Move run queue related..."), 13 ("Add helper for...").
A few patches might be mergeable even if the main idea with deadline
doesn't get approved, that's why I'm suggesting that.
Yes some of those could be possible and I am happy to extract and
rebase in principle. But not yet I think. If and when something gets a
positive nod.
Regards,
Tvrtko
drivers/dma-buf/dma-fence-unwrap.c | 8 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 5 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 8 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 8 +-
drivers/gpu/drm/scheduler/Makefile | 2 +-
drivers/gpu/drm/scheduler/sched_entity.c | 316 ++++++-----
drivers/gpu/drm/scheduler/sched_fence.c | 5 +-
drivers/gpu/drm/scheduler/sched_main.c | 587 +++++-------------
--
drivers/gpu/drm/scheduler/sched_rq.c | 199 +++++++
include/drm/gpu_scheduler.h | 74 ++-
include/linux/dma-fence-unwrap.h | 31 +-
14 files changed, 606 insertions(+), 678 deletions(-)
create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c