Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Christian König <christian.koenig@xxxxxxx> · Thu, 6 Apr 2023 12:14:36 +0200

Am 06.04.23 um 08:37 schrieb Daniel Vetter:
On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@xxxxxxx> wrote:
Am 05.04.23 um 11:07 schrieb Daniel Vetter:
[SNIP]
I would approach it from the complete other side. This component here is a
tool to decide what job should run next.

How that is then signaled and run should not be part of the scheduler, but
another more higher level component.

This way you also don't have a problem with not using DMA-fences as
dependencies as well as constrains for running more jobs.
I think we're talking about two things here and mixing them up.

For the dependencies I agree with you, and imo that higher level tool
should probably just be an on-demand submit thread in userspace for the
rare case where the kernel would need to sort out a dependency otherwise
(due to running out of ringspace in the per-ctx ringbuffer).

The other thing is the message passing stuff, and this is what I was
talking about above. This has nothing to do with handling dependencies,
but with talking to the gpu fw. Here the intel design issue is that the fw
only provides a single queue, and it's in-order. Which means it
fundamentally has the stalling issue you describe as a point against a
message passing design. And fundamentally we need to be able to talk to
the fw in the scheduler ->run_job callback.

The proposal here for the message passing part is that since it has the
stalling issue already anyway, and the scheduler needs to be involved
anyway, it makes sense to integrated this (as an optional thing, only for
drivers which have this kind of fw interface) into the scheduler.
Otherwise you just end up with two layers for no reason and more ping-pong
delay because the ->run_job needs to kick off the subordinate driver layer
first. Note that for this case the optional message passing support in the
drm/scheduler actually makes things better, because it allows you to cut
out one layer.

Of course if a driver with better fw interface uses this message passing
support, then that's bad. Hence the big warning in the kerneldoc.
Well what I wanted to say is that if you design the dependency handling
/ scheduler properly you don't need the message passing through it.

For example if the GPU scheduler component uses a work item to do it's
handling instead of a kthread you could also let the driver specify the
work queue where this work item is executed on.

When you design it like this the driver specifies the thread context of
execution for it's job. In other words it can specify a single threaded
firmware work queue as well.

When you then have other messages which needs to be passed to the
firmware you can also use the same single threaded workqueue for this.

Drivers which have a different firmware interface would just use one of
the system work queues instead.

This approach basically decouples the GPU scheduler component from the
message passing functionality.
Hm I guess we've been talking past each another big time, because
that's really what I thought was under discussions? Essentially the
current rfc, but implementing with some polish.

I think Daniel pretty much nailed it here (thanks), to recap:

1. I want the messages in the same worker so run_job / free_job /
process_msg execution is mutual exclusive and also so during reset paths
if the worker is stopped all the entry points can't be entered.

If this is a NAK, then another worker is fine I guess. A lock between
run_job / free_job + process_msg should solve the exclusion issue and the
reset paths can also stop this new worker too. That being said I'd
rather leave this as is but will not fight this point.

2. process_msg is just used to communicate with the firmware using the
same queue as submission. Waiting for space in this queue is the only
place this function can block (same as submission), well actually we
have the concept a preempt time slice but that sleeps for 10 ms by
default. Also preempt is only used in LR entities so I don't think it is
relavent in this case either.

3. Agree this is in the dma-fence signaling path (if process_msg is in
the submission worker) so we can't block indefinitely or an unreasonable
period of time (i.e. we must obey dma-fence rules).
Just to hammer this in: Not just process_msg is in the dma_fence signaling
path, but the entire fw queue where everything is being funneled through,
including whatever the fw is doing to process these.

Yes this is terrible and blew up a few times already :-/

But also, probably something that the docs really need to hammer in, to
make sure people don't look at this and thinkg "hey this seems to be the
recommended way to do this on linux". We don't want hw people to build
more of these designs, they're an absolute pain to deal with with Linux'
dma_fence signalling and gpu job scheduling rules.

It's just that if you're stuck with such fw, then integrating the flow
into drm/sched instead of having an extra layer of workers seems the
better of two pretty bad solutions.

Yeah and if you have such fw limitations, make sure that you use 
something which is understood by lockdep to feed into it.

In other words, either locks or work item/queue and not some message 
passing functionality through the scheduler.

As far as I can see the approach with the work item/queue should fit 
your needs here.

Christian.

-Daniel

4. Agree the documentation for thw usage of the messaging interface
needs to be clear.

5. Agree that my code could alway use polishing.

Lets close on #1 then can I get on general Ack on this part of the RFC
and apply the polish in the full review process?

Matt

iow I agree with you (I think at least).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch