Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Daniel Vetter <daniel@xxxxxxxx> · Thu, 6 Apr 2023 12:32:47 +0200



On Thu, Apr 06, 2023 at 12:14:36PM +0200, Christian König wrote:
> Am 06.04.23 um 08:37 schrieb Daniel Vetter:
> > On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
> > > On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> > > > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@xxxxxxx> wrote:
> > > > > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > > I would approach it from the complete other side. This component here is a
> > > > > > > tool to decide what job should run next.
> > > > > > > 
> > > > > > > How that is then signaled and run should not be part of the scheduler, but
> > > > > > > another more higher level component.
> > > > > > > 
> > > > > > > This way you also don't have a problem with not using DMA-fences as
> > > > > > > dependencies as well as constrains for running more jobs.
> > > > > > I think we're talking about two things here and mixing them up.
> > > > > > 
> > > > > > For the dependencies I agree with you, and imo that higher level tool
> > > > > > should probably just be an on-demand submit thread in userspace for the
> > > > > > rare case where the kernel would need to sort out a dependency otherwise
> > > > > > (due to running out of ringspace in the per-ctx ringbuffer).
> > > > > > 
> > > > > > The other thing is the message passing stuff, and this is what I was
> > > > > > talking about above. This has nothing to do with handling dependencies,
> > > > > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > > > > only provides a single queue, and it's in-order. Which means it
> > > > > > fundamentally has the stalling issue you describe as a point against a
> > > > > > message passing design. And fundamentally we need to be able to talk to
> > > > > > the fw in the scheduler ->run_job callback.
> > > > > > 
> > > > > > The proposal here for the message passing part is that since it has the
> > > > > > stalling issue already anyway, and the scheduler needs to be involved
> > > > > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > > > > drivers which have this kind of fw interface) into the scheduler.
> > > > > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > > > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > > > > first. Note that for this case the optional message passing support in the
> > > > > > drm/scheduler actually makes things better, because it allows you to cut
> > > > > > out one layer.
> > > > > > 
> > > > > > Of course if a driver with better fw interface uses this message passing
> > > > > > support, then that's bad. Hence the big warning in the kerneldoc.
> > > > > Well what I wanted to say is that if you design the dependency handling
> > > > > / scheduler properly you don't need the message passing through it.
> > > > > 
> > > > > For example if the GPU scheduler component uses a work item to do it's
> > > > > handling instead of a kthread you could also let the driver specify the
> > > > > work queue where this work item is executed on.
> > > > > 
> > > > > When you design it like this the driver specifies the thread context of
> > > > > execution for it's job. In other words it can specify a single threaded
> > > > > firmware work queue as well.
> > > > > 
> > > > > When you then have other messages which needs to be passed to the
> > > > > firmware you can also use the same single threaded workqueue for this.
> > > > > 
> > > > > Drivers which have a different firmware interface would just use one of
> > > > > the system work queues instead.
> > > > > 
> > > > > This approach basically decouples the GPU scheduler component from the
> > > > > message passing functionality.
> > > > Hm I guess we've been talking past each another big time, because
> > > > that's really what I thought was under discussions? Essentially the
> > > > current rfc, but implementing with some polish.
> > > > 
> > > I think Daniel pretty much nailed it here (thanks), to recap:
> > > 
> > > 1. I want the messages in the same worker so run_job / free_job /
> > > process_msg execution is mutual exclusive and also so during reset paths
> > > if the worker is stopped all the entry points can't be entered.
> > > 
> > > If this is a NAK, then another worker is fine I guess. A lock between
> > > run_job / free_job + process_msg should solve the exclusion issue and the
> > > reset paths can also stop this new worker too. That being said I'd
> > > rather leave this as is but will not fight this point.
> > > 
> > > 2. process_msg is just used to communicate with the firmware using the
> > > same queue as submission. Waiting for space in this queue is the only
> > > place this function can block (same as submission), well actually we
> > > have the concept a preempt time slice but that sleeps for 10 ms by
> > > default. Also preempt is only used in LR entities so I don't think it is
> > > relavent in this case either.
> > > 
> > > 3. Agree this is in the dma-fence signaling path (if process_msg is in
> > > the submission worker) so we can't block indefinitely or an unreasonable
> > > period of time (i.e. we must obey dma-fence rules).
> > Just to hammer this in: Not just process_msg is in the dma_fence signaling
> > path, but the entire fw queue where everything is being funneled through,
> > including whatever the fw is doing to process these.
> > 
> > Yes this is terrible and blew up a few times already :-/
> > 
> > But also, probably something that the docs really need to hammer in, to
> > make sure people don't look at this and thinkg "hey this seems to be the
> > recommended way to do this on linux". We don't want hw people to build
> > more of these designs, they're an absolute pain to deal with with Linux'
> > dma_fence signalling and gpu job scheduling rules.
> > 
> > It's just that if you're stuck with such fw, then integrating the flow
> > into drm/sched instead of having an extra layer of workers seems the
> > better of two pretty bad solutions.
> 
> Yeah and if you have such fw limitations, make sure that you use something
> which is understood by lockdep to feed into it.
> 
> In other words, either locks or work item/queue and not some message passing
> functionality through the scheduler.
> 
> As far as I can see the approach with the work item/queue should fit your
> needs here.

dma_fence signalling annotations would also make the scheduler thread
suitable to catch issues with lockdep, just to make double-sure it can
catch issues.
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > 4. Agree the documentation for thw usage of the messaging interface
> > > needs to be clear.
> > > 
> > > 5. Agree that my code could alway use polishing.
> > > 
> > > Lets close on #1 then can I get on general Ack on this part of the RFC
> > > and apply the polish in the full review process?
> > > 
> > > Matt
> > > 
> > > > iow I agree with you (I think at least).
> > > > -Daniel
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch