Re: [PATCH] nouveau: offload fence uevents work to workqueue

Daniel Vetter <daniel@xxxxxxxx> · Fri, 16 Feb 2024 17:41:31 +0100

On Tue, Feb 13, 2024 at 06:39:20PM +0100, Danilo Krummrich wrote:
> On 2/9/24 19:52, Daniel Vetter wrote:
> > On Fri, Feb 09, 2024 at 06:41:32PM +0100, Danilo Krummrich wrote:
> > > On 2/6/24 15:03, Daniel Vetter wrote:
> > > > On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
> > > > > On 2/5/24 22:08, Dave Airlie wrote:
> > > > > > On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
> > > > > > > 
> > > > > > > On 1/29/24 02:50, Dave Airlie wrote:
> > > > > > > > From: Dave Airlie <airlied@xxxxxxxxxx>
> > > > > > > > 
> > > > > > > > This should break the deadlock between the fctx lock and the irq lock.
> > > > > > > > 
> > > > > > > > This offloads the processing off the work from the irq into a workqueue.
> > > > > > > > 
> > > > > > > > Signed-off-by: Dave Airlie <airlied@xxxxxxxxxx>
> > > > > > > 
> > > > > > > Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
> > > > > > > safe deferring fence signalling to the kernel global wq. However, I wonder
> > > > > > > if we could create deadlocks by building dependency chains into other
> > > > > > > drivers / kernel code that, by chance, makes use of the kernel global wq as
> > > > > > > well.
> > > > > > > 
> > > > > > > Admittedly, even if, it's gonna be extremely unlikely given that
> > > > > > > WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
> > > > > > > 
> > > > > > > Also, do we need to CC stable?
> > > > > > 
> > > > > > I pushed this to Linus at the end of last week, since the hangs in 6.7
> > > > > > take out the complete system and I wanted it in stable.
> > > > > > 
> > > > > > It might be safer to use a dedicated wq, is the concern someone is
> > > > > > waiting on a fence in a workqueue somewhere else so we will never
> > > > > > signal it?
> > > > > 
> > > > > Yes, if some other work is waiting for this fence (or something else in the same
> > > > > dependency chain) to signal it can prevent executing the work signaling this fence,
> > > > > in case both are scheduled on the same wq. As mentioned, with the kernel global wq
> > > > > this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
> > > > > but formally the race condition exists. I guess a malicious attacker could try to
> > > > > intentionally push jobs directly or indirectly depending on this fence to a driver
> > > > > which queues them up on a scheduler using the kernel global wq.
> > > > 
> > > > I think if you add dma_fence_signalling annotations (aside, there's some
> > > > patch from iirc Thomas Hellstrom to improve them and cut down on some
> > > > false positives, but I lost track) then I think you won't get any splats
> > > > because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
> > > > infinity to not matter.
> > > 
> > > As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel
> > > with enough jobs to to provoke a deadlock doesn't seem impossible to me.
> > > 
> > > I think it'd be safer to just establish not to use the kernel global wq for executing
> > > work in the fence signalling critical path.
> > > 
> > > We could also run into similar problems with a dedicated wq, e.g. when drivers share
> > > a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch
> > > that with lockdep.
> > 
> > I think if you want to fix it perfectly you'd need to set the max number
> > of wq to the number of engines (or for dynamic/fw scheduled engines to the
> > number of context) you have. Or whatever limit to the number of parallel
> > timelines there is.> I guess this would need a new wq function to
> > update? drm/sched code could
> > be able to set that for drivers, so drivers cannot get this wrong.
> 
> Not sure I can follow. The scheduler instance might be per context and bind
> queue. In this case it gets the shared wq passed, but doesn't know how many
> other scheduler instances share the same one.

Yeah that's why maybe more of that logic should be in the drm/sched code
instead of drivers just cleverly using what's there ...

> Additionally, there might be drivers not using the DRM scheduler for for bind
> queues at all (I think Xe does not).

Uh ... maybe we should do this the same across all drivers? But I also
thought that Xe was flat-out synchronous and only had an out-fence since
you need a userspace thread in mesa for vk semantics anyway.

If xe hand-rolled a scheduler I'm not going to be very amused.

> > If we don't do something like that then I'm not sure there's really much
> > benefit - instead of carefully timing 512 dma_fence on the global wq you
> > just need to create a pile of context (at least on intel's guc that's
> > absolutely no issue) and then careful time them.
> 
> Well, that's true. I'd still argue that there is a slight difference. From a
> drivers isolated perspective using the global kernel wq might be entirely fine,
> as in this patch. However, in combination with another driver doing the same
> thing, things can blow up. For the case you illustrated it's at least possible
> to spot it from a driver's perspective.
> 
> > 
> > I still feel like we have bigger fish to fry ... But might be worth the
> > effort to at least make the parallel timeline limit a lot more explicit?
> 
> I agree, and it'd be great if we can find a solution such bugs can be detected
> systematically (e.g. through lockdep), but maybe we can start to at least
> document that we should never use the kernel global wq and where we need to be
> careful in sharing driver wqs.

Yeah I guess the above two are other reasons why maybe we need a bit more
structure in scheduler apis instead of just allowing drivers to hand in
shared wq pointers. Something like a struct drm_sched_domain, which
contains the wq + a list of drm_sched for it. Would also make stuff like
reliably stopping the right amount of schedulers in tdr much more robust.
-Sima

> 
> - Danilo
> 
> > 
> > Cheers, Sima
> > 
> > > 
> > > [1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313
> > > 
> > > > 
> > > > I'm not sure we should care differently, but I guess it'd be good to
> > > > annotate it all in case the wq subsystem's idea of how much such deadlocks
> > > > are real changes.
> > > > 
> > > > Also Teo is on a mission to get rid of all the global wq flushes, so there
> > > > should also be no source of deadlocks from that kind of cross-driver
> > > > dependency. Or at least shouldn't be in the future, I'm not sure it all
> > > > landed.
> > > > -Sima
> > > 
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch