On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote: > On 2/5/24 22:08, Dave Airlie wrote: > > On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@xxxxxxxxxx> wrote: > > > > > > On 1/29/24 02:50, Dave Airlie wrote: > > > > From: Dave Airlie <airlied@xxxxxxxxxx> > > > > > > > > This should break the deadlock between the fctx lock and the irq lock. > > > > > > > > This offloads the processing off the work from the irq into a workqueue. > > > > > > > > Signed-off-by: Dave Airlie <airlied@xxxxxxxxxx> > > > > > > Nouveau's scheduler uses a dedicated wq, hence from this perspective it's > > > safe deferring fence signalling to the kernel global wq. However, I wonder > > > if we could create deadlocks by building dependency chains into other > > > drivers / kernel code that, by chance, makes use of the kernel global wq as > > > well. > > > > > > Admittedly, even if, it's gonna be extremely unlikely given that > > > WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq. > > > > > > Also, do we need to CC stable? > > > > I pushed this to Linus at the end of last week, since the hangs in 6.7 > > take out the complete system and I wanted it in stable. > > > > It might be safer to use a dedicated wq, is the concern someone is > > waiting on a fence in a workqueue somewhere else so we will never > > signal it? > > Yes, if some other work is waiting for this fence (or something else in the same > dependency chain) to signal it can prevent executing the work signaling this fence, > in case both are scheduled on the same wq. As mentioned, with the kernel global wq > this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512, > but formally the race condition exists. I guess a malicious attacker could try to > intentionally push jobs directly or indirectly depending on this fence to a driver > which queues them up on a scheduler using the kernel global wq. I think if you add dma_fence_signalling annotations (aside, there's some patch from iirc Thomas Hellstrom to improve them and cut down on some false positives, but I lost track) then I think you won't get any splats because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to infinity to not matter. I'm not sure we should care differently, but I guess it'd be good to annotate it all in case the wq subsystem's idea of how much such deadlocks are real changes. Also Teo is on a mission to get rid of all the global wq flushes, so there should also be no source of deadlocks from that kind of cross-driver dependency. Or at least shouldn't be in the future, I'm not sure it all landed. -Sima -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch