On Tue, Feb 13, 2024 at 06:39:20PM +0100, Danilo Krummrich wrote: > On 2/9/24 19:52, Daniel Vetter wrote: > > On Fri, Feb 09, 2024 at 06:41:32PM +0100, Danilo Krummrich wrote: > > > On 2/6/24 15:03, Daniel Vetter wrote: > > > > On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote: > > > > > On 2/5/24 22:08, Dave Airlie wrote: > > > > > > On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > On 1/29/24 02:50, Dave Airlie wrote: > > > > > > > > From: Dave Airlie <airlied@xxxxxxxxxx> > > > > > > > > > > > > > > > > This should break the deadlock between the fctx lock and the irq lock. > > > > > > > > > > > > > > > > This offloads the processing off the work from the irq into a workqueue. > > > > > > > > > > > > > > > > Signed-off-by: Dave Airlie <airlied@xxxxxxxxxx> > > > > > > > > > > > > > > Nouveau's scheduler uses a dedicated wq, hence from this perspective it's > > > > > > > safe deferring fence signalling to the kernel global wq. However, I wonder > > > > > > > if we could create deadlocks by building dependency chains into other > > > > > > > drivers / kernel code that, by chance, makes use of the kernel global wq as > > > > > > > well. > > > > > > > > > > > > > > Admittedly, even if, it's gonna be extremely unlikely given that > > > > > > > WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq. > > > > > > > > > > > > > > Also, do we need to CC stable? > > > > > > > > > > > > I pushed this to Linus at the end of last week, since the hangs in 6.7 > > > > > > take out the complete system and I wanted it in stable. > > > > > > > > > > > > It might be safer to use a dedicated wq, is the concern someone is > > > > > > waiting on a fence in a workqueue somewhere else so we will never > > > > > > signal it? > > > > > > > > > > Yes, if some other work is waiting for this fence (or something else in the same > > > > > dependency chain) to signal it can prevent executing the work signaling this fence, > > > > > in case both are scheduled on the same wq. As mentioned, with the kernel global wq > > > > > this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512, > > > > > but formally the race condition exists. I guess a malicious attacker could try to > > > > > intentionally push jobs directly or indirectly depending on this fence to a driver > > > > > which queues them up on a scheduler using the kernel global wq. > > > > > > > > I think if you add dma_fence_signalling annotations (aside, there's some > > > > patch from iirc Thomas Hellstrom to improve them and cut down on some > > > > false positives, but I lost track) then I think you won't get any splats > > > > because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to > > > > infinity to not matter. > > > > > > As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel > > > with enough jobs to to provoke a deadlock doesn't seem impossible to me. > > > > > > I think it'd be safer to just establish not to use the kernel global wq for executing > > > work in the fence signalling critical path. > > > > > > We could also run into similar problems with a dedicated wq, e.g. when drivers share > > > a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch > > > that with lockdep. > > > > I think if you want to fix it perfectly you'd need to set the max number > > of wq to the number of engines (or for dynamic/fw scheduled engines to the > > number of context) you have. Or whatever limit to the number of parallel > > timelines there is.> I guess this would need a new wq function to > > update? drm/sched code could > > be able to set that for drivers, so drivers cannot get this wrong. > > Not sure I can follow. The scheduler instance might be per context and bind > queue. In this case it gets the shared wq passed, but doesn't know how many > other scheduler instances share the same one. Yeah that's why maybe more of that logic should be in the drm/sched code instead of drivers just cleverly using what's there ... > Additionally, there might be drivers not using the DRM scheduler for for bind > queues at all (I think Xe does not). Uh ... maybe we should do this the same across all drivers? But I also thought that Xe was flat-out synchronous and only had an out-fence since you need a userspace thread in mesa for vk semantics anyway. If xe hand-rolled a scheduler I'm not going to be very amused. > > If we don't do something like that then I'm not sure there's really much > > benefit - instead of carefully timing 512 dma_fence on the global wq you > > just need to create a pile of context (at least on intel's guc that's > > absolutely no issue) and then careful time them. > > Well, that's true. I'd still argue that there is a slight difference. From a > drivers isolated perspective using the global kernel wq might be entirely fine, > as in this patch. However, in combination with another driver doing the same > thing, things can blow up. For the case you illustrated it's at least possible > to spot it from a driver's perspective. > > > > > I still feel like we have bigger fish to fry ... But might be worth the > > effort to at least make the parallel timeline limit a lot more explicit? > > I agree, and it'd be great if we can find a solution such bugs can be detected > systematically (e.g. through lockdep), but maybe we can start to at least > document that we should never use the kernel global wq and where we need to be > careful in sharing driver wqs. Yeah I guess the above two are other reasons why maybe we need a bit more structure in scheduler apis instead of just allowing drivers to hand in shared wq pointers. Something like a struct drm_sched_domain, which contains the wq + a list of drm_sched for it. Would also make stuff like reliably stopping the right amount of schedulers in tdr much more robust. -Sima > > - Danilo > > > > > Cheers, Sima > > > > > > > > [1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313 > > > > > > > > > > > I'm not sure we should care differently, but I guess it'd be good to > > > > annotate it all in case the wq subsystem's idea of how much such deadlocks > > > > are real changes. > > > > > > > > Also Teo is on a mission to get rid of all the global wq flushes, so there > > > > should also be no source of deadlocks from that kind of cross-driver > > > > dependency. Or at least shouldn't be in the future, I'm not sure it all > > > > landed. > > > > -Sima > > > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch