On Wed, 4 Sep 2024 12:03:24 +0200 Simona Vetter <simona.vetter@xxxxxxxx> wrote: > On Wed, Sep 04, 2024 at 11:46:54AM +0200, Simona Vetter wrote: > > On Wed, Sep 04, 2024 at 09:40:36AM +0200, Christian König wrote: > > > Am 03.09.24 um 10:13 schrieb Simona Vetter: > > > > [SNIP] > > > > > > So I think the issue is much, much bigger, and there's more. And the > > > > > > issue is I think a fundamental design issue of dma_fence itself, not > > > > > > individual users. > > > > > IIRC both Alex and me pointed out this issue on the very first dma_fence > > > > > code and nobody really cared. > > > > I guess way back then we didn't really sort out any of the hotunplug > > > > issues, and there wasn't any fw ctx schedulers at least on our horizons > > > > yet. Thin excuse, I know ... > > > > > > Well it's just when you have a bee string and a broken leg, what do you > > > attend first? :) > > > > Yeah ... > > > > > > > > I think at the core it's two constraints: > > > > > > > > > > > > - dma_fence can stick around practically forever in varios container > > > > > > objects. We only garbage collect when someone looks, and not even then > > > > > > consistently. > > > > > > > > > > > > - fences are meant to be cheap, so they do not have the big refcount going > > > > > > on like other shared objects like dma_buf > > > > > > > > > > > > Specifically there's also no refcounting on the module itself with the > > > > > > ->owner and try_module_get stuff. So even if we fix all these issues on > > > > > > the data structure lifetime side of things, you might still oops calling > > > > > > into dma_fence->ops->release. > > > > > > > > > > > > Oops. > > > > > Yes, exactly that. I'm a bit surprised that you realize that only now :) > > > > > > > > > > We have the issue for at least 10 years or so and it pops up every now and > > > > > then on my desk because people complain that unloading amdgpu crashes. > > > > Yeah I knew about the issue. The new idea that popped into my mind is that > > > > I think we cannot plug this properly unless we do it in dma_fence.c for > > > > everyone, and essentially reshape the lifetime rules for that from yolo > > > > to something actually well-defined. > > > > > > > > Kinda similar work to how dma_resv locking rules and fence book-keeping > > > > were unified to something that actually works across drivers ... > > > > > > Well sounds like I've just got more items on my TODO list. > > > > > > I have patches waiting to be send out going into this direction anyway, will > > > try to get them out by the end of the week and then we can discuss what's > > > still missing. > > > > Quick addition, another motivator from the panthor userspace submit > > discussion: If the preempt ctx fence concept spreads, that's another > > non-drm_sched fence that drivers will need and are pretty much guaranteed > > to get wrong. > > > > Also maybe Boris volunteers to help out with some of the work here? Or > > perhaps some of the nova folks, it seems to be even more a pain for rust > > drivers ... > > I forgot to add: I think it'd be really good to record the rough consensus > on the problem and the long term solution we're aiming for an a kerneldoc > or TODO patch. I think recording those design goals helped us a _lot_ in > making the dma_resv_usage/lock and dma_buf api cleanups and cross-driver > consistent semantics happen. Maybe as a WARNING/TODO block in the > dma_fence_ops kerneldoc? > > Boris, can you volunteer perhaps? Sure, I won't be able to do that this week though.