On Fri, 30 Aug 2024 11:37:21 +0200 Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > With the introduction of a new model where each entity has its own > > > drm_gpu_scheduler instance, this situation is likely to happen every time > > > a GPU context is destroyed and some of its fences remain attached to > > > dma_buf objects still owned by other drivers/processes. > > > > > > In order to make drm_sched_fence_get_timeline_name() safe, we need to > > > copy the scheduler name into our own refcounted object that's only > > > destroyed when both the scheduler and all its fences are gone. > > > > > > The fact drm_sched_fence might have a reference to the drm_gpu_scheduler > > > even after it's been released is worrisome though, but I'd rather > > > discuss that with everyone than come up with a solution that's likely > > > to end up being rejected. > > > > > > Note that the bug was found while repeatedly reading dma_buf's debugfs > > > file, which, at some point, calls dma_resv_describe() on a resv that > > > contains signalled fences coming from a destroyed GPU context. > > > AFAIK, there's nothing invalid there. > > > > Yeah but reading debugfs is not guaranteed to crash the kernel. > > > > On the other hand the approach with a kref'ed string looks rather sane > > to me. One comment on this below. > > There's still the problem I mentioned above (unloading drm_sched can > make things crash). Are there any plans to fix that? The simple option > would be to prevent compiling drm_sched as a module, but that's not an > option because it depends on DRM which is a tristate too. Maybe we > could have drm_sched_fence.o linked statically, just like dma-fence.c > is linked statically to prevent the stub ops from disappearing. > Not sure if drm_sched_fence.c depends on symbols defined in > sched_{main,entity}.c or other parts of the DRM subsystem though. For the record, I gave it a try, and linking drm_sched_fence.o statically while leaving the rest of drm_sched as a module seems to work. I just sent an RFC for that [1]. [1]https://lore.kernel.org/dri-devel/20240830104057.1479321-1-boris.brezillon@xxxxxxxxxxxxx/T/#u