Hi Christian, On Fri, 30 Aug 2024 10:14:18 +0200 Christian König <christian.koenig@xxxxxxx> wrote: > Am 29.08.24 um 19:12 schrieb Boris Brezillon: > > dma_fence objects created by an entity might outlive the > > drm_gpu_scheduler this entity was bound to if those fences are retained > > by other other objects, like a dma_buf resv. This means that > > drm_sched_fence::sched might be invalid when the resv is walked, which > > in turn leads to a UAF when dma_fence_ops::get_timeline_name() is called. > > > > This probably went unnoticed so far, because the drm_gpu_scheduler had > > the lifetime of the drm_device, so, unless you were removing the device, > > there were no reasons for the scheduler to be gone before its fences. > > Nope, that is intentional design. get_timeline_name() is not safe to be > called after the fence signaled because that would causes circular > dependency problems. Do you mean the dma_fence layer should not call get_timeline_name() after it's been signalled (looking at the code/doc, it doesn't seem to be the case), or do you mean the drm_sched implementation of the fence interface is wrong and should assume the fence can live longer than its creator? > > E.g. when you have hardware fences it can happen that fences reference a > driver module (for the function printing the name) and the module in > turn keeps fences around. > > So you easily end up with a module you can never unload. On the other hand, I think preventing the module from being unloaded is the right thing to do, because otherwise the dma_fence_ops might be gone when they get dereferenced in the release path. That's also a problem I noticed when I started working on the initial panthor driver without drm_sched. To solve that I ended up retaining a module ref for each fence created, and releasing this ref in the dma_fence_ops::release() function. drm_sched adds an indirection that allows drivers to not care, but that's still a problem if you end up unloading drm_sched while some of its drm_sched_fence fences are owned by external components. > > > > With the introduction of a new model where each entity has its own > > drm_gpu_scheduler instance, this situation is likely to happen every time > > a GPU context is destroyed and some of its fences remain attached to > > dma_buf objects still owned by other drivers/processes. > > > > In order to make drm_sched_fence_get_timeline_name() safe, we need to > > copy the scheduler name into our own refcounted object that's only > > destroyed when both the scheduler and all its fences are gone. > > > > The fact drm_sched_fence might have a reference to the drm_gpu_scheduler > > even after it's been released is worrisome though, but I'd rather > > discuss that with everyone than come up with a solution that's likely > > to end up being rejected. > > > > Note that the bug was found while repeatedly reading dma_buf's debugfs > > file, which, at some point, calls dma_resv_describe() on a resv that > > contains signalled fences coming from a destroyed GPU context. > > AFAIK, there's nothing invalid there. > > Yeah but reading debugfs is not guaranteed to crash the kernel. > > On the other hand the approach with a kref'ed string looks rather sane > to me. One comment on this below. There's still the problem I mentioned above (unloading drm_sched can make things crash). Are there any plans to fix that? The simple option would be to prevent compiling drm_sched as a module, but that's not an option because it depends on DRM which is a tristate too. Maybe we could have drm_sched_fence.o linked statically, just like dma-fence.c is linked statically to prevent the stub ops from disappearing. Not sure if drm_sched_fence.c depends on symbols defined in sched_{main,entity}.c or other parts of the DRM subsystem though. > > +/** > > + * struct drm_sched_fence_timeline - Wrapped around the timeline name > > + * > > + * This is needed to cope with the fact dma_fence objects created by > > + * an entity might outlive the drm_gpu_scheduler this entity was bound > > + * to, making drm_sched_fence::sched invalid and leading to a UAF when > > + * dma_fence_ops::get_timeline_name() is called. > > + */ > > +struct drm_sched_fence_timeline { > > + /** @kref: Reference count of this timeline object. */ > > + struct kref kref; > > + > > + /** > > + * @name: Name of the timeline. > > + * > > + * This is currently a copy of drm_gpu_scheduler::name. > > + */ > > + const char *name; > > Make that a char name[] and embed the name into the structure. The macro > struct_size() can be used to calculate the size. Sure I can do that. Regards, Boris