On Fri, 30 Aug 2024 21:43:44 +0000 Matthew Brost <matthew.brost@xxxxxxxxx> wrote: > On Fri, Aug 30, 2024 at 10:14:18AM +0200, Christian König wrote: > > Am 29.08.24 um 19:12 schrieb Boris Brezillon: > > > dma_fence objects created by an entity might outlive the > > > drm_gpu_scheduler this entity was bound to if those fences are retained > > > by other other objects, like a dma_buf resv. This means that > > > drm_sched_fence::sched might be invalid when the resv is walked, which > > > in turn leads to a UAF when dma_fence_ops::get_timeline_name() is called. > > > > > > This probably went unnoticed so far, because the drm_gpu_scheduler had > > > the lifetime of the drm_device, so, unless you were removing the device, > > > there were no reasons for the scheduler to be gone before its fences. > > > > Nope, that is intentional design. get_timeline_name() is not safe to be > > called after the fence signaled because that would causes circular > > dependency problems. > > > > I'm quite sure happens, ftrace for example can and will call > get_timeline_name in trace_dma_fence_destroy which is certainly after > the fence is signaled. There are likely other cases too - this just > quickly came to mind. > > > E.g. when you have hardware fences it can happen that fences reference a > > driver module (for the function printing the name) and the module in turn > > keeps fences around. > > > > I am almost positive without this patch this problematic in Xe or any > driver in which schedulers are tied to IOCTLs rather than kernel module. > > In Xe 'fence->sched' maps to an xe_exec_queue which can be freed once > the destroy exec queue IOCTL is called and all jobs are free'd (i.e. > 'fence' signals). The fence could be live on after in dma-resv objects, > drm syncobjs, etc... > > I know this issue has been raised before and basically NACK'd but I have > a strong opinion this is valid and in fact required. IIUC, Christian recognized that it's more problematic now that schedulers lifetime is no longer bound to the device lifetime but instead the GPU context lifetime. So I think we all agree that this needs fixing :-). How about I send a v2 of this patch, as it seems Christian was more or less happy with the approach, baring the allocation scheme. And then we can discuss how we want to fix the module-unload issue.