On Mon, Sep 02, 2024 at 04:18:33PM +0200, Christian König wrote: > Am 02.09.24 um 15:23 schrieb Daniel Vetter: > > On Mon, Sep 02, 2024 at 12:43:45PM +0200, Christian König wrote: > > > Am 30.08.24 um 23:43 schrieb Matthew Brost: > > > > On Fri, Aug 30, 2024 at 10:14:18AM +0200, Christian König wrote: > > > > > Am 29.08.24 um 19:12 schrieb Boris Brezillon: > > > > > > dma_fence objects created by an entity might outlive the > > > > > > drm_gpu_scheduler this entity was bound to if those fences are retained > > > > > > by other other objects, like a dma_buf resv. This means that > > > > > > drm_sched_fence::sched might be invalid when the resv is walked, which > > > > > > in turn leads to a UAF when dma_fence_ops::get_timeline_name() is called. > > > > > > > > > > > > This probably went unnoticed so far, because the drm_gpu_scheduler had > > > > > > the lifetime of the drm_device, so, unless you were removing the device, > > > > > > there were no reasons for the scheduler to be gone before its fences. > > > > > Nope, that is intentional design. get_timeline_name() is not safe to be > > > > > called after the fence signaled because that would causes circular > > > > > dependency problems. > > So I don't think knowlingly crashing in debugfs is ok. debugfs can break > > stuff like secure boot, and if you go about things very wrongly it can > > upset the kernel (like touching pci mappings from userspace can). But just > > going boom due to a race essentially means debugfs is unusable. Because > > there's no way to avoid the boom with dma_fence: > > > > - they're guaranteed to signal in finite time (unless driver bugs) > > > > - the moment they've signalled looking too closely at them is undefined > > behaviour. > > > > > > I'm quite sure happens, ftrace for example can and will call > > > > get_timeline_name in trace_dma_fence_destroy which is certainly after > > > > the fence is signaled. There are likely other cases too - this just > > > > quickly came to mind. > > > Good point, completely forgotten about ftrace. > > > > > > > > E.g. when you have hardware fences it can happen that fences reference a > > > > > driver module (for the function printing the name) and the module in turn > > > > > keeps fences around. > > > > > > > > > I am almost positive without this patch this problematic in Xe or any > > > > driver in which schedulers are tied to IOCTLs rather than kernel module. > > > > > > > > In Xe 'fence->sched' maps to an xe_exec_queue which can be freed once > > > > the destroy exec queue IOCTL is called and all jobs are free'd (i.e. > > > > 'fence' signals). The fence could be live on after in dma-resv objects, > > > > drm syncobjs, etc... > > > > > > > > I know this issue has been raised before and basically NACK'd but I have > > > > a strong opinion this is valid and in fact required. > > > I've NACK'd automatically signaling pending fences on destruction of the > > > scheduler (that reminds me that I wanted to add a warning for that) and > > > copying the name into every scheduler fence. > > > > > > As long as we don't do any of that I'm perfectly fine fixing this issue. The > > > approach of creating a reference counted object for the name looks rather > > > valid to me. > > > > > > Especially since we then pretty much get the module references correct for > > > free as well. > > So I think the issue is much, much bigger, and there's more. And the > > issue is I think a fundamental design issue of dma_fence itself, not > > individual users. > > IIRC both Alex and me pointed out this issue on the very first dma_fence > code and nobody really cared. I guess way back then we didn't really sort out any of the hotunplug issues, and there wasn't any fw ctx schedulers at least on our horizons yet. Thin excuse, I know ... > > I think at the core it's two constraints: > > > > - dma_fence can stick around practically forever in varios container > > objects. We only garbage collect when someone looks, and not even then > > consistently. > > > > - fences are meant to be cheap, so they do not have the big refcount going > > on like other shared objects like dma_buf > > > > Specifically there's also no refcounting on the module itself with the > > ->owner and try_module_get stuff. So even if we fix all these issues on > > the data structure lifetime side of things, you might still oops calling > > into dma_fence->ops->release. > > > > Oops. > > Yes, exactly that. I'm a bit surprised that you realize that only now :) > > We have the issue for at least 10 years or so and it pops up every now and > then on my desk because people complain that unloading amdgpu crashes. Yeah I knew about the issue. The new idea that popped into my mind is that I think we cannot plug this properly unless we do it in dma_fence.c for everyone, and essentially reshape the lifetime rules for that from yolo to something actually well-defined. Kinda similar work to how dma_resv locking rules and fence book-keeping were unified to something that actually works across drivers ... > > I think the complete solution is if we change this code all so that core > > dma-fence.c code guarantees to never ever again call into any driver code > > after dma_fence_signal has been called, and takes over the final kfree_rcu > > itself. But that's a giantic change. But I think it's the only way to > > really fix this mess: > > > > - drivers will clean up any of their own references in a timely fashion, > > so no more accidentally lingering gpu context or vms and the bo they > > have mapped lying around. > > > > - there's no lifetime or other use-after-free issues anywhere for fences > > anymore > > > > Downside is that some of the debugging stuff becomes a bit less useful. > > But e.g. tracepoints could just dump the timeline once at creation or when > > signalling, and so you don't need to dump it anymore when freeing. And a > > signalled fence is generally not a problem anymore, so in a compositor > > that's also all fine (iirc you can get at some of this stuff through the > > sync_file interfaces too). > > > > The other downside is that it's a huge pile of work, but I don't think we > > can get to an actually solid design with less headaches and pain ... > > > > Thoughts? > > The alternative is to use the scheduler fence(s) to decouple hardware fences > from the containers. That would be rather cheap to implement. > > The only downside would be that the scheduler module probably keeps loaded > forever once used. But at least I can live with that. Yeah I think interim it's an ok stop-gap. But aside from keeping the scheduler code pinned forever I think there's some more things: - I'm not sure we can do it, without digging into dma_fence.c locking internals too much. - It defacto means you can use dma_fence that are fence containers and drm_sched_job_fence, and nothing else. And drivers will get this wrong and do dma_fence ad-hoc for stuff like tlb flushing, or pte writing, and whatever else, that won't necessairly go through a drm_sched. So not great imo, and hence why I've shifted towards that we should fix this in dma_fence.c code for everyone. -Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch