Re: [RFC PATCH] drm/sched: Fix a UAF on drm_sched_fence::sched

Simona Vetter <simona.vetter@xxxxxxxx> · Tue, 3 Sep 2024 10:13:25 +0200

On Mon, Sep 02, 2024 at 04:18:33PM +0200, Christian König wrote:
> Am 02.09.24 um 15:23 schrieb Daniel Vetter:
> > On Mon, Sep 02, 2024 at 12:43:45PM +0200, Christian König wrote:
> > > Am 30.08.24 um 23:43 schrieb Matthew Brost:
> > > > On Fri, Aug 30, 2024 at 10:14:18AM +0200, Christian König wrote:
> > > > > Am 29.08.24 um 19:12 schrieb Boris Brezillon:
> > > > > > dma_fence objects created by an entity might outlive the
> > > > > > drm_gpu_scheduler this entity was bound to if those fences are retained
> > > > > > by other other objects, like a dma_buf resv. This means that
> > > > > > drm_sched_fence::sched might be invalid when the resv is walked, which
> > > > > > in turn leads to a UAF when dma_fence_ops::get_timeline_name() is called.
> > > > > > 
> > > > > > This probably went unnoticed so far, because the drm_gpu_scheduler had
> > > > > > the lifetime of the drm_device, so, unless you were removing the device,
> > > > > > there were no reasons for the scheduler to be gone before its fences.
> > > > > Nope, that is intentional design. get_timeline_name() is not safe to be
> > > > > called after the fence signaled because that would causes circular
> > > > > dependency problems.
> > So I don't think knowlingly crashing in debugfs is ok. debugfs can break
> > stuff like secure boot, and if you go about things very wrongly it can
> > upset the kernel (like touching pci mappings from userspace can). But just
> > going boom due to a race essentially means debugfs is unusable. Because
> > there's no way to avoid the boom with dma_fence:
> > 
> > - they're guaranteed to signal in finite time (unless driver bugs)
> > 
> > - the moment they've signalled looking too closely at them is undefined
> >    behaviour.
> > 
> > > > I'm quite sure happens, ftrace for example can and will call
> > > > get_timeline_name in trace_dma_fence_destroy which is certainly after
> > > > the fence is signaled. There are likely other cases too - this just
> > > > quickly came to mind.
> > > Good point, completely forgotten about ftrace.
> > > 
> > > > > E.g. when you have hardware fences it can happen that fences reference a
> > > > > driver module (for the function printing the name) and the module in turn
> > > > > keeps fences around.
> > > > > 
> > > > I am almost positive without this patch this problematic in Xe or any
> > > > driver in which schedulers are tied to IOCTLs rather than kernel module.
> > > > 
> > > > In Xe 'fence->sched' maps to an xe_exec_queue which can be freed once
> > > > the destroy exec queue IOCTL is called and all jobs are free'd (i.e.
> > > > 'fence' signals). The fence could be live on after in dma-resv objects,
> > > > drm syncobjs, etc...
> > > > 
> > > > I know this issue has been raised before and basically NACK'd but I have
> > > > a strong opinion this is valid and in fact required.
> > > I've NACK'd automatically signaling pending fences on destruction of the
> > > scheduler (that reminds me that I wanted to add a warning for that) and
> > > copying the name into every scheduler fence.
> > > 
> > > As long as we don't do any of that I'm perfectly fine fixing this issue. The
> > > approach of creating a reference counted object for the name looks rather
> > > valid to me.
> > > 
> > > Especially since we then pretty much get the module references correct for
> > > free as well.
> > So I think the issue is much, much bigger, and there's more. And the
> > issue is I think a fundamental design issue of dma_fence itself, not
> > individual users.
> 
> IIRC both Alex and me pointed out this issue on the very first dma_fence
> code and nobody really cared.

I guess way back then we didn't really sort out any of the hotunplug
issues, and there wasn't any fw ctx schedulers at least on our horizons
yet. Thin excuse, I know ...

> >   I think at the core it's two constraints:
> > 
> > - dma_fence can stick around practically forever in varios container
> >    objects. We only garbage collect when someone looks, and not even then
> >    consistently.
> > 
> > - fences are meant to be cheap, so they do not have the big refcount going
> >    on like other shared objects like dma_buf
> > 
> > Specifically there's also no refcounting on the module itself with the
> > ->owner and try_module_get stuff. So even if we fix all these issues on
> > the data structure lifetime side of things, you might still oops calling
> > into dma_fence->ops->release.
> > 
> > Oops.
> 
> Yes, exactly that. I'm a bit surprised that you realize that only now :)
> 
> We have the issue for at least 10 years or so and it pops up every now and
> then on my desk because people complain that unloading amdgpu crashes.

Yeah I knew about the issue. The new idea that popped into my mind is that
I think we cannot plug this properly unless we do it in dma_fence.c for
everyone, and essentially reshape the lifetime rules for that from yolo
to something actually well-defined.

Kinda similar work to how dma_resv locking rules and fence book-keeping
were unified to something that actually works across drivers ...

> > I think the complete solution is if we change this code all so that core
> > dma-fence.c code guarantees to never ever again call into any driver code
> > after dma_fence_signal has been called, and takes over the final kfree_rcu
> > itself. But that's a giantic change. But I think it's the only way to
> > really fix this mess:
> > 
> > - drivers will clean up any of their own references in a timely fashion,
> >    so no more accidentally lingering gpu context or vms and the bo they
> >    have mapped lying around.
> > 
> > - there's no lifetime or other use-after-free issues anywhere for fences
> >    anymore
> > 
> > Downside is that some of the debugging stuff becomes a bit less useful.
> > But e.g. tracepoints could just dump the timeline once at creation or when
> > signalling, and so you don't need to dump it anymore when freeing. And a
> > signalled fence is generally not a problem anymore, so in a compositor
> > that's also all fine (iirc you can get at some of this stuff through the
> > sync_file interfaces too).
> > 
> > The other downside is that it's a huge pile of work, but I don't think we
> > can get to an actually solid design with less headaches and pain ...
> > 
> > Thoughts?
> 
> The alternative is to use the scheduler fence(s) to decouple hardware fences
> from the containers. That would be rather cheap to implement.
> 
> The only downside would be that the scheduler module probably keeps loaded
> forever once used. But at least I can live with that.

Yeah I think interim it's an ok stop-gap. But aside from keeping the
scheduler code pinned forever I think there's some more things:

- I'm not sure we can do it, without digging into dma_fence.c locking
  internals too much.

- It defacto means you can use dma_fence that are fence containers and
  drm_sched_job_fence, and nothing else. And drivers will get this wrong
  and do dma_fence ad-hoc for stuff like tlb flushing, or pte writing, and
  whatever else, that won't necessairly go through a drm_sched.

So not great imo, and hence why I've shifted towards that we should fix
this in dma_fence.c code for everyone.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch