[Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

daniel@xxxxxxxx (Daniel Vetter) · Tue, 20 Mar 2018 15:08:10 +0100

On Tue, Mar 20, 2018 at 11:54:18AM +0100, Christian KÃ¶nig wrote:
> Am 20.03.2018 um 08:44 schrieb Daniel Vetter:
> > On Mon, Mar 19, 2018 at 5:23 PM, Christian KÃ¶nig
> > <ckoenig.leichtzumerken at gmail.com> wrote:
> > > Am 19.03.2018 um 16:53 schrieb Chris Wilson:
> > > > Quoting Christian KÃ¶nig (2018-03-16 14:22:32)
> > > > [snip, probably lost too must context]
> > > > > This allows for full grown pipelining, e.g. the exporter can say I need
> > > > > to move the buffer for some operation. Then let the move operation wait
> > > > > for all existing fences in the reservation object and install the fence
> > > > > of the move operation as exclusive fence.
> > > > Ok, the situation I have in mind is the non-pipelined case: revoking
> > > > dma-buf for mmu_invalidate_range or shrink_slab. I would need a
> > > > completion event that can be waited on the cpu for all the invalidate
> > > > callbacks. (Essentially an atomic_t counter plus struct completion; a
> > > > lighter version of dma_fence, I wonder where I've seen that before ;)
> > > 
> > > Actually that is harmless.
> > > 
> > > When you need to unmap a DMA-buf because of mmu_invalidate_range or
> > > shrink_slab you need to wait for it's reservation object anyway.
> > reservation_object only prevents adding new fences, you still have to
> > wait for all the current ones to signal. Also, we have dma-access
> > without fences in i915. "I hold the reservation_object" does not imply
> > you can just go and nuke the backing storage.
> 
> I was not talking about taking the lock, but rather using
> reservation_object_wait_timeout_rcu().
> 
> To be more precise you actually can't take the reservation object lock in an
> mmu_invalidate_range callback and you can only trylock it in a shrink_slab
> callback.

Ah ok, and yes agreed. Kinda. See below.

> > > This needs to be done to make sure that the backing memory is now idle, it
> > > doesn't matter if the jobs where submitted by DMA-buf importers or your own
> > > driver.
> > > 
> > > The sg tables pointing to the now released memory might live a bit longer,
> > > but that is unproblematic and actually intended.
> > I think that's very problematic. One reason for an IOMMU is that you
> > have device access isolation, and a broken device can't access memory
> > it shouldn't be able to access. From that security-in-depth point of
> > view it's not cool that there's some sg tables hanging around still
> > that a broken GPU could use. And let's not pretend hw is perfect,
> > especially GPUs :-)
> 
> I completely agree on that, but there is unfortunately no other way.
> 
> See you simply can't take a reservation object lock in an mmu or slab
> callback, you can only trylock them.
> 
> For example it would require changing all allocations done while holding any
> reservation lock to GFP_NOIO.

Yeah mmu and slab can only trylock, and they need to skip to the next
object when the trylock fails. But once you have the lock you imo should
be able to clean up the entire mess still. We definitely do that for the
i915 shrinkers, and I don't see how going to importers through the
->invalidate_mapping callback changes anything with that.

For the in-driver reservation path (CS) having a slow-path that grabs a
temporary reference, drops the vram lock and then locks the reservation
normally (using the acquire context used already for the entire CS) is a
bit tricky, but totally feasible. Ttm doesn't do that though.

So there's completely feasible ways to make sure the sg list is all
properly released, all DMA gone and the IOMMU mappings torn down. Anything
else is just a bit shoddy device driver programming imo.

> > > When we would try to destroy the sg tables in an mmu_invalidate_range or
> > > shrink_slab callback we would run into a lockdep horror.
> > So I'm no expert on this, but I think this is exactly what we're doing
> > in i915. Kinda no other way to actually free the memory without
> > throwing all the nice isolation aspects of an IOMMU into the wind. Can
> > you please paste the lockdeps you've seen with amdgpu when trying to
> > do that?
> 
> Taking a quick look at i915 I can definitely say that this is actually quite
> buggy what you guys do here.

Note that there are 2 paths for i915 userptr. One is the mmu notifier, the
other one is the root-only hack we have for dubious reasons (or that I
really don't see the point in myself).

> For coherent usage you need to install some lock to prevent concurrent
> get_user_pages(), command submission and
> invalidate_range_start/invalidate_range_end from the MMU notifier.
> 
> Otherwise you can't guarantee that you are actually accessing the right page
> in the case of a fork() or mprotect().

Yeah doing that with a full lock will create endless amounts of issues,
but I don't see why we need that. Userspace racing stuff with itself gets
to keep all the pieces. This is like racing DIRECT_IO against mprotect and
fork.

Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of
stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's
book-keeping.

In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
after all the mappings are really gone (both GPU PTEs and sg mapping),
guaranteeing that any stray writes from either the GPU or IOMMU will
result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU
actually works" is an assumption behind device isolation).

> Felix and I hammered for quite some time on amdgpu until all of this was
> handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.

Maybe we should have more shared code in this, it seems to be a source of
endless amounts of fun ...

> I can try to gather the lockdep splat from my mail history, but it
> essentially took us multiple years to get rid of all of them.

I'm very much interested in specifically the splat that makes it
impossible for you folks to remove the sg mappings. That one sounds bad.
And would essentially make mmu_notifiers useless for their primary use
case, which is handling virtual machines where you really have to make
sure the IOMMU mapping is gone before you claim it's gone, because there's
no 2nd level of device checks (like GPU PTEs, or command checker) catching
stray writes.

Cheers, Daniel

> 
> Regards,
> Christian.
> 
> > -Daniel
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Even so, it basically means passing a fence object down to the async
> > > > callbacks for them to signal when they are complete. Just to handle the
> > > > non-pipelined version. :|
> > > > -Chris
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > dri-devel at lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > 
> > > _______________________________________________
> > > Linaro-mm-sig mailing list
> > > Linaro-mm-sig at lists.linaro.org
> > > https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
> > 
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch