Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

Christian König <christian.koenig@xxxxxxx> · Thu, 16 Jan 2025 16:13:13 +0100

    Am 15.01.25 um 18:09 schrieb Jason Gunthorpe:

      On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:

           Granted, let me try to improve this.
   Here is a real world example of one of the issues we ran into and why
   CPU mappings of importers are redirected to the exporter.
   We have a good bunch of different exporters who track the CPU mappings
   of their backing store using address_space objects in one way or
   another and then uses unmap_mapping_range() to invalidate those CPU
   mappings.
   But when importers get the PFNs of the backing store they can look
   behind the curtain and directly insert this PFN into the CPU page
   tables.
   We had literally tons of cases like this where drivers developers cause
   access after free issues because the importer created a CPU mappings on
   their own without the exporter knowing about it.
   This is just one example of what we ran into. Additional to that
   basically the whole synchronization between drivers was overhauled as
   well because we found that we can't trust importers to always do the
   right thing.

      But this, fundamentally, is importers creating attachments and then
*ignoring the lifetime rules of DMABUF*. If you created an attachment,
got a move and *ignored the move* because you put the PFN in your own
VMA, then you are not following the attachment lifetime rules!

    Move notify is solely for informing the importer that they need to
    re-fresh their DMA mappings and eventually block for ongoing DMA to
    end.

    This semantics doesn't work well for CPU mappings because you need
    to hold the reservation lock to make sure that the information stay
    valid and you can't hold a lock while returning from a page fault.

    In other words page faults are opportunistically and happen in
    concurrent with invalidation, while the move_notify approach is
    serialized through a common lock.

    I think that's what Sima tried to point out as well.

      To implement this safely the driver would need to use
unma_mapping_range() on the driver VMA inside the move callback, and
hook into the VMA fault callback to re-attach the dmabuf.

    Yeah and exactly that is something we don't want to allow because it
    means that every importer need to get things right to prevent
    exporters from running into problems.

      This is where I get into trouble with your argument. It is not that
the API has an issue, or that the rules of the API are not logical and
functional.

You are arguing that even a logical and functional API will be
mis-used by some people and that reviewers will not catch
it.

    Well it's not miss-used, it's just a very bad design decision to let
    every importer implement functionality which actually belong into a
    single point in the exporter.

      Honestly, I don't think that is consistent with the kernel philosophy.

We should do our best to make APIs that are had to mis-use, but if we
can't achieve that it doesn't mean we stop and give up on problems,
we go into the world of APIs that can be mis-used and we are supposed
to rely on the reviewer system to catch it.

    This is not giving up, but rather just apply good design approaches.

    See I can only repeat my self that we came up with this approach
    because of experience and finding that what you suggest here doesn't
    work.

    In other words we already tried what you suggest here and it doesn't
    work.

           You can already turn both a TEE allocated buffer as well as a memfd
   into a DMA-buf. So basically TEE and memfd already provides different
   interfaces which go beyond what DMA-buf does and allows.

           In other words if you want to do things like direct I/O to block or
   network devices you can mmap() your memfd and do this while at the same
   time send your memfd as DMA-buf to your GPU, V4L or neural accelerator.
   Would this be a way you could work with as well? 

      I guess, but this still requires creating a dmabuf2 type thing with
very similar semantics and then shimming dmabuf2 to 1 for DRM consumers.

I don't see how it addresses your fundamental concern that the
semantics we want are an API that is too easy for drivers to abuse.

And being more functional and efficient we'd just see people wanting
to use dmabuf2 directly instead of bothering with 1.

    Why would you want to do a dmabuf2 here?

           separate file descriptor representing the private MMIO which iommufd
   and KVM uses but you can turn it into a DMA-buf whenever you need to
   give it to a DMA-buf importer?

      Well, it would end up just being used everywhere. I think one person
wanted to use this with DRM drivers for some reason, but RDMA would
use the dmabuf2 directly because it will be much more efficient than
using scatterlist.

Honestly, I'd much rather extend dmabuf and see DRM institute some
rule that DRM drivers may not use XYZ parts of the improvement. Like
maybe we could use some symbol namespaces to really enforce it
eg. MODULE_IMPORT_NS(DMABUF_NOT_FOR_DRM_USAGE)

Some of the improvements we want like the revoke rules for lifetime
seem to be agreeable.

Block the API that gives you the non-scatterlist attachment. Only
VFIO/RDMA/kvm/iommufd will get to implement it.

    I don't mind improving the scatterlist approach in any way possible.
    I'm just rejecting things which we already tried and turned out to
    be a bad idea.

    If you make an interface which gives DMA addresses plus additional
    information like address space, access hints etc.. to importers that
    would be really welcomed.

    But exposing PFNs and letting the importers created their DMA
    mappings themselves and making CPU mappings themselves is an
    absolutely clear no-go.

        In this case Xu is exporting MMIO from VFIO and importing to KVM and
iommufd.

   So basically a portion of a PCIe BAR is imported into iommufd?

      And KVM. We need to get the CPU address into KVM and IOMMU page
tables. It must go through a private FD path and not a VMA because of
the CC rules about machine check I mentioned earlier. The private FD
must have a lifetime model to ensure we don't UAF the PCIe BAR memory.

    Then create an interface between VFIO and KVM/iommufd which allows
    to pass data between these two.

    We already do this between DMA-buf exporters/importers all the time.
    Just don't make it general DMA-buf API.

      Someone else had some use case where they wanted to put the VFIO MMIO
PCIe BAR into a DMABUF and ship it into a GPU driver for
somethingsomething virtualization but I didn't understand it.

    Yeah, that is already perfectly supported.

           Let's just say that both the ARM guys as well as the GPU people already
   have some pretty "interesting" ways of doing digital rights management
   and content protection.

      Well, that is TEE stuff, TEE and CC are not the same thing, though
they have some high level conceptual overlap.

In a certain sense CC is a TEE that is built using KVM instead of the
TEE subsystem. Using KVM and integrating with the MM brings a whole
set of unique challenges that TEE got to avoid.. 

    Please go over those challenges in more detail. I need to get a
    better understanding of what's going on here.

    E.g. who manages encryption keys, who raises the machine check on
    violations etc...

    Regards,

    Christian.

Jason