Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

Christian König <christian.koenig@xxxxxxx> · Thu, 9 Jan 2025 10:10:01 +0100

    Am 08.01.25 um 17:22 schrieb Jason Gunthorpe:

    [SNIP]

          For P2P cases we are going toward (PFN + P2P source information) as
input to the DMA API. The additional "P2P source information" provides
a good way for co-operating drivers to represent private address
spaces as well. Both importer and exporter can have full understanding
what is being mapped and do the correct things, safely.

        I can say from experience that this is clearly not going to work for all use
cases.

It would mean that we have to pull a massive amount of driver specific
functionality into the DMA API.

      That isn't what I mean. There are two distinct parts, the means to
describe the source (PFN + P2P source information) that is compatible
with the DMA API, and the DMA API itself that works with a few general
P2P source information types.

Private source information would be detected by co-operating drivers
and go down driver private paths. It would be rejected by other
drivers. This broadly follows how the new API is working.

So here I mean you can use the same PFN + Source API between importer
and exporter and the importer can simply detect the special source and
do the private stuff. It is not shifting things under the DMA API, it
is building along side it using compatible design approaches. You
would match the source information, cast it to a driver structure, do
whatever driver math is needed to compute the local DMA address and
then write it to the device. Nothing is hard or "not going to work"
here.

    Well to be honest that sounds like an absolutely horrible design.

    You are moving all responsibilities for inter driver handling into
    the drivers themselves without any supervision by the core OS.

    Drivers are notoriously buggy and should absolutely not do things
    like that on their own.

    Do you have pointers to this new API?

          So, no, we don't loose private address space support when moving to
importer mapping, in fact it works better because the importer gets
more information about what is going on.

        Well, sounds like I wasn't able to voice my concern. Let me try again:

We should not give importers information they don't need. Especially not
information about the backing store of buffers.

So that importers get more information about what's going on is a bad thing.

      I strongly disagree because we are suffering today in mlx5 because of
this viewpoint. You cannot predict in advance what importers are going
to need. I already listed many examples where it does not work today
as is.

    Well there is no need to predict in advance what importers are going
    to do. We just gather the requirements of the importers and see how
    we can fulfill them.

    And yes, it is absolutely intentional that maintainers reject some
    of those requirements because the they find that the importer tries
    to do something totally nasty which shouldn't happen in the first
    place.

          I have imagined a staged approach were DMABUF gets a new API that
works with the new DMA API to do importer mapping with "P2P source
information" and a gradual conversion.

        To make it clear as maintainer of that subsystem I would reject such a step
with all I have.

      This is unexpected, so you want to just leave dmabuf broken? Do you
have any plan to fix it, to fix the misuse of the DMA API, and all
the problems I listed below? This is a big deal, it is causing real
problems today.

    Well as maintainer of DMA-buf I don't think it is broken. It's just
    missing requirements.

    When you have new requirements we can certainly talk about them, but
    giving drivers a green card to do whatever nasty thing they want to
    do is not going to fly at all.

          Exporter mapping falls down in too many cases already:

1) Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list
can't really properly convey that. If the DMABUF has a mixture of CPU
and private it becomes a PITA

        Correct, yes. That's why I said that scatterlist was a bad choice for the
interface.

But exposing the backing store to importers and then let them do whatever
they want with it sounds like an even worse idea.

      You keep saying this without real justification.

    Well I have really strong justification for that. Please look at the
    code here and how it cam to be:
https://elixir.bootlin.com/linux/v6.12.6/source/drivers/dma-buf/dma-buf.c#L776

    We intentionally mangle the struct page pointers in the scatterlist
    to prevent importers from trying to access it.

      To me it is a nanny style of API design.

    Yes, absolutely and that is really necessary.

    Inter driver APIs are hard work because the framework not only needs
    to define rules but also makes sure to properly enforce them.

    That's why we not only added tons of documentation how to implement
    things, but also annotation which gives big warnings whenever
    somebody tries to do something nasty.

      But also I don't see how you can possibly fix the
above without telling the importer alot more information.

    Giving additional information to the DMA addresses is perfectly ok.
    It's just that those additional information should be the output of
    the DMA API or DMA-buf and not something drivers try to figure out
    on their own.

          2) Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the
multi-path. We are doing this in mlx5 today and have hacks because
DMABUF is destroying the information the importer needs to choose the
correct PCI path.

        That's why the exporter gets the struct device of the importer so that it
can plan how those accesses are made. Where exactly is the problem with
that?

      A single struct device does not convey the multipath options. We have
multiple struct devices (and multiple PCI endpoints) doing DMA
concurrently under one driver.

Multipath always needs additional meta information in the importer
side to tell the device which path to select. A naked dma address is
not sufficient.

Today we guess that DMABUF will be using P2P and hack to choose a P2P
struct device to pass the exporter. We need to know what is in the
dmabuf before we can choose which of the multiple struct devices the
driver has to use for DMA mapping.

But even simple CPU centric cases we will eventually want to select
the proper NUMA local PCI channel matching struct device for CPU only
buffers.

    So basically you want something which tells you if device A or
    device B should be used to access something?

    That was discussed before as well, but so far the resolution was
    that userspace should decide that stuff and not the kernel.

    Would it be sufficient to have a query API where you can ask the
    DMA-buf exporter for information about the buffer? E.g. NUMA
    distance etc... 

        When you have an use case which is not covered by the existing DMA-buf
interfaces then please voice that to me and other maintainers instead of
implementing some hack.

      Do you have any suggestion for any of this then?

    Yeah, compared to other requirements that sounds trivial what you
    have at hand here.

          5) iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible

        We have customers using both KVM and XEN with DMA-buf, so I can clearly
confirm that this isn't true.

      Today they are mmaping the dma-buf into a VMA and then using KVM's
follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
dma-buf must have a CPU PFN.

    Yeah and I'm the one who forced the KVM guys to do it like that.

      Here Xu implements basically the same path, except without the VMA
indirection, and it suddenly not OK? Illogical.

    Well I'm the designer of this KVM solution and it's not illogical to
    reject that when you are missing quite a bunch of stuff.

    Sima correctly pointed out that you need an mmu notifier if you want
    to work with PFNs, but that is just the tip of the iceberg here.

    If you want to go down the CPU access path you also need to cover
    things like cacheing flags, lifetime, dynamic changing backing
    etc...

    Regards,

    Christian.

Jason