On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:Granted, let me try to improve this. Here is a real world example of one of the issues we ran into and why CPU mappings of importers are redirected to the exporter. We have a good bunch of different exporters who track the CPU mappings of their backing store using address_space objects in one way or another and then uses unmap_mapping_range() to invalidate those CPU mappings. But when importers get the PFNs of the backing store they can look behind the curtain and directly insert this PFN into the CPU page tables. We had literally tons of cases like this where drivers developers cause access after free issues because the importer created a CPU mappings on their own without the exporter knowing about it. This is just one example of what we ran into. Additional to that basically the whole synchronization between drivers was overhauled as well because we found that we can't trust importers to always do the right thing.But this, fundamentally, is importers creating attachments and then *ignoring the lifetime rules of DMABUF*. If you created an attachment, got a move and *ignored the move* because you put the PFN in your own VMA, then you are not following the attachment lifetime rules!
Move notify is solely for informing the importer that they need to re-fresh their DMA mappings and eventually block for ongoing DMA to end.
This semantics doesn't work well for CPU mappings because you need to hold the reservation lock to make sure that the information stay valid and you can't hold a lock while returning from a page fault.
In other words page faults are opportunistically and happen in concurrent with invalidation, while the move_notify approach is serialized through a common lock.
I think that's what Sima tried to point out as well.
To implement this safely the driver would need to use unma_mapping_range() on the driver VMA inside the move callback, and hook into the VMA fault callback to re-attach the dmabuf.
Yeah and exactly that is something we don't want to allow because it means that every importer need to get things right to prevent exporters from running into problems.
This is where I get into trouble with your argument. It is not that the API has an issue, or that the rules of the API are not logical and functional. You are arguing that even a logical and functional API will be mis-used by some people and that reviewers will not catch it.
Well it's not miss-used, it's just a very bad design decision to let every importer implement functionality which actually belong into a single point in the exporter.
Honestly, I don't think that is consistent with the kernel philosophy. We should do our best to make APIs that are had to mis-use, but if we can't achieve that it doesn't mean we stop and give up on problems, we go into the world of APIs that can be mis-used and we are supposed to rely on the reviewer system to catch it.
This is not giving up, but rather just apply good design approaches.
See I can only repeat my self that we came up with this approach because of experience and finding that what you suggest here doesn't work.
In other words we already tried what you suggest here and it doesn't work.
You can already turn both a TEE allocated buffer as well as a memfd into a DMA-buf. So basically TEE and memfd already provides different interfaces which go beyond what DMA-buf does and allows.In other words if you want to do things like direct I/O to block or network devices you can mmap() your memfd and do this while at the same time send your memfd as DMA-buf to your GPU, V4L or neural accelerator. Would this be a way you could work with as well?I guess, but this still requires creating a dmabuf2 type thing with very similar semantics and then shimming dmabuf2 to 1 for DRM consumers. I don't see how it addresses your fundamental concern that the semantics we want are an API that is too easy for drivers to abuse. And being more functional and efficient we'd just see people wanting to use dmabuf2 directly instead of bothering with 1.
Why would you want to do a dmabuf2 here?
separate file descriptor representing the private MMIO which iommufd and KVM uses but you can turn it into a DMA-buf whenever you need to give it to a DMA-buf importer?Well, it would end up just being used everywhere. I think one person wanted to use this with DRM drivers for some reason, but RDMA would use the dmabuf2 directly because it will be much more efficient than using scatterlist. Honestly, I'd much rather extend dmabuf and see DRM institute some rule that DRM drivers may not use XYZ parts of the improvement. Like maybe we could use some symbol namespaces to really enforce it eg. MODULE_IMPORT_NS(DMABUF_NOT_FOR_DRM_USAGE) Some of the improvements we want like the revoke rules for lifetime seem to be agreeable. Block the API that gives you the non-scatterlist attachment. Only VFIO/RDMA/kvm/iommufd will get to implement it.
I don't mind improving the scatterlist approach in any way possible. I'm just rejecting things which we already tried and turned out to be a bad idea.
If you make an interface which gives DMA addresses plus additional information like address space, access hints etc.. to importers that would be really welcomed.
But exposing PFNs and letting the importers created their DMA mappings themselves and making CPU mappings themselves is an absolutely clear no-go.
In this case Xu is exporting MMIO from VFIO and importing to KVM and iommufd. So basically a portion of a PCIe BAR is imported into iommufd?And KVM. We need to get the CPU address into KVM and IOMMU page tables. It must go through a private FD path and not a VMA because of the CC rules about machine check I mentioned earlier. The private FD must have a lifetime model to ensure we don't UAF the PCIe BAR memory.
Then create an interface between VFIO and KVM/iommufd which allows to pass data between these two.
We already do this between DMA-buf exporters/importers all the time. Just don't make it general DMA-buf API.
Someone else had some use case where they wanted to put the VFIO MMIO PCIe BAR into a DMABUF and ship it into a GPU driver for somethingsomething virtualization but I didn't understand it.
Yeah, that is already perfectly supported.
Let's just say that both the ARM guys as well as the GPU people already have some pretty "interesting" ways of doing digital rights management and content protection.Well, that is TEE stuff, TEE and CC are not the same thing, though they have some high level conceptual overlap. In a certain sense CC is a TEE that is built using KVM instead of the TEE subsystem. Using KVM and integrating with the MM brings a whole set of unique challenges that TEE got to avoid..
Please go over those challenges in more detail. I need to get a better understanding of what's going on here.
E.g. who manages encryption keys, who raises the machine check on violations etc...
Regards,
Christian.
Jason