On Fri, Jan 10, 2025 at 04:38:38PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote: > > > So if I'm getting this right, what you need from a functional pov is a > > dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap I'm not sure if the word 'mmap' is proper here. It is kind of like the mapping from (FD+offset) to backend memory, which is directly provided by memory provider, rather than via VMA and cpu page table. Basically VMA & cpu page table are for host to access the memory, but VMM/host doesn't access most of the guest memory, so why must build them. > > is not going to work I guess? > > Don't want something TDX specific! > > There is a general desire, and CC is one, but there are other > motivations like performance, to stop using VMAs and mmaps as a way to > exchanage memory between two entities. Instead we want to use FDs. Exactly. > > We now have memfd and guestmemfd that are usable with > memfd_pin_folios() - this covers pinnable CPU memory. > > And for a long time we had DMABUF which is for all the other wild > stuff, and it supports movable memory too. > > So, the normal DMABUF semantics with reservation locking and move > notifiers seem workable to me here. They are broadly similar enough to > the mmu notifier locking that they can serve the same job of updating > page tables. Yes. With this new sharing model, the lifecycle of the shared memory/pfn/Page is directly controlled by dma-buf exporter, not by CPU mapping. So I also think reservation lock & move_notify works well for lifecycle control, no conflict (nothing to do) with follow_pfn() & mmu_notifier. > > > Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf > > memory model: > > - permanently pinned dma-buf, they never move > > - dynamic dma-buf, they move through ->move_notify and importers can remap > > - revocable dma-buf, which thus far only exist for pci mmio resources > > I would like to see the importers be able to discover which one is > going to be used, because we have RDMA cases where we can support 1 > and 3 but not 2. > > revocable doesn't require page faulting as it is a terminal condition. > > > Since we're leaning even more on that 3rd model I'm wondering whether we > > should make it something official. Because the existing dynamic importers > > do very much assume that re-acquiring the memory after move_notify will > > work. But for the revocable use-case the entire point is that it will > > never work. > > > I feel like that's a concept we need to make explicit, so that dynamic > > importers can reject such memory if necessary. > > It strikes me as strange that HW can do page faulting, so it can > support #2, but it can't handle a non-present fault? > > > So yeah there's a bunch of tricky lifetime questions that need to be > > sorted out with proper design I think, and the current "let's just use pfn > > directly" proposal hides them all under the rug. > > I don't think these two things are connected. The lifetime model that > KVM needs to work with the EPT, and that VFIO needs for it's MMIO, > definately should be reviewed and evaluated. > > But it is completely orthogonal to allowing iommufd and kvm to access > the CPU PFN to use in their mapping flows, instead of the > dma_addr_t. > > What I want to get to is a replacement for scatter list in DMABUF that > is an array of arrays, roughly like: > > struct memory_chunks { > struct memory_p2p_provider *provider; > struct bio_vec addrs[]; > }; > int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks); Maybe we need to specify which object the API is operating on, struct dma_buf, or struct dma_buf_attachment, or a new attachment. I think: int (*dmabuf_get_memory)(struct dma_buf_attachment *attach, struct memory_chunks **chunks, size_t *num_chunks); works, but maybe a new attachment is conceptually more clear to importers and harder to abuse? Thanks, Yilun > > This can represent all forms of memory: P2P, private, CPU, etc and > would be efficient with the new DMA API. > > This is similar to the structure BIO has, and it composes nicely with > a future pin_user_pages() and memfd_pin_folios(). > > Jason