On Tue, Nov 22, 2022 at 07:08:25PM +0100, Daniel Vetter wrote: > On Tue, 22 Nov 2022 at 19:04, Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote: > > > tldr; DMA buffers aren't normal memory, expecting that you can use > > > them like that (like calling get_user_pages works, or that they're > > > accounting like any other normal memory) cannot be guaranteed. > > > > > > Since some userspace only runs on integrated devices, where all > > > buffers are actually all resident system memory, there's a huge > > > temptation to assume that a struct page is always present and useable > > > like for any more pagecache backed mmap. This has the potential to > > > result in a uapi nightmare. > > > > > > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which > > > blocks get_user_pages and all the other struct page based > > > infrastructure for everyone. In spirit this is the uapi counterpart to > > > the kernel-internal CONFIG_DMABUF_DEBUG. > > > > > > Motivated by a recent patch which wanted to swich the system dma-buf > > > heap to vm_insert_page instead of vm_insert_pfn. > > > > > > v2: > > > > > > Jason brought up that we also want to guarantee that all ptes have the > > > pte_special flag set, to catch fast get_user_pages (on architectures > > > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would > > > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that. > > > > > > From auditing the various functions to insert pfn pte entires > > > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like > > > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so > > > this should be the correct flag to check for. > > > > I didn't look at how this actually gets used, but it is a bit of a > > pain to insert a lifetime controlled object like a struct page as a > > special PTE/VM_PFNMAP > > > > How is the lifetime model implemented here? How do you know when > > userspace has finally unmapped the page? > > The vma has a filp which is the refcounted dma_buf. With dma_buf you > never get an individual page it's always the entire object. And it's > up to the allocator how exactly it wants to use or not use the page's > refcount. So if gup goes in and elevates the refcount, you can break > stuff, which is why I'm doing this. But how does move work? Jason