On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote: > > Am 24.02.21 um 10:31 schrieb Daniel Vetter: > > > On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel) > > > <thomas_os@xxxxxxxxxxxx> wrote: > > > > > > > > On 2/24/21 9:45 AM, Daniel Vetter wrote: > > > > > On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel) > > > > > <thomas_os@xxxxxxxxxxxx> wrote: > > > > > > On 2/23/21 11:59 AM, Daniel Vetter wrote: > > > > > > > tldr; DMA buffers aren't normal memory, expecting that you can use > > > > > > > them like that (like calling get_user_pages works, or that they're > > > > > > > accounting like any other normal memory) cannot be guaranteed. > > > > > > > > > > > > > > Since some userspace only runs on integrated devices, where all > > > > > > > buffers are actually all resident system memory, there's a huge > > > > > > > temptation to assume that a struct page is always present and useable > > > > > > > like for any more pagecache backed mmap. This has the potential to > > > > > > > result in a uapi nightmare. > > > > > > > > > > > > > > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which > > > > > > > blocks get_user_pages and all the other struct page based > > > > > > > infrastructure for everyone. In spirit this is the uapi counterpart to > > > > > > > the kernel-internal CONFIG_DMABUF_DEBUG. > > > > > > > > > > > > > > Motivated by a recent patch which wanted to swich the system dma-buf > > > > > > > heap to vm_insert_page instead of vm_insert_pfn. > > > > > > > > > > > > > > v2: > > > > > > > > > > > > > > Jason brought up that we also want to guarantee that all ptes have the > > > > > > > pte_special flag set, to catch fast get_user_pages (on architectures > > > > > > > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would > > > > > > > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that. > > > > > > > > > > > > > > From auditing the various functions to insert pfn pte entires > > > > > > > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like > > > > > > > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so > > > > > > > this should be the correct flag to check for. > > > > > > > > > > > > > If we require VM_PFNMAP, for ordinary page mappings, we also need to > > > > > > disallow COW mappings, since it will not work on architectures that > > > > > > don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()). > > > > > Hm I figured everyone just uses MAP_SHARED for buffer objects since > > > > > COW really makes absolutely no sense. How would we enforce this? > > > > Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that > > > > or allowing MIXEDMAP. > > > > > > > > > > Also worth noting is the comment in ttm_bo_mmap_vma_setup() with > > > > > > possible performance implications with x86 + PAT + VM_PFNMAP + normal > > > > > > pages. That's a very old comment, though, and might not be valid anymore. > > > > > I think that's why ttm has a page cache for these, because it indeed > > > > > sucks. The PAT changes on pages are rather expensive. > > > > IIRC the page cache was implemented because of the slowness of the > > > > caching mode transition itself, more specifically the wbinvd() call + > > > > global TLB flush. > > > > Yes, exactly that. The global TLB flush is what really breaks our neck here > > from a performance perspective. > > > > > > > There is still an issue for iomem mappings, because the PAT validation > > > > > does a linear walk of the resource tree (lol) for every vm_insert_pfn. > > > > > But for i915 at least this is fixed by using the io_mapping > > > > > infrastructure, which does the PAT reservation only once when you set > > > > > up the mapping area at driver load. > > > > Yes, I guess that was the issue that the comment describes, but the > > > > issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP. > > > > > > > > > Also TTM uses VM_PFNMAP right now for everything, so it can't be a > > > > > problem that hurts much :-) > > > > Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP? > > > > > > > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554 > > > Uh that's bad, because mixed maps pointing at struct page wont stop > > > gup. At least afaik. > > > > Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have > > already seen tons of problems with the page cache. > > On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed > boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path. > > But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how > you're stopping gup slow path. See check_vma_flags() in mm/gup.c. > > Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think > vm_insert_mixed even works on iomem pfns. There's the devmap exception, > but we're not devmap. Worse ttm abuses some accidental codepath to smuggle > in hugepte support by intentionally not being devmap. > > So I'm really not sure this works as we think it should. Maybe good to do > a quick test program on amdgpu with a buffer in system memory only and try > to do direct io into it. If it works, you have a problem, and a bad one. That's probably impossible, since a quick git grep shows that pretty much anything reasonable has special ptes: arc, arm, arm64, powerpc, riscv, s390, sh, sparc, x86. I don't think you'll have a platform where you can plug an amdgpu in and actually exercise the bug :-) So maybe we should just switch over to VM_PFNMAP for ttm for more clarity? -Daniel > > > > > Regards, > > Christian. > > > > > Christian, do we need to patch this up, and maybe fix up ttm fault > > > handler to use io_mapping so the vm_insert_pfn stuff is fast? > > > -Daniel > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch