Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf: Require VM_PFNMAP vma for mmap

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2/26/21 2:28 PM, Daniel Vetter wrote:
On Fri, Feb 26, 2021 at 10:41 AM Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:

On 2/25/21 4:49 PM, Daniel Vetter wrote:
On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
Am 24.02.21 um 10:31 schrieb Daniel Vetter:
On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:
On 2/24/21 9:45 AM, Daniel Vetter wrote:
On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:
On 2/23/21 11:59 AM, Daniel Vetter wrote:
tldr; DMA buffers aren't normal memory, expecting that you can use
them like that (like calling get_user_pages works, or that they're
accounting like any other normal memory) cannot be guaranteed.

Since some userspace only runs on integrated devices, where all
buffers are actually all resident system memory, there's a huge
temptation to assume that a struct page is always present and useable
like for any more pagecache backed mmap. This has the potential to
result in a uapi nightmare.

To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
blocks get_user_pages and all the other struct page based
infrastructure for everyone. In spirit this is the uapi counterpart to
the kernel-internal CONFIG_DMABUF_DEBUG.

Motivated by a recent patch which wanted to swich the system dma-buf
heap to vm_insert_page instead of vm_insert_pfn.

v2:

Jason brought up that we also want to guarantee that all ptes have the
pte_special flag set, to catch fast get_user_pages (on architectures
that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.

     From auditing the various functions to insert pfn pte entires
(vm_insert_pfn_prot, remap_pfn_range and all it's callers like
dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
this should be the correct flag to check for.

If we require VM_PFNMAP, for ordinary page mappings, we also need to
disallow COW mappings, since it will not work on architectures that
don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
Hm I figured everyone just uses MAP_SHARED for buffer objects since
COW really makes absolutely no sense. How would we enforce this?
Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
or allowing MIXEDMAP.

Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
possible performance implications with x86 + PAT + VM_PFNMAP + normal
pages. That's a very old comment, though, and might not be valid anymore.
I think that's why ttm has a page cache for these, because it indeed
sucks. The PAT changes on pages are rather expensive.
IIRC the page cache was implemented because of the slowness of the
caching mode transition itself, more specifically the wbinvd() call +
global TLB flush.
Yes, exactly that. The global TLB flush is what really breaks our neck here
from a performance perspective.

There is still an issue for iomem mappings, because the PAT validation
does a linear walk of the resource tree (lol) for every vm_insert_pfn.
But for i915 at least this is fixed by using the io_mapping
infrastructure, which does the PAT reservation only once when you set
up the mapping area at driver load.
Yes, I guess that was the issue that the comment describes, but the
issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.

Also TTM uses VM_PFNMAP right now for everything, so it can't be a
problem that hurts much :-)
Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
Uh that's bad, because mixed maps pointing at struct page wont stop
gup. At least afaik.
Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
already seen tons of problems with the page cache.
On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.

But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
you're stopping gup slow path. See check_vma_flags() in mm/gup.c.

Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
vm_insert_mixed even works on iomem pfns. There's the devmap exception,
but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
in hugepte support by intentionally not being devmap.

So I'm really not sure this works as we think it should. Maybe good to do
a quick test program on amdgpu with a buffer in system memory only and try
to do direct io into it. If it works, you have a problem, and a bad one.
That's probably impossible, since a quick git grep shows that pretty
much anything reasonable has special ptes: arc, arm, arm64, powerpc,
riscv, s390, sh, sparc, x86. I don't think you'll have a platform
where you can plug an amdgpu in and actually exercise the bug :-)
Hm. AFAIK _insert_mixed() doesn't set PTE_SPECIAL on system pages, so I
don't see what should be stopping gup to those?
If you have an arch with pte special we use insert_pfn(), which afaict
will use pte_mkspecial for the !devmap case. And ttm isn't devmap
(otherwise our hugepte abuse of devmap hugeptes would go rather
wrong).

So I think it stops gup. But I haven't verified at all. Would be good
if Christian can check this with some direct io to a buffer in system
memory.

Hmm,

Docs (again vm_normal_page() say)

 * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
 * page" backing, however the difference is that _all_ pages with a struct
 * page (that is, those where pfn_valid is true) are refcounted and considered
 * normal pages by the VM. The disadvantage is that pages are refcounted
 * (which can be slower and simply not an option for some PFNMAP users). The
 * advantage is that we don't have to follow the strict linearity rule of
 * PFNMAP mappings in order to support COWable mappings.

but it's true __vm_insert_mixed() ends up in the insert_pfn() path, so the above isn't really true, which makes me wonder if and in that case why there could any longer ever be a significant performance difference between MIXEDMAP and PFNMAP.

BTW regarding the TTM hugeptes, I don't think we ever landed that devmap hack, so they are (for the non-gup case) relying on vma_is_special_huge(). For the gup case, I think the bug is still there.

/Thomas

-Daniel



[Index of Archives]     [Linux Input]     [Video for Linux]     [Gstreamer Embedded]     [Mplayer Users]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux