Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf: Require VM_PFNMAP vma for mmap

Daniel Vetter <daniel@xxxxxxxx> · Thu, 11 Mar 2021 14:17:14 +0100

On Thu, Mar 11, 2021 at 2:12 PM Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:
>
> Hi!
>
> On 3/11/21 2:00 PM, Daniel Vetter wrote:
> > On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote:
> >> On 3/1/21 3:09 PM, Daniel Vetter wrote:
> >>> On Mon, Mar 1, 2021 at 11:17 AM Christian König
> >>> <christian.koenig@xxxxxxx> wrote:
> >>>>
> >>>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
> >>>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
> >>>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
> >>>>>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
> >>>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
> >>>>>>>> <thomas_os@xxxxxxxxxxxx> wrote:
> >>>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
> >>>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
> >>>>>>>>>> good
> >>>>>>>>>> if Christian can check this with some direct io to a buffer in
> >>>>>>>>>> system
> >>>>>>>>>> memory.
> >>>>>>>>> Hmm,
> >>>>>>>>>
> >>>>>>>>> Docs (again vm_normal_page() say)
> >>>>>>>>>
> >>>>>>>>>       * VM_MIXEDMAP mappings can likewise contain memory with or
> >>>>>>>>> without "struct
> >>>>>>>>>       * page" backing, however the difference is that _all_ pages
> >>>>>>>>> with a struct
> >>>>>>>>>       * page (that is, those where pfn_valid is true) are refcounted
> >>>>>>>>> and
> >>>>>>>>> considered
> >>>>>>>>>       * normal pages by the VM. The disadvantage is that pages are
> >>>>>>>>> refcounted
> >>>>>>>>>       * (which can be slower and simply not an option for some PFNMAP
> >>>>>>>>> users). The
> >>>>>>>>>       * advantage is that we don't have to follow the strict
> >>>>>>>>> linearity rule of
> >>>>>>>>>       * PFNMAP mappings in order to support COWable mappings.
> >>>>>>>>>
> >>>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
> >>>>>>>>> path, so
> >>>>>>>>> the above isn't really true, which makes me wonder if and in that
> >>>>>>>>> case
> >>>>>>>>> why there could any longer ever be a significant performance
> >>>>>>>>> difference
> >>>>>>>>> between MIXEDMAP and PFNMAP.
> >>>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
> >>>>>>>> what sticks.
> >>>>>>>>
> >>>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
> >>>>>>>>> devmap
> >>>>>>>>> hack, so they are (for the non-gup case) relying on
> >>>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
> >>>>>>>>> there.
> >>>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
> >>>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
> >>>>>>>> to find the underlying page.
> >>>>>>>> -Daniel
> >>>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
> >>>>>>> to set
> >>>>>>>
> >>>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
> >>>>>>> true, and
> >>>>>>> then
> >>>>>>>
> >>>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
> >>>>>>> gup_fast()
> >>>>>>> backs off,
> >>>>>>>
> >>>>>>> in the end that would mean setting in stone that "if there is a huge
> >>>>>>> devmap
> >>>>>>> page table entry for which we haven't registered any devmap struct
> >>>>>>> pages
> >>>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
> >>>>>>> huge
> >>>>>>> page table entry".
> >>>>>>>
> >>>>>>>    From what I can tell, all code calling get_dev_pagemap() already
> >>>>>>> does that,
> >>>>>>> it's just a question of getting it accepted and formalizing it.
> >>>>>> Oh I thought that's already how it works, since I didn't spot anything
> >>>>>> else that would block gup_fast from falling over. I guess really would
> >>>>>> need some testcases to make sure direct i/o (that's the easiest to test)
> >>>>>> fails like we expect.
> >>>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
> >>>>> Otherwise pmd_devmap() will not return true and since there is no
> >>>>> pmd_special() things break.
> >>>> Is that maybe the issue we have seen with amdgpu and huge pages?
> >>> Yeah, essentially when you have a hugepte inserted by ttm, and it
> >>> happens to point at system memory, then gup will work on that. And
> >>> create all kinds of havoc.
> >>>
> >>>> Apart from that I'm lost guys, that devmap and gup stuff is not
> >>>> something I have a good knowledge of apart from a one mile high view.
> >>> I'm not really better, hence would be good to do a testcase and see.
> >>> This should provoke it:
> >>> - allocate nicely aligned bo in system memory
> >>> - mmap, again nicely aligned to 2M
> >>> - do some direct io from a filesystem into that mmap, that should trigger gup
> >>> - before the gup completes free the mmap and bo so that ttm recycles
> >>> the pages, which should trip up on the elevated refcount. If you wait
> >>> until the direct io is completely, then I think nothing bad can be
> >>> observed.
> >>>
> >>> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
> >>> another issue.
> >>>
> >>> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
> >>> -Daniel
> >> So I did the following quick experiment on vmwgfx, and it turns out that
> >> with it,
> >> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
> >>
> >> I should probably craft an RFC formalizing this.
> > Yeah I think that would be good. Maybe even more formalized if we also
> > switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
> > fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
> > something like that.
> >
> > Otoh your description of when it only sometimes succeeds would indicate my
> > understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.
>
> My understanding from reading the vmf_insert_mixed() code is that iff
> the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's
> not consistent with the vm_normal_page() doc. For architectures without
> pte_special, VM_PFNMAP must be used, and then we must also block COW
> mappings.
>
> If we can get someone can commit to verify that the potential PAT WC
> performance issue is gone with PFNMAP, I can put together a series with
> that included.

Iirc when I checked there's not much archs without pte_special, so I
guess that's why we luck out. Hopefully.

> As for existing userspace using COW TTM mappings, I once had a couple of
> test cases to verify that it actually worked, in particular together
> with huge PMDs and PUDs where breaking COW would imply splitting those,
> but I can't think of anything else actually wanting to do that other
> than by mistake.

Yeah disallowing MAP_PRIVATE mappings would be another good thing to
lock down. Really doesn't make much sense.
-Daniel

> /Thomas
>
>
> >
> > Christian, what's your take?
> > -Daniel
> >
> >> /Thomas
> >>
> >> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> index 6dc96cf66744..72b6fb17c984 100644
> >> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>          pfn_t pfnt;
> >>          struct ttm_tt *ttm = bo->ttm;
> >>          bool write = vmf->flags & FAULT_FLAG_WRITE;
> >> +       struct dev_pagemap *pagemap;
> >>
> >>          /* Fault should not cross bo boundary. */
> >>          page_offset &= ~(fault_page_size - 1);
> >> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>          if ((pfn & (fault_page_size - 1)) != 0)
> >>                  goto out_fallback;
> >>
> >> +       /*
> >> +        * Huge entries must be special, that is marking them as devmap
> >> +        * with no backing device map range. If there is a backing
> >> +        * range, Don't insert a huge entry.
> >> +        */
> >> +       pagemap = get_dev_pagemap(pfn, NULL);
> >> +       if (pagemap) {
> >> +               put_dev_pagemap(pagemap);
> >> +               goto out_fallback;
> >> +       }
> >> +
> >>          /* Check that memory is contiguous. */
> >>          if (!bo->mem.bus.is_iomem) {
> >>                  for (i = 1; i < fault_page_size; ++i) {
> >> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>                  }
> >>          }
> >>
> >> -       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
> >> +       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
> >>          if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
> >>                  ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
> >>   #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>          if (ret != VM_FAULT_NOPAGE)
> >>                  goto out_fallback;
> >>
> >> +#if 1
> >> +       {
> >> +               int npages;
> >> +               struct page *page;
> >> +
> >> +               npages = get_user_pages_fast_only(vmf->address, 1, 0,
> >> &page);
> >> +               if (npages == 1) {
> >> +                       DRM_WARN("Fast gup succeeded. Bad.\n");
> >> +                       put_page(page);
> >> +               } else {
> >> +                       DRM_INFO("Fast gup failed. Good.\n");
> >> +               }
> >> +       }
> >> +#endif
> >> +
> >>          return VM_FAULT_NOPAGE;
> >>   out_fallback:
> >>          count_vm_event(THP_FAULT_FALLBACK);
> >>
> >>
> >>
> >>
> >>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch