On Thu, Mar 11, 2021 at 2:12 PM Thomas Hellström (Intel) <thomas_os@xxxxxxxxxxxx> wrote: > > Hi! > > On 3/11/21 2:00 PM, Daniel Vetter wrote: > > On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote: > >> On 3/1/21 3:09 PM, Daniel Vetter wrote: > >>> On Mon, Mar 1, 2021 at 11:17 AM Christian König > >>> <christian.koenig@xxxxxxx> wrote: > >>>> > >>>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel): > >>>>> On 3/1/21 10:05 AM, Daniel Vetter wrote: > >>>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel) > >>>>>> wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote: > >>>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel) > >>>>>>>> <thomas_os@xxxxxxxxxxxx> wrote: > >>>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote: > >>>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be > >>>>>>>>>> good > >>>>>>>>>> if Christian can check this with some direct io to a buffer in > >>>>>>>>>> system > >>>>>>>>>> memory. > >>>>>>>>> Hmm, > >>>>>>>>> > >>>>>>>>> Docs (again vm_normal_page() say) > >>>>>>>>> > >>>>>>>>> * VM_MIXEDMAP mappings can likewise contain memory with or > >>>>>>>>> without "struct > >>>>>>>>> * page" backing, however the difference is that _all_ pages > >>>>>>>>> with a struct > >>>>>>>>> * page (that is, those where pfn_valid is true) are refcounted > >>>>>>>>> and > >>>>>>>>> considered > >>>>>>>>> * normal pages by the VM. The disadvantage is that pages are > >>>>>>>>> refcounted > >>>>>>>>> * (which can be slower and simply not an option for some PFNMAP > >>>>>>>>> users). The > >>>>>>>>> * advantage is that we don't have to follow the strict > >>>>>>>>> linearity rule of > >>>>>>>>> * PFNMAP mappings in order to support COWable mappings. > >>>>>>>>> > >>>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn() > >>>>>>>>> path, so > >>>>>>>>> the above isn't really true, which makes me wonder if and in that > >>>>>>>>> case > >>>>>>>>> why there could any longer ever be a significant performance > >>>>>>>>> difference > >>>>>>>>> between MIXEDMAP and PFNMAP. > >>>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see > >>>>>>>> what sticks. > >>>>>>>> > >>>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that > >>>>>>>>> devmap > >>>>>>>>> hack, so they are (for the non-gup case) relying on > >>>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still > >>>>>>>>> there. > >>>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do > >>>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying > >>>>>>>> to find the underlying page. > >>>>>>>> -Daniel > >>>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was > >>>>>>> to set > >>>>>>> > >>>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be > >>>>>>> true, and > >>>>>>> then > >>>>>>> > >>>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and > >>>>>>> gup_fast() > >>>>>>> backs off, > >>>>>>> > >>>>>>> in the end that would mean setting in stone that "if there is a huge > >>>>>>> devmap > >>>>>>> page table entry for which we haven't registered any devmap struct > >>>>>>> pages > >>>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special" > >>>>>>> huge > >>>>>>> page table entry". > >>>>>>> > >>>>>>> From what I can tell, all code calling get_dev_pagemap() already > >>>>>>> does that, > >>>>>>> it's just a question of getting it accepted and formalizing it. > >>>>>> Oh I thought that's already how it works, since I didn't spot anything > >>>>>> else that would block gup_fast from falling over. I guess really would > >>>>>> need some testcases to make sure direct i/o (that's the easiest to test) > >>>>>> fails like we expect. > >>>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes. > >>>>> Otherwise pmd_devmap() will not return true and since there is no > >>>>> pmd_special() things break. > >>>> Is that maybe the issue we have seen with amdgpu and huge pages? > >>> Yeah, essentially when you have a hugepte inserted by ttm, and it > >>> happens to point at system memory, then gup will work on that. And > >>> create all kinds of havoc. > >>> > >>>> Apart from that I'm lost guys, that devmap and gup stuff is not > >>>> something I have a good knowledge of apart from a one mile high view. > >>> I'm not really better, hence would be good to do a testcase and see. > >>> This should provoke it: > >>> - allocate nicely aligned bo in system memory > >>> - mmap, again nicely aligned to 2M > >>> - do some direct io from a filesystem into that mmap, that should trigger gup > >>> - before the gup completes free the mmap and bo so that ttm recycles > >>> the pages, which should trip up on the elevated refcount. If you wait > >>> until the direct io is completely, then I think nothing bad can be > >>> observed. > >>> > >>> Ofc if your amdgpu+hugepte issue is something else, then maybe we have > >>> another issue. > >>> > >>> Also usual caveat: I'm not an mm hacker either, so might be completely wrong. > >>> -Daniel > >> So I did the following quick experiment on vmwgfx, and it turns out that > >> with it, > >> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds > >> > >> I should probably craft an RFC formalizing this. > > Yeah I think that would be good. Maybe even more formalized if we also > > switch over to VM_PFNMAP, since afaiui these pte flags here only stop the > > fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or > > something like that. > > > > Otoh your description of when it only sometimes succeeds would indicate my > > understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here. > > My understanding from reading the vmf_insert_mixed() code is that iff > the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's > not consistent with the vm_normal_page() doc. For architectures without > pte_special, VM_PFNMAP must be used, and then we must also block COW > mappings. > > If we can get someone can commit to verify that the potential PAT WC > performance issue is gone with PFNMAP, I can put together a series with > that included. Iirc when I checked there's not much archs without pte_special, so I guess that's why we luck out. Hopefully. > As for existing userspace using COW TTM mappings, I once had a couple of > test cases to verify that it actually worked, in particular together > with huge PMDs and PUDs where breaking COW would imply splitting those, > but I can't think of anything else actually wanting to do that other > than by mistake. Yeah disallowing MAP_PRIVATE mappings would be another good thing to lock down. Really doesn't make much sense. -Daniel > /Thomas > > > > > > Christian, what's your take? > > -Daniel > > > >> /Thomas > >> > >> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c > >> b/drivers/gpu/drm/ttm/ttm_bo_vm.c > >> index 6dc96cf66744..72b6fb17c984 100644 > >> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c > >> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c > >> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault > >> *vmf, > >> pfn_t pfnt; > >> struct ttm_tt *ttm = bo->ttm; > >> bool write = vmf->flags & FAULT_FLAG_WRITE; > >> + struct dev_pagemap *pagemap; > >> > >> /* Fault should not cross bo boundary. */ > >> page_offset &= ~(fault_page_size - 1); > >> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault > >> *vmf, > >> if ((pfn & (fault_page_size - 1)) != 0) > >> goto out_fallback; > >> > >> + /* > >> + * Huge entries must be special, that is marking them as devmap > >> + * with no backing device map range. If there is a backing > >> + * range, Don't insert a huge entry. > >> + */ > >> + pagemap = get_dev_pagemap(pfn, NULL); > >> + if (pagemap) { > >> + put_dev_pagemap(pagemap); > >> + goto out_fallback; > >> + } > >> + > >> /* Check that memory is contiguous. */ > >> if (!bo->mem.bus.is_iomem) { > >> for (i = 1; i < fault_page_size; ++i) { > >> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault > >> *vmf, > >> } > >> } > >> > >> - pfnt = __pfn_to_pfn_t(pfn, PFN_DEV); > >> + pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP); > >> if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT)) > >> ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write); > >> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > >> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault > >> *vmf, > >> if (ret != VM_FAULT_NOPAGE) > >> goto out_fallback; > >> > >> +#if 1 > >> + { > >> + int npages; > >> + struct page *page; > >> + > >> + npages = get_user_pages_fast_only(vmf->address, 1, 0, > >> &page); > >> + if (npages == 1) { > >> + DRM_WARN("Fast gup succeeded. Bad.\n"); > >> + put_page(page); > >> + } else { > >> + DRM_INFO("Fast gup failed. Good.\n"); > >> + } > >> + } > >> +#endif > >> + > >> return VM_FAULT_NOPAGE; > >> out_fallback: > >> count_vm_event(THP_FAULT_FALLBACK); > >> > >> > >> > >> > >> -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch