On Fri, Jan 28, 2022 at 11:26:16AM +0800, Peter Xu wrote: > diff --git a/mm/gup.c b/mm/gup.c > index 8ebc04058e97..b3a109114968 100644 > +++ b/mm/gup.c > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, > static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > pte_t *pte, unsigned int flags) > { > - /* No page to get reference */ > - if (flags & (FOLL_GET | FOLL_PIN)) > - return -EFAULT; > - > if (flags & FOLL_TOUCH) { > pte_t entry = *pte; > > Then the convertion will make sure it behaves like before. There is a very > slight side effect on FOLL_TOUCH above but it shouldn't be a major concern, at > least not in the vfio use case You mean that it touches the page even though it will not return the page? This is what devmap is doing already :\ > IMHO the other site follow_devmap_pud() is tricker, because we can't simply > remove the two lines here: Sure, but I didn't ask to fix everything, just trend toward something sane - if you drop the 4 line above then your patch will do what it needs to do, we don't need to jump into devmap to fix this bug. > - /* > - * device mapped pages can only be returned if the > - * caller will manage the page reference count. > - * > - * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here: > - */ > - if (!(flags & (FOLL_GET | FOLL_PIN))) > - return ERR_PTR(-EEXIST); > - > pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; > *pgmap = get_dev_pagemap(pfn, *pgmap); > if (!*pgmap) > > Because with the old code we'll 100% return -EEXIST as long as we don't have > GET|PIN (e.g. mlock), but then later when the two lines removed we'll start to > try call get_dev_pagemap(), if it returned something we'll start to fetch the > page. This is part of the "return codes don't make any sense" I was complaining about.. devmap has a populated PTE and a struct page, it shouldn't be returning -EFAULT. Failure to get the struct page is either EEXIST or NULL (take a fault to resolve the race). [as we've discussed before the get_dev_pagemap shouldn't even be here at all, I suppose this is why it is so weird looking] > That part looks fine, __get_user_pages() will still continue but ignore the > page* returned. IMHO it would be clearer if the __get_user_pages() checked the struct page zone and rejected the cases it doesn't want than to sprinkle these checks all over the place based on PMD flags. We want to go in this direction anyhow so we can remove the pmd_dax() flag.. > But what if get_dev_pagemap() returned NULL? Then we'll start to return > -EFAULT where we used to constantly return -EEXIST. IHMO most of the EFAULTs in the walker call chain are pretty questionable.. they should ideally be EEXIST or NULL. eg why does get_gate_page() return EFAULT for a non-present page? That should be NULL and faultin_page() should generate the EFAULT > > For instance, should we return -EEXIST in other cases like devmap and > > more that have PTEs present but are not returnable? It is already > > really confusing (and probably wrong) that we sometimes return NULL > > for populated PTEs. NULL causes faultin_page() to be called on > > something we already know is populated, seems like nonsense. > > Could you elaborate what's the "return NULL" confusion you mentioned? Returning null for a popoulated PTE as a permanent failure is nonsense. NULL means 'call faultin_page()', it should be used on populated pages that ought to have a struct page but for some racy reason is unavailable. The faultin should resolve that race and make the pte either fully non-present or restore the struct page. > > It is not about useful, it is about understandable and > > consistency.. There should be clear rules about when and what errno to > > return in every case, we should trend in that direction. > > I see that both you and John has a strong preference on at least the > WARN_ON_ONCE() in the patch. Well, I wanted to see the WARN_ON_ONCE avoided by making it unnecessary by construction, not just deleted :\ Jason