Re: [PATCH] mm: Fix invalid page pointer returned with FOLL_PIN gups

Peter Xu <peterx@xxxxxxxxxx> · Fri, 28 Jan 2022 11:26:16 +0800

On Thu, Jan 27, 2022 at 10:31:27PM -0400, Jason Gunthorpe wrote:
> We don't, but that isn't the point - it is about understandability not
> utility.
> 
> There are only two call chains that involve follow_page_mask(), one
> always wants the -EEXIST and the other wants -EEXIST sometimes and
> sometimes wants to return -EFAULT for -EEXIST
> 
> The solution is not to change follow_page_mask() but to be consistent
> throughout that -EEXIST means we encountered a populated but
> non-struct page PTE and the single place that wants -EEXIST to be
> reported as -EFAULT (__get_user_pages) should make that
> transformation.

Currently -EEXIST is not defined like that.  To me, it's defined as:

  This pte is populated but there's no page struct that binds to it. Let's skip
  this pte and continue (because we've checked there's no **pages requested).

That services use cases like mlock() very well.  IMHO that's fine.

If I understand correctly, what is proposed above is actually to redefine
-EEXIST into:

  This pte is populated but there's no page struct that binds to it.

Then the caller will decide how we should react on it.

I kind of agree that the latter one looks cleaner, but I don't have extremely
strong reason to refactor it immediately.

More importantly, I still don't know what will be the side effect if we
immediately remove the two checks in follow_page_mask(), and how much changeset
we'll need to redefine -EEXIST here.

Let's discuss the two sites that returns -EEXIST one by one..

follow_pfn_pte() should be relatively easy.  If with the -EEXIST to -EFAULT
convertion proposed in __get_user_pages(), we could have dropped below two
lines:

---8<---

diff --git a/mm/gup.c b/mm/gup.c
index 8ebc04058e97..b3a109114968 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
                pte_t *pte, unsigned int flags)
 {
-       /* No page to get reference */
-       if (flags & (FOLL_GET | FOLL_PIN))
-               return -EFAULT;
-
        if (flags & FOLL_TOUCH) {
                pte_t entry = *pte;
---8<---

Then the convertion will make sure it behaves like before.  There is a very
slight side effect on FOLL_TOUCH above but it shouldn't be a major concern, at
least not in the vfio use case - I believe most of the guest pages should
already have the dirty bit set anyways, and since they're pinned they won't be
a good candidate for reclaims already.

IMHO the other site follow_devmap_pud() is tricker, because we can't simply
remove the two lines here:

---8<---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..0fe2253f922c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1171,15 +1171,6 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
        if (flags & FOLL_TOUCH)
                touch_pud(vma, addr, pud, flags);
 
-       /*
-        * device mapped pages can only be returned if the
-        * caller will manage the page reference count.
-        *
-        * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
-        */
-       if (!(flags & (FOLL_GET | FOLL_PIN)))
-               return ERR_PTR(-EEXIST);
-
        pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
        *pgmap = get_dev_pagemap(pfn, *pgmap);
        if (!*pgmap)
---8<---

Because with the old code we'll 100% return -EEXIST as long as we don't have
GET|PIN (e.g. mlock), but then later when the two lines removed we'll start to
try call get_dev_pagemap(), if it returned something we'll start to fetch the
page.

That part looks fine, __get_user_pages() will still continue but ignore the
page* returned.

But what if get_dev_pagemap() returned NULL?  Then we'll start to return
-EFAULT where we used to constantly return -EEXIST.  I'm afraid it'll stop
working for some mlock() or MAP_POPULATE case that Kirill wanted to fix before,
then it could break things.

We could definitely add more code to fix up above to make sure mlock() on huge
pud of DAX will keep working.  However I hope I have somehow clarified the
point that it's not as trivial as it looks like to change the semantics of
-EEXIST for follow_page_mask() immediately, and to fix the vfio mdev breakage
we'd better merge the oneliner fix first even if we want to rework -EEXIST.

Summary: I see no problem at all on either way, it's just that for this bug fix
it's straightforward to keep the -EEXIST definition, rather than clean it up,
because otherwise the change will grow.  Note that we will need to backport
this patch, I think it's proper we keep the changeset small and leave the
cleanups for later.

> 
> For instance, should we return -EEXIST in other cases like devmap and
> more that have PTEs present but are not returnable? It is already
> really confusing (and probably wrong) that we sometimes return NULL
> for populated PTEs. NULL causes faultin_page() to be called on
> something we already know is populated, seems like nonsense.

Could you elaborate what's the "return NULL" confusion you mentioned?

> 
> I certainly don't want to see every place like that doing some if to
> transform -EEXIST into -EFAULT.
> 
> > > Callers that want an early failure must pass in NULL for pages, it is
> > > just that simple. It has nothing to do with the FOLL flags.
> > > 
> > > A WARN_ON would be appropriate to compare the FOLL flags against the
> > > pages. eg FOLL_GET without a pages is nonsense and should be
> > > immediately aborted. On the other hand, we avoid this by construction
> > > internal to gup.c
> > 
> > We have something like that already, although it's only a VM_BUG_ON() not a
> > BUG_ON() or WARN_ON() at the entry of __get_user_pages():
> > 
> > 	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
> 
> Right
> 
> > I believe this works too and I think I get your point, but as stated above it's
> > just not used yet so the path is not useful to any real code path.
> 
> It is not about useful, it is about understandable and
> consistency.. There should be clear rules about when and what errno to
> return in every case, we should trend in that direction.

I see that both you and John has a strong preference on at least the
WARN_ON_ONCE() in the patch.

Do you think it's okay I repost with only the one-liner fix, which will keep
the Fixes but drop the WARN_ON_ONCE?  Then we can leave the rest as follow up.

For my own preference, I really like to keep the patch as-is, because as I
mentioned it helps to identify issues with existing code already (it'll even
help the reader when looking at the -EEXIST handling because that's not
obvious).  But I don't strongly ask for that, I still sincerely wish the
discussion won't block the real simple bugfix.

Thanks,

-- 
Peter Xu