Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

Peter Xu <peterx@xxxxxxxxxx> · Thu, 27 Jul 2023 11:13:16 -0400

On Thu, Jul 27, 2023 at 03:28:49PM +0200, David Hildenbrand wrote:
> > > > Therefore, when obtaining pages through the follow_trans_huge_pmd
> > > > interface, add the FOLL_FORCE flag to count the pages corresponding to
> > > > PROTNONE to solve the above problem.
> > > > 
> > > 
> > > We really want to avoid the usage of FOLL_FORCE, and ideally limit it
> > > to ptrace only.
> > 
> > Fundamentally when removing FOLL_NUMA we did already assumed !FORCE is
> > FOLL_NUMA.  It means to me after the removal it's not possible to say in a
> > gup walker that "it's not FORCEd, but I don't want to trigger NUMA but just
> > get the page".
> > 
> > Is that what we want?  Shall we document that in FOLL_FORCE if we intended
> > to enforce numa balancing as long as !FORCE?
> 
> That was the idea, yes. I could have sworn we had that at least in some
> patch description.
> 
> Back then, I played with special-casing on gup_can_follow_protnone() on
> FOLL_GET | FOLL_PIN. But it's all just best guesses.
> 
> Can always be added if deemed necessary and worth it.
> 
> Here, it's simply an abuse of that GUP function that I wasn't aware of --
> otherwise I'd have removed that before hand.
> 
> > 
> > > 
> > > > Signed-off-by: liubo <liubo254@xxxxxxxxxx>
> > > > Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> > > > ---
> > > >   fs/proc/task_mmu.c | 6 ++++--
> > > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index c1e6531cb02a..ed08f9b869e2 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > >          bool migration = false;
> > > > 
> > > >          if (pmd_present(*pmd)) {
> > > > -               /* FOLL_DUMP will return -EFAULT on huge zero page */
> > > > -               page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> > > > +               /* FOLL_DUMP will return -EFAULT on huge zero page
> > > > +                * FOLL_FORCE follow a PROT_NONE mapped page
> > > > +                */
> > > > +               page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE);
> > > >          } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> > > >                  swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > > 
> > > Might do as an easy fix. But we really should get rid of that
> > > absolutely disgusting usage of follow_trans_huge_pmd().
> > > 
> > > We don't need 99% of what follow_trans_huge_pmd() does here.
> > > 
> > > Would the following also fix your issue?
> > > 
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 507cd4e59d07..fc744964816e 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -587,8 +587,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> > >          bool migration = false;
> > > 
> > >          if (pmd_present(*pmd)) {
> > > -               /* FOLL_DUMP will return -EFAULT on huge zero page */
> > > -               page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> > > +               page = vm_normal_page_pmd(vma, addr, *pmd);
> > >          } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> > >                  swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > > 
> > > It also skips the shared zeropage and pmd_devmap(),
> > > 
> > > Otherwise, a simple pmd_page(*pmd) + is_huge_zero_pmd(*pmd) check will do, but I
> > > suspect vm_normal_page_pmd() might be what we actually want to have here.
> > > 
> > > Because smaps_pte_entry() properly checks for vm_normal_page().
> > 
> > There're indeed some very trivial detail in vm_normal_page_pmd() that's
> > different, but maybe not so relevant.  E.g.,
> > 
> > 	if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
> > 		return -ENOMEM;
> 
> Note that we're not even passing FOLL_GET | FOLL_PIN. Because we're not
> actually doing GUP. So the refcount is not that relevant.
> 
> > 
> > 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> > 		return -EREMOTEIO;
> > 
> > I'm not sure whether the p2pdma page would matter in any form here.  E.g.,
> > whether it can be mapped privately.
> 
> Good point, but I don't think that people messing with GUP even imagined
> that we would call that function from a !GUP place.
> 
> This was wrong from the very start. If we're not in GUP, we shouldn't call
> GUP functions.

My understanding is !GET && !PIN is also called gup.. otherwise we don't
need GET and it can just be always implied.

The other proof is try_grab_page() doesn't fail hard on !GET && !PIN.  So I
don't know whether that's "wrong" to be used..

Back to the topic: I'd say either of the patches look good to solve the
problem.  If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess
vm_normal_page_pmd() proposed here will also work on it, so nothing I see
wrong on 2nd one yet.

It looks nicer indeed to not have FOLL_FORCE here, but it also makes me
just wonder whether we should document NUMA behavior for FOLL_* somewhere,
because we have an implication right now on !FOLL_FORCE over NUMA, which is
not obvious to me..

And to look more over that aspect, see follow_page(): previously we can
follow a page for protnone (as it never applies FOLL_NUMA) but now it won't
(it never applies FOLL_FORCE, either, so it seems "accidentally" implies
FOLL_NUMA now).  Not sure whether it's intended, though..

-- 
Peter Xu