Re: A mapcount riddle

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 24, 2023 at 03:29:53PM -0800, Yang Shi wrote:
> On Tue, Jan 24, 2023 at 3:00 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > > Q How can a page be mapped into multiple processes and have a
> > >   mapcount of 1?
> > >
> > > A It is a hugetlb page referenced by a shared PMD.
> > >
> > > I was looking to expose some basic information about PMD sharing via
> > > /proc/smaps.  After adding the code, I started a couple processes
> > > sharing a large hugetlb mapping that would result in the use of
> > > shared PMDs.  When I looked at the output of /proc/smaps, I saw
> > > my new metric counting the number of shared PMDs.  However, what
> > > stood out was that the entire mapping was listed as Private_Hugetlb.
> > > WTH???  It certainly was shared!  The routine smaps_hugetlb_range
> > > decides between Private_Hugetlb and Shared_Hugetlb with this code:
> > >
> > >       if (page) {
> > >               int mapcount = page_mapcount(page);
> > >
> > >               if (mapcount >= 2)
> > >                       mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > >               else
> > >                       mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > >       }
> >
> > This is definitely unfortunate..
> >
> > >
> > > After spending some time looking for issues in the page_mapcount code,
> > > I came to the realization that the mapcount of hugetlb pages only
> > > referenced by a shared PMD would be 1 no matter how many processes had
> > > mapped the page.  When a page is first faulted, the mapcount is set to 1.
> > > When faulted in other processes, the shared PMD is added to the page
> > > table of the other processes.  No increase of mapcount will occur.
> > >
> > > At first thought this seems bad.  However, I believe this has been the
> > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > unaware of any reported issues.  I did a audit of code looking at
> > > mapcount.  In addition to the above issue with smaps, there appears
> > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > without appropriate privilege.
> > >
> > >       /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > >       if (flags & (MPOL_MF_MOVE_ALL) ||
> > >           (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > >               if (isolate_hugetlb(page, qp->pagelist) &&
> > >                       (flags & MPOL_MF_STRICT))
> > >                       /*
> > >                        * Failed to isolate page but allow migrating pages
> > >                        * which have been queued.
> > >                        */
> > >                       ret = 1;
> > >       }
> > >
> > > I will prepare fixes for both of these.  However, I wanted to ask if
> > > anyone has ideas about other potential issues with this?
> >
> > This reminded me whether things should be checked already before this
> > happens.  E.g. when trying to share pmd, whether it makes sense to check
> > vma mempolicy before doing so?
> >
> > Then the question is if pmd sharing only happens with the vma that shares
> > the same memory policy, whether above mapcount==1 check would be acceptable
> > even if it's shared by multiple processes.
> 
> I don't think so. One process might change its policy, for example,
> bind to another node, then result in migration for the hugepage due to
> the incorrect mapcount. The above example code pasted by Mike actually
> comes from mbind if I remember correctly.

Yes, or any page migrations.  Above was a purely wild idea that we share
pmd based on vma attributes matching first (shared, mapping alignments,
etc.).  It can also take mempolicy into account so that when migrating one
page on the shared pmd, one can make a decision for all with mapcount==1
because that single mapcount may stand for all the sharers of the page as
long as they share the same mempolicy.

If above idea applies, we'll also need to unshare during mbind() when the
mempolicy of vma changes for hugetlb in this path, because right after the
mempolicy changed the vma attribute changed, so pmd sharing doesn't hold.

But please also ignore this whole thing - I don't think that'll resolve the
generic problem of mapcount issue on pmd sharing no matter what.  It's just
something come up to mind when I read it.

> 
> I'm wondering whether we could use refcount instead of mapcount to
> determine if hugetlb page is shared or not, assuming refcounting for
> hugetlb page behaves similar to base page (inc when mapped by a new
> process or pinned). If it is pinned (for example, GUP) we can't
> migrate it either.

I think refcount has the same issue because it's not accounted either when
pmd page is shared.

-- 
Peter Xu





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux