On Wed, Jan 25, 2023 at 8:02 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Tue, Jan 24, 2023 at 03:29:53PM -0800, Yang Shi wrote: > > On Tue, Jan 24, 2023 at 3:00 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote: > > > > Q How can a page be mapped into multiple processes and have a > > > > mapcount of 1? > > > > > > > > A It is a hugetlb page referenced by a shared PMD. > > > > > > > > I was looking to expose some basic information about PMD sharing via > > > > /proc/smaps. After adding the code, I started a couple processes > > > > sharing a large hugetlb mapping that would result in the use of > > > > shared PMDs. When I looked at the output of /proc/smaps, I saw > > > > my new metric counting the number of shared PMDs. However, what > > > > stood out was that the entire mapping was listed as Private_Hugetlb. > > > > WTH??? It certainly was shared! The routine smaps_hugetlb_range > > > > decides between Private_Hugetlb and Shared_Hugetlb with this code: > > > > > > > > if (page) { > > > > int mapcount = page_mapcount(page); > > > > > > > > if (mapcount >= 2) > > > > mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > > > else > > > > mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > > > } > > > > > > This is definitely unfortunate.. > > > > > > > > > > > After spending some time looking for issues in the page_mapcount code, > > > > I came to the realization that the mapcount of hugetlb pages only > > > > referenced by a shared PMD would be 1 no matter how many processes had > > > > mapped the page. When a page is first faulted, the mapcount is set to 1. > > > > When faulted in other processes, the shared PMD is added to the page > > > > table of the other processes. No increase of mapcount will occur. > > > > > > > > At first thought this seems bad. However, I believe this has been the > > > > behavior since hugetlb PMD sharing was introduced in 2006 and I am > > > > unaware of any reported issues. I did a audit of code looking at > > > > mapcount. In addition to the above issue with smaps, there appears > > > > to be an issue with 'migrate_pages' where shared pages could be migrated > > > > without appropriate privilege. > > > > > > > > /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */ > > > > if (flags & (MPOL_MF_MOVE_ALL) || > > > > (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) { > > > > if (isolate_hugetlb(page, qp->pagelist) && > > > > (flags & MPOL_MF_STRICT)) > > > > /* > > > > * Failed to isolate page but allow migrating pages > > > > * which have been queued. > > > > */ > > > > ret = 1; > > > > } > > > > > > > > I will prepare fixes for both of these. However, I wanted to ask if > > > > anyone has ideas about other potential issues with this? > > > > > > This reminded me whether things should be checked already before this > > > happens. E.g. when trying to share pmd, whether it makes sense to check > > > vma mempolicy before doing so? > > > > > > Then the question is if pmd sharing only happens with the vma that shares > > > the same memory policy, whether above mapcount==1 check would be acceptable > > > even if it's shared by multiple processes. > > > > I don't think so. One process might change its policy, for example, > > bind to another node, then result in migration for the hugepage due to > > the incorrect mapcount. The above example code pasted by Mike actually > > comes from mbind if I remember correctly. > > Yes, or any page migrations. Above was a purely wild idea that we share > pmd based on vma attributes matching first (shared, mapping alignments, > etc.). It can also take mempolicy into account so that when migrating one > page on the shared pmd, one can make a decision for all with mapcount==1 > because that single mapcount may stand for all the sharers of the page as > long as they share the same mempolicy. > > If above idea applies, we'll also need to unshare during mbind() when the > mempolicy of vma changes for hugetlb in this path, because right after the > mempolicy changed the vma attribute changed, so pmd sharing doesn't hold. Make sense to me. The vmas with different mempolicies can't be merged. So, they shouldn't share PMD either. > > But please also ignore this whole thing - I don't think that'll resolve the > generic problem of mapcount issue on pmd sharing no matter what. It's just > something come up to mind when I read it. > > > > > I'm wondering whether we could use refcount instead of mapcount to > > determine if hugetlb page is shared or not, assuming refcounting for > > hugetlb page behaves similar to base page (inc when mapped by a new > > process or pinned). If it is pinned (for example, GUP) we can't > > migrate it either. > > I think refcount has the same issue because it's not accounted either when > pmd page is shared. Good to know that. It is a little bit counter-intuitive TBH. > > -- > Peter Xu >