Re: folio mapcount

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 16 Dec 2021 13:56:57 +0000

On Thu, Dec 16, 2021 at 12:37:37PM +0300, Kirill A. Shutemov wrote:
> On Wed, Dec 15, 2021 at 09:55:20PM +0000, Matthew Wilcox wrote:
> > I've been trying to understand whether we can simplify the mapcount
> > handling for folios from the current situation with THPs.  Let me
> > quote the commit message from 53f9263baba6:
> > 
> > > mm: rework mapcount accounting to enable 4k mapping of THPs
> > >
> > > We're going to allow mapping of individual 4k pages of THP compound.  It
> > > means we need to track mapcount on per small page basis.
> > >
> > > Straight-forward approach is to use ->_mapcount in all subpages to track
> > > how many time this subpage is mapped with PMDs or PTEs combined.  But
> > > this is rather expensive: mapping or unmapping of a THP page with PMD
> > > would require HPAGE_PMD_NR atomic operations instead of single we have
> > > now.
> > >
> > > The idea is to store separately how many times the page was mapped as
> > > whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
> > > track PTE mapcount.
> > >
> > > We use the same approach as with compound page destructor and compound
> > > order to store compound_mapcount: use space in first tail page,
> > > ->mapping this time.
> > >
> > > Any time we map/unmap whole compound page (THP or hugetlb) -- we
> > > increment/decrement compound_mapcount.  When we map part of compound
> > > page with PTE we operate on ->_mapcount of the subpage.
> > >
> > > page_mapcount() counts both: PTE and PMD mappings of the page.
> > >
> > > Basically, we have mapcount for a subpage spread over two counters.  It
> > > makes tricky to detect when last mapcount for a page goes away.
> > >
> > > We introduced PageDoubleMap() for this.  When we split THP PMD for the
> > > first time and there's other PMD mapping left we offset up ->_mapcount
> > > in all subpages by one and set PG_double_map on the compound page.
> > > These additional references go away with last compound_mapcount.
> > >
> > > This approach provides a way to detect when last mapcount goes away on
> > > per small page basis without introducing new overhead for most common
> > > cases.
> > 
> > What breaks if we simply track any mapping (whether by PMD or PTE)
> > as an increment to the head page (aka folio's) refcount?
> 
> The obvious answer is CoW: as discussed yesterday we need exact mapcount
> to know if the page can be re-used or has to be copied.
> 
> Consider the case when you have folio mapped as PMD and then split into
> PTE page table (like with mprotect()). You get WP page fault on a page
> that has mapcount == 512. How would you know if we can re-use the 4k?

I was trying to say the exact opposite of that ... fortunately I
rephrased it below.  The scenario I think you're describing here is:

p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
mprotect(p, 4KB, PROT_READ): THP split.

And in that case, I would say the THP now has mapcount of 2 because
there are 2 VMAs mapping it.

> Also we need to detect case when the last mapping of a 4k in the folio has
> gone to trigger deferred_split_huge_page() logic.

I think you're referring to this logic in rmap.c:

        if (TestClearPageDoubleMap(page)) {
                /*
                 * Subpages can be mapped with PTEs too. Check how many of
                 * them are still mapped.
                 */
                for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
                        if (atomic_add_negative(-1, &page[i]._mapcount))
                                nr++;
                }

                /*
                 * Queue the page for deferred split if at least one small
                 * page of the compound page is unmapped, but at least one
                 * small page is still mapped.
                 */
                if (nr && nr < thp_nr_pages(page))
                        deferred_split_huge_page(page);

The 'partial_mapcount' idea I mentioned below could help with this.
Checking that they're identical might be racy, though.

> > Essentially, we make the head mapcount 'the number of VMAs which contain
> > a reference to any page in this folio'.
> 
> Okay, so you will have mapcount == 2 or 3 for mprotect case above, not
> 512. But it doesn't help with answering question if the page can be
> re-used. You would need to do rmap walk to get the answer.
> 
> Note also that VMA lifecycle is different from page lifecycle:
> MADV_DONTNEED removes mapping, but leaves VMA intact. Who would decrement
> mapcount here?

I think page_remove_rmap() does when called from zap_huge_pmd().
The tricky part is handling partial DONTNEED calls.  eg we could have
a 2MB page (if we're talking about shmem, it could even be mapped
askew), MADV_DONTNEED the first 512KB, then MADV_DONTNEED the last
512KB, then finally MADV_DONTNEED the middle 1MB and only at the third
call should the mapcount be decremented.

So a zap call has to check all the PTE/PMD entries in the range of the
(folio intersect vma) to be sure that there are no more references to
a part of this folio.  It might not be terribly fast, but it probably
won't be noticable compared to all the other costs of doing a munmap().

> > We can remove PageDoubleMap. The tail refcounts will all be 0.  If it's
> > useful, we could introduce a 'partial_mapcount' which would be <=
> > mapcount (but I don't know if it's useful).  Splitting a PMD would not
> > change ->_mapcount.  Splitting the folio already causes the folio to be
> > unmapped, so page faults will naturally re-increment ->_mapcount of each
> > subpage.
> > 
> > We might need some additional logic to treat a large folio (aka compound
> > page) as a single unit; that is, when we fault on one page, we place
> > entries for all pages in this folio (that fit ...) into the page tables,
> > so that we only account it once, even if it's not compatible with using
> > a PMD.
> 
> I still don't see a way to simplify mapcount for THP. But I'm preconsived
> becasue I'm the author of the current scheme.
> 
> Please, prove me wrong. I want to be mistaken. :)

I'm just trying to learn enough to make sensible suggestions for
simplification.  As yesterday's call proved, there are all kinds of
corner cases when messing with mapcount and refcount.