Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

David Hildenbrand <david@xxxxxxxxxx> · Fri, 12 Apr 2024 21:36:32 +0200

On 12.04.24 20:29, Yang Shi wrote:
On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <ziy@xxxxxxxxxx> wrote:

On 12 Apr 2024, at 10:21, Zi Yan wrote:

On 11 Apr 2024, at 17:59, Yang Shi wrote:

On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 11.04.24 21:01, Yang Shi wrote:
On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 11.04.24 17:32, Zi Yan wrote:
From: Zi Yan <ziy@xxxxxxxxxx>

In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. It is possible that
the folio is unmapped fully, but it is unnecessary to add the folio
to deferred split list at all. Fix it by checking folio mapcount before
adding a folio to deferred split list.

Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
---
    mm/rmap.c | 9 ++++++---
    1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..d599a772e282 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                enum rmap_level level)
    {
        atomic_t *mapped = &folio->_nr_pages_mapped;
-     int last, nr = 0, nr_pmdmapped = 0;
+     int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
        enum node_stat_item idx;

        __folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                        break;
                }

-             atomic_sub(nr_pages, &folio->_large_mapcount);
+             mapcount = atomic_sub_return(nr_pages,
+                                          &folio->_large_mapcount) + 1;

That becomes a new memory barrier on some archs. Rather just re-read it
below. Re-reading should be fine here.

                do {
                        last = atomic_add_negative(-1, &page->_mapcount);
                        if (last) {
@@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                 * is still mapped.
                 */
                if (folio_test_large(folio) && folio_test_anon(folio))
-                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
+                     if ((level == RMAP_LEVEL_PTE &&
+                          mapcount != 0) ||
+                         (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
                                deferred_split_folio(folio);
        }

But I do wonder if we really care? Usually the folio will simply get
freed afterwards, where we simply remove it from the list.

If it's pinned, we won't be able to free or reclaim, but it's rather a
corner case ...

Is it really worth the added code? Not convinced.

It is actually not only an optimization, but also fixed the broken
thp_deferred_split_page counter in /proc/vmstat.

The counter actually counted the partially unmapped huge pages (so
they are on deferred split queue), but it counts the fully unmapped
mTHP as well now. For example, when a 64K THP is fully unmapped, the
thp_deferred_split_page is not supposed to get inc'ed, but it does
now.

The counter is also useful for performance analysis, for example,
whether a workload did a lot of partial unmap or not. So fixing the
counter seems worthy. Zi Yan should have mentioned this in the commit
log.

Yes, all that is information that is missing from the patch description.
If it's a fix, there should be a "Fixes:".

Likely we want to have a folio_large_mapcount() check in the code below.
(I yet have to digest the condition where this happens -- can we have an
example where we'd use to do the wrong thing and now would do the right
thing as well?)

For example, map 1G memory with 64K mTHP, then unmap the whole 1G or
some full 64K areas, you will see thp_deferred_split_page increased,
but it shouldn't.

It looks __folio_remove_rmap() incorrectly detected whether the mTHP
is fully unmapped or partially unmapped by comparing the number of
still-mapped subpages to ENTIRELY_MAPPED, which should just work for
PMD-mappable THP.

However I just realized this problem was kind of workaround'ed by commit:

commit 98046944a1597f3a02b792dbe9665e9943b77f28
Author: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
Date:   Fri Mar 29 14:59:33 2024 +0800

     mm: huge_memory: add the missing folio_test_pmd_mappable() for THP
split statistics

     Now the mTHP can also be split or added into the deferred list, so add
     folio_test_pmd_mappable() validation for PMD mapped THP, to avoid
     confusion with PMD mapped THP related statistics.

     Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@xxxxxxxxxxxxxxxxx
     Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
     Acked-by: David Hildenbrand <david@xxxxxxxxxx>
     Cc: Muchun Song <muchun.song@xxxxxxxxx>
     Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>

This commit made thp_deferred_split_page didn't count mTHP anymore, it
also made thp_split_page didn't count mTHP anymore.

However Zi Yan's patch does make the code more robust and we don't
need to worry about the miscounting issue anymore if we will add
deferred_split_page and split_page counters for mTHP in the future.

Actually, the patch above does not fix everything. A fully unmapped
PTE-mapped order-9 THP is also added to deferred split list and
counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512
(non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
the order-9 folio is folio_test_pmd_mappable().

I will add this information in the next version.

It might
Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"),
but before this commit fully unmapping a PTE-mapped order-9 THP still increased
THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE
unmapping adds the THP into the deferred split list. This means commit b06dc281aa99
did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is
due to implementation. I will add this to the commit log as well without Fixes
tag.

Thanks for digging deeper. The problem may be not that obvious before
mTHP because PMD-mappable THP is converted to PTE-mapped due to
partial unmap in most cases. But mTHP is always PTE-mapped in the
first place. The other reason is batched rmap remove was not supported
before David's optimization.

Yes.


Now we do have reasonable motivation to make it precise and it is also
easier to do so than before.

If by "precise" you mean "less unreliable in some cases", yes. See my 
other mail.

--
Cheers,

David / dhildenb