Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

David Hildenbrand <david@xxxxxxxxxx> · Fri, 12 Apr 2024 21:32:58 +0200

On 12.04.24 16:35, Zi Yan wrote:
On 11 Apr 2024, at 11:46, David Hildenbrand wrote:

On 11.04.24 17:32, Zi Yan wrote:
From: Zi Yan <ziy@xxxxxxxxxx>

In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. It is possible that
the folio is unmapped fully, but it is unnecessary to add the folio
to deferred split list at all. Fix it by checking folio mapcount before
adding a folio to deferred split list.

Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
---
   mm/rmap.c | 9 ++++++---
   1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..d599a772e282 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
   		enum rmap_level level)
   {
   	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int last, nr = 0, nr_pmdmapped = 0;
+	int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
   	enum node_stat_item idx;
    	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
   			break;
   		}
  -		atomic_sub(nr_pages, &folio->_large_mapcount);
+		mapcount = atomic_sub_return(nr_pages,
+					     &folio->_large_mapcount) + 1;

That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here.

Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped)
below, but to save an atomic op, I chose to read mapcount here.

Some points:

(1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
RMW that return a value -- and how they interact with memory barriers.
Further, how relaxed variants are only optimized on some architectures.

atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory
access that should not be refetched. Usually cheaper than most other stuff
that involves atomics.

(2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped)
to figure out if the folio is now completely unmapped.

(3) There is one fundamental issue: if we are not batch-unmapping the whole
thing, we will still add the folios to the deferred split queue. Migration
would still do that, or if there are multiple VMAs covering a folio.

(4) We should really avoid making common operations slower only to make
some unreliable stats less unreliable.


We should likely do something like the following, which might even be a bit
faster in some cases because we avoid a function call in case we unmap
individual PTEs by checking _deferred_list ahead of time

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..356598b3dc3c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                 * page of the folio is unmapped and at least one page
                 * is still mapped.
                 */
-               if (folio_test_large(folio) && folio_test_anon(folio))
-                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
-                               deferred_split_folio(folio);
+               if (folio_test_large(folio) && folio_test_anon(folio) &&
+                   (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) &&
+                   atomic_read(mapped) &&
+                   data_race(list_empty(&folio->_deferred_list)))
+                       deferred_split_folio(folio);
        }
 

I also thought about handling the scenario where we unmap the whole
think in smaller chunks. We could detect "!atomic_read(mapped)" and
detect that it is on the deferred split list, and simply remove it
from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.

But it would be racy with concurrent remapping of the folio (might happen with
anon folios in corner cases I guess).

What we can do is the following, though:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dc30139590e6..f05cba1807f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio)
        ds_queue = get_deferred_split_queue(folio);
        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
        if (!list_empty(&folio->_deferred_list)) {
+               if (folio_test_pmd_mappable(folio))
+                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
                ds_queue->split_queue_len--;
                list_del_init(&folio->_deferred_list);
        }

Adding the right event of course.


Then it's easy to filter out these "temporarily added to the list, but never split
before the folio was freed" cases.


--
Cheers,

David / dhildenb