Re: [PATCH v9 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()

Hugh Dickins <hughd@xxxxxxxxxx> · Mon, 10 Mar 2025 01:54:23 -0700 (PDT)

On Thu, 6 Mar 2025, Zi Yan wrote:
> On 5 Mar 2025, at 17:38, Hugh Dickins wrote:
> > On Wed, 5 Mar 2025, Zi Yan wrote:
> >> On 5 Mar 2025, at 16:03, Hugh Dickins wrote:
> >>>
> >>> Beyond checking that, I didn't have time yesterday to investigate
> >>> further, but I'll try again today (still using last weekend's mm.git).
> >>
> >> I am trying to replicate your runs locally. Can you clarify your steps
> >> of “kernel builds on huge tmpfs while swapping to SSD”? Do you impose
> >> a memory limit so that anonymous memory is swapped to SSD or make tmpfs
> >> swap to SSD?
> >
> > Yeah, my heart sank a bit when I saw Andrew (with good intention) asking
> > you to repeat my testing.
> >
> > We could spend weeks going back and forth on that, and neither of us has
> > weeks to spare.
> >
> > "To fulfil contractual obligations" I'll mail you the tarfile I send
> > out each time I'm asked for this; but I haven't updated that tarfile
> > in four years, whereas I'm frequently tweaking things to match what's
> > needed (most recently and relevantly, I guess enabling 64kB hugepages
> > for anon and shmem in addition to the PMD-sized).
> >
> > Please don't waste much of your time over trying to replicate what
> > I'm doing: just give the scripts a glance, as a source for "oh,
> > I could exercise something like that in my testing too" ideas.
> >
> > Yes, I limit physical memory by booting with mem=1G, and also apply
> > lower memcg v1 limits.
> >
> > I made a point of saying "SSD" there because I'm not testing zram or
> > zswap at all, whereas many others are testing those rather than disk.
> >
> > swapoff, and ext4 on loop0 on tmpfs, feature in what I exercise, but are
> > NOT relevant to the corruption I'm seeing here - that can occur before
> > any swapoff, and it's always on the kernel build in tmpfs: the parallel
> > build in ext4 on loop0 on tmpfs completes successfully.
> 
> Thanks for the scripts. I kinda replicate your setup as follows:
> 
> 1. boot a VM with 1GB memory and 8 cores;
> 2. mount a tmpfs with huge=always and 200GB;
> 3. clone the mainline kernel and use x86_64 defconfig (my gcc 14 gives
>    errors during the old kernel builds), this takes about 2GB space,
>    so some of tmpfs is already swapped to SSD;
> 4. create a new cgroupv2 and set memory.high to 700MB to induce memory
>    swap during kernel compilation;
> 5. run “while true; do echo 1 | sudo tee /proc/sys/vm/compact_memory >/dev/null; done” to trigger compaction all the time;
> 6. build the kernel with make -j20.
> 
> I ran the above on mm-everything-2025-03-05-03-54 plus the xarray fix v3,
> folio_split() with your fixes, and Minimize xa_node allocation during
> xarry split patches. The repo is at: https://github.com/x-y-z/linux-dev/tree/shmem_fix-mm-everything-2025-03-05-03-54.
> 
> It has ran over night for 30 kernel builds and no crash happened so far.
> I wonder if you can give my repo a shot.

Thanks for trying, I hope you didn't waste too much time on it,
I was not optimistic that it could be adapted easily.

You appeared to be suggesting above that I try your setup, which did
not reproduce the corruption, instead of mine which did.  And later you
appeared to conclude that all is good because you saw no corruption.

No. I continued with my setup (working on mm-everything-2025-03-08-00-43),
have now root-caused the corruption, and have the fix: as so often,
it is obvious in retrospect.

After looking at enough of the fixdep failures and their .o.d files,
I found them consistent with seeing the head page of a large folio
in place of its first tail (presumably while racing a split).

And the debug patch to filemap_get_entry() below (not something we want
to go into the tree, but good for confirmation - maybe it will even
show warnings on your setup) confirmed that - well, head in place of
first tail was the majority of cases, but head in place of any tail
in general.

There's a race between RCU lookup of the xarray and your splitting.
That's something that normally the xas_reload(&xas) check guards
against, but it was not effective.

I wasted a day getting it exactly back to front: I thought the problem
was that __split_unmap_folio() needed to __xa_store the former tail
before unfreezing it; but the patch reversing that ordering still
issued some warnings.

No, it's that __split_unmap_folio() must not unfreeze the original
head until all the xa slots which used to point to it have been
updated with their new contents (as __split_huge_page() always did).

I've taken the liberty of simplifying the unfreeze calculaton in terms
of mapping and swap_cache, as we did elsewhere (and after fiddling
around with the comment above it, just dropped it as adding nothing
beyond what the code already does).

And there's one other, unrelated change in there: I've changed the
folio_put() after __filemap_remove_folio() to folio_put_refs().
I believe that is what's correct there, but please check carefully:
I'm a little worried that my testing (which I expect to be exercising
that "beyond EOF" case plenty) has run well both before and after that
change, whereas I thought there should be a noticeable leak of memory
while it was just folio_put().

Or maybe my testing is barely exercising anything more than uniform
splits to 0-order, in which case there's no difference: I imagine
your selftests (I've not tried them) will do much better on that.

Please fold in the mm/huge_memory.c mods if you are in agreement.

Signed-off-by: Hugh Dickins <<hughd@xxxxxxxxxx>

diff --git a/mm/filemap.c b/mm/filemap.c
index f7281ad22743..34b4fdafec40 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1871,6 +1871,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
 		folio_put(folio);
 		goto repeat;
 	}
+
+	if (mapping->host /* filter out swap cache */ &&
+	    /* !folio_contains(folio, index), but that requires lock */
+	    WARN_ON(index - folio->index >= folio_nr_pages(folio))) {
+		pr_warn("Mismatched index:%#lx\n", index);
+		dump_page(&folio->page, NULL);
+		folio_put(folio);
+		goto repeat;
+	}
 out:
 	rcu_read_unlock();
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3e05e62fdccb..be0c9873019c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3787,18 +3787,13 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 						MTHP_STAT_NR_ANON, 1);
 			}
 
-			/*
-			 * Unfreeze refcount first. Additional reference from
-			 * page cache.
-			 */
-			folio_ref_unfreeze(release,
-				1 + ((!folio_test_anon(origin_folio) ||
-				     folio_test_swapcache(origin_folio)) ?
-					     folio_nr_pages(release) : 0));
-
 			if (release == origin_folio)
 				continue;
 
+			folio_ref_unfreeze(release, 1 +
+					((mapping || swap_cache) ?
+						folio_nr_pages(release) : 0));
+
 			lru_add_page_tail(origin_folio, &release->page,
 						lruvec, list);
 
@@ -3810,7 +3805,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 					folio_account_cleaned(release,
 						inode_to_wb(mapping->host));
 				__filemap_remove_folio(release, NULL);
-				folio_put(release);
+				folio_put_refs(release, folio_nr_pages(release));
 			} else if (mapping) {
 				__xa_store(&mapping->i_pages,
 						release->index, release, 0);
@@ -3822,6 +3817,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		}
 	}
 
+	folio_ref_unfreeze(origin_folio, 1 +
+		((mapping || swap_cache) ?  folio_nr_pages(origin_folio) : 0));
+
 	unlock_page_lruvec(lruvec);
 
 	if (swap_cache)