Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Mon, 11 Mar 2024 15:50:09 +0000




On Mon, Mar 11, 2024 at 12:36:00PM +0000, Ryan Roberts wrote:
> On 11/03/2024 12:26, Matthew Wilcox wrote:
> > On Mon, Mar 11, 2024 at 09:01:16AM +0000, Ryan Roberts wrote:
> >> [  153.499149] Call trace:
> >> [  153.499470]  uncharge_folio+0x1d0/0x2c8
> >> [  153.500045]  __mem_cgroup_uncharge_folios+0x5c/0xb0
> >> [  153.500795]  move_folios_to_lru+0x5bc/0x5e0
> >> [  153.501275]  shrink_lruvec+0x5f8/0xb30
> > 
> >> And that code is from your commit 29f3843026cf ("mm: free folios directly in move_folios_to_lru()") which is another patch in the same series. This suffers from the same problem; uncharge before removing folio from deferred list, so using wrong lock - there are 2 sites in this function that does this.
> > 
> > Two sites, but basically the same thing; one is for "the batch is full"
> > and the other is "we finished the list".
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a0e53999a865..f60c5b3977dc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1842,6 +1842,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
> >  		if (unlikely(folio_put_testzero(folio))) {
> >  			__folio_clear_lru_flags(folio);
> >  
> > +			if (folio_test_large(folio) &&
> > +			    folio_test_large_rmappable(folio))
> > +				folio_undo_large_rmappable(folio);
> >  			if (folio_batch_add(&free_folios, folio) == 0) {
> >  				spin_unlock_irq(&lruvec->lru_lock);
> >  				mem_cgroup_uncharge_folios(&free_folios);
> > 
> >> A quick grep over the entire series has a lot of hits for "uncharge". I
> >> wonder if we need a full audit of that series for other places that
> >> could potentially be doing the same thing?
> > 
> > I think this assertion will catch all occurrences of the same thing,
> > as long as people who are testing are testing in a memcg.  My setup
> > doesn't use a memcg, so I never saw any of this ;-(
> > 
> > If you confirm this fixes it, I'll send two patches; a respin of the patch
> > I sent on Sunday that calls undo_large_rmappable in this one extra place,
> > and then a patch to add the assertions.
> 
> Good timing on your response - I've just finished testing! Although my patch
> included both the site you have above and another that I fixed up speculatively
> in shrink_folio_list() based on reviewing all mem_cgroup_uncharge_folios() call
> sites.

I've been running some tests with this:

+++ b/mm/huge_memory.c
@@ -3223,6 +3221,7 @@ void folio_undo_large_rmappable(struct folio *folio)
        struct deferred_split *ds_queue;
        unsigned long flags;
 
+       folio_clear_large_rmappable(folio);
        if (folio_order(folio) <= 1)
                return;
 
+++ b/mm/memcontrol.c
@@ -7483,6 +7483,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
        struct obj_cgroup *objcg;

        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+       VM_BUG_ON_FOLIO(folio_test_large(folio) &&
+                       folio_test_large_rmappable(folio), folio);

        /*
         * Nobody should be changing or seriously looking at


which I think is too aggressive to be added.  It does "catch" one spot
where we don't call folio_undo_large_rmappable() in
__filemap_add_folio() but since it's the "we allocated a folio, charged
it, but failed to add it to the pagecache" path, there's no way that
it can have been mmaped, so it can't be on the split list.

With this patch, it took about 700 seconds in my xfstests run to find
the one in shrink_folio_list().  It took 3260 seconds to catch the one
in move_folios_to_lru().

The thought occurs that we know these folios are large rmappable (if
they're large).  It's a little late to make that optimisation, but
during the upcoming development cycle, I'm going to remove that
half of the test.  ie I'll make it look like this:

@@ -1433,6 +1433,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
                 */
                nr_reclaimed += nr_pages;

+               if (folio_test_large(folio))
+                       folio_undo_large_rmappable(folio);
                if (folio_batch_add(&free_folios, folio) == 0) {
                        mem_cgroup_uncharge_folios(&free_folios);
                        try_to_unmap_flush();

> There is also a call to mem_cgroup_uncharge() in delete_from_lru_cache(), which
> I couldn't convince myself was safe. Perhaps you could do a quick audit of the
> call sites?

That one is only probably OK.  We usually manage to split a folio before
we get to this point, so we should remove it from the deferred list then.
You'd need to buy a lottery ticket if you managed to get hwpoison in a
folio that was deferred split ...

I reviewed all the locations that call mem_cgroup_uncharge:

__filemap_add_folio	Never mmapable
free_huge_folio		hugetlb, never splittable
delete_from_lru_cache	Discussed above
free_zone_device_page	compound pages not yet supported
destroy_large_folio	folio_undo_large_rmappable already called
__folio_put_small	not a large folio

Then all the placs that call mem_cgroup_uncharge_folios:

folios_put_refs		Just fixed
shrink_folio_list	Just fixed
move_folios_to_lru	Just fixed

So I think we're good.

> But taking a step back, I wonder whether the charge() and uncharge() functions
> should really be checking to see if the folio is on a deferred split list and if
> so, then move the folio to the corect list?

I don't think that's the right thing to do.  All of these places which
uncharge a folio are part of the freeing path, so we always want it
removed, not moved to a different deferred list.

But what about mem_cgroup_move_account()?  Looks like that's memcg v1
only?  Should still be fixed though ...