Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Mon, 11 Mar 2024 17:49:09 +0000

On Mon, Mar 11, 2024 at 04:14:06PM +0000, Ryan Roberts wrote:
> > With this patch, it took about 700 seconds in my xfstests run to find
> > the one in shrink_folio_list().  It took 3260 seconds to catch the one
> > in move_folios_to_lru().
> 
> 54 hours?? That's before I even reported that we still had a bug! Either you're
> anticipating my every move, or you have a lot of stuff running in parallel :)

One hour is 3600 seconds ;-)  So more like 53 minutes.

> > That one is only probably OK.  We usually manage to split a folio before
> > we get to this point, so we should remove it from the deferred list then.
> > You'd need to buy a lottery ticket if you managed to get hwpoison in a
> > folio that was deferred split ...
> 
> OK, but "probably"? Given hwpoison is surely not a hot path, why not just be
> safe and call folio_undo_large_rmappable()?

Yes, I've added it; just commenting on the likelihood of being able to
hit it in testing, even with aggressive error injection.

> >> But taking a step back, I wonder whether the charge() and uncharge() functions
> >> should really be checking to see if the folio is on a deferred split list and if
> >> so, then move the folio to the corect list?
> > 
> > I don't think that's the right thing to do.  All of these places which
> > uncharge a folio are part of the freeing path, so we always want it
> > removed, not moved to a different deferred list.
> 
> Well I'm just thinking about trying to be robust. Clearly you would prefer that
> folio_undo_large_rmappable() has been called before uncharge(), then uncharge()
> notices that there is nothing on the deferred list (and doesn't take the lock).
> But if it's not, is it better to randomly crash (costing best part of a week to
> debug) or move the folio to the right list?

Neither ;-)  The right option is to include the assertion that the
deferred list is empty.  That way we get to see the backtrace of whoever
forgot to take the folio off the deferred list.

> Alternatively, can we refactor so that there aren't 9 separate uncharge() call
> sites. Those sites are all trying to free the folio so is there a way to better
> refactor that into a single place (I guess the argument for the current
> arrangement is reducing the number of times that we have to iterate through the
> batch?). Then we only have to get it right once.

I have been wondering about a better way to do it.  I've also been
looking a bit askance at put_pages_list() which doesn't do memcg
uncharging ...

> > 
> > But what about mem_cgroup_move_account()?  Looks like that's memcg v1
> > only?  Should still be fixed though ...
> 
> Right.
> 
> And what about the first bug you found with the local list corruption? I'm not
> running with that fix so its obviously not a problem here. But I still think its
> a bug that we should fix? list_for_each_entry_safe() isn't safe against
> *concurrent* list modification, right?

I've been thinking about that too.  I decided that the local list is
actually protected by the lock after all.  It's a bit fiddly to prove,
but:

1. We have a reference on every folio ahead on the list (not behind us,
but see below)
2. If split_folio succeeds, it takes the lock that would protect the
list we are on.
3. If it doesn't, and folio_put() turns out to be the last reference,
__folio_put_large -> destroy_large_folio -> folio_undo_large_rmappable
takes the lock that protects the list we would be on.

So we can analyse this loop as:

	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
		if (random() & 1)
			continue;
		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
		list_del_init(&folio->_deferred_list);
		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
	}

We're guaranteed that 'next' is a valid folio because we hold a refcount
on it.  Anything left on the list between &list and next may have been
removed from the list, but we don't look at those entries until after
we take the split_queue_lock again to do the list_splice_tail().

I'm too scared to write a loop like this, but I don't think it contains
a bug.