On 11/03/2024 17:49, Matthew Wilcox wrote: > On Mon, Mar 11, 2024 at 04:14:06PM +0000, Ryan Roberts wrote: >>> With this patch, it took about 700 seconds in my xfstests run to find >>> the one in shrink_folio_list(). It took 3260 seconds to catch the one >>> in move_folios_to_lru(). >> >> 54 hours?? That's before I even reported that we still had a bug! Either you're >> anticipating my every move, or you have a lot of stuff running in parallel :) > > One hour is 3600 seconds ;-) So more like 53 minutes. > >>> That one is only probably OK. We usually manage to split a folio before >>> we get to this point, so we should remove it from the deferred list then. >>> You'd need to buy a lottery ticket if you managed to get hwpoison in a >>> folio that was deferred split ... >> >> OK, but "probably"? Given hwpoison is surely not a hot path, why not just be >> safe and call folio_undo_large_rmappable()? > > Yes, I've added it; just commenting on the likelihood of being able to > hit it in testing, even with aggressive error injection. > >>>> But taking a step back, I wonder whether the charge() and uncharge() functions >>>> should really be checking to see if the folio is on a deferred split list and if >>>> so, then move the folio to the corect list? >>> >>> I don't think that's the right thing to do. All of these places which >>> uncharge a folio are part of the freeing path, so we always want it >>> removed, not moved to a different deferred list. >> >> Well I'm just thinking about trying to be robust. Clearly you would prefer that >> folio_undo_large_rmappable() has been called before uncharge(), then uncharge() >> notices that there is nothing on the deferred list (and doesn't take the lock). >> But if it's not, is it better to randomly crash (costing best part of a week to >> debug) or move the folio to the right list? > > Neither ;-) The right option is to include the assertion that the > deferred list is empty. That way we get to see the backtrace of whoever > forgot to take the folio off the deferred list. > >> Alternatively, can we refactor so that there aren't 9 separate uncharge() call >> sites. Those sites are all trying to free the folio so is there a way to better >> refactor that into a single place (I guess the argument for the current >> arrangement is reducing the number of times that we have to iterate through the >> batch?). Then we only have to get it right once. > > I have been wondering about a better way to do it. I've also been > looking a bit askance at put_pages_list() which doesn't do memcg > uncharging ... > >>> >>> But what about mem_cgroup_move_account()? Looks like that's memcg v1 >>> only? Should still be fixed though ... >> >> Right. >> >> And what about the first bug you found with the local list corruption? I'm not >> running with that fix so its obviously not a problem here. But I still think its >> a bug that we should fix? list_for_each_entry_safe() isn't safe against >> *concurrent* list modification, right? > > I've been thinking about that too. I decided that the local list is > actually protected by the lock after all. It's a bit fiddly to prove, > but: > > 1. We have a reference on every folio ahead on the list (not behind us, > but see below) > 2. If split_folio succeeds, it takes the lock that would protect the > list we are on. > 3. If it doesn't, and folio_put() turns out to be the last reference, > __folio_put_large -> destroy_large_folio -> folio_undo_large_rmappable > takes the lock that protects the list we would be on. > > So we can analyse this loop as: > > list_for_each_entry_safe(folio, next, &list, _deferred_list) { > if (random() & 1) > continue; > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > list_del_init(&folio->_deferred_list); > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > > We're guaranteed that 'next' is a valid folio because we hold a refcount > on it. Anything left on the list between &list and next may have been > removed from the list, but we don't look at those entries until after > we take the split_queue_lock again to do the list_splice_tail(). > > I'm too scared to write a loop like this, but I don't think it contains > a bug. OK, wow. Now that I'm looking at the implementation of list_for_each_entry_safe() along with your reasoning, that is clear. But it's certainly not obvious looking at deferred_split_scan().