On 11/03/2024 15:50, Matthew Wilcox wrote: > On Mon, Mar 11, 2024 at 12:36:00PM +0000, Ryan Roberts wrote: >> On 11/03/2024 12:26, Matthew Wilcox wrote: >>> On Mon, Mar 11, 2024 at 09:01:16AM +0000, Ryan Roberts wrote: >>>> [ 153.499149] Call trace: >>>> [ 153.499470] uncharge_folio+0x1d0/0x2c8 >>>> [ 153.500045] __mem_cgroup_uncharge_folios+0x5c/0xb0 >>>> [ 153.500795] move_folios_to_lru+0x5bc/0x5e0 >>>> [ 153.501275] shrink_lruvec+0x5f8/0xb30 >>> >>>> And that code is from your commit 29f3843026cf ("mm: free folios directly in move_folios_to_lru()") which is another patch in the same series. This suffers from the same problem; uncharge before removing folio from deferred list, so using wrong lock - there are 2 sites in this function that does this. >>> >>> Two sites, but basically the same thing; one is for "the batch is full" >>> and the other is "we finished the list". >>> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index a0e53999a865..f60c5b3977dc 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -1842,6 +1842,9 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, >>> if (unlikely(folio_put_testzero(folio))) { >>> __folio_clear_lru_flags(folio); >>> >>> + if (folio_test_large(folio) && >>> + folio_test_large_rmappable(folio)) >>> + folio_undo_large_rmappable(folio); >>> if (folio_batch_add(&free_folios, folio) == 0) { >>> spin_unlock_irq(&lruvec->lru_lock); >>> mem_cgroup_uncharge_folios(&free_folios); >>> >>>> A quick grep over the entire series has a lot of hits for "uncharge". I >>>> wonder if we need a full audit of that series for other places that >>>> could potentially be doing the same thing? >>> >>> I think this assertion will catch all occurrences of the same thing, >>> as long as people who are testing are testing in a memcg. My setup >>> doesn't use a memcg, so I never saw any of this ;-( >>> >>> If you confirm this fixes it, I'll send two patches; a respin of the patch >>> I sent on Sunday that calls undo_large_rmappable in this one extra place, >>> and then a patch to add the assertions. >> >> Good timing on your response - I've just finished testing! Although my patch >> included both the site you have above and another that I fixed up speculatively >> in shrink_folio_list() based on reviewing all mem_cgroup_uncharge_folios() call >> sites. > > I've been running some tests with this: > > +++ b/mm/huge_memory.c > @@ -3223,6 +3221,7 @@ void folio_undo_large_rmappable(struct folio *folio) > struct deferred_split *ds_queue; > unsigned long flags; > > + folio_clear_large_rmappable(folio); > if (folio_order(folio) <= 1) > return; > > +++ b/mm/memcontrol.c > @@ -7483,6 +7483,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) > struct obj_cgroup *objcg; > > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > + VM_BUG_ON_FOLIO(folio_test_large(folio) && > + folio_test_large_rmappable(folio), folio); > > /* > * Nobody should be changing or seriously looking at > > > which I think is too aggressive to be added. It does "catch" one spot > where we don't call folio_undo_large_rmappable() in > __filemap_add_folio() but since it's the "we allocated a folio, charged > it, but failed to add it to the pagecache" path, there's no way that > it can have been mmaped, so it can't be on the split list. > > With this patch, it took about 700 seconds in my xfstests run to find > the one in shrink_folio_list(). It took 3260 seconds to catch the one > in move_folios_to_lru(). 54 hours?? That's before I even reported that we still had a bug! Either you're anticipating my every move, or you have a lot of stuff running in parallel :) > > The thought occurs that we know these folios are large rmappable (if > they're large). It's a little late to make that optimisation, but > during the upcoming development cycle, I'm going to remove that > half of the test. ie I'll make it look like this: > > @@ -1433,6 +1433,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > */ > nr_reclaimed += nr_pages; > > + if (folio_test_large(folio)) > + folio_undo_large_rmappable(folio); > if (folio_batch_add(&free_folios, folio) == 0) { > mem_cgroup_uncharge_folios(&free_folios); > try_to_unmap_flush(); > >> There is also a call to mem_cgroup_uncharge() in delete_from_lru_cache(), which >> I couldn't convince myself was safe. Perhaps you could do a quick audit of the >> call sites? > > That one is only probably OK. We usually manage to split a folio before > we get to this point, so we should remove it from the deferred list then. > You'd need to buy a lottery ticket if you managed to get hwpoison in a > folio that was deferred split ... OK, but "probably"? Given hwpoison is surely not a hot path, why not just be safe and call folio_undo_large_rmappable()? > > I reviewed all the locations that call mem_cgroup_uncharge: > > __filemap_add_folio Never mmapable > free_huge_folio hugetlb, never splittable > delete_from_lru_cache Discussed above > free_zone_device_page compound pages not yet supported > destroy_large_folio folio_undo_large_rmappable already called > __folio_put_small not a large folio > > Then all the placs that call mem_cgroup_uncharge_folios: > > folios_put_refs Just fixed > shrink_folio_list Just fixed > move_folios_to_lru Just fixed OK same conclusion as me, except delete_from_lru_cache(). > > So I think we're good. > >> But taking a step back, I wonder whether the charge() and uncharge() functions >> should really be checking to see if the folio is on a deferred split list and if >> so, then move the folio to the corect list? > > I don't think that's the right thing to do. All of these places which > uncharge a folio are part of the freeing path, so we always want it > removed, not moved to a different deferred list. Well I'm just thinking about trying to be robust. Clearly you would prefer that folio_undo_large_rmappable() has been called before uncharge(), then uncharge() notices that there is nothing on the deferred list (and doesn't take the lock). But if it's not, is it better to randomly crash (costing best part of a week to debug) or move the folio to the right list? Alternatively, can we refactor so that there aren't 9 separate uncharge() call sites. Those sites are all trying to free the folio so is there a way to better refactor that into a single place (I guess the argument for the current arrangement is reducing the number of times that we have to iterate through the batch?). Then we only have to get it right once. > > But what about mem_cgroup_move_account()? Looks like that's memcg v1 > only? Should still be fixed though ... Right. And what about the first bug you found with the local list corruption? I'm not running with that fix so its obviously not a problem here. But I still think its a bug that we should fix? list_for_each_entry_safe() isn't safe against *concurrent* list modification, right?