On 24/10/2024 05:10, Hugh Dickins wrote: > Recent changes are putting more pressure on THP deferred split queues: > under load revealing long-standing races, causing list_del corruptions, > "Bad page state"s and worse (I keep BUGs in both of those, so usually > don't get to see how badly they end up without). The relevant recent > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, > improved swap allocation, and underused THP splitting. > > The new unlocked list_del_init() in deferred_split_scan() is buggy. > I gave bad advice, it looks plausible since that's a local on-stack > list, but the fact is that it can race with a third party freeing or > migrating the preceding folio (properly unqueueing it with refcount 0 > while holding split_queue_lock), thereby corrupting the list linkage. > > The obvious answer would be to take split_queue_lock there: but it has > a long history of contention, so I'm reluctant to add to that. Instead, > make sure that there is always one safe (raised refcount) folio before, > by delaying its folio_put(). (And of course I was wrong to suggest > updating split_queue_len without the lock: leave that until the splice.) > Thanks for this, I imagine this was quite tricky to debug. Avoiding taking the split_queue_lock by keeping reference of the preceding folio is really nice. And I should have spotted in the original patch that split_queue_len shouldn't be changed without holding split_queue_lock. Acked-by: Usama Arif <usamaarif642@xxxxxxxxx> > And remove two over-eager partially_mapped checks, restoring those tests > to how they were before: if uncharge_folio() or free_tail_page_prepare() > finds _deferred_list non-empty, it's in trouble whether or not that folio > is partially_mapped (and the flag was already cleared in the latter case). > > Fixes: dafff3f4c850 ("mm: split underused THPs") > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > --- > mm/huge_memory.c | 21 +++++++++++++++++---- > mm/memcontrol.c | 3 +-- > mm/page_alloc.c | 5 ++--- > 3 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2fb328880b50..a1d345f1680c 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3718,8 +3718,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, > struct deferred_split *ds_queue = &pgdata->deferred_split_queue; > unsigned long flags; > LIST_HEAD(list); > - struct folio *folio, *next; > - int split = 0; > + struct folio *folio, *next, *prev = NULL; > + int split = 0, removed = 0; > > #ifdef CONFIG_MEMCG > if (sc->memcg) > @@ -3775,15 +3775,28 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, > */ > if (!did_split && !folio_test_partially_mapped(folio)) { > list_del_init(&folio->_deferred_list); > - ds_queue->split_queue_len--; > + removed++; > + } else { > + /* > + * That unlocked list_del_init() above would be unsafe, > + * unless its folio is separated from any earlier folios > + * left on the list (which may be concurrently unqueued) > + * by one safe folio with refcount still raised. > + */ > + swap(folio, prev); > } > - folio_put(folio); > + if (folio) > + folio_put(folio); > } > > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > list_splice_tail(&list, &ds_queue->split_queue); > + ds_queue->split_queue_len -= removed; > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > > + if (prev) > + folio_put(prev); > + > /* > * Stop shrinker if we didn't split any page, but the queue is empty. > * This can happen if pages were freed under us. > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7845c64a2c57..2703227cce88 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4631,8 +4631,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > VM_BUG_ON_FOLIO(folio_order(folio) > 1 && > !folio_test_hugetlb(folio) && > - !list_empty(&folio->_deferred_list) && > - folio_test_partially_mapped(folio), folio); > + !list_empty(&folio->_deferred_list), folio); > > /* > * Nobody should be changing or seriously looking at > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8afab64814dc..4b21a368b4e2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -961,9 +961,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) > break; > case 2: > /* the second tail page: deferred_list overlaps ->mapping */ > - if (unlikely(!list_empty(&folio->_deferred_list) && > - folio_test_partially_mapped(folio))) { > - bad_page(page, "partially mapped folio on deferred list"); > + if (unlikely(!list_empty(&folio->_deferred_list))) { > + bad_page(page, "on deferred list"); > goto out; > } > break;