On 12/03/2024 03:45, Matthew Wilcox wrote: > On Mon, Mar 11, 2024 at 03:58:48PM -0400, Zi Yan wrote: >> @@ -1168,6 +1172,17 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >> folio_lock(src); >> } >> locked = true; >> + if (folio_test_large_rmappable(src) && I think you also need to check that the order > 1, now that we support order-1 pagecache folios? _deferred_list only exists if order > 1. >> + !list_empty(&src->_deferred_list)) { >> + struct deferred_split *ds_queue = get_deferred_split_queue(src); >> + >> + spin_lock(&ds_queue->split_queue_lock); >> + ds_queue->split_queue_len--; >> + list_del_init(&src->_deferred_list); >> + spin_unlock(&ds_queue->split_queue_lock); >> + old_page_state |= PAGE_WAS_ON_DEFERRED_LIST; >> + } > > I have a few problems with this ... > > Trivial: your whitespace is utterly broken. You can't use a single tab > for both indicating control flow change and for line-too-long. > > Slightly more important: You're checking list_empty outside the lock > (which is fine in order to avoid unnecessarily acquiring the lock), > but you need to re-check it inside the lock in case of a race. And you > didn't mark it as data_race(), so KMSAN will whinge. I've seen data_race() used around list_empty() without the lock held inconsistently (see deferred_split_folio()). What are the rules? Given that we are not doing a memory access here, I don't really understand why it is needed? list_empty() uses READ_ONCE() which I thought would be sufficient? (I've just added a similar lockless check in my swap-out series - will add data_race() if needed, but previously concluded it's not). > > Much more important: You're doing this with a positive refcount, which > breaks the (undocumented) logic in deferred_split_scan() that a folio > with a positive refcount will not be removed from the list. > > Maximally important: Wer shouldn't be doing any of this! This folio is > on the deferred split list. We shouldn't be migrating it as a single > entity; we should be splitting it now that we're in a context where we > can do the right thing and split it. Documentation/mm/transhuge.rst > is clear that we don't split it straight away due to locking context. > Splitting it on migration is clearly the right thing to do. > > If splitting fails, we should just fail the migration; splitting fails > due to excess references, and if the source folio has excess references, > then migration would fail too. This comment makes me wonder what we do in split_huge_page_to_list_to_order() if the target order is greater than 1 and the input folio is on the deferred split list. Looks like we currently just remove it from the deferred list. Is there a case for putting any output folios that are still partially mapped back on the deferred list, or splitting them to a lower order such that all output folios are fully mapped, and all unmapped portions are freed?