On 7/21/2023 4:51 AM, Yosry Ahmed wrote: > On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote: >> >> >> >> On 7/19/2023 11:44 PM, Yosry Ahmed wrote: >>> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote: >>>> >>>> On Wed, 19 Jul 2023, Yin Fengwei wrote: >>>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock >>>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page? >>>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and >>>>>>>>>>>>> cpu 2 is isolating the folio for any purpose: >>>>>>>>>>>>> >>>>>>>>>>>>> cpu1 cpu2 >>>>>>>>>>>>> isolate folio >>>>>>>>>>>>> folio_test_clear_lru() // 0 >>>>>>>>>>>>> putback folio // add to unevictable list >>>>>>>>>>>>> folio_test_clear_mlocked() >>>>>>>>>> folio_set_lru() >>>>> Let's wait the response from Huge and Yu. :). >>>> >>>> I haven't been able to give it enough thought, but I suspect you are right: >>>> that the current __munlock_folio() is deficient when folio_test_clear_lru() >>>> fails. >>>> >>>> (Though it has not been reported as a problem in practice: perhaps because >>>> so few places try to isolate from the unevictable "list".) >>>> >>>> I forget what my order of development was, but it's likely that I first >>>> wrote the version for our own internal kernel - which used our original >>>> lruvec locking, which did not depend on getting PG_lru first (having got >>>> lru_lock, it checked memcg, then tried again if that had changed). >>> >>> Right. Just holding the lruvec lock without clearing PG_lru would not >>> protect against memcg movement in this case. >>> >>>> >>>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation, >>>> but it turned out to work okay - elsewhere; but it looks as if I missed >>>> its implication when adapting __munlock_page() for upstream. >>>> >>>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm >>>> not), I would first look at that aspect: instead of folio_test_clear_lru() >>>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever >>>> spin waiting for PG_lru here? >>> >>> +Matthew Wilcox >>> >>> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to >>> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do >>> folio_set_lru() before checking folio_evictable(). While this is >>> probably extraneous since folio_batch_move_lru() will set it again >>> afterwards, it's probably harmless given that the lruvec lock is held >>> throughout (so no one can complete the folio isolation anyway), and >>> given that there were no problems introduced by this extra >>> folio_set_lru() as far as I can tell. >> After checking related code, Yes. Looks fine if we move folio_set_lru() >> before if (folio_evictable(folio)) in lru_add_fn() because of holding >> lru lock. >> >>> >>> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713 >>> ("mm/munlock: >>> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict >>> ordering between manipulating PG_lru and PG_mlocked, I suppose we can >>> get away without having to spin. Again, that would only be possible if >>> reworking mlock_count [1] is acceptable. Otherwise, we can't clear >>> PG_mlocked before PG_lru in __munlock_folio(). >> What about following change to move mlocked operation before check lru >> in __munlock_folio()? > > It seems correct to me on a high level, but I think there is a subtle problem: > > We clear PG_mlocked before trying to isolate to make sure that if > someone already has the folio isolated they will put it back on an > evictable list, then if we are able to isolate the folio ourselves and > find that the mlock_count is > 0, we set PG_mlocked again. > > There is a small window where PG_mlocked might be temporarily cleared > but the folio is not actually munlocked (i.e we don't update the > NR_MLOCK stat). In that window, a racing reclaimer on a different cpu > may find VM_LOCKED from in a different vma, and call mlock_folio(). In > mlock_folio(), we will call folio_test_set_mlocked(folio) and see that > PG_mlocked is clear, so we will increment the MLOCK stats, even though > the folio was already mlocked. This can cause MLOCK stats to be > unbalanced (increments more than decrements), no? Looks like NR_MLOCK is always connected to PG_mlocked bit. Not possible to be unbalanced. Let's say: mlock_folio() NR_MLOCK increase and set mlocked mlock_folio() NR_MLOCK NO change as folio is already mlocked __munlock_folio() with isolated folio. NR_MLOCK decrease (0) and clear mlocked folio_putback_lru() reclaimed mlock_folio() NR_MLOCK increase and set mlocked munlock_folio() NR_MLOCK decrease (0) and clear mlocked munlock_folio() NR_MLOCK NO change as folio has no mlocked set Regards Yin, Fengwei > >> >> diff --git a/mm/mlock.c b/mm/mlock.c >> index 0a0c996c5c21..514f0d5bfbfd 100644 >> --- a/mm/mlock.c >> +++ b/mm/mlock.c >> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv >> static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec) >> { >> int nr_pages = folio_nr_pages(folio); >> - bool isolated = false; >> + bool isolated = false, mlocked = true; >> + >> + mlocked = folio_test_clear_mlocked(folio); >> >> if (!folio_test_clear_lru(folio)) >> goto munlock; >> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec >> /* Then mlock_count is maintained, but might undercount */ >> if (folio->mlock_count) >> folio->mlock_count--; >> - if (folio->mlock_count) >> + if (folio->mlock_count) { >> + if (mlocked) >> + folio_set_mlocked(folio); >> goto out; >> + } >> } >> /* else assume that was the last mlock: reclaim will fix it if not */ >> >> munlock: >> - if (folio_test_clear_mlocked(folio)) { >> + if (mlocked) { >> __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages); >> if (isolated || !folio_test_unevictable(folio)) >> __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages); >> >> >>> >>> I am not saying this is necessarily better than spinning, just a note >>> (and perhaps selfishly making [1] more appealing ;)). >>> >>> [1]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@xxxxxxxxxx/ >>> >>>> >>>> Hugh