On 7/19/2023 11:44 PM, Yosry Ahmed wrote: > On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote: >> >> On Wed, 19 Jul 2023, Yin Fengwei wrote: >>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock >>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page? >>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and >>>>>>>>>>> cpu 2 is isolating the folio for any purpose: >>>>>>>>>>> >>>>>>>>>>> cpu1 cpu2 >>>>>>>>>>> isolate folio >>>>>>>>>>> folio_test_clear_lru() // 0 >>>>>>>>>>> putback folio // add to unevictable list >>>>>>>>>>> folio_test_clear_mlocked() >>>>>>>> folio_set_lru() >>> Let's wait the response from Huge and Yu. :). >> >> I haven't been able to give it enough thought, but I suspect you are right: >> that the current __munlock_folio() is deficient when folio_test_clear_lru() >> fails. >> >> (Though it has not been reported as a problem in practice: perhaps because >> so few places try to isolate from the unevictable "list".) >> >> I forget what my order of development was, but it's likely that I first >> wrote the version for our own internal kernel - which used our original >> lruvec locking, which did not depend on getting PG_lru first (having got >> lru_lock, it checked memcg, then tried again if that had changed). > > Right. Just holding the lruvec lock without clearing PG_lru would not > protect against memcg movement in this case. > >> >> I was uneasy with the PG_lru aspect of upstream lru_lock implementation, >> but it turned out to work okay - elsewhere; but it looks as if I missed >> its implication when adapting __munlock_page() for upstream. >> >> If I were trying to fix this __munlock_folio() race myself (sorry, I'm >> not), I would first look at that aspect: instead of folio_test_clear_lru() >> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever >> spin waiting for PG_lru here? > > +Matthew Wilcox > > It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to > a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do > folio_set_lru() before checking folio_evictable(). While this is > probably extraneous since folio_batch_move_lru() will set it again > afterwards, it's probably harmless given that the lruvec lock is held > throughout (so no one can complete the folio isolation anyway), and > given that there were no problems introduced by this extra > folio_set_lru() as far as I can tell. After checking related code, Yes. Looks fine if we move folio_set_lru() before if (folio_evictable(folio)) in lru_add_fn() because of holding lru lock. > > If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713 > ("mm/munlock: > delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict > ordering between manipulating PG_lru and PG_mlocked, I suppose we can > get away without having to spin. Again, that would only be possible if > reworking mlock_count [1] is acceptable. Otherwise, we can't clear > PG_mlocked before PG_lru in __munlock_folio(). What about following change to move mlocked operation before check lru in __munlock_folio()? diff --git a/mm/mlock.c b/mm/mlock.c index 0a0c996c5c21..514f0d5bfbfd 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec) { int nr_pages = folio_nr_pages(folio); - bool isolated = false; + bool isolated = false, mlocked = true; + + mlocked = folio_test_clear_mlocked(folio); if (!folio_test_clear_lru(folio)) goto munlock; @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec /* Then mlock_count is maintained, but might undercount */ if (folio->mlock_count) folio->mlock_count--; - if (folio->mlock_count) + if (folio->mlock_count) { + if (mlocked) + folio_set_mlocked(folio); goto out; + } } /* else assume that was the last mlock: reclaim will fix it if not */ munlock: - if (folio_test_clear_mlocked(folio)) { + if (mlocked) { __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages); if (isolated || !folio_test_unevictable(folio)) __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages); > > I am not saying this is necessarily better than spinning, just a note > (and perhaps selfishly making [1] more appealing ;)). > > [1]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@xxxxxxxxxx/ > >> >> Hugh