On Sat, Nov 16, 2024 at 10:26 PM Chen Ridong <chenridong@xxxxxxxxxxxxxxx> wrote: > > From: Chen Ridong <chenridong@xxxxxxxxxx> > > An issue was found with the following testing step: > 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y > 2. Mount memcg v1, and create memcg named test_memcg and set > usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. > 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. > > It was found that: > > cat memory.usage_in_bytes > 2144940032 > cat memory.memsw.usage_in_bytes > 2255056896 > > free -h > total used free > Mem: 31Gi 2.1Gi 27Gi > Swap: 1.0Gi 618Mi 405Mi > > As shown above, the test_memcg used about 100M swap, but 600M+ swap memory > was used, which means that 500M may be wasted because other memcgs can not > use these swap memory. > > It can be explained as follows: > 1. When entering shrink_inactive_list, it isolates folios from lru from > tail to head. If it just takes folioN from lru(make it simple). > > inactive lru: folio1<->folio2<->folio3...<->folioN-1 > isolated list: folioN > > 2. In shrink_page_list function, if folioN is THP(2M), it may be splited > and added to swap cache folio by folio. After adding to swap cache, > it will submit io to writeback folio to swap, which is asynchronous. > When shrink_page_list is finished, the isolated folios list will be > moved back to the head of inactive lru. The inactive lru may just look > like this, with 512 filioes have been move to the head of inactive lru. > > folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > > It committed io from folioN1 to folioN512, the later folios committed > was added to head of the 'ret_folios' in the shrink_page_list function. > As a result, the order was shown as folioN512->folioN511->...->folioN1. > > 3. When folio writeback io is completed, the folio may be rotated to tail > of the lru one by one. It's assumed that filioN1,filioN2, ...,filioN512 > are completed in order(commit io in this order), and they are rotated to > the tail of the LRU in order (filioN1<->...folioN511<->folioN512). > Therefore, those folios that are tail of the lru will be reclaimed as > soon as possible. > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > 4. However, shrink_page_list and folio writeback are asynchronous. If THP > is splited, shrink_page_list loops at least 512 times, which means that > shrink_page_list is not completed but some folios writeback have been > completed, and this may lead to failure to rotate these folios to the > tail of lru. The lru may look likes as below: > > folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > folioN51<->folioN52<->...folioN511<->folioN512 > > Although those folios (N1-N50) have been finished writing back, they > are still at the head of the lru. This is because their writeback_end > occurred while it were still looping in shrink_folio_list(), causing > folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving > these folios, which are not in the LRU but still in the 'folio_list', > to the tail of the LRU. > When isolating folios from lru, it scans from tail to head, so it is > difficult to scan those folios again. > > What mentioned above may lead to a large number of folios have been added > to swap cache but can not be reclaimed in time, which may reduce reclaim > efficiency and prevent other memcgs from using this swap memory even if > they trigger OOM. > > To fix this issue, the folios whose writeback has been completed should be > move to the tail of the LRU instead of always placing them at the head of > the LRU when the shrink_page_list is finished. It can be realized by > following steps. > 1. In the shrink_page_list function, the folios whose are committed to It seems like there's a grammatical error here—whose something? > are added to the head of 'folio_list', which will be return to the > caller. > 2. When shrink_page_list finishes, it is known that how many folios have > been pageout, and they are all at the head of 'folio_list', which is > ready be moved back to LRU. So, in the 'move_folios_to_lru function' > function, if the first 'nr_io' folios (which have been pageout) have > been written back completely, move them to the tail of LRU. Otherwise, > move them to the head of the LRU. > > Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx> > --- > mm/vmscan.c | 37 +++++++++++++++++++++++++++++-------- > 1 file changed, 29 insertions(+), 8 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 76378bc257e3..04f7eab9d818 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1046,6 +1046,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > struct folio_batch free_folios; > LIST_HEAD(ret_folios); > LIST_HEAD(demote_folios); > + LIST_HEAD(pageout_folios); > unsigned int nr_reclaimed = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > @@ -1061,7 +1062,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > struct address_space *mapping; > struct folio *folio; > enum folio_references references = FOLIOREF_RECLAIM; > - bool dirty, writeback; > + bool dirty, writeback, is_pageout = false; > unsigned int nr_pages; > > cond_resched(); > @@ -1384,6 +1385,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > nr_pages = 1; > } > stat->nr_pageout += nr_pages; > + is_pageout = true; > > if (folio_test_writeback(folio)) > goto keep; > @@ -1508,7 +1510,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > keep_locked: > folio_unlock(folio); > keep: > - list_add(&folio->lru, &ret_folios); > + if (is_pageout) > + list_add(&folio->lru, &pageout_folios); > + else > + list_add(&folio->lru, &ret_folios); > VM_BUG_ON_FOLIO(folio_test_lru(folio) || > folio_test_unevictable(folio), folio); > } > @@ -1551,6 +1556,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > free_unref_folios(&free_folios); > > list_splice(&ret_folios, folio_list); > + list_splice(&pageout_folios, folio_list); Do we really need this pageout_folios list? is the below not sufficient? + if (nr_io > 0 && + !folio_test_reclaim(folio) && + !folio_test_writeback(folio)) + lruvec_add_folio_tail(lruvec, folio); + else + lruvec_add_folio(lruvec, folio); > count_vm_events(PGACTIVATE, pgactivate); > > if (plug) > @@ -1826,11 +1832,14 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file, > > /* > * move_folios_to_lru() moves folios from private @list to appropriate LRU list. > + * @lruvec: The LRU vector the list is moved to. > + * @list: The folio list are moved to lruvec > + * @nr_io: The first nr folios of the list that have been committed io. > * > * Returns the number of pages moved to the given lruvec. > */ > static unsigned int move_folios_to_lru(struct lruvec *lruvec, > - struct list_head *list) > + struct list_head *list, unsigned int nr_io) > { > int nr_pages, nr_moved = 0; > struct folio_batch free_folios; > @@ -1880,9 +1889,21 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, > * inhibits memcg migration). > */ > VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); > - lruvec_add_folio(lruvec, folio); > + /* > + * If the folio have been committed io and writed back completely, > + * it should be added to the tailed to the lru, so it can > + * be relaimed as soon as possible. > + */ > + if (nr_io > 0 && > + !folio_test_reclaim(folio) && > + !folio_test_writeback(folio)) > + lruvec_add_folio_tail(lruvec, folio); > + else > + lruvec_add_folio(lruvec, folio); > + > nr_pages = folio_nr_pages(folio); > nr_moved += nr_pages; > + nr_io = nr_io > nr_pages ? (nr_io - nr_pages) : 0; > if (folio_test_active(folio)) > workingset_age_nonresident(lruvec, nr_pages); > } > @@ -1960,7 +1981,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false); > > spin_lock_irq(&lruvec->lru_lock); > - move_folios_to_lru(lruvec, &folio_list); > + move_folios_to_lru(lruvec, &folio_list, stat.nr_pageout); > > __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), > stat.nr_demoted); > @@ -2111,8 +2132,8 @@ static void shrink_active_list(unsigned long nr_to_scan, > */ > spin_lock_irq(&lruvec->lru_lock); > > - nr_activate = move_folios_to_lru(lruvec, &l_active); > - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive); > + nr_activate = move_folios_to_lru(lruvec, &l_active, 0); > + nr_deactivate = move_folios_to_lru(lruvec, &l_inactive, 0); > > __count_vm_events(PGDEACTIVATE, nr_deactivate); > __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); > @@ -4627,7 +4648,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap > > spin_lock_irq(&lruvec->lru_lock); > > - move_folios_to_lru(lruvec, &list); > + move_folios_to_lru(lruvec, &list, 0); I'm not entirely convinced about using the 'nr' argument here. Is the goal to differentiate between two cases? 1. we need to take care of written-back folios 2. we don't need to take care of written-back folios? Would it be a bool or better to provide separate helpers? > > walk = current->reclaim_state->mm_walk; > if (walk && walk->batched) { > -- > 2.34.1 > Thanks Barry