On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote: > > Hi Ridong, > > This should be the first version for upstream, and the issue only > occurred when large folio is spited. > > Adding more CCs to see if there's more feedback. > > > On 2024/10/10 16:18, Chen Ridong wrote: > > From: Chen Ridong <chenridong@xxxxxxxxxx> > > > > An issue was found with the following testing step: > > 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y > > 2. Mount memcg v1, and create memcg named test_memcg and set > > usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. > > 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. > > > > It was found that: > > > > cat memory.usage_in_bytes > > 2144940032 > > cat memory.memsw.usage_in_bytes > > 2255056896 > > > > free -h > > total used free > > Mem: 31Gi 2.1Gi 27Gi > > Swap: 1.0Gi 618Mi 405Mi > > > > As shown above, the test_memcg used about 100M swap, but 600M+ swap memory > > was used, which means that 500M may be wasted because other memcgs can not > > use these swap memory. > > > > It can be explained as follows: > > 1. When entering shrink_inactive_list, it isolates folios from lru from > > tail to head. If it just takes folioN from lru(make it simple). > > > > inactive lru: folio1<->folio2<->folio3...<->folioN-1 > > isolated list: folioN > > > > 2. In shrink_page_list function, if folioN is THP, it may be splited and > > added to swap cache folio by folio. After adding to swap cache, it will > > submit io to writeback folio to swap, which is asynchronous. > > When shrink_page_list is finished, the isolated folios list will be > > moved back to the head of inactive lru. The inactive lru may just look > > like this, with 512 filioes have been move to the head of inactive lru. > > > > folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > > > > 3. When folio writeback io is completed, the folio may be rotated to tail > > of lru. The following lru list is expected, with those filioes that have > > been added to swap cache are rotated to tail of lru. So those folios > > can be reclaimed as soon as possible. > > > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > > > 4. However, shrink_page_list and folio writeback are asynchronous. If THP > > is splited, shrink_page_list loops at least 512 times, which means that > > shrink_page_list is not completed but some folios writeback have been > > completed, and this may lead to failure to rotate these folios to the > > tail of lru. The lru may look likes as below: I assume you’re referring to PMD-mapped THP, but your code also modifies mTHP, which might not be that large. For instance, it could be a 16KB mTHP. > > > > folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > > folioN51<->folioN52<->...folioN511<->folioN512 > > > > Although those folios (N1-N50) have been finished writing back, they > > are still at the head of lru. When isolating folios from lru, it scans > > from tail to head, so it is difficult to scan those folios again. > > > > What mentioned above may lead to a large number of folios have been added > > to swap cache but can not be reclaimed in time, which may reduce reclaim > > efficiency and prevent other memcgs from using this swap memory even if > > they trigger OOM. > > > > To fix this issue, it's better to stop looping if THP has been splited and > > nr_pageout is greater than nr_to_reclaim. > > > > Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx> > > --- > > mm/vmscan.c | 16 +++++++++++++++- > > 1 file changed, 15 insertions(+), 1 deletion(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 749cdc110c74..fd8ad251eda2 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > LIST_HEAD(demote_folios); > > unsigned int nr_reclaimed = 0; > > unsigned int pgactivate = 0; > > - bool do_demote_pass; > > + bool do_demote_pass, splited = false; > > struct swap_iocb *plug = NULL; > > > > folio_batch_init(&free_folios); > > @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > cond_resched(); > > > > + /* > > + * If a large folio has been split, many folios are added > > + * to folio_list. Looping through the entire list takes > > + * too much time, which may prevent folios that have completed > > + * writeback from rotateing to the tail of the lru. Just > > + * stop looping if nr_pageout is greater than nr_to_reclaim. > > + */ > > + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim)) > > + break; I’m not entirely sure about the theory behind comparing stat->nr_pageout with sc->nr_to_reclaim. However, the condition might still hold true even if you’ve split a relatively small “large folio,” such as 16kB? > > + > > folio = lru_to_folio(folio_list); > > list_del(&folio->lru); > > > > @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if ((nr_pages > 1) && !folio_test_large(folio)) { > > sc->nr_scanned -= (nr_pages - 1); > > nr_pages = 1; > > + splited = true; > > } > > > > /* > > @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (nr_pages > 1 && !folio_test_large(folio)) { > > sc->nr_scanned -= (nr_pages - 1); > > nr_pages = 1; > > + splited = true; > > } > > goto activate_locked; > > case PAGE_SUCCESS: > > if (nr_pages > 1 && !folio_test_large(folio)) { > > sc->nr_scanned -= (nr_pages - 1); > > nr_pages = 1; > > + splited = true; > > } > > stat->nr_pageout += nr_pages; > > > > @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (nr_pages > 1) { > > sc->nr_scanned -= (nr_pages - 1); > > nr_pages = 1; > > + splited = true; > > } > > activate_locked: > > /* Not a candidate for swapping, so reclaim swap space. */ >