On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@xxxxxxxxxx> wrote: > > > > On 2024/10/11 0:17, Barry Song wrote: > > On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote: > >> > >> Hi Ridong, > >> > >> This should be the first version for upstream, and the issue only > >> occurred when large folio is spited. > >> > >> Adding more CCs to see if there's more feedback. > >> > >> > >> On 2024/10/10 16:18, Chen Ridong wrote: > >>> From: Chen Ridong <chenridong@xxxxxxxxxx> > >>> > >>> An issue was found with the following testing step: > >>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y > >>> 2. Mount memcg v1, and create memcg named test_memcg and set > >>> usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. > >>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. > >>> > >>> It was found that: > >>> > >>> cat memory.usage_in_bytes > >>> 2144940032 > >>> cat memory.memsw.usage_in_bytes > >>> 2255056896 > >>> > >>> free -h > >>> total used free > >>> Mem: 31Gi 2.1Gi 27Gi > >>> Swap: 1.0Gi 618Mi 405Mi > >>> > >>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory > >>> was used, which means that 500M may be wasted because other memcgs can not > >>> use these swap memory. > >>> > >>> It can be explained as follows: > >>> 1. When entering shrink_inactive_list, it isolates folios from lru from > >>> tail to head. If it just takes folioN from lru(make it simple). > >>> > >>> inactive lru: folio1<->folio2<->folio3...<->folioN-1 > >>> isolated list: folioN > >>> > >>> 2. In shrink_page_list function, if folioN is THP, it may be splited and > >>> added to swap cache folio by folio. After adding to swap cache, it will > >>> submit io to writeback folio to swap, which is asynchronous. > >>> When shrink_page_list is finished, the isolated folios list will be > >>> moved back to the head of inactive lru. The inactive lru may just look > >>> like this, with 512 filioes have been move to the head of inactive lru. > >>> > >>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > >>> > >>> 3. When folio writeback io is completed, the folio may be rotated to tail > >>> of lru. The following lru list is expected, with those filioes that have > >>> been added to swap cache are rotated to tail of lru. So those folios > >>> can be reclaimed as soon as possible. > >>> > >>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > >>> > >>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP > >>> is splited, shrink_page_list loops at least 512 times, which means that > >>> shrink_page_list is not completed but some folios writeback have been > >>> completed, and this may lead to failure to rotate these folios to the > >>> tail of lru. The lru may look likes as below: > > > > I assume you’re referring to PMD-mapped THP, but your code also modifies > > mTHP, which might not be that large. For instance, it could be a 16KB mTHP. > > > >>> > >>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > >>> folioN51<->folioN52<->...folioN511<->folioN512 > >>> > >>> Although those folios (N1-N50) have been finished writing back, they > >>> are still at the head of lru. When isolating folios from lru, it scans > >>> from tail to head, so it is difficult to scan those folios again. > >>> > >>> What mentioned above may lead to a large number of folios have been added > >>> to swap cache but can not be reclaimed in time, which may reduce reclaim > >>> efficiency and prevent other memcgs from using this swap memory even if > >>> they trigger OOM. > >>> > >>> To fix this issue, it's better to stop looping if THP has been splited and > >>> nr_pageout is greater than nr_to_reclaim. > >>> > >>> Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx> > >>> --- > >>> mm/vmscan.c | 16 +++++++++++++++- > >>> 1 file changed, 15 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>> index 749cdc110c74..fd8ad251eda2 100644 > >>> --- a/mm/vmscan.c > >>> +++ b/mm/vmscan.c > >>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >>> LIST_HEAD(demote_folios); > >>> unsigned int nr_reclaimed = 0; > >>> unsigned int pgactivate = 0; > >>> - bool do_demote_pass; > >>> + bool do_demote_pass, splited = false; > >>> struct swap_iocb *plug = NULL; > >>> > >>> folio_batch_init(&free_folios); > >>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >>> > >>> cond_resched(); > >>> > >>> + /* > >>> + * If a large folio has been split, many folios are added > >>> + * to folio_list. Looping through the entire list takes > >>> + * too much time, which may prevent folios that have completed > >>> + * writeback from rotateing to the tail of the lru. Just > >>> + * stop looping if nr_pageout is greater than nr_to_reclaim. > >>> + */ > >>> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim)) > >>> + break; > > > > I’m not entirely sure about the theory behind comparing stat->nr_pageout > > with sc->nr_to_reclaim. However, the condition might still hold true even > > if you’ve split a relatively small “large folio,” such as 16kB? > > > > Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all > pages that have been pageout can be reclaimed, then enough pages can be > reclaimed when all pages have finished writeback. Thus, it may not have > to pageout more. > > If a small large folio(16 kB) has been split, it may return early > without the entire pages in the folio_list being pageout, but I think > that is fine. It can pageout more pages the next time it enters > shrink_folio_list if there are not enough pages to reclaimed. > > However, if pages that have been pageout are still at the head of the > LRU, it is difficult to scan these pages again. In this case, not only > might it "waste" some swap memory but it also has to pageout more pages. > > Considering the above, I sent this patch. It may not be a perfect > solution, but i think it's a good option to consider. And I am wondering > if anyone has a better solution. Hi Ridong, My overall understanding is that you have failed to describe your problem particularly I don't understand what your 3 and 4 mean: > 3. When folio writeback io is completed, the folio may be rotated to tail > of lru. The following lru list is expected, with those filioes that have > been added to swap cache are rotated to tail of lru. So those folios > can be reclaimed as soon as possible. > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > 4. However, shrink_page_list and folio writeback are asynchronous. If THP > is splited, shrink_page_list loops at least 512 times, which means that > shrink_page_list is not completed but some folios writeback have been > completed, and this may lead to failure to rotate these folios to the > tail of lru. The lru may look likes as below: can you please describe it in a readable approach? i feel your below diagram is somehow wrong: folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 You mentioned "rotate', how could "rotate" makes: folioN512<->folioN511<->...filioN1 in (2) become filioN1<->...folioN511<->folioN512 in (3). btw, writeback isn't always async. it could be sync for zram and sync_io swap. in that case, your patch might change the order of LRU. i mean, for example, while a mTHP becomes cold, we always reclaim all of them, but not part of them and put back part of small folios to the head of lru. > > Best regards, > Ridong > > >>> + > >>> folio = lru_to_folio(folio_list); > >>> list_del(&folio->lru); > >>> > >>> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >>> if ((nr_pages > 1) && !folio_test_large(folio)) { > >>> sc->nr_scanned -= (nr_pages - 1); > >>> nr_pages = 1; > >>> + splited = true; > >>> } > >>> > >>> /* > >>> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >>> if (nr_pages > 1 && !folio_test_large(folio)) { > >>> sc->nr_scanned -= (nr_pages - 1); > >>> nr_pages = 1; > >>> + splited = true; > >>> } > >>> goto activate_locked; > >>> case PAGE_SUCCESS: > >>> if (nr_pages > 1 && !folio_test_large(folio)) { > >>> sc->nr_scanned -= (nr_pages - 1); > >>> nr_pages = 1; > >>> + splited = true; > >>> } > >>> stat->nr_pageout += nr_pages; > >>> > >>> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >>> if (nr_pages > 1) { > >>> sc->nr_scanned -= (nr_pages - 1); > >>> nr_pages = 1; > >>> + splited = true; > >>> } > >>> activate_locked: > >>> /* Not a candidate for swapping, so reclaim swap space. */ > >> > Thanks barry