On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@xxxxxxxxxxxxxxx> wrote: > > > > On 2024/10/21 12:44, Barry Song wrote: > > On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@xxxxxxxxxx> wrote: > >> > >> > >> > >> On 2024/10/11 0:17, Barry Song wrote: > >>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote: > >>>> > >>>> Hi Ridong, > >>>> > >>>> This should be the first version for upstream, and the issue only > >>>> occurred when large folio is spited. > >>>> > >>>> Adding more CCs to see if there's more feedback. > >>>> > >>>> > >>>> On 2024/10/10 16:18, Chen Ridong wrote: > >>>>> From: Chen Ridong <chenridong@xxxxxxxxxx> > >>>>> > >>>>> An issue was found with the following testing step: > >>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y > >>>>> 2. Mount memcg v1, and create memcg named test_memcg and set > >>>>> usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. > >>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. > >>>>> > >>>>> It was found that: > >>>>> > >>>>> cat memory.usage_in_bytes > >>>>> 2144940032 > >>>>> cat memory.memsw.usage_in_bytes > >>>>> 2255056896 > >>>>> > >>>>> free -h > >>>>> total used free > >>>>> Mem: 31Gi 2.1Gi 27Gi > >>>>> Swap: 1.0Gi 618Mi 405Mi > >>>>> > >>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory > >>>>> was used, which means that 500M may be wasted because other memcgs can not > >>>>> use these swap memory. > >>>>> > >>>>> It can be explained as follows: > >>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from > >>>>> tail to head. If it just takes folioN from lru(make it simple). > >>>>> > >>>>> inactive lru: folio1<->folio2<->folio3...<->folioN-1 > >>>>> isolated list: folioN > >>>>> > >>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and > >>>>> added to swap cache folio by folio. After adding to swap cache, it will > >>>>> submit io to writeback folio to swap, which is asynchronous. > >>>>> When shrink_page_list is finished, the isolated folios list will be > >>>>> moved back to the head of inactive lru. The inactive lru may just look > >>>>> like this, with 512 filioes have been move to the head of inactive lru. > >>>>> > >>>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > >>>>> > >>>>> 3. When folio writeback io is completed, the folio may be rotated to tail > >>>>> of lru. The following lru list is expected, with those filioes that have > >>>>> been added to swap cache are rotated to tail of lru. So those folios > >>>>> can be reclaimed as soon as possible. > >>>>> > >>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > >>>>> > >>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP > >>>>> is splited, shrink_page_list loops at least 512 times, which means that > >>>>> shrink_page_list is not completed but some folios writeback have been > >>>>> completed, and this may lead to failure to rotate these folios to the > >>>>> tail of lru. The lru may look likes as below: > >>> > >>> I assume you’re referring to PMD-mapped THP, but your code also modifies > >>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP. > >>> > >>>>> > >>>>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > >>>>> folioN51<->folioN52<->...folioN511<->folioN512 > >>>>> > >>>>> Although those folios (N1-N50) have been finished writing back, they > >>>>> are still at the head of lru. When isolating folios from lru, it scans > >>>>> from tail to head, so it is difficult to scan those folios again. > >>>>> > >>>>> What mentioned above may lead to a large number of folios have been added > >>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim > >>>>> efficiency and prevent other memcgs from using this swap memory even if > >>>>> they trigger OOM. > >>>>> > >>>>> To fix this issue, it's better to stop looping if THP has been splited and > >>>>> nr_pageout is greater than nr_to_reclaim. > >>>>> > >>>>> Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx> > >>>>> --- > >>>>> mm/vmscan.c | 16 +++++++++++++++- > >>>>> 1 file changed, 15 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>>>> index 749cdc110c74..fd8ad251eda2 100644 > >>>>> --- a/mm/vmscan.c > >>>>> +++ b/mm/vmscan.c > >>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >>>>> LIST_HEAD(demote_folios); > >>>>> unsigned int nr_reclaimed = 0; > >>>>> unsigned int pgactivate = 0; > >>>>> - bool do_demote_pass; > >>>>> + bool do_demote_pass, splited = false; > >>>>> struct swap_iocb *plug = NULL; > >>>>> > >>>>> folio_batch_init(&free_folios); > >>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >>>>> > >>>>> cond_resched(); > >>>>> > >>>>> + /* > >>>>> + * If a large folio has been split, many folios are added > >>>>> + * to folio_list. Looping through the entire list takes > >>>>> + * too much time, which may prevent folios that have completed > >>>>> + * writeback from rotateing to the tail of the lru. Just > >>>>> + * stop looping if nr_pageout is greater than nr_to_reclaim. > >>>>> + */ > >>>>> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim)) > >>>>> + break; > >>> > >>> I’m not entirely sure about the theory behind comparing stat->nr_pageout > >>> with sc->nr_to_reclaim. However, the condition might still hold true even > >>> if you’ve split a relatively small “large folio,” such as 16kB? > >>> > >> > >> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all > >> pages that have been pageout can be reclaimed, then enough pages can be > >> reclaimed when all pages have finished writeback. Thus, it may not have > >> to pageout more. > >> > >> If a small large folio(16 kB) has been split, it may return early > >> without the entire pages in the folio_list being pageout, but I think > >> that is fine. It can pageout more pages the next time it enters > >> shrink_folio_list if there are not enough pages to reclaimed. > >> > >> However, if pages that have been pageout are still at the head of the > >> LRU, it is difficult to scan these pages again. In this case, not only > >> might it "waste" some swap memory but it also has to pageout more pages. > >> > >> Considering the above, I sent this patch. It may not be a perfect > >> solution, but i think it's a good option to consider. And I am wondering > >> if anyone has a better solution. > > > > Hi Ridong, > > My overall understanding is that you have failed to describe your problem > > particularly I don't understand what your 3 and 4 mean: > > > >> 3. When folio writeback io is completed, the folio may be rotated to tail > >> of lru. The following lru list is expected, with those filioes that have > >> been added to swap cache are rotated to tail of lru. So those folios > >> can be reclaimed as soon as possible. > >> > >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > > > > 4. However, shrink_page_list and folio writeback are asynchronous. If THP > > > is splited, shrink_page_list loops at least 512 times, which means that > > > shrink_page_list is not completed but some folios writeback have been > > > completed, and this may lead to failure to rotate these folios to the > > > tail of lru. The lru may look likes as below: > > > > can you please describe it in a readable approach? > > > > i feel your below diagram is somehow wrong: > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > > > You mentioned "rotate', how could "rotate" makes: > > folioN512<->folioN511<->...filioN1 in (2) > > become > > filioN1<->...folioN511<->folioN512 in (3). > > > > I am sorry for any confusion. > > If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed > to writeback one by one. it assumed that filioN1, > filioN2,filioN3,...filioN512 are completed in order. > > Orignal: > folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > > filioN1 is finished, filioN1 is rotated to the tail of LRU: > folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1 > > filioN2 is finished: > folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2 > > filioN3 is finished: > folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3 > > ... > > filioN512 is finished: > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > When the filios are finished, the LRU might just like this: > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 understood, thanks! Let me try to understand the following part: > 4: > folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > folioN51<->folioN52<->...folioN511<->folioN512 > Although those folios (N1-N50) have been finished writing back, they > are still at the head of lru. When isolating folios from lru, it scans > from tail to head, so it is difficult to scan those folios again. What is the reason that "those folios (N1-N50) have finished writing back, yet they remain at the head of the LRU"? Is it because their writeback_end occurred while we were still looping in shrink_folio_list(), causing folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving these folios, which are still in the "folio_list", to the tail of the LRU? > > > btw, writeback isn't always async. it could be sync for zram and sync_io > > swap. in that case, your patch might change the order of LRU. i mean, > > for example, while a mTHP becomes cold, we always reclaim all of them, > > but not part of them and put back part of small folios to the head of lru. > > > > Yes, This can be changed. > Although it may put back part of small folios to the head of lru, it can > return in time from shrink_folio_list without causing much additional I/O. > > If you have understood this issue, do you have any suggestions to fix > it? My patch may not be a perfect way to fix this issue. > My point is that synchronous I/O, like zRAM, doesn't have this issue and doesn't require this fix, as writeback is always completed without asynchronous latency. > Best regards, > Ridong > Thanks Barry