Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out

Barry Song <21cnbao@xxxxxxxxx> · Mon, 21 Oct 2024 22:42:13 +1300

On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2024/10/21 12:44, Barry Song wrote:
> > On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2024/10/11 0:17, Barry Song wrote:
> >>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
> >>>>
> >>>> Hi Ridong,
> >>>>
> >>>> This should be the first version for upstream, and the issue only
> >>>> occurred when large folio is spited.
> >>>>
> >>>> Adding more CCs to see if there's more feedback.
> >>>>
> >>>>
> >>>> On 2024/10/10 16:18, Chen Ridong wrote:
> >>>>> From: Chen Ridong <chenridong@xxxxxxxxxx>
> >>>>>
> >>>>> An issue was found with the following testing step:
> >>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> >>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
> >>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> >>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> >>>>>
> >>>>> It was found that:
> >>>>>
> >>>>> cat memory.usage_in_bytes
> >>>>> 2144940032
> >>>>> cat memory.memsw.usage_in_bytes
> >>>>> 2255056896
> >>>>>
> >>>>> free -h
> >>>>>                 total        used        free
> >>>>> Mem:           31Gi       2.1Gi        27Gi
> >>>>> Swap:         1.0Gi       618Mi       405Mi
> >>>>>
> >>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> >>>>> was used, which means that 500M may be wasted because other memcgs can not
> >>>>> use these swap memory.
> >>>>>
> >>>>> It can be explained as follows:
> >>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
> >>>>>      tail to head. If it just takes folioN from lru(make it simple).
> >>>>>
> >>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
> >>>>>      isolated list: folioN
> >>>>>
> >>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
> >>>>>      added to swap cache folio by folio. After adding to swap cache, it will
> >>>>>      submit io to writeback folio to swap, which is asynchronous.
> >>>>>      When shrink_page_list is finished, the isolated folios list will be
> >>>>>      moved back to the head of inactive lru. The inactive lru may just look
> >>>>>      like this, with 512 filioes have been move to the head of inactive lru.
> >>>>>
> >>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >>>>>
> >>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
> >>>>>      of lru. The following lru list is expected, with those filioes that have
> >>>>>      been added to swap cache are rotated to tail of lru. So those folios
> >>>>>      can be reclaimed as soon as possible.
> >>>>>
> >>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>>>>
> >>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >>>>>      is splited, shrink_page_list loops at least 512 times, which means that
> >>>>>      shrink_page_list is not completed but some folios writeback have been
> >>>>>      completed, and this may lead to failure to rotate these folios to the
> >>>>>      tail of lru. The lru may look likes as below:
> >>>
> >>> I assume you’re referring to PMD-mapped THP, but your code also modifies
> >>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> >>>
> >>>>>
> >>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> >>>>>      folioN51<->folioN52<->...folioN511<->folioN512
> >>>>>
> >>>>>      Although those folios (N1-N50) have been finished writing back, they
> >>>>>      are still at the head of lru. When isolating folios from lru, it scans
> >>>>>      from tail to head, so it is difficult to scan those folios again.
> >>>>>
> >>>>> What mentioned above may lead to a large number of folios have been added
> >>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
> >>>>> efficiency and prevent other memcgs from using this swap memory even if
> >>>>> they trigger OOM.
> >>>>>
> >>>>> To fix this issue, it's better to stop looping if THP has been splited and
> >>>>> nr_pageout is greater than nr_to_reclaim.
> >>>>>
> >>>>> Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx>
> >>>>> ---
> >>>>>    mm/vmscan.c | 16 +++++++++++++++-
> >>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>> index 749cdc110c74..fd8ad251eda2 100644
> >>>>> --- a/mm/vmscan.c
> >>>>> +++ b/mm/vmscan.c
> >>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>>        LIST_HEAD(demote_folios);
> >>>>>        unsigned int nr_reclaimed = 0;
> >>>>>        unsigned int pgactivate = 0;
> >>>>> -     bool do_demote_pass;
> >>>>> +     bool do_demote_pass, splited = false;
> >>>>>        struct swap_iocb *plug = NULL;
> >>>>>
> >>>>>        folio_batch_init(&free_folios);
> >>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>>
> >>>>>                cond_resched();
> >>>>>
> >>>>> +             /*
> >>>>> +              * If a large folio has been split, many folios are added
> >>>>> +              * to folio_list. Looping through the entire list takes
> >>>>> +              * too much time, which may prevent folios that have completed
> >>>>> +              * writeback from rotateing to the tail of the lru. Just
> >>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
> >>>>> +              */
> >>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> >>>>> +                     break;
> >>>
> >>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
> >>> with sc->nr_to_reclaim. However, the condition might still hold true even
> >>> if you’ve split a relatively small “large folio,” such as 16kB?
> >>>
> >>
> >> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
> >> pages that have been pageout can be reclaimed, then enough pages can be
> >> reclaimed when all pages have finished writeback. Thus, it may not have
> >> to pageout more.
> >>
> >> If a small large folio(16 kB) has been split, it may return early
> >> without the entire pages in the folio_list being pageout, but I think
> >> that is fine. It can pageout more pages the next time it enters
> >> shrink_folio_list if there are not enough pages to reclaimed.
> >>
> >> However, if pages that have been pageout are still at the head of the
> >> LRU, it is difficult to scan these pages again. In this case, not only
> >> might it "waste" some swap memory but it also has to pageout more pages.
> >>
> >> Considering the above, I sent this patch. It may not be a perfect
> >> solution, but i think it's a good option to consider. And I am wondering
> >> if anyone has a better solution.
> >
> > Hi Ridong,
> > My overall understanding is that you have failed to describe your problem
> > particularly I don't understand what your 3 and 4 mean:
> >
> >> 3. When folio writeback io is completed, the folio may be rotated to tail
> >>    of lru. The following lru list is expected, with those filioes that have
> >>    been added to swap cache are rotated to tail of lru. So those folios
> >>  can be reclaimed as soon as possible.
> >>
> >>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> >  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >  >    is splited, shrink_page_list loops at least 512 times, which means that
> >  >    shrink_page_list is not completed but some folios writeback have been
> >  >    completed, and this may lead to failure to rotate these folios to the
> >   >  tail of lru. The lru may look likes as below:
> >
> > can you please describe it in a readable approach?
> >
> > i feel your below diagram is somehow wrong:
> > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> > You mentioned "rotate', how could "rotate" makes:
> > folioN512<->folioN511<->...filioN1 in (2)
> > become
> > filioN1<->...folioN511<->folioN512 in (3).
> >
>
> I am sorry for any confusion.
>
> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
> to writeback one by one. it assumed that filioN1,
> filioN2,filioN3,...filioN512 are completed in order.
>
> Orignal:
> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>
> filioN1 is finished, filioN1 is rotated to the tail of LRU:
> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
>
> filioN2 is finished:
> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
>
> filioN3 is finished:
> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
>
> ...
>
> filioN512 is finished:
> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>
> When the filios are finished, the LRU might just like this:
> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

understood, thanks!

Let me try to understand the following part:

> 4:
>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>   folioN51<->folioN52<->...folioN511<->folioN512

 >  Although those folios (N1-N50) have been finished writing back, they
 >  are still at the head of lru. When isolating folios from lru, it scans
 >  from tail to head, so it is difficult to scan those folios again.

What is the reason that "those folios (N1-N50) have finished writing back,
yet they remain at the head of the LRU"? Is it because their writeback_end
occurred while we were still looping in shrink_folio_list(), causing
folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
these folios, which are still in the "folio_list", to the tail of the LRU?

>
> > btw, writeback isn't always async. it could be sync for zram and sync_io
> > swap. in that case, your patch might change the order of LRU. i mean,
> > for example, while a mTHP becomes cold, we always reclaim all of them,
> > but not part of them and put back part of small folios to the head of lru.
> >
>
> Yes, This can be changed.
> Although it may put back part of small folios to the head of lru, it can
> return in time from shrink_folio_list without causing much additional I/O.
>
> If you have understood this issue, do you have any suggestions to fix
> it? My patch may not be a perfect way to fix this issue.
>

My point is that synchronous I/O, like zRAM, doesn't have this issue and
doesn't require this fix, as writeback is always completed without
asynchronous latency.

> Best regards,
> Ridong
>

Thanks
Barry