Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out

Barry Song <baohua@xxxxxxxxxx> · Fri, 11 Oct 2024 00:17:17 +0800

On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
>
> Hi Ridong,
>
> This should be the first version for upstream, and the issue only
> occurred when large folio is spited.
>
> Adding more CCs to see if there's more feedback.
>
>
> On 2024/10/10 16:18, Chen Ridong wrote:
> > From: Chen Ridong <chenridong@xxxxxxxxxx>
> >
> > An issue was found with the following testing step:
> > 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> > 2. Mount memcg v1, and create memcg named test_memcg and set
> >     usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> > 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> >
> > It was found that:
> >
> > cat memory.usage_in_bytes
> > 2144940032
> > cat memory.memsw.usage_in_bytes
> > 2255056896
> >
> > free -h
> >                total        used        free
> > Mem:           31Gi       2.1Gi        27Gi
> > Swap:         1.0Gi       618Mi       405Mi
> >
> > As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> > was used, which means that 500M may be wasted because other memcgs can not
> > use these swap memory.
> >
> > It can be explained as follows:
> > 1. When entering shrink_inactive_list, it isolates folios from lru from
> >     tail to head. If it just takes folioN from lru(make it simple).
> >
> >     inactive lru: folio1<->folio2<->folio3...<->folioN-1
> >     isolated list: folioN
> >
> > 2. In shrink_page_list function, if folioN is THP, it may be splited and
> >     added to swap cache folio by folio. After adding to swap cache, it will
> >     submit io to writeback folio to swap, which is asynchronous.
> >     When shrink_page_list is finished, the isolated folios list will be
> >     moved back to the head of inactive lru. The inactive lru may just look
> >     like this, with 512 filioes have been move to the head of inactive lru.
> >
> >     folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >
> > 3. When folio writeback io is completed, the folio may be rotated to tail
> >     of lru. The following lru list is expected, with those filioes that have
> >     been added to swap cache are rotated to tail of lru. So those folios
> >     can be reclaimed as soon as possible.
> >
> >     folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >     is splited, shrink_page_list loops at least 512 times, which means that
> >     shrink_page_list is not completed but some folios writeback have been
> >     completed, and this may lead to failure to rotate these folios to the
> >     tail of lru. The lru may look likes as below:

I assume you’re referring to PMD-mapped THP, but your code also modifies
mTHP, which might not be that large. For instance, it could be a 16KB mTHP.

> >
> >     folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> >     folioN51<->folioN52<->...folioN511<->folioN512
> >
> >     Although those folios (N1-N50) have been finished writing back, they
> >     are still at the head of lru. When isolating folios from lru, it scans
> >     from tail to head, so it is difficult to scan those folios again.
> >
> > What mentioned above may lead to a large number of folios have been added
> > to swap cache but can not be reclaimed in time, which may reduce reclaim
> > efficiency and prevent other memcgs from using this swap memory even if
> > they trigger OOM.
> >
> > To fix this issue, it's better to stop looping if THP has been splited and
> > nr_pageout is greater than nr_to_reclaim.
> >
> > Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx>
> > ---
> >   mm/vmscan.c | 16 +++++++++++++++-
> >   1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 749cdc110c74..fd8ad251eda2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >       LIST_HEAD(demote_folios);
> >       unsigned int nr_reclaimed = 0;
> >       unsigned int pgactivate = 0;
> > -     bool do_demote_pass;
> > +     bool do_demote_pass, splited = false;
> >       struct swap_iocb *plug = NULL;
> >
> >       folio_batch_init(&free_folios);
> > @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >
> >               cond_resched();
> >
> > +             /*
> > +              * If a large folio has been split, many folios are added
> > +              * to folio_list. Looping through the entire list takes
> > +              * too much time, which may prevent folios that have completed
> > +              * writeback from rotateing to the tail of the lru. Just
> > +              * stop looping if nr_pageout is greater than nr_to_reclaim.
> > +              */
> > +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> > +                     break;

I’m not entirely sure about the theory behind comparing stat->nr_pageout
with sc->nr_to_reclaim. However, the condition might still hold true even
if you’ve split a relatively small “large folio,” such as 16kB?

> > +
> >               folio = lru_to_folio(folio_list);
> >               list_del(&folio->lru);
> >
> > @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >               if ((nr_pages > 1) && !folio_test_large(folio)) {
> >                       sc->nr_scanned -= (nr_pages - 1);
> >                       nr_pages = 1;
> > +                     splited = true;
> >               }
> >
> >               /*
> > @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                               if (nr_pages > 1 && !folio_test_large(folio)) {
> >                                       sc->nr_scanned -= (nr_pages - 1);
> >                                       nr_pages = 1;
> > +                                     splited = true;
> >                               }
> >                               goto activate_locked;
> >                       case PAGE_SUCCESS:
> >                               if (nr_pages > 1 && !folio_test_large(folio)) {
> >                                       sc->nr_scanned -= (nr_pages - 1);
> >                                       nr_pages = 1;
> > +                                     splited = true;
> >                               }
> >                               stat->nr_pageout += nr_pages;
> >
> > @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >               if (nr_pages > 1) {
> >                       sc->nr_scanned -= (nr_pages - 1);
> >                       nr_pages = 1;
> > +                     splited = true;
> >               }
> >   activate_locked:
> >               /* Not a candidate for swapping, so reclaim swap space. */
>