Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM

jingxiang zeng <jingxiangzeng.cas@xxxxxxxxx> · Wed, 9 Oct 2024 13:34:54 +0800

Hi Chris,

Before I released the V4 version, I also ran the swap stress test you gave me,
with -j32, 1G memcg on my local branch:

With the V4 patch:
1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata
920100maxresident)k

Without the patch:
1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata
919880maxresident)k

My test results are the same as yours. This should not be test noise. I am
trying to analyze whether it can be further optimized.

Jingxiang Zeng

On Wed, 9 Oct 2024 at 01:12, Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> Hi Jingxiang,
>
> I did run the same swap stress test on V4 and it is much better than V3.
> V3 test was hang there (time out). V4 did not hang any more, it
> finishes in about the same time.
>
> If we look closer of V4, it seems suggest that v4 system time is slightly worse.
> Is that kind of expected or might be the noise of my test? Just trying
> to understand it better, it is not a NACK by any means.
>
> Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a568:
> Without (10 times):
> user    2688.328
> system  6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75
> 6057.21 6063.31 6075.76 6123.18
> real    277.145
>
> With V4:
> First run (10 times):
> user    2688.537
> system  6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93
> 6197.26 6202.98 6204.64 6213.74
> real    280.174
> Second run (10 times):
> user    2771.498
> system  6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33
> 6204.03 6212.9 6216.32 6256.84
> real    284.854
>
> Chris
>
> On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
> <jingxiangzeng.cas@xxxxxxxxx> wrote:
> >
> > From: Jingxiang Zeng <linuszeng@xxxxxxxxxxx>
> >
> > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > removed the opportunity to wake up flushers during the MGLRU page
> > reclamation process can lead to an increased likelihood of triggering OOM
> > when encountering many dirty pages during reclamation on MGLRU.
> >
> > This leads to premature OOM if there are too many dirty pages in cgroup:
> > Killed
> >
> > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > order=0, oom_score_adj=0
> >
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x5f/0x80
> >   dump_stack+0x14/0x20
> >   dump_header+0x46/0x1b0
> >   oom_kill_process+0x104/0x220
> >   out_of_memory+0x112/0x5a0
> >   mem_cgroup_out_of_memory+0x13b/0x150
> >   try_charge_memcg+0x44f/0x5c0
> >   charge_memcg+0x34/0x50
> >   __mem_cgroup_charge+0x31/0x90
> >   filemap_add_folio+0x4b/0xf0
> >   __filemap_get_folio+0x1a4/0x5b0
> >   ? srso_return_thunk+0x5/0x5f
> >   ? __block_commit_write+0x82/0xb0
> >   ext4_da_write_begin+0xe5/0x270
> >   generic_perform_write+0x134/0x2b0
> >   ext4_buffered_write_iter+0x57/0xd0
> >   ext4_file_write_iter+0x76/0x7d0
> >   ? selinux_file_permission+0x119/0x150
> >   ? srso_return_thunk+0x5/0x5f
> >   ? srso_return_thunk+0x5/0x5f
> >   vfs_write+0x30c/0x440
> >   ksys_write+0x65/0xe0
> >   __x64_sys_write+0x1e/0x30
> >   x64_sys_call+0x11c2/0x1d50
> >   do_syscall_64+0x47/0x110
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >
> >   ...
> >   file_dirty 303247360
> >   file_writeback 0
> >   ...
> >
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > oom_score_adj:0
> >
> > The flusher wake up was removed to decrease SSD wearing, but if we are
> > seeing all dirty folios at the tail of an LRU, not waking up the flusher
> > could lead to thrashing easily.  So wake it up when a mem cgroups is about
> > to OOM due to dirty caches.
> >
> > ---
> > Changes from v3:
> > - Avoid taking lock and reduce overhead on folio isolation by
> >   checking the right flags and rework wake up condition, fixing the
> >   performance regression reported by Chris Li.
> >   [Chris Li, Kairui Song]
> > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> >   case as well, and update comments. [Kairui Song]
> > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@xxxxxxxxx/
> > Changes from v2:
> > - Acquire the lock before calling the folio_check_dirty_writeback
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@xxxxxxxxx/
> > Changes from v1:
> > - Add code to count the number of unqueued_dirty in the sort_folio
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@xxxxxxxxx/
> > ---
> >
> > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > Signed-off-by: Zeng Jingxiang <linuszeng@xxxxxxxxxxx>
> > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> > Cc: T.J. Mercier <tjmercier@xxxxxxxxxx>
> > Cc: Wei Xu <weixugc@xxxxxxxxxx>
> > Cc: Yu Zhao <yuzhao@xxxxxxxxxx>
> > ---
> >  mm/vmscan.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index dc7a285b256b..2a5c2fe81467 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                        int tier_idx)
> >  {
> >         bool success;
> > +       bool dirty, writeback;
> >         int gen = folio_lru_gen(folio);
> >         int type = folio_is_file_lru(folio);
> >         int zone = folio_zonenum(folio);
> > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                 return true;
> >         }
> >
> > +       dirty = folio_test_dirty(folio);
> > +       writeback = folio_test_writeback(folio);
> > +       if (type == LRU_GEN_FILE && dirty && !writeback)
> > +               sc->nr.unqueued_dirty += delta;
> > +
> >         /* waiting for writeback */
> > -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > +       if (folio_test_locked(folio) || writeback ||
> > +           (type == LRU_GEN_FILE && dirty)) {
> >                 gen = folio_inc_gen(lruvec, folio, true);
> >                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> >                 return true;
> > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> >         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
> >                                 scanned, skipped, isolated,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> > -
> > +       sc->nr.taken += scanned;
> >         /*
> >          * There might not be eligible folios due to reclaim_idx. Check the
> >          * remaining to prevent livelock if it's not making progress.
> > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                 cond_resched();
> >         }
> >
> > +       /*
> > +        * If too many file cache in the coldest generation can't be evicted
> > +        * due to being dirty, wake up the flusher.
> > +        */
> > +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > +
> >         /* whether this lruvec should be rotated */
> >         return nr_to_scan < 0;
> >  }
> > --
> > 2.43.5
> >