On Tue, Oct 8, 2024 at 9:52 PM jingxiang zeng <jingxiangzeng.cas@xxxxxxxxx> wrote: > > On Tue, 8 Oct 2024 at 11:26, Wei Xu <weixugc@xxxxxxxxxx> wrote: > > > > On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng > > <jingxiangzeng.cas@xxxxxxxxx> wrote: > > > > > > From: Jingxiang Zeng <linuszeng@xxxxxxxxxxx> > > > > > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > > removed the opportunity to wake up flushers during the MGLRU page > > > reclamation process can lead to an increased likelihood of triggering OOM > > > when encountering many dirty pages during reclamation on MGLRU. > > > > > > This leads to premature OOM if there are too many dirty pages in cgroup: > > > Killed > > > > > > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), > > > order=0, oom_score_adj=0 > > > > > > Call Trace: > > > <TASK> > > > dump_stack_lvl+0x5f/0x80 > > > dump_stack+0x14/0x20 > > > dump_header+0x46/0x1b0 > > > oom_kill_process+0x104/0x220 > > > out_of_memory+0x112/0x5a0 > > > mem_cgroup_out_of_memory+0x13b/0x150 > > > try_charge_memcg+0x44f/0x5c0 > > > charge_memcg+0x34/0x50 > > > __mem_cgroup_charge+0x31/0x90 > > > filemap_add_folio+0x4b/0xf0 > > > __filemap_get_folio+0x1a4/0x5b0 > > > ? srso_return_thunk+0x5/0x5f > > > ? __block_commit_write+0x82/0xb0 > > > ext4_da_write_begin+0xe5/0x270 > > > generic_perform_write+0x134/0x2b0 > > > ext4_buffered_write_iter+0x57/0xd0 > > > ext4_file_write_iter+0x76/0x7d0 > > > ? selinux_file_permission+0x119/0x150 > > > ? srso_return_thunk+0x5/0x5f > > > ? srso_return_thunk+0x5/0x5f > > > vfs_write+0x30c/0x440 > > > ksys_write+0x65/0xe0 > > > __x64_sys_write+0x1e/0x30 > > > x64_sys_call+0x11c2/0x1d50 > > > do_syscall_64+0x47/0x110 > > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > > > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > > > > > ... > > > file_dirty 303247360 > > > file_writeback 0 > > > ... > > > > > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test, > > > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0 > > > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > > > oom_score_adj:0 > > > > > > The flusher wake up was removed to decrease SSD wearing, but if we are > > > seeing all dirty folios at the tail of an LRU, not waking up the flusher > > > could lead to thrashing easily. So wake it up when a mem cgroups is about > > > to OOM due to dirty caches. > > > > > > --- > > > Changes from v3: > > > - Avoid taking lock and reduce overhead on folio isolation by > > > checking the right flags and rework wake up condition, fixing the > > > performance regression reported by Chris Li. > > > [Chris Li, Kairui Song] > > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > > > case as well, and update comments. [Kairui Song] > > > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@xxxxxxxxx/ > > > Changes from v2: > > > - Acquire the lock before calling the folio_check_dirty_writeback > > > function. [Wei Xu, Jingxiang Zeng] > > > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@xxxxxxxxx/ > > > Changes from v1: > > > - Add code to count the number of unqueued_dirty in the sort_folio > > > function. [Wei Xu, Jingxiang Zeng] > > > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@xxxxxxxxx/ > > > --- > > > > > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > > Signed-off-by: Zeng Jingxiang <linuszeng@xxxxxxxxxxx> > > > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx> > > > Cc: T.J. Mercier <tjmercier@xxxxxxxxxx> > > > Cc: Wei Xu <weixugc@xxxxxxxxxx> > > > Cc: Yu Zhao <yuzhao@xxxxxxxxxx> > > > --- > > > mm/vmscan.c | 19 ++++++++++++++++--- > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index dc7a285b256b..2a5c2fe81467 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > > int tier_idx) > > > { > > > bool success; > > > + bool dirty, writeback; > > > int gen = folio_lru_gen(folio); > > > int type = folio_is_file_lru(folio); > > > int zone = folio_zonenum(folio); > > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > > return true; > > > } > > > > > > + dirty = folio_test_dirty(folio); > > > + writeback = folio_test_writeback(folio); > > > + if (type == LRU_GEN_FILE && dirty && !writeback) > > > + sc->nr.unqueued_dirty += delta; > > > + > > > > This sounds good. BTW, when shrink_folio_list() in evict_folios() > > returns, we should add stat.nr_unqueued_dirty to sc->nr.unqueued_dirty > > there as well. > > Thank you for your valuable feedback, I will implement it in the next version. > > > > > /* waiting for writeback */ > > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > > + if (folio_test_locked(folio) || writeback || > > > + (type == LRU_GEN_FILE && dirty)) { > > > gen = folio_inc_gen(lruvec, folio, true); > > > list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > > return true; > > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH, > > > scanned, skipped, isolated, > > > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); > > > - > > > + sc->nr.taken += scanned; > > > > I think we should only include file pages (in sort_folio) and isolated > > pages into sc->nr.taken, instead of all scanned pages. For example, if > > Unqueued dirty pages are not isolated, but promoted to the newer generation > in the sort_folio function. So I tend to wake up the flusher thread when the > number of scanned pages is equal to the number of unqueued dirty pages. > > > there are only unevictable and unqueued dirty pages, we would still > > like to wake up the flusher threads, but because nr.taken counts > > unevictable pages as well, the wakeup condition in > > try_to_shrink_lruvec() won't be met. > > The situation you mentioned will not happen because the number of > scanned pages does not include unevicatble pages. > However, there is another situation that needs attention. When the > scanned pages contain anonymous pages and unqueued dirty pages, > the flusher cannot be woken up. I will fix this situation in the next version. > Pages in unevictable LRUs are not scanned, but unevictable pages that are not yet moved to unevictable LRUs can be included in scanned pages. See the !folio_evictable(folio) check in sort_folio(). > > > > > /* > > > * There might not be eligible folios due to reclaim_idx. Check the > > > * remaining to prevent livelock if it's not making progress. > > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > > > cond_resched(); > > > } > > > > > > + /* > > > + * If too many file cache in the coldest generation can't be evicted > > > + * due to being dirty, wake up the flusher. > > > + */ > > > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken) > > > + wakeup_flusher_threads(WB_REASON_VMSCAN); > > > + > > > > try_to_shrink_lruvec() can be called from shrink_node() for global > > reclaim as well. We need to reset sc->nr before calling > > lru_gen_shrink_node() there. MGLRU didn't need that because it didn't > > use sc->nr until this change. > > Thank you for your valuable feedback, I will implement it in the next version. > > > > > /* whether this lruvec should be rotated */ > > > return nr_to_scan < 0; > > > } > > > -- > > > 2.43.5 > > >