Re: [syzbot] [mm?] WARNING in lock_list_lru_of_memcg

Kairui Song <ryncsn@xxxxxxxxx> · Thu, 19 Dec 2024 03:08:01 +0800

On Wed, Dec 18, 2024 at 2:19 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> Thanks! Looking
>
>
> Sasha Levin <sashal@xxxxxxxxxx> 于 2024年12月17日周二 02:39写道：
> >
> > On Sun, Dec 15, 2024 at 07:45:38PM -0700, Yu Zhao wrote:
> > >Hi Kairui,
> > >
> > >On Sun, Dec 15, 2024 at 10:45 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > >>
> > >> On Sun, Dec 15, 2024 at 3:43 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > >> >
> > >> > On Sat, Dec 14, 2024 at 2:06 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > >> > >
> > >> > > On Fri, Dec 13, 2024 at 8:56 PM syzbot
> > >> > > <syzbot+38a0cbd267eff2d286ff@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > >> > > >
> > >> > > > Hello,
> > >> > > >
> > >> > > > syzbot found the following issue on:
> > >> > > >
> > >> > > > HEAD commit:    7cb1b4663150 Merge tag 'locking_urgent_for_v6.13_rc3' of g..
> > >> > > > git tree:       upstream
> > >> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=16e96b30580000
> > >> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=fee25f93665c89ac
> > >> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff
> > >> > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > >> > > >
> > >> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > >> > > >
> > >> > > > Downloadable assets:
> > >> > > > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-7cb1b466.raw.xz
> > >> > > > vmlinux: https://storage.googleapis.com/syzbot-assets/13e083329dab/vmlinux-7cb1b466.xz
> > >> > > > kernel image: https://storage.googleapis.com/syzbot-assets/fe3847d08513/bzImage-7cb1b466.xz
> > >> > > >
> > >> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > >> > > > Reported-by: syzbot+38a0cbd267eff2d286ff@xxxxxxxxxxxxxxxxxxxxxxxxx
> > >> > > >
> > >> > > > ------------[ cut here ]------------
> > >> > > > WARNING: CPU: 0 PID: 80 at mm/list_lru.c:97 lock_list_lru_of_memcg+0x395/0x4e0 mm/list_lru.c:97
> > >> > > > Modules linked in:
> > >> > > > CPU: 0 UID: 0 PID: 80 Comm: kswapd0 Not tainted 6.13.0-rc2-syzkaller-00018-g7cb1b4663150 #0
> > >> > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > >> > > > RIP: 0010:lock_list_lru_of_memcg+0x395/0x4e0 mm/list_lru.c:97
> > >> > > > Code: e9 22 fe ff ff e8 9b cc b6 ff 4c 8b 7c 24 10 45 84 f6 0f 84 40 ff ff ff e9 37 01 00 00 e8 83 cc b6 ff eb 05 e8 7c cc b6 ff 90 <0f> 0b 90 eb 97 89 e9 80 e1 07 80 c1 03 38 c1 0f 8c 7a fd ff ff 48
> > >> > > > RSP: 0018:ffffc9000105e798 EFLAGS: 00010093
> > >> > > > RAX: ffffffff81e891c4 RBX: 0000000000000000 RCX: ffff88801f53a440
> > >> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > >> > > > RBP: ffff888042e70054 R08: ffffffff81e89156 R09: 1ffffffff2032cae
> > >> > > > R10: dffffc0000000000 R11: fffffbfff2032caf R12: ffffffff81e88e5e
> > >> > > > R13: ffffffff9a3feb20 R14: 0000000000000000 R15: ffff888042e70000
> > >> > > > FS:  0000000000000000(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
> > >> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> > > > CR2: 0000000020161000 CR3: 0000000032d12000 CR4: 0000000000352ef0
> > >> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >> > > > Call Trace:
> > >> > > >  <TASK>
> > >> > > >  list_lru_add+0x59/0x270 mm/list_lru.c:164
> > >> > > >  list_lru_add_obj+0x17b/0x250 mm/list_lru.c:187
> > >> > > >  workingset_update_node+0x1af/0x230 mm/workingset.c:634
> > >> > > >  xas_update lib/xarray.c:355 [inline]
> > >> > > >  update_node lib/xarray.c:758 [inline]
> > >> > > >  xas_store+0xb8f/0x1890 lib/xarray.c:845
> > >> > > >  page_cache_delete mm/filemap.c:149 [inline]
> > >> > > >  __filemap_remove_folio+0x4e9/0x670 mm/filemap.c:232
> > >> > > >  __remove_mapping+0x86f/0xad0 mm/vmscan.c:791
> > >> > > >  shrink_folio_list+0x30a6/0x5ca0 mm/vmscan.c:1467
> > >> > > >  evict_folios+0x3c86/0x5800 mm/vmscan.c:4593
> > >> > > >  try_to_shrink_lruvec+0x9a6/0xc70 mm/vmscan.c:4789
> > >> > > >  shrink_one+0x3b9/0x850 mm/vmscan.c:4834
> > >> > > >  shrink_many mm/vmscan.c:4897 [inline]
> > >> > > >  lru_gen_shrink_node mm/vmscan.c:4975 [inline]
> > >> > > >  shrink_node+0x37c5/0x3e50 mm/vmscan.c:5956
> > >> > > >  kswapd_shrink_node mm/vmscan.c:6785 [inline]
> > >> > > >  balance_pgdat mm/vmscan.c:6977 [inline]
> > >> > > >  kswapd+0x1ca9/0x36f0 mm/vmscan.c:7246
> > >> > > >  kthread+0x2f0/0x390 kernel/kthread.c:389
> > >> > > >  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
> > >> > > >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> > >> > > >  </TASK>
> > >> > >
> > >> > > This one seems to be related to "mm/list_lru: split the lock to
> > >> > > per-cgroup scope".
> > >> > >
> > >> > > Kairui, can you please take a look? Thanks.
> > >> >
> > >> > Thanks for pinging, yes that's a new sanity check added by me.
> > >> >
> > >> > Which is supposed to mean, a list_lru is being reparented while the
> > >> > memcg it belongs to isn't dying.
> > >> >
> > >> > More concretely, list_lru is marked dead by memcg_offline_kmem ->
> > >> > memcg_reparent_list_lrus, if the function is called for one memcg, but
> > >> > now the memcg is not dying, this WARN triggers. I'm not sure how this
> > >> > is caused. One possibility is if alloc_shrinker_info() in
> > >> > mem_cgroup_css_online failed, then memcg_offline_kmem is called early?
> > >> > Doesn't seem to fit this case though.. Or maybe just sync issues with
> > >> > the memcg dying flag so the user saw the list_lru dying before seeing
> > >> > memcg dying? The object might be leaked to the parent cgroup, seems
> > >> > not too terrible though.
> > >> >
> > >> > I'm not sure how to reproduce this. I will keep looking.
> > >>
> > >> Managed to boot the image and using the kernel config provided by bot,
> > >> so far local tests didn't trigger any issue. Is there any way I can
> > >> reproduce what the bot actually did?
> > >
> > >If syzbot doesn't have a repro, it might not be productive for you to
> > >try to find one. Personally, I would analyze stacktraces and double
> > >check the code, and move on if I can't find something obviously wrong.
> > >
> > >> Or provide some patch for the bot
> > >> to test?
> > >
> > >syzbot only can try patches after it finds a repro. So in this case,
> > >no, it can't try your patches.
> > >
> > >Hope the above clarifies things for you.
> >
> > Chiming in here as LKFT seems to be able to hit a nearby warning on
> > boot.
> >
> > The link below contains the full log as well as additional information
> > on the run.
> >
> > https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.13-rc2-232-g4800575d8c0b/testrun/26323524/suite/log-parser-test/test/exception-warning-cpu-pid-at-mmlist_lruc-list_lru_del/details/
> >
>

After some investigation, this mm/list_lru.c:80 warn should be fixed by:

diff --git a/mm/list_lru.c b/mm/list_lru.c
index f93ada6a207b..7d69434c70e0 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -77,7 +77,6 @@ lock_list_lru_of_memcg(struct list_lru *lru, int
nid, struct mem_cgroup *memcg,
                        spin_lock(&l->lock);
                nr_items = READ_ONCE(l->nr_items);
                if (likely(nr_items != LONG_MIN)) {
-                       WARN_ON(nr_items < 0);
                        rcu_read_unlock();
                        return l;
                }
@@ -450,6 +449,7 @@ static void memcg_reparent_list_lru_one(struct
list_lru *lru, int nid,

        list_splice_init(&src->list, &dst->list);
        if (src->nr_items) {
+               WARN_ON(src->nr_items < 0);
                dst->nr_items += src->nr_items;
                set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
        }

This should be caused by a short time window between `mlru =
xas_store(&xas, NULL);` and `memcg_reparent_list_lru_one` in
memcg_reparent_list_lrus, if any user delete an item from list lru
during this time window, it may cause the parents nr_items went
negative and trigger this warning, as the child list_lru still holding
the actual item but it's the parents counter get updated. The counter
will be synced by the reparent so it is not a problem.
We can keep this WARN_ON just move it to the time of the reparent
progress, this removes this false warning while still keep avoiding
misuse from users. I'm not 100% sure this is exactly the LKFT warning,
but will send this out after double confirmation as it does need to be
fixed.


And the mm/list_lru.c:97 seems a different problem, I suspect
memcg_list_lru_alloc wasn't called for shadow_nodes but kswapd started
early. If this is the case, it might not be a new issue, just get
exposed by this new sanity check, this can be bypassed with:

diff --git a/mm/list_lru.c b/mm/list_lru.c
index f93ada6a207b..5f124a661ee8 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -81,6 +81,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int
nid, struct mem_cgroup *memcg,
                        rcu_read_unlock();
                        return l;
                }
+               VM_WARN_ON(!css_is_dying(&memcg->css));
                if (irq)
                        spin_unlock_irq(&l->lock);
                else
@@ -94,7 +95,6 @@ lock_list_lru_of_memcg(struct list_lru *lru, int
nid, struct mem_cgroup *memcg,
                rcu_read_unlock();
                return NULL;
        }
-       VM_WARN_ON(!css_is_dying(&memcg->css));
        memcg = parent_mem_cgroup(memcg);
        goto again;
 }

But I'm not sure if it indicates some potential (and previously
existing) list_lru leak, keeping this sanity check at current place
could be helpful for catching missing memcg_list_lru_alloc call. Will
try to send a proper fix after checking the root cause and reproduce
it locally.