Re: [syzbot] [mm?] WARNING in zswap_swapoff

Chris Li <chrisl@xxxxxxxxxx> · Thu, 22 Aug 2024 13:16:11 -0700

On Thu, Aug 22, 2024 at 11:13 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Tue, Aug 20, 2024 at 11:42 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > > <syzbot+ce6029250d7fd4d0476d@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > syzbot found the following issue on:
> > > > > > >
> > > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > > >
> > > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > > >
> > > > > > > git tree:       linux-next
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > > >
> > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > >
> > > > > > > Downloadable assets:
> > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > > >
> > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > > > > >
> > > > > > > ------------[ cut here ]------------
> > > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > Modules linked in:
> > > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > Call Trace:
> > > > > > >  <TASK>
> > > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > > >  </TASK>
> > > > > >
> > > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > > zswap_invalidate() call in some paths.
> > > > > >
> > > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > > things happened:
> > > > > >
> > > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > > entirely new invalidation path that does not go through
> > > > > > free_swap_slot() for order-0 entries?
> > > > > >
> > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > > then we still free the swap entries as a cluster?
> > > > >
> > > > > Hi Yosry, thanks for the report.
> > > > >
> > > > > There are many mTHP related optimizations recently, for this problem I
> > > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > > after reverting:
> > > > >
> > > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > > >
> > > > > Hi Barry,
> > > > >
> > > > > If a set of continuous slots are having the same value, they are
> > > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > > zswap leak.
> > > > > This didn't happen in put_swap_folio because that function is
> > > > > expecting an actual mTHP folio behind the slots but
> > > > > free_swap_and_cache_nr is simply walking the slots.
> > > > >
> > > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > > will panic with mTHP due to lack of following fixes:
> > > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@xxxxxxxxx/
> > > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@xxxxxxxxx/
> > > > >
> > > > > >
> > > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > > folks who have done work in that area recently.
> > > > > >
> > > > > > I am starting to think maybe it would be more reliable to just call
> > > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > > when swap entry free"), and that was before we started using the
> > > > > > xarray (so it was arguably worse than it would be now).
> > > > > >
> > > > >
> > > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > > swap_range_free and call it for every freed slot.
> > > > >
> > > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > > free swap entries for zap_pte_range()".
> > > >
> > > > Hmm, on second thought, the commit message in the attachment commit
> > > > might be not suitable, current zswap_invalidate is also designed to
> > > > only work for order 0 ZSWAP, so things are not clean even after this.
> > >
> > > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > > you have one folio backed?
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index c1638a009113..8ff58be40544 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > >         unlock_cluster_or_swap_info(si, ci);
> > >
> > >         if (!has_cache) {
> > > +               for (i = 0; i < nr; i++)
> > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > >                 spin_lock(&si->lock);
> > >                 swap_entry_range_free(si, entry, nr);
> > >                 spin_unlock(&si->lock);
> > >
> >
> > Hi Barry,
> >
> > Thanks for updating this thread, I'm thinking maybe something will
> > better be done at the zswap side?
> >
> > The concern of using zswap_invalidate is that it calls xa_erase which
> > requires the xa spin lock. But if we are calling zswap_invalidate in
> > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> > a lockless read first with xa_load should be OK for checking if the
> > slot needs a ZSWAP invalidation. The performance cost will be minimal
> > and we only need to call zswap_invalidate in one place, something like
> > this (haven't tested, comments are welcome). Also ZSWAP mthp will
> > still store entried in order 0 so this should be OK for future.
>
>
> While I do agree with this change on a high level, it's essentially
> reverting commit 0827a1fb143f ("mm/zswap: invalidate zswap entry when
> swap entry free") which fixed a small problem with zswap writeback.
> I'd prefer that we don't if possible.
>
> One thing that I always wanted to do is to pull some of the work done
> in swap_entry_range_free() and swap_range_free() before the slots
> caching layer. The memcg uncharging, clearing shadow entries from the
> swap cache, arch invalidation, zswap invalidation, etc. If we can have
> a hook for these pre-free callbacks we can call it for single entries
> before we add them to the slots cache, and call them for the clusters
> as we do today. This should also reduce the amount of work done under
> the lock, and move more work to where the freeing is actually
> happening vs. the cache draining.
>
> I remember discussing this briefly with Ying before. Anyone have any thoughts?

Hi Yosry,

If I understand correctly, the lock you are talking about is the
si->lock, right?

Kairui has some WIP patches removing the swap slot cache in the swap
entry freeing path. Basically the si->lock is only used to protect the
cluster list. Most of the time freeing swap entry will only take the
ci->lock. No need to take the si->lock to change the cluster lists.
Only when the cluster moves to another list will it require the
si->lock e.g. the cluster moves to the free list when all 512 entries
are freed. Because each cluster has 512 entries. The need to take
si->lock is dramatically reduced. That patch is based on the new
cluster swap allocator series. Kairui can share more details.

I don't think ci->lock has heavy contentions.

Chris

>
> >
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 13ab3b771409..d7bb3caa9d4e 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -273,9 +273,6 @@ void free_swap_slot(swp_entry_t entry)
> >  {
> >          struct swap_slots_cache *cache;
> >
> > -        /* Large folio swap slot is not covered. */
> > -        zswap_invalidate(entry);
> > -
> >          cache = raw_cpu_ptr(&swp_slots);
> >          if (likely(use_swap_slot_cache && cache->slots_ret)) {
> >                  spin_lock_irq(&cache->free_lock);
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f947f4dd31a9..fbc25d38a27e 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -242,9 +242,6 @@ static int __try_to_reclaim_swap(struct
> > swap_info_struct *si,
> >          folio_set_dirty(folio);
> >
> >          spin_lock(&si->lock);
> > -        /* Only sinple page folio can be backed by zswap */
> > -        if (nr_pages == 1)
> > -                zswap_invalidate(entry);
> >          swap_entry_range_free(si, entry, nr_pages);
> >          spin_unlock(&si->lock);
> >          ret = nr_pages;
> > @@ -1545,6 +1542,10 @@ static void swap_entry_range_free(struct
> > swap_info_struct *si, swp_entry_t entry
> >          unsigned char *map_end = map + nr_pages;
> >          struct swap_cluster_info *ci;
> >
> > +        /* Slots are pinned with SWAP_HAS_CACHE, safe to invalidate */
> > +        for (int i = 0; i < nr_pages; ++i)
> > +                zswap_invalidate(swp_entry(si->type, offset + i));
> > +
> >          ci = lock_cluster(si, offset);
> >          do {
> >                  VM_BUG_ON(*map != SWAP_HAS_CACHE);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index df66ab102d27..100ad04397fe 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1656,15 +1656,18 @@ bool zswap_load(struct folio *folio)
> >          return true;
> >  }
> >
> > +/* Caller need to pin the slot to prevent parallel store */
> >  void zswap_invalidate(swp_entry_t swp)
> >  {
> >          pgoff_t offset = swp_offset(swp);
> >          struct xarray *tree = swap_zswap_tree(swp);
> >          struct zswap_entry *entry;
> >
> > -        entry = xa_erase(tree, offset);
> > -        if (entry)
> > -                zswap_entry_free(entry);
> > +        if (xa_load(tree, offset)) {
> > +                entry = xa_erase(tree, offset);
> > +                if (entry)
> > +                        zswap_entry_free(entry);
> > +        }
> >  }
> >
> >  int zswap_swapon(int type, unsigned long nr_pages)
> > --
> > 2.45.2