Re: [syzbot] [mm?] WARNING in zswap_swapoff

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 19 Aug 2024 13:12:20 -0700

On Fri, Aug 16, 2024 at 12:52 PM syzbot
<syzbot+ce6029250d7fd4d0476d@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+ce6029250d7fd4d0476d@xxxxxxxxxxxxxxxxxxxxxxxxx
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> Modules linked in:
> CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  __do_sys_swapoff mm/swapfile.c:2837 [inline]
>  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f9e629feb37
> Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
>  </TASK>

I am hoping syzbot would find a reproducer and bisect this for us.
Meanwhile, from a high-level it looks to me like we are missing a
zswap_invalidate() call in some paths.

If I have to guess, I would say it's related to the latest mTHP swap
changes, but I am not following closely. Perhaps one of the following
things happened:

(1) We are not calling zswap_invalidate() in some invalidation paths.
It used to not be called for the cluster freeing path, so maybe we end
up with some order-0 swap entries in a cluster? or maybe there is an
entirely new invalidation path that does not go through
free_swap_slot() for order-0 entries?

(2) Some higher order swap entries (i.e. a cluster) end up in zswap
somehow. zswap_store() has a warning to cover that though. Maybe
somehow some swap entries are allocated as a cluster, but then pages
are swapped out one-by-one as order-0 (which can go to zswap), but
then we still free the swap entries as a cluster?

I am not closely following the latest changes so I am not sure. CCing
folks who have done work in that area recently.

I am starting to think maybe it would be more reliable to just call
zswap_invalidate() for all freed swap entries anyway. Would that be
too expensive? We used to do that before the zswap_invalidate() call
was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
when swap entry free"), and that was before we started using the
xarray (so it was arguably worse than it would be now).