On Tue, Aug 20, 2024 at 9:29 PM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > On Tue, Aug 20, 2024 at 5:22 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > On Tue, Aug 20, 2024 at 8:47 PM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > > > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot > > > > <syzbot+ce6029250d7fd4d0476d@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > > > Hello, > > > > > > > > > > syzbot found the following issue on: > > > > > > > > > > HEAD commit: 367b5c3d53e5 Add linux-next specific files for 20240816 > > > > > > I can't find this commit, seems this commit is not in linux-next any more? > > > > > > > > git tree: linux-next > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000 > > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467 > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d > > > > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 > > > > > > > > > > Unfortunately, I don't have any reproducer for this issue yet. > > > > > > > > > > Downloadable assets: > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz > > > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@xxxxxxxxxxxxxxxxxxxxxxxxx > > > > > > > > > > ------------[ cut here ]------------ > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700 > > > > > Modules linked in: > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0 > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024 > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700 > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293 > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00 > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248 > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000 > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002 > > > > > FS: 00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > Call Trace: > > > > > <TASK> > > > > > __do_sys_swapoff mm/swapfile.c:2837 [inline] > > > > > __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706 > > > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > > RIP: 0033:0x7f9e629feb37 > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48 > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8 > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37 > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20 > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94 > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188 > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80 > > > > > </TASK> > > > > > > > > I am hoping syzbot would find a reproducer and bisect this for us. > > > > Meanwhile, from a high-level it looks to me like we are missing a > > > > zswap_invalidate() call in some paths. > > > > > > > > If I have to guess, I would say it's related to the latest mTHP swap > > > > changes, but I am not following closely. Perhaps one of the following > > > > things happened: > > > > > > > > (1) We are not calling zswap_invalidate() in some invalidation paths. > > > > It used to not be called for the cluster freeing path, so maybe we end > > > > up with some order-0 swap entries in a cluster? or maybe there is an > > > > entirely new invalidation path that does not go through > > > > free_swap_slot() for order-0 entries? > > > > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap > > > > somehow. zswap_store() has a warning to cover that though. Maybe > > > > somehow some swap entries are allocated as a cluster, but then pages > > > > are swapped out one-by-one as order-0 (which can go to zswap), but > > > > then we still free the swap entries as a cluster? > > > > > > Hi Yosry, thanks for the report. > > > > > > There are many mTHP related optimizations recently, for this problem I > > > can reproduce this locally. Can confirm the problem is gone for me > > > after reverting: > > > > > > "mm: attempt to batch free swap entries for zap_pte_range()" > > > > > > Hi Barry, > > > > > > If a set of continuous slots are having the same value, they are > > > considered a mTHP and freed, bypassing the slot cache, and causing > > > zswap leak. > > > This didn't happen in put_swap_folio because that function is > > > expecting an actual mTHP folio behind the slots but > > > free_swap_and_cache_nr is simply walking the slots. > > > > Hi Kairui, > > > > I don't understand, if anyone has a folio backend, the code will > > go fallback to __try_to_reclaim_swap(), it won't call > > swap_entry_range_free(). > > > > ci = lock_cluster_or_swap_info(si, offset); > > if (!swap_is_last_map(si, offset, nr, &has_cache)) { > > unlock_cluster_or_swap_info(si, ci); > > goto fallback; > > } > > for (i = 0; i < nr; i++) > > WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); > > unlock_cluster_or_swap_info(si, ci); > > > > if (!has_cache) { > > spin_lock(&si->lock); > > swap_entry_range_free(si, entry, nr); > > spin_unlock(&si->lock); > > } > > return has_cache; > > > > Am i missing something? > > Hi Barry, > > Per my understanding, ZSWAP invalidation could happen after the folio > is gone from the swap cache, especially in free_swap_and_cache_nr, it > will iterate and zap the swap slots without swapping them in. > So a slot doesn't have a folio backed doesn't mean it doesn't have ZSWAP data. well. thanks! the original non-batched code always has a zswap_invalidate() in free_swap_slot(). void free_swap_slot(swp_entry_t entry) { struct swap_slots_cache *cache; /* Large folio swap slot is not covered. */ zswap_invalidate(entry); ... } but the benefits of batched-free is huge(on phones, almost 100% PTEs have no swapcache as it uses sync io swap; on SSD cases, only a small part of PTEs have folio backend in swapcache, so it will still benefit significantly from this batched-free). so let's try to find a proper fix for this before we have to do the below ugly (which will at least benefit phones who never use zswap): diff --git a/mm/swapfile.c b/mm/swapfile.c index f947f4dd31a9..c8c70a9bf6d6 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1498,6 +1498,9 @@ static bool __swap_entries_free(struct swap_info_struct *si, unsigned char count; int i; + if (!zswap_never_enabled()) + return fallback; + if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1) goto fallback; /* cross into another cluster */ Thanks Barry