Kindly ping. On 2023/8/16 11:52, mawupeng wrote: > Since page_remove_rmap in wp_page_copy only clear this mlocked page iff > page's mapcount is -1. which can be seen as follow. > > wp_page_copy > page_remove_rmap > if (!atomic_add_negative(-1, &page->_mapcount)) > goto out; > clear_page_mlock(page); // clear mlocked flag > > During out test, we can test this mapcount before mlock the kpage, this > can close this race. > > diff --git a/mm/ksm.c b/mm/ksm.c > index 62feb478a367..347f4c0339c2 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, > if (!PageMlocked(kpage)) { > unlock_page(page); > lock_page(kpage); > - mlock_vma_page(kpage); > + if (page_mapcount(kpage) > 0) > + mlock_vma_page(kpage); > page = kpage; /* for final unlock */ > } > } > > > On 2023/8/15 15:07, mawupeng wrote: >> Our syzbot reports a warning on bad page state. The mlocked flag is not >> cleared during page free. >> >> During try_to_merge_one_page in ksm, kpage will be remlocked if vma >> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy. >> Since the mapcount of this kpage is -1, no one can remove its mlocked flag >> before free, this lead to the bad page report. >> >> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have >> this problem. The 5.10/5.15 LTS do have this issue. >> >> Here is the simplified calltrace: >> try_to_merge_one_page wp_page_copy >> >> try_to_merge_one_page >> // clear page mlocked during rmap removal >> replace_page >> page_remove_rmap >> if (unlikely(PageMlocked(page))) >> clear_page_mlock(compound_head(page)); >> >> if ((vma->vm_flags & VM_LOCKED) >> lock_page(old_page); >> if (vma->vm_flags & VM_LOCKED) >> if (PageMlocked(old_page)) >> munlock_vma_page(old_page); >> if (!PageMlocked(kpage)) >> lock_page(kpage); >> mlock_vma_page(kpage); >> unlock_page(kpage); >> ------------------------------------------------- >> >> This problem can be easily reproduced with the following modifies: >> 1. enable the following CONFIG >> a) CONFIG_DEBUG_VM >> b) CONFIG_KSM >> c) CONFIG_MEMORY_FAIALURE >> >> 2. add delay in try_to_merge_one_page >> diff --git a/mm/ksm.c b/mm/ksm.c >> index a5716fdec1aa..f9ee2ec615ac 100644 >> --- a/mm/ksm.c >> +++ b/mm/ksm.c >> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, >> >> if ((vma->vm_flags & VM_LOCKED) && kpage && !err) { >> munlock_vma_page(page); >> + mdelay(10); >> if (!PageMlocked(kpage)) { >> unlock_page(page); >> + mdelay(100); >> lock_page(kpage); >> mlock_vma_page(kpage); >> page = kpage; /* for final unlock */ >> >> 3. run syzbot with the following content: >> >> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc) >> mlockall(0x1) >> mlockall(0x5) >> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65) >> >> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc) >> mlockall(0x1) >> mlockall(0xa5) >> mlockall(0x0) >> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000) >> >> ------------------------------------------------- >> The detail bug report can be seen as follow: >> >> BUG: Bad page state in process rs:main Q:Reg pfn:11406a >> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a >> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3) >> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000 >> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000 >> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000 >> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set >> Modules linked in: >> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 >> Call Trace: >> <TASK> >> dump_stack_lvl+0x33/0x46 >> bad_page+0x9e/0xe0 >> free_pcp_prepare+0x14b/0x1f0 >> free_unref_page_list+0x7c/0x210 >> release_pages+0x2fe/0x3c0 >> __pagevec_lru_add+0x21a/0x360 >> lru_cache_add+0x80/0xe0 >> add_to_page_cache_lru+0x71/0xd0 >> pagecache_get_page+0x245/0x460 >> grab_cache_page_write_begin+0x1a/0x40 >> ext4_da_write_begin+0xb7/0x280 >> generic_perform_write+0xb4/0x1e0 >> ext4_buffered_write_iter+0x9c/0x140 >> ext4_file_write_iter+0x5b/0x840 >> ? do_futex+0x1af/0xb60 >> ? check_preempt_curr+0x21/0x60 >> ? ttwu_do_wakeup.isra.140+0xd/0xf0 >> new_sync_write+0x117/0x1b0 >> vfs_write+0x1ff/0x260 >> ksys_write+0xa0/0xe0 >> do_syscall_64+0x37/0x90 >> entry_SYSCALL_64_after_hwframe+0x67/0xd1 >> RIP: 0033:0x7fb815cef32f >> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48 >> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 >> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f >> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007 >> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550 >> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000 >> </TASK> >> >> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@xxxxxxx/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460