Re: [5.10/5.15 LTS] Question on mlock race between ksm and cow

Hugh Dickins <hughd@xxxxxxxxxx> · Thu, 19 Oct 2023 20:57:04 -0700 (PDT)

On Wed, 18 Oct 2023, mawupeng wrote:

> This problem can be easily reproduced with syz in LTS 5.10/5.15.
> 
> Kindly ping.

Sorry, but it takes me a long time to find a long enough slot
to sink back deep enough into understanding these things.

> 
> On 2023/9/18 9:07, mawupeng wrote:
> > Kindly ping.
> > 
> > On 2023/8/16 11:52, mawupeng wrote:
> >> Since page_remove_rmap in wp_page_copy only clear this mlocked page iff
> >> page's mapcount is -1. which can be seen as follow.
> >>
> >> wp_page_copy
> >>   page_remove_rmap
> >>     if (!atomic_add_negative(-1, &page->_mapcount))
> >>       goto out;
> >>     clear_page_mlock(page);	// clear mlocked flag
> >>
> >> During out test, we can test this mapcount before mlock the kpage, this
> >> can close this race.
> >>
> >> diff --git a/mm/ksm.c b/mm/ksm.c
> >> index 62feb478a367..347f4c0339c2 100644
> >> --- a/mm/ksm.c
> >> +++ b/mm/ksm.c
> >> @@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
> >>                 if (!PageMlocked(kpage)) {
> >>                         unlock_page(page);
> >>                         lock_page(kpage);
> >> -                       mlock_vma_page(kpage);
> >> +                       if (page_mapcount(kpage) > 0)
> >> +                               mlock_vma_page(kpage);

Yes, that's safe enough; though if we thought for longer, we might
devise other such races which this would not be enough for.

But I'm not going to spend longer on it, this is good enough, if you
find that it enables you to get much further with your syzbottery.

If you prepare a proper Signed-off patch to send GregKH & stable,
Cc me and I shall probably be able to support it with a grudging Ack.

> >>                         page = kpage;           /* for final unlock */
> >>                 }
> >>         }
> >>
> >>
> >> On 2023/8/15 15:07, mawupeng wrote:
> >>> Our syzbot reports a warning on bad page state. The mlocked flag is not
> >>> cleared during page free.
> >>>
> >>> During try_to_merge_one_page in ksm, kpage will be remlocked if vma
> >>> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
> >>> Since the mapcount of this kpage is -1, no one can remove its mlocked flag
> >>> before free, this lead to the bad page report.
> >>>
> >>> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
> >>> this problem. The 5.10/5.15 LTS do have this issue.

I was glad to hear that it happens to have been fixed by 5.18 changes.
I was going to complain that you're asking me for a fix to old kernels,
just because I fixed it in new kernels.  But that would be unfair: you'll
have noticed that I was guilty of this KSM PageMlocked code too (and one
of the pleasures of those mlock changes was being able to delete this).

> >>>
> >>> Here is the simplified calltrace:
> >>> try_to_merge_one_page				wp_page_copy
> >>>
> >>> try_to_merge_one_page
> >>> // clear page mlocked during rmap removal
> >>> replace_page
> >>> page_remove_rmap
> >>> if (unlikely(PageMlocked(page)))
> >>> clear_page_mlock(compound_head(page));
> >>>
> >>> if ((vma->vm_flags & VM_LOCKED)
> >>> 									lock_page(old_page);
> >>> 									if (vma->vm_flags & VM_LOCKED)
> >>> 									if (PageMlocked(old_page))
> >>> 									munlock_vma_page(old_page);
> >>> if (!PageMlocked(kpage))
> >>> lock_page(kpage);
> >>> mlock_vma_page(kpage);
> >>> unlock_page(kpage);
> >>> -------------------------------------------------
> >>>
> >>> This problem can be easily reproduced with the following modifies:
> >>> 1. enable the following CONFIG
> >>>   a) CONFIG_DEBUG_VM
> >>>   b) CONFIG_KSM
> >>>   c) CONFIG_MEMORY_FAIALURE

CONFIG_MEMORY_FAILURE

> >>>
> >>> 2. add delay in try_to_merge_one_page
> >>> diff --git a/mm/ksm.c b/mm/ksm.c
> >>> index a5716fdec1aa..f9ee2ec615ac 100644
> >>> --- a/mm/ksm.c
> >>> +++ b/mm/ksm.c
> >>> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
> >>>  
> >>>         if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
> >>>                 munlock_vma_page(page);
> >>> +               mdelay(10);
> >>>                 if (!PageMlocked(kpage)) {
> >>>                         unlock_page(page);
> >>> +                       mdelay(100);
> >>>                         lock_page(kpage);
> >>>                         mlock_vma_page(kpage);
> >>>                         page = kpage;           /* for final unlock */
> >>>
> >>> 3. run syzbot with the following content:
> >>>
> >>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
> >>> mlockall(0x1)
> >>> mlockall(0x5)
> >>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)

0x65: MADV_SOFT_OFFLINE /* soft offline page for testing */

Whereby CAP_SYS_ADMIN allows syzbot to take advantage of memory-failure
testing backdoors to inject unexpected memory failures.  Which gives it
the right to TTU_IGNORE_MLOCK unmap pages which would ordinarily be kept
mapped by VM_LOCKED.

I said "grudging Ack" above, because I don't enjoy the time spent on
working around such issues.  There's lots of other damage that can be
done with CAP_SYS_ADMIN, and with memory failures on kernel memory.
But you've devised a patch which takes you further along... so okay.

Please mention the CAP_SYS_ADMIN MADV_SOFT_OFFLINE error injection
dependence in your commitlog - thanks.

Hugh

> >>>
> >>> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
> >>> mlockall(0x1)
> >>> mlockall(0xa5)
> >>> mlockall(0x0)
> >>> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)
> >>>
> >>> -------------------------------------------------
> >>> The detail bug report can be seen as follow:
> >>>
> >>> BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
> >>> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
> >>> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
> >>> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
> >>> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
> >>> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
> >>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> >>> Modules linked in:
> >>> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
> >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> >>> Call Trace:
> >>>  <TASK>
> >>>  dump_stack_lvl+0x33/0x46
> >>>  bad_page+0x9e/0xe0
> >>>  free_pcp_prepare+0x14b/0x1f0
> >>>  free_unref_page_list+0x7c/0x210
> >>>  release_pages+0x2fe/0x3c0
> >>>  __pagevec_lru_add+0x21a/0x360
> >>>  lru_cache_add+0x80/0xe0
> >>>  add_to_page_cache_lru+0x71/0xd0
> >>>  pagecache_get_page+0x245/0x460
> >>>  grab_cache_page_write_begin+0x1a/0x40
> >>>  ext4_da_write_begin+0xb7/0x280
> >>>  generic_perform_write+0xb4/0x1e0
> >>>  ext4_buffered_write_iter+0x9c/0x140
> >>>  ext4_file_write_iter+0x5b/0x840
> >>>  ? do_futex+0x1af/0xb60
> >>>  ? check_preempt_curr+0x21/0x60
> >>>  ? ttwu_do_wakeup.isra.140+0xd/0xf0
> >>>  new_sync_write+0x117/0x1b0
> >>>  vfs_write+0x1ff/0x260
> >>>  ksys_write+0xa0/0xe0
> >>>  do_syscall_64+0x37/0x90
> >>>  entry_SYSCALL_64_after_hwframe+0x67/0xd1
> >>> RIP: 0033:0x7fb815cef32f
> >>> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
> >>> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> >>> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
> >>> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
> >>> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
> >>> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
> >>> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
> >>>  </TASK>
> >>>
> >>> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@xxxxxxx/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460