Re: [5.10/5.15 LTS] Question on mlock race between ksm and cow

mawupeng <mawupeng1@xxxxxxxxxx> · Sat, 21 Oct 2023 14:59:46 +0800

On 2023/10/20 11:57, Hugh Dickins wrote:
> On Wed, 18 Oct 2023, mawupeng wrote:
> 
>> This problem can be easily reproduced with syz in LTS 5.10/5.15.
>>
>> Kindly ping.
> 
> Sorry, but it takes me a long time to find a long enough slot
> to sink back deep enough into understanding these things.
> 
>>
>> On 2023/9/18 9:07, mawupeng wrote:
>>> Kindly ping.
>>>
>>> On 2023/8/16 11:52, mawupeng wrote:
>>>> Since page_remove_rmap in wp_page_copy only clear this mlocked page iff
>>>> page's mapcount is -1. which can be seen as follow.
>>>>
>>>> wp_page_copy
>>>>   page_remove_rmap
>>>>     if (!atomic_add_negative(-1, &page->_mapcount))
>>>>       goto out;
>>>>     clear_page_mlock(page);	// clear mlocked flag
>>>>
>>>> During out test, we can test this mapcount before mlock the kpage, this
>>>> can close this race.
>>>>
>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>> index 62feb478a367..347f4c0339c2 100644
>>>> --- a/mm/ksm.c
>>>> +++ b/mm/ksm.c
>>>> @@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>>>                 if (!PageMlocked(kpage)) {
>>>>                         unlock_page(page);
>>>>                         lock_page(kpage);
>>>> -                       mlock_vma_page(kpage);
>>>> +                       if (page_mapcount(kpage) > 0)
>>>> +                               mlock_vma_page(kpage);
> 
> Yes, that's safe enough; though if we thought for longer, we might
> devise other such races which this would not be enough for.
> 
> But I'm not going to spend longer on it, this is good enough, if you
> find that it enables you to get much further with your syzbottery.
> 
> If you prepare a proper Signed-off patch to send GregKH & stable,
> Cc me and I shall probably be able to support it with a grudging Ack.
> 
>>>>                         page = kpage;           /* for final unlock */
>>>>                 }
>>>>         }
>>>>
>>>>
>>>> On 2023/8/15 15:07, mawupeng wrote:
>>>>> Our syzbot reports a warning on bad page state. The mlocked flag is not
>>>>> cleared during page free.
>>>>>
>>>>> During try_to_merge_one_page in ksm, kpage will be remlocked if vma
>>>>> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
>>>>> Since the mapcount of this kpage is -1, no one can remove its mlocked flag
>>>>> before free, this lead to the bad page report.
>>>>>
>>>>> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
>>>>> this problem. The 5.10/5.15 LTS do have this issue.
> 
> I was glad to hear that it happens to have been fixed by 5.18 changes.
> I was going to complain that you're asking me for a fix to old kernels,
> just because I fixed it in new kernels.  But that would be unfair: you'll
> have noticed that I was guilty of this KSM PageMlocked code too (and one
> of the pleasures of those mlock changes was being able to delete this).
> 
>>>>>
>>>>> Here is the simplified calltrace:
>>>>> try_to_merge_one_page				wp_page_copy
>>>>>
>>>>> try_to_merge_one_page
>>>>> // clear page mlocked during rmap removal
>>>>> replace_page
>>>>> page_remove_rmap
>>>>> if (unlikely(PageMlocked(page)))
>>>>> clear_page_mlock(compound_head(page));
>>>>>
>>>>> if ((vma->vm_flags & VM_LOCKED)
>>>>> 									lock_page(old_page);
>>>>> 									if (vma->vm_flags & VM_LOCKED)
>>>>> 									if (PageMlocked(old_page))
>>>>> 									munlock_vma_page(old_page);
>>>>> if (!PageMlocked(kpage))
>>>>> lock_page(kpage);
>>>>> mlock_vma_page(kpage);
>>>>> unlock_page(kpage);
>>>>> -------------------------------------------------
>>>>>
>>>>> This problem can be easily reproduced with the following modifies:
>>>>> 1. enable the following CONFIG
>>>>>   a) CONFIG_DEBUG_VM
>>>>>   b) CONFIG_KSM
>>>>>   c) CONFIG_MEMORY_FAIALURE
> 
> CONFIG_MEMORY_FAILURE
> 
>>>>>
>>>>> 2. add delay in try_to_merge_one_page
>>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>>> index a5716fdec1aa..f9ee2ec615ac 100644
>>>>> --- a/mm/ksm.c
>>>>> +++ b/mm/ksm.c
>>>>> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>>>>  
>>>>>         if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
>>>>>                 munlock_vma_page(page);
>>>>> +               mdelay(10);
>>>>>                 if (!PageMlocked(kpage)) {
>>>>>                         unlock_page(page);
>>>>> +                       mdelay(100);
>>>>>                         lock_page(kpage);
>>>>>                         mlock_vma_page(kpage);
>>>>>                         page = kpage;           /* for final unlock */
>>>>>
>>>>> 3. run syzbot with the following content:
>>>>>
>>>>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
>>>>> mlockall(0x1)
>>>>> mlockall(0x5)
>>>>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)
> 
> 0x65: MADV_SOFT_OFFLINE /* soft offline page for testing */
> 
> Whereby CAP_SYS_ADMIN allows syzbot to take advantage of memory-failure
> testing backdoors to inject unexpected memory failures.  Which gives it
> the right to TTU_IGNORE_MLOCK unmap pages which would ordinarily be kept
> mapped by VM_LOCKED.
> 
> I said "grudging Ack" above, because I don't enjoy the time spent on
> working around such issues.  There's lots of other damage that can be
> done with CAP_SYS_ADMIN, and with memory failures on kernel memory.
> But you've devised a patch which takes you further along... so okay.
> 
> Please mention the CAP_SYS_ADMIN MADV_SOFT_OFFLINE error injection
> dependence in your commitlog - thanks.
> 
> Hugh
> 
>>>>>
>>>>> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
>>>>> mlockall(0x1)
>>>>> mlockall(0xa5)
>>>>> mlockall(0x0)
>>>>> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)
>>>>>
>>>>> -------------------------------------------------
>>>>> The detail bug report can be seen as follow:
>>>>>
>>>>> BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
>>>>> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
>>>>> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
>>>>> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
>>>>> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
>>>>> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
>>>>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>>>>> Modules linked in:
>>>>> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
>>>>> Call Trace:
>>>>>  <TASK>
>>>>>  dump_stack_lvl+0x33/0x46
>>>>>  bad_page+0x9e/0xe0
>>>>>  free_pcp_prepare+0x14b/0x1f0
>>>>>  free_unref_page_list+0x7c/0x210
>>>>>  release_pages+0x2fe/0x3c0
>>>>>  __pagevec_lru_add+0x21a/0x360
>>>>>  lru_cache_add+0x80/0xe0
>>>>>  add_to_page_cache_lru+0x71/0xd0
>>>>>  pagecache_get_page+0x245/0x460
>>>>>  grab_cache_page_write_begin+0x1a/0x40
>>>>>  ext4_da_write_begin+0xb7/0x280
>>>>>  generic_perform_write+0xb4/0x1e0
>>>>>  ext4_buffered_write_iter+0x9c/0x140
>>>>>  ext4_file_write_iter+0x5b/0x840
>>>>>  ? do_futex+0x1af/0xb60
>>>>>  ? check_preempt_curr+0x21/0x60
>>>>>  ? ttwu_do_wakeup.isra.140+0xd/0xf0
>>>>>  new_sync_write+0x117/0x1b0
>>>>>  vfs_write+0x1ff/0x260
>>>>>  ksys_write+0xa0/0xe0
>>>>>  do_syscall_64+0x37/0x90
>>>>>  entry_SYSCALL_64_after_hwframe+0x67/0xd1
>>>>> RIP: 0033:0x7fb815cef32f
>>>>> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
>>>>> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
>>>>> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
>>>>> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
>>>>> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
>>>>> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
>>>>> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
>>>>>  </TASK>
>>>>>
>>>>> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@xxxxxxx/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460
> 

Hi Hugh.

Thanks for you kindly reply.

I will cc to you if a proper patch is considered.

Wupeng.