Re: bug: move_pages(2) does not udpate "status" if no pages are moved

John Hubbard <jhubbard@xxxxxxxxxx> · Wed, 4 Dec 2019 16:03:39 -0800

On 12/4/19 12:17 PM, Yang Shi wrote:
> On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@xxxxxxxxxx> wrote:
>>
>> Hello all,
>>
>> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
>> pages happen to be on the right node already, this function returns 0 but the
>> "status" array is not updated. This array potentially contains garbage values
>> (e.g. from malloc(3)), and I don't see a way to detect this.
>>
>> Looking at the kernel code, we are probably exiting do_pages_move here:
>> out_flush:
>>     if (list_empty(&pagelist))
>>         return err;
> 
> May you please give the below patch a try? I just did build test.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a8f87cb..f2f1279 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
>   * the target node
>   */
>  static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> - int node, struct list_head *pagelist, bool migrate_all)
> + int node, struct list_head *pagelist, bool migrate_all,
> + int __user *status, int start)
>  {
>   struct vm_area_struct *vma;
>   struct page *page;
> @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct
> mm_struct *mm, unsigned long addr,
>   goto out;
> 
>   err = 0;
> - if (page_to_nid(page) == node)
> + if (page_to_nid(page) == node) {
> + err = store_status(status, start, node, 1);
>   goto out_putpage;
> + }
> 
>   err = -EACCES;
>   if (page_mapcount(page) > 1 && !migrate_all)
> @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm,
> nodemask_t task_nodes,
>   * report them via status
>   */
>   err = add_page_for_migration(mm, addr, current_node,
> - &pagelist, flags & MPOL_MF_MOVE_ALL);
> + &pagelist, flags & MPOL_MF_MOVE_ALL, status,
> + i);
> +
>   if (!err)
>   continue;
> 

Hi Yang,

The patch looks correct, and I *think* the following lockdep report
is a pre-existing problem, but it happened with your patch applied to today's
linux.git (commit aedc0650f9135f3b92b39cbed1a8fe98d8088825), using the
unmodified version of Felix's test program:

============================================
WARNING: possible recursive locking detected
5.4.0-hubbard-github+ #552 Not tainted
--------------------------------------------
move_pages_bug/1286 is trying to acquire lock:
ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: __might_fault+0x3e/0x90

but task is already holding lock:
ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&mm->mmap_sem#2);
  lock(&mm->mmap_sem#2);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

1 lock held by move_pages_bug/1286:
 #0: ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0

stack backtrace:
CPU: 6 PID: 1286 Comm: move_pages_bug Not tainted 5.4.0-hubbard-github+ #552
Hardware name: ASUS X299-A/PRIME X299-A, BIOS 2002 09/25/2019
Call Trace:
 dump_stack+0x71/0xa0
 validate_chain.cold+0x122/0x15f
 ? find_held_lock+0x2b/0x80
 __lock_acquire+0x39c/0x790
 lock_acquire+0x95/0x190
 ? __might_fault+0x3e/0x90
 __might_fault+0x68/0x90
 ? __might_fault+0x3e/0x90
 do_pages_move+0x2c4/0x6a0
 kernel_move_pages+0x1f5/0x3e0
 ? do_syscall_64+0x1c/0x230
 __x64_sys_move_pages+0x25/0x30
 do_syscall_64+0x5a/0x230
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7efd42f581ad
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f08
RSP: 002b:00007ffffb207c78 EFLAGS: 00000216 ORIG_RAX: 0000000000000117
RAX: ffffffffffffffda RBX: 0000556eb240cd28 RCX: 00007efd42f581ad
RDX: 0000556eb240ccf0 RSI: 0000000000000008 RDI: 0000000000000000
RBP: 00007ffffb207d10 R08: 0000556eb240cd70 R09: 0000000000000002
R10: 0000556eb240cd40 R11: 0000000000000216 R12: 0000556eb04b70a0
R13: 00007ffffb207df0 R14: 0000000000000000 R15: 0000000000000000

thanks,
-- 
John Hubbard
NVIDIA