----- Original Message ----- > On 1/2/19 1:24 PM, Mike Kravetz wrote: > > On 1/2/19 12:30 PM, Jan Stancek wrote: > >> Hi, > >> > >> LTP move_pages12 [1] started failing recently. > >> > >> The test maps/unmaps some anonymous private huge pages > >> and migrates them between 2 nodes. This now reliably > >> hits NULL ptr deref: > >> > >> [ 194.819357] BUG: unable to handle kernel NULL pointer dereference at > >> 0000000000000030 > >> [ 194.864410] #PF error: [WRITE] > >> [ 194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0 > >> [ 194.913833] Oops: 0002 [#1] SMP NOPTI > >> [ 194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1 > >> [ 194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012 > >> [ 195.005359] RIP: 0010:down_write+0x1b/0x40 > >> [ 195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44 > >> 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff > >> ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01 > >> [ 195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246 > >> [ 195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX: > >> 0000000000000000 > >> [ 195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI: > >> 0000000000000030 > >> [ 195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09: > >> 0000000000000003 > >> [ 195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12: > >> ffffe7d4889d0008 > >> [ 195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15: > >> ffff8be9b2ba4580 > >> [ 195.332532] FS: 00007f1670112b80(0000) GS:ffff8be9b7a00000(0000) > >> knlGS:0000000000000000 > >> [ 195.373888] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> [ 195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4: > >> 00000000000006f0 > >> [ 195.443579] Call Trace: > >> [ 195.456876] migrate_pages+0x833/0xcb0 > >> [ 195.478070] ? __ia32_compat_sys_migrate_pages+0x20/0x20 > >> [ 195.506027] do_move_pages_to_node.isra.63.part.64+0x2a/0x50 > >> [ 195.536963] kernel_move_pages+0x667/0x8c0 > >> [ 195.559616] ? __handle_mm_fault+0xb95/0x1370 > >> [ 195.588765] __x64_sys_move_pages+0x24/0x30 > >> [ 195.611439] do_syscall_64+0x5b/0x160 > >> [ 195.631901] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >> [ 195.657790] RIP: 0033:0x7f166f5ff959 > >> [ 195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 > >> 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 > >> 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48 > >> [ 195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX: > >> 0000000000000117 > >> [ 195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: > >> 00007f166f5ff959 > >> [ 195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI: > >> 0000000000000360 > >> [ 195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09: > >> 0000000000000004 > >> [ 195.920017] R10: 0000000002305410 R11: 0000000000000246 R12: > >> 0000000002303400 > >> [ 195.958053] R13: 0000000002305410 R14: 0000000002306420 R15: > >> 0000000000000003 > >> [ 195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif > >> edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr > >> ipmi_msgha > >> ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c > >> radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se > >> rio_raw pata_atiixp > >> [ 196.134162] CR2: 0000000000000030 > >> [ 196.152788] ---[ end trace 4420ea5061342d3e ]--- > >> > >> Suspected commit is: > >> b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing > >> synchronization") > >> which adds to unmap_and_move_huge_page(): > >> + struct address_space *mapping = page_mapping(hpage); > >> + > >> + /* > >> + * try_to_unmap could potentially call huge_pmd_unshare. > >> + * Because of this, take semaphore in write mode here and > >> + * set TTU_RMAP_LOCKED to let lower levels know we have > >> + * taken the lock. > >> + */ > >> + i_mmap_lock_write(mapping); > >> > >> If I'm reading this right, 'mapping' will be NULL for anon mappings. > > > > Not exactly. > > > > Well, yes exactly. Sorry, I was confusing this with something else. > > That commit does cause BUGs for migration and page poisoning of anon huge > pages. The patch was trying to take care of i_mmap_rwsem locking outside > try_to_unmap infrastructure. This is because try_to_unmap will take the > semaphore in read mode (for file mappings) and we really need it to be > taken in write mode. > > The patch below continues to take the semaphore outside try_to_unmap for > the file mapping case. For anon mappings, the locking is done as a special > case in try_to_unmap_one. This is something I was trying to avoid as it > it harder to follow/understand. Any suggestions on how to restructure this > or make it more clear are welcome. > > Adding Andrew on Cc as he already sent the commit causing the BUGs upstream. > > From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > hugetlbfs: fix migration and poisoning of anon huge pages > > Expanded use of i_mmap_rwsem for pmd sharing synchronization incorrectly > used page_mapping() of anon huge pages to get to address_space > i_mmap_rwsem. Since page_mapping() is NULL for pages of anon mappings, > an "unable to handle kernel NULL pointer" BUG would occur with stack > similar to: > > RIP: 0010:down_write+0x1b/0x40 > Call Trace: > migrate_pages+0x81f/0xb90 > __ia32_compat_sys_migrate_pages+0x190/0x190 > do_move_pages_to_node.isra.53.part.54+0x2a/0x50 > kernel_move_pages+0x566/0x7b0 > __x64_sys_move_pages+0x24/0x30 > do_syscall_64+0x5b/0x180 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > To fix, only use page_mapping() for non-anon or file pages. For anon > pages wait until we find a vma in which the page is mapped and get the > address_space from vm_file. > > Fixes: b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing > synchronization") > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Mike, 1) with LTP move_pages12 (MAP_PRIVATE version of reproducer) Patch below fixes the panic for me. It didn't apply cleanly to latest master, but conflicts were easy to resolve. 2) with MAP_SHARED version of reproducer It still hangs in user-space. v4.19 kernel appears to work fine so I've started a bisect. Regards, Jan > --- > mm/memory-failure.c | 15 +++++++++++---- > mm/migrate.c | 34 +++++++++++++++++++++++----------- > mm/rmap.c | 15 ++++++++++++--- > 3 files changed, 46 insertions(+), 18 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 93558fb981fb..f229cbd0b347 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1032,14 +1032,21 @@ static bool hwpoison_user_mappings(struct page *p, > unsigned long pfn, > unmap_success = try_to_unmap(hpage, ttu); > } else if (mapping) { > /* > - * For hugetlb pages, try_to_unmap could potentially call > - * huge_pmd_unshare. Because of this, take semaphore in > - * write mode here and set TTU_RMAP_LOCKED to indicate we > - * have taken the lock at this higer level. > + * For file mappings, take semaphore in write mode here and > + * set TTU_RMAP_LOCKED to let lower levels know we have taken > + * the lock. This is in case lower levels call > + * huge_pmd_unshare. Without this, try_to_unmap would only > + * take the semaphore in read mode. > */ > i_mmap_lock_write(mapping); > unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); > i_mmap_unlock_write(mapping); > + } else { > + /* > + * For huge page anon mappings, try_to_unmap_one will take the > + * i_mmap_rwsem before calling huge_pmd_unshare if necessary. > + */ > + unmap_success = try_to_unmap(hpage, ttu); > } > if (!unmap_success) > pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n", > diff --git a/mm/migrate.c b/mm/migrate.c > index 725edaef238a..45d7dd0c9479 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1309,17 +1309,29 @@ static int unmap_and_move_huge_page(new_page_t > get_new_page, > if (page_mapped(hpage)) { > struct address_space *mapping = page_mapping(hpage); > > - /* > - * try_to_unmap could potentially call huge_pmd_unshare. > - * Because of this, take semaphore in write mode here and > - * set TTU_RMAP_LOCKED to let lower levels know we have > - * taken the lock. > - */ > - i_mmap_lock_write(mapping); > - try_to_unmap(hpage, > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS| > - TTU_RMAP_LOCKED); > - i_mmap_unlock_write(mapping); > + if (mapping) { > + /* > + * For file mappings, take semaphore in write mode here > + * and set TTU_RMAP_LOCKED to let lower levels know we > + * have taken the lock. This is in case lower levels > + * call huge_pmd_unshare. Without this, try_to_unmap > + * would only take the semaphore in read mode. > + */ > + i_mmap_lock_write(mapping); > + try_to_unmap(hpage, > + TTU_MIGRATION|TTU_IGNORE_MLOCK| > + TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED); > + i_mmap_unlock_write(mapping); > + } else { > + /* > + * For anon mappings, try_to_unmap_one will take the > + * i_mmap_rwsem before calling huge_pmd_unshare if > + * necessary. > + */ > + try_to_unmap(hpage, > + TTU_MIGRATION|TTU_IGNORE_MLOCK| > + TTU_IGNORE_ACCESS); > + } > page_was_mapped = 1; > } > > diff --git a/mm/rmap.c b/mm/rmap.c > index c566bd552535..b267cc084f92 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1375,11 +1375,17 @@ static bool try_to_unmap_one(struct page *page, > struct > vm_area_struct *vma, > /* > * If sharing is possible, start and end will be adjusted > * accordingly. > - * > - * If called for a huge page, caller must hold i_mmap_rwsem > - * in write mode as it is possible to call huge_pmd_unshare. > */ > adjust_range_if_pmd_sharing_possible(vma, &start, &end); > + > + /* > + * If called for a huge page file mapping, caller will hold > + * i_mmap_rwsem in write mode. For anon mappings, we must > + * take the semaphore here. All this is necessary because > + * it is possible huge_pmd_unshare will 'unshare' a pmd. > + */ > + if (PageAnon(page)) > + i_mmap_lock_write(vma->vm_file->f_mapping); > } > mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); > > @@ -1655,6 +1661,9 @@ static bool try_to_unmap_one(struct page *page, struct > vm_area_struct *vma, > } > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); > + /* For anon huge pages, we must unlock. */ > + if (PageHuge(page) && PageAnon(page)) > + i_mmap_unlock_write(vma->vm_file->f_mapping); > > return ret; > } > -- > 2.17.2 > >