On Thu, Nov 05, 2020 at 11:50:58AM -0800, Mike Kravetz wrote: > Qian Cai reported the following BUG in [1] > > [ 6147.019063][T45242] LTP: starting move_pages12 > [ 6147.475680][T64921] BUG: unable to handle page fault for address: ffffffffffffffe0 > ... > [ 6147.525866][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 > avc_start_pgoff at mm/interval_tree.c:63 > [ 6147.620914][T64921] Call Trace: > [ 6147.624078][T64921] rmap_walk_anon+0x141/0xa30 > rmap_walk_anon at mm/rmap.c:1864 > [ 6147.628639][T64921] try_to_unmap+0x209/0x2d0 > try_to_unmap at mm/rmap.c:1763 > [ 6147.633026][T64921] ? rmap_walk_locked+0x140/0x140 > [ 6147.637936][T64921] ? page_remove_rmap+0x1190/0x1190 > [ 6147.643020][T64921] ? page_not_mapped+0x10/0x10 > [ 6147.647668][T64921] ? page_get_anon_vma+0x290/0x290 > [ 6147.652664][T64921] ? page_mapcount_is_zero+0x10/0x10 > [ 6147.657838][T64921] ? hugetlb_page_mapping_lock_write+0x97/0x180 > [ 6147.663972][T64921] migrate_pages+0x1005/0x1fb0 > [ 6147.668617][T64921] ? remove_migration_pte+0xac0/0xac0 > [ 6147.673875][T64921] move_pages_and_store_status.isra.47+0xd7/0x1a0 > [ 6147.680181][T64921] ? migrate_pages+0x1fb0/0x1fb0 > [ 6147.685002][T64921] __x64_sys_move_pages+0xa5c/0x1100 > [ 6147.690176][T64921] ? trace_hardirqs_on+0x20/0x1b5 > [ 6147.695084][T64921] ? move_pages_and_store_status.isra.47+0x1a0/0x1a0 > [ 6147.701653][T64921] ? rcu_read_lock_sched_held+0xaa/0xd0 > [ 6147.707088][T64921] ? switch_fpu_return+0x196/0x400 > [ 6147.712083][T64921] ? lockdep_hardirqs_on_prepare+0x38c/0x550 > [ 6147.717954][T64921] ? do_syscall_64+0x24/0x310 > [ 6147.722513][T64921] do_syscall_64+0x5f/0x310 > [ 6147.726897][T64921] ? trace_hardirqs_off+0x12/0x1a0 > [ 6147.731894][T64921] ? asm_exc_page_fault+0x8/0x30 > [ 6147.736714][T64921] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Hugh Dickens diagnosed this as a migration bug caused by code introduced > to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the > routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED > flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for > anon pages as the anon_vma_lock should be held in this case. Further > analysis suggested that i_mmap_rwsem was not required to he held at all > when calling try_to_unmap for anon pages as an anon page could never be > part of a shared pmd mapping. > > Discussion also revealed that the hack in hugetlb_page_mapping_lock_write > to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to > keep mapping valid while dropping page lock. > > This patch does the following: > - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when > calling try_to_unmap. > - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine > will now simply do a 'trylock' while still holding the page lock. If > the trylock fails, it will return NULL. This could impact the callers: > - migration calling code will receive -EAGAIN and retry up to the hard > coded limit (10). > - memory error code will treat the page as BUSY. This will force > killing (SIGKILL) instead of SIGBUS any mapping tasks. > Do note that this change in behavior only happens when there is a race. > None of the standard kernel testing suites actually hit this race, but > it is possible. > > [1] https://lore.kernel.org/lkml/20200708012044.GC992@xxxxxx/ > [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/ > > Reported-by: Qian Cai <cai@xxxxxx> > Suggested-by: Hugh Dickins <hughd@xxxxxxxxxx> > Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") > Cc: <stable@xxxxxxxxxxxxxxx> > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> This approch looks simpler and better than former ones. Thank you for the update. Acked-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>