On Wed, 2020-03-04 at 16:26 -0800, Mike Kravetz wrote: > While discussing the issue with huge_pte_offset [1], I remembered that > there were more outstanding hugetlb races. These issues are: > > 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become > invalid via a call to huge_pmd_unshare by another thread. > 2) hugetlbfs page faults can race with truncation causing invalid global > reserve counts and state. > > A previous attempt was made to use i_mmap_rwsem in this manner as described > at [2]. However, those patches were reverted starting with [3] due to > locking issues. > > To effectively use i_mmap_rwsem to address the above issues it needs to > be held (in read mode) during page fault processing. However, during > fault processing we need to lock the page we will be adding. Lock > ordering requires we take page lock before i_mmap_rwsem. Waiting until > after taking the page lock is too late in the fault process for the > synchronization we want to do. > > To address this lock ordering issue, the following patches change the > lock ordering for hugetlb pages. This is not too invasive as hugetlbfs > processing is done separate from core mm in many places. However, I > don't really like this idea. Much ugliness is contained in the new > routine hugetlb_page_mapping_lock_write() of patch 1. > > The only other way I can think of to address these issues is by catching > all the races. After catching a race, cleanup, backout, retry ... etc, > as needed. This can get really ugly, especially for huge page reservations. > At one time, I started writing some of the reservation backout code for > page faults and it got so ugly and complicated I went down the path of > adding synchronization to avoid the races. Any other suggestions would > be welcome. Reverted this series on the top of today's linux-next fixed the hang with LTP move_pages12 on both powerpc and arm64, # /opt/ltp/testcases/bin/move_pages12 tst_test.c:1217: INFO: Timeout per run is 0h 05m 00s move_pages12.c:263: INFO: Free RAM 260577280 kB move_pages12.c:281: INFO: Increasing 2048kB hugepages pool on node 0 to 4 move_pages12.c:291: INFO: Increasing 2048kB hugepages pool on node 8 to 4 move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 0 move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 8 <hang> [ 3948.791155][ T688] INFO: task move_pages12:32930 can't die for more than 3072 seconds. [ 3948.791181][ T688] move_pages12 D26224 32930 1 0x00040002 [ 3948.791199][ T688] Call Trace: [ 3948.791210][ T688] [c000200816b4f680] [c0000000010b7a68] cpufreq_update_util_data+0x0/0x8 (unreliable) [ 3948.791234][ T688] [c000200816b4f860] [c00000000002615c] __switch_to+0x38c/0x520 [ 3948.791247][ T688] [c000200816b4f8d0] [c0000000009a1c94] __schedule+0x4b4/0xba0 [ 3948.791268][ T688] [c000200816b4f9a0] [c0000000009a2428] schedule+0xa8/0x170 [ 3948.791288][ T688] [c000200816b4f9d0] [c0000000009a2d0c] io_schedule+0x2c/0x50 [ 3948.791300][ T688] [c000200816b4fa00] [c000000000331020] __lock_page+0x150/0x3c0 [ 3948.791322][ T688] [c000200816b4fac0] [c000000000420164] hugetlb_no_page+0xb04/0xd40 lock_page at include/linux/pagemap.h:480 (inlined by) hugetlb_no_page at mm/hugetlb.c:4286 [ 3948.791342][ T688] [c000200816b4fc10] [c000000000420bd8] hugetlb_fault+0x738/0xc00 [ 3948.791363][ T688] [c000200816b4fcd0] [c0000000003b9c44] handle_mm_fault+0x444/0x450 [ 3948.791384][ T688] [c000200816b4fd20] [c000000000070b98] __do_page_fault+0x2b8/0xf90 [ 3948.791406][ T688] [c000200816b4fe20] [c00000000000aa88] handle_page_fault+0x10/0x30 > > [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@xxxxxxxxxx/ > [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@xxxxxxxxxx/ > [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@xxxxxxxxxx > > Mike Kravetz (2): > hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization > hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race > > fs/hugetlbfs/inode.c | 34 +++++--- > include/linux/fs.h | 5 ++ > include/linux/hugetlb.h | 8 ++ > mm/hugetlb.c | 175 +++++++++++++++++++++++++++++++++++----- > mm/memory-failure.c | 29 ++++++- > mm/migrate.c | 24 +++++- > mm/rmap.c | 17 +++- > mm/userfaultfd.c | 11 ++- > 8 files changed, 264 insertions(+), 39 deletions(-) >