I am sending this as a v2 RFC for the following reasons: - The original RFC was incomplete and had a few issues. - The only comments from the original RFC suggested eliminating huge pmd sharing to eliminate the associated complexity. I do not believe this is possible as user space code will notice it's absence. In any case, if we want to remove i_mmap_rwsem from the fault path to address fault scalability, we will need to address fault/truncate races. Patches 3 and 4 of this series do that. hugetlb fault scalability regressions have recently been reported [1]. This is not the first such report, as regressions were also noted when commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") was added [2] in v5.7. At that time, a proposal to address the regression was suggested [3] but went nowhere. To illustrate the regression, I created a simple program that does the following in an infinite loop: - mmap a 4GB hugetlb file (size insures pmd sharing) - fault in all pages - unmap the hugetlb file The hugetlb fault code was then instrumented to collect number of times the mutex was locked and wait time. Samples are from 10 second intervals on a 4 CPU VM with 8GB memory. Eight instances of the map/fault/unmap program are running. next-20220420 ------------- [ 690.117843] Wait_debug: faults sec 3506 [ 690.118788] num faults 35062 [ 690.119825] num locks 35062 [ 690.120956] intvl wait time 54688 msecs [ 690.122330] max_wait_time 24000 usecs next-20220420 + this series --------------------------- [ 484.965960] Wait_debug: faults sec 1419429 [ 484.967294] num faults 14194293 [ 484.968656] num locks 5611 [ 484.969893] intvl wait time 21087 msecs [ 484.971388] max_wait_time 34000 usecs As can be seen, fault time suffers when there are other operations taking i_mmap_rwsem in write mode such as unmap. This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which depends on c0d0381ade79. This moves acquisition of i_mmap_rwsem in the fault path back to huge_pmd_share where it is only taken when necessary. After, reverting these patches we still need to handle: - fault and truncate races Catch and properly backout faults beyond i_size Backing out reservations is much easier after 846be08578ed to expand restore_reserve_on_error functionality. - unshare and fault/lookup races Since the pointer returned from huge_pte_offset or huge_pte_alloc may become invalid until we lock the page table, we must revalidate after taking the lock. Code paths must backout and possibly retry if page table pointer changes. Patch 5 in this series makes basic changes to page table locking for hugetlb mappings. Currently code uses (split) pmd locks for hugetlb mappings if page size is PMD_SIZE. A pointer to the pmd is required to find the page struct containing the lock. However, with pmd sharing the pmd pointer is not stable until we hold the pmd lock. To solve this chicken/egg problem, we use the page_table_lock in mm_struct if the pmd pointer is associated with a mapping where pmd sharing is possible. A study of the performance implications of this change still needs to be performed. Please help with comments or suggestions. I would like to come up with something that is performant AND safe. Mike Kravetz (6): hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization hugetlbfs: move routine remove_huge_page to hugetlb.c hugetlbfs: catch and handle truncate racing with page faults hugetlbfs: Do not use pmd locks if hugetlb sharing possible hugetlb: Check for pmd unshare and fault/lookup races arch/powerpc/mm/pgtable.c | 2 +- fs/hugetlbfs/inode.c | 136 +++++++++++-------- include/linux/hugetlb.h | 30 ++--- mm/damon/vaddr.c | 4 +- mm/hmm.c | 2 +- mm/hugetlb.c | 273 ++++++++++++++++++++++---------------- mm/mempolicy.c | 2 +- mm/migrate.c | 2 +- mm/page_vma_mapped.c | 2 +- mm/rmap.c | 8 +- mm/userfaultfd.c | 11 +- 11 files changed, 261 insertions(+), 211 deletions(-) -- 2.35.1