[RFC PATCH 0/5] hugetlb: Change huge pmd sharing

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Wed, 6 Apr 2022 13:48:18 -0700

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 4GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file

The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time.  Samples are from 10 second
intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
map/fault/unmap program are running.

v5.17
-----
[  708.763114] Wait_debug: faults sec  3622
[  708.764010]             num faults  36220
[  708.765016]             num waits   36220
[  708.766054]             intvl wait time 54074 msecs
[  708.767287]             max_wait_time   31000 usecs

v5.17 + this series (similar to v5.6)
-------------------------------------
[  282.191391] Wait_debug: faults sec  1777939
[  282.192571]             num faults  17779393
[  282.193746]             num locks   5517
[  282.194858]             intvl wait time 19907 msecs
[  282.196226]             max_wait_time   43000 usecs

As can be seen, fault time suffers when there are other operations
taking i_mmap_rwsem in write mode such as unmap.

This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
fault path back to huge_pmd_share where it is only taken when necessary.
After, reverting these patches we still need to handle:
fault and truncate races
- Catch and properly backout faults beyond i_size
  Backing out reservations is much easier after 846be08578ed to expand
  restore_reserve_on_error functionality.
unshare and fault/lookup races
- Since the pointer returned from huge_pte_offset or huge_pte_alloc may
  become invalid until we lock the page table, we must revalidate after
  taking the lock.  Code paths must backout and possibly retry if
  page table pointer changes.

The commit message in patch 5 suggests that it is not safe to use
SPLIT_PMD_PTLOCKS for hugetlb mappings if sharing is possible.  If
others confirm/agree then there will need to be additional work.

Please help with comments or suggestions.  I would like to come up with
something that is performant and safe.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@xxxxxxxxxxxxx/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@xxxxxxxxxx/

Mike Kravetz (5):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlb: Check for pmd unshare and fault/lookup races

 fs/hugetlbfs/inode.c    |  84 ++++++++------------
 include/linux/hugetlb.h |   3 +-
 mm/hugetlb.c            | 169 +++++++++++++++++++---------------------
 mm/rmap.c               |  14 +---
 mm/userfaultfd.c        |  11 +--
 5 files changed, 118 insertions(+), 163 deletions(-)

-- 
2.35.1