I'm kicking this back to RFC status as there may be a couple other options we want to explore. This expanded use of i_mmap_rwsem in hugetlbfs was recently pushed upstream and reverted due to locking issues. http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@xxxxxxxxxx http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@xxxxxxxxxx The biggest problem with those patches was lock ordering. Recall, the two issues we are trying to address: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. To effectively use i_mmap_rwsem to address these issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not difficult as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. If this approach of extending the usage of i_mmap_rwsem is not acceptable, there are two other options I can think of. - Add a new rw-semaphore that lives in the hugetlbfs specific inode extension. The good thing is that this semaphore would be hugetlbfs specific and not directly exposed to the rest of the mm code. Therefore, interaction with other locking schemes is minimized. - Don't really add any new synchronization, but notice and catch all the races. After catching the race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Suggestions on how to proceed would be appreciated. If you think the following patches are not too ugly, comments on those would also be welcome. Mike Kravetz (2): hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race fs/hugetlbfs/inode.c | 34 +++++--- include/linux/fs.h | 5 ++ include/linux/hugetlb.h | 8 ++ mm/hugetlb.c | 186 ++++++++++++++++++++++++++++++++++------ mm/memory-failure.c | 29 ++++++- mm/migrate.c | 24 +++++- mm/rmap.c | 17 +++- mm/userfaultfd.c | 11 ++- 8 files changed, 270 insertions(+), 44 deletions(-) -- 2.17.2