[PATCH RFC v3 0/2] hugetlbfs: use i_mmap_rwsem for more synchronization

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Fri, 1 Feb 2019 14:17:03 -0800

I'm kicking this back to RFC status as there may be a couple other
options we want to explore.  This expanded use of i_mmap_rwsem in
hugetlbfs was recently pushed upstream and reverted due to locking
issues.
http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@xxxxxxxxxx
http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@xxxxxxxxxx

The biggest problem with those patches was lock ordering.  Recall, the
two issues we are trying to address:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.

To effectively use i_mmap_rwsem to address these issues it needs to
be held (in read mode) during page fault processing.  However, during
fault processing we need to lock the page we will be adding.  Lock
ordering requires we take page lock before i_mmap_rwsem.  Waiting until
after taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the
lock ordering for hugetlb pages.  This is not difficult as hugetlbfs
processing is done separate from core mm in many places.  However, I
don't really like this idea.  Much ugliness is contained in the new
routine hugetlb_page_mapping_lock_write() of patch 1.

If this approach of extending the usage of i_mmap_rwsem is not acceptable,
there are two other options I can think of.
- Add a new rw-semaphore that lives in the hugetlbfs specific inode
  extension.  The good thing is that this semaphore would be hugetlbfs
  specific and not directly exposed to the rest of the mm code.  Therefore,
  interaction with other locking schemes is minimized.
- Don't really add any new synchronization, but notice and catch all the
  races.  After catching the race, cleanup, backout, retry ... etc, as
  needed.  This can get really ugly, especially for huge page reservations.
  At one time, I started writing some of the reservation backout code for
  page faults and it got so ugly and complicated I went down the path of
  adding synchronization to avoid the races.

Suggestions on how to proceed would be appreciated.  If you think the
following patches are not too ugly, comments on those would also be
welcome.

Mike Kravetz (2):
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race

 fs/hugetlbfs/inode.c    |  34 +++++---
 include/linux/fs.h      |   5 ++
 include/linux/hugetlb.h |   8 ++
 mm/hugetlb.c            | 186 ++++++++++++++++++++++++++++++++++------
 mm/memory-failure.c     |  29 ++++++-
 mm/migrate.c            |  24 +++++-
 mm/rmap.c               |  17 +++-
 mm/userfaultfd.c        |  11 ++-
 8 files changed, 270 insertions(+), 44 deletions(-)

-- 
2.17.2