Re: [PATCH v3 7/8] mm/mempolicy: use a standard migration target allocation callback

Hugh Dickins <hughd@xxxxxxxxxx> · Fri, 9 Oct 2020 15:23:03 -0700 (PDT)

On Fri, 9 Oct 2020, Mike Kravetz wrote:
> On 10/8/20 10:50 PM, Hugh Dickins wrote:
> > 
> > It's a problem I've faced before in tmpfs, keeping a hold on the
> > mapping while page lock is dropped.  Quite awkward: igrab() looks as
> > if it's the right thing to use, but turns out to give no protection
> > against umount.  Last time around, I ended up with a stop_eviction
> > count in the shmem inode, which shmem_evict_inode() waits on if
> > necessary.  Something like that could be done for hugetlbfs too,
> > but I'd prefer to do it without adding extra, if there is a way.
> 
> Thanks.

I failed to come up with anything neater than a stop_eviction count
in the hugetlbfs inode.

But that's not unlike a very special purpose rwsem added into the
hugetlbfs inode: and now that you're reconsidering how i_mmap_rwsem
got repurposed, perhaps you will end up with an rwsem of your own in
the hugetlbfs inode.

So I won't distract you with a stop_eviction patch unless you ask:
that's easy, what you're looking into is hard - good luck!

Hugh

> >>
> >> As mentioned above, I hope all this can be removed.
> > 
> > If you continue to nest page lock inside i_mmap_rwsem for hugetlbfs,
> > then I think that part of hugetlb_page_mapping_lock_write() has to
> > remain.  I'd much prefer that hugetlbfs did not reverse the usual
> > nesting, but accept that you had reasons for doing it that way.
> 
> Yes, that is necessary with the change to lock order.
> 
> Yesterday I found another issue/case to handle in the hugetlb COW code
> caused by the difference in lock nesting.  As a result, I am rethinking
> the decision to change the locking order.
> 
> The primary reason for changing the lock order was to make the hugetlb
> page fault code not have to worry about pte pointers changing.  The issue
> is that the pte pointer you get (or create) while walking the table
> without the page table lock could go away due to unsharing the PMD.  We
> can walk the table again after acquiring the lock and validate and possibly
> retry.  However, the backout code in this area which previously had to
> deal with truncation as well, was quite fragile and did not work in all
> corner cases.  This was mostly due to lovely huge page reservations.
> I thought that adding more complexity to the backout code was going to
> cause more issues.  Changing the locking order eliminated the pte pointer
> race as well as truncation.  However, it created new locking issues. :(
> 
> In parallel to working the locking issue, I am also exploring enhanced
> backout code to handle all the needed cases.
> -- 
> Mike Kravetz