On Fri, 9 Oct 2020, Mike Kravetz wrote: > On 10/8/20 10:50 PM, Hugh Dickins wrote: > > > > It's a problem I've faced before in tmpfs, keeping a hold on the > > mapping while page lock is dropped. Quite awkward: igrab() looks as > > if it's the right thing to use, but turns out to give no protection > > against umount. Last time around, I ended up with a stop_eviction > > count in the shmem inode, which shmem_evict_inode() waits on if > > necessary. Something like that could be done for hugetlbfs too, > > but I'd prefer to do it without adding extra, if there is a way. > > Thanks. I failed to come up with anything neater than a stop_eviction count in the hugetlbfs inode. But that's not unlike a very special purpose rwsem added into the hugetlbfs inode: and now that you're reconsidering how i_mmap_rwsem got repurposed, perhaps you will end up with an rwsem of your own in the hugetlbfs inode. So I won't distract you with a stop_eviction patch unless you ask: that's easy, what you're looking into is hard - good luck! Hugh > >> > >> As mentioned above, I hope all this can be removed. > > > > If you continue to nest page lock inside i_mmap_rwsem for hugetlbfs, > > then I think that part of hugetlb_page_mapping_lock_write() has to > > remain. I'd much prefer that hugetlbfs did not reverse the usual > > nesting, but accept that you had reasons for doing it that way. > > Yes, that is necessary with the change to lock order. > > Yesterday I found another issue/case to handle in the hugetlb COW code > caused by the difference in lock nesting. As a result, I am rethinking > the decision to change the locking order. > > The primary reason for changing the lock order was to make the hugetlb > page fault code not have to worry about pte pointers changing. The issue > is that the pte pointer you get (or create) while walking the table > without the page table lock could go away due to unsharing the PMD. We > can walk the table again after acquiring the lock and validate and possibly > retry. However, the backout code in this area which previously had to > deal with truncation as well, was quite fragile and did not work in all > corner cases. This was mostly due to lovely huge page reservations. > I thought that adding more complexity to the backout code was going to > cause more issues. Changing the locking order eliminated the pte pointer > race as well as truncation. However, it created new locking issues. :( > > In parallel to working the locking issue, I am also exploring enhanced > backout code to handle all the needed cases. > -- > Mike Kravetz