On 10/9/20 3:23 PM, Hugh Dickins wrote: > On Fri, 9 Oct 2020, Mike Kravetz wrote: >> On 10/8/20 10:50 PM, Hugh Dickins wrote: >>> >>> It's a problem I've faced before in tmpfs, keeping a hold on the >>> mapping while page lock is dropped. Quite awkward: igrab() looks as >>> if it's the right thing to use, but turns out to give no protection >>> against umount. Last time around, I ended up with a stop_eviction >>> count in the shmem inode, which shmem_evict_inode() waits on if >>> necessary. Something like that could be done for hugetlbfs too, >>> but I'd prefer to do it without adding extra, if there is a way. >> >> Thanks. > > I failed to come up with anything neater than a stop_eviction count > in the hugetlbfs inode. > > But that's not unlike a very special purpose rwsem added into the > hugetlbfs inode: and now that you're reconsidering how i_mmap_rwsem > got repurposed, perhaps you will end up with an rwsem of your own in > the hugetlbfs inode. We have this in the Oracle UEK kernel. https://github.com/oracle/linux-uek/commit/89260f55df008bb01c841714fb2ad26c300c8c88 Usage has been expanded to cover more cases. When I proposed the same upstream, the suggestion was to try and use i_mmap_rwsem. That is what I have been trying to do. Certainly, a hugetlbfs specific rwsem is easier to manage and less disruptive. > So I won't distract you with a stop_eviction patch unless you ask: > that's easy, what you're looking into is hard - good luck! Thanks Hugh! >> In parallel to working the locking issue, I am also exploring enhanced >> backout code to handle all the needed cases. I'm now confident that we need this or some other locking in place. Why? I went back to a code base before locking changes. My idea was to check for races and backout changes as necessary. Page fault handling code will do something like this: ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); ... do some stuff, possibly allocate a page ... ptl = huge_pte_lock(h, mm, ptep); Because of pmd sharing, we can not be sure the ptep is valid until after holding the ptl. So, I was going to add something like this after obtaining the ptl. while(ptep != huge_pte_offset(mm, haddr, huge_page_size(h))) { spin_unlock(ptl); ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); if (!ptep) { ret = VM_FAULT_OOM; goto backout_unlocked; } ptl = huge_pte_lock(h, mm, ptep); } Unfortunately, the problem is worse than I thought. I knew the PMD pointed to by the ptep could be unshared before attempting to acquire the ptl. However, it is possible that the page which was the PMD may be repurposed before we even try to acquire the ptl. Since the ptl is a spinlock within the struct page of what was the PMD, it may no longer be valid. I actually experienced addressing exceptions in the spinlock code within huge_pte_lock. :( So, it seems that we need some way to prevent PMD unsharing between the huge_pte_alloc() and huge_pte_lock(). It is going to have to be i_mmap_rwsem or some other rwsem. -- Mike Kravetz