Re: [PATCH v3 7/8] mm/mempolicy: use a standard migration target allocation callback

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Fri, 9 Oct 2020 17:25:30 -0700

On 10/9/20 3:23 PM, Hugh Dickins wrote:
> On Fri, 9 Oct 2020, Mike Kravetz wrote:
>> On 10/8/20 10:50 PM, Hugh Dickins wrote:
>>>
>>> It's a problem I've faced before in tmpfs, keeping a hold on the
>>> mapping while page lock is dropped.  Quite awkward: igrab() looks as
>>> if it's the right thing to use, but turns out to give no protection
>>> against umount.  Last time around, I ended up with a stop_eviction
>>> count in the shmem inode, which shmem_evict_inode() waits on if
>>> necessary.  Something like that could be done for hugetlbfs too,
>>> but I'd prefer to do it without adding extra, if there is a way.
>>
>> Thanks.
> 
> I failed to come up with anything neater than a stop_eviction count
> in the hugetlbfs inode.
> 
> But that's not unlike a very special purpose rwsem added into the
> hugetlbfs inode: and now that you're reconsidering how i_mmap_rwsem
> got repurposed, perhaps you will end up with an rwsem of your own in
> the hugetlbfs inode.

We have this in the Oracle UEK kernel.
https://github.com/oracle/linux-uek/commit/89260f55df008bb01c841714fb2ad26c300c8c88
Usage has been expanded to cover more cases.

When I proposed the same upstream, the suggestion was to try and use
i_mmap_rwsem.  That is what I have been trying to do.  Certainly, a
hugetlbfs specific rwsem is easier to manage and less disruptive.

> So I won't distract you with a stop_eviction patch unless you ask:
> that's easy, what you're looking into is hard - good luck!

Thanks Hugh!

>> In parallel to working the locking issue, I am also exploring enhanced
>> backout code to handle all the needed cases.

I'm now confident that we need this or some other locking in place.  Why?

I went back to a code base before locking changes.  My idea was to check
for races and backout changes as necessary.  Page fault handling code will
do something like this:

ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
... do some stuff, possibly allocate a page ...
ptl = huge_pte_lock(h, mm, ptep);

Because of pmd sharing, we can not be sure the ptep is valid until
after holding the ptl.  So, I was going to add something like this
after obtaining the ptl.

while(ptep != huge_pte_offset(mm, haddr, huge_page_size(h))) {
	spin_unlock(ptl);
	ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
	if (!ptep) {
		ret = VM_FAULT_OOM;
		goto backout_unlocked;
	}
	ptl = huge_pte_lock(h, mm, ptep);
}

Unfortunately, the problem is worse than I thought.  I knew the PMD
pointed to by the ptep could be unshared before attempting to acquire
the ptl.  However, it is possible that the page which was the PMD may
be repurposed before we even try to acquire the ptl.  Since the ptl is
a spinlock within the struct page of what was the PMD, it may no longer
be valid.  I actually experienced addressing exceptions in the spinlock
code within huge_pte_lock. :(

So, it seems that we need some way to prevent PMD unsharing between the
huge_pte_alloc() and huge_pte_lock().  It is going to have to be
i_mmap_rwsem or some other rwsem.
-- 
Mike Kravetz