Re: scalability regressions related to hugetlb_fault() changes

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Mon, 28 Mar 2022 11:30:40 -0700

On 3/25/22 06:33, Ray Fucillo wrote:
> 
>> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
>>
>> I will continue to look at this.  A quick check of the fork code shows the
>> semaphore held in read mode for the duration of the page table copy.
> 
> Thank you for looking into it.  
> 

Adding some mm people on cc:
Just a quick update on some thoughts and possible approach.

Note that regressions were noted when code was originally added to take
i_mmap_rwsem at fault time.  A limited way of addressing the issue was
proposed here:
https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@xxxxxxxxxx/

I do not think such a change would help in this case as hugetlb pages are used
via a shared memory segment.  Hence, sharing and pmd sharing is happening.

After some thought, I believe the synchronization needed for pmd sharing
as outlined in commit c0d0381ade79 is limited to a single address space/mm_struct.  We only need to worry about one thread of a process causing
an unshare while another thread in the same process is faulting.  That is
because the unshare only tears down the page tables in the calling process.
Also, the page table modifications associated pmd sharing are constrained
by the virtual address range of a vma describing the sharable area.
Therefore, pmd sharing synchronization can be done at the vma level.

My 'plan' is to hang a rw_sema off the vm_private_data of hugetlb vmas that
can possibly have shared pmds.  We will use this new semaphore instead of
i_mmap_rwsem at fault and pmd_unshare time.  The only time we should see
contention on this semaphore is if one thread of a process is doing something
to cause unsharing for an address range while another thread is faulting in
the same range.  This seems unlikely, and much much less common than one
process unmapping pages while another process wants to fault them in on a
large shared area.

There will also be a little code shuffling as the fault code is also
synchronized with truncation and hole punch via i_mmap_rwsem.  But, this is
much easier to address.

Comments or other suggestions welcome.
-- 
Mike Kravetz