On 3/25/22 06:33, Ray Fucillo wrote: > >> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: >> >> I will continue to look at this. A quick check of the fork code shows the >> semaphore held in read mode for the duration of the page table copy. > > Thank you for looking into it. > Adding some mm people on cc: Just a quick update on some thoughts and possible approach. Note that regressions were noted when code was originally added to take i_mmap_rwsem at fault time. A limited way of addressing the issue was proposed here: https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@xxxxxxxxxx/ I do not think such a change would help in this case as hugetlb pages are used via a shared memory segment. Hence, sharing and pmd sharing is happening. After some thought, I believe the synchronization needed for pmd sharing as outlined in commit c0d0381ade79 is limited to a single address space/mm_struct. We only need to worry about one thread of a process causing an unshare while another thread in the same process is faulting. That is because the unshare only tears down the page tables in the calling process. Also, the page table modifications associated pmd sharing are constrained by the virtual address range of a vma describing the sharable area. Therefore, pmd sharing synchronization can be done at the vma level. My 'plan' is to hang a rw_sema off the vm_private_data of hugetlb vmas that can possibly have shared pmds. We will use this new semaphore instead of i_mmap_rwsem at fault and pmd_unshare time. The only time we should see contention on this semaphore is if one thread of a process is doing something to cause unsharing for an address range while another thread is faulting in the same range. This seems unlikely, and much much less common than one process unmapping pages while another process wants to fault them in on a large shared area. There will also be a little code shuffling as the fault code is also synchronized with truncation and hole punch via i_mmap_rwsem. But, this is much easier to address. Comments or other suggestions welcome. -- Mike Kravetz