On Thu, Nov 07, 2019 at 02:06:28PM -0500, Waiman Long wrote: > A customer with large SMP systems (up to 16 sockets) with application > that uses large amount of static hugepages (~500-1500GB) are experiencing > random multisecond delays. These delays was caused by the long time it > took to scan the VMA interval tree with mmap_sem held. > > The sharing of huge PMD does not require changes to the i_mmap at all. > As a result, we can just take the read lock and let other threads > searching for the right VMA to share in parallel. Once the right > VMA is found, either the PMD lock (2M huge page for x86-64) or the > mm->page_table_lock will be acquired to perform the actual PMD sharing. > > Lock contention, if present, will happen in the spinlock. That is much > better than contention in the rwsem where the time needed to scan the > the interval tree is indeterminate. I don't think this description really explains the contention argument well. There are _more_ PMD locks than there are i_mmap_sem locks, so processes accessing different parts of the same file can work in parallel. Are there other current users of the write lock that could use a read lock? At first blush, it would seem that unmap_ref_private() also only needs a read lock on the i_mmap tree. I don't think hugetlb_change_protection() needs the write lock either. Nor retract_page_tables().